We’re definitely not the only ones to believe in Silverlight’s ability to bring BI to the next level. Like us, InfoMod Dashboards is based in Washington and offers impressive dashboards which you can access here: https://one.xrmdashboards.com/demo.htm. One nice thing about their demo is that it doesn’t require account registration (perhaps we should do the same), and that it loads very quickly. At the moment, they surface Microsoft Dynamics CRM data, but our guess is that they’ll expand to other products in the future. Check it out!
Several of our customers have asked us if Data Applied visualizations could be embedded in any web page. Our answer until now was: sorry no, use our XML Web API to build your own. But now, you can! Simply click on a “share” button and copy /paste the URL, or embed in an IFrame. Check out our new demo center for examples of HTML embedding.
Let’s talk about the implementation. Sharing views securely is the hard part. Each visualization implements complex operations such as search results, view underlying data, tag results with comments, etc. So when you share a view, what should be allowed? When you share with a friend, you may want to allow commenting or full data retrieval. But when you share to the world, you may want to restrict access further.
To support secure sharing, we introduced the concept of restricted tickets in our platform. In short:
- Users receive full tickets upon authentication
- Full tickets can be converted to restricted tickets using a single API call
- Simply pass a list of usage restrictions to apply to the restricted ticket
So for example, using the XML Web API, an authenticated user could present a valid ticket, and request a restricted ticket which:
- Expires 5 minutes from now
- Only allows read access to a given workspace
- Only allows the “retrieve comments” operation
Whoever receives the restricted ticket will be able to retrieve comments for 5 minutes, but nothing else. This concept is incredibly powerful, and obviously extends beyond simple view sharing. For example, it could be used to allow third-party applications to perform a limited set of operations on your behalf. More information about the API can be found here (see “Restricting Access”).
We just learned that Savian has released a .NET connector for SAS data sets. We can’t give more details, but apparently it was a huge man-days investment to make it happen. We heard that the SAS data format is fairly complicated, but we’re not sure whether it is because of complex metadata, because of storage optimizations, or because it was intentionally designed to be difficult to interop with. It’s true that data formats which have been around for a long time tend to get complicated because of multi-versioning and old-style binary encoding. Another perfect example of complex encoding is the Microsoft Word binary format, but at least they’ve now done the right thing and fully documented it. In any case, being able to access SAS data sets using .NET technologies is something a lot of people have been waiting for, and which will open a whole new realm of possibilities. It just makes sense!
One thing is for sure: data mining is expensive. For example, this study shows that the average initial annual cost of an SPSS deployment is $342,061 (including $153,500 in licensing fees). Also, this study shows that SAS licensing fees alone and for a year are $123,669 (this for a quadcore machine, excluding training or consulting fees). So while the cost to deploy data mining solutions remains extremely high, we think that large companies tolerate it because of the resulting ROI. Using predictive analytics to achieve a 0.5% decrease in churn rates (or a 0.5% increase in campaign response) may result in savings high enough to quickly offset the initial investment. This said, does it make sense to pay this much for data mining? For example, licensing models which charge more depending on the number of processors seem a bit ridiculous at a time when even laptops come with dual / quad cores.
1. Same algorithms, different prices:
When we decided to start promoting our data mining solution on the web, we experimented with Facebook’s and Google’s ad programs. Google offers a pay-per-click (PPC) model, where advertisers bid on search keywords, with some geographic and demographic preferences allowed. Facebook offers both a pay-per-click (PPC) and a pay-per-impression (PPM) model, where advertisers bid for an audience defined by location, age, workplace, and interests.
We found several articles claiming that conversion rates are much lower using Facebook ads. This makes sense: on Facebook, distracted users stumble on ads and click on them when bored. But using Google ads, users entered specific keywords and are actively looking for something. Still, we feel that Facebook advertising makes more sense for us. Here is why:
1. Some audiences are hard to reach:
One of our target group is data mining experts. Because they are experts in their field, they are extremely unlikely to search for “data mining” on Google. Rather, these experts are more likely to search for specific keywords (ex: “direct hashing and pruning association mining”), which are difficult to identify. Only Facebook, with its interest-based targetting, gives us an opportunity to reach this type of audience.
You may wonder how we test our data mining algorithms to ensure they correctly turn data into insights. For verification purposes, we use two types of data sets: synthetic and real. Synthetic data is computer-generated data which follows certain statistical patterns, while real data comes from the real world and therefore often includes incorrect measurements or missing values. Using both synthetic and real data sets, we verify our ability to identify hidden relationships, discover large groups of similar records, automatically detect anomalies, etc.
To obtain synthetic data sets, the main ingredient is a set of random generators, each following a different distribution (ex: linear, gaussian, gamma, etc.). For example, our gaussian generator uses the Box-Muller transform to generate random data. Next, each random generator can be constrained to always return values within a certain range. For example, if we want to generate synthetic values representing ages, we may want all values to be positive integers. To generate discrete values, we simply map ranges of numeric values to discrete equivalents. For example, we could map numeric values in the 0.0-0.5 range to a “yes” value, and values in the 0.5-1.0 range to a “no” value. Finally, to simulate complex interdependence between variables, we specify a set of influence functions using a dependence graph. For example, we could define an influence function which increases or decreases income based on specific age + region combinations.
What does it mean to be a garage startup, especially when your goal is to crunch massive amounts of data? In our case, we mean this quite literally, and since a picture is worth a thousand words, take a peek. Obviously, we’re also using computers hosted by third-party providers whose data centers offer full redundancy and physical data protection. So keep in mind that what you’re seeing below is only for internal use / to support our free plan.
This said, it doesn’t have to be on-premise computing vs. cloud computing. We’ve found that it actually makes sense to combine local and cloud computing. For example, local computing can be used on a fixed basis to deal with regular load, while cloud computing can be used to accomodate overflow load. Cloud computing is a bit like renting movies: it’s cheap to rent one, as long as you don’t keep it for too long.