Several of our customers have asked us if Data Applied visualizations could be embedded in any web page. Our answer until now was: sorry no, use our XML Web API to build your own. But now, you can! Simply click on a “share” button and copy /paste the URL, or embed in an IFrame. Check out our new demo center for examples of HTML embedding.
Let’s talk about the implementation. Sharing views securely is the hard part. Each visualization implements complex operations such as search results, view underlying data, tag results with comments, etc. So when you share a view, what should be allowed? When you share with a friend, you may want to allow commenting or full data retrieval. But when you share to the world, you may want to restrict access further.
To support secure sharing, we introduced the concept of restricted tickets in our platform. In short:
- Users receive full tickets upon authentication
- Full tickets can be converted to restricted tickets using a single API call
- Simply pass a list of usage restrictions to apply to the restricted ticket
So for example, using the XML Web API, an authenticated user could present a valid ticket, and request a restricted ticket which:
- Expires 5 minutes from now
- Only allows read access to a given workspace
- Only allows the “retrieve comments” operation
Whoever receives the restricted ticket will be able to retrieve comments for 5 minutes, but nothing else. This concept is incredibly powerful, and obviously extends beyond simple view sharing. For example, it could be used to allow third-party applications to perform a limited set of operations on your behalf. More information about the API can be found here (see “Restricting Access”).
One thing is for sure: data mining is expensive. For example, this study shows that the average initial annual cost of an SPSS deployment is $342,061 (including $153,500 in licensing fees). Also, this study shows that SAS licensing fees alone and for a year are $123,669 (this for a quadcore machine, excluding training or consulting fees). So while the cost to deploy data mining solutions remains extremely high, we think that large companies tolerate it because of the resulting ROI. Using predictive analytics to achieve a 0.5% decrease in churn rates (or a 0.5% increase in campaign response) may result in savings high enough to quickly offset the initial investment. This said, does it make sense to pay this much for data mining? For example, licensing models which charge more depending on the number of processors seem a bit ridiculous at a time when even laptops come with dual / quad cores.
1. Same algorithms, different prices:
When we decided to start promoting our data mining solution on the web, we experimented with Facebook’s and Google’s ad programs. Google offers a pay-per-click (PPC) model, where advertisers bid on search keywords, with some geographic and demographic preferences allowed. Facebook offers both a pay-per-click (PPC) and a pay-per-impression (PPM) model, where advertisers bid for an audience defined by location, age, workplace, and interests.
We found several articles claiming that conversion rates are much lower using Facebook ads. This makes sense: on Facebook, distracted users stumble on ads and click on them when bored. But using Google ads, users entered specific keywords and are actively looking for something. Still, we feel that Facebook advertising makes more sense for us. Here is why:
1. Some audiences are hard to reach:
One of our target group is data mining experts. Because they are experts in their field, they are extremely unlikely to search for “data mining” on Google. Rather, these experts are more likely to search for specific keywords (ex: “direct hashing and pruning association mining”), which are difficult to identify. Only Facebook, with its interest-based targetting, gives us an opportunity to reach this type of audience.
You may wonder how we test our data mining algorithms to ensure they correctly turn data into insights. For verification purposes, we use two types of data sets: synthetic and real. Synthetic data is computer-generated data which follows certain statistical patterns, while real data comes from the real world and therefore often includes incorrect measurements or missing values. Using both synthetic and real data sets, we verify our ability to identify hidden relationships, discover large groups of similar records, automatically detect anomalies, etc.
To obtain synthetic data sets, the main ingredient is a set of random generators, each following a different distribution (ex: linear, gaussian, gamma, etc.). For example, our gaussian generator uses the Box-Muller transform to generate random data. Next, each random generator can be constrained to always return values within a certain range. For example, if we want to generate synthetic values representing ages, we may want all values to be positive integers. To generate discrete values, we simply map ranges of numeric values to discrete equivalents. For example, we could map numeric values in the 0.0-0.5 range to a “yes” value, and values in the 0.5-1.0 range to a “no” value. Finally, to simulate complex interdependence between variables, we specify a set of influence functions using a dependence graph. For example, we could define an influence function which increases or decreases income based on specific age + region combinations.
This post is to tell you about a new exciting feature we’ve been working on and which is almost ready: data transformation. Successful data analysis starts with clean data, and this new feature will make preparing and cleansing data easier than ever before.
Using our new web client editor, users will be able to specify simple step-based transformation rules to prepare data. Of course, we’ve followed our number one principle: if you can do something using our Web client, then you can also do it programmatically using our XML web API. Therefore, developers will also be able to create data transformation tasks, and schedule them for background execution.
Another design principle we’ve followed is: if you know how to write a Microsoft Outlook rule, then you’ll know how to use the new data transformation feature. As a result, the feature will enpower most users to filter data sets, handle missing or extreme values, take random samples, or set fields to calculated values. Below are some of the built-in transformation rules:
- Copy existing data sets
- Create, delete, rename, or convert fields
- Filter rows with multiple and/or criteria
- Take a random sample of rows
- Set fields to complex formulas
- Scramble data to anonymize its content
- Rank rows using multi-level ordering
Let’s face it: squeezing the most information out of your data requires more than just executive dashboards and pretty reports. Only more advanced techniques such as data mining can automate the process of going over all possible combinations and transform data into actionable insights. Two key obstacles remain however: cost and complexity. Like us, Data Mining Tools.NET is working hard to address these two issues. We like their tag line: watch, learn, mine.
Their company is based in India, and provides free videos and tutorials about a number of data mining products, including Data Applied (see videos or tutorials), InfoChimps (see this), WEKA, etc. They also offer a subscription-based program allowing members to prepare for certification exams. They’re adding new videos at a furious pace, so we recommend visiting them often to check for new content. For example, one of their developer recently contacted us regarding our API, and just posted a new tutorial on how to use our data crunching Web API using Python.
The data mining industry is changing rapidly. Data sets are becoming more affordable and easier to find, while analysis software is becoming more powerful and easier to use. And since you’re still reading this, check out our new training page: http://data-applied.com/Web/Support/Training.aspx!
James Taylor published a “First Look” blog post about Data Applied, which you can read here. Jumping to the end:
The product is visually very appealing and looks very easy to use – delivering data mining results without a need for a lot of data mining know-how.
James is an industry analyst for products and services related to decision management, incluiding data mining, enterprise optimization, and business intelligence. You may be wondering: what does an industry analyst do exactly? Essentially, an industry analyst monitors an industry by watching product releases, keeping track of company news, evaluating products, etc. Therefore industry analysts are in a unique position to see the big picture, and provide a broad perspective on a specific market. In addition, industry analysts usually offer retainer-based services which include:
- Notifyng clients of important trends / news are relevant to them
- Providing networking opportunities with other companies
- Writing white papers, evaluating competitors’ products, recommending pricing strategies, etc.
We’re too small to hire an industry analyst right now, but after talking to James, we now understand how useful it could be in the future.