One thing is for sure: data mining is expensive. For example, this study shows that the average initial annual cost of an SPSS deployment is $342,061 (including $153,500 in licensing fees). Also, this study shows that SAS licensing fees alone and for a year are $123,669 (this for a quadcore machine, excluding training or consulting fees). So while the cost to deploy data mining solutions remains extremely high, we think that large companies tolerate it because of the resulting ROI. Using predictive analytics to achieve a 0.5% decrease in churn rates (or a 0.5% increase in campaign response) may result in savings high enough to quickly offset the initial investment. This said, does it make sense to pay this much for data mining? For example, licensing models which charge more depending on the number of processors seem a bit ridiculous at a time when even laptops come with dual / quad cores.
1. Same algorithms, different prices:
When we decided to start promoting our data mining solution on the web, we experimented with Facebook’s and Google’s ad programs. Google offers a pay-per-click (PPC) model, where advertisers bid on search keywords, with some geographic and demographic preferences allowed. Facebook offers both a pay-per-click (PPC) and a pay-per-impression (PPM) model, where advertisers bid for an audience defined by location, age, workplace, and interests.
We found several articles claiming that conversion rates are much lower using Facebook ads. This makes sense: on Facebook, distracted users stumble on ads and click on them when bored. But using Google ads, users entered specific keywords and are actively looking for something. Still, we feel that Facebook advertising makes more sense for us. Here is why:
1. Some audiences are hard to reach:
One of our target group is data mining experts. Because they are experts in their field, they are extremely unlikely to search for “data mining” on Google. Rather, these experts are more likely to search for specific keywords (ex: “direct hashing and pruning association mining”), which are difficult to identify. Only Facebook, with its interest-based targetting, gives us an opportunity to reach this type of audience.
You may wonder how we test our data mining algorithms to ensure they correctly turn data into insights. For verification purposes, we use two types of data sets: synthetic and real. Synthetic data is computer-generated data which follows certain statistical patterns, while real data comes from the real world and therefore often includes incorrect measurements or missing values. Using both synthetic and real data sets, we verify our ability to identify hidden relationships, discover large groups of similar records, automatically detect anomalies, etc.
To obtain synthetic data sets, the main ingredient is a set of random generators, each following a different distribution (ex: linear, gaussian, gamma, etc.). For example, our gaussian generator uses the Box-Muller transform to generate random data. Next, each random generator can be constrained to always return values within a certain range. For example, if we want to generate synthetic values representing ages, we may want all values to be positive integers. To generate discrete values, we simply map ranges of numeric values to discrete equivalents. For example, we could map numeric values in the 0.0-0.5 range to a “yes” value, and values in the 0.5-1.0 range to a “no” value. Finally, to simulate complex interdependence between variables, we specify a set of influence functions using a dependence graph. For example, we could define an influence function which increases or decreases income based on specific age + region combinations.
What does it mean to be a garage startup, especially when your goal is to crunch massive amounts of data? In our case, we mean this quite literally, and since a picture is worth a thousand words, take a peek. Obviously, we’re also using computers hosted by third-party providers whose data centers offer full redundancy and physical data protection. So keep in mind that what you’re seeing below is only for internal use / to support our free plan.
This said, it doesn’t have to be on-premise computing vs. cloud computing. We’ve found that it actually makes sense to combine local and cloud computing. For example, local computing can be used on a fixed basis to deal with regular load, while cloud computing can be used to accomodate overflow load. Cloud computing is a bit like renting movies: it’s cheap to rent one, as long as you don’t keep it for too long.
This post is to tell you about a new exciting feature we’ve been working on and which is almost ready: data transformation. Successful data analysis starts with clean data, and this new feature will make preparing and cleansing data easier than ever before.
Using our new web client editor, users will be able to specify simple step-based transformation rules to prepare data. Of course, we’ve followed our number one principle: if you can do something using our Web client, then you can also do it programmatically using our XML web API. Therefore, developers will also be able to create data transformation tasks, and schedule them for background execution.
Another design principle we’ve followed is: if you know how to write a Microsoft Outlook rule, then you’ll know how to use the new data transformation feature. As a result, the feature will enpower most users to filter data sets, handle missing or extreme values, take random samples, or set fields to calculated values. Below are some of the built-in transformation rules:
- Copy existing data sets
- Create, delete, rename, or convert fields
- Filter rows with multiple and/or criteria
- Take a random sample of rows
- Set fields to complex formulas
- Scramble data to anonymize its content
- Rank rows using multi-level ordering
Let’s face it: squeezing the most information out of your data requires more than just executive dashboards and pretty reports. Only more advanced techniques such as data mining can automate the process of going over all possible combinations and transform data into actionable insights. Two key obstacles remain however: cost and complexity. Like us, Data Mining Tools.NET is working hard to address these two issues. We like their tag line: watch, learn, mine.
Their company is based in India, and provides free videos and tutorials about a number of data mining products, including Data Applied (see videos or tutorials), InfoChimps (see this), WEKA, etc. They also offer a subscription-based program allowing members to prepare for certification exams. They’re adding new videos at a furious pace, so we recommend visiting them often to check for new content. For example, one of their developer recently contacted us regarding our API, and just posted a new tutorial on how to use our data crunching Web API using Python.
The data mining industry is changing rapidly. Data sets are becoming more affordable and easier to find, while analysis software is becoming more powerful and easier to use. And since you’re still reading this, check out our new training page: http://data-applied.com/Web/Support/Training.aspx!
James Taylor published a “First Look” blog post about Data Applied, which you can read here. Jumping to the end:
The product is visually very appealing and looks very easy to use – delivering data mining results without a need for a lot of data mining know-how.
James is an industry analyst for products and services related to decision management, incluiding data mining, enterprise optimization, and business intelligence. You may be wondering: what does an industry analyst do exactly? Essentially, an industry analyst monitors an industry by watching product releases, keeping track of company news, evaluating products, etc. Therefore industry analysts are in a unique position to see the big picture, and provide a broad perspective on a specific market. In addition, industry analysts usually offer retainer-based services which include:
- Notifyng clients of important trends / news are relevant to them
- Providing networking opportunities with other companies
- Writing white papers, evaluating competitors’ products, recommending pricing strategies, etc.
We’re too small to hire an industry analyst right now, but after talking to James, we now understand how useful it could be in the future.