This post is to tell you about a new exciting feature we’ve been working on and which is almost ready: data transformation. Successful data analysis starts with clean data, and this new feature will make preparing and cleansing data easier than ever before.
Using our new web client editor, users will be able to specify simple step-based transformation rules to prepare data. Of course, we’ve followed our number one principle: if you can do something using our Web client, then you can also do it programmatically using our XML web API. Therefore, developers will also be able to create data transformation tasks, and schedule them for background execution.
Another design principle we’ve followed is: if you know how to write a Microsoft Outlook rule, then you’ll know how to use the new data transformation feature. As a result, the feature will enpower most users to filter data sets, handle missing or extreme values, take random samples, or set fields to calculated values. Below are some of the built-in transformation rules:
- Copy existing data sets
- Create, delete, rename, or convert fields
- Filter rows with multiple and/or criteria
- Take a random sample of rows
- Set fields to complex formulas
- Scramble data to anonymize its content
- Rank rows using multi-level ordering
Let’s face it: squeezing the most information out of your data requires more than just executive dashboards and pretty reports. Only more advanced techniques such as data mining can automate the process of going over all possible combinations and transform data into actionable insights. Two key obstacles remain however: cost and complexity. Like us, Data Mining Tools.NET is working hard to address these two issues. We like their tag line: watch, learn, mine.
Their company is based in India, and provides free videos and tutorials about a number of data mining products, including Data Applied (see videos or tutorials), InfoChimps (see this), WEKA, etc. They also offer a subscription-based program allowing members to prepare for certification exams. They’re adding new videos at a furious pace, so we recommend visiting them often to check for new content. For example, one of their developer recently contacted us regarding our API, and just posted a new tutorial on how to use our data crunching Web API using Python.
The data mining industry is changing rapidly. Data sets are becoming more affordable and easier to find, while analysis software is becoming more powerful and easier to use. And since you’re still reading this, check out our new training page: http://data-applied.com/Web/Support/Training.aspx!
James Taylor published a “First Look” blog post about Data Applied, which you can read here. Jumping to the end:
The product is visually very appealing and looks very easy to use – delivering data mining results without a need for a lot of data mining know-how.
James is an industry analyst for products and services related to decision management, incluiding data mining, enterprise optimization, and business intelligence. You may be wondering: what does an industry analyst do exactly? Essentially, an industry analyst monitors an industry by watching product releases, keeping track of company news, evaluating products, etc. Therefore industry analysts are in a unique position to see the big picture, and provide a broad perspective on a specific market. In addition, industry analysts usually offer retainer-based services which include:
- Notifyng clients of important trends / news are relevant to them
- Providing networking opportunities with other companies
- Writing white papers, evaluating competitors’ products, recommending pricing strategies, etc.
We’re too small to hire an industry analyst right now, but after talking to James, we now understand how useful it could be in the future.
Our technology might be on the edge, but we envision a world where the world’s data is routinely analyzed using more powerful methods than simple pie chart reporting and executive dashboards.
What kind of data? Survey data, marketing data, sales data, inventory data, employee data, engineering data, social data, salesforce data, ebay data – any type of data!
At Data Applied, we use Google analytics to monitor incoming traffic to our website. We then crunch this data using our own product to extract more meaningful information than we would using Google’s UI alone. For example, we use clustering to automatically categorize visits into different groups, based on characteristics such as visit duration, page views, or location. We use association rule mining to identify hidden associations between visit time, keywords, and network names. We perform outlier detection to get a list of visits which may be out of the ordinary. Finally, we use our own super pivots to better visualize this information. In fact, if you have already created a free account, you should notice a new (anonymized) clickstream dataset we uploaded into the demo workspace.
Regarding network names, there is some type of asymmetric information warfare at play here. Because we are a small startup, we do not have the luxury of maintaining a private network. This means that, when we connect to any website, we appear under a generic (ex: “Comcast customer”) network name. The same however does not apply to other large Business Intelligence companies when they pay us a visit. Here is some summary clickstream data regarding recent visits to our web site. We’re publishing this information because we can, or more precisely because we find it interesting that we can but our visitors cannot. But of course, we’d still like to say thank you for paying us a visit!
|sas institute inc.||42||6.666666667||294.7142857|
PS: we reviewed Google Analytics terms of service to make sure it is ok to publish this type of information.
Update: for some unknown reason, we received a lot more visits (over 700 from Microsoft).
Perhaps you haven’t already heard about Dr Sandro Saitta. He not only works on grid computing and analytics for FinScore, but also runs a successful data mining blog (http://www.dataminingblog.com) in his spare time. Here is his profile if you want to know more. We’re fan of his blog because it contains a lot of practical advice, including book recommendations.
Sandro recently invited us to write a guest post so we obliged. You’ll have to follow this link for more details, but in this post we discuss the broken promises of data mining, and how the community should respond. Let us know what you think!
We just launched and received some great press! Two disappointments however:
- We wrote to the local press (TechFlash, XConomy) but they chose to ignore us
- We’re #1 for “data applied” on Bing.com and Yahoo.com, but not yet on Google.com
Here are some of the highlights…
ReadWriteWeb + New York Times:
It’s pretty hard to beat this combination in terms of audience! As ReadWriteWeb explains on their web site, they are among the top 10 blogs, while the New York Times is the largest metropolitan newspaper in the US. It’s awesome that the journalist (Marshall Kirkpatrick) was in a good mood, but also that he immediately understood what we were trying to do. These guys are also super efficient in terms of turnaround.