Hungry for data sets
You may wonder how we test our data mining algorithms to ensure they correctly turn data into insights. For verification purposes, we use two types of data sets: synthetic and real. Synthetic data is computer-generated data which follows certain statistical patterns, while real data comes from the real world and therefore often includes incorrect measurements or missing values. Using both synthetic and real data sets, we verify our ability to identify hidden relationships, discover large groups of similar records, automatically detect anomalies, etc.
To obtain synthetic data sets, the main ingredient is a set of random generators, each following a different distribution (ex: linear, gaussian, gamma, etc.). For example, our gaussian generator uses the Box-Muller transform to generate random data. Next, each random generator can be constrained to always return values within a certain range. For example, if we want to generate synthetic values representing ages, we may want all values to be positive integers. To generate discrete values, we simply map ranges of numeric values to discrete equivalents. For example, we could map numeric values in the 0.0-0.5 range to a “yes” value, and values in the 0.5-1.0 range to a “no” value. Finally, to simulate complex interdependence between variables, we specify a set of influence functions using a dependence graph. For example, we could define an influence function which increases or decreases income based on specific age + region combinations.
To obtain real data sets, we mostly rely on third-parties, but also use self-generated real data. For example, using any system performance monitor, we can quickly obtain large amounts of data about memory usage, disk access, network access, etc. Or we can analyze web logs from our own web site. For third-party data, we can perform simple web searches requiring the target to be a file in CSV or Excel format (click here for an example). Another way is to retrieve data sets from data banks such as the University of Irvine’s Machine Learning Repository. Because their data sets are often used for data mining research projects, we can compare our results with others. Finally, if you’re as hungry for data as we are, you ought to look at next-generation data marketplaces such as InfoChimps.org. Companies such as Infochimps are creating a new data-driven economy where data sets can be exchanged, purchased, or sold.