## Hungry for data sets

You may wonder how we test our data mining algorithms to ensure they correctly turn data into insights. For verification purposes, we use two types of data sets: **synthetic **and **real**. Synthetic data is computer-generated data which follows certain statistical patterns, while real data comes from the real world and therefore often includes incorrect measurements or missing values. Using both synthetic and real data sets, we verify our ability to identify hidden relationships, discover large groups of similar records, automatically detect anomalies, etc.

To obtain **synthetic data sets**, the main ingredient is a set of random generators, each following a different distribution (ex: linear, gaussian, gamma, etc.). For example, our gaussian generator uses the Box-Muller transform to generate random data. Next, each random generator can be constrained to always return values within a certain range. For example, if we want to generate synthetic values representing ages, we may want all values to be positive integers. To generate discrete values, we simply map ranges of numeric values to discrete equivalents. For example, we could map numeric values in the 0.0-0.5 range to a “yes” value, and values in the 0.5-1.0 range to a “no” value. Finally, to simulate complex interdependence between variables, we specify a set of influence functions using a dependence graph. For example, we could define an influence function which increases or decreases income based on specific age + region combinations.

Read more…