CSI: Data WarehouseCSI: Data Warehouse
Armed with a knowledge of patterns, you can root out bad data.
My wife and I love all the CSI police procedural dramas that are so popular now. The crime lab crew gets to the crime site knowing nothing about the situation and finds the bad guy in 60 minutes. How about a show called CSI: Data Warehouse on Tech TV?
Imagine that you walk into a client who has a large amount of data, and he wants to know if his data is real or fake. You don't know anything about his data, or even his industry. It turns out that data qua data actually has some patterns that are fairly easy to find in a modern database. Let me give a quick overview, without much mathematics, of some of the easy ones.
Seeing the Pattern
In the old days, before we had pocket calculators and cheap computers, engineers solved equations with slide rulers and books of mathematical tables. A slide ruler is accurate to three decimal places and most of the books were accurate to five or six places at least. (If you don't know what a slide ruler is, then Google it.)
In 1938, Dr. Frank Benford was a physicist at General Electric who noticed that pages in the logarithm tables that corresponded to numbers starting with the numeral one were getting worn out faster than the other pages. Most of us would have stopped at that point and not thought about it very much.
Dr. Benford could see no immediate reason that physicists and engineers would prefer logarithms starting with one. He started data mining unrelated sets of numbers. I mean really unrelated — geography, census data, baseball statistics, numbers in magazine articles, and just about anything he could find. After looking at 20,229 sets of numbers, he found that they all followed the same pattern he had seen in the logarithm tables.
The usual guess would be that all digits, one through nine, would be equally likely to pop up at the start of a string of digits. Nope, not true; Benford's Law says that it can be approximated by the formula P(d is first digit) = LOG10(1.0 + 1.0/d). The pattern is 30.1 percent for one, 17.6 percent for two, and down to 4.6 percent for nine. You can get some confirmation of this in "The First-Digit Phenomenon" by T. P. Hill (American Scientist, July-August 1998). Benford's Law gets better as the sample gets larger and more varied.
What makes Benford's Law useful to a data miner is that you don't have to understand the data. If the data drifts from the pattern, you know to look for a systematic bias or faked data. Like any statistic, it isn't a certainty, but it's a good place to start. In fact, there are fraud detection packages based on Benford's Law that look at patterns in expense reports and other financial data.
A Run of Luck
Another data mining trick is looking at runs. If you toss a fair coin 200 times, you're almost certain to encounter a run of six heads or six tails. But when people fake data, they don't like to repeat long runs of one value. This idea can be extended to multiple values and patterns of runs, but the coin toss is the simplest case and very easy to test.
We can generalize the idea of patterns over time. Given a jar of (n) numbered marbles, draw one out of the jar at random and put it back. How many draws do you have to make to have a better than 50 percent chance of getting a previous marble that was drawn (p) draws ago? You can make this scenario into customers who come back to the store, trucks that pull into the motor pool, or whatever. Most people guess (n/2), but it's much lower. The formula is a bit messy, but here's a quick table.
n | p |
---|---|
10 | 4 |
100 | 12 |
1000 | 37 |
10,000 | 118 |
100,000 | 372 |
In the real world, breaks in the data patterns are often the result of something systematic and not criminal. For example, the digit nine might show up a lot in an expense report sample because there's a $100.00 limit on something and people are padding their reports.
But the digit nine might also show up too often because the data warehouse handles missing data by filling codes that start with nines. Now we have an indication that we have a lot of missing data. We might want to look at this data and see if we need more detailed codes for kinds of missing data, if we need to improve data capture, or if we have to adjust the certainty of predictive reports to allow for this fact.
Joe Celko is an independent consultant in Austin, Texas and the author of Joe Celko's Trees and Hierarchies in SQL for Smarties (Morgan Kaufmann, 2004) and Joe Celko's SQL for Smarties: Advanced SQL Programming (Morgan Kaufmann, 1999).
About the Author
You May Also Like