Data-snooping bias

In statistics, data-snooping bias is a form of statistical bias generated by the misuse of data mining techniques which can lead to bogus results in scientific research. Although data-snooping biases can occur in any field that uses data mining, data snooping biases are a particular concern in finance and medical research, both of which make heavy use of data mining techniques.

In the process of data mining, huge numbers of hypotheses about a single data set can be tested in a very short time, by exhaustively searching for combinations of variables that might show a correlation.

Because conventional tests of statistical significance are based on the probability that an observation arose by chance, it is reasonable to expect that 5% of randomly chosen hypotheses will turn out to be significant at the 5% level, 0.1% will turn out to be significant at the 0.1% significance level, and so on, simply by chance.

Thus, given enough hypotheses tested, it is virtually certain that some of them will appear to be highly statistically significant, even on a data set with no real correlations at all. Researchers who are using data mining techniques can be easily misled by these apparently significant results, even though they are merely chance artifacts.

Data-snooping bias most commonly occurs when researchers have not formed an hypothesis in advance, and therefore are open to any hypothesis suggestions presented by the data; or when researchers narrow the data used in order to reduce the probability of the sample refuting a specific hypothesis.

Examples

Example 1: Hypothesis suggested by data

In a list of 367 people, at least two will have the same day and month of birth. Suppose Mary and John both celebrate birthdays on August 7.

Data snooping would, by design, try to find additional similarities between Mary and John, such as: : Are they the youngest and the oldest persons in the list?: Have they met in person once? Twice? Three times? : Do their fathers have the same first name, or mothers have the same maiden name?

By going through hundreds or thousands of potential similarities between John and Mary, each having a low probability of being true, we may eventually find proof of virtually any hypothesis.

Perhaps John and Mary are the only two persons in the list who switched minors three times in college, a fact we found out by exhaustively comparing their life's histories. Our data-snooping bias hypothesis can then become, "People born on August 7 have a much higher chance of switching minors more than twice in college."

The data itself very strongly supports that correlation, since no one with a different birthday had switched minors three times in college.

However, when we turn to the larger sample of the general population and attempt to reproduce the results, we find that there is no statistical correlation between August 7 birthdays and changing college minors more than once. The "fact" exists only for a very small, specific sample, not for the public as a whole.

Example 2: Narrow sample to match hypothesis

Suppose medical researchers examine a pool of data representing 10,000 lung cancer patients. They want to find information that suggests non-smokers who develop lung cancer have a better chance of survival than smokers with lung cancer.

The researchers notice that 90 percent of the patients (9,000) smoked cigarettes. About 4 percent (360 people) went into remission with no chemotherapy.

Of the 10 percent (1,000) of patients who were not smokers, 40 people -- 4 percent -- also went into remission with no chemotherapy.

The data, as it stands, suggests that smokers are as likely as non-smokers to go into remission without chemotherapy. But the result is not what the researchers desire, so they reduce the sample size to 1,000 patients, to see if that produces different results.

The new data retains the 90 percent smoker rate (900). In this sample, 36 people -- about 4 percent -- go into remission without chemotherapy.

However, the new sample of non-smoking patients (100) retains 16 of the 40 people from the original sample who went into remission without chemotherapy. That is 16 percent of the new sample size.

The researchers therefore claim that non-smokers with lung cancer are four times more likely to go into remission without chemotherapy than smokers are.

By reducing the sample size without regard to statistical significance, after the original sample suggested there is no difference in untreated remission rates, the researchers have produced numbers that seem to bear out the desired result.

External links

* [http://data-snooping.martinsewell.com/ A bibliography on data-snooping bias]


Wikimedia Foundation. 2010.