Data dredging (data fishing, data snooping) is the inappropriate (sometimes deliberately so) use of data mining to uncover misleading relationships in data. Data-snooping bias is a form of statistical bias that arises from this misuse of statistics. Any relationships found might appear to be valid within the test set but they would have no statistical significance in the wider population.
Data dredging and data-snooping bias can occur when researchers either do not form a hypothesis in advance or narrow the data used to reduce the probability of the sample refuting a specific hypothesis. Although data-snooping bias can occur in any field that uses data mining, it is of particular concern in finance and medical research, both of which make heavy use of data mining techniques.
The process of data mining involves automatically testing huge numbers of hypotheses about a single data set by exhaustively searching for combinations of variables that might show a correlation. Conventional tests of statistical significance are based on the probability that an observation arose by chance, and necessarily accept some risk of mistaken test results, called the significance. When large numbers of tests are performed, it is always expected that some will produce false results, hence 5% of randomly chosen hypotheses will turn out to be significant at the 5% level, 1% will turn out to be significant at the 1% significance level, and so on, by chance alone.
If enough hypotheses are tested, it is virtually certain that some will falsely appear to be statistically significant, since every data set with any degree of randomness contains some bogus correlations. Researchers using data mining techniques can be easily misled by these apparently significant results, even though they are mere artifacts of random variation.
Circumventing the traditional scientific approach by conducting an experiment without a hypothesis can lead to premature conclusions. Data mining can be used negatively to seek more information from a data set than it actually contains. Failure to adjust existing statistical models when applying them to new datasets can also result in the occurrences of new patterns between different attributes that would otherwise have not shown up. Overfitting, oversearching, overestimation, and attribute selection errors are all actions that can lead to data dredging.
Types of problem
Drawing conclusions from data
The conventional frequentist statistical hypothesis testing procedure is to formulate a research hypothesis, such as "people in higher social classes live longer", then collect relevant data, followed by carrying out a statistical significance test to see whether the results could be due to the effects of chance. (The last step is called testing against the null hypothesis).
A key point in proper statistical analysis is to test a hypothesis with evidence (data) that was not used in constructing the hypothesis. This is critical because every data set will contain some patterns due entirely to chance. If the hypothesis is not tested on a different data set from the same population, it is impossible to determine if the patterns found are chance patterns. See testing hypotheses suggested by the data.
Here is a simplistic example. Throwing five coins, with a result of 2 heads and 3 tails, might lead one to ask why the coin favors tails by fifty percent. On the other hand, forming the hypothesis might lead one to conclude that only a 5-0 or 0-5 result would be very surprising, since the odds are 93.75% against this happening by chance.
As a more visual example, on a cloudy day, try the experiment of looking for figures in the clouds; if one looks long enough one may see castles, cattle, and all sort of fanciful images; but the images are not really in the clouds, as can be easily confirmed by looking at other clouds.
It is important to realize that the alleged statistical significance here is completely spurious – significance tests do not protect against data dredging. When testing a data set on which the hypothesis is known to be true, the data set is by definition not a representative data set, and any resulting significance levels are meaningless.
Hypothesis suggested by non-representative data
In a list of 367 people, at least two will have the same day and month of birth. Suppose Mary and John both celebrate birthdays on August 7.
Data snooping would, by design, try to find additional similarities between Mary and John, such as:
- Are they the youngest and the oldest persons in the list?
- Have they met in person once? Twice? Three times?
- Do their fathers have the same first name, or mothers have the same maiden name?
By going through hundreds or thousands of potential similarities between John and Mary, each having a low probability of being true, we may eventually find proof of virtually any hypothesis.
Perhaps John and Mary are the only two persons in the list who switched minors three times in college, a fact we found out by exhaustively comparing their lives' histories. Our data-snooping bias hypothesis can then become, "People born on August 7 have a much higher chance of switching minors more than twice in college."
The data itself very strongly supports that correlation, since no one with a different birthday had switched minors three times in college.
However, when we turn to the larger sample of the general population and attempt to reproduce the results, we find that there is no statistical correlation between August 7 birthdays and changing college minors more than once. The "fact" exists only for a very small, specific sample, not for the public as a whole.
Narrowing the sample to match hypothesis
Suppose medical researchers examine a pool of data representing 10,000 lung cancer patients. They want to find information that suggests non-smokers who develop lung cancer have a better chance of survival than smokers with lung cancer do.
The researchers notice that 90 percent of the patients (9,000) smoked cigarettes. About 4 percent (360 people) went into remission with no chemotherapy.
Of the 10 percent (1,000) of patients who were not smokers, 40 people—4 percent—also went into remission with no chemotherapy.
The data, as it stands, suggests that smokers are as likely as non-smokers to go into remission without chemotherapy. The result is not what the researchers desire, so they reduce the sample size to 1,000 patients, to see if that produces different results.
The new data retains the 90 percent smoker rate (900). In this sample, 36 people—about 4 percent—go into remission without chemotherapy.
However, the new sample of non-smoking patients (100) retains 16 of the 40 people from the original sample who went into remission without chemotherapy. That is 16 percent of the new sample size.
The researchers therefore claim that non-smokers with lung cancer are four times more likely to go into remission without chemotherapy than smokers are.
By reducing the sample size without regard to statistical significance, after the original sample suggested there is no difference in untreated remission rates, the researchers have produced numbers that seem to bear out the desired result.
Example in meteorology
In meteorology, dataset A is often weather data up to the present, which ensures that, even subconsciously, subset B of the data could not influence the formulation of the hypothesis. Of course, such a discipline necessitates waiting for new data to come in, to show the formulated theory's predictive power versus the null hypothesis. This process ensures that no one can accuse the researcher of hand-tailoring the predictive model to the data on hand, since the upcoming weather is not yet available.
Suppose that observers note that a particular town appears to be a cancer cluster, but lack a firm hypothesis of why this is so. However, they have access to a large amount of demographic data about the town and surrounding area, containing measurements for the area of hundreds or thousands of different variables, mostly uncorrelated. Even if all these variables are independent of the cancer incidence rate, it is highly likely that at least one variable will be significantly correlated with the cancer rate across the area. While this may suggest a hypothesis, further testing using the same variables but with data from a different location is needed to confirm. Note that a p-value of 0.01 suggests that 1% of the time a result at least that extreme would be obtained by chance; if hundreds or thousands of hypotheses (with mutually relatively uncorrelated independent variables) are tested, then one is more likely than not to get at least one null hypothesis with a p-value less than 0.01.
The practice of looking for patterns in data is legitimate; the vice of applying a statistical test of significance (hypothesis testing) to the same data from which the pattern was learned is wrong. One way to construct hypotheses while avoiding the problems of data dredging is to conduct randomized out-of-sample tests. The researcher collects a data set, then randomly partitions it into two subsets, A and B. Only one subset - say, subset A - is examined for creating hypotheses. Once a hypothesis has been formulated, it must be tested on subset B, which was not used to construct the hypothesis. Only where such a hypothesis is also supported by B is it reasonable to believe that the hypothesis might be valid.
Another remedy for data dredging is to record the number of all significance tests conducted during the experiment and simply multiply the final significance level by this number (the Bonferroni correction); however, this is a very conservative metric. The use of a false discovery rate is a more sophisticated approach that has become a popular method for control of multiple hypothesis tests.
Ultimately, the statistical significance of a test and the statistical confidence of a finding are joint properties of data and the method used to examine the data. Thus, if someone says that a certain event has probability of 20% ± 2% 19 times out of 20, this means that if the probability of the event is estimated by the same method used to obtain the 20% estimate, the result will be between 18% and 22% with probability 0.95. No claim of statistical significance can be made by only looking, without due regard to the method used to assess the data.
- Base rate fallacy
- Bonferroni inequalities
- False discovery rate
- Multiple comparisons
- Predictive analytics
- Ioannidis, John P.A. (August 30, 2005). "Why Most Published Research Findings Are False". PLoS Medicine (San Francisco: Public Library of Science) 2 (8): e124. doi:10.1371/journal.pmed.0020124. ISSN 1549-1277. PMC 1182327. PMID 16060722. http://www.plosmedicine.org/article/info%3Adoi%2F10.1371%2Fjournal.pmed.0020124. Retrieved 2009-11-29.
Wikimedia Foundation. 2010.
Look at other dictionaries:
data dredging — Stats the process of making comparisons and drawing conclusions from data that was not part of the original brief for a study … The ultimate business dictionary
Data mining — Not to be confused with analytics, information extraction, or data analysis. Data mining (the analysis step of the knowledge discovery in databases process, or KDD), a relatively young and interdisciplinary field of computer science is… … Wikipedia
Testing hypotheses suggested by the data — In statistics, hypotheses suggested by the data must be tested differently from hypotheses formed independently of the data.How to do it wrongFor example, suppose fifty different researchers, unaware of each other s work, run clinical trials to… … Wikipedia
Misuse of statistics — A misuse of statistics occurs when a statistical argument asserts a falsehood. In some cases, the misuse may be accidental. In others, it is purposeful and for the gain of the perpetrator. When the statistical reason involved is false or… … Wikipedia
Dredge (disambiguation) — Dredge or dredging may refer to: Dredging, underwater excavation work Dredge, California, former community in Butte County Dragline excavator, tool for dredging in mining; piece of heavy equipment used to remove overburden in an open pit or strip … Wikipedia
Prosecutor's fallacy — The prosecutor s fallacy is a fallacy of statistical reasoning made in law where the context in which the accused has been brought to court is falsely assumed to be irrelevant to judging how confident a jury can be in evidence against them with a … Wikipedia
List of statistics topics — Please add any Wikipedia articles related to statistics that are not already on this list.The Related changes link in the margin of this page (below search) leads to a list of the most recent changes to the articles listed below. To see the most… … Wikipedia
Overfitting — Noisy (roughly linear) data is fitted to both linear and polynomial functions. Although the polynomial function passes through each data point, and the linear function through few, the linear version is a better fit. If the regression curves were … Wikipedia
Experimenter's bias — In experimental science, experimenter s bias is subjective bias towards a result expected by the human experimenter. David Sackett, in a useful review of biases in clinical studies, states that biases can occur in any one of seven stages of… … Wikipedia
List of mathematics articles (D) — NOTOC D D distribution D module D D Agostino s K squared test D Alembert Euler condition D Alembert operator D Alembert s formula D Alembert s paradox D Alembert s principle Dagger category Dagger compact category Dagger symmetric monoidal… … Wikipedia