Jump to content

Data dredging

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by John Broughton (talk | contribs) at 20:16, 31 August 2005 (Copyediting; added "data fishing"; added reference). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Data dredging is the term used to refer to the inappropriate (sometimes deliberately so) search for 'statistically significant' relationships in large quantities of data. This activity was formerly known in the statistical community as data mining, but that term is now in widespread use with a substantially different meaning, so the term data dredging is now used instead. The term data fishing is another label for doing data dredging.

Conventional statistical procedure is to formulate a research hypothesis, (such as 'people in higher social classes live longer') then collect relevant data, then carry out a statistical significance test to see whether the results could be due to the effects of chance.

A key point is that one should not formulate a hypothesis as a result of seeing the data, at least not if the data is then used as proof of the hypothesis. If you want to work from data to hypotheses while avoiding the problems of data dredging, you need to collect a data set, then partition it into two subsets, A and B, with data items randomly placed in the two subsets. Only one subset - say, subset B - is examined for interesting hypotheses. Once a hypothesis has been formulated it can be tested on subset A, since subset A was not used to construct the hypothesis; only where it is also supported by subset A is it reasonable to believe that the hypothesis might be valid.

Any large data set contains some chance features which will not be present in similar data sets, and to simply declare these as 'facts' is spurious. An example would be a TV marketing campaign to increase the use of banking services of a major bank. Suppose the campaign is run in one geographical area but not in another (similar one), which serves as a control group, and that overall sales in the treatment group - where the campaign was run - did not rise significantly more than in the control area. An analysis might find that sales did go up more for Spanish-speaking households, or for households with incomes between $35,000 and $50,000, or for households that had refinanced in the past two years, or whatever, and that such an increase was 'statistically significant'. There would certainly be a temptation to report such findings as 'proof' that the campaign was successful, or would be successful if targeted to such a group in other markets.

It is important to realise that the alleged statistical significance here is completely spurious - significance tests do not protect against data dredging. When testing a data set on which the hypothesis is known to be true, the data set is by definitiion not a representative data set, and any resulting significance levels are meaningless.

See, for example:

Why Most Published Research Findings Are False, Public Library of Science, August 2005.