Jump to content

Data dredging: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Copyediting; added "data fishing"; added reference
 
(411 intermediate revisions by more than 100 users not shown)
Line 1: Line 1:
{{Short description|Misuse of data analysis}}
'''Data dredging''' is the term used to refer to the inappropriate (sometimes deliberately so) search for 'statistically significant' relationships in large quantities of data. This activity was formerly known in the statistical community as [[data mining]], but that term is now in widespread use with a substantially different meaning, so the term data dredging is now used instead. The term '''data fishing''' is another label for doing data dredging.
[[File:Spurious correlations - spelling bee spiders.svg|thumb|upright=1.3|A humorous example of a result produced by data dredging, showing a correlation between the number of letters in [[Scripps National Spelling Bee]]'s winning word and the number of people in the United States killed by [[venomous spiders]]]]


'''Data dredging''' (also known as '''data snooping''' or '''''p''-hacking''')<ref name="Wasserstein2016">{{cite journal | last1=Wasserstein | first1=Ronald L. | last2=Lazar | first2=Nicole A. | title=The ASA Statement on p-Values: Context, Process, and Purpose | journal=The American Statistician | publisher=Informa UK Limited | volume=70 | issue=2 | date=2016-04-02 | issn=0003-1305 | doi=10.1080/00031305.2016.1154108 | pages=129–133| doi-access=free }}</ref>{{efn|Other names include data grubbing,<!--per Smith 2014 and others--> data butchery, data fishing, selective inference, significance chasing, and significance questing.}} is the misuse of [[data analysis]] to find patterns in data that can be presented as [[statistically significant]], thus dramatically increasing and understating the risk of [[false positives]]. This is done by performing many [[statistical test]]s on the data and only reporting those that come back with significant results.<ref name="bmj02" />
Conventional statistical procedure is to formulate a research hypothesis, (such as 'people in higher social classes live longer') then collect relevant data, then carry out a statistical [[significance test]] to see whether the results could be due to the effects of chance.


The process of data dredging involves testing multiple hypotheses using a single [[data set]] by [[Brute-force search|exhaustively searching]]—perhaps for combinations of variables that might show a [[correlation]], and perhaps for groups of cases or observations that show differences in their mean or in their breakdown by some other variable.
A key point is that ''one should not formulate a hypothesis as a result of seeing the data'', at least not if the data is then used as '''proof''' of the hypothesis. If you want to work from data to hypotheses while avoiding the problems of data dredging, you need to collect a data set, then partition it into two subsets, A and B, with data items '''randomly''' placed in the two subsets. Only one subset - say, subset B - is examined for interesting hypotheses. Once a hypothesis has been formulated it can be tested on subset A, since subset A was not used to construct the hypothesis; only where it is also supported by subset A is it reasonable to believe that the hypothesis might be valid.


Conventional tests of [[statistical significance]] are based on the probability that a particular result would arise if chance alone were at work, and necessarily accept some risk of mistaken conclusions of a certain type (mistaken rejections of the [[null hypothesis]]). This level of risk is called the [[statistical significance|''significance'']]. When large numbers of tests are performed, some produce false results of this type; hence 5% of randomly chosen hypotheses might be (erroneously) reported to be statistically significant at the 5% significance level, 1% might be (erroneously) reported to be statistically significant at the 1% significance level, and so on, by chance alone. When enough hypotheses are tested, it is virtually certain that some will be reported to be statistically significant (even though this is misleading), since almost every data set with any degree of randomness is likely to contain (for example) some [[spurious correlation|spurious correlations]]. If they are not cautious, researchers using data mining techniques can be easily misled by these results. The term ''p-hacking'' (in reference to [[p-value|''p''-values]]) was coined in a 2014 paper by the three researchers behind the blog [[Data Colada]], which has been focusing on uncovering such problems in social sciences research.<ref name=":22">{{Cite magazine |last=Lewis-Kraus |first=Gideon |date=2023-09-30 |title=They Studied Dishonesty. Was Their Work a Lie? |language=en-US |magazine=The New Yorker |url=https://www.newyorker.com/magazine/2023/10/09/they-studied-dishonesty-was-their-work-a-lie |access-date=2023-10-01 |issn=0028-792X}}</ref><ref name=":3">{{Cite web |last=Subbaraman |first=Nidhi |date=2023-09-24 |title=The Band of Debunkers Busting Bad Scientists |url=https://www.wsj.com/science/data-colada-debunk-stanford-president-research-14664f3 |url-status=live |archive-url=https://archive.today/20230924094046/https://www.wsj.com/science/data-colada-debunk-stanford-president-research-14664f3 |archive-date=2023-09-24 |access-date=2023-10-08 |website=[[Wall Street Journal]] |language=en-US}}</ref><ref>{{Cite web |title=APA PsycNet |url=https://psycnet.apa.org/record/2013-25331-001 |access-date=2023-10-08 |website=psycnet.apa.org |language=en}}</ref>
Any large data set contains some chance features which will not be present in similar data sets, and to simply declare these as 'facts' is spurious. An example would be a TV marketing campaign to increase the use of banking services of a major bank. Suppose the campaign is run in one geographical area but not in another (similar one), which serves as a [[control group]], and that overall sales in the [[treatment group]] - where the campaign was run - did not rise significantly more than in the control area. An analysis might find that sales ''did'' go up more for Spanish-speaking households, or for households with incomes between $35,000 and $50,000, or for households that had refinanced in the past two years, or whatever, and that such an increase was 'statistically significant'. There would certainly be a temptation to report such findings as 'proof' that the campaign was successful, or would be successful if targeted to such a group in other markets.


Data dredging is an example of disregarding the [[multiple comparisons problem]]. One form is when subgroups are compared without alerting the reader to the total number of subgroup comparisons examined.<ref name="Deming">{{Cite journal
It is important to realise that the alleged statistical significance here is completely spurious - significance tests do ''not'' protect against data dredging. When testing a data set on which the hypothesis is known to be true, the data set is by definitiion '''not''' a representative data set, and any resulting significance levels are meaningless.
|author1=Young, S.&nbsp;S. |author2=Karr, A.
|title = Deming, data and observational studies
|journal = Significance
|volume = 8
|issue = 3
|year = 2011
|url = http://www.niss.org/sites/default/files/Young%20Karr%20Obs%20Study%20Problem.pdf
|doi = 10.1111/j.1740-9713.2011.00506.x
|pages=116–120
|doi-access = free
}}
</ref>


==Types==
See, for example:


=== Drawing conclusions from data ===
[http://medicine.plosjournals.org/perlserv/?request=get-document&doi=10%2E1371%2Fjournal%2Epmed%2E0020124 Why Most Published Research Findings Are False], Public Library of Science, August 2005.
The conventional [[statistical hypothesis testing]] procedure using [[frequentist probability]] is to formulate a research hypothesis, such as "people in higher social classes live longer", then collect relevant data. Lastly, a statistical [[significance test]] is carried out to see how likely the results are by chance alone (also called testing against the null hypothesis).


A key point in proper statistical analysis is to test a hypothesis with evidence (data) that was not used in constructing the hypothesis. This is critical because every [[data set]] contains some patterns due entirely to chance. If the hypothesis is not tested on a different data set from the same [[statistical population]], it is impossible to assess the likelihood that chance alone would produce such patterns.
[[Category:Data management]]

For example, [[flipping a coin]] five times with a result of 2 heads and 3 tails might lead one to hypothesize that the coin favors tails by 3/5 to 2/5. If this hypothesis is then tested on the existing data set, it is confirmed, but the confirmation is meaningless. The proper procedure would have been to form in advance a hypothesis of what the tails probability is, and then throw the coin various times to see if the hypothesis is rejected or not. If three tails and two heads are observed, another hypothesis, that the tails probability is 3/5, could be formed, but it could only be tested by a new set of coin tosses. The statistical significance under the incorrect procedure is completely spurious—significance tests do not protect against data dredging.

=== Optional stopping ===
[[File:P-hacking by early stopping.svg|thumb|315x315px|The figure shows the change in p-values computed from a t-test as the sample size increases, and how early stopping can allow for p-hacking.

Data is drawn from two identical normal distributions, <math>N(0, 10)</math>. For each sample size <math>n</math>, ranging from 5 to <math>10^4</math>, a t-test is performed on the first <math>n</math> samples from each distribution, and the resulting p-value is plotted. The red dashed line indicates the commonly used significance level of 0.05.

If the data collection or analysis were to stop at a point where the p-value happened to fall below the significance level, a spurious statistically significant difference could be reported.]]
Optional stopping is a practice where one collects data until some stopping criteria is reached. While it is a valid procedure, it is easily misused. The problem is that p-value of an optionally stopped statistical test is larger than what it seems. Intuitively, this is because the p-value is supposed to be the sum of all events at least as rare as what is observed. With optional stopping, there are even rarer events that are difficult to account for, i.e. not triggering the optional stopping rule, and collect even more data, before stopping. Neglecting these events leads to a p-value that's too low. In fact, if the null hypothesis is true, then ''any'' significance level can be reached if one is allowed to keep collecting data and stop when the desired p-value (calculated as if one has always been planning to collect exactly this much data) is obtained.<ref name=":9">{{Cite journal |last=Wagenmakers |first=Eric-Jan |date=October 2007 |title=A practical solution to the pervasive problems of p values |url=http://link.springer.com/10.3758/BF03194105 |journal=Psychonomic Bulletin & Review |language=en |volume=14 |issue=5 |pages=779–804 |doi=10.3758/BF03194105 |issn=1069-9384 |pmid=18087943}}</ref> For a concrete example of testing for a fair coin, see {{section link|P-value|Optional stopping|display=''p''-value}}.

Or, more succinctly, the proper calculation of p-value requires accounting for counterfactuals, that is, what the experimenter ''could'' have done in reaction to data that ''might'' have been. Accounting for what might have been is hard, even for honest researchers.<ref name=":9" /> One benefit of preregistration is to account for all counterfactuals, allowing the p-value to be calculated correctly.<ref>{{Cite journal |last1=Wicherts |first1=Jelte M. |last2=Veldkamp |first2=Coosje L. S. |last3=Augusteijn |first3=Hilde E. M. |last4=Bakker |first4=Marjan |last5=van Aert |first5=Robbie C. M. |last6=van Assen |first6=Marcel A. L. M. |date=2016-11-25 |title=Degrees of Freedom in Planning, Running, Analyzing, and Reporting Psychological Studies: A Checklist to Avoid p-Hacking |journal=Frontiers in Psychology |volume=7 |page=1832 |doi=10.3389/fpsyg.2016.01832 |issn=1664-1078 |pmc=5122713 |pmid=27933012 |doi-access=free}}</ref>

The problem of early stopping is not just limited to researcher misconduct. There is often pressure to stop early if the cost of collecting data is high. Some animal ethics boards even mandate early stopping if the study obtains a significant result midway.<ref name="mlh">{{Cite journal |last1=Head |first1=Megan L. |last2=Holman |first2=Luke |last3=Lanfear |first3=Rob |last4=Kahn |first4=Andrew T. |last5=Jennions |first5=Michael D. |date=2015-03-13 |title=The Extent and Consequences of P-Hacking in Science |journal=PLOS Biology |language=en |volume=13 |issue=3 |pages=e1002106 |doi=10.1371/journal.pbio.1002106 |issn=1545-7885 |pmc=4359000 |pmid=25768323 |doi-access=free}}</ref>

=== Post-hoc data replacement ===
If data is removed ''after'' some data analysis is already done on it, such as on the pretext of "removing outliers", then it would increase the false positive rate. Replacing "outliers" by replacement data increases the false positive rate further.<ref name=":0">{{Cite journal |last=Szucs |first=Denes |date=2016-09-22 |title=A Tutorial on Hunting Statistical Significance by Chasing N |journal=Frontiers in Psychology |language=English |volume=7 |doi=10.3389/fpsyg.2016.01444 |doi-access=free |pmid=27713723 |issn=1664-1078|pmc=5031612 }}</ref>

=== Post-hoc grouping ===
If a dataset contains multiple features, then one or more of the features can be used as grouping, and potentially create a statistically significant result. For example, if a dataset of patients records their age and sex, then a researcher can consider grouping them by age and check if the illness recovery rate is correlated with age. If it does not work, then the researcher might check if it correlates with sex. If not, then perhaps it correlates with age after controlling for sex, etc. The number of possible groupings grows exponentially with the number of features.<ref name=":0" />

=== Hypothesis suggested by non-representative data ===
{{main article|Testing hypotheses suggested by the data}}

Suppose that a study of a [[random sample]] of people includes exactly two people with a birthday of August 7: Mary and John. Someone engaged in data dredging might try to find additional similarities between Mary and John. By going through hundreds or thousands of potential similarities between the two, each having a low probability of being true, an unusual similarity can almost certainly be found. Perhaps John and Mary are the only two people in the study who switched minors three times in college. A hypothesis, biased by data dredging, could then be "people born on August 7 have a much higher chance of switching minors more than twice in college."

The data itself taken out of context might be seen as strongly supporting that correlation, since no one with a different birthday had switched minors three times in college. However, if (as is likely) this is a spurious hypothesis, this result will most likely not be [[reproducible]]; any attempt to check if others with an August 7 birthday have a similar rate of changing minors will most likely get contradictory results almost immediately.

=== Systematic bias ===
{{main article|Bias}}
Bias is a systematic error in the analysis. For example, doctors directed [[HIV]] patients at high cardiovascular risk to a particular HIV treatment, [[abacavir]], and lower-risk patients to other drugs, preventing a simple assessment of abacavir compared to other treatments. An analysis that did not correct for this bias unfairly penalized abacavir, since its patients were more high-risk so more of them had heart attacks.<ref name="Deming" /> This problem can be very severe, for example, in the [[observational study]].<ref name="Deming" /><ref name="bmj02">
{{Cite journal
|author1=Davey Smith, G.|author1-link=George Davey Smith
|author2=Ebrahim, S.
|title = Data dredging, bias, or confounding
|journal = BMJ
|volume = 325
|year = 2002
|pmc = 1124898
|doi = 10.1136/bmj.325.7378.1437
|pmid=12493654
|issue=7378
|pages=1437–1438}}
</ref>

Missing factors, unmeasured [[confounders]], and loss to follow-up can also lead to bias.<ref name="Deming" /> By selecting papers with significant [[p-value|''p''-values]], negative studies are selected against, which is [[publication bias]]. This is also known as ''file drawer bias'', because less significant ''p''-value results are left in the file drawer and never published.

=== Multiple modelling ===
Another aspect of the conditioning of [[statistical test]]s by knowledge of the data can be seen while using the {{clarify span|system or machine analysis and [[linear regression]] to observe the frequency of data.|date=October 2019}} A crucial step in the process is to decide which [[covariate]]s to include in a relationship explaining one or more other variables. There are both statistical (see [[stepwise regression]]) and substantive considerations that lead the authors to favor some of their models over others, and there is a liberal use of statistical tests. However, to discard one or more variables from an explanatory relation on the basis of the data means one cannot validly apply standard statistical procedures to the retained variables in the relation as though nothing had happened. In the nature of the case, the retained variables have had to pass some kind of preliminary test (possibly an imprecise intuitive one) that the discarded variables failed. In 1966, Selvin and Stuart compared variables retained in the model to the fish that don't fall through the net—in the sense that their effects are bound to be bigger than those that do fall through the net. Not only does this alter the performance of all subsequent tests on the retained explanatory model, but it may also introduce bias and alter [[mean square error]] in estimation.<ref name="Selvin">
{{Cite journal
|author1=Selvin, H.&nbsp;C.
|author2=Stuart, A.
|title = Data-Dredging Procedures in Survey Analysis
|journal = The American Statistician
|volume = 20
|issue = 3
|pages = 20–23
|year = 1966
|doi=10.1080/00031305.1966.10480401
|jstor=2681493}}
</ref><ref name="BerkBrownZhao">
{{Cite journal
|author1=Berk, R. |author2=Brown, L. |author3=Zhao, L.
|title = Statistical Inference After Model Selection
|journal = J Quant Criminol
|doi = 10.1007/s10940-009-9077-7
|year = 2009 |volume=26 |issue=2 |pages=217–236 |s2cid=10350955 |url=https://repository.upenn.edu/statistics_papers/540 }}
</ref>

==Examples ==

=== In meteorology and epidemiology ===
In [[meteorology]], hypotheses are often formulated using weather data up to the present and tested against future weather data, which ensures that, even subconsciously, future data could not influence the formulation of the hypothesis. Of course, such a discipline necessitates waiting for new data to come in, to show the formulated theory's [[predictive power]] versus the [[null hypothesis]]. This process ensures that no one can accuse the researcher of hand-tailoring the [[predictive modelling|predictive model]] to the data on hand, since the upcoming weather is not yet available.

As another example, suppose that observers note that a particular town appears to have a [[cancer cluster]], but lack a firm hypothesis of why this is so. However, they have access to a large amount of [[demographic data]] about the town and surrounding area, containing measurements for the area of hundreds or thousands of different variables, mostly uncorrelated. Even if all these variables are independent of the cancer incidence rate, it is highly likely that at least one variable correlates significantly with the cancer rate across the area. While this may suggest a hypothesis, further testing using the same variables but with data from a different location is needed to confirm. Note that a ''p''-value of 0.01 suggests that 1% of the time a result at least that extreme would be obtained by chance; if hundreds or thousands of hypotheses (with mutually relatively uncorrelated independent variables) are tested, then one is likely to obtain a ''p''-value less than 0.01 for many null hypotheses.

== Appearance in media ==
One example is the [[John Bohannon#Intentionally misleading chocolate study|chocolate weight loss hoax study]] conducted by journalist [[John Bohannon]], who explained publicly in a ''Gizmodo'' article that the study was deliberately conducted fraudulently as a [[social experiment]].<ref>{{Cite web |last=Bohannon |first=John |date=2015-05-27 |title=I Fooled Millions Into Thinking Chocolate Helps Weight Loss. Here's How. |url=https://gizmodo.com/i-fooled-millions-into-thinking-chocolate-helps-weight-1707251800 |access-date=2023-10-20 |website=[[Gizmodo]] |language=en}}</ref> This study was widespread in many media outlets around 2015, with many people believing the claim that eating a chocolate bar every day would cause them to lose weight, against their better judgement. This [[:File:Chocolate with high Cocoa content as a weight-loss accelerator.pdf|study]] was published in the Institute of Diet and Health. According to Bohannon, to reduce the ''p''-value to below 0.05, taking 18 different variables into consideration when testing was crucial.

==Remedies==
While looking for patterns in data is legitimate, applying a statistical test of significance or [[hypothesis test]] to the same data until a pattern emerges is prone to abuse. One way to construct hypotheses while avoiding data dredging is to conduct randomized [[out-of-sample test]]s. The researcher collects a data set, then randomly partitions it into two subsets, A and B. Only one subset—say, subset A—is examined for creating hypotheses. Once a hypothesis is formulated, it must be tested on subset B, which was not used to construct the hypothesis. Only where B also supports such a hypothesis is it reasonable to believe the hypothesis might be valid. (This is a simple type of [[cross-validation (statistics)|cross-validation]] and is often termed training-test or split-half validation.)

Another remedy for data dredging is to record the number of all significance tests conducted during the study and simply divide one's criterion for significance (alpha) by this number; this is the [[Bonferroni correction]]. However, this is a very conservative metric. A family-wise alpha of 0.05, divided in this way by 1,000 to account for 1,000 significance tests, yields a very stringent per-hypothesis alpha of 0.00005. Methods particularly useful in analysis of variance, and in constructing simultaneous confidence bands for regressions involving basis functions are [[Scheffé's method]] and, if the researcher has in mind only [[Pairwise comparison (psychology)|pairwise comparison]]s, the [[Tukey range test|Tukey method]]. To avoid the extreme conservativeness of the Bonferroni correction, more sophisticated selective inference methods are available.<ref name="TaylorTibshirani2015">
{{Cite journal
|author1=Taylor, J. |author2=Tibshirani, R.
|title = Statistical learning and selective inference
|journal = Proceedings of the National Academy of Sciences
|doi = 10.1073/pnas.1507583112
|year = 2015 |volume=112 |issue=25 |pages=7629–7634
|doi-access=free|pmid=26100887
|pmc=4485109|bibcode=2015PNAS..112.7629T
}}
</ref> The most common selective inference method is the use of Benjamini and Hochberg's [[false discovery rate]] controlling procedure: it is a less conservative approach that has become a popular method for control of multiple hypothesis tests.

When neither approach is practical, one can make a clear distinction between data analyses that are [[Statistical hypothesis testing|confirmatory]] and analyses that are [[exploratory data analysis|exploratory]]. Statistical inference is appropriate only for the former.<ref name="BerkBrownZhao" />

Ultimately, the statistical significance of a test and the statistical confidence of a finding are joint properties of data and the method used to examine the data. Thus, if someone says that a certain event has probability of 20% ± 2% 19 times out of 20, this means that if the probability of the event is estimated ''by the same method'' used to obtain the 20% estimate, the result is between 18% and 22% with probability 0.95. No claim of statistical significance can be made by only looking, without due regard to the method used to assess the data.

Academic journals increasingly shift to the [[registered report]] format, which aims to counteract very serious issues such as data dredging and [[HARKing|{{abbr|HARKing|Hypothesizing After Results are Known}}]], which have made theory-testing research very unreliable. For example, ''[[Nature Human Behaviour]]'' has adopted the registered report format, as it "shift[s] the emphasis from the results of research to the questions that guide the research and the methods used to answer them".<ref>{{cite journal|title=Promoting reproducibility with registered reports|date=10 January 2017|journal=Nature Human Behaviour|volume=1|issue=1|pages=0034|doi=10.1038/s41562-016-0034|s2cid=28976450|doi-access=free}}</ref> The ''[[European Journal of Personality]]'' defines this format as follows: "In a registered report, authors create a study proposal that includes theoretical and empirical background, research questions/hypotheses, and pilot data (if available). Upon submission, this proposal will then be reviewed prior to data collection, and if accepted, the paper resulting from this peer-reviewed procedure will be published, regardless of the study outcomes."<ref>{{cite web|url=https://www.ejp-blog.com/blog/2017/2/3/streamlined-review-and-registered-reports-coming-soon|title=Streamlined review and registered reports soon to be official at EJP|website=ejp-blog.com|date=6 February 2018 }}</ref>

Methods and results can also be made publicly available, as in the [[open science]] approach, making it yet more difficult for data dredging to take place.<ref>{{cite journal |last1=Vyse |first1=Stuart |title=P-Hacker Confessions: Daryl Bem and Me |journal=[[Skeptical Inquirer]] |date=2017 |volume=41 |issue=5 |pages=25–27 |url=https://www.csicop.org/specialarticles/show/p-hacker_confessions_daryl_bem_and_me |access-date=5 August 2018|archive-url=https://web.archive.org/web/20180805142806/https://www.csicop.org/specialarticles/show/p-hacker_confessions_daryl_bem_and_me |archive-date=2018-08-05 }}</ref>

==See also==
{{Div col|colwidth=27em}}
* {{annotated link|Aliasing}}
* {{annotated link|Base rate fallacy}}
* {{annotated link|Bible code}}
* {{annotated link|Bonferroni inequalities}}
* {{annotated link|Cherry picking}}
* [[Garden of forking paths fallacy]]<ref>{{Cite web |last=Gelman |first=Andrew |date=2013 |title=The garden of forking paths |url=http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf}}</ref>{{snd}}side effect of too many researcher degrees of freedom
* {{annotated link|Circular analysis}}
* {{annotated link|HARKing}}
* {{annotated link|Lincoln–Kennedy coincidences urban legend}}
* {{annotated link|Look-elsewhere effect}}
* {{annotated link|Metascience}}
* {{annotated link|Misuse of statistics}}
* {{annotated link|Overfitting}}
* {{annotated link|Pareidolia}}
* {{annotated link|Post hoc analysis}}
* {{annotated link|Post hoc theorizing}}
* {{annotated link|Predictive analytics}}
* {{annotated link|Texas sharpshooter fallacy}}
{{Div col end}}

==Notes==
{{notelist}}

==References==
{{reflist}}

==Further reading==
* {{Cite journal |last=Ioannidis |first=John P.A. |author-link=John P. A. Ioannidis |title=Why Most Published Research Findings Are False |journal=[[PLOS Medicine]] |volume=2 |issue=8 |pages=e124 |publisher=Public Library of Science |location=San Francisco |date=August 30, 2005 |issn=1549-1277 |doi=10.1371/journal.pmed.0020124 |pmid=16060722 |pmc=1182327 |doi-access=free }}
* {{cite journal|last1=Head|first1=Megan L.|last2=Holman|first2=Luke|last3=Lanfear|first3=Rob|last4=Kahn|first4=Andrew T.|last5=Jennions|first5=Michael D.|title=The Extent and Consequences of P-Hacking in Science|journal=PLOS Biology|date=13 March 2015|volume=13|issue=3|pages=e1002106|doi=10.1371/journal.pbio.1002106|pmid=25768323|pmc=4359000 |doi-access=free }}<!--|access-date=8 April 2015-->
*{{cite news|last1=Insel|first1=Thomas|title=P-Hacking|url=https://www.nimh.nih.gov/about/directors/thomas-insel/blog/2014/p-hacking.shtml|work=NIMH Director's Blog|date=November 14, 2014|language=en}}
*{{cite book |last=Smith |first=Gary |date=2016 |title=Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics |url=https://books.google.com/books?id=B-EoDwAAQBAJ |publisher=Gerald Duckworth & Co |isbn=9780715649749 }}

==External links==
* [http://data-snooping.martinsewell.com/ A bibliography on data-snooping bias]
* [http://www.tylervigen.com/spurious-correlations Spurious Correlations], a gallery of examples of implausible correlations
* {{YouTube|UFhJefdVCjE|StatQuest: ''P''-value pitfalls and power calculations}}
* [https://www.youtube.com/watch?v=A0vEGuOMTyA Video explaining p-hacking] by "[[Neuroskeptic]]", a blogger at Discover Magazine
* [https://journalofbigdata.springeropen.com/articles/10.1186/s40537-018-0143-6 Step Away From Stepwise], an article in the Journal of Big Data criticizing stepwise regression

{{DEFAULTSORT:Data Dredging}}
[[Category:Bias]]
[[Category:Cognitive biases]]
[[Category:Scientific misconduct]]
[[Category:Data mining]]
[[Category:Design of experiments]]
[[Category:Statistical hypothesis testing]]
[[Category:Misuse of statistics]]

Latest revision as of 22:26, 11 November 2024

A humorous example of a result produced by data dredging, showing a correlation between the number of letters in Scripps National Spelling Bee's winning word and the number of people in the United States killed by venomous spiders

Data dredging (also known as data snooping or p-hacking)[1][a] is the misuse of data analysis to find patterns in data that can be presented as statistically significant, thus dramatically increasing and understating the risk of false positives. This is done by performing many statistical tests on the data and only reporting those that come back with significant results.[2]

The process of data dredging involves testing multiple hypotheses using a single data set by exhaustively searching—perhaps for combinations of variables that might show a correlation, and perhaps for groups of cases or observations that show differences in their mean or in their breakdown by some other variable.

Conventional tests of statistical significance are based on the probability that a particular result would arise if chance alone were at work, and necessarily accept some risk of mistaken conclusions of a certain type (mistaken rejections of the null hypothesis). This level of risk is called the significance. When large numbers of tests are performed, some produce false results of this type; hence 5% of randomly chosen hypotheses might be (erroneously) reported to be statistically significant at the 5% significance level, 1% might be (erroneously) reported to be statistically significant at the 1% significance level, and so on, by chance alone. When enough hypotheses are tested, it is virtually certain that some will be reported to be statistically significant (even though this is misleading), since almost every data set with any degree of randomness is likely to contain (for example) some spurious correlations. If they are not cautious, researchers using data mining techniques can be easily misled by these results. The term p-hacking (in reference to p-values) was coined in a 2014 paper by the three researchers behind the blog Data Colada, which has been focusing on uncovering such problems in social sciences research.[3][4][5]

Data dredging is an example of disregarding the multiple comparisons problem. One form is when subgroups are compared without alerting the reader to the total number of subgroup comparisons examined.[6]

Types

[edit]

Drawing conclusions from data

[edit]

The conventional statistical hypothesis testing procedure using frequentist probability is to formulate a research hypothesis, such as "people in higher social classes live longer", then collect relevant data. Lastly, a statistical significance test is carried out to see how likely the results are by chance alone (also called testing against the null hypothesis).

A key point in proper statistical analysis is to test a hypothesis with evidence (data) that was not used in constructing the hypothesis. This is critical because every data set contains some patterns due entirely to chance. If the hypothesis is not tested on a different data set from the same statistical population, it is impossible to assess the likelihood that chance alone would produce such patterns.

For example, flipping a coin five times with a result of 2 heads and 3 tails might lead one to hypothesize that the coin favors tails by 3/5 to 2/5. If this hypothesis is then tested on the existing data set, it is confirmed, but the confirmation is meaningless. The proper procedure would have been to form in advance a hypothesis of what the tails probability is, and then throw the coin various times to see if the hypothesis is rejected or not. If three tails and two heads are observed, another hypothesis, that the tails probability is 3/5, could be formed, but it could only be tested by a new set of coin tosses. The statistical significance under the incorrect procedure is completely spurious—significance tests do not protect against data dredging.

Optional stopping

[edit]
The figure shows the change in p-values computed from a t-test as the sample size increases, and how early stopping can allow for p-hacking. Data is drawn from two identical normal distributions, . For each sample size , ranging from 5 to , a t-test is performed on the first samples from each distribution, and the resulting p-value is plotted. The red dashed line indicates the commonly used significance level of 0.05. If the data collection or analysis were to stop at a point where the p-value happened to fall below the significance level, a spurious statistically significant difference could be reported.

Optional stopping is a practice where one collects data until some stopping criteria is reached. While it is a valid procedure, it is easily misused. The problem is that p-value of an optionally stopped statistical test is larger than what it seems. Intuitively, this is because the p-value is supposed to be the sum of all events at least as rare as what is observed. With optional stopping, there are even rarer events that are difficult to account for, i.e. not triggering the optional stopping rule, and collect even more data, before stopping. Neglecting these events leads to a p-value that's too low. In fact, if the null hypothesis is true, then any significance level can be reached if one is allowed to keep collecting data and stop when the desired p-value (calculated as if one has always been planning to collect exactly this much data) is obtained.[7] For a concrete example of testing for a fair coin, see p-value § Optional stopping.

Or, more succinctly, the proper calculation of p-value requires accounting for counterfactuals, that is, what the experimenter could have done in reaction to data that might have been. Accounting for what might have been is hard, even for honest researchers.[7] One benefit of preregistration is to account for all counterfactuals, allowing the p-value to be calculated correctly.[8]

The problem of early stopping is not just limited to researcher misconduct. There is often pressure to stop early if the cost of collecting data is high. Some animal ethics boards even mandate early stopping if the study obtains a significant result midway.[9]

Post-hoc data replacement

[edit]

If data is removed after some data analysis is already done on it, such as on the pretext of "removing outliers", then it would increase the false positive rate. Replacing "outliers" by replacement data increases the false positive rate further.[10]

Post-hoc grouping

[edit]

If a dataset contains multiple features, then one or more of the features can be used as grouping, and potentially create a statistically significant result. For example, if a dataset of patients records their age and sex, then a researcher can consider grouping them by age and check if the illness recovery rate is correlated with age. If it does not work, then the researcher might check if it correlates with sex. If not, then perhaps it correlates with age after controlling for sex, etc. The number of possible groupings grows exponentially with the number of features.[10]

Hypothesis suggested by non-representative data

[edit]

Suppose that a study of a random sample of people includes exactly two people with a birthday of August 7: Mary and John. Someone engaged in data dredging might try to find additional similarities between Mary and John. By going through hundreds or thousands of potential similarities between the two, each having a low probability of being true, an unusual similarity can almost certainly be found. Perhaps John and Mary are the only two people in the study who switched minors three times in college. A hypothesis, biased by data dredging, could then be "people born on August 7 have a much higher chance of switching minors more than twice in college."

The data itself taken out of context might be seen as strongly supporting that correlation, since no one with a different birthday had switched minors three times in college. However, if (as is likely) this is a spurious hypothesis, this result will most likely not be reproducible; any attempt to check if others with an August 7 birthday have a similar rate of changing minors will most likely get contradictory results almost immediately.

Systematic bias

[edit]

Bias is a systematic error in the analysis. For example, doctors directed HIV patients at high cardiovascular risk to a particular HIV treatment, abacavir, and lower-risk patients to other drugs, preventing a simple assessment of abacavir compared to other treatments. An analysis that did not correct for this bias unfairly penalized abacavir, since its patients were more high-risk so more of them had heart attacks.[6] This problem can be very severe, for example, in the observational study.[6][2]

Missing factors, unmeasured confounders, and loss to follow-up can also lead to bias.[6] By selecting papers with significant p-values, negative studies are selected against, which is publication bias. This is also known as file drawer bias, because less significant p-value results are left in the file drawer and never published.

Multiple modelling

[edit]

Another aspect of the conditioning of statistical tests by knowledge of the data can be seen while using the system or machine analysis and linear regression to observe the frequency of data.[clarify] A crucial step in the process is to decide which covariates to include in a relationship explaining one or more other variables. There are both statistical (see stepwise regression) and substantive considerations that lead the authors to favor some of their models over others, and there is a liberal use of statistical tests. However, to discard one or more variables from an explanatory relation on the basis of the data means one cannot validly apply standard statistical procedures to the retained variables in the relation as though nothing had happened. In the nature of the case, the retained variables have had to pass some kind of preliminary test (possibly an imprecise intuitive one) that the discarded variables failed. In 1966, Selvin and Stuart compared variables retained in the model to the fish that don't fall through the net—in the sense that their effects are bound to be bigger than those that do fall through the net. Not only does this alter the performance of all subsequent tests on the retained explanatory model, but it may also introduce bias and alter mean square error in estimation.[11][12]

Examples

[edit]

In meteorology and epidemiology

[edit]

In meteorology, hypotheses are often formulated using weather data up to the present and tested against future weather data, which ensures that, even subconsciously, future data could not influence the formulation of the hypothesis. Of course, such a discipline necessitates waiting for new data to come in, to show the formulated theory's predictive power versus the null hypothesis. This process ensures that no one can accuse the researcher of hand-tailoring the predictive model to the data on hand, since the upcoming weather is not yet available.

As another example, suppose that observers note that a particular town appears to have a cancer cluster, but lack a firm hypothesis of why this is so. However, they have access to a large amount of demographic data about the town and surrounding area, containing measurements for the area of hundreds or thousands of different variables, mostly uncorrelated. Even if all these variables are independent of the cancer incidence rate, it is highly likely that at least one variable correlates significantly with the cancer rate across the area. While this may suggest a hypothesis, further testing using the same variables but with data from a different location is needed to confirm. Note that a p-value of 0.01 suggests that 1% of the time a result at least that extreme would be obtained by chance; if hundreds or thousands of hypotheses (with mutually relatively uncorrelated independent variables) are tested, then one is likely to obtain a p-value less than 0.01 for many null hypotheses.

Appearance in media

[edit]

One example is the chocolate weight loss hoax study conducted by journalist John Bohannon, who explained publicly in a Gizmodo article that the study was deliberately conducted fraudulently as a social experiment.[13] This study was widespread in many media outlets around 2015, with many people believing the claim that eating a chocolate bar every day would cause them to lose weight, against their better judgement. This study was published in the Institute of Diet and Health. According to Bohannon, to reduce the p-value to below 0.05, taking 18 different variables into consideration when testing was crucial.

Remedies

[edit]

While looking for patterns in data is legitimate, applying a statistical test of significance or hypothesis test to the same data until a pattern emerges is prone to abuse. One way to construct hypotheses while avoiding data dredging is to conduct randomized out-of-sample tests. The researcher collects a data set, then randomly partitions it into two subsets, A and B. Only one subset—say, subset A—is examined for creating hypotheses. Once a hypothesis is formulated, it must be tested on subset B, which was not used to construct the hypothesis. Only where B also supports such a hypothesis is it reasonable to believe the hypothesis might be valid. (This is a simple type of cross-validation and is often termed training-test or split-half validation.)

Another remedy for data dredging is to record the number of all significance tests conducted during the study and simply divide one's criterion for significance (alpha) by this number; this is the Bonferroni correction. However, this is a very conservative metric. A family-wise alpha of 0.05, divided in this way by 1,000 to account for 1,000 significance tests, yields a very stringent per-hypothesis alpha of 0.00005. Methods particularly useful in analysis of variance, and in constructing simultaneous confidence bands for regressions involving basis functions are Scheffé's method and, if the researcher has in mind only pairwise comparisons, the Tukey method. To avoid the extreme conservativeness of the Bonferroni correction, more sophisticated selective inference methods are available.[14] The most common selective inference method is the use of Benjamini and Hochberg's false discovery rate controlling procedure: it is a less conservative approach that has become a popular method for control of multiple hypothesis tests.

When neither approach is practical, one can make a clear distinction between data analyses that are confirmatory and analyses that are exploratory. Statistical inference is appropriate only for the former.[12]

Ultimately, the statistical significance of a test and the statistical confidence of a finding are joint properties of data and the method used to examine the data. Thus, if someone says that a certain event has probability of 20% ± 2% 19 times out of 20, this means that if the probability of the event is estimated by the same method used to obtain the 20% estimate, the result is between 18% and 22% with probability 0.95. No claim of statistical significance can be made by only looking, without due regard to the method used to assess the data.

Academic journals increasingly shift to the registered report format, which aims to counteract very serious issues such as data dredging and HARKing, which have made theory-testing research very unreliable. For example, Nature Human Behaviour has adopted the registered report format, as it "shift[s] the emphasis from the results of research to the questions that guide the research and the methods used to answer them".[15] The European Journal of Personality defines this format as follows: "In a registered report, authors create a study proposal that includes theoretical and empirical background, research questions/hypotheses, and pilot data (if available). Upon submission, this proposal will then be reviewed prior to data collection, and if accepted, the paper resulting from this peer-reviewed procedure will be published, regardless of the study outcomes."[16]

Methods and results can also be made publicly available, as in the open science approach, making it yet more difficult for data dredging to take place.[17]

See also

[edit]

Notes

[edit]
  1. ^ Other names include data grubbing, data butchery, data fishing, selective inference, significance chasing, and significance questing.

References

[edit]
  1. ^ Wasserstein, Ronald L.; Lazar, Nicole A. (2016-04-02). "The ASA Statement on p-Values: Context, Process, and Purpose". The American Statistician. 70 (2). Informa UK Limited: 129–133. doi:10.1080/00031305.2016.1154108. ISSN 0003-1305.
  2. ^ a b Davey Smith, G.; Ebrahim, S. (2002). "Data dredging, bias, or confounding". BMJ. 325 (7378): 1437–1438. doi:10.1136/bmj.325.7378.1437. PMC 1124898. PMID 12493654.
  3. ^ Lewis-Kraus, Gideon (2023-09-30). "They Studied Dishonesty. Was Their Work a Lie?". The New Yorker. ISSN 0028-792X. Retrieved 2023-10-01.
  4. ^ Subbaraman, Nidhi (2023-09-24). "The Band of Debunkers Busting Bad Scientists". Wall Street Journal. Archived from the original on 2023-09-24. Retrieved 2023-10-08.
  5. ^ "APA PsycNet". psycnet.apa.org. Retrieved 2023-10-08.
  6. ^ a b c d Young, S. S.; Karr, A. (2011). "Deming, data and observational studies" (PDF). Significance. 8 (3): 116–120. doi:10.1111/j.1740-9713.2011.00506.x.
  7. ^ a b Wagenmakers, Eric-Jan (October 2007). "A practical solution to the pervasive problems of p values". Psychonomic Bulletin & Review. 14 (5): 779–804. doi:10.3758/BF03194105. ISSN 1069-9384. PMID 18087943.
  8. ^ Wicherts, Jelte M.; Veldkamp, Coosje L. S.; Augusteijn, Hilde E. M.; Bakker, Marjan; van Aert, Robbie C. M.; van Assen, Marcel A. L. M. (2016-11-25). "Degrees of Freedom in Planning, Running, Analyzing, and Reporting Psychological Studies: A Checklist to Avoid p-Hacking". Frontiers in Psychology. 7: 1832. doi:10.3389/fpsyg.2016.01832. ISSN 1664-1078. PMC 5122713. PMID 27933012.
  9. ^ Head, Megan L.; Holman, Luke; Lanfear, Rob; Kahn, Andrew T.; Jennions, Michael D. (2015-03-13). "The Extent and Consequences of P-Hacking in Science". PLOS Biology. 13 (3): e1002106. doi:10.1371/journal.pbio.1002106. ISSN 1545-7885. PMC 4359000. PMID 25768323.
  10. ^ a b Szucs, Denes (2016-09-22). "A Tutorial on Hunting Statistical Significance by Chasing N". Frontiers in Psychology. 7. doi:10.3389/fpsyg.2016.01444. ISSN 1664-1078. PMC 5031612. PMID 27713723.
  11. ^ Selvin, H. C.; Stuart, A. (1966). "Data-Dredging Procedures in Survey Analysis". The American Statistician. 20 (3): 20–23. doi:10.1080/00031305.1966.10480401. JSTOR 2681493.
  12. ^ a b Berk, R.; Brown, L.; Zhao, L. (2009). "Statistical Inference After Model Selection". J Quant Criminol. 26 (2): 217–236. doi:10.1007/s10940-009-9077-7. S2CID 10350955.
  13. ^ Bohannon, John (2015-05-27). "I Fooled Millions Into Thinking Chocolate Helps Weight Loss. Here's How". Gizmodo. Retrieved 2023-10-20.
  14. ^ Taylor, J.; Tibshirani, R. (2015). "Statistical learning and selective inference". Proceedings of the National Academy of Sciences. 112 (25): 7629–7634. Bibcode:2015PNAS..112.7629T. doi:10.1073/pnas.1507583112. PMC 4485109. PMID 26100887.
  15. ^ "Promoting reproducibility with registered reports". Nature Human Behaviour. 1 (1): 0034. 10 January 2017. doi:10.1038/s41562-016-0034. S2CID 28976450.
  16. ^ "Streamlined review and registered reports soon to be official at EJP". ejp-blog.com. 6 February 2018.
  17. ^ Vyse, Stuart (2017). "P-Hacker Confessions: Daryl Bem and Me". Skeptical Inquirer. 41 (5): 25–27. Archived from the original on 2018-08-05. Retrieved 5 August 2018.
  18. ^ Gelman, Andrew (2013). "The garden of forking paths" (PDF).

Further reading

[edit]
[edit]