Criticism 1 of NHST: Good Tools for Individual Researchers are not Good Tools for Research Communities

Introduction

Over my years as a graduate student, I have built up a long list of complaints about the use of Null Hypothesis Significance Testing (NHST) in the empirical sciences. In the next few weeks, I’m planning to publish a series of blog posts, each of which will articulate one specific weakness of NHST. The weaknesses I will discuss are not novel observations about NHST: people have been complaining about the use of p-values since the 1950′s. My intention is simply to gather all of the criticisms of NHST in a single place and to articulate each of the criticisms in a way that permits no confusion. I’m hoping that readers will comment on these pieces and give me enough feedback to sharpen the points into a useful resource for the community.

In the interest of absolute clarity, I should note at the start of this series that I am primarily unhappy with the use of p-values as (1) a threshold that scientific results are expected to pass before they are considered publishable and (2) a measure of the evidence in defense of a hypothesis. I believe that p-values cannot be used for either of these purposes, but I will concede upfront that p-values can be useful to researchers who wish to test their own private hypotheses.

With that limitation of scope in mind, let’s get started.

Communities of Researchers Face Different Problems than Individual Researchers

Many scientists who defend the use of p-values as a threshold for publication employ an argument that, in broad form, can be summarized as follows: “a community of researchers can be thought of as if it were a single decision-maker who must select a set of procedures for coping with the inherent uncertainties of empiricism — foremost of which is the risk that purely chance processes will give rise to data supporting false hypotheses. To prevent our hypothetical decision-maker from believing in every hypothesis for which there exists some supporting data, we must use significance testing to separate results that could plausibly be the product of randomness from those which provide strong evidence of some underlying regularity in Nature.”

While I agree with part of the argument above — p-values, when used appropriately, can help an individual researcher resist their all-too-human inclination to discover patterns in noise –, I do not think that this sort of argument applies with similar force to a community of researchers, because the types of information necessary for correctly interpreting p-values are always available to individual researchers acting in isolation, but are seldom available to the members of a community of researchers who learn about each other’s work from published reports. For example, the community will frequently be ignorant of the exact research procedures used by its members, even though the details of these procedures can have profound effects on the interpretation of published p-values. To illustrate this concern, let’s work through a specific hypothetical example of a reported p-value that cannot be taken at face value.

The Hidden Multiple Testing Problem

Imagine that Researcher A has measured twenty variables, which we will call X1 through X20. After collecting data, Researcher A attempts to predict one other variable, Y, using these twenty variables as predictors in a standard linear regression model in which Y ~ X1 + … + X20. Imagine, for the sake of argument, that Researcher A finds that X17 has a statistically significant effect on Y at p < .05 and rushes to publish this result in the new hit paper: "Y Depends upon X17!". How will Researcher B, who sees only this result and no mention of the 19 variables that failed to predict Y, react?

If Researcher B embraces NHST as a paradigm without misgivings or suspicion, B must react to A's findings with a credulity that could never be defended in the face of perfect information about Researcher A's research methods. As I imagine most scientists are already aware, Researcher A's result is statistically invalid, because the significance threshold that has been passed depended upon a set of assumptions violated by the search through twenty different variables for a predictive relationship. When you use standard NHST p-values to evaluate a hypothesis, you must acquire a new set of data and then test exactly one hypothesis on the entire data set. In our case, each of the twenty variables that was evaluated as a potential predictor of Y constitutes a separate hypothesis, so that Researcher A has not conducted one hypothesis test, but rather twenty. This is conventionally called multiple testing; in this case, the result of multiple testing is that the actual probability of at least one variable being found to predict Y due purely to luck is closer to 50% than to the 5% level suggested by a reported p-value of p < 0.05.

What is worrisome is that this sort of multiple testing can be effortlessly hidden from Researcher B, our hypothetical reader of a scientific article. If Researcher A does not report the tests that failed, how can Researcher B know that they were conducted? Must Researcher B learn to live in fear of his fellow scientists, lest he be betrayed by their predilection to underreport their methods?

As I hope is clear from our example, NHST as a method depends upon a faith in the perfection of our fellow researchers that will easily fall victim to any mixture of incompetence or malice on their part. Unlike a descriptive statistic such as a mean, a p-value purports to tell us something that it cannot do without perfect information about the exact scientific methods used by every researcher in our community. An individual researcher will necessarily have this sort of perfect information about their own work, but a community will typically not. The imperfect information available to the community implies that reasoning about the community's ideal standards for measuring evidence based on the ideal standards for a hypothetical individual will be systematically misleading.

If an individual researcher conducts multiple tests without correcting p-values for this search through hypotheses, the individual researcher will develop false hypotheses and harm only themselves. But if even one member of a community of researchers conducts multiple tests and publishes results whose interpretation cannot be sustained in the light of knowledge of the hidden tests that took place, the community as a whole will have only a permanent record of a hypothesis supported by illusory evidence. And this illusion of evidence cannot be easily discovered after the fact without investing effort into explicit replication studies. Indeed, after Researcher A dies, any evidence of their statistical errors will likely disappear, except for the puzzling persistence of a paper reporting a relationship between Y and X17 that has not been found again.

Conclusion

What should we take away from this example? We should acknowledge that there are deep problems with the theoretical framework used to justify NHST as a scientific institution. NHST, as it stands, is based upon an inappropriate analogy between a community of researchers and a hypothetical decision-maker who evaluates the research of a whole community using NHST. The actual community of researchers suffers from imperfect information about the research methods being used by its members. The sort of fishing through data for positive results described above may result from either statistical naivete or a genuine lack of scruples on the part of our fellow scientists, but it is almost certainly occurring. NHST is only exacerbating the problem, because there is no credible mechanism for insuring that we know how many hypotheses have been tested before discovering a hypothesis that satisfies our community’s threshold.

Because the framework of NHST is not appropriate for use by a community with imperfect information, I suspect that the core objective of NHST — the prevention of false positive results — is not being achieved. At times, I even suspect that NHST has actually increased the frequency of reporting false positive results, because the universality of the procedure encourages blind searching through hypotheses for one that passes a community’s p-value threshold.

This is an unfortunate situation, because I am very sympathetic to those proponents of NHST who feel that it is an unambiguous, algorithmic procedure that diminishes the extent of subjective opinion in evaluating research work. While I agree that diminishing the dependence of science on subjectivity and personal opinion is always valuable, we should not, in our quest to remove subjectivity, substitute in its stead a method that depends upon an assumption of the perfect wisdom and honesty of our fellow scientists. Despite our strong desires to the contrary, human beings make mistakes. As Lincoln might have said, some researchers make mistakes all of the time and all researchers make mistakes some of the time. Because NHST is being used by a community of researchers rather than the theoretical individual for which it was designed, NHST is not robust to the imperfections of our fellow scientists.

References

Simmons et al. (2011), ‘False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant’ SSRN

21 responses to “Criticism 1 of NHST: Good Tools for Individual Researchers are not Good Tools for Research Communities”

  1. Ian

    I very much enjoyed this post and look forward to the rest of this series. Perhaps you’ll delve more into this in subsequent posts, but if not, John Kruschke wrote a nice article detailing how the method used to recruit and place subjects into groups affects the p-value.

    http://www.indiana.edu/~kruschke/articles/Kruschke2010WIRES.pdf

  2. Ethan Fosse

    Thanks for starting this discussion and the excellent comments. The biggest problem I’ve experienced with p-values is that it’s so easy to overlook effect size (i.e., you can have a large effect with a huge p-value compared with a very tiny effect with tiny p-value) and the difference between imprecision and “no effect” (i.e., you can have a confidence interval centered well away from the null yet include the null, as compared to a confidence interval that is directly centered over the null). I’ll add another problem with p-values and confidence intervals: it’s incredibly difficult to interpret them correctly, in part because they really are very weird constructions. As is well-known, the p-values and confidence intervals of a particular parameter describe neither the properties of the data set at hand nor the particular model fit to the data. Instead, the p-values and confidence intervals describe the imagined properties of an imagined distribution of an unobserved parameter from a set of unobserved models that we imagine have been fit repeatedly to imagined data sets gathered in a similarly imagined way from the same unobserved population. Thus, a p-value never gives the probability that our parameter is above a certain observed threshold, and a confidence interval never indicates the probability that our parameter lies within a certain set of observed values.

  3. Michael Großbach

    Thanks for this post! There is even an entire issue of the Zeitschrift für Psychologie / Journal of Psychology available online at http://psycontent.metapress.com/content/ln067244071g/?p=c966f0a7dceb44d6abae4bd73b25ca83&pi=7 with articles on this topic and alternatives to NHST.

  4. toke emil

    I hope you will propose a solution to the problem of pretest bias in your posts to come.

    Thanks for raising the issue.

    Cheers Toke

  5. Michael bishop
  6. Tom

    I’m also looking forward to this series!

    As a current grad student who has a sneaky suspicion that everything he’s currently being taught will be obsolete in ten year’s time, can I make a request for you to also add a section on ‘alternative approaches’? Ignorantly, I don’t know if bayesian approaches are the only other option. If so, then a bayesian option presented a la Kruschke with pros highlighted would be very useful. Otherwise I’m left with a dead-end list of things that don’t work.

    Of course if you’re writing this for people not like me who already are well-aware of the alternatives, than this isn’t necessary.

  7. toke emil

    Yes pretest bias is an econometric term for something all economist do if they work empirical. In general, economist tend to forget to report on the model selection process where numerous hypothesis are tested before the final model is selected. The models are presented as if they were test only once, often with very clever specifications…

    If you google pretest bias and econometrics you will find stuff like this:
    http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&ved=0CGEQFjAC&url=http%3A%2F%2Farno.uvt.nl%2Fshow.cgi%3Ffid%3D4229%3Bh%3Drepec%3Adgr%3Akubcen%3A200137&ei=5tKwT8fWMYixtAaF0tGuBg&usg=AFQjCNFw3nbtU6ETHeEHdYclG3ahUqkYNA&sig2=FvEaTZGwyWGWPPxQZ8By8g

    Cheers Toke

  8. Tommy O'Dell

    Hey John, I like this xkcd comic. Green jelly beans linked to acne – 95% confidence! http://xkcd.com/882/ Does that illustrate your point?

    Possibly related, I’ve been using Tableau at work quite a bit for exploratory data analysis. See tableausoftware.com if you’re not familiar with it. Anyway, it makes generating visual summaries ridiculously easy and fast. By slicing and dicing data at 200 mph am I just doing a huge amount of visual “hypothesis testing/pattern finding”, increasing the odds that I’ll find appealing but likely random patterns in the data?

  9. Tommy O'Dell

    Hi John,

    Hidden not from the reader of the comic, but hidden perhaps from the readers of the hypothetical newspaper at the bottom?

    Is the additional step to verify a pattern found in Tableau what you’d called “confirmatory data analyst”? What kind of steps would you do to confirm what you’re seeing? Confidence intervals around means? Some googling shows there’s a bit of controversy over what “CDA” actually means. (http://andrewgelman.com/2010/02/exploratory_and/). What’s your opinion?

    A bit more on topic, are you planning to get into a discussion on p-values/effect size. “The reporting of effect sizes facilitates the interpretation of the substantive, as opposed to the statistical, significance of a research result.” (http://http://en.wikipedia.org/wiki/Effect_size). Of course, that isn’t a flaw in p-values themselves. But there does seem to be some conflation of the two in the minds of many, and there’s a huge body of work out there that reports only on significance.

  10. John Myles White

    Hi Tommy,

    You’re right: the testing is hidden from the readers of the hypothetical newspaper at the bottom.

    And, yes, the additional step to verify a pattern found in Tableau is what I’d called “confirmatory data analysis”. I would start by testing, in a new data set that you have not used in your original search for patterns, that some quantification of your pattern still shows up. If your original pattern is that x and y are positively correlated, you should check that x and y are still positively correlated in the new data as well. This will sometimes be impossible, but it’s the right goal: the real worry is that the correlation between x and y will be clearly zero (or even clearly negative) in the new data. This is a sign that your so-called
    pattern does not generalize and that you’ve done too much fishing in Tableau.

    I discussed effect sizes a bit in the comments on the Criticism 3 post. I’ll mention them again, but I’m not very fond of them: they’re better than p-values, but they still mix up two separate concepts (mean and variance) that I think should never be combined into a single number without great caution.

  11. Tommy O'Dell

    Hi John, thanks for clarifying your thoughts on CDA. “This will sometimes be impossible, but it’s the right goal”. Well put.

    I read your Criticism 3 post last night. I wasn’t familiar with the idea, but I understand where you’re coming from now with regard to reducing the two dimensions into one.

    FYI, I’m going through Machine Learning for Hackers, and enjoying it a great deal. You do a great job explaining the concepts in an easy to follow way. Also, I’ve just started the Stanford Coursera Machine Learning course. I imagine the two will go very well together.

  12. shabbychef

    I don’t see why publication bias is uniquely a deficiency of NHST. To take a straw man, mutual fund managers typically send out quarterly reports that emphasize those funds, among the many they manage, that experienced good luck in the trailing quarter. The authors do not bother to include something as weird as a p-value, but rather they publish the quarterly return, which is sometimes conveniently annualized to amplify the illusion. The value of the returns of the funds, each just a sample mean, is biased by selection. There is no NHST here, just publication bias. How is that corrected by discarding NHST? Maybe I did not read the comments close enough, but it is hard to imagine any publication filter that solves this problem.

  13. shabbychef

    Again, I don’t see how this is a failing particularly of NHST. Any kind of information asymmetry allows strategic manipulation. Whether the bar for publication (or if that is set too low, the bar for ‘interestingness’) is a sacred p-value, or a positive mean return the previous quarter, or a Bayesian posterior mean above some value, or a sufficiently large ‘q-value’ (a la Storey), any bar, as you note, biases the results by selection.