Introduction
Over my years as a graduate student, I have built up a long list of complaints about the use of Null Hypothesis Significance Testing (NHST) in the empirical sciences. In the next few weeks, I’m planning to publish a series of blog posts, each of which will articulate one specific weakness of NHST. The weaknesses I will discuss are not novel observations about NHST: people have been complaining about the use of p-values since the 1950′s. My intention is simply to gather all of the criticisms of NHST in a single place and to articulate each of the criticisms in a way that permits no confusion. I’m hoping that readers will comment on these pieces and give me enough feedback to sharpen the points into a useful resource for the community.
In the interest of absolute clarity, I should note at the start of this series that I am primarily unhappy with the use of p-values as (1) a threshold that scientific results are expected to pass before they are considered publishable and (2) a measure of the evidence in defense of a hypothesis. I believe that p-values cannot be used for either of these purposes, but I will concede upfront that p-values can be useful to researchers who wish to test their own private hypotheses.
With that limitation of scope in mind, let’s get started.
Communities of Researchers Face Different Problems than Individual Researchers
Many scientists who defend the use of p-values as a threshold for publication employ an argument that, in broad form, can be summarized as follows: “a community of researchers can be thought of as if it were a single decision-maker who must select a set of procedures for coping with the inherent uncertainties of empiricism — foremost of which is the risk that purely chance processes will give rise to data supporting false hypotheses. To prevent our hypothetical decision-maker from believing in every hypothesis for which there exists some supporting data, we must use significance testing to separate results that could plausibly be the product of randomness from those which provide strong evidence of some underlying regularity in Nature.”
While I agree with part of the argument above — p-values, when used appropriately, can help an individual researcher resist their all-too-human inclination to discover patterns in noise –, I do not think that this sort of argument applies with similar force to a community of researchers, because the types of information necessary for correctly interpreting p-values are always available to individual researchers acting in isolation, but are seldom available to the members of a community of researchers who learn about each other’s work from published reports. For example, the community will frequently be ignorant of the exact research procedures used by its members, even though the details of these procedures can have profound effects on the interpretation of published p-values. To illustrate this concern, let’s work through a specific hypothetical example of a reported p-value that cannot be taken at face value.
The Hidden Multiple Testing Problem
Imagine that Researcher A has measured twenty variables, which we will call X1 through X20. After collecting data, Researcher A attempts to predict one other variable, Y, using these twenty variables as predictors in a standard linear regression model in which Y ~ X1 + … + X20. Imagine, for the sake of argument, that Researcher A finds that X17 has a statistically significant effect on Y at p < .05 and rushes to publish this result in the new hit paper: "Y Depends upon X17!". How will Researcher B, who sees only this result and no mention of the 19 variables that failed to predict Y, react?
If Researcher B embraces NHST as a paradigm without misgivings or suspicion, B must react to A's findings with a credulity that could never be defended in the face of perfect information about Researcher A's research methods. As I imagine most scientists are already aware, Researcher A's result is statistically invalid, because the significance threshold that has been passed depended upon a set of assumptions violated by the search through twenty different variables for a predictive relationship. When you use standard NHST p-values to evaluate a hypothesis, you must acquire a new set of data and then test exactly one hypothesis on the entire data set. In our case, each of the twenty variables that was evaluated as a potential predictor of Y constitutes a separate hypothesis, so that Researcher A has not conducted one hypothesis test, but rather twenty. This is conventionally called multiple testing; in this case, the result of multiple testing is that the actual probability of at least one variable being found to predict Y due purely to luck is closer to 50% than to the 5% level suggested by a reported p-value of p < 0.05.
What is worrisome is that this sort of multiple testing can be effortlessly hidden from Researcher B, our hypothetical reader of a scientific article. If Researcher A does not report the tests that failed, how can Researcher B know that they were conducted? Must Researcher B learn to live in fear of his fellow scientists, lest he be betrayed by their predilection to underreport their methods?
As I hope is clear from our example, NHST as a method depends upon a faith in the perfection of our fellow researchers that will easily fall victim to any mixture of incompetence or malice on their part. Unlike a descriptive statistic such as a mean, a p-value purports to tell us something that it cannot do without perfect information about the exact scientific methods used by every researcher in our community. An individual researcher will necessarily have this sort of perfect information about their own work, but a community will typically not. The imperfect information available to the community implies that reasoning about the community's ideal standards for measuring evidence based on the ideal standards for a hypothetical individual will be systematically misleading.
If an individual researcher conducts multiple tests without correcting p-values for this search through hypotheses, the individual researcher will develop false hypotheses and harm only themselves. But if even one member of a community of researchers conducts multiple tests and publishes results whose interpretation cannot be sustained in the light of knowledge of the hidden tests that took place, the community as a whole will have only a permanent record of a hypothesis supported by illusory evidence. And this illusion of evidence cannot be easily discovered after the fact without investing effort into explicit replication studies. Indeed, after Researcher A dies, any evidence of their statistical errors will likely disappear, except for the puzzling persistence of a paper reporting a relationship between Y and X17 that has not been found again.
Conclusion
What should we take away from this example? We should acknowledge that there are deep problems with the theoretical framework used to justify NHST as a scientific institution. NHST, as it stands, is based upon an inappropriate analogy between a community of researchers and a hypothetical decision-maker who evaluates the research of a whole community using NHST. The actual community of researchers suffers from imperfect information about the research methods being used by its members. The sort of fishing through data for positive results described above may result from either statistical naivete or a genuine lack of scruples on the part of our fellow scientists, but it is almost certainly occurring. NHST is only exacerbating the problem, because there is no credible mechanism for insuring that we know how many hypotheses have been tested before discovering a hypothesis that satisfies our community’s threshold.
Because the framework of NHST is not appropriate for use by a community with imperfect information, I suspect that the core objective of NHST — the prevention of false positive results — is not being achieved. At times, I even suspect that NHST has actually increased the frequency of reporting false positive results, because the universality of the procedure encourages blind searching through hypotheses for one that passes a community’s p-value threshold.
This is an unfortunate situation, because I am very sympathetic to those proponents of NHST who feel that it is an unambiguous, algorithmic procedure that diminishes the extent of subjective opinion in evaluating research work. While I agree that diminishing the dependence of science on subjectivity and personal opinion is always valuable, we should not, in our quest to remove subjectivity, substitute in its stead a method that depends upon an assumption of the perfect wisdom and honesty of our fellow scientists. Despite our strong desires to the contrary, human beings make mistakes. As Lincoln might have said, some researchers make mistakes all of the time and all researchers make mistakes some of the time. Because NHST is being used by a community of researchers rather than the theoretical individual for which it was designed, NHST is not robust to the imperfections of our fellow scientists.
References
Simmons et al. (2011), ‘False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant’ SSRN
I very much enjoyed this post and look forward to the rest of this series. Perhaps you’ll delve more into this in subsequent posts, but if not, John Kruschke wrote a nice article detailing how the method used to recruit and place subjects into groups affects the p-value.
http://www.indiana.edu/~kruschke/articles/Kruschke2010WIRES.pdf
Thanks for the link, Ian. I hadn’t read that piece by Kruschke before. I’ll see how it fits into the points I’m hoping to make.
Thanks for starting this discussion and the excellent comments. The biggest problem I’ve experienced with p-values is that it’s so easy to overlook effect size (i.e., you can have a large effect with a huge p-value compared with a very tiny effect with tiny p-value) and the difference between imprecision and “no effect” (i.e., you can have a confidence interval centered well away from the null yet include the null, as compared to a confidence interval that is directly centered over the null). I’ll add another problem with p-values and confidence intervals: it’s incredibly difficult to interpret them correctly, in part because they really are very weird constructions. As is well-known, the p-values and confidence intervals of a particular parameter describe neither the properties of the data set at hand nor the particular model fit to the data. Instead, the p-values and confidence intervals describe the imagined properties of an imagined distribution of an unobserved parameter from a set of unobserved models that we imagine have been fit repeatedly to imagined data sets gathered in a similarly imagined way from the same unobserved population. Thus, a p-value never gives the probability that our parameter is above a certain observed threshold, and a confidence interval never indicates the probability that our parameter lies within a certain set of observed values.
Thanks for this post! There is even an entire issue of the Zeitschrift für Psychologie / Journal of Psychology available online at http://psycontent.metapress.com/content/ln067244071g/?p=c966f0a7dceb44d6abae4bd73b25ca83&pi=7 with articles on this topic and alternatives to NHST.
Thanks for the comments, Ethan and Michael. I’ll check out the articles in Zeitschrift für Psychologie.
Also, I’ll say that I agree with all of the issues raised by Ethan. I’ll address all of them in the next few posts, though I won’t reach confidence intervals. The effect size argument for me is summarized as follows: real data is characterized by an estimate of an effect and by an estimate of the uncertainty we have about this estimate. Any attempt to reduce this two-dimensional space into a one-dimensional space is problematic.
The question of interpretability is much more difficult to get right, because it’s so difficult to articulate the correct interpretation in plain English. For p-values, I’m going to focus instead on giving examples in which the correct interpretation seems like a meaningless statement — which it sometimes is. For confidence intervals, I’ve come to think recently that the correct description of confidence is simply to describe it as the long-term reliability of the confidence interval procedure, rather than a statement that depends (in any way at all) on the data in front of you. This failure to modulate the asserted “confidence” in response to the actual data on hand is brought out brilliantly in an example from the second chapter of Berger and Wolpert’s “The Likelihood Principle” in which any thinking person can see that we know a parameter’s value with absolute certainty, yet we are expected to go on reporting less than certain confidence in our interval.
I hope you will propose a solution to the problem of pretest bias in your posts to come.
Thanks for raising the issue.
Cheers Toke
I’m afraid that I don’t know what pretest bias is. Is that a term from econometrics?
See and add to the references here
http://stats.stackexchange.com/questions/10510/what-are-good-references-containing-arguments-against-null-hypothesis-significan
I’m also looking forward to this series!
As a current grad student who has a sneaky suspicion that everything he’s currently being taught will be obsolete in ten year’s time, can I make a request for you to also add a section on ‘alternative approaches’? Ignorantly, I don’t know if bayesian approaches are the only other option. If so, then a bayesian option presented a la Kruschke with pros highlighted would be very useful. Otherwise I’m left with a dead-end list of things that don’t work.
Of course if you’re writing this for people not like me who already are well-aware of the alternatives, than this isn’t necessary.
Yes pretest bias is an econometric term for something all economist do if they work empirical. In general, economist tend to forget to report on the model selection process where numerous hypothesis are tested before the final model is selected. The models are presented as if they were test only once, often with very clever specifications…
If you google pretest bias and econometrics you will find stuff like this:
http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&ved=0CGEQFjAC&url=http%3A%2F%2Farno.uvt.nl%2Fshow.cgi%3Ffid%3D4229%3Bh%3Drepec%3Adgr%3Akubcen%3A200137&ei=5tKwT8fWMYixtAaF0tGuBg&usg=AFQjCNFw3nbtU6ETHeEHdYclG3ahUqkYNA&sig2=FvEaTZGwyWGWPPxQZ8By8g
Cheers Toke
Tom, at the close of the series I will present a variety of possible options for going forward methodologically. Sadly, I think that many of these options (e.g. insisting that exact replications must always take place) will be radical enough that they will not be put into practice. So I have the suspicion that the exact opposite of your suspicion will occur: in ten year’s time we’ll still be using NHST inappropriately. After all, we should have stopped when Meehl tried to get us to stop in the 50′s, but we didn’t. I don’t think that changing institutions is very easy.
I also don’t think that Bayesian methods are a sufficient way forward. I think that Bayesian methods are very beautiful and useful, but I think that their value only becomes clear when you embrace a broader change in your approach as a scientist — the change to a belief that the most basic job of modern scientists is the creation of novel, fully specified statistical models. The alternative hypothesis, as conventionally used, is a very sloppy statistical model, whereas the null hypothesis is an elegantly precise model. In fact, the exactness of the null is precisely the reason why it so often loses in a fight against the vague alternative hypothesis. (This concern is secretly lurking in my post called “Criticism 2 of NHST”: it’s the reason that my complaints in that piece don’t make NHST unusable a priori.)
Bayesian methods are only really valuable when we insist that scientists must replace the null with a new model that is as exact as the null hypothesis — after that is conceded, Bayesian methods provide a very useful way for comparing the performance of this new model to the performance of the old model. Otherwise, I sadly agree with Simmons et al.: Bayesian methods only provide sloppy researchers with more ways to do sloppy work.
Hey John, I like this xkcd comic. Green jelly beans linked to acne – 95% confidence! http://xkcd.com/882/ Does that illustrate your point?
Possibly related, I’ve been using Tableau at work quite a bit for exploratory data analysis. See tableausoftware.com if you’re not familiar with it. Anyway, it makes generating visual summaries ridiculously easy and fast. By slicing and dicing data at 200 mph am I just doing a huge amount of visual “hypothesis testing/pattern finding”, increasing the odds that I’ll find appealing but likely random patterns in the data?
Hi Toke,
Sadly, I don’t think that there are fully trustworthy solutions to the pretest bias concern you raise, which I would describe as the misrepresentation of a very elaborate exploratory data analysis (EDA) process, in which many models are built and tested against a fixed data set in the interest of discovering potential forms of structure in the data. The problem is that one performs EDA, but then presents the results as if you had first formulated a hypothesis and only afterwards gathered data to test the hypothesis. To use a billiards metaphor, you are not insisting that people call their shots before they take them. The problem is that any attempt to insist on calling shots beforehand by publicly specifying your model and only then gathering data to test the model is susceptible to trivial forms of deception: the person can gather data, perform EDA, build an elaborate model, then publish this model — and then, years later, publish the data from which the model was derived as if it was new data supporting their model. The two virtues I see with this two-part publication process are that (1) elaborate models without supporting data are less likely to be believed than elaborate models with data that is falsely being presented as supporting evidence (which it is not, because the data was used to construct the hypothesis); and (2) pretest bias is here transformed from the problem of underreporting EDA to an act of blatant deception and misconduct, which will presumably be punished by the immediate retraction of all published work by the scientist at fault.
Another potential solution is to insist that people publish models and that another, wholly separate group of people will gather the data and test these models: in short, to divide every field into explicit theorists and experimentalists. I think that this is conceivably a good strategy, but it is such a sweeping change that I see little likelihood of it ever coming into being. Also, it is not that much more robust to deception: it only makes deception more difficult to perform, because you now need to organize a conspiracy of scientists and cannot commit misconduct on your own. Conspiracies do exist, so this strategy will not solve all of our problems. That said, I do think that this sort of public testing of models is the way forward and I will be presenting a variant of this approach in one of the final pieces of this series. You can think of this as a reverse Kaggle website: instead of publishing data and letting the community build models, you publish models and let the community build data sets to test those models.
Switching gears a bit, I think that all of these problems depend upon a broken idealization of the scientific process and a publication system that incentivizes the continued misrepresentation of how science is really practiced. Our gravest failure is that we do not acknowledge that EDA is a part of both real and ideal science: ideal science _should_ involve gathering data and then fighting with that data to formulate a hypothesis: in short, ideal science requires EDA. The problem is that ideal science also requires a second step of validating the derived hypothesis on new data (this is called confirmatory data analysis or CDA) — and this second step is very rarely performed. Because we do not consider the first set of steps publishable in isolation (i.e. we systematically undervalue EDA), we have created a system in which professional success depends upon misrepresenting one’s work in a way in which you assert that you have performed the second step when you have not. In short, we do not allow publishing pure EDA and so we encourage people to present pure EDA as a mixture of EDA and CDA — or, worse still, as pure CDA.
To me, this misrepresentation of one’s work is the inevitable result of a broken set of ideals that damages scientists in much the same way that the ideals of pre-martial chastity in previous generations damaged the unmarried. Real people have sex: when you tell them that sex is morally wrong, you don’t prevent sex — you only promote lying. Similarly, real science mostly consists of pure EDA. When you tell people that pure EDA is not publishable without some form of CDA, you don’t prevent EDA — you only promote the tendency to misrepresent EDA as CDA. This problem is far more severe given the standards of tenure committees, which expect that scientists will have published a lot of work (CDA is hard and won’t produce lots of work) and that this work will be novel (CDA doesn’t count as novel, only the fruits of EDA + CDA do).
The solution I see for this is also one that I doubt will work: start allowing the publication of pure EDA as pure EDA and remove all of the veneer of CDA from this, especially any attempt at NHST. Many research papers should not contain any p-values, because many papers do not contain any CDA: they are purely acts of EDA that are misrepresented in the interests of getting published. But even a systematic attempt to embrace EDA is fraught with problems: when a theory that survives CDA is considered more valuable than one that emerges from EDA (as it should be), you will always have people want to present their work as the fruits of CDA.
In short, I see no way of preventing people from wanting to present bad science as good science. Everyone wants to think highly of themselves; everyone wants to believe in their own hypotheses; everyone wants to be respected; everyone wants to get tenure. The only possible solutions are either (1) means of making it harder to believe that you have done better work than you have in fact done or (2) trying to make the community value types of work such as pure EDA that it currently undervalues. Either we must adjust our ideals to be more realistically attainable or we must accept that you cannot do better work without also doing more work.
Hi Tommy,
Yes, that xkcd is exactly the same issue I was discussing, except that the multiple tests aren’t hidden from the reader.
And there’s no doubt that searching through data in many, many different ways increases the odds of finding a false pattern. The one reason that doing this visually is likely to be less problematic than the overuse of something like t-tests is that a complex visual pattern like a sine wave is far less likely to arise from pure noise processes than a difference in means between two groups. That said, complex visual patterns can (and do!) arise from nothing more than noise, so you should not conclude that any pattern you’ve found using Tableau is real until you’ve done additional work to demonstrate this pattern in new data.
Hi John,
Hidden not from the reader of the comic, but hidden perhaps from the readers of the hypothetical newspaper at the bottom?
Is the additional step to verify a pattern found in Tableau what you’d called “confirmatory data analyst”? What kind of steps would you do to confirm what you’re seeing? Confidence intervals around means? Some googling shows there’s a bit of controversy over what “CDA” actually means. (http://andrewgelman.com/2010/02/exploratory_and/). What’s your opinion?
A bit more on topic, are you planning to get into a discussion on p-values/effect size. “The reporting of effect sizes facilitates the interpretation of the substantive, as opposed to the statistical, significance of a research result.” (http://http://en.wikipedia.org/wiki/Effect_size). Of course, that isn’t a flaw in p-values themselves. But there does seem to be some conflation of the two in the minds of many, and there’s a huge body of work out there that reports only on significance.
Hi Tommy,
You’re right: the testing is hidden from the readers of the hypothetical newspaper at the bottom.
And, yes, the additional step to verify a pattern found in Tableau is what I’d called “confirmatory data analysis”. I would start by testing, in a new data set that you have not used in your original search for patterns, that some quantification of your pattern still shows up. If your original pattern is that x and y are positively correlated, you should check that x and y are still positively correlated in the new data as well. This will sometimes be impossible, but it’s the right goal: the real worry is that the correlation between x and y will be clearly zero (or even clearly negative) in the new data. This is a sign that your so-called
pattern does not generalize and that you’ve done too much fishing in Tableau.
I discussed effect sizes a bit in the comments on the Criticism 3 post. I’ll mention them again, but I’m not very fond of them: they’re better than p-values, but they still mix up two separate concepts (mean and variance) that I think should never be combined into a single number without great caution.
Hi John, thanks for clarifying your thoughts on CDA. “This will sometimes be impossible, but it’s the right goal”. Well put.
I read your Criticism 3 post last night. I wasn’t familiar with the idea, but I understand where you’re coming from now with regard to reducing the two dimensions into one.
FYI, I’m going through Machine Learning for Hackers, and enjoying it a great deal. You do a great job explaining the concepts in an easy to follow way. Also, I’ve just started the Stanford Coursera Machine Learning course. I imagine the two will go very well together.
I don’t see why publication bias is uniquely a deficiency of NHST. To take a straw man, mutual fund managers typically send out quarterly reports that emphasize those funds, among the many they manage, that experienced good luck in the trailing quarter. The authors do not bother to include something as weird as a p-value, but rather they publish the quarterly return, which is sometimes conveniently annualized to amplify the illusion. The value of the returns of the funds, each just a sample mean, is biased by selection. There is no NHST here, just publication bias. How is that corrected by discarding NHST? Maybe I did not read the comments close enough, but it is hard to imagine any publication filter that solves this problem.
I don’t think that a publication filter can solve the problem of publication bias: arguably the definition of a filter is to introduce a bias into publication. The problem I wanted to focus on with this post is that, while NHST is a useful tool for personal reasoning, it is not a tool that is even slightly robust to strategic manipulation. I think the goal of a publication filter should be to prevent an excess of false claims. I think that NHST does not do that and that it may even increase the number of them by sanctifying bad science.
Again, I don’t see how this is a failing particularly of NHST. Any kind of information asymmetry allows strategic manipulation. Whether the bar for publication (or if that is set too low, the bar for ‘interestingness’) is a sacred p-value, or a positive mean return the previous quarter, or a Bayesian posterior mean above some value, or a sufficiently large ‘q-value’ (a la Storey), any bar, as you note, biases the results by selection.
The issue is that the use of descriptive statistics like held-out prediction accuracy on a data set under a fixed training/test split (the sort of measure of statistical algorithms common in ML) does not involve any information asymmetry. The p-value is being imbued with an inferential statistical meaning that cannot be sustained under imperfect information and therefore should be replaced in publications, although still used by individual researchers.