Criticism 4 of NHST: No Mechanism for Producing Substantive Cumulative Knowledge

[Note to the Reader: This is a much rougher piece than the previous pieces because the argument is more complex. I ask that you please point out places where things are unclear and where claims are not rigorous.]

In this fourth part of my series of criticisms of NHST, I’m going to focus on broad questions of epistemology: I want to ask what types of knowledge we can hope to obtain using NHST and what types of knowledge we would need to obtain as social scientists before our work will offer the sort of real-world value that older sciences like physics and chemistry provide. Above all else, my argument is based upon my conviction that the older sciences are currently superior to the social sciences because they can make precise numerical predictions about important things in the world — after all, it is this numerical precision that allows humanity to construct powerful technologies like satellites and computers. That we have failed so far to construct comparably precise models of human behavior is no great cause for concern: we have chosen to study topics in which there are no low-hanging fruit like the equations defining Newtonian gravity’s effect on small objects — equations which could be discovered from experimental data quite easily using modern (and not so modern) statistical techniques. We social scientists are going slower and reaching less powerful conclusions than the physical scientists have because we are studying something that is intrinsically much harder to learn about. If our subject matter were easier, then we could turn our attention to developing the sort of complex mathematical machinery and rich formal theory that makes modern physics so interesting, but also so difficult for many people.

In large part, this piece can be considered an elaboration of the claim by Paul Meehl that:

The almost universal reliance on merely refuting the null hypothesis as the standard method for corroborating substantive theories in the soft areas is… basically unsound.

To do this, we will build up an idealized world in which we can clearly see that NHST provides no mechanism for the accumulation of substantive knowledge. In this idealized world, it will be easy to describe (1) the types of knowledge that we might aspire to possess and (2) the types of knowledge that NHST can actually provide. By doing so, we will show that NHST, even when practiced perfectly using infinite data, can achieve only a very weak approximation of the types of knowledge that one could attain by using the most basic quantitative modeling strategies. In addition, we will be able to show that the two different forms of NHST in popular use, which are either (1) a test against a point hypothesis that some constant m != 0 or (2) a test against a directional hypothesis that some constant m > 0, are equally unhelpful in our pursuit of cumulative knowledge because the conjunctive knowledge of having falsified two different null hypotheses about two different constants is actually much weaker than the knowledge from a single null hypothesis. In short, we will try to demonstrate by example that, even in a much simpler world than our own, NHST cannot form a major part of a successful science’s long-term research program.

To construct our idealized world in which it is easy to measure the extent of our knowledge, let us suppose that the exact structure of the universe is entirely linear and deterministic: there are n basic components of reality (which are measured as x_i‘s) and any one of these basics variables is related to every other variable in a single equation of the form:

beta_n * x_n = beta_1 * x_1 + beta_2 * x_2 + ... + beta_n-1 * x_n-1

In Nature’s caprice, a few of these variables may have zero coefficients (i.e. beta_i = 0 for some i), but we will not assume that most have zero coefficients. In an idealized world similar to the one facing the social sciences, it will generally be the case that there are almost no zero coefficients, because, as David Lykken observed, there is a crud factor in psychology linking all observables about human beings.1

Now that we have constructed our idealized world in which randomness plays no part (except when it is the result of not measuring one of the x_i‘s) and in which non-linearity never arises, let us set up an idealized social science research program. In this idealized science, we will learn about the structure of the world (which, when conducted correctly, is identical to learning about the equation above) by slowing discovering the x_i‘s that could be measured through some unspecified process and then performing randomized experiments to learn the coefficients for each. But, because we test only null hypotheses, our knowledge of each of the coefficients ultimately reduces either to (1) knowledge that beta_i != 0 or (2) knowledge that beta_i > 0 or that beta_i < 0. We will call these two types of NHST Form 1 and Form 2.

Because our knowledge about the beta_i reduces to knowledge at most of the sign of the beta_i's, if we one day wished to learn this model in full quantitative form, the final fruits after an indefinitely long progression of our idealized science's development using NHST Form 1 as our primary mechanism for verifying empirical results would be a list of the coefficients that would need to measured to build up the quantitative theory, i.e. a list of non-zero coefficients. The results of using NHST Form 2 would be a list of the coefficients that need to be measured along with the additional piece of prior information that each of those coefficients has a known sign. In short, the perfect state of knowledge from NHST would be little more than cost-saving measure for the construction of a quantitative model. If we believe in Lykken's crud factor, NHST Form 1 will not even provide us with this much information: there will be no non-zero coefficients and the list of non-zero coefficients will be empty, saving us no work and giving us no information about how to construct a quantitative theory.

But let us ignore the building of a quantitative theory for a moment and consider instead what this perfect state of knowledge would actually mean to us. To do this, we can ask how close we would come to full knowledge of the world: indeed, we could ask how many bits of information we have about the coefficients in this model. Sadly, the answer is either 0 or 1 bit per coefficient, so that we have at most n bits of knowledge after measuring n coefficients with infinitely precise application of NHST. This is in sharp contrast to a person who has measured the actual values of the coefficients: that person will know m * n bits of information about the world, where m is the precision in bits of the average coefficient's measurement. In short, there is an upper bound for any user of NHST on the amount of information they can obtain: it is n bits of information. In contrast, the quantitative theorist has an upper bound of infinite bits of information.

But this argument about bits is unlikely to convince many people who are not already very quantitative by nature: measuring things in bits only appeals to people who like to measure things in units. So let us ask a different sort of question: we can ask what sort of questions about the world we could answer using this perfect state of NHST knowledge. Again, the response is disheartening: under NHST Form 2 (which is strictly stronger than NHST Form 1), we will only be able to answer questions that can be broken apart into pieces, each of which reduces to the form: "is beta_i > 0, beta_i = 0, or beta_i < 0?" But the answer to such question does not merit the name of quantitative knowledge: it is knowledge of direction alone. This is not surprising: our NHST methodology focused on direction alone from the very start. But we will now show that perfect knowledge of direction is of shockingly little value for making predictions about the world. For instance, we could not answer any of the following questions:

(1) If I set all of the x_i to 1 except for x_n, what will be the the value of x_n? This is the idealized form of all prediction tasks: given a set of inputs, make a prediction about an output. Being unable to solve this most basic prediction task means that we have not built up a base of cumulative knowledge with any capacity to predict unseen events. But the value of science is defined in large part by its capacity to predict unseen events.

(2) Is x_1 more important than x_2 in the sense that changing x_1's value from a to b would have more effect on x_n than changing x_2's value from a to b? This is the basic form of any question asking which of our variables is most important. Being unable to answer such questions demonstrates that we have no sense of what objects in our theory really have the capacity to affect human lives.

(3) If we set all of the x_i to 1 except for x_n, what is the sign of x_n? We cannot answer even this question (which is superficially the aggregation of the basic directional question posed earlier), because, if beta_1 > 0, beta_2 < 0, and beta_i = 0 for all i > 3, then it matters whether beta_1 > beta_2 if we want to predict the sign of x_n. This is particularly damning, because it means that the accumulation of knowledge of the signs of many coefficients is not even sufficient to produce knowledge of the sign of one single output that depends on only two variables. This is the basic form of asking whether our knowledge is cumulative, rather than shattered and lacking conceptual coherence.

What can we conclude from our failure to answer such simple questions about our field of study in an idealized world in which reality has a simple mathematical structure and in which we assume that we have performed infinitely many experiments with infinite precision? I believe that we should conclude that NHST is basically unusable as a method for doing science: it amounts to what Noam Chomsky would call butterfly-collecting, except that our butterflies are variables that we have proven to have non-zero coefficients. We have no knowledge of the predictive importance of these variables nor of the scale of their effects on human experiences.

And this is true even in our idealized world in which methods like ANOVA's and linear regression would be able to tell us the exact structure of the universe. Given that those methods exist and are universally taught to social scientists, why do we persist in using NHST, a method that, if used in isolation, is strictly incapable of producing any substantive knowledge of the structure of the world?

One argument that can be given back is that we do not only have access to null hypothesis testing against zero: we can test against hypotheses like beta_1 > 1 and beta_i < 3. By using infinitely many of these enriched tests (let us call these directional test against arbitrary values NHST Form 3), we could ultimately perform a binary search of the space of all values for beta_i and slowly build up an idea of the quantitive value of each of the coefficients.

But admitting that we are interested in learning the actual values of the beta_i should not encourage use to use more elaborate NHST: it should be the first step towards deciding never to use NHST again. Why? Because the binary search paradigm described right above is noticeably inferior to strategies in which we simply use methods like linear regression to estimate the values of the beta_i's directly while placing confidence intervals around those estimated values to keep a record of the precision of our knowledge.

And this realization that simple linear regression would be a giant leap forward leads us to notice what is perhaps the most perverse part of our continued use of NHST in the social sciences: the machinery of NHST actually requires us to at least estimate all of the coefficients in our idealized equation (albeit in a possibly biased way), because t-tests, ANOVA's and linear regression actually require those estimates before one can compute a p-value. We are already producing, as a byproduct of our obsession with NHST, a strictly superior form of knowledge about the beta_i's, but we throw all of that information away! The entire substance of what I would call powerful scientific knowledge is consider a worthless intermediate step in falsifying the null and supporting one's favorite qualitative hypothesis: the estimates of the beta_i's are typically published, but simple verbal questioning of most social scientists will confirm that people do not walk away from papers they have read with this quantitative knowledge in memory.

I find it very odd that, in an age so otherwise concerned with the conservation of resources, we should be so careless with quantitative knowledge that we are willing to discard it as a valueless intermediate byproduct of the process of falsifying the null. I think this careless indifference to the more meaningful quantitative values that are estimated as a means to the end of producing a p-value reflects powerful epistemological misunderstandings on the part of the social sciences: we have much too little interest in quantitative thinking, foremost of which should be an interest in exact numerical prediction. And it is this absence of interest in exact numerical prediction that sets us apart from astronomy, a field that, unlike our own, does not even have limited access to randomized experiments. Astronomers have learned to make due using only observational data by building formal and numerically precise models that make sharp predictions whose accuracy can be rigorously tested. As I have shown, even if practiced perfectly in a simpler world, NHST would not produce a body of knowledge with value even close to the knowledge that any undergraduate in astronomy has.2 As a field, we should not accept this condition. In the long-term, scientific knowledge must be quantitative. We are already capable of producing such knowledge: indeed, we produce it as a waste product of our quest to falsify the null. If we abandon our interest in NHST and focus instead on the prediction of quantitative measurements, we will put our field on a healthier path towards catching up with the physical sciences. While we may always trail behind because of the intrinsic difficulty of our work, I also believe that the social sciences, in principle, have much more to offer to humanity than the natural sciences. We should start to live up to that promise more often. But NHST is a serious impediment to that, because it pushes towards the acquisition of a radically inferior type of knowledge that is not cumulative: there is no viable method for combining the successful falsification of the null about beta_i and the falsification of the null about beta_j into a coherent whole that will predict unseen data.


Paul E. Meehl (1978) "Theoretical Risks and Tabular Asterisks: Sir Karl, Sir Ronald, and the Slow Progress of Soft Psychology", Journal of Consulting and Clinical Psychology

David A. Freedman (1991), "Statistical Models and Shoe Leather", Sociological Methodology

Paul E. Meehl (1990), "Why Summaries of Research on Psychological Theories are Often Uninterpretable", Psychological Reports

  1. I have found that this crud factor surprises some non-psychologists. To convince non-psychologists, I would note that Meehl once collected 15 variables about people and regressed every pair of them. 96% had a highly significant (p < 10e-6) correlation.
  2. I note that a side product of this observation should be the conclusion that we psychologists have an excessive faith in the value of randomized experiments. It is the sharp prediction of unseen data in a way that is readily falsifiable that makes a field of inquiry as valuable to humanity as the physical sciences already are: while we social scientists do, in fact, have much such knowledge, it is attributable almost entirely to the intellectual talent and personal integrity of social scientists, all of which function despite our crippling usage of NHST.

2 responses to “Criticism 4 of NHST: No Mechanism for Producing Substantive Cumulative Knowledge”

  1. Aaron Goodman

    In order for a field to be successful, you need to move beyond having a null hypothesis of ‘no effect’ or randomly distributed. There needs to be some canonical knowledge that can serve as the null hypothesis, and then by rejecting this hypothesis you actually gain knowledge.

    Take genetics for example. As our understanding of inheritances has improved, our null hypothesis has changed. For example a naive null hypothesis is that diseases appear at random in the population. However some diseases have a genetic component, for example sickle cell anemia follows traditional Mendelian patterns. So at this point there is a new standard for demonstrating a novel pattern for disease transmission. To prove a new pattern, a scientist needs to not only show that the disease is not randomly distributed, but also that it is not Mendelian distributed.

    One example of such a disease is color-blindness. Which is an sex-linked disease, and is passed from a mother who has the recessive trait to her son. Careful study of such diseases reveals that the gene for color vision is on the X-chromosome, so is expressed in males as a dominant trait and females as a recessive trait.

    Our model of transmission of genetic diseases continues to become more refined, because we continue to refine our model, prove that our model is inaccurate, and refine the model so that it can accommodate new evidence. We continue to find diseases that have some genetic components, but can not be explained by these models, so we incorporate other factors into our models such as mutations, genetic imprinting, epistatic effects.

    The point is that science is iterative, and it is important to build upon previous work, and refine theories that are already well established. The process cannot work if the null hypothesis is always ‘no effect.’ Instead we need to choose a null hypothesis that reflects the current understandings in the field.