Criticism 5 of NHST: p-Values Measure Effort, Not Truth

Introduction

In the third installment of my series of criticisms of NHST, I focused on the notion that a p-value is nothing more than a one-dimensional representation of a two-dimensional space in which (1) the measured size of an effect and (2) the precision of this measurement have been combined in such a way that we can never pull those two dimensions apart again. Because the size of a p-value depends on two fully independent factors, measured p-values will become smaller when either (A) a researcher focuses on studying an effect whose true magnitude is greater or (B) a researcher works harder to increase the precision of the measurement of the effect they are already committed to studying.

I’d like to dwell on this second strategy for producing low p-values for a moment, because I believe it suggests that p-values — in the hands of a social scientist committed to studying effects that are never exactly zero nor likely to be large in magnitude — are simply a confusing way of reporting the precision of our measurements. Moreover, because the precision of our measurements can always be increased by gathering more data or acquiring better equipment, p-values in the social sciences are nothing more than measures of the effort and money we invest in acquiring more precise measurements, even though many of us would like to think of p-values as a measure of the truth of our hypotheses.

To hammer home the notion that p-values are, in practice, just measurements of sample size, I’m going to provide the reader with a simple algorithm that any researcher can use to insure that essentially any study in the social sciences will, in expectation, be validated as true by the NHST approach to assessing research. For the sake of argument, we’ll insist that our hypothetical study should pass the classical threshold of p < .05.

At a high level, the approach I'll take is similar in spirit to power analysis, but it is different in the details. In power analysis, you would tell me (1) the size of the effect you expect and (2) the sample size you can expect to gather; I would then hand you back the probability of successfully passing the p-value threshold you've set. In the approach described here, you hand me only the size of the expected effect; I then hand you back the minimum sample size you need to gather so that, in expectation, you'll pass the p < .05 threshold. In other words, for any hypothesis you could possibly invent that isn't absolutely false, you can take the information I'll give you and run an experiment with the expectation that you will be able to use NHST to support your hypothesis. This experiment may be extremely expensive to run and may contribute nothing to human welfare if completed successfully, but you can run the experiment in reasonable confidence that you'll be able to publish your results.

To do this, we only need to turn a simple calculation over to our computer:

  1. For any proposed effect size, epsilon, we generate many different random samples of N normally distributed data points centered at epsilon.
  2. For each sample, we run a one-sample t-test comparing the mean of that sample against 0. We store a copy of the p-value associated with this t-test.
  3. Across samples, we average the p-values derived from these t-tests to estimate the expected p-value for a fixed sample size.
  4. If the expected p-value is greater than .05, we increase N to N + 1 and try again.
  5. Eventually we will find a large enough value of N so that the expected p-value is below .05: we store this in a database and move on to the next effect size.

If we generate a large enough number for different data sets for each of our effect sizes, our Monte Carlo estimates of the expected p-values will hopefully be accurate. With a large number for these estimates for different effect sizes, we can extrapolate out a smooth curve that shows the minimum sample size required to have an expectation that the p-value for a study with that sample size and effect size will be less than .05.

To calculate these values, I've coded up this simple search algorithm in R and put the code up on GitHub. In this post, I'll simply provide a graph that shows how large a value of N is required to expect to pass a standard t-test as the size of the effect you're studying grows along a logarithmic scale starting at 0.1 and going all the way up to 10:

Sample size

As you can see, you need huge sample sizes for effects as small as 0.1, but you can use remarkably few data points when the effect is sufficiently large. No matter how small your effect may be, you can always do the hard work of gathering data in order to pass the threshold of p < .05. As long as the effect you're studying isn't non-existent, p-values just measure how much effort you've put into collecting data.

Conclusions

What should we take away from this? A variety of conclusions could be drawn from thinking about the graph above, but I'm really only concerned with one of them: if your field always studies effects that aren't exactly zero, it is always possible to design a study that can be expected to pass a significance test. As such, a significance test cannot, in principle, allow science to filter out any idea in the social sciences, except perhaps for those so outlandish that they do not have even a shred of truth in them. All that NHST does is measure how precise your measurement system is -- which in turn is nothing more than a metric of how hard you were willing to work to gather data in support of your hypothesis. But the width of a confidence interval is clearly a superior tool for measuring the precision of your measurements, because it directly tells you the precision of your measurements; a p-value only tells you the precision of a measurement after you've pealed off the residue left behind by the size of the effect you're studying.

Thankfully, I think that there are many viable approaches for presenting scientific research and evaluating the research of others that are far superior to p-values. In the next and final post in this series, I'll focus on some suggestions for developing better strategies for analyzing data in the social sciences. Unfortunately, these better strategies require more than a change in methods in the social sciences, they require a change in culture. We need to do less hypothesis testing and more estimation of formal models. But I'll leave the details of that until my next post.

[P.S. Many thanks to Sean Taylor for useful feedback on a draft of this post.]

9 responses to “Criticism 5 of NHST: p-Values Measure Effort, Not Truth”

  1. Josef Fruehwald

    Very cool. And to make matters worse, if you don’t put in the necessary effort to guarantee passing the p<0.5 threshold, but you're lucky enough to still pass it, you're almost guaranteed to drastically overestimate the size of the effect you're studying. http://val-systems.blogspot.com/2012/05/decline-effect-in-linguisics.html

  2. Stian S Ludvigsen

    Josef points to an upside to what you’re describing here: if only small-N studies are published, it is difficult to tease out whether results are selected for publication, but if you have a large sample of studies, and a plot of their effects vs. N resemble the graph you’ve made, then you surely have publication bias! This is likely to happen where opposite effects are either impossible or implausible (with plausible results in both directions, the plot should look like an upside-down funnel), but it can be controlled for through meta-regressions.

    Such meta-regressions will be hard to run if researchers do not work hard to increase their number of observations (and hence – the number of published estimates).

    The decline-effect is nothing more than a result of this bias: more observations are added over time – smaller effects become significant – results are published. The initial results were overstated, but as researchers put more effort into collecting data, they slowly approach the genuine effect (which may – indeed – be nothing).

    Thanks to this effort, meta-analyses can be conducted, and controls for publication bias applied. You can learn more about this in a book published only a few days ago: http://www.routledge.com/books/details/9780415670784/ The authors have several articles and working papers on the subject as well. One interesting piece is http://amstat.tandfonline.com/doi/abs/10.1198/tast.2009.08205?journalCode=utas20

    In the latter, Stanley et al. suggest you discard 90 % of the estimated effects, but if I understand your underlying criticism correctly, you suggest that we discard 100 % of the estimated effects _if_ effects go towards 0 as N reaches infinity!? Wouldn’t you rather say the genuine effect is 0? Perhaps I’ve misunderstood you here.

    And yes: everything should be published, regardless of how it turns out. Then there would be less need for us to run complicated meta-regressions on studies to find the genuine effects: simple averages would (in many cases) do.

  3. Stian S Ludvigsen

    OK, I see my comment may have been beside your point. Nonetheless, let me clarify!

    Your post was on what it takes to get significant effects, and your concern was that authors may polish their results by adding more observations. That concern may be real if only a few studies are published, but with several studies, you can look for (and control for) patterns in the published results. In such a case, you can look at your graph from a different perspective: if you only have a few observations, you need strong effects. If editors and referees desire interesting results, this is what you publish. As more observations are added, you don’t need to go to such extremes to get published. Hence, publication selection is likely if published effects change as more observations are added. If effects don’t change (on average) as more observations are added, then studies are selected for publication for other things than the significance and direction of their results (which is a good thing).

    So, a funnel plot (and meta-regression controlling for precision), will reveal publication bias. If you’re not convinced, have a look at http://www.deakin.edu.au/buslaw/aef/workingpapers/papers/2011_4.pdf, especially figures 3 and 4.

    Thanks for an excellent blog!
    /Stian.

  4. Eran

    Hi,

    Thanks for an illuminating post. I think you might find the following post related and interesting:

    http://www.bzst.com/2012/05/policy-changing-results-or-artifacts-of.html

    It is not all about P.values, and we should all be careful now, (and more so in the future) with our P.values based conclusions.

  5. Moritz Büchi

    Hi,
    After having read your interesting series on NHST, I did some research on the topic and came across the following current article by John K. Kruschke: http://www.indiana.edu/~kruschke/BEST/BEST.pdf

  6. Uwe Czienskowski

    Hi John,
    nice work that you present in your blog, in particular about NHST. Fundamentally, I agree with the most publications that have shown that the common way to do significance tests in Psychology (and other soft “sciences”) is seriously flawed. The problem is: although virtually every single publication (with a few important exceptions, see below) seems to provide evidence that NHST is the problem, actually the NHST is not really affected. The real problem is a weird procedure that dominates applications of NHST since its adoption in Psychology (and related disciplines): instead of testing the research hypothesis as hypothesis to be nullified (i.e. rejected), a straw man hypothesis (the common “nil” null hypothesis) is formulated and taken as the hypothesis to be rejected. This is what Cohen and Meehl call the “weak” form of null hypothesis test, and we all agree that this procedure is simply invalid. The “strong” form, however, which sets the research hypothesis as null hypothesis to be rejected, is not only valid, but is also not subject of the common criticisms of the new “ban significance tests” orthodoxy. Thus, instead of (inappropriately) bashing the NHST, it would be more useful to teach students and practitioners the proper way to use it. For details, see my blog: “http://sciencecovskij.wordpress.com/2012/05/30/nhst-revisited/”

    Cheers
    uwec

  7. Akhmed

    Interesting discussion. The post seems to favor confidence intervals. Yet exactly the same argument of “unless they are totally independent, I can always find N large enough” can be applied to the confidence intervals too.

    If small-size effect is uninteresting, then neither p-value nor confidence interval will probably save the paper. A good review team will point this out.

    Also, most people know it, but just worth mentioning that p-value + point estimate can be directly converted into a confidence interval.

    The scales that I typically use as a reference are:

    if p-value=0.05 and measurement is equal to X, then 95% confidence interval is (0,2X) (that is, from 0 to 2X)

    If p-value=0.01 and measurement is equal to X, then 95% confidence interval is (X-0.68X,X+0.68X).

    If p-value=0.0001 and measurement is equal to X, then 95% confidence interval is (X-0.55X,X+0.55X).

    So, unless p-value is really small (like < 0.0001), it may not be worth talking about precision at all: it is quite bad anyway.

    I do agree however that when p-value<0.0001 and when precision is needed, it is more natural to look for confidence intervals.