Criticism 5 of NHST: p-Values Measure Effort, Not Truth

Introduction

In the third installment of my series of criticisms of NHST, I focused on the notion that a p-value is nothing more than a one-dimensional representation of a two-dimensional space in which (1) the measured size of an effect and (2) the precision of this measurement have been combined in such a way that we can never pull those two dimensions apart again. Because the size of a p-value depends on two fully independent factors, measured p-values will become smaller when either (A) a researcher focuses on studying an effect whose true magnitude is greater or (B) a researcher works harder to increase the precision of the measurement of the effect they are already committed to studying.

I’d like to dwell on this second strategy for producing low p-values for a moment, because I believe it suggests that p-values — in the hands of a social scientist committed to studying effects that are never exactly zero nor likely to be large in magnitude — are simply a confusing way of reporting the precision of our measurements. Moreover, because the precision of our measurements can always be increased by gathering more data or acquiring better equipment, p-values in the social sciences are nothing more than measures of the effort and money we invest in acquiring more precise measurements, even though many of us would like to think of p-values as a measure of the truth of our hypotheses.

To hammer home the notion that p-values are, in practice, just measurements of sample size, I’m going to provide the reader with a simple algorithm that any researcher can use to insure that essentially any study in the social sciences will, in expectation, be validated as true by the NHST approach to assessing research. For the sake of argument, we’ll insist that our hypothetical study should pass the classical threshold of p < .05.

At a high level, the approach I'll take is similar in spirit to power analysis, but it is different in the details. In power analysis, you would tell me (1) the size of the effect you expect and (2) the sample size you can expect to gather; I would then hand you back the probability of successfully passing the p-value threshold you've set. In the approach described here, you hand me only the size of the expected effect; I then hand you back the minimum sample size you need to gather so that, in expectation, you'll pass the p < .05 threshold. In other words, for any hypothesis you could possibly invent that isn't absolutely false, you can take the information I'll give you and run an experiment with the expectation that you will be able to use NHST to support your hypothesis. This experiment may be extremely expensive to run and may contribute nothing to human welfare if completed successfully, but you can run the experiment in reasonable confidence that you'll be able to publish your results.

To do this, we only need to turn a simple calculation over to our computer:

  1. For any proposed effect size, epsilon, we generate many different random samples of N normally distributed data points centered at epsilon.
  2. For each sample, we run a one-sample t-test comparing the mean of that sample against 0. We store a copy of the p-value associated with this t-test.
  3. Across samples, we average the p-values derived from these t-tests to estimate the expected p-value for a fixed sample size.
  4. If the expected p-value is greater than .05, we increase N to N + 1 and try again.
  5. Eventually we will find a large enough value of N so that the expected p-value is below .05: we store this in a database and move on to the next effect size.

If we generate a large enough number for different data sets for each of our effect sizes, our Monte Carlo estimates of the expected p-values will hopefully be accurate. With a large number for these estimates for different effect sizes, we can extrapolate out a smooth curve that shows the minimum sample size required to have an expectation that the p-value for a study with that sample size and effect size will be less than .05.

To calculate these values, I've coded up this simple search algorithm in R and put the code up on GitHub. In this post, I'll simply provide a graph that shows how large a value of N is required to expect to pass a standard t-test as the size of the effect you're studying grows along a logarithmic scale starting at 0.1 and going all the way up to 10:

Sample size

As you can see, you need huge sample sizes for effects as small as 0.1, but you can use remarkably few data points when the effect is sufficiently large. No matter how small your effect may be, you can always do the hard work of gathering data in order to pass the threshold of p < .05. As long as the effect you're studying isn't non-existent, p-values just measure how much effort you've put into collecting data.

Conclusions

What should we take away from this? A variety of conclusions could be drawn from thinking about the graph above, but I'm really only concerned with one of them: if your field always studies effects that aren't exactly zero, it is always possible to design a study that can be expected to pass a significance test. As such, a significance test cannot, in principle, allow science to filter out any idea in the social sciences, except perhaps for those so outlandish that they do not have even a shred of truth in them. All that NHST does is measure how precise your measurement system is -- which in turn is nothing more than a metric of how hard you were willing to work to gather data in support of your hypothesis. But the width of a confidence interval is clearly a superior tool for measuring the precision of your measurements, because it directly tells you the precision of your measurements; a p-value only tells you the precision of a measurement after you've pealed off the residue left behind by the size of the effect you're studying.

Thankfully, I think that there are many viable approaches for presenting scientific research and evaluating the research of others that are far superior to p-values. In the next and final post in this series, I'll focus on some suggestions for developing better strategies for analyzing data in the social sciences. Unfortunately, these better strategies require more than a change in methods in the social sciences, they require a change in culture. We need to do less hypothesis testing and more estimation of formal models. But I'll leave the details of that until my next post.

[P.S. Many thanks to Sean Taylor for useful feedback on a draft of this post.]

Optimization Functions in Julia

Update 10/30/2013: Since this post was written, Julia has acquired a large body of optimization tools, which have been grouped under the heading of JuliaOpt.

Over the last few weeks, I’ve made a concerted effort to develop a basic suite of optimization algorithms for Julia so that Matlab programmers used to using fminunc() and R programmers used to using optim() can start to transition code over to Julia that requires access to simple optimization algorithms like L-BFGS and the Nelder-Mead method.

As of today, I believe that enough progress has been made to justify publicly announcing the first draft of an optimization suite for Julia tentatively called optim.jl. I intend to keep working on optim.jl for the next few months and expect that it will eventually become one of the first Julia packages.

For an end-user, interactions with optim.jl should take place almost entirely through the `optimize()` function, which I demonstrate how to use below in the context of minimizing the Rosenbrock function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
load("src/init.jl")
 
function rosenbrock(x::Vector)
  (1.0 - x[1])^2 + 100.0 * (x[2] - x[1]^2)^2
end
 
function rosenbrock_gradient(x::Vector)
  [-2.0 * (1.0 - x[1]) - 400.0 * (x[2] - x[1]^2) * x[1],
   200.0 * (x[2] - x[1]^2)]
end
 
function rosenbrock_hessian(x::Vector)
  h = zeros(2, 2)
  h[1, 1] = 2.0 - 400.0 * x[2] + 1200.0 * x[1]^2
  h[1, 2] = -400.0 * x[1]
  h[2, 1] = -400.0 * x[1]
  h[2, 2] = 200.0
  h
end
 
problem = Dict()
problem[:f] = rosenbrock
problem[:g] = rosenbrock_gradient
problem[:h] = rosenbrock_hessian
problem[:initial_x] = [0.0, 0.0]
problem[:solution] = [1.0, 1.0]
 
algorithms = ["naive_gradient_descent",
              "gradient_descent",
              "newton",
              "bfgs",
              "l-bfgs",
              "nelder-mead",
              "sa"]
 
for algorithm = algorithms
  results = optimize(problem[:f],
                     problem[:g],
                     problem[:h],
                     problem[:initial_x],
                     algorithm,
                     10e-8,
                     true)
  print(results)
end

I think the code in my current draft of optim.jl is basically usable, but I’m still a little unsure of its correctness as I’ve never developed optimization algorithms before. I’ve set up a basic suite of benchmark functions to test these functions, but I could really use some additional use cases and benchmarks. If you’re interested, please submit pull requests through GitHub that provide new functions that you want optim.jl to be able to minimize automatically. If you submit a function, please provide the function itself, its gradient, its Hessian, a starting point and the global minimum of the function.

I’ve already set up five test functions as benchmarks, which are:

  1. A simple exponential function.
  2. A simple parabolic function.
  3. A simple 4th-degree polynomial function.
  4. The Rosenbrock function.
  5. The Powell function.

The code for these five functions can be seen on GitHub along with drafts of a few other functions.

Below you can see plots that depict the performance of the seven algorithms that are currently available on each of these five problems. In some of these plots the bar for Naive Gradient Descent drops out: that’s because the constant step-size gradient descent algorithm behaves pathologically and produces NaN’s at some point. I’d stay away from that function unless you’re sure it’s what you want.

Stepping away from the perils of hard-coding a step-size for gradient descent, I think the take away message from these plots is that you should use BFGS, L-BFGS or Newton’s Method if you can compute or safely approximate the Hessian of a function. Only fall back on the other four methods if you find that you really need to avoid using even approximate Hessians or can’t compute the gradient of your to-be-minimized function.

The next few steps ahead are to add a few more optimization methods including Brent’s method and conjugate gradients; then start working on constrained optimization; and finally get to work on incorporating automatic differentiation and finite differencing. With those things in place, I think we’ll have a basic optimization function for Julia that will let us keep up with R’s optim. I think we’ve got a long way to go to catch up with fminunc, but I’m sure we’ll get there by the time Julia hits a 1.0 release.

Run times
Solution error

Bayesian Nonparametrics in R

On July 25th, I’ll be presenting at the Seattle R Meetup about implementing Bayesian nonparametrics in R. If you’re not sure what Bayesian nonparametric methods are, they’re a family of methods that allow you to fit traditional statistical models, such as mixture models or latent factor models, without having to fully specify the number of clusters or latent factors in advance.

Instead of predetermining the number of clusters or latent factors to prevent a statistical algorithm from using as many clusters as there are data points in a data set, Bayesian nonparametric methods prevent overfitting by using a family of flexible priors, including the Dirichlet Process, the Chinese Restaurant Process or the Indian Buffet Process, that allow for a potentially infinite number of clusters, but nevertheless favor using a small numbers of clusters unless the data demands using more.

This flexibility has several virtues:

  1. You don’t have to decide how many clusters are present in your data before seeing the results of your analysis. You just specify the strength of a penalty for creating more clusters.
  2. The results of a fitted Bayesian nonparametric model can be used as informed priors when you start processing new data. This is particularly important if you are writing an online algorithm: the use of a prior like the Chinese Restaurant Process that, in principle, allows the addition of new clusters at any time means that you can take advantage of all the information you have about existing clusters without arbitrarily forcing new data into those clusters: if a new data point is unique enough, it will create its own cluster.

To give you a sense of how Bayesian nonparametric methods work, I recently coded up the newly published dp-Means algorithm, which provides a deterministic implementation of a k-means like algorithm that allows for a potentially infinite number of clusters, but, in practice, only uses a small number of clusters by penalizing the creation of additional clusters. This penalty comes from a parameter, lambda, that the user specifies before fitting the model.

The series of images below shows how the dp-Means algorithm behaves when you turn it loose on a data set with four distinct Gaussian clusters. Each image represents a higher penalty for using more clusters.

0 dp means

1 dp means

10 dp means

50 dp means

100 dp means

125 dp means

200 dp means

As you can see, you can shift the model from a preference for using many clusters to using only one by manipulating the lambda parameter. If you’re interested in learning more and will be in the Seattle area in late July, come out.

The Great Julia RNG Refactor

Many readers of this blog will know that I’m a big fan of Bayesian methods, in large part because automated inference tools like JAGS allow modelers to focus on the types of structure they want to extract from data rather than worry about the algorithmic details of how they will fit their models to data. For me, the ease with which we can construct automated inference tools for Bayesian models is the greatest fruit of the MCMC Revolution. And it’s embodied in lots of great software packages like WinBUGS, JAGS, and PyMC — among severals others I know less well.

When I develop a new model for analyzing small data sets, I typically can do all of my statistical programming in JAGS. But as soon as the data set I’m working with contains more than 100,000 entries, I find that I need to write sampling code by hand. And, of course, I always write my simulation code by hand.

Traditionally, I would use R for both of these tasks because of the wide variety of RNG’s provided by R by default. The convenience of having access to functions like rbinom() and rbeta() in out-of-the-box R makes it easy to forgive R’s faults.

But there’s always been a small hole in the RNG’s provided by R: there’s no rdirichlet() function in the “stats” package. In Bayesian statistics, the Dirichlet distribution comes up over and over again, because it is the conjugate prior for models with multinomial data, which include almost all Bayesian models for text analysis. Now, it is true that you can find an R function for Dirichlet random number generation in the “MCMCpack” package, but this is inconvenient and I find that I frequently forget which package contains rdirichlet() if I haven’t used it recently.

To avoid finding myself in the same situation with Julia, I recently wrote up samplers for the multinomial and Dirichlet distributions. And, as of this week, those samplers are now in the standard branch of Julia.

These new samplers are particularly nice to work with because they can take advantage of a substantial refactor of the Julia RNG design that was worked out by several Julia developers over the past few weeks. I’m going to describe this new interface in general, so that I can show how one can use this interface to draw samples from the multinomial and Dirichlet distributions in Julia.

Inspired by SciPy, the main idea of the new approach to RNG’s in Julia is to exploit Julia’s multiple dispatch system to make it possible for functions for working with distributions to be named as generically as possible. Instead of using naming conventions as is done in R (think of rbinom, dbinom, pbinom and qbinom), Julia uses the exact same function names for all probability distributions. To generate Gaussian, Gamma or Dirichlet draws, you’ll always use the rand() function. For the probability mass or density functions, you’ll always use pmf() or pdf(). For the cumulative distribution function, you’ll always use cdf(). This list goes on and on: there are even customized mean() and var() functions for many distributions that will return the expectation and variance of a theoretical random variable, rather than the empirical mean or variance of a sample from that distribution.

Julia can make this constant naming convention work smoothly by giving each probability distribution its own explicit position in the Julia type system. For example, there is now a Binomial type defined as:

1
2
3
4
type Binomial
  size::Int
  prob::Float64
end

To generate a random binomial sample with 10 draws with probability 0.8 of heads for each draw, you therefore run the following code:

1
2
3
load("base/distributions.jl")
 
rand(Binomial(10, 0.8))

I quite like this approach. At times, it makes Julia feel a bit like Mathematica, because you can interact with abstract concepts rather than thinking about data structures, as seen in this calculation of the theoretical variance of a binomial variable:

1
2
3
load("base/distributions.jl")
 
var(Binomial(10, 0.8))

That’s enough of an introduction to Julia’s new RNG design for me to show off the multinomial and Dirichlet functionality. Here we go:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
load("base/distributions.jl")
 
##
# Multinomial
##
 
# The multinomial distribution for one draw of one object with p = [0.5, 0.4, 0.1] of drawing each of three objects.
d = Multinomial(1, [0.5, 0.4, 0.1])
 
# Draw one object from three types, all equally probable.
d = Multinomial(1, 3)
 
# Compute the mean and variance of the distribution under its parameterization.
mean(d)
var(d)
 
# Compute the probability mass function.
pmf(d, [1, 0, 0])
pmf(d, [1, 1, 0])
pmf(d, [0, 1, 0])
 
# Draw a random sample.
rand(d)
 
##
# Dirichlet
##
 
# The Dirichlet distribution with concentration parameter = [1.0, 2.0, 1.0].
d = Dirichlet([1.0, 2.0, 1.0])
 
# The mean and variance of this distribution.
mean(d)
var(d)
 
# Compute the probability density function.
pdf(d, [0.1, 0.8, 0.1])
 
# Draw a random sample.
rand(d)

I’m excited to have these tools in Julia by default now. They’ll make it much easier to write clean code for Gibbs samplers for models like LDA that make heavy use of Dirichlet priors.

I also want to note that while I was writing up these new samplers, Chris DuBois has been building some very general purpose MCMC functions that will let people write a sampler for an arbitrary Bayesian model fairly easily. You can see his examples on GitHub.

Criticism 4 of NHST: No Mechanism for Producing Substantive Cumulative Knowledge

[Note to the Reader: This is a much rougher piece than the previous pieces because the argument is more complex. I ask that you please point out places where things are unclear and where claims are not rigorous.]

In this fourth part of my series of criticisms of NHST, I’m going to focus on broad questions of epistemology: I want to ask what types of knowledge we can hope to obtain using NHST and what types of knowledge we would need to obtain as social scientists before our work will offer the sort of real-world value that older sciences like physics and chemistry provide. Above all else, my argument is based upon my conviction that the older sciences are currently superior to the social sciences because they can make precise numerical predictions about important things in the world — after all, it is this numerical precision that allows humanity to construct powerful technologies like satellites and computers. That we have failed so far to construct comparably precise models of human behavior is no great cause for concern: we have chosen to study topics in which there are no low-hanging fruit like the equations defining Newtonian gravity’s effect on small objects — equations which could be discovered from experimental data quite easily using modern (and not so modern) statistical techniques. We social scientists are going slower and reaching less powerful conclusions than the physical scientists have because we are studying something that is intrinsically much harder to learn about. If our subject matter were easier, then we could turn our attention to developing the sort of complex mathematical machinery and rich formal theory that makes modern physics so interesting, but also so difficult for many people.

In large part, this piece can be considered an elaboration of the claim by Paul Meehl that:

The almost universal reliance on merely refuting the null hypothesis as the standard method for corroborating substantive theories in the soft areas is… basically unsound.

To do this, we will build up an idealized world in which we can clearly see that NHST provides no mechanism for the accumulation of substantive knowledge. In this idealized world, it will be easy to describe (1) the types of knowledge that we might aspire to possess and (2) the types of knowledge that NHST can actually provide. By doing so, we will show that NHST, even when practiced perfectly using infinite data, can achieve only a very weak approximation of the types of knowledge that one could attain by using the most basic quantitative modeling strategies. In addition, we will be able to show that the two different forms of NHST in popular use, which are either (1) a test against a point hypothesis that some constant m != 0 or (2) a test against a directional hypothesis that some constant m > 0, are equally unhelpful in our pursuit of cumulative knowledge because the conjunctive knowledge of having falsified two different null hypotheses about two different constants is actually much weaker than the knowledge from a single null hypothesis. In short, we will try to demonstrate by example that, even in a much simpler world than our own, NHST cannot form a major part of a successful science’s long-term research program.

To construct our idealized world in which it is easy to measure the extent of our knowledge, let us suppose that the exact structure of the universe is entirely linear and deterministic: there are n basic components of reality (which are measured as x_i‘s) and any one of these basics variables is related to every other variable in a single equation of the form:

beta_n * x_n = beta_1 * x_1 + beta_2 * x_2 + ... + beta_n-1 * x_n-1

In Nature’s caprice, a few of these variables may have zero coefficients (i.e. beta_i = 0 for some i), but we will not assume that most have zero coefficients. In an idealized world similar to the one facing the social sciences, it will generally be the case that there are almost no zero coefficients, because, as David Lykken observed, there is a crud factor in psychology linking all observables about human beings.1

Now that we have constructed our idealized world in which randomness plays no part (except when it is the result of not measuring one of the x_i‘s) and in which non-linearity never arises, let us set up an idealized social science research program. In this idealized science, we will learn about the structure of the world (which, when conducted correctly, is identical to learning about the equation above) by slowing discovering the x_i‘s that could be measured through some unspecified process and then performing randomized experiments to learn the coefficients for each. But, because we test only null hypotheses, our knowledge of each of the coefficients ultimately reduces either to (1) knowledge that beta_i != 0 or (2) knowledge that beta_i > 0 or that beta_i < 0. We will call these two types of NHST Form 1 and Form 2.

Because our knowledge about the beta_i reduces to knowledge at most of the sign of the beta_i's, if we one day wished to learn this model in full quantitative form, the final fruits after an indefinitely long progression of our idealized science's development using NHST Form 1 as our primary mechanism for verifying empirical results would be a list of the coefficients that would need to measured to build up the quantitative theory, i.e. a list of non-zero coefficients. The results of using NHST Form 2 would be a list of the coefficients that need to be measured along with the additional piece of prior information that each of those coefficients has a known sign. In short, the perfect state of knowledge from NHST would be little more than cost-saving measure for the construction of a quantitative model. If we believe in Lykken's crud factor, NHST Form 1 will not even provide us with this much information: there will be no non-zero coefficients and the list of non-zero coefficients will be empty, saving us no work and giving us no information about how to construct a quantitative theory.

But let us ignore the building of a quantitative theory for a moment and consider instead what this perfect state of knowledge would actually mean to us. To do this, we can ask how close we would come to full knowledge of the world: indeed, we could ask how many bits of information we have about the coefficients in this model. Sadly, the answer is either 0 or 1 bit per coefficient, so that we have at most n bits of knowledge after measuring n coefficients with infinitely precise application of NHST. This is in sharp contrast to a person who has measured the actual values of the coefficients: that person will know m * n bits of information about the world, where m is the precision in bits of the average coefficient's measurement. In short, there is an upper bound for any user of NHST on the amount of information they can obtain: it is n bits of information. In contrast, the quantitative theorist has an upper bound of infinite bits of information.

But this argument about bits is unlikely to convince many people who are not already very quantitative by nature: measuring things in bits only appeals to people who like to measure things in units. So let us ask a different sort of question: we can ask what sort of questions about the world we could answer using this perfect state of NHST knowledge. Again, the response is disheartening: under NHST Form 2 (which is strictly stronger than NHST Form 1), we will only be able to answer questions that can be broken apart into pieces, each of which reduces to the form: "is beta_i > 0, beta_i = 0, or beta_i < 0?" But the answer to such question does not merit the name of quantitative knowledge: it is knowledge of direction alone. This is not surprising: our NHST methodology focused on direction alone from the very start. But we will now show that perfect knowledge of direction is of shockingly little value for making predictions about the world. For instance, we could not answer any of the following questions:

(1) If I set all of the x_i to 1 except for x_n, what will be the the value of x_n? This is the idealized form of all prediction tasks: given a set of inputs, make a prediction about an output. Being unable to solve this most basic prediction task means that we have not built up a base of cumulative knowledge with any capacity to predict unseen events. But the value of science is defined in large part by its capacity to predict unseen events.

(2) Is x_1 more important than x_2 in the sense that changing x_1's value from a to b would have more effect on x_n than changing x_2's value from a to b? This is the basic form of any question asking which of our variables is most important. Being unable to answer such questions demonstrates that we have no sense of what objects in our theory really have the capacity to affect human lives.

(3) If we set all of the x_i to 1 except for x_n, what is the sign of x_n? We cannot answer even this question (which is superficially the aggregation of the basic directional question posed earlier), because, if beta_1 > 0, beta_2 < 0, and beta_i = 0 for all i > 3, then it matters whether beta_1 > beta_2 if we want to predict the sign of x_n. This is particularly damning, because it means that the accumulation of knowledge of the signs of many coefficients is not even sufficient to produce knowledge of the sign of one single output that depends on only two variables. This is the basic form of asking whether our knowledge is cumulative, rather than shattered and lacking conceptual coherence.

What can we conclude from our failure to answer such simple questions about our field of study in an idealized world in which reality has a simple mathematical structure and in which we assume that we have performed infinitely many experiments with infinite precision? I believe that we should conclude that NHST is basically unusable as a method for doing science: it amounts to what Noam Chomsky would call butterfly-collecting, except that our butterflies are variables that we have proven to have non-zero coefficients. We have no knowledge of the predictive importance of these variables nor of the scale of their effects on human experiences.

And this is true even in our idealized world in which methods like ANOVA's and linear regression would be able to tell us the exact structure of the universe. Given that those methods exist and are universally taught to social scientists, why do we persist in using NHST, a method that, if used in isolation, is strictly incapable of producing any substantive knowledge of the structure of the world?

One argument that can be given back is that we do not only have access to null hypothesis testing against zero: we can test against hypotheses like beta_1 > 1 and beta_i < 3. By using infinitely many of these enriched tests (let us call these directional test against arbitrary values NHST Form 3), we could ultimately perform a binary search of the space of all values for beta_i and slowly build up an idea of the quantitive value of each of the coefficients.

But admitting that we are interested in learning the actual values of the beta_i should not encourage use to use more elaborate NHST: it should be the first step towards deciding never to use NHST again. Why? Because the binary search paradigm described right above is noticeably inferior to strategies in which we simply use methods like linear regression to estimate the values of the beta_i's directly while placing confidence intervals around those estimated values to keep a record of the precision of our knowledge.

And this realization that simple linear regression would be a giant leap forward leads us to notice what is perhaps the most perverse part of our continued use of NHST in the social sciences: the machinery of NHST actually requires us to at least estimate all of the coefficients in our idealized equation (albeit in a possibly biased way), because t-tests, ANOVA's and linear regression actually require those estimates before one can compute a p-value. We are already producing, as a byproduct of our obsession with NHST, a strictly superior form of knowledge about the beta_i's, but we throw all of that information away! The entire substance of what I would call powerful scientific knowledge is consider a worthless intermediate step in falsifying the null and supporting one's favorite qualitative hypothesis: the estimates of the beta_i's are typically published, but simple verbal questioning of most social scientists will confirm that people do not walk away from papers they have read with this quantitative knowledge in memory.

I find it very odd that, in an age so otherwise concerned with the conservation of resources, we should be so careless with quantitative knowledge that we are willing to discard it as a valueless intermediate byproduct of the process of falsifying the null. I think this careless indifference to the more meaningful quantitative values that are estimated as a means to the end of producing a p-value reflects powerful epistemological misunderstandings on the part of the social sciences: we have much too little interest in quantitative thinking, foremost of which should be an interest in exact numerical prediction. And it is this absence of interest in exact numerical prediction that sets us apart from astronomy, a field that, unlike our own, does not even have limited access to randomized experiments. Astronomers have learned to make due using only observational data by building formal and numerically precise models that make sharp predictions whose accuracy can be rigorously tested. As I have shown, even if practiced perfectly in a simpler world, NHST would not produce a body of knowledge with value even close to the knowledge that any undergraduate in astronomy has.2 As a field, we should not accept this condition. In the long-term, scientific knowledge must be quantitative. We are already capable of producing such knowledge: indeed, we produce it as a waste product of our quest to falsify the null. If we abandon our interest in NHST and focus instead on the prediction of quantitative measurements, we will put our field on a healthier path towards catching up with the physical sciences. While we may always trail behind because of the intrinsic difficulty of our work, I also believe that the social sciences, in principle, have much more to offer to humanity than the natural sciences. We should start to live up to that promise more often. But NHST is a serious impediment to that, because it pushes towards the acquisition of a radically inferior type of knowledge that is not cumulative: there is no viable method for combining the successful falsification of the null about beta_i and the falsification of the null about beta_j into a coherent whole that will predict unseen data.

References

Paul E. Meehl (1978) "Theoretical Risks and Tabular Asterisks: Sir Karl, Sir Ronald, and the Slow Progress of Soft Psychology", Journal of Consulting and Clinical Psychology

David A. Freedman (1991), "Statistical Models and Shoe Leather", Sociological Methodology

Paul E. Meehl (1990), "Why Summaries of Research on Psychological Theories are Often Uninterpretable", Psychological Reports

  1. I have found that this crud factor surprises some non-psychologists. To convince non-psychologists, I would note that Meehl once collected 15 variables about people and regressed every pair of them. 96% had a highly significant (p < 10e-6) correlation.
  2. I note that a side product of this observation should be the conclusion that we psychologists have an excessive faith in the value of randomized experiments. It is the sharp prediction of unseen data in a way that is readily falsifiable that makes a field of inquiry as valuable to humanity as the physical sciences already are: while we social scientists do, in fact, have much such knowledge, it is attributable almost entirely to the intellectual talent and personal integrity of social scientists, all of which function despite our crippling usage of NHST.

Criticism 3 of NHST: Essential Information is Lost When Transforming 2D Data into a 1D Measure

Introduction

Continuing on with my series on the weaknesses of NHST, I’d like to focus on an issue that’s not specific to NHST, but rather one that’s relevant to all quantitative analysis: the destruction caused by an inappropriate reduction of dimensionality. In our case, we’ll be concerned with the loss of essential information caused by the reduction of a two-dimensional world of uncertain measurements into the one-dimensional world of p-values.

p-Values Mix Up the Strength of Effects with the Precision of Their Measurement

For NHST, the two independent dimensions of measurement are (1) the strength of an effect, measured using the distance of a point estimate from zero; and (2) the uncertainty we have about the effect’s true strength, measured using something like the expected variance of our measurement device. These two dimensions are reduced into a single p-value in a way that discards much of the meaning of the original data.

When using confidence intervals, the two dimensions I’ve described are equivalent to the position of the center of the confidence interval relative to zero and the width of the confidence interval. Clearly these two dimensions can vary independently. Working through the math, it is easy to show that p-values are simply a one-dimensional representation of these two dimensions.1

To illustrate how many different kinds of data sets receive the same p-value under NHST, let’s consider three very different data sets in which we test for a difference across two groups and then get the same p-value out of our analysis:

three_studies.png

Clearly these data sets are substantively different, despite producing identical p-values. Really, we’ve seen three qualitatively different types of effects under study:

  1. An effect that is probably trivial, but which has been measured with considerable precision.
  2. An effect with moderate importance that has been measured moderately well.
  3. An effect that could be quite important, but which has been measured fairly poorly.

No one can argue that these situations are not objectively different. Importantly, I think many of us also feel that the scientific merits of these three types of research are very different: we have some use for the last two types of studies and no real use for the first. Sadly, I suspect that the scientific literature increasingly focuses on the first category, because it is always possible to measure anything precisely if you are willing to invest enough time and money. If the community’s metric for scientific quality is a p-value, which can be no more than a statement about the precision of measurements, then you will find that scientists produce precise measurements of banalities rather than tentative measurements of important effects.

How Do We Solve This Problem?

Unlike previous posts, this problem with the use of NHST can be solved without any great effort to teach people to use better methods: to compute a p-value, you need to estimate both the strength of an effect and the precision of its measurement. Moving forward, we must be certain that we report both of these quantities instead of the one-number p-value summary.

Sadly, people have been arguing for this change for years without much success. To solve our impasse, I think we need to push on our community to impose a flat out ban: going forward, researchers should only be allowed to report confidence intervals. Given that p-values can always be derived from the more informative confidence intervals while the opposite transformation is not possible, how compelling could any argument be for continuing to tolerate p-values?

References

Ziliak, S.T. and McCloskey, D.N. (2008), “The cult of statistical significance: How the standard error costs us jobs, justice, and lives”, Univ of Michigan Press

  1. Indeed, p-values are effectively constructed by dividing the distance of the point estimate from zero by the width of the confidence interval and then passing this normalized distance through a non-linear function. I’m particularly bewildered by the use of this non-linear function: most people have trouble interpreting numbers already, and this transformation seems almost designed to make the numbers harder to interpret.

Criticism 2 of NHST: NHST Conflates Rare Events with Evidence Against the Null Hypothesis

Introduction

This is my second post in a series describing the weaknesses of the NHST paradigm. In the first post, I argued that NHST is a dangerous tool for a community of researchers because p-values cannot be interpreted properly without perfect knowledge of the research practices of other scientists — knowledge that we cannot hope to attain.

In this post, I will switch gears and focus on a weakness of NHST that afflicts individual researchers as severely as it afflicts communities of researchers. This post focuses on the concern that NHST is not robust to “absolutely rare” events, by which I mean events that have low probability under all possible hypotheses. NHST erroneously treats such “absolutely rare” events as evidence against the null hypothesis. In practice, this means that many researchers will treat such “absolutely rare” events as evidence in support of the alternative hypothesis, even when the null hypothesis assigns higher probability to the observed data than the alternative hypothesis and should therefore be considered the superior model.

To explain this concern using a detailed example, I will borrow an idea from Jacob Cohen. Before I do that, I need to outline a version of NHST that I believe is an accurate description of the way many practicing scientists actually use NHST.

In this formulation, the scientist observes an event; then posits a model, called the null hypothesis, that accounts for this and similar events using only chance mechanisms; and then estimates the probability of observing this sort of event under the null hypothesis. If the event and its kind are assigned low probability, the null hypothesis is rejected. For many scientists, this rejection is taken as evidence in favor of their preferred hypothesis, which we call the alternative hypothesis.

Cohen’s Example: Americans and Members of Congress

In our example, we will imagine meeting a new person named John Smith. We entertain two hypotheses about him: one, the null hypothesis, asserts that John Smith is an American. The other, the alternative hypothesis, asserts that John Smith is not an American. Because these hypotheses are mutually exclusive, it seems acceptable to say that any evidence against the null hypothesis should constitute evidence in support of the alternative hypothesis. We will demonstrate that p-values cannot constitute such evidence.

We do this by creating a simple null model: under the null hypothesis, the probability that John Smith is a Member of Congress is roughly 535 / 311,000,000 which is close to two in a million. In short, John Smith being a Member of Congress is a very improbable event under the null hypothesis.

Under the alternative hypothesis, the probability that John Smith is a Member of Congress is 0, because non-Americans cannot serve in Congress. This is an uncommon virtue in probabilistic models, because the existence of an event with zero probability means that the alternative hypothesis is actually falsifiable.

Now suppose that we do not know John Smith’s citizenship, but we do find out that he is a Member of Congress. Using NHST, we will perversely reject the null hypothesis that John is an American because Americans are very rarely Members of Congress. If we continue on from this rejection of the null to an acceptance of the alternative hypothesis, we will conclude that John Smith is not an American.

This is troubling, because a simple calculation using Bayes’ Theorem will demonstrate that the probability that John Smith is an American given that he is a Member of Congress is 1: we are absolutely certain that John Smith is an American. Yet NHST would have us reject this absolutely certain conclusion. For those who harbor an inveterate suspicion of Bayes’ Theorem, we can note that the likelihood ratio in this example is infinite in support of the null hypothesis and that a significance test of this ratio leads us to fail to reject the null hypothesis.1

How Can NHST Make Such An Obvious Mistake?

For many readers, I suspect that this example will seem too outlandish to be trusted: it seems hard to believe that the basic machinery of NHST could be so fragile as to allow an example like ours to break it. Such readers are right to suspect that some trickery is required to power our thought experiment: we must assume that we have observed data that is “absolutely rare”, in the sense that the probability that a person is a Member of Congress, across both the null hypothesis and the alternative hypothesis, is only one in ten million. As such, when applied to randomly selected human beings, our hypothesis test will perform well: we will make the mistake I have highlighted only one in ten million times. As a bet, NHST is safe: indeed, there are few safer bets if we only select randomly among Americans while using a test with such a low false positive rate. But when a rare event does occur, NHST will produce erroneous results. In other words, NHST treats rare events interchangeably with evidence against the null hypothesis, even though this equivalence is not defensible when the data is viewed from another perspective. Moreover, if we begin to select randomly from the true population of the Earth, our use of a procedure with such an incredibly low false positive rate is actually a serious error, because our method has very low power to detect non-Americans — even though they are the vast majority of all human beings. For those who believe that the null hypothesis is almost always false in research studies, the use of underpowered methods is particularly troubling.

Sadly, unambiguous examples like this one are harder to construct using the types of data typically employed in the modern sciences, because our hypotheses and measurements are typically continuous rather than binary and are rarely absolutely falsifiable. And because we typically lack verifiable base rate information, we cannot expect to use Bayes’ Theorem to calculate the relative probability of our hypotheses without subjective decisions about the a priori plausibility of our hypotheses. For some readers, this may make our example seem misleading: NHST only looks bad when clearly superior methods are available, which they seldom are.

But I think that this example nevertheless reveals a deep weakness with NHST: absolutely rare events will happen. A method that rejects the null hypothesis when rare events occur without even testing whether the alternative hypothesis is plausible seems very questionable to me, especially when we use a liberal threshold like p < 0.05. Because events only need to be very weakly rare to merit rejection under conventional NHST standards, the sort of hidden multiple testing I discussed in the last post may insure that rare events occur frequently enough to be reported much more frequently in the literature than our nominal false positive rate suggests. If a rare event only needs to be p = 0.05 level rare to be publishable, then one only needs to conduct twenty studies in a row to produce such a rare event. Surely people can work hard enough to produce that sort of rarity when their careers depend upon it.

How Can We Do Better?

What lesson should we take away from this bizarre example in which NHST leads us to reject a hypothesis that the data incontrovertibly supports? In my mind, the lesson that we should take away is that we cannot hope to learn about the relative plausibility of two hypotheses without assessing both hypotheses explicitly. We should not fault the null hypothesis for failing to predict data that the alternative hypothesis would not have predicted any better. Our interest as scientists should always lie in creating and selecting models with superior predictive power: we learn little to nothing by defeating a hypothesis that may not have been given a fair chance because of the quirky data being used to test it.

NHST does not satisfy the demand to evaluate both hypotheses explicitly, because it does all calculations using only the null hypothesis while entirely ignoring the alternative hypothesis. This is particularly troubling when so many practicing scientists treat a low p-value as evidence in support of the alternative hypothesis, even though this conclusion cannot generally be supported and was not part of Fisher’s original intention when introducing NHST. In short, NHST too often leads us to reject true hypotheses and too easily leads us to transform our rejection of true hypotheses into an acceptance of inferior hypotheses.

References

Cohen, J. (1994), ‘The Earth is Round (p < .05)', American Psychologist Ungated Copy

  1. In passing, I note that it should trouble readers that two different types of NHST give opposite answers. This is just one example of what Bayesians would call the incoherence of NHST as a paradigm.

Criticism 1 of NHST: Good Tools for Individual Researchers are not Good Tools for Research Communities

Introduction

Over my years as a graduate student, I have built up a long list of complaints about the use of Null Hypothesis Significance Testing (NHST) in the empirical sciences. In the next few weeks, I’m planning to publish a series of blog posts, each of which will articulate one specific weakness of NHST. The weaknesses I will discuss are not novel observations about NHST: people have been complaining about the use of p-values since the 1950’s. My intention is simply to gather all of the criticisms of NHST in a single place and to articulate each of the criticisms in a way that permits no confusion. I’m hoping that readers will comment on these pieces and give me enough feedback to sharpen the points into a useful resource for the community.

In the interest of absolute clarity, I should note at the start of this series that I am primarily unhappy with the use of p-values as (1) a threshold that scientific results are expected to pass before they are considered publishable and (2) a measure of the evidence in defense of a hypothesis. I believe that p-values cannot be used for either of these purposes, but I will concede upfront that p-values can be useful to researchers who wish to test their own private hypotheses.

With that limitation of scope in mind, let’s get started.

Communities of Researchers Face Different Problems than Individual Researchers

Many scientists who defend the use of p-values as a threshold for publication employ an argument that, in broad form, can be summarized as follows: “a community of researchers can be thought of as if it were a single decision-maker who must select a set of procedures for coping with the inherent uncertainties of empiricism — foremost of which is the risk that purely chance processes will give rise to data supporting false hypotheses. To prevent our hypothetical decision-maker from believing in every hypothesis for which there exists some supporting data, we must use significance testing to separate results that could plausibly be the product of randomness from those which provide strong evidence of some underlying regularity in Nature.”

While I agree with part of the argument above — p-values, when used appropriately, can help an individual researcher resist their all-too-human inclination to discover patterns in noise –, I do not think that this sort of argument applies with similar force to a community of researchers, because the types of information necessary for correctly interpreting p-values are always available to individual researchers acting in isolation, but are seldom available to the members of a community of researchers who learn about each other’s work from published reports. For example, the community will frequently be ignorant of the exact research procedures used by its members, even though the details of these procedures can have profound effects on the interpretation of published p-values. To illustrate this concern, let’s work through a specific hypothetical example of a reported p-value that cannot be taken at face value.

The Hidden Multiple Testing Problem

Imagine that Researcher A has measured twenty variables, which we will call X1 through X20. After collecting data, Researcher A attempts to predict one other variable, Y, using these twenty variables as predictors in a standard linear regression model in which Y ~ X1 + … + X20. Imagine, for the sake of argument, that Researcher A finds that X17 has a statistically significant effect on Y at p < .05 and rushes to publish this result in the new hit paper: "Y Depends upon X17!". How will Researcher B, who sees only this result and no mention of the 19 variables that failed to predict Y, react? If Researcher B embraces NHST as a paradigm without misgivings or suspicion, B must react to A's findings with a credulity that could never be defended in the face of perfect information about Researcher A's research methods. As I imagine most scientists are already aware, Researcher A's result is statistically invalid, because the significance threshold that has been passed depended upon a set of assumptions violated by the search through twenty different variables for a predictive relationship. When you use standard NHST p-values to evaluate a hypothesis, you must acquire a new set of data and then test exactly one hypothesis on the entire data set. In our case, each of the twenty variables that was evaluated as a potential predictor of Y constitutes a separate hypothesis, so that Researcher A has not conducted one hypothesis test, but rather twenty. This is conventionally called multiple testing; in this case, the result of multiple testing is that the actual probability of at least one variable being found to predict Y due purely to luck is closer to 50% than to the 5% level suggested by a reported p-value of p < 0.05. What is worrisome is that this sort of multiple testing can be effortlessly hidden from Researcher B, our hypothetical reader of a scientific article. If Researcher A does not report the tests that failed, how can Researcher B know that they were conducted? Must Researcher B learn to live in fear of his fellow scientists, lest he be betrayed by their predilection to underreport their methods? As I hope is clear from our example, NHST as a method depends upon a faith in the perfection of our fellow researchers that will easily fall victim to any mixture of incompetence or malice on their part. Unlike a descriptive statistic such as a mean, a p-value purports to tell us something that it cannot do without perfect information about the exact scientific methods used by every researcher in our community. An individual researcher will necessarily have this sort of perfect information about their own work, but a community will typically not. The imperfect information available to the community implies that reasoning about the community's ideal standards for measuring evidence based on the ideal standards for a hypothetical individual will be systematically misleading. If an individual researcher conducts multiple tests without correcting p-values for this search through hypotheses, the individual researcher will develop false hypotheses and harm only themselves. But if even one member of a community of researchers conducts multiple tests and publishes results whose interpretation cannot be sustained in the light of knowledge of the hidden tests that took place, the community as a whole will have only a permanent record of a hypothesis supported by illusory evidence. And this illusion of evidence cannot be easily discovered after the fact without investing effort into explicit replication studies. Indeed, after Researcher A dies, any evidence of their statistical errors will likely disappear, except for the puzzling persistence of a paper reporting a relationship between Y and X17 that has not been found again.

Conclusion

What should we take away from this example? We should acknowledge that there are deep problems with the theoretical framework used to justify NHST as a scientific institution. NHST, as it stands, is based upon an inappropriate analogy between a community of researchers and a hypothetical decision-maker who evaluates the research of a whole community using NHST. The actual community of researchers suffers from imperfect information about the research methods being used by its members. The sort of fishing through data for positive results described above may result from either statistical naivete or a genuine lack of scruples on the part of our fellow scientists, but it is almost certainly occurring. NHST is only exacerbating the problem, because there is no credible mechanism for insuring that we know how many hypotheses have been tested before discovering a hypothesis that satisfies our community’s threshold.

Because the framework of NHST is not appropriate for use by a community with imperfect information, I suspect that the core objective of NHST — the prevention of false positive results — is not being achieved. At times, I even suspect that NHST has actually increased the frequency of reporting false positive results, because the universality of the procedure encourages blind searching through hypotheses for one that passes a community’s p-value threshold.

This is an unfortunate situation, because I am very sympathetic to those proponents of NHST who feel that it is an unambiguous, algorithmic procedure that diminishes the extent of subjective opinion in evaluating research work. While I agree that diminishing the dependence of science on subjectivity and personal opinion is always valuable, we should not, in our quest to remove subjectivity, substitute in its stead a method that depends upon an assumption of the perfect wisdom and honesty of our fellow scientists. Despite our strong desires to the contrary, human beings make mistakes. As Lincoln might have said, some researchers make mistakes all of the time and all researchers make mistakes some of the time. Because NHST is being used by a community of researchers rather than the theoretical individual for which it was designed, NHST is not robust to the imperfections of our fellow scientists.

References

Simmons et al. (2011), ‘False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant’ SSRN

cumplyr: Extending the plyr Package to Handle Cross-Dependencies

Introduction

For me, Hadley Wickham‘s reshape and plyr packages are invaluable because they encapsulate omnipresent design patterns in statistical computing: reshape handles switching between the different possible representations of the same underlying data, while plyr automates what Hadley calls the Split-Apply-Combine strategy, in which you split up your data into several subsets, perform some computation on each of these subsets and then combine the results into a new data set. Many of the computations implicit in traditional statistical theory are easily described in this fashion: for example, comparing the means of two groups is computationally equivalent to splitting a data set of individual observations up into subsets based on the group assignments, applying mean to those subsets and then pooling the results back together again.

The Split-Apply-Combine Strategy is Broader than plyr

The only weakness of plyr, which automates so many of the computations that instantiate the Split-Apply-Combine strategy, is that plyr implements one very specific version of the Split-Apply-Combine strategy: plyr always splits your data into disjoint subsets. By disjoint, I mean that any row of the original data set can occur in only one of the subsets created by the splitting function. For computations that involve cross-dependencies between observations, this makes plyr inapplicable: cumulative quantities like running means and broadly local quantities like kernelized means cannot be computed using plyr. To highlight that concern, let’s consider three very simple data analysis problems.

Computing Forward-Running Means

Suppose that you have the following data set:

Time Value
1 1
2 3
3 5

To compute a forward-running mean, you need to split this data into three subsets:

Time Value
1 1
Time Value
1 1
2 3
Time Value
1 1
2 3
3 5

In each of these clearly non-disjoint subsets, you would then compute the mean of Value and combine the results to give:

Time Value
1 1
2 2
3 3

This sort of computation occurs often enough in a simpler form that R provides tools like cumsum and cumprod to deal with cumulative quantities. But the splitting problem in our example is not addressed by those tools, nor by plyr, because the cumulative quantities have to computed on subsets that are not disjoint.

Computing Backward-Running Means

Consider performing the same sort of calculation as described above, but moving in the opposite direction. In that case, the three non-disjoint subsets are:

Time Value
3 5
Time Value
2 3
3 5
Time Value
1 1
2 3
3 5

And the final result is:

Time Value
1 3
2 4
3 5

Computing Local Means (AKA Kernelized Means)

Imagine that, instead of looking forward or backward, we only want to know something about data that is close to the current observation being examined. For example, we might want to know the mean value of each row when pooled with its immediately proceeding and succeeding neighbors. This computation must create the following subsets of data:

Time Value
1 1
2 3
Time Value
1 1
2 3
3 5
Time Value
2 3
3 5

Within these non-disjoint subsets, means are computed and the result is:

Time Value
1 2
2 3
3 4

A Strategy for Handling Non-Disjoint Subsets

How can we build a general purpose tool to handle these sorts of computations? One way is to rethink how plyr works and then extend it with some trivial variations on its core principles. We can envision plyr as a system that uses a splitting operation that partitions our data into subsets in which each subset satisfies a group of equality constraints: you split the data into groups in which Variable 1 = Value 1 AND Variable 2 = Value 2, etc. Because you consider the conjunction of several equality constraints, the resulting subsets are disjoint.

Seen in this fashion, there is a simple relaxation of the equality constraints that allows us to solve the three problems described a moment ago: instead of looking at the conjunction of equality constraints, we use a conjunction of inequality constraints. For the time being, I’ll describe just three instantiations of this broader strategy.

Using Upper Bounds

Here, we divide data into groups in which Variable 1 <= Value 1 AND Variable 2 <= Value 2, etc. We will also allow equality constraints, so that the operations of plyr are a strict subset of the computations in this new model. For example, we might use the constraint Variable = Value 1 AND Variable 2 <= Value 2. If the upper bound is the Time variable, these contraints will allow us to compute the forward-moving mean we described earlier.

Using Lower Bounds

Instead of using upper bounds, we can use lower bounds to divide data into groups in which Variable >= Value 1 AND Variable 2 >= Value 2, etc. This allows us to implement the backward-moving mean described earlier.

Using Norm Balls

Finally, we can consider a combination of upper and lower bounds. For simplicity, we'll assume that these bounds have a fixed tightness around the "center" of each subset of our split data. To articulate this tightness formally, we look at a specific hypothetical equality constraint like Variable 1 = Value 1 and then loosen it so that norm(Variable 1 - Value 1) <= r. When r = 0, this system gives the original equality constraint. But when r > 0, we produce a "ball" of data around the constraint whose tightness is r. This lets us estimate the local means from our third example.

Implementation

To demo these ideas in a usable fashion, I've created a draft package for R called cumplyr. Here is an extended example of its usage in solving simple variants of the problems described in this post:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
library('cumplyr')
 
data <- data.frame(Time = 1:5, Value = seq(1, 9, by = 2))
 
iddply(data,
       equality.variables = c('Time'),
       lower.bound.variables = c(),
       upper.bound.variables = c(),
       norm.ball.variables = list(),
       func = function (df) {with(df, mean(Value))})
 
iddply(data,
       equality.variables = c(),
       lower.bound.variables = c('Time'),
       upper.bound.variables = c(),
       norm.ball.variables = list(),
       func = function (df) {with(df, mean(Value))})
 
iddply(data,
       equality.variables = c(),
       lower.bound.variables = c(),
       upper.bound.variables = c('Time'),
       norm.ball.variables = list(),
       func = function (df) {with(df, mean(Value))})
 
iddply(data,
       equality.variables = c(),
       lower.bound.variables = c(),
       upper.bound.variables = c(),
       norm.ball.variables = list('Time' = 1),
       func = function (df) {with(df, mean(Value))})
 
iddply(data,
       equality.variables = c(),
       lower.bound.variables = c(),
       upper.bound.variables = c(),
       norm.ball.variables = list('Time' = 2),
       func = function (df) {with(df, mean(Value))})
 
iddply(data,
       equality.variables = c(),
       lower.bound.variables = c(),
       upper.bound.variables = c(),
       norm.ball.variables = list('Time' = 5),
       func = function (df) {with(df, mean(Value))})

You can download this package from GitHub and play with it to see whether it helps you. Please submit feedback using GitHub if you have any comments, complaints or patches.

Comparing plyr with cumplyr

In the long run, I'm hoping to make the functions in cumplyr robust enough to submit a patch to plyr. I see these tools as one logical extension of plyr to encompass more of the framework described in Hadley's paper on the Split-Apply-Combine strategy.

For the time being, I would advise any users of cumplyr to make sure that you do not use cumplyr for anything that plyr could already do. cumplyr is very much demo software and I am certain that both its API and implementation will change. In contrast, plyr is fast and stable software that can be trusted to perform its job.

But, if you have a problem that cumplyr will solve and plyr will not, I hope you'll try cumplyr out and submit patches when it breaks.

Happy hacking!

Implementing the Exact Binomial Test in Julia

One major benefit of spending my time recently adding statistical functionality to Julia is that I’ve learned a lot about the inner guts of algorithmic null hypothesis significance testing.

Implementing Welch’s two-sample t-test last week was a trivial task because of the symmetry of the null hypothesis, but implementing the exact binomial test has proven to be more challenging because the asymmetry of a skewed null defined over a bounded set means that one has to think a bit more carefully about what is being computed.

To see why, let’s first recap the logic of the standard two-sided hypothesis test. In all NHST situations, you assign a p-value to the observed data by working under the null hypothesis and using this assumption to calculate the probability of observing data sets that are as extreme or more extreme than the observed data.

For the normal, this calculation is easy: you find the probability of seeing an equal or higher z-score than the observed data’s z-score and then you double this probability to account for the lower tail in which the hypothetical z-score is lower than the observed z-score.

But for the binomial, the right quantity to use as a definition of extremity is less obvious (at least to me). Suppose that you’ve seen x successes after n samples from a Bernoulli variable with probability p of success.

You might try defining extremity by saying that a hypothetical data set y is more extreme than x if abs(y - n) > abs(x - n): in short, you could use the count space to assess extremity.

This approach will not work. Consider the case in which x = 4, n = 10 and p = 0.2. Under this definition y = 0 would be as extreme as y = 4, but p(y = 0) > p(y = 4), so 0 should not be considered as extreme as 4. You need to use probability space and not count space to assess extremity.

This logic leads to the conclusion that the proper definition is one in which y is as extreme or more extreme than x if p(y, n, p) < p(x, n, p). This is the correct definition for the exact binomial test. Implementing it leads to this piece of code in Julia for computing p-values for the binomial test:

1
2
3
4
5
6
7
8
9
load("extras/Rmath.jl")
 
function binom_p_value(x, n, p)
  sum(filter(d -> d <= dbinom(x, n, p),
             map(i -> dbinom(i, n, p),
                 [0:n])))
end
 
binom_p_value(2, 10, 0.8)

As far as I know, this procedure may be as efficient as possible, but it seems odd to me that we should need to assess the PDF at n + 1 numbers when, in principle, we should only need to assess the CDF of the binomial distribution at two points to find a p-value.

For that reason, you might hope to replace your loop over n + 1 numbers to one that, for extreme data sets, is more efficient by estimating lower and upper bounds on the count values with lower PDF values than the observed data. For the moment, I’m experimenting with doing this as follows:

1
2
3
4
5
6
7
8
9
10
11
12
load("extras/Rmath.jl")
 
function binom_p_value(x, n, p)
  lower_bound = floor(n * p - abs(n * p - x))
  upper_bound = ceil(n * p + abs(n * p - x))
 
  sum(filter(d -> d <= dbinom(x, n, p),
  	     map(i -> dbinom(i, n, p),
	         vcat([0:lower_bound], [upper_bound:n]))))
end
 
binom_p_value(2, 10, 0.8)

Unfortunately, I haven’t yet done the analytic work to demonstrate that these bounds are actually correct. (One of them must be, since one of them is a and the strict monotonicity of the distribution function about n * p guarantees that a must be either a lower or an upper bound for itself.) Of course, if these bounds are sufficiently conservative, they’ll function to save computation without any risk of giving corrupt answers — even if they’re not the tightest possible bounds.

Note that, in principle, it should be possible to go further: if you know exact bounds, then the summing and filtering operations are entirely superfluous and we can run:

1
2
3
4
5
6
7
8
9
10
load("extras/Rmath.jl")
 
function binom_p_value(x, n, p)
  lower_bound = exact_lower_bound(x, n, p)
  upper_bound = exact_upper_bound(x, n, p)
 
  pbinom(lower_bound, n, p) + 1 - pbinom(upper_bound - 1, n, p)
end
 
binom_p_value(2, 10, 0.8)

I don’t know that these exact bounds can be computed exactly without a lot of work, but if they can be, they give a much more efficient implementation of the exact binomial test.

I wrote all of this up because (a) I’d appreciate knowing if the exact bounds can be computed efficiently and (b) I thought it was a very nice example of (1) thinking through the logic of hypothesis testing in detail (including considerations of what extremity really means) and (2) the constant problem that mathematically equivalent definitions suggest algorithms with very different computational costs.