Covariate-Based Diagnostics for Randomized Experiments are Often Misleading

Over the last couple of years, I’ve seen a large number of people attempt to diagnose the quality of their randomized experiments by looking for imbalances in covariates, because they expect covariate imbalances to be ruled out by successful randomization. But imbalance is, in general, not guaranteed for any specific random assignment — and, as such, many attempts to check for covariate imbalance are misleading.

In this post, I’m going to use the Neyman-Rubin Causal Model to describe an extreme example in which covariate-based diagnostics for randomized experiments are strictly misleading, in the sense that a search for covariate imbalance will always produce a false positive result no matter which random assignment is realized. In particular, we’ll see that all possible randomized assignments generate an exact estimate of the average treatment effect (ATE), even though all assignments also generate covariate imbalances at arbitrarily stringent levels of statistical significance.

Here’s the setup:

  1. We have a large, finite population of \(N\) people, where \(N\) is a large even number like \(100,000\). These people will be completely randomized in order to assign them to treatment or control.
  2. The potential outcomes for the \(i\)-th person in the population are \(y_i(1) = 1\) and \(y_i(0) = 0\). In other words, the population is perfectly homogeneous and, as such, there’s a constant treatment effect of \(1\).
  3. We have \(K\) covariates. Each covariate is a column of \(N / 2\) zeros and \(N / 2\) ones. The \(j\)-th column will be one of the \(K\) possible permutations of \(N / 2\) zeros and \(N / 2\) ones. Moreover, each column is distinct, so that we have one covariate column for every possible permutation of \(N / 2\) zeros and \(N / 2\) ones.

Under these assumptions, we’ll randomly assign \(N / 2\) people to the treatment condition and \(N / 2\) people to the control condition using complete randomization. Because people are perfectly homogeneous, the sample mean in the treatment group will always be \(1\) and the sample mean in the control group will always be \(0\). As such, the sample average treatment effect will always be \(1 – 0\), which is identical to the true average treatment effect in the population. In other words, the homogeneity of the experimental population allows exact inference without any error under all realizations of the randomized assignments.

But the covariates, because they exhaust the space of all permutations of a fixed set of balanced values, will include some columns that exhibit substantial imbalances. In particular, there will be covariates that are statistically significantly imbalanced at the \(p < \alpha\) for a large set of values of \(\alpha\). This occurs because our columns literally contain all possible permutations of a fixed set of values, so they must contain even those permutations that would be considered extremely improbable under randomization. Indeed, the permutation that assigns \(N / 2\) zeros to the treatment group and \(N / 2\) ones to the control group must occur as one column -- with the result that we'll see perfect separation between the groups on that covariate. This leads to a strange disconnect: the outcome of our experiment is always a perfect estimate of the ATE, but we can always detect arbitrarily strong imbalances in covariates. This disconnect occurs for the simple reason that there is no meaningful connection between covariate imbalance and successful randomization: balancing covariates is neither necessary nor sufficient for an experiment's inferences to be accurate. Moreover, checking for imbalance often leads to dangerous p-hacking because there are many ways in which imbalance could be defined and checked. In a high-dimensional setting like this one, there will be, by necessity, extreme imbalances along some dimensions. For me, the lesson from this example is simple: we must not go looking for imbalances, because we will always be able to find them.