Covariate-Based Diagnostics for Randomized Experiments are Often Misleading

Over the last couple of years, I’ve seen a large number of people attempt to diagnose the quality of their randomized experiments by looking for imbalances in covariates, because they expect covariate imbalances to be ruled out by successful randomization. But imbalance is, in general, not guaranteed for any specific random assignment — and, as such, many attempts to check for covariate imbalance are misleading.

In this post, I’m going to use the Neyman-Rubin Causal Model to describe an extreme example in which covariate-based diagnostics for randomized experiments are strictly misleading, in the sense that a search for covariate imbalance will always produce a false positive result no matter which random assignment is realized. In particular, we’ll see that all possible randomized assignments generate an exact estimate of the average treatment effect (ATE), even though all assignments also generate covariate imbalances at arbitrarily stringent levels of statistical significance.

Here’s the setup:

  1. We have a large, finite population of \(N\) people, where \(N\) is a large even number like \(100,000\). These people will be completely randomized in order to assign them to treatment or control.
  2. The potential outcomes for the \(i\)-th person in the population are \(y_i(1) = 1\) and \(y_i(0) = 0\). In other words, the population is perfectly homogeneous and, as such, there’s a constant treatment effect of \(1\).
  3. We have \(K\) covariates. Each covariate is a column of \(N / 2\) zeros and \(N / 2\) ones. The \(j\)-th column will be one of the \(K\) possible permutations of \(N / 2\) zeros and \(N / 2\) ones. Moreover, each column is distinct, so that we have one covariate column for every possible permutation of \(N / 2\) zeros and \(N / 2\) ones.

Under these assumptions, we’ll randomly assign \(N / 2\) people to the treatment condition and \(N / 2\) people to the control condition using complete randomization. Because people are perfectly homogeneous, the sample mean in the treatment group will always be \(1\) and the sample mean in the control group will always be \(0\). As such, the sample average treatment effect will always be \(1 – 0\), which is identical to the true average treatment effect in the population. In other words, the homogeneity of the experimental population allows exact inference without any error under all realizations of the randomized assignments.

But the covariates, because they exhaust the space of all permutations of a fixed set of balanced values, will include some columns that exhibit substantial imbalances. In particular, there will be covariates that are statistically significantly imbalanced at the \(p < \alpha\) for a large set of values of \(\alpha\). This occurs because our columns literally contain all possible permutations of a fixed set of values, so they must contain even those permutations that would be considered extremely improbable under randomization. Indeed, the permutation that assigns \(N / 2\) zeros to the treatment group and \(N / 2\) ones to the control group must occur as one column -- with the result that we'll see perfect separation between the groups on that covariate. This leads to a strange disconnect: the outcome of our experiment is always a perfect estimate of the ATE, but we can always detect arbitrarily strong imbalances in covariates. This disconnect occurs for the simple reason that there is no meaningful connection between covariate imbalance and successful randomization: balancing covariates is neither necessary nor sufficient for an experiment's inferences to be accurate. Moreover, checking for imbalance often leads to dangerous p-hacking because there are many ways in which imbalance could be defined and checked. In a high-dimensional setting like this one, there will be, by necessity, extreme imbalances along some dimensions. For me, the lesson from this example is simple: we must not go looking for imbalances, because we will always be able to find them.

What is an Interaction Effect?

Introduction

Yesterday, Uri Simonsohn published a blog post called “Interactions in Logit Regressions: Why Positive May Mean Negative”. I love Uri’s blog in general, but I was pretty confused by this latest post. I found it very hard to understand exactly what was being claimed in the absence of formal mathematical definitions of the terms being used.

Thankfully Uri’s post linked to the original papers that inspired his post. Those papers provided clear formal definitions of what they meant when they used the term “interaction effect”. This post expands on those definitions a bit.

Defining the Existence of Interactions and the Measurement of Interaction Effects

Let’s assume that we have an outcome variable \(y\) that is a function of two predictor variables, \(x_1\) and \(x_2\). These assumptions allow us to write \(y = f(x_1, x_2)\) for some \(f\).

We can then define the existence of interactions between \(x_1\) and \(x_2\) in a negative way. We’ll do that by specifying an invariant that holds whenever there are no interactions. The invariant we’ll use is this: no interaction between \(x_1\) and \(x_2\) exists whenever the partial derivative of \(f\) with respect to one variable isn’t a function of the other variable.

To make this formal, let’s define a new function, \(g\), as follows:

$$
g(x_1, x_2) = \frac{\partial}{\partial x_1} [f(x_1, x_2)]
$$

Note that we’ve chosen to focus on the partial derivatives of \(f\) with respect to \(x_1\), but that a full treatment would need to apply this same approach to other variables.

Given this definition of \(g\), we can say that interactions do not exist whenever
$$
g(x_1, x_2) = g(x_1, x_2^{\prime})
$$

for all values of \(x_2\) and \(x_2^{\prime}\).

In other words, the partial derivative of \(f\) with respect to \(x_1\) does not change when we evaluate it at different values of \(x_2\). In looser English, the way that our outcome changes in response to the first predictor does not depend upon the specific value of the second predictor.

Instead of working with derivatives, we could also define interactions in terms of second derivatives, which is what’s done in the papers Uri linked to. Those papers define a quantitative measure called the “interaction effect”, \(I\), as follows:

$$
I(x_1, x_2) = \frac{\partial^2}{\partial x_1 x_2} [f(x_1, x_2)]
$$

Note that this is a function of both \(x_1\) and \(x_2\) and is generally not a constant — and therefore cannot generally be summarized by a single number without engaging in lossy data compression.

Given this quantitative definition of the “interaction effect” function, we can say that no interaction exists when the second derivatives are zero everywhere. That is,

$$
I(x_1, x_2) = \frac{\partial^2}{\partial x_1 x_2} [f(x_1, x_2)] = 0
$$

for all values of \(x_1\) and \(x_2\).

Some Examples of Interactions Measured by First and Second Derivatives

To see how these definitions play out in practice, let’s consider some simple models.

Two-Variable Linear Regression without an Interaction Term

First, let’s assume that \(y = \beta_1 x_1 + \beta_2 x_2\). This is a linear regression model in which there is no “interaction term”. In that case,

$$
\frac{\partial}{\partial x_1} [\beta_1 x_1 + \beta_2 x_2] = \beta_1
$$

This derivative is a constant and is therefore clearly invariant to the value of \(x_2\), so we can conclude there will be no “interaction effects” — because the second derivative will be a derivative of a constant, which is always zero.

Two-Variable Linear Regression with an Interaction Term

In contrast, if we assume that \(y = \beta_1 x_1 + \beta_2 x_2 + \beta_{12} x_1 x_2\) we’ll observe a non-zero “interaction effect”. In this changed model, we’ve introduced an “interaction term”, \(x_1 x_2\). That change to the model implies that,

$$
\frac{\partial}{\partial x_1} [\beta_1 x_1 + \beta_2 x_2 + \beta_{12} x_1 x_2] = \beta_1 + \beta_{12} x_2
$$

This derivative is clearly a function of \(x_2\) and so we can reasonably expect to see a non-zero “interaction effect”.

To generalize from this example: introducing what is often called “an interaction term” into a linear regression model introduces an “interaction effect” according to the definition given in the previous section. Moreover, the coefficient on the “interaction term” is exactly equal to the “interaction effect”:

$$
\frac{\partial^2}{\partial^2 x_1 x_2} [\beta_1 x_1 + \beta_2 x_2 + \beta_{12} x_1 x_2] = \beta_{12}
$$

But note that this equivalence of coefficients on “interaction terms” and “interaction effects” is a special property of linear regression models and does not generally hold.

But also note that, even though linear regression models have unusually simple derivatives, the special properties of linear regression are not required to prevent the introduction of non-zero “interaction effects”. The absence of interactions has nothing to do with linearity: it’s driven instead by a form of additivity. To see why, let’s consider another example in which the model is non-linear, but exhibits no “interaction effects”.

Two-Variable Generalized Additive Model

Let’s change things again and assume that \(y = f(x_1) + g(x_2)\). In that case,

$$
\frac{\partial}{\partial x_1} [f(x_1) + g(x_2)] = f^{\prime}(x_1)
$$

This derivative is not a constant everywhere (which makes it very different from the linear regression model), but it is also not a function of \(x_2\), so there will be no “interaction effects”. In short, linearity is not necessary for the non-existence of non-zero “interaction effects”.

We can also see this by explicitly calculating the “interaction effect”,

$$
\frac{\partial^2}{\partial^2 x_1 x_2} [f(x_1) + g(x_2)] = \frac{\partial}{\partial x_2} [f^{\prime}(x_1)] = 0
$$

As long as a regression model is additively separable in the sense that the outcome variable can written as a sum of univariate functions that depend upon only one predictor variable, “interaction effects” will never exist.

Examples of Non-Zero Interaction Effects in GLM’s

It is important to note that only a special form of additivity is required to prevent “interaction effects”. But the point of Uri’s post, derived from the papers he cites, is that “interaction effects” exist in logistic regression models and that they are generally not equal to the coefficients on the “interaction terms”.

He’s totally right about this. I’d even argue that he’s understating this issue by focusing on logistic regression models: the argument applies to all GLM’s that aren’t identical to linear regression. The problem is that GLM’s use an inverse link function — and the inverse link function per se introduces a deviation from additive separability.

To see why, let’s assume that \(y = f(\beta_1 x_1 + \beta_2 x_2)\) and that \(f\) is an inverse link function of the sort used in a GLM. By applying the chain rule to this GLM-style model, we see that,

$$
\frac{\partial}{\partial x_1} [f(\beta_1 x_1 + \beta_2 x_2)] = f^{\prime}(\beta_1 x_1 + \beta_2 x_2) \beta_1
$$

This derivative is clearly a function of \(x_2\) and changes depending on the value of \(x_2\). In other words, an “interaction effect” must be non-zero for some values of \(x_1\) and \(x_2\).

Note how weak the assumptions are that I’ve had to make to introduce an “interaction effect”: I just need \(f\) to be distinct from the identity function used in linear regression. The inverse logit link function will introduce “interaction effects” as will the exponential inverse link function used in Poisson regression. The linear regression model is, again, totally non-representative of the other GLM models.

Conclusion

I’m glad to see Uri making a point to distinguish “interaction terms” from “interaction effects”, although I’m not sure I understand his other point about “conceptual interactions” and “mechanical interactions”. Given the importance of the first point, I thought it was worth making the ideas he was addressing more mathematically explicit.

With that in mind, I’m generally inclined to think of “interaction terms” as an old-fashioned form of feature engineering. Introducing “interaction terms” can substantially improve the predictive power of mathematical models, but they are extremely difficult to interpret as a statement about “interaction effects” because implausibly strong assumptions are required to make “interaction terms” line up with “interaction effects”. In general, I think people who are testing hypotheses about “interaction terms” are actually trying to test for the existence of any non-zero “interaction effects”, but I’m skeptical that hypothesis testing for non-zero coefficients on “interaction terms” is generally useful for testing for the existence of non-zero “interaction effects”.

In contrast, I think “interaction effects” are a much deeper and more interesting property of mathematical models, but they’re also functions of all of the original predictors and are therefore almost never the constants that occur in linear regression. There’s simply no reason to think you can easily summarize the way in which the first derivatives of models respond to changes in other predictor variables with a single number. If you’re interested in understanding the “interaction effects”, you should probably model them directly.

Where Predictive Modeling Goes Astray

I recently reread Yarkoni and Westfall’s in-progress paper, “Choosing prediction over explanation in psychology: Lessons from machine learning”. I like this paper even more than I liked their previous paper, but I think a few notes of caution should be raised about the ways in which a transition to predictive modeling could create problems that psychologists are seldom trained to deal with.

In particular, I think there are three main dangers one encounters when engaging in predictive modeling with the aim of gaining generalizable insights about the world.

Danger 1: What kinds of predictions are needed?

The first danger with advocating “prediction” is that a reasonable reader can be uncertain about what kinds of predictions are going to be generated by psychologists after the field’s direction changes. In the mathematical literature, the word “prediction” is vague because of polysemy; that is, the term can be used to refer to two different problems in mathematical modeling. The first problem in modeling uses “prediction” to refer to the task of making predictions from the position of an all-seeing but never-acting observer who will predict the future without ever attempting to change it. The second problem in modeling uses “prediction” to refer to the task of making predictions from the position of a decision-maker who will intervene directly in the world that they observe and who wishes to predict the outcomes of their own actions before they act. For those familiar with the literature on causality, this distinction reflects the gaps between probabilistic and causal modeling: the first task involves the use of Judea Pearl’s see() operator and the second task involves the use of his do() operator. The appropriate techniques for developing and evaluating models in these two settings are closely related, but they are also generally not identical. If the field is going to move towards predicting modeling, it is important to understand which kind of prediction is being advocated because the training of students will need to ensure that students are taught to distinguish these two kinds of predicting modeling so that they can apply the proper techniques. I think this distinction is particularly important to make when contrasting “prediction” and “explanation” because I suspect that much of what we call “explanation” is a hodge-podge of causal modeling and the pursuit of models that are easily interpretable by humans. But these two pursuits are not inevitably linked: we should not assume that interpretable models will be valid causal models nor should we assume that valid causal models will be interpretable. The focus on explanation, in my opinion, allows one to excuse oneself for failing to construct models that make falsifiable and approximately accurate predictions, because one can always assert that the models, despite their predictive failures, are insightful and educational.

Danger 2: Mis-specified models perform well locally, but do not generalize well.

The second danger with adopting “prediction” as an objective of science is that we should not assume that the models we use will faithfully represent the data generating processes for the datasets we wish to predict against. But if the models we fit to data are usually mis-specified, we should fear that they will not generalize well when applied to new datasets — because only the true data generating process will typically converge locally to a parameterization that generalizes well across heterogeneous datasets. (And even this claim requires caveats because of potential failures caused either by using small samples or by using datasets without sufficient variation in the range of inputs to identify the model being learned.)

To see why the use of mis-specified models could be a problem, consider the problem of fitting a linear model to quadratic data described in this pure theory post I wrote a while back. When one fits an approximate model, one needs to be very clear about the domain in which this approximation will be employed: the true model extrapolates flawlessly (for what are essentially tautological reasons), but approximate models often fail disastrously when asked to extrapolate from the range of inputs for which they were optimized.

Danger 3: Empirical model comparisons are often badly confounded.

The problem of extrapolation for mis-specified models hints at another broad problem with evaluating models empirically: the empirical performance of a model depends on many factors and it is prohibitively difficult (or even impossible) to perform all-else-equals comparisons between models using only empirical data. In the language of psychological statistics, empirical model comparisons exhibit substantial interactions and suffer from powerful moderators.

This occurs because, when we compare models using finite data sets, we do not generally compare the intrinsic properties of the models in the way that a pure mathematical analysis would allow us to do. Instead, we are forced to select specific algorithms that compute specific estimators and those estimators are used to compare learned parameterizations of models rather than comparing the parameter families that we typically have in mind when talking about mathematical models. (To clarify this distinction, note that the model f(x) = sin(a * x) is actually an infinite parametric family of parameterized models with a distinct element for every value of a. Training a model family on any given dataset will lead one to select a single element from this set — but comparisons between single elements are not generally equivalent to comparisons between sets.)

To put it another way, measured predictive power is, at a minimum, a function of (1) the evaluation metric used, (2) the model being fit, (3) the estimator being used to do the fitting, (4) the algorithm being used to compute the estimator, and (5) the data set being used. When you change any of these inputs to the function that outputs measured predictive power, you generally get different results. This means that it can be very difficult to understand what exactly accounts for the differences in observed predictive power between two entities we nominally call “models”: is your logistic regression with a quadratic polynomial as input out-performing your probit regression with a cubic polynomial as input because of the cubic term? Or because your computer code for evaluating the inverse logit function is numerically unstable? Or because you’re using a small N sample in which a simpler model does better than the true model — which is being rejected by your, as it were, “underpowered” dataset that cannot reliably lead to the conclusions you would see in the asymptotic limit as N goes to infinity?

This type of confounding is not such a large problem for machine learning, because the goal of many machine learning researchers isn’t to discover one single universally correct model parameterization, but rather to develop algorithmic tools that can be used to learn useful model parametrizations from arbitrary datasets. What sets machine learning and statistics apart from psychology and physics is that those fields are far more dataset agnostic than fields with specific empirical areas of focus.

Instead of looking to machine learning, one should keep classical physics in mind when one wants to see a successful example of partially generalizable (but still imperfect) model parameterizations being learned and generating falsifiable and approximately accurate predictions about new datasets. In physics, the equation predicting the position of a falling object affected by Earth’s gravitation field isn’t a family of arbitrary polynomials, but rather a specific parameterization of a quadratic equation in which parameters like the typical value of the coefficient on the quadratic term are taught to students in introductory courses.

More generally, this danger hints at the same concerns about objectives that the first danger addressed: is the goal of a new version of psychology the construction of generalizable predictive model parameterizations or the construction of general-purpose algorithms that can be applied to custom datasets, but which will generally disagree across datasets? If the latter, how will people know when they’re talking past one another?

Conclusion

Despite these three concerns, I believe that Yarkoni and Westfall’s preference for prediction over explanation is not only the right approach for psychology, but is the only long-term cure to psychology’s problems. More sophisticated statistical methods will not, in my mind, solve psychology’s biggest problems: the field may purge itself of its accumulated stock of false positive claims, but it will not create a cumulative science of true positive claims until it can show that it can generate model parameterizations that make falsifiable, unexpected and approximately accurate predictions on a wide range of new datasets. This is an extremely hard challenge. But I believe Yarkoni and Westfall’s suggestions represent the best chance psychology has to overcome this challenge.

Type Safety and Statistical Computing

I broadly believe that the statistics community would benefit from greater exposure to computer science concepts. Consistent with that belief, I argue in this post that the concept of type-safety could be used to develop a normative theory for how statistical computing systems ought to behave. I also argue that such a normative theory would allow us to clarify the ways in which current systems can be misused. Along the way, I note the numerous and profound challenges that any realistic proposal to implement a more type-safe language for statistics would encounter.

Type Safety as a Concept in Computer Science

Vijay Saraswat defines a type-safe language as follows:

A language is type-safe if the only operations that can be performed on data in the language are those sanctioned by the type of the data.

Although I personally like this definition’s brevity, I suspect it will be helpful to many readers to explain this definition with some examples.

To begin, remember that a computer is ultimately little more than a piece of hardware that (a) can perform operations on bits and that (b) has been given access to other hardware that represents data as bits. As a result, at the hardware level, text files are just bits, audio files are just bits, and image files are just bits.

Because every kind of data is ultimately represented as just a sequence of bits, it is always possible in theory to apply functions that only make sense for one interpretation of those bits (say an operation that computes the FFT of an audio file) to a set of bits that was intended to be interpreted in a different way (say a text file representing the Gettysburg Address). The fact that there seems to be no reasonable interpretation of the FFT of a text file that you decide to pretend is an audio file does not preclude the possibility of its computation.

The classic solution to this problem in computer science is to introduce more information about each piece of data in the system. In particular, the bits that represent the Gettysburg Address should be coupled to a type which indicates that those bits represent a string of characters and should therefore be viewed as text rather than as, say, audio or video. In general, programming languages with types can use type information to ensure that users cannot apply methods designed to process one kind of data to another incompatible kind of data that cannot be correctly processed in the same way.

A language that uses type information to ensure that improper operations are not performed on data is a language that aspires, at some level, to type-safety. In what follows, I argue that existing tools for analyzing data statistically are generally type-unsafe, but I also note that the creation of type-safe languages for statistics would be profoundly challenging.

What Would Type-Safety Mean for Statistical Computing?

Let’s consider three settings in which we might want to compute a mean over a vector of integers. These settings were chosen because they progressively require stronger assumptions to justify the belief that the computations being performed will generate correct results relative to the goals of the person asking for the computation to be performed.

Types that Justify Descriptive Statistics

In the first case, let’s assume that we have a vector of integers. For concreteness, I’ll assume that we have the vector v = [6, 6, 8, 7, 5], where I am using Julia-like notation to define the vector.

Suppose that we want to calculate the arithmetic mean of this vector of integers as a pure descriptive statistic. I’ll refer to this computation as descriptive_mean(v). The arithmetic mean is a well-defined concept on any vector of real numbers. As such, it’s reasonable to assume that a function that simply sums the numbers in a vector and then divides the sum by the length of the vector will generate the correct descriptive statistic. Moreover, we can credibly believe that any vector of integers (which would be called Vector{Int} in Julia-like notation) can be handled correctly by code that uses that algorithmic strategy.

Types that Justify Inferential Statistics

In addition to computing descriptive statistics, people also frequently want to compute inferential statistics in which the vector of numbers being analyzed is not itself the primary object of study, but is only being analyzed because it is viewed as a random sample from a larger population that represents the primary object of study.

In general, we cannot treat an arbitrary Vector{Int} as a valid random sample. Indeed, if we were handed the vector, v = [0, 1, 1, 2, 3, 5, 8, 13, 21, 34], it would make sense for us to have doubts about whether this vector is really a random sample from a probability distribution.

In other words, the calculation of a mean as a correct inferential quantity requires that the data we analyze have a type more like IIDRandomSample{Int} than like Vector{Int}. If such a type existed, we might make a first pass at type-safety by insisting that inferential_mean(v) can only be applied to data of type IIDRandomSample{Int}.

Hopefully this example makes clear that many computations performed in existing statistical computing systems are only valid if we make assumptions that are generally not encoded in the type systems used in statistical computing. In this particular case, descriptive_mean(v) and inferential_mean(v) would always generate the same outputs: the type system would only check whether one could assert that inferential_mean(v) had legitimate status as an estimator of another unobserved quantity. But a computation like inferential_sem(v) would, in fact, provide different results for IIDRandomSample{Int} than it would for, say, KDependentRandomSample{Int}. And this kind of differing behavior is, I think, the norm as I suggest with the next example.

Types that Justify Quantitative Error Bounds on Inferential Statistics

To make matters more complex, what would happen if wanted to compute something more complex than just the mean as an inferential statistic? What if we also wanted to get a bound on its probable error as inferential_mean_with_error_bound(v)? In that case, we would need to make stronger assumptions about our vector of numbers because it will no longer be sufficient to assume that we are dealing with IIDRandomSample{Int}.

Instead, we’ll need to also ensure that the vector’s length is great enough that the distribution of the sample means is approximately normal. The assumption of approximate normality will depend, in general, on both the specific distribution from which each element of the vector is drawn and also on the vector’s length. We will generally not have access to this information and will therefore generally not be able to credibly use types to express these assumptions, even though their validity is essential to the correctness of our error bounds.

In general, encoding all of the assumptions we need to employ in a statistical analysis in a type system is likely to be hopeless and would frequently end up requiring that our type system know things about probability theory that are at the boundaries of what statistical theory can currently offer. As such, we should probably conclude that using types to demonstrate the full correctness of arbitrary statistical programs is not tractable. Although it is clear that we could catch many obvious mistakes in statistical programs by using types that encode our assumptions, we will still fail to catch many other serious mistakes.

Attainable Kinds of Type-Safety in Statistical Computing

Hopefully these three examples have illustrated the challenges involved with type-safety as a concept in statistical computing. One might conclude from the error bound problem above that type systems are simply not capable of capturing the kinds of assumptions we need to make in applied statistics. Although we cannot encode all of our assumptions, I think we should not therefore assume it would be useless to try encoding some of our assumptions.

To argue for a more positive take on why I think we might want to develop richer type systems for statistical computing, I’ll try to list some concepts that I think could be formalized fruitfully:

  • In surveys, we can use type systems to distinguish between sampling strategies. Indeed, some R packages for survey analysis already use annotations on values to distinguish simple random sampling from stratified sampling and from more complex sampling strategies.
  • In experiments, we can use type systems to distinguish between distinct strategies for randomizing units’ assignment to treatments. We can, for example, distinguish completely randomized experiments from block-randomized experiments.
  • In analyses of datasets with missing values, we can use type systems to distinguish between variables that are missing completely at random from variables that are missing conditionally at random and also from variables that are missing not at random.
  • In time series analyses of experiments, we can use type systems to distinguish pre-treatment and post-treatment variables to ensure that we do not introduce biases by conditioning on post-treatment variables.

Hopefully these examples make clear that type systems, despite the many limitations they would have, could make statistical computing safer by making assumptions more explicit and ensuring that invalid analyses are not conducted without users at least stating all of the erroneous assumptions they’ve made. My experience is that explicitness is almost always more tedious than magic, but is also far easier to debug as an author and verify as a reviewer.

At the same time, I hope that it’s clear to readers that most of the current tools for statistical computing make little attempt to ensure type-safe analyses because the types used in SQL and R and Python and Julia do not encode the relevant concepts about random samples, independence structure, etc. Even computations like the computation of sem in SciPy are type-unsafe. (Indeed, the SciPy documentation doesn’t even mention that the correctness of the computation depends upon the assumption that the input array is an IID random sample.)

In short, I think we have a long way to go to building truly safe tools for statistical computing. Embracing the concept of type-safety is one potential way to think about making our tools safer.

Once Again: Prefer Confidence Intervals to Point Estimates

Today I saw a claim being made on Twitter that 17% of Jill Stein supporters in Louisiana are also David Duke supporters. For anyone familiar with US politics, this claim is a priori implausible, although certainly not impossible.

Given how non-credible this claim struck me as being, I decided to look into the origin of this number of 17%. I found that this claim is based on numbers reported in a document released by the University of New Orleans Survey Research Center, which I’ll refer to as UNO-SRC going forward. As noted in that document, the question about support for Senate candidates (which includes Duke) was answered by a total of 12 Jill Stein supporters.

If we assume the responses to that survey are like Bernoulli random variables, then a 95% confidence interval for the percentage of Stein supporters who would vote for David Duke would be a range from 2% up to 48%. See this Gist on GitHub for a demonstration of how one might compute that result in R.

Unfortunately, UNO-SRC’s report doesn’t prominently feature this confidence interval, so people are treating the reported number of 17% as more credible than it is. By itself, this is a great example of the reasons why people should almost always focus their attention on confidence intervals rather than point estimates.

But what’s even more striking about this particular example is that the confidence interval for Duke’s support among Stein voters, when combined with other less noisy point estimates from that same survey, suggests that the point estimate is obviously too high.

In particular, the bottom value in the confidence interval places Duke’s support among Stein supporters at 2%. That number is striking because it’s exactly the rate of support that Duke is gathering from the 205 Clinton supporters who responded to the UNO-SRC poll — and it’s also exactly equal to the 2% rate of support found among the 283 Trump supporters that responded to the UNO-SRC poll.

In short, I would argue that a reasonable person would conclude that approximately 2% of Jill Stein’s supporters support David Duke. Making strong claims that Duke’s support is substantially higher than 2% requires evidence that is simply not available.

[Post-Publication Edit] A couple of people have noted that the column containing Senate candidate preference percentages for Stein supporters doesn’t sum to 100%, so it’s possible that the statistical issues I’ve raised don’t even matter: this whole story may derive from nothing more than typo.

Claims and Evidence: A Joke

The other day a friend posted the following old joke about the level of rigor that mathematicians usually require. (Disclaimer: if you take the joke as a serious claim about the standards of quality in the other fields referenced in the joke, it is an obviously unfair characterization of both astronomy and physics.)

A Mathematician, a Physicist, and an astronomer were travelling north by train. They had just crossed the border into Scotland, when the Astronomer looked out of the window and saw a single black sheep in the middle of a field. “All Scottish sheep are black,” he remarked. “No, my friend,” replied the Physicist, “Some Scottish sheep are black.” At which point the Mathematician looked up from his paper and glanced out the window. After a few second’s thought he said blandly: “In Scotland, there exists at least one field, in which there exists at least one sheep, at least one side of which is black.”

I thought the joke was particularly timely given the recent round of arguments in psychology about the field’s evidentiary standards. As a former psychologist, I am continually surprised that more people in the field won’t yet acknowledge that the field often behaves like the astronomer in the joke even though the empirical tradition in psychology can only justify very weak claims of the sort that the mathematician produces.

No juice for you, CSV format. It just makes you more awful.

I just found a minimal example of how easy it is to confuse R’s CSV parser when providing it with ill-formatted data. To make it easy to understand, I put material for reproducing the problem up on GitHub.

I’m sure one could construct many similar examples for Python and Julia. The problem here is two-fold: (1) the CSV format is just kind of awful and (2) many end-users complain when CSV parsers don’t “successfully” interpret the mis-formatted files they want to parse. To borrow language from the web, essentially all CSV parsers have been compelled to operate in quirks mode all the time.

As far as I can see, there’s very little that would need to be changed to create a less awful CSV format. Here’s some minimal changes that could be used to define a new CSV++ format that would be far easier to parse and validate rigorously:

  1. Every CSV++ file starts with a short preamble.
  2. The first section of the preamble indicates the number of rows and columns.
  3. The second section of the preamble indicates the names and types of all of the columns.
  4. The third section of the preamble indicates the special sequence that’s used to represent NULL values.
  5. The fourth section of the preamble indicates (a) the separator sequence that’s used to separate columns within each row and (b) the separator sequence that’s used to separate rows.
  6. String values are always quoted. Values of other types are never quoted.
  7. CSV++ formatted files must never contain comments or any other non-content-carrying bits.

From my perspective, the only problem with this proposed format is that it assumes CSV++ files are written in batch mode because you need to know the number of rows before you can start writing data. I don’t really think this is so burdensome; in a streaming context, you could just write out files of length N and then do some last-minute patching to the final file that you wrote to correct the inaccurate number of rows that you indicated in the preamble during the initial writing phase.

I have little-to-no hope that such a format would ever become popular, but I think it’s worth documenting just how little work would be required to replace CSV files with a format that’s far easier for machines to parse while only being slightly harder for humans to write. When I think about the trade-off between human and machine needs in computing, the CSV format is my canonical example of something that ignored the machine needs so systematically that it became hard for humans to use without labor-intensive manual data validation.

Update: Ivar Nesje notes that the preamble also needs to indicate the quotation character for quoting strings and specify the rule for escaping that character within strings.

Turning Distances into Distributions

Deriving Distributions from Distances

Several of the continuous univariate distributions that frequently come up in statistical theory can be derived by transforming distances into probabilities. Essentially, these distributions only differ in terms of how frequently values are drawn that lie at a distance \(d\) from the mode. To see how these transformations work (and unify many distinct distributions), let’s make some simplifying assumptions.

First, we’ll assume that the distribution \(D\) we want to construct is unimodal. In other words, there is a typical value, which we’ll call \(m\), that occurs more frequently than any other value drawn from that distribution. For the sake of simplicity, we’re going to assume that \(m = 0\) from now on.

Second, we’ll assume that the probability that a value \(x\) is drawn from the distribution \(D\) decreases as \(x\) gets further from \(0\). To simplify things, we’ll assume that the density function is inversely proportional to a function \(f\) of \(x\)’s distance from \(0\).

Given these assumptions, we know that the density function will look like,
$$
p(x) = \frac{c}{f(|x|)} \propto \frac{1}{f(|x|)},
$$
where \(c\) is chosen to ensure that the density function integrates to \(1\) over \(x \in (-\infty, +\infty)\). From now on, we’ll ignore this constant as it distracts one from seeing the bigger picture.

The really big question given this setup is this: what densities do we get when we consider different transformations for \(f\)?

How Must the Function \(f\) Behave to Define a Valid Density?

What constraints must we place on \(f\) to ensure that the proposed density \(p(x)\) is actually a valid probability density function?

First, we’ll assume that \(f(|x|)\) is bounded at zero to ensure that \(\frac{1}{f(|x|)}\) doesn’t head off towards \(\infty\) or take on negative values. To simplify things we’re going to make an even stronger assumption: we’ll assume that \(f(|0|) = 1\) and that \(f(|x|) > f(0)\) for all \(x \neq 0\).

Second, we’ll assume that \(f(|x|)\) increases fast enough to ensure that,
$$
\int_{-\infty}^{+\infty} \frac{1}{f(|x|)} \, dx< \infty. $$ This latter requirement turns out to be the more interesting requirement. How fast must \(f(|x|)\) increase to guarantee that the probability integral is finite? To see that it's possible for \(f(|x|)\) not to increase fast enough, imagine that \(f(|x|) = 1 + |x|\). In this case, we'd be integrating something that looks like, $$ \int_{-\infty}^{+\infty} \frac{1}{1 + |x|} \, dx, $$ which is not finite. As such, we know that \(f(|x|)\) must grow faster than the absolute value function. It turns out that the next obvious choice for growth after linear growth is sufficient to give us a valid probability density function: if we look at \(f(|x|) = 1 + |x|^k\) for \(k \geq 2\), we already get a valid density function. Indeed, when \(k = 2\) we've derived the functional form of the Cauchy distribution: a distribution famous for having tails so heavy that it has no finite moments. More generally, when \(k > 2\), we get something that looks a bit like the t-statistic’s distribution, although it’s definitely not the same. Our densities look like,
$$
p(x) \propto \frac{1}{1 + |x|^k},
$$
whereas the t-statistic’s distribution looks more like,
$$
p(x) \propto \frac{1}{(1 + \frac{|x|^2}{2k – 1})^k}.
$$

That said, when \(f(|x|)\) is a polynomial, we generally get a heavy-tailed distribution.

From Polynomial Growth to Exponential Growth

What happens if we consider exponential growth for \(f(|x|)\) instead of polynomial growth?

(1) If \(f(|x|) = \exp(|x|)\), we get the Laplace distribution. This is the canonical example of a sub-exponential distributions, whose tails drop off exponentially fast, but not as fast as a Gaussian’s tails.

(2) If \(f(|x|) = \exp(|x|^2)\), we get the Gaussian distribution. This is the quintessentially well-behaved distribution that places the vast majority of its probability mass at points within a small distance from zero.

(3) If \(f(|x|) = \exp(|x|^k)\) and \(k > 2\), we get a subset of the sub-Gaussian distributions, whose tails are even thinner than the Gaussian’s tails.

Essentially, the distance-to-distribution way of thinking about distributions gives us a way of thinking about how fast the tail probabilities for a distribution head to zero. By talking in terms of \(f(|x|)\) without reference to anything else, we can derive lots of interesting results that don’t depend upon the many constants we might otherwise get distracted by.

Conclusion: Ranking Distributions by Their Tails

We can recover a lot of classic probability distributions by assuming that they put mass at values inversely proportional to the value’s distance from zero. In particular, we can transform distributions into distributions as follows:

* Select a function \(f(|x|)\) that increases faster than \(f(|x|) = 1 + |x|\).

* Define the unnormalized density as \(p'(x) = \frac{1}{f(|x|)}\).

* Figure out the integral of \(p'(x)\) from \(-\infty\) to \(+\infty\) and call it \(c\).

* Define the normalized density as \(p(x) = \frac{c}{f(|x|)}\).

This approach gives us the following prominent distributions:

(1) If \(f(|x|) = 1 + |x|^k\) and \(k = 2\), we exactly recover the Cauchy distribution. For larger \(k\), we recover heavy-tailed distributions that seem related to the t-statistic’s distribution, but aren’t identical to it.

(2) If \(f(|x|) = \exp(|x|)\), we exactly recover the Laplace distribution.

(3) If \(f(|x|) = \exp(|x|^k)\) and \(1 < k < 2\), we recover a class of sub-exponential distributions between the Laplace distribution and the Gaussian distribution. (4) If \(f(|x|) = \exp(|x|^2)\), we exactly recover the Gaussian distribution. (5) If \(f(|x|) = \exp(|x|^k)\) and \(k > 2\), we recover a class of sub-Gaussian distributions.

If you want to see why this ranking of distributions in terms of distance is so helpful, imagine that the limiting distribution for a mean was Laplace rather than Gaussian. You’d find that inference would be vastly more difficult. The extreme speed with which the tails of the normal distribution head to zero does much of the heavy-lifting in statistical inference.

References to the Literature?

I imagine this classification of distributions has a name. It seems linked to the exponential family, although the details of that link aren’t very clear to me yet.

Much of the material for this note is canonical, although I’m less familiar with the focus on relating metrics to distributions. One thing that got me interested in the topic long ago was this blog post by Ben Moran.

The Convexity of Improbability: How Rare are K-Sigma Effects?

In my experience, people seldom appreciate just how much more compelling a 5-sigma effect is than a 2-sigma effect. I suspect part of the problem is that p-values don’t invoke the visceral sense of magnitude that statements of the form, “this would happen 1 in K times”, would invoke.

To that end, I wrote a short Julia script to show how often a K-sigma effect would occur if the null hypothesis were true. A table of examples for K between 1 and 12 is shown below.

A K Sigma Effect Occurs 1 in N Times under the Null
1 1 in 3 times
2 1 in 22 times
3 1 in 370 times
4 1 in 15,787 times
5 1 in 1,744,278 times
6 1 in 506,797,346 times
7 1 in 390,682,215,445 times
8 1 in 803,734,397,655,352 times
9 1 in 4430,313,100,526,877,940 times
10 1 in 65,618,063,552,490,270,597,321 times
11 1 in 2,616,897,361,902,875,526,344,715,013 times
12 1 in 281,455,127,862,356,011,929,402,519,726,336 times

One thing I like about showing things in this format: it’s much more obvious that increasing K by one additional unit has an ever-increasing effect on strengthening your claims. This is what I’m jokingly calling the “convexity of improbability” in the title: each additional sigma makes your claim much stronger than the previous one.

Why I’m Not a Fan of R-Squared

The Big Message

People sometimes use \(R^2\) as their preferred measure of model fit. Unlike quantities such as MSE or MAD, \(R^2\) is not a function only of model’s errors, its definition contains an implicit model comparison between the model being analyzed and the constant model that uses only the observed mean to make predictions. As such, \(R^2\) answers the question: “does my model perform better than a constant model?” But we often would like to answer a very different question: “does my model perform worse than the true model?”

In theoretical examples, it is easy to see that the answers to these two questions are not interchangeable. We can construct examples in which our model performs no better than a constant model, even though it also performs no worse than the true model. But we can also construct examples in which our model performs much better than a constant model, even when it also performs much worse than the true model.

As with all model comparisons, \(R^2\) is a function not only of the models being compared, but also a function of the data set being used to perform the comparison. For almost all models, there exist data sets that are wholly incapable of distinguishing between the constant model and the true model. In particular, when using a dataset with insufficient model-discriminating power, \(R^2\) can be pushed arbitrarily close to zero — even when we are measuring \(R^2\) for the true model. As such, we must always keep in mind that \(R^2\) does not tell us whether our model is a good approximation to the true model: \(R^2\) only tells us whether our model performs noticeably better than the constant model on our dataset.

A Theoretical Example

To see how comparing a proposed model with a constant model could lead to opposite conclusions than comparing the same proposed model with the true model, let’s consider a simple example: we want to model the function \(f(x)\), which is observed noisily over an equally spaced grid of \(n\) points between \(x_{min}\) and \(x_{max}\).

To start, let’s assume that:

  • \(f(x) = \log(x)\).
  • \(x_{min} = 0.99\).
  • \(x_{max} = 1.01\).
  • At 1,000 evenly spaced values of \(x\) between \(x_{min}\) and \(x_{max}\), we observe \(y_i = f(x_i) + \epsilon_i\), where \(\epsilon_i \sim \mathbf{N}(0, \sigma^2)\).

Based on this data, we’ll attempt to learn a model of \(f(x)\) using univariate OLS regression. We’ll fit both a linear model and a quadratic model. An example realization of this modeling process looks like the following:

In this graph, we can see that \(f(x)\) is well approximated by a line and so both our linear and quadratic regression models come close to recovering the true model. This is because \(x_{min}\) and \(x_{max}\) are very close and our target function is well approximated by a line in this region, especially relative to the amount of noise in our observations.

We can see the quality of these simple regression models if we look at two binary comparisons: our model versus the constant model and our model versus the true logarithmic model. To simplify these calculations, we’ll work with an alternative \(R^2\) calculation that ignores corrections for the number of regressors in a model. Thus, we’ll compute \(R^2\) for a model \(m\) (versus the constant model \(c\)) as follows:

$$
R^2 = \frac{\text{MSE}_c – \text{MSE}_m}{\text{MSE}_c} = 1 – \frac{\text{MSE}_m}{\text{MSE}_c}
$$

As with the official definition \(R^2\), this quantity tells us how much of the residual errors left over by the constant model are accounted for by our model. To see how this comparison leads to different conclusions than a comparison against the true model, we’ll also consider a variant of \(R^2\) that we’ll call \(E^2\) to emphasize that it measures how much worse the errors are than we’d see using the true model \(t\):

$$
E^2 = \frac{\text{MSE}_m – \text{MSE}_t}{\text{MSE}_t} = \frac{\text{MSE}_m}{\text{MSE}_t} – 1
$$

Note that \(E^2\) has the opposite sense as \(R^2\): better fitting models have lower values of \(E^2\).

Computing these numbers on our example, we find that \(R^2 = 0.007\) for the linear model and \(R^2 = 0.006\) for the true model. In contrast, \(E^2 = -0.0008\) for the linear model, indicating that it is essentially indistinguishable from the true model on this data set. Even though \(R^2\) suggests our model is not very good, \(E^2\) tells us that our model is close to perfect over the range of \(x\).

Now what would happen if the gap between \(x_{max}\) and \(x_{min}\) grew larger? For a monotonic function like \(f(x) = \log(x)\), we’ll see that \(E^2\) will constantly increase, but that \(R^2\) will do something very strange: it will increase for a while and then start to decrease.

Before considering this idea in general, let’s look at one more specific example in which \(x_{min} = 1\) and \(x_{max} = 1000\):

In this case, visual inspection makes it clear that the linear model and quadratic models are both systematically inaccurate, but their values of \(R^2\) have gone up substantially: \(R^2 = 0.760\) for the linear model and \(R^2 = 0.997\) for the true model. In contrast, \(E^2 = 85.582\) for the linear model, indicating that this data set provides substantial evidence that the linear model is worse than the true model.

These examples show how the linear model’s \(R^2\) can increase substantially, even though most people would agree that the linear model is becoming an increasingly unacceptably bad approximation of the true model. Indeed, although \(R^2\) seems to improve when transitioning from one example to the other, \(E^2\) gets strictly worse as the gap between \(x_{min}\) and \(x_{max}\) increases. This suggests that \(R^2\) might be misleading if we interpret it as a proxy for the generally unmeasurable \(E^2\). But, in truth, our two extreme examples don’t tell the full story: \(R^2\) actually changes non-monotonically as we transition between the two cases we’ve examined.

Consider performing a similar analysis to the ones we’ve done above, but over many possible grids of points starting with \((x_{min}, x_{max}) = (1, 1.1)\) and ending with \((x_{min}, x_{max}) = (1, 1000)\). If we calculate \(R^2\) and \(E^2\) along the way, we get graphs like the following (after scaling the x-axis logarithmically to make the non-monotonicity more visually apparent).

First, \(R^2\):

Second, \(E^2\):

Notice how strange the graph for \(R^2\) looks compared with the graph for \(E^2\). \(E^2\) always grows: as we consider more discriminative data sets, we find increasingly strong evidence that our linear approximation is not the true model. In contrast, \(R^2\) starts off very low (precisely when \(E^2\) is very low because our linear model is little worse than the true model) and then changes non-monotonically: it peaks when there is enough variation in the data to rule out the constant model, but not yet enough variation in the data to rule out the linear model. After that point it decreases. A similar non-monotonicity is seen in the \(R^2\) value for the quadratic model. Only the true model shows a monotonic increase in \(R^2\).

Conclusion

I wrote down all of this not to discourage people from ever using \(R^2\). I use it sometimes and will continue to do so. But I think it’s important to understand both that (a) the value of \(R^2\) is heavily determined by the data set being used and that (b) the value of \(R^2\) can decrease even when your model is becoming an increasingly good approximation to the true model. When deciding whether a model is useful, a high \(R^2\) can be undesirable and a low \(R^2\) can be desirable.

This is an inescapable problem: whether a wrong model is useful always depends upon the domain in which the model will be applied and the way in which we evaluate all possible errors over that domain. Because \(R^2\) contains an implicit model comparison, it suffers from this general dependence on the data set. \(E^2\) also has such a dependence, but it at least seems not to exhibit a non-monotonic relationship with the amount of variation in the values of the regressor variable, \(x\).

Code

The code for this post is on GitHub.

Retrospective Edits

In retrospect, I would like to have made more clear that measures of fit like MSE and MAD also depend upon the domain of application. My preference for their use is largely that the lack of an implicit normalization means that they are prima facie arbitrary numbers, which I hope makes it more obvious that they are highly sensitive to the domain of application — whereas the normalization in \(R^2\) makes the number seem less arbitrary and might allow one to forget how data-dependent they are.

In addition, I really ought to have defined \(E^2\) using a different normalization. For a model \(m\) compared with both the constant model \(c\) and the true model \(t\), it would be easier to work with the bounded quantity:

$$
E^2 = \frac{\text{MSE}_m – \text{MSE}_t}{\text{MSE}_c – \text{MSE}_t}
$$

This quantity asks where on a spectrum from the worse defensible model’s performance (which is \(\text{MSE}_c – \text{MSE}_t\)) to the best possible model’s performance (which is \(\text{MSE}_t – \text{MSE}_t\) = 0), one lies. Note that in the homoscedastic regression setup I’ve considered, \(\text{MSE}_t = \sigma^2\), which can be estimated from empirical data using the identifying assumption of homoscedasticity and repeated observations at any fixed value of the regressor, \(x\). (I’m fairly certain that this normalized quantity has a real name in the literature, but I don’t know it off the top of my head.)

I should also have noted that some additional hacks are required to guarantee this number lies in \([0, 1]\) because the lack of model nesting means that violations are possible. (Similar violations are not possible in the linear regression setup because the constant model is almost always nested in the more elaborate models being tested.)

Retrospective Edits Round 2

After waking up to see this on Hacker News, I should probably have included a graphical representation of the core idea to help readers who don’t have any background in theoretical statistics. Here’s that graphical representation, which shows that, given three models to be compared, we can always place our model on a spectrum from the performance of the worst model (i.e. the constant model) to the performance of the best model (i.e. the true model).

r2_e2

When measuring a model’s position on this spectrum, \(R^2\) and \(E^2\) are complementary quantities.

Retrospective Edits Round 3

An alternative presentation of these ideas would have focused on a bias/variance decompositions of the quantities being considered. From that perspective, we’d see that:

$$
R^2 = 1 – \frac{\text{MSE}_m}{\text{MSE}_c} = 1 – \frac{\text{bias}_m^2 + \sigma^2}{\text{bias}_c^2 + \sigma^2}
$$

Using the redefinition of \(E^2\) I mentioned in the first round of “Retrospective Edits”, the alternative to \(R^2\) would be:

$$
E^2 = \frac{\text{MSE}_m – \text{MSE}_t}{\text{MSE}_c – \text{MSE}_t} = \frac{(\text{bias}_m^2 + \sigma^2) – (\text{bias}_t^2 + \sigma^2)} {(\text{bias}_c^2 + \sigma^2) – (\text{bias}_t^2 + \sigma^2)} = \frac{\text{bias}_m^2 – \text{bias}_t^2}{\text{bias}_c^2 – \text{bias}_t^2} = \frac{\text{bias}_m^2 – 0}{\text{bias}_c^2 – 0} = \frac{\text{bias}_m^2}{\text{bias}_c^2}
$$

In large part, I find \(R^2\) hard to reason about because of the presence of the \(\sigma^2\) term in the ratios shown above. \(E^2\) essentially removes that term, which I find useful for reasoning about the quality of a model’s fit to data.