Type Safety and Statistical Computing

2016-12-12 8-minute read

I broadly believe that the statistics community would benefit from greater exposure to computer science concepts. Consistent with that belief, I argue in this post that the concept of type-safety could be used to develop a normative theory for how statistical computing systems ought to behave. I also argue that such a normative theory would allow us to clarify the ways in which current systems can be misused. Along the way, I note the numerous and profound challenges that any realistic proposal to implement a more type-safe language for statistics would encounter.

Type Safety as a Concept in Computer Science

Vijay Saraswat defines a type-safe language as follows:

A language is type-safe if the only operations that can be performed on data in the language are those sanctioned by the type of the data.

Although I personally like this definition’s brevity, I suspect it will be helpful to many readers to explain this definition with some examples.

To begin, remember that a computer is ultimately little more than a piece of hardware that (a) can perform operations on bits and that (b) has been given access to other hardware that represents data as bits. As a result, at the hardware level, text files are just bits, audio files are just bits, and image files are just bits.

Because every kind of data is ultimately represented as just a sequence of bits, it is always possible in theory to apply functions that only make sense for one interpretation of those bits (say an operation that computes the FFT of an audio file) to a set of bits that was intended to be interpreted in a different way (say a text file representing the Gettysburg Address). The fact that there seems to be no reasonable interpretation of the FFT of a text file that you decide to pretend is an audio file does not preclude the possibility of its computation.

The classic solution to this problem in computer science is to introduce more information about each piece of data in the system. In particular, the bits that represent the Gettysburg Address should be coupled to a type which indicates that those bits represent a string of characters and should therefore be viewed as text rather than as, say, audio or video. In general, programming languages with types can use type information to ensure that users cannot apply methods designed to process one kind of data to another incompatible kind of data that cannot be correctly processed in the same way.

A language that uses type information to ensure that improper operations are not performed on data is a language that aspires, at some level, to type-safety. In what follows, I argue that existing tools for analyzing data statistically are generally type-unsafe, but I also note that the creation of type-safe languages for statistics would be profoundly challenging.

What Would Type-Safety Mean for Statistical Computing?

Let’s consider three settings in which we might want to compute a mean over a vector of integers. These settings were chosen because they progressively require stronger assumptions to justify the belief that the computations being performed will generate correct results relative to the goals of the person asking for the computation to be performed.

Types that Justify Descriptive Statistics

In the first case, let’s assume that we have a vector of integers. For concreteness, I’ll assume that we have the vector v = [6, 6, 8, 7, 5], where I am using Julia-like notation to define the vector.

Suppose that we want to calculate the arithmetic mean of this vector of integers as a pure descriptive statistic. I’ll refer to this computation as descriptive_mean(v). The arithmetic mean is a well-defined concept on any vector of real numbers. As such, it’s reasonable to assume that a function that simply sums the numbers in a vector and then divides the sum by the length of the vector will generate the correct descriptive statistic. Moreover, we can credibly believe that any vector of integers (which would be called Vector{Int} in Julia-like notation) can be handled correctly by code that uses that algorithmic strategy.

Types that Justify Inferential Statistics

In addition to computing descriptive statistics, people also frequently want to compute inferential statistics in which the vector of numbers being analyzed is not itself the primary object of study, but is only being analyzed because it is viewed as a random sample from a larger population that represents the primary object of study.

In general, we cannot treat an arbitrary Vector{Int} as a valid random sample. Indeed, if we were handed the vector, v = [0, 1, 1, 2, 3, 5, 8, 13, 21, 34], it would make sense for us to have doubts about whether this vector is really a random sample from a probability distribution.

In other words, the calculation of a mean as a correct inferential quantity requires that the data we analyze have a type more like IIDRandomSample{Int} than like Vector{Int}. If such a type existed, we might make a first pass at type-safety by insisting that inferential_mean(v) can only be applied to data of type IIDRandomSample{Int}.

Hopefully this example makes clear that many computations performed in existing statistical computing systems are only valid if we make assumptions that are generally not encoded in the type systems used in statistical computing. In this particular case, descriptive_mean(v) and inferential_mean(v) would always generate the same outputs: the type system would only check whether one could assert that inferential_mean(v) had legitimate status as an estimator of another unobserved quantity. But a computation like inferential_sem(v) would, in fact, provide different results for IIDRandomSample{Int} than it would for, say, KDependentRandomSample{Int}. And this kind of differing behavior is, I think, the norm as I suggest with the next example.

Types that Justify Quantitative Error Bounds on Inferential Statistics

To make matters more complex, what would happen if wanted to compute something more complex than just the mean as an inferential statistic? What if we also wanted to get a bound on its probable error as inferential_mean_with_error_bound(v)? In that case, we would need to make stronger assumptions about our vector of numbers because it will no longer be sufficient to assume that we are dealing with IIDRandomSample{Int}.

Instead, we’ll need to also ensure that the vector’s length is great enough that the distribution of the sample means is approximately normal. The assumption of approximate normality will depend, in general, on both the specific distribution from which each element of the vector is drawn and also on the vector’s length. We will generally not have access to this information and will therefore generally not be able to credibly use types to express these assumptions, even though their validity is essential to the correctness of our error bounds.

In general, encoding all of the assumptions we need to employ in a statistical analysis in a type system is likely to be hopeless and would frequently end up requiring that our type system know things about probability theory that are at the boundaries of what statistical theory can currently offer. As such, we should probably conclude that using types to demonstrate the full correctness of arbitrary statistical programs is not tractable. Although it is clear that we could catch many obvious mistakes in statistical programs by using types that encode our assumptions, we will still fail to catch many other serious mistakes.

Attainable Kinds of Type-Safety in Statistical Computing

Hopefully these three examples have illustrated the challenges involved with type-safety as a concept in statistical computing. One might conclude from the error bound problem above that type systems are simply not capable of capturing the kinds of assumptions we need to make in applied statistics. Although we cannot encode all of our assumptions, I think we should not therefore assume it would be useless to try encoding some of our assumptions.

To argue for a more positive take on why I think we might want to develop richer type systems for statistical computing, I’ll try to list some concepts that I think could be formalized fruitfully:

In surveys, we can use type systems to distinguish between sampling strategies. Indeed, some R packages for survey analysis already use annotations on values to distinguish simple random sampling from stratified sampling and from more complex sampling strategies.
In experiments, we can use type systems to distinguish between distinct strategies for randomizing units' assignment to treatments. We can, for example, distinguish completely randomized experiments from block-randomized experiments.
In analyses of datasets with missing values, we can use type systems to distinguish between variables that are missing completely at random from variables that are missing conditionally at random and also from variables that are missing not at random.
In time series analyses of experiments, we can use type systems to distinguish pre-treatment and post-treatment variables to ensure that we do not introduce biases by conditioning on post-treatment variables.

Hopefully these examples make clear that type systems, despite the many limitations they would have, could make statistical computing safer by making assumptions more explicit and ensuring that invalid analyses are not conducted without users at least stating all of the erroneous assumptions they’ve made. My experience is that explicitness is almost always more tedious than magic, but is also far easier to debug as an author and verify as a reviewer.

At the same time, I hope that it’s clear to readers that most of the current tools for statistical computing make little attempt to ensure type-safe analyses because the types used in SQL and R and Python and Julia do not encode the relevant concepts about random samples, independence structure, etc. Even computations like the computation of sem in SciPy are type-unsafe. (Indeed, the SciPy documentation doesn’t even mention that the correctness of the computation depends upon the assumption that the input array is an IID random sample.)

In short, I think we have a long way to go to building truly safe tools for statistical computing. Embracing the concept of type-safety is one potential way to think about making our tools safer.