Turning Off Comments

A few days ago I disabled the comment system on this site. I’d been debating the change for some time, but reached a final decision while reading the comments on an article about a vaccine for Lyme disease.

Although this site has generally had very high quality comments, I’ve become increasingly opposed (as a matter of principle) to the use of online comment systems. My feelings mirror those of many other people who’ve deactivated comments on their sites, including Marco Arment and Matt Gemmell. As many have said before, comments tend to bring out the worst in people. The conversations that comments are ostensibly supposed to inspire now occur on Twitter and in volleys of blog posts that are traded between multiple blogs. In contrast, comment threads tend to trap the material that people either (a) don’t want to associate with their own name or (b) don’t want to take the time to write up formally. I think we have too much of both of these sorts of writing and would prefer not to encourage either.

What’s Next

The last two weeks have been full of changes for me. For those who’ve been asking about what’s next, I thought I’d write up a quick summary of all the news.

(1) I successfully defended my thesis this past Monday. Completing a Ph.D. has been a massive undertaking for the past five years, and it’s a major relief to be done. From now on I’ll be (perhaps undeservedly) making airline and restaurant reservations under the name Dr. White.

(2) As announced last week, I’ll be one of the residents at Hacker School this summer. The list of other residents is pretty amazing, and I’m really looking forward to meeting the students.

(3) In addition to my residency at Hacker School, I’ll be a temporary postdoc in the applied math department at MIT, where I’ll be working on Julia full-time. Expect to see lots of work on building up the core data analysis infrastructure.

(4) As of today I’ve accepted an offer to join Facebook’s Data Science team in the fall. I’ll be moving out to the Bay Area in November.

That’s all so far.

Using Norms to Understand Linear Regression

Introduction

In my last post, I described how we can derive modes, medians and means as three natural solutions to the problem of summarizing a list of numbers, \((x_1, x_2, \ldots, x_n)\), using a single number, \(s\). In particular, we measured the quality of different potential summaries in three different ways, which led us to modes, medians and means respectively. Each of these quantities emerged from measuring the typical discrepancy between an element of the list, \(x_i\), and the summary, \(s\), using a formula of the form,
$$
\sum_i |x_i – s|^p,
$$
where \(p\) was either \(0\), \(1\) or \(2\).

The \(L_p\) Norms

In this post, I’d like to extend this approach to linear regression. The notion of discrepancies we used in the last post is very closely tied to the idea of measuring the size of a vector in \(\mathbb{R}^n\). Specifically, we were minimizing a measure of discrepancies that was almost identical to the \(L_p\) family of norms that can be used to measure the size of vectors. Understanding \(L_p\) norms makes it much easier to describe several modern generalizations of classical linear regression.

To extend our previous approach to the more standard notion of an \(L_p\) norm, we simply take the sum we used before and rescale things by taking a \(p^{th}\) root. This gives the formula for the \(L_p\) norm of any vector, \(v = (v_1, v_2, \ldots, v_n)\), as,
$$
|v|_p = (\sum_i |v_i|^p)^\frac{1}{p}.
$$
When \(p = 2\), this formula reduces to the familiar formula for the length of a vector:
$$
|v|_2 = \sqrt{\sum_i v_i^2}.
$$

In the last post, the vector we cared about was the vector of elementwise discrepancies, \(v = (x_1 – s, x_2 – s, \ldots, x_n – s)\). We wanted to minimize the overall size of this vector in order to make \(s\) a good summary of \(x_1, \ldots, x_n\). Because we were interested only in the minimum size of this vector, it didn’t matter that we skipped taking the \(p^{th}\) root at the end because one vector, \(v_1\), has a smaller norm than another vector, \(v_2\), only when the \(p^{th}\) power of that norm smaller than the \(p^{th}\) power of the other. What was essential wasn’t the scale of the norm, but rather the value of \(p\) that we chose. Here we’ll follow that approach again. Specifically, we’ll again be working consistently with the \(p^{th}\) power of an \(L_p\) norm:
$$
|v|_p^p = (\sum_i |v_i|^p).
$$

The Regression Problem

Using \(L_p\) norms to measure the overall size of a vector of discrepancies extends naturally to other problems in statistics. In the previous post, we were trying to summarize a list of numbers by producing a simple summary statistic. In this post, we’re instead going to summarize the relationship between two lists of numbers in a form that generalizes traditional regression models.

Instead of a single list, we’ll now work with two vectors: \((x_1, x_2, \ldots, x_n)\) and \((y_1, y_2, \ldots, y_n)\). Because we like simple models, we’ll make the very strong (and very convenient) assumption that the second vector is, approximately, a linear function of the first vector, which gives us the formula:
$$
y_i \approx \beta_0 + \beta_1 x_i.
$$

In practice, this linear relationship is never perfect, but only an approximation. As such, for any specific values we choose for \(\beta_0\) and \(\beta_1\), we have to compute a vector of discrepancies: \(v = (y_1 – (\beta_0 + \beta_1 x_1), \ldots, y_n – (\beta_0 + \beta_1 x_n))\). The question then becomes: how do we measure the size of this vector of discrepancies? By choosing different norms to measure its size, we arrive at several different forms of linear regression models. In particular, we’ll work with three norms: the \(L_0\), \(L_1\) and \(L_2\) norms.

As we did with the single vector case, here we’ll define discrepancies as,
$$
d_i = |y_i – (\beta_0 + \beta_1 x_i)|^p,
$$
and the total error as,
$$
E_p = \sum_i |y_i – (\beta_0 + \beta_1 x_i)|^p,
$$
which is the just the \(p^{th}\) power of the \(L_p\) norm.

Several Forms of Regression

In general, we want estimate a set of regression coefficients that minimize this total error. Different forms of linear regression appear when we alter the values of \(p\). As before, let’s consider three settings:
$$
E_0 = \sum_i |y_i – (\beta_0 + \beta_1 x_i)|^0
$$
$$
E_1 = \sum_i |y_i – (\beta_0 + \beta_1 x_i)|^1
$$
$$
E_2 = \sum_i |y_i – (\beta_0 + \beta_1 x_i)|^2
$$

What happens in these settings? In the first case, we select regression coefficients so that the line passes through as many points as possible. Clearly we can always select a line that passes through any pair of points. And we can show that there are data sets in which we cannot do better. So the \(L_0\) norm doesn’t seem to provide a very useful form of linear regression, but I’d be interested to see examples of its use.

In contrast, minimizing \(E_1\) and \(E_2\) define quite interesting and familiar forms of linear regression. We’ll start with \(E_2\) because it’s the most familiar: it defines Ordinary Least Squares (OLS) regression, which is the one we all know and love. In the \(L_2\) case, we select \(\beta_0\) and \(\beta_1\) to minimize,
$$
E_2 = \sum_i (y_i – (\beta_0 + \beta_1 x_i))^2,
$$
which is the summed squared error over all of the \((x_i, y_i)\) pairs. In other words, Ordinary Least Squares regression is just an attempt to find an approximating linear relationship between two vectors that minimizes the \(L_2\) norm of the vector of discrepancies.

Although OLS regression is clearly king, the coefficients we get from minimizing \(E_1\) are also quite widely used: using the \(L_1\) norm defines Least Absolute Deviations (LAD) regression, which is also sometimes called Robust Regression. This approach to regression is robust because large outliers that would produce errors greater than \(1\) are not unnecessarily augmented by the squaring operation that’s used in defining OLS regression, but instead only have their absolute values taken. This means that the resulting model will try to match the overall linear pattern in the data even when there are some very large outliers.

We can also relate these two approaches to the strategy employed in the previous post. When we use OLS regression (which would be better called \(L_2\) regression), we predict the mean of \(y_i\) given the value of \(x_i\). And when we use LAD regression (which would be better called \(L_1\) regression), we predict the median of \(y_i\) given the value of \(x_i\). Just as I said in the previous post, the core theoretical tool that we need to understand is the \(L_p\) norm. For single number summaries, it naturally leads to modes, medians and means. For simple regression problems, it naturally leads to LAD regression and OLS regression. But there’s more: it also leads naturally to the two most popular forms of regularized regression.

Regularization

If you’re not familiar with regularization, the central idea is that we don’t exclusively try to find the values of \(\beta_0\) and \(\beta_1\) that minimize the discrepancy between \(\beta_0 + \beta_1 x_i\) and \(y_i\), but also simultaneously try to satisfy a competing requirement that \(\beta_1\) not get too large. Note that we don’t try to control the size of \(\beta_0\) because it describes the overall scale of the data rather than the relationship between \(x\) and \(y\).

Because these objectives compete, we have to combine them into a single objective. We do that by working with a linear sum of the two objectives. And because both the discrepancy objective and the size of the coefficients can be described in terms of norms, we’ll assume that we want to minimize the \(L_p\) norm of the discrepancies and the \(L_q\) norm of the \(\beta\)’s. This means that we end up trying to minimize an expression of the form,
$$
(\sum_i |y_i – (\beta_0 + \beta_1 x_i)|^{p}) + \lambda (|\beta_1|^q).
$$

In most regularized regression models that I’ve seen in the wild, people tend to use \(p = 2\) and \(q = 1\) or \(q = 2\). When \(q = 1\), this model is called the LASSO. When \(q = 2\), this model is called ridge regression. In another approach, I’ll try to describe why the LASSO and ridge regression produce such different patterns of coefficients.

Modes, Medians and Means: A Unifying Perspective

Introduction / Warning

Any traditional introductory statistics course will teach students the definitions of modes, medians and means. But, because introductory courses can’t assume that students have much mathematical maturity, the close relationship between these three summary statistics can’t be made clear. This post tries to remedy that situation by making it clear that all three concepts arise as specific parameterizations of a more general problem.

To do so, I’ll need to introduce one non-standard definition that may trouble some readers. In order to simplify my exposition, let’s all agree to assume that \(0^0 = 0\). In particular, we’ll want to assume that \(|0|^0 = 0\), even though \(|\epsilon|^0 = 1\) for all \(\epsilon > 0\). This definition is non-standard, but it greatly simplifies what follows and emphasizes the conceptual unity of modes, medians and means.

Constructing a Summary Statistic

To see how modes, medians and means arise, let’s assume that we have a list of numbers, \((x_1, x_2, \ldots, x_n)\), that we want to summarize. We want our summary to be a single number, which we’ll call \(s\). How should we select \(s\) so that it summarizes the numbers, \((x_1, x_2, \ldots, x_n)\), effectively?

To answer that, we’ll assume that \(s\) is an effective summary of the entire list if the typical discrepancy between \(s\) and each of the \(x_i\) is small. With that assumption in place, we only need to do two things: (1) define the notion of discrepancy between two numbers, \(x_i\) and \(s\); and (2) define the notion of a typical discrepancy. Because each number \(x_i\) produces its own discrepancy, we’ll need to introduce a method for aggregating the individual discrepancies to order to say something about the typical discrepancy.

Defining a Discrepancy

We could define the discrepancy between a number \(x_i\) and another number \(s\) in many ways. For now, we’ll consider only three possibilities. All of these three options satisfies a basic intuition we have about the notion of discrepancy: we expect that the discrepancy between \(x_i\) and \(s\) should be \(0\) if \(|x_i – s| = 0\) and that the discrepancy should be greater than \(0\) if \(|x_i – s| > 0\). That leaves us with one obvious question: how much greater should the discrepancy be when \(|x_i – s| > 0\)?

To answer that question, let’s consider three definitions of the discrepancy, \(d_i\):

  1. \(d_i = |x_i – s|^0\)
  2. \(d_i = |x_i – s|^1\)
  3. \(d_i = |x_i – s|^2\)

How should we think about these three possible definitions?

The first definition, \(d_i = |x_i – s|^0\), says that the discrepancy is \(1\) if \(x_i \neq s\) and is \(0\) only when \(x_i = s\). This notion of discrepancy is typically called zero-one loss in machine learning. Note that this definition implies that anything other than exact equality produces a constant measure of discrepancy. Summarizing \(x_i = 2\) with \(s = 0\) is no better nor worse than using \(s = 1\). In other words, the discrepancy does not increase at all as \(s\) gets further and further from \(x_i\). You can see this reflected in the far-left column of the image below:


Discrepancy

The second definition, \(d_i = |x_i – s|^1\), says that the discrepancy is equal to the distance between \(x_i\) and \(s\). This is often called an absolute deviation in machine learning. Note that this definition implies that the discrepancy should increase linearly as \(s\) gets further and further from \(x_i\). This is reflected in the center column of the image above.

The third definition, \(d_i = |x_i – s|^2\), says that the discrepancy is the squared distance between \(x_i\) and \(s\). This is often called a squared error in machine learning. Note that this definition implies that the discrepancy should increase super-linearly as \(s\) gets further and further from \(x_i\). For example, if \(x_i = 1\) and \(s = 0\), then the discrepancy is \(1\). But if \(x_i = 2\) and \(s = 0\), then the discrepancy is \(4\). This is reflected in the far right column of the image above.

When we consider a list with a single element, \((x_1)\), these definitions all suggest that we should choose the same number: namely, \(s = x_1\).

Aggregating Discrepancies

Although these definitions do not differ for a list with a single element, they suggest using very different summaries of a list with more than one number in it. To see why, let’s first assume that we’ll aggregate the discrepancy between \(x_i\) and \(s\) for each of the \(x_i\) into a single summary of the quality of a proposed value of \(s\). To perform this aggregation, we’ll sum up the discrepancies over each of the \(x_i\) and call the result \(E\).

In that case, our three definitions give three interestingly different possible definitions of the typical discrepancy, which we’ll call \(E\) for error:
$$
E_0 = \sum_{i} |x_i – s|^0.
$$

$$
E_1 = \sum_{i} |x_i – s|^1.
$$

$$
E_2 = \sum_{i} |x_i – s|^2.
$$

When we write down these expressions in isolation, they don’t look very different. But if we select \(s\) to minimize each of these three types of errors, we get very different numbers. And, surprisingly, each of these three numbers will be very familiar to us.

Minimizing Aggregate Discrepancies

For example, suppose that we try to find \(s_0\) that minimizes the zero-one loss definition of the error of a single number summary. In that case, we require that,
$$
s_0 = \arg \min_{s} \sum_{i} |x_i – s|^0.
$$
What value should \(s_0\) take on? If you give this some extended thought, you’ll discover two things: (1) there is not necessarily a single best value of \(s_0\), but potentially many different values; and (2) each of these best values is one of the modes of the \(x_i\).

In other words, the best single number summary of a set of numbers, when you use exact equality as your metric of error, is one of the modes of that set of numbers.

What happens if we consider some of the other definitions? Let’s start by considering \(s_1\):
$$
s_1 = \arg \min_{s} \sum_{i} |x_i – s|^1.
$$
Unlike \(s_0\), \(s_1\) is a unique number: it is the median of the \(x_i\). That is, the best summary of a set of numbers, when you use absolute differences as your metric of error, is the median of that set of numbers.

Since we’ve just found that the mode and the median appear naturally, we might wonder if other familiar basic statistics will appear. Luckily, they will. If we look for,
$$
s_2 = \arg \min_{s} \sum_{i} |x_i – s|^2,
$$
we’ll find that, like \(s_1\), \(s_2\) is again a unique number. Moreover, \(s_2\) is the mean of the \(x_i\). That is, the best summary of a set of numbers, when you use squared differences as your metric of error, is the mean of that set of numbers.

To sum up, we’ve just seen that the three most famous single number summaries of a data set are very closely related: they all minimize the average discrepancy between \(s\) and the numbers being summarized. They only differ in the type of discrepancy being considered:

  1. The mode minimizes the number of times that one of the numbers in our summarized list is not equal to the summary that we use.
  2. The median minimizes the average distance between each number and our summary.
  3. The mean minimizes the average squared distance between each number and our summary.

In equations,

  1. \(\text{The mode of } x_i = \arg \min_{s} \sum_{i} |x_i – s|^0\)
  2. \(\text{The median of } x_i = \arg \min_{s} \sum_{i} |x_i – s|^1\)
  3. \(\text{The mean of } x_i = \arg \min_{s} \sum_{i} |x_i – s|^2\)

Summary

We’ve just seen that the mode, median and mean all arise from a simple parametric process in which we try to minimize the average discrepancy between a single number \(s\) and a list of numbers, \(x_1, x_2, \ldots, x_n\) that we try to summarize using \(s\). In a future blog post, I’ll describe how the ideas we’ve just introduced relate to the concept of \(L_p\) norms. Thinking about minimizing \(L_p\) norms is a generalization of taking modes, medians and means that leads to almost every important linear method in statistics — ranging from linear regression to the SVD.

Thanks

Thanks to Sean Taylor for reading a draft of this post and commenting on it.

Writing Better Statistical Programs in R

A while back a friend asked me for advice about speeding up some R code that they’d written. Because they were running an extensive Monte Carlo simulation of a model they’d been developing, the poor performance of their code had become an impediment to their work.

After I looked through their code, it was clear that the performance hurdles they were stumbling upon could be overcome by adopting a few best practices for statistical programming. This post tries to describe some of the simplest best practices for statistical programming in R. Following these principles should make it easier for you to write statistical programs that are both highly performant and correct.

Write Out a DAG

Whenever you’re running a simulation study, you should appreciate the fact that you are working with a probabilistic model. Even if you are primarily focused upon the deterministic components of this model, the presence of any randomness in the model means that all of the theory of probabilistic models applies to your situation.

Almost certainly the most important concept in probabilistic modeling when you want to write efficient code is the notion of conditional independence. Conditional independence is important because many probabilistic models can be decomposed into simple pieces that can be computed in isolation. Although your model contains many variables, any one of these variables may depend upon only a few other variables in your model. If you can organize all of variables in your model based on their dependencies, it will be easier to exploit two computational tricks: vectorization and parallelization.

Let’s go through an example. Imagine that you have the model shown below:

$$
X \sim \text{Normal}(0, 1)
$$

$$
Y1 \sim \text{Uniform}(X, X + 1)
$$

$$
Y2 \sim \text{Uniform}(X – 1, X)
$$

$$
Z \sim \text{Cauchy}(Y1 + Y2, 1)
$$

In this model, the distribution of Y1 and Y2 depends only on the value of X. Similarly, the distribution of Z depends only on the values of Y1 and Y2. We can formalize this notion using a DAG, which is a directed acyclic graph that depicts which variables depend upon which other variables. It will help you appreciate the value of this format if you think of the arrows in the DAG below as indicating the flow of causality:


Dag

Having this DAG drawn out for your model will make it easier to write efficient code, because you can generate all of the values of a variable V simultaneously once you’ve computed the values of the variables that V depends upon. In our example, you can generate the values of X for all of your different simulations at once and then generate all of the Y1′s and Y2′s based on the values of X that you generate. You can then exploit this stepwise generation procedure to vectorize and parallelize your code. I’ll discuss vectorization to give you a sense of how to exploit the DAG we’ve drawn to write faster code.

Vectorize Your Simulations

Sequential dependencies are a major bottleneck in languages like R and Matlab that cannot perform loops efficiently. Looking at the DAG for the model shown able, you might think that you can’t get around writing a “for” loop to generate samples of this model because some of the variables need to be generated before others.

But, in reality, each individual sample from this model is independent of all of the others. As such, you can draw all of the X’s for all of your different simulations using vectorized code. Below I show how this model could be implemented using loops and then show how this same model could be implemented using vectorized operations:

Loop Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
run.sims <- function(n.sims)
{
	results <- data.frame()
 
	for (sim in 1:n.sims)
	{
		x <- rnorm(1, 0, 1)
		y1 <- runif(1, x, x + 1)
		y2 <- runif(1, x - 1, x)
		z <- rcauchy(1, y1 + y2, 1)
		results <- rbind(results, data.frame(X = x, Y1 = y1, Y2 = y2, Z = z))
	}
 
	return(results)
}
 
b <- Sys.time()
run.sims(5000)
e <- Sys.time()
e - b

Vectorized Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
run.sims <- function(n.sims)
{
	x <- rnorm(n.sims, 0, 1)
	y1 <- runif(n.sims, x, x + 1)
	y2 <- runif(n.sims, x - 1, x)
	z <- rcauchy(n.sims, y, 1)
	results <- data.frame(X = x, Y1 = y1, Y2 = y2, Z = z)
 
	return(results)
}
 
b <- Sys.time()
run.sims(5000)
e <- Sys.time()
e - b

The performance gains for this example are substantial when you move from the naive loop code to the vectorized code. (NB: There are also some gains from avoiding the repeated calls to rbind, although they are less important than one might think in this case.)

We could go further and parallelize the vectorized code, but this can be tedious to do in R.

The Data Generation / Model Fitting Cycle

Vectorization can make code in languages like R much more efficient. But speed is useless if you’re not generating correct output. For me, the essential test of correctness for a probabilistic model only becomes clear after I’ve written two complementary functions:

  1. A data generation function that produces samples from my model. We can call this function generate. The arguments to generate are the parameters of my model.
  2. A model fitting function that estimates the parameters of my model based on a sample of data. We can call this function fit. The arguments to fit are the data points we generated using generate

The value of these two functions is that they can be set up to feedback into one another in the cycle shown below:


Cycle2

I feel confident in the quality of statistical code when these functions interact stably. If the parameters inferred in a single pass through this loop are close to the original inputs, then my code is likely to work correctly. This amounts to a specific instance of the following design pattern:

1
2
3
data <- generate(model, parameters)
inferred.parameters <- fit(model, data)
reliability <- error(model, parameters, inferred.parameters)

To see this pattern in action, let’s step through a process of generating data from a normal distribution and then fitting a normal to the data we generate. You can think of this as a form of “currying” in which we hardcore the value of the parameter model:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
n.sims <- 100
n.obs <- 100
 
generate.normal <- function(parameters)
{
	return(rnorm(n.obs, parameters[1], parameters[2]))
}
 
fit.normal <- function(data)
{
	return(c(mean(data), sd(data)))
}
 
distance <- function(true.parameters, inferred.parameters)
{
	return((true.parameters - inferred.parameters)^2)
}
 
reliability <- data.frame()
 
for (sim in 1:n.sims)
{
	parameters <- c(runif(1), runif(1))
	data <- generate.normal(parameters)
	inferred.parameters <- fit.normal(data)
	recovery.error <- distance(parameters, inferred.parameters)
	reliability <- rbind(reliability,
		                 data.frame(True1 = parameters[1],
		                 	        True2 = parameters[2],
		                 	        Inferred1 = inferred.parameters[1],
		                 	        Inferred2 = inferred.parameters[2],
							        Error1 = recovery.error[1],
							        Error2 = recovery.error[2]))
}

If you generate data this way, you will see that our inference code is quite reliable. And you can see that it becomes better if we set n.obs to a larger value like 100,000.

I expect this kind of performance from all of my statistical code. I can’t trust the quality of either generate or fit until I see that they play well together. It is their mutual coherence that inspires faith.

General Lessons

Speed

When writing code in R, you can improve performance by searching for every possible location in which vectorization is possible. Vectorization essentially replaces R’s loops (which are not efficient) with C’s loops (which are efficient) because the computations in a vectorized call are almost always implemented in a language other than R.

Correctness

When writing code for model fitting in any language, you should always insure that your code can infer the parameters of models when given simulated data with known parameter values.

Americans Live Longer and Work Less

Today I saw an article on Hacker News entitled, “America’s CEOs Want You to Work Until You’re 70″. I was particularly surprised by this article appearing out of the blue because I take it for granted that America will eventually have to raise the retirement age to avoid bankruptcy. After reading the article, I wasn’t able to figure out why the story had been run at all. So I decided to do some basic fact-checking.

I tracked down some time series data about life expectancies in the U.S. from Berkeley and then found some time series data about the average age at retirement from the OECD. Plotting just these two bits of information, as shown below, makes it clear that Americans are spending a larger proportion of their life in retirement.


Retirement

Perhaps I’m just naive, but it seems obvious to me that we can’t afford to take on several additional years of retirement pension liabilities for every living American. If Americans are living longer, we will need them to work longer in order to pay our bills.

Symbolic Differentiation in Julia

A Brief Introduction to Metaprogramming in Julia

In contrast to my previous post, which described one way in which Julia allows (and expects) the programmer to write code that directly employs the atomic operations offered by computers, this post is meant to introduce newcomers to some of Julia’s higher level functions for metaprogramming. To make metaprogramming more interesting, we’re going to build a system for symbolic differentiation in Julia.

Like Lisp, the Julia interpreter represents Julian expressions using normal data structures: every Julian expression is represented using an object of type Expr. You can see this by typing something like :(x + 1) into the Julia REPL:

1
2
3
4
5
julia> :(x + 1)
:(+(x,1))
 
julia> typeof(:(x+1))
Expr

Looking at the REPL output when we enter an expression quoted using the : operator, we can see that Julia has rewritten our input expression, originally written using infix notation, as an expression that uses prefix notation. This standardization to prefix notation makes it easier to work with arbitrary expressions because it removes a needless source of variation in the format of expressions.

To develop an intuition for what this kind of expression means to Julia, we can use the dump function to examine its contents:

1
2
3
4
5
6
7
8
julia> dump(:(x + 1))
Expr 
  head: Symbol call
  args: Array(Any,(3,))
    1: Symbol +
    2: Symbol x
    3: Int64 1
  typ: Any

Here you can see that a Julian expression consists of three parts:

  1. A head symbol, which describes the basic type of the expression. For this blog post, all of the expressions we’ll work with have head equal to :call.
  2. An Array{Any} that contains the arguments of the head. In our example, the head is :call, which indicates a function call is being made in this expression. The arguments for the function call are:
    1. :+, the symbol denoting the addition function that we are calling.
    2. :x, the symbol denoting the variable x
    3. 1, the number 1 represented as a 64-bit integer.
  3. A typ which stores type inference information. We’ll ignore this information as it’s not relevant to us right now.

Because each expression is built out of normal components, we can construct one piecemeal:

1
2
julia> Expr(:call, {:+, 1, 1}, Any)
:(+(1,1))

Because this expression only depends upon constants, we can immediately evaluate it using the eval function:

1
2
julia> eval(Expr(:call, {:+, 1, 1}, Any))
2

Symbolic Differentiation in Julia

Now that we know how Julia expressions are built, we can design a very simple prototype system for doing symbolic differentiation in Julia. We’ll build up our system in pieces using some of the most basic rules of calculus:

  1. The Constant Rule: d/dx c = 0
  2. The Symbol Rule: d/dx x = 1, d/dx y = 0
  3. The Sum Rule: d/dx (f + g) = (d/dx f) + (d/dx g)
  4. The Subtraction Rule: d/dx (f - g) = (d/dx f) - (d/dx g)
  5. The Product Rule: d/dx (f * g) = (d/dx f) * g + f * (d/dx g)
  6. The Quotient Rule: d/dx (f / g) = [(d/dx f) * g - f * (d/dx g)] / g^2

Implementing these operations is quite easy once you understand the data structure Julia uses to represent expressions. And some of these operations would be trivial regardless.

For example, here’s the Constant Rule in Julia:

1
differentiate(x::Number, target::Symbol) = 0

And here’s the Symbol rule:

1
2
3
4
5
6
7
function differentiate(s::Symbol, target::Symbol)
    if s == target
        return 1
    else
        return 0
    end
end

The first two rules of calculus don’t actually require us to understand anything about Julian expressions. But the interesting parts of a symbolic differentiation system do. To see that, let’s look at the Sum Rule:

1
2
3
4
5
6
7
8
9
function differentiate_sum(ex::Expr, target::Symbol)
    n = length(ex.args)
    new_args = Array(Any, n)
    new_args[1] = :+
    for i in 2:n
        new_args[i] = differentiate(ex.args[i], target)
    end
    return Expr(:call, new_args, Any)
end

The Subtraction Rule can be defined almost identically:

1
2
3
4
5
6
7
8
9
function differentiate_subtraction(ex::Expr, target::Symbol)
    n = length(ex.args)
    new_args = Array(Any, n)
    new_args[1] = :-
    for i in 2:n
        new_args[i] = differentiate(ex.args[i], target)
    end
    return Expr(:call, new_args, Any)
end

The Product Rule is a little more interesting because we need to build up an expression whose components are themselves expressions:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
function differentiate_product(ex::Expr, target::Symbol)
    n = length(ex.args)
    res_args = Array(Any, n)
    res_args[1] = :+
    for i in 2:n
       new_args = Array(Any, n)
       new_args[1] = :*
       for j in 2:n
           if j == i
               new_args[j] = differentiate(ex.args[j], target)
           else
               new_args[j] = ex.args[j]
           end
       end
       res_args[i] = Expr(:call, new_args, Any)
    end
    return Expr(:call, res_args, Any)
end

Last, but not least, here’s the Quotient Rule, which is a little more complex. We can code this rule up in a more explicit fashion that doesn’t use any loops so that we can directly see the steps we’re taking:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
function differentiate_quotient(ex::Expr, target::Symbol)
    return Expr(:call,
                {
                    :/,
                    Expr(:call,
                         {
                            :-,
                            Expr(:call,
                                 {
                                    :*,
                                    differentiate(ex.args[2], target),
                                    ex.args[3]
                                 },
                                 Any),
                            Expr(:call,
                                 {
                                    :*,
                                    ex.args[2],
                                    differentiate(ex.args[3], target)
                                 },
                                 Any)
                         },
                         Any),
                    Expr(:call,
                         {
                            :^,
                            ex.args[3],
                            2
                         },
                         Any)
                },
                Any)
end

Now that we have all of those basic rules of calculus implemented as functions, we’ll build up a lookup table that we can use to tell our final differentiate function where to send new expressions based on the kind of function’s that being differentiated during each call to differentiate:

1
2
3
4
5
6
differentiate_lookup = {
                          :+ => differentiate_sum,
                          :- => differentiate_subtraction,
                          :* => differentiate_product,
                          :/ => differentiate_quotient
                       }

With all of the core machinery in place, the final definition of differentiate is very simple:

1
2
3
4
5
6
7
8
9
10
11
function differentiate(ex::Expr, target::Symbol)
    if ex.head == :call
        if has(differentiate_lookup, ex.args[1])
            return differentiate_lookup[ex.args[1]](ex, target)
        else
            error("Don't know how to differentiate $(ex.args[1])")
        end
    else
        return differentiate(ex.head)
    end
end

Ive put all of these snippets together in a single GitHub Gist. To try out this new differentiation function, let’s copy the contents of that GitHub gist into a file called differentiate.jl. We can then load the contents of that file into Julia at the REPL using include, which will allow us try out our differentiation tool:

1
2
3
4
5
6
7
julia> include("differentiate.jl")
 
julia> differentiate(:(x + x*x), :x)
:(+(1,+(*(1,x),*(x,1))))
 
julia> differentiate(:(x + a*x), :x)
:(+(1,+(*(0,x),*(a,1))))

While the expressions that are constructed by our differentiate function are ugly, they are correct: they just need to be simplified so that things like *(0, x) are replaced with 0. If you’d like to see how to write code to perform some basic simplifications, you can see the simplify function I’ve been building for Julia’s new Calculus package. That codebase includes all of the functionality shown here for differentiate, along with several other rules that make the system more powerful.

What I love about Julia is the ease with which one can move from low-level bit operations like those described in my previous post to high-level operations that manipulate Julian expressions. By allowing the programmer to manipulate expressions programmatically, Julia has copied one of the most beautiful parts of Lisp.

Computers are Machines

When people try out Julia for the first time, many of them are worried by the following example:

1
2
3
4
5
6
7
julia> factorial(n) = n == 0 ? 1 : n * factorial(n - 1)
 
julia> factorial(20)
2432902008176640000
 
julia> factorial(21)
-4249290049419214848

If you’re not familiar with computer architecture, this result is very troubling. Why would Julia claim that the factorial of 21 is a negative number?

The answer is simple, but depends upon a set of concepts that are largely unfamiliar to programmers who, like me, grew up using modern languages like Python and Ruby. Julia thinks that the factorial of 21 is a negative number because computers are machines.

Because they are machines, computers represent numbers using many small groups of bits. Most modern machines work with groups of 64 bits at a time. If an operation has to work with more than 64 bits at a time, that operation will be slower than a similar operation than only works with 64 bits at a time.

As a result, if you want to write fast computer code, it helps to only execute operations that are easily expressible using groups of 64 bits.

Arithmetic involving small integers fits into the category of operations that only require 64 bits at a time. Every integer between -9223372036854775808 and 9223372036854775807 can be expressed using just 64 bits. You can see this for yourself by using the typemin and typemax functions in Julia:

1
2
3
4
5
julia> typemin(Int64)
-9223372036854775808
 
julia> typemax(Int64)
9223372036854775807

If you do things like the following, the computer will quickly produce correct results:

1
2
3
4
5
julia> typemin(Int64) + 1
-9223372036854775807
 
julia> typemax(Int64) - 1
9223372036854775806

But things go badly if you try to break out of the range of numbers that can be represented using only 64 bits:

1
2
3
4
5
julia> typemin(Int64) - 1
9223372036854775807
 
julia> typemax(Int64) + 1
-9223372036854775808

The reasons for this are not obvious at first, but make more sense if you examine the actual bits being operated upon:

1
2
3
4
5
6
7
8
julia> bits(typemax(Int64))
"0111111111111111111111111111111111111111111111111111111111111111"
 
julia> bits(typemax(Int64) + 1)
"1000000000000000000000000000000000000000000000000000000000000000"
 
julia> bits(typemin(Int64))
"1000000000000000000000000000000000000000000000000000000000000000"

When it adds 1 to a number, the computer blindly uses a simple arithmetic rule for individual bits that works just like the carry system you learned as a child. This carrying rule is very efficient, but works poorly if you end up flipping the very first bit in a group of 64 bits. The reason is that this first bit represents the sign of an integer. When this special first bit gets flipped by an operation that overflows the space provided by 64 bits, everything else breaks down.

The special interpretation given to certain bits in a group of 64 is the reason that factorial of 21 is a negative number when Julia computes it. You can confirm this by looking at the exact bits involved:

1
2
3
4
5
julia> bits(factorial(20))
"0010000111000011011001110111110010000010101101000000000000000000"
 
julia> bits(factorial(21))
"1100010100000111011111010011011010111000110001000000000000000000"

Here, as before, the computer has just executed the operations necessary to perform multiplication by 21. But the result has flipped the sign bit, which causes the result to appear to be a negative number.

There is a way around this: you can tell Julia to work with groups of more than 64 bits at a time when expressing integers using the BigInt type:

1
2
3
4
5
6
7
8
9
10
julia> require("BigInt")
 
julia> BigInt(typemax(Int))
9223372036854775807
 
julia> BigInt(typemax(Int)) + 1
9223372036854775808
 
julia> BigInt(factorial(20)) * 21
51090942171709440000

Now everything works smoothly. By working with BigInt‘s automatically, languages like Python avoid these concerns:

1
2
3
4
>>> factorial(20)
2432902008176640000
>>> factorial(21)
51090942171709440000L

The L at the end of the numbers here indicates that Python has automatically converted a normal integer into something like Julia’s BigInt. But this automatic conversion comes at a substantial cost: every operation that stays within the bounds of 64-bit arithmetic is slower in Python than Julia because of the time required to check whether an operation might go beyond the 64-bit bound.

Python’s automatic conversion approach is safer, but slower. Julia’s approach is faster, but requires that the programmer understand more about the computer’s architecture. Julia achieves its performance by confronting the fact that computers are machines head on. This is confusing at first and frustrating at times, but it’s a price that you have to pay for high performance computing. Everyone who grew up with C is used to these issues, but they’re largely unfamiliar to programmers who grew up with modern languages like Python. In many ways, Julia sets itself apart from other new languages by its attempt to recover some of the power that was lost in the transition from C to languages like Python. But the transition comes with a substantial learning curve.

And that’s why I wrote this post.

What is Correctness for Statistical Software?

Introduction

A few months ago, Drew Conway and I gave a webcast that tried to teach people about the basic principles behind linear and logistic regression. To illustrate logistic regression, we worked through a series of progressively more complex spam detection problems.

The simplest data set we used was the following:


Spam2

This data set has one clear virtue: the correct classifier defines a decision boundary that implements a simple OR operation on the values of MentionsViagra and MentionsNigeria. Unfortunately, that very simplicity causes the logistic regression model to break down, because the MLE coefficients for MentionsViagra and MentionsNigeria should be infinite. In some ways, our elegantly simple example for logistic regression is actually the statistical equivalent of a SQL injection.

In our webcast, Drew and I decided to ignore that concern because R produces a useful model fit despite the theoretical MLE coefficients being infinite:


ToyClassificationResults

Although R produces finite coefficients here despite theory telling us to except something else, I should note that R does produce a somewhat cryptic warning during the model fitting step that alerts the very well-informed user that something has gain awry:

1
glm.fit: fitted probabilities numerically 0 or 1 occurred

It seems clear to me that R’s warning would be better off if it were substantially more verbose:

1
2
3
4
5
6
Warning from glm.fit():
 
Fitted probabilities could not be distinguished from 0's or 1's 
under finite precision floating point arithmetic. As a result, the 
optimization algorithm for GLM fitting may have failed to converge.
You should check whether your data set is linearly separable.

Broader Questions

Although I’ve started this piece with a very focused example of how R’s implementation of logistic regression differs from the purely mathematical definition of that model, I’m not really that interested in the details of how different pieces of software implement logistic regression. If you’re interested in learning more about that kind of thing, I’d suggest reading the excellent piece on R’s logistic regression function that can be found on the Win-Vector blog.

Instead, what interests me right now are a set of broader questions about how statistical software should work. What is the standard for correctness for statistical software? And what is the standard for usefulness? And how closely related are those two criteria?

Let’s think about each of them separately:

  • Usefulness: If you want to simply make predictions based on your model, then you want R to produce a fitted model for this data set that makes reasonably good predictions on the training data. R achieves that goal: the fitted predictions for R’s logistic regression model are numerically almost indistinguishable from the 0/1 values that we would expect from a maximum likelihood algorithm. If you want useful algorithms, then R’s decision to produce some model fit is justified.
  • Correctness: If you want software to either produce mathematically correct answers or to die trying, then R’s implementation of logistic is not for you. If you insist on theoretical purity, it seems clear that R should not merely emit a warning here, but should instead throw an inescapable error rather than return an imperfect model fit. You might even want R to go further and to teach the end-user about the virtues of SVM’s or the general usefulness of parameter regularization. Whatever you’d like to see, one thing is sure: you definitely do not want R to produce model fits that are mathematically incorrect.

It’s remarkable that such a simple example can bring the goals of predictive power and theoretical correctness into such direct opposition. In part, the conflict arises here because those purely theoretical concerns are linked by a third consideration: computer algorithms are not generally equivalent to their mathematical idealizations. Purely computational concerns involving floating-point imprecision and finite compute time mean that we cannot generally hope for computers to produce answers similar to those prescribed by theoretical mathematics.

What’s fascinating about this specific example is that there’s something strangely desirable about floating-point numbers having finite precision: no one with any practical interest in modeling is likely to be interested in fitting a model with infinite-valued parameters. R’s decision to blindly run an optimization algorithm here unwittingly achieves a form of regularization like that employed in early stopping algorithms for fitting neural networks. And that may be a good thing if you’re interested in using a fitted model to make predictions, even though it means that R produces quantities like standard errors that have no real coherent interpretation in terms of frequentist estimators.

Whatever your take is on the virtues or vices of R’s implementation of logistic regression, there’s a broad take away from this example that I’ve been dealing with constantly while working on Julia: any programmer designing statistical software has to make decisions that involve personal judgment. The requirement for striking a compromise between correctness and usefulness is so nearly omnipresent that one of the most popular pieces of statistical software on Earth implements logistic regression using an algorithm that a pure theorist could argue is basically broken. But it produces an answer that has practical value. And that might just be the more important thing for statistical software to do.

What is Economics Studying?

Having spent all five of my years as a graduate student trying to get psychologists and economists to agree on basic ideas about decision-making, I think the following two pieces complement one another perfectly:

  • Cosma Shalizi’s comments on rereading Blanchard and Fischer’s “Lectures on Macroeconomics”:

    Blanchard and Fischer is about “modern” macro, models based on agents who know what the economy is like optimizing over time, possible under some limits. This is the DSGE style of macro. which has lately come into so much discredit — thoroughly deserved discredit. Chaikin and Lubensky is about modern condensed matter physics, especially soft condensed matter, based on principles of symmetry-breaking and phase transitions. Both books are about building stylized theoretical models and solving them to see what they imply; implicitly they are also about the considerations which go into building models in their respective domains.

    What is very striking, looking at them side by side, is that while these are both books about mathematical modeling, Chaikin and Lubensky presents empirical data, compares theoretical predictions to experimental results, and goes into some detail into the considerations which lead to this sort of model for nematic liquid crystals, or that model for magnetism. There is absolutely nothing like this in Blanchard and Fischer — no data at all, no comparison of models to reality, no evidence of any kind supporting any of the models. There is not even an attempt, that I can find, to assess different macroeconomic models, by comparing their qualitative predictions to each other and to historical reality. I presume that Blanchard and Fischer, as individual scholars, are not quite so indifferent to reality, but their pedagogy is.

    I will leave readers to draw their own morals.

  • Itzhak Gilboa’s argument that economic theory is a rhetoric apparatus rather than a set of direct predictions about the world in which we live.