Writing Type-Stable Code in Julia

For many of the people I talk to, Julia’s main appeal is speed. But achieving peak performance in Julia requires that programmers absorb a few subtle concepts that are generally unfamiliar to users of weakly typed languages.

One particularly subtle performance pitfall is the need to write type-stable code. Code is said to be type-stable if the type of every variable does not vary over time. To clarify this idea, consider the following two closely related function definitions:

function sumofsins1(n::Integer)
r = 0
for i in 1:n
r += sin(3.4)
end
return r
end

function sumofsins2(n::Integer)
r = 0.0
for i in 1:n
r += sin(3.4)
end
return r
end


The only difference between these function definitions is that sumofsins1 initializes r to 0, whereas sumofsins2 initializes r to 0.0.

This seemingly minor distinction has important practical implications because the initialization of r to 0 means that the main loop of sumofsins1 begins with a single iteration in which the computer adds 0 to sin(3.4). This single addition step transforms the type of r from Int, which is the type of 0, to Float64, which is the type of sin(3.4). This means that the type of r is not stable over the course of this loop.

This instability has considerable effects on the performance of sumofsins1. To see this, let’s run some naive benchmarks. As always in Julia, we’ll start with a dry run to get the JIT to compile the functions being compared:

sumofsins1(100_000)
sumofsins2(100_000)

@time [sumofsins1(100_000) for i in 1:100];
@time [sumofsins2(100_000) for i in 1:100];


The results of this timing comparison are quite striking:

julia> @time [sumofsins1(100_000) for i in 1:100];
elapsed time: 0.412261722 seconds (320002496 bytes allocated)

julia> @time [sumofsins2(100_000) for i in 1:100];
elapsed time: 0.008509995 seconds (896 bytes allocated)


As you can see, the type-unstable code in sumofsins1 is 50x slower than the type-stable code. What might have seemed like a nitpicky point about the initial value of r has enormous performance implications.

To understand the reasons for this huge performance gap, it’s worth considering what effect type-instability has on the compiler. In this case, the compiler can’t optimize the contents of the main loop of sumofsins1 because it can’t be certain that the type of r will remain invariant throughout the entire loop. Without this crucial form of invariance, the compiler has to check the type of r on every iteration of the loop, which is a much more intensive computation than repeatedly adding a constant value to a Float64.

You can confirm for yourself that the compiler produces more complex code by examining the LLVM IR for both of these functions.

First, we’ll examine the LLVM IR for sumofsins1:

julia> code_llvm(sumofsins1, (Int, ))

define %jl_value_t* @julia_sumofsins11067(i64) {
top:
%1 = alloca [5 x %jl_value_t*], align 8
%.sub = getelementptr inbounds [5 x %jl_value_t*]* %1, i64 0, i64 0
%2 = getelementptr [5 x %jl_value_t*]* %1, i64 0, i64 2, !dbg !5145
store %jl_value_t* inttoptr (i64 6 to %jl_value_t*), %jl_value_t** %.sub, align 8
%3 = load %jl_value_t*** @jl_pgcstack, align 8, !dbg !5145
%4 = getelementptr [5 x %jl_value_t*]* %1, i64 0, i64 1, !dbg !5145
%.c = bitcast %jl_value_t** %3 to %jl_value_t*, !dbg !5145
store %jl_value_t* %.c, %jl_value_t** %4, align 8, !dbg !5145
store %jl_value_t** %.sub, %jl_value_t*** @jl_pgcstack, align 8, !dbg !5145
%5 = getelementptr [5 x %jl_value_t*]* %1, i64 0, i64 3
store %jl_value_t* null, %jl_value_t** %5, align 8
%6 = getelementptr [5 x %jl_value_t*]* %1, i64 0, i64 4
store %jl_value_t* null, %jl_value_t** %6, align 8
store %jl_value_t* inttoptr (i64 140379580131904 to %jl_value_t*), %jl_value_t** %2, align 8, !dbg !5150
%7 = icmp slt i64 %0, 1, !dbg !5151
br i1 %7, label %L2, label %pass, !dbg !5151

pass:                                             ; preds = %top, %pass
%8 = phi %jl_value_t* [ %13, %pass ], [ inttoptr (i64 140379580131904 to %jl_value_t*), %top ]
%"#s6.03" = phi i64 [ %14, %pass ], [ 1, %top ]
store %jl_value_t* %8, %jl_value_t** %5, align 8, !dbg !5152
%9 = call %jl_value_t* @alloc_2w(), !dbg !5152
%10 = getelementptr inbounds %jl_value_t* %9, i64 0, i32 0, !dbg !5152
store %jl_value_t* inttoptr (i64 140379580056656 to %jl_value_t*), %jl_value_t** %10, align 8, !dbg !5152
%11 = getelementptr inbounds %jl_value_t* %9, i64 1, i32 0, !dbg !5152
%12 = bitcast %jl_value_t** %11 to double*, !dbg !5152
store double 0xBFD05AC910FF4C6C, double* %12, align 8, !dbg !5152
store %jl_value_t* %9, %jl_value_t** %6, align 8, !dbg !5152
%13 = call %jl_value_t* @jl_apply_generic(%jl_value_t* inttoptr (i64 140379586379936 to %jl_value_t*), %jl_value_t** %5, i32 2), !dbg !5152
store %jl_value_t* %13, %jl_value_t** %2, align 8, !dbg !5152
%14 = add i64 %"#s6.03", 1, !dbg !5152
%15 = icmp sgt i64 %14, %0, !dbg !5151
br i1 %15, label %L2, label %pass, !dbg !5151

L2:                                               ; preds = %pass, %top
%.lcssa = phi %jl_value_t* [ inttoptr (i64 140379580131904 to %jl_value_t*), %top ], [ %13, %pass ]
%16 = load %jl_value_t** %4, align 8, !dbg !5153
%17 = getelementptr inbounds %jl_value_t* %16, i64 0, i32 0, !dbg !5153
store %jl_value_t** %17, %jl_value_t*** @jl_pgcstack, align 8, !dbg !5153
ret %jl_value_t* %.lcssa, !dbg !5153
}


Then we’ll examine the LLVM IR for sumofsins2:

julia> code_llvm(sumofsins2, (Int, ))

define double @julia_sumofsins21068(i64) {
top:
%1 = icmp slt i64 %0, 1, !dbg !5151
br i1 %1, label %L2, label %pass, !dbg !5151

pass:                                             ; preds = %top, %pass
%"#s6.04" = phi i64 [ %3, %pass ], [ 1, %top ]
%r.03 = phi double [ %2, %pass ], [ 0.000000e+00, %top ]
%2 = fadd double %r.03, 0xBFD05AC910FF4C6C, !dbg !5156
%3 = add i64 %"#s6.04", 1, !dbg !5156
%4 = icmp sgt i64 %3, %0, !dbg !5151
br i1 %4, label %L2, label %pass, !dbg !5151

L2:                                               ; preds = %pass, %top
%r.0.lcssa = phi double [ 0.000000e+00, %top ], [ %2, %pass ]
ret double %r.0.lcssa, !dbg !5157
}


The difference in size and complexity of code between these two functions in compiled form is considerable. And this difference is entirely atttributable to the compiler’s need to recheck the type of r on every iteration of the main loop in sumofsins1, which can be optimized out in sumofsins2, where r has a stable type.

Given the potential performance impacts of type-instability, every aspiring Julia programmer needs to learn to recognize potential sources of type-instability in their own code. Future versions of Julia may be configured to issue warnings when type-unstable code is encountered, but, for now, the responsibility lies with the programmer. Thankfully, once you learn about type-stability, it becomes easy to recognize in most cases.

September Talks

To celebrate my last full month on the East Coast, I’m doing a bunch of talks. If you’re interested in hearing more about Julia or statistics in general, you might want to come out to one of the events I’ll be at:

• Julia Tutorial at DataGotham: On 9/12, Stefan and I will be giving a 3-hour long, hands on Julia tutorial as part of the Thursday DataGotham activities this year. If you’re in NYC and care about data analysis, you should try to make it out to part of the event, even if you skip the tutorials.
• Online Learning Talk in NYC: On 9/17, I’ll be giving a talk on online learning at the Open Statistical Programming meetup. I’ll talk about using SGD to fit models online. This material is quite basic, but seems to be unfamiliar to a lot of people.
• Julia Talk in DC: On 9/26, I’ll be giving a quick introduction to Julia in DC at the Statistical Programming DC meetup. The goal will be to introduce people to the basics of Julia.

Hopfield Networks in Julia

As a fun side project last night, I decided to implement a basic package for working with Hopfield networks in Julia.

Since I suspect many of the readers of this blog have never seen a Hopfield net before, let me explain what they are and what they can be used for. The short-and-skinny is that Hopfield networks were invented in the 1980’s to demonstrate how a network of simple neurons might learn to associate incoming stimuli with a fixed pool of existing memories. As you’ll see from the examples below, this associative ability behaves a little bit like locality-sensitive hashing.

To see how Hopfield networks work, we need to define their internal structure. For the purposes of this blog post, we’ll assume that a Hopfield network is made up of N neurons. At every point in time, this network of neurons has a simple binary state, which I’ll associate with a vector of -1’s and +1’s.

Incoming stimuli are also represented using binary vectors of length N. Every time one of these stimuli is shown to the network, the network will use a simple updating rule to modify its state. The network will keep modifying its state until it settles into a stable state, which will be one of many fixed points for the updating rule. We’ll refer to the stable state that the network reaches as the memory that the network associates with the input stimulus.

For example, let’s assume that we have a network consisting of 42 neurons arranged in a 7×6 matrix. We’ll train our network to recognize the letters X and O, which will also be represented as 7×6 matrices. After training the network, we’ll present corrupted copies of the letters X and O to show that the network is able to associate corrupted stimuli with their uncorrupted memories. We’ll also show the network an uncorrupted copy of the unfamiliar letter F to see what memory it associates with an unfamiliar stimulus.

Using the HopfieldNets package, we can do this in Julia as follows:

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18  using HopfieldNets   include(Pkg.dir("HopfieldNets", "demo", "letters.jl"))   patterns = hcat(X, O)   n = size(patterns, 1)   h = DiscreteHopfieldNet(n)   train!(h, patterns)   Xcorrupt = copy(X) for i = 2:7 Xcorrupt[i] = 1 end   Xrestored = associate!(h, Xcorrupt)

In the image below, I show what happens when we present X, O and F to the network after training it on the X and O patterns:

As you can see, the network perfectly recovers X and O from corrupted copies of those letters. In addition, the network associates F with an O, although the O is inverted relative to the O found in the training set. This kind of untrained memory emerging is common in Hopfield nets. To continue the analogy with LSH, you can think of the memories produced by the Hopfield net as hashes of the input, which have the property that similar inputs tend to produce similar outputs. In practice, you shouldn’t use a Hopfield net to do LSH, because the computations involved are quite costly.

Hopefully this simple example has piqued your interest in Hopfield networks. If you’d like to learn more, you can read through the code I wrote or work through the very readable presentation of the theory of Hopfield networks in David Mackay’s book on Information Theory, Inference, and Learning Algorithms.

A few days ago I disabled the comment system on this site. I’d been debating the change for some time, but reached a final decision while reading the comments on an article about a vaccine for Lyme disease.

Although this site has generally had very high quality comments, I’ve become increasingly opposed (as a matter of principle) to the use of online comment systems. My feelings mirror those of many other people who’ve deactivated comments on their sites, including Marco Arment and Matt Gemmell. As many have said before, comments tend to bring out the worst in people. The conversations that comments are ostensibly supposed to inspire now occur on Twitter and in volleys of blog posts that are traded between multiple blogs. In contrast, comment threads tend to trap the material that people either (a) don’t want to associate with their own name or (b) don’t want to take the time to write up formally. I think we have too much of both of these sorts of writing and would prefer not to encourage either.

What’s Next

The last two weeks have been full of changes for me. For those who’ve been asking about what’s next, I thought I’d write up a quick summary of all the news.

(1) I successfully defended my thesis this past Monday. Completing a Ph.D. has been a massive undertaking for the past five years, and it’s a major relief to be done. From now on I’ll be (perhaps undeservedly) making airline and restaurant reservations under the name Dr. White.

(2) As announced last week, I’ll be one of the residents at Hacker School this summer. The list of other residents is pretty amazing, and I’m really looking forward to meeting the students.

(3) In addition to my residency at Hacker School, I’ll be a temporary postdoc in the applied math department at MIT, where I’ll be working on Julia full-time. Expect to see lots of work on building up the core data analysis infrastructure.

(4) As of today I’ve accepted an offer to join Facebook’s Data Science team in the fall. I’ll be moving out to the Bay Area in November.

That’s all so far.

Using Norms to Understand Linear Regression

Introduction

In my last post, I described how we can derive modes, medians and means as three natural solutions to the problem of summarizing a list of numbers, $$(x_1, x_2, \ldots, x_n)$$, using a single number, $$s$$. In particular, we measured the quality of different potential summaries in three different ways, which led us to modes, medians and means respectively. Each of these quantities emerged from measuring the typical discrepancy between an element of the list, $$x_i$$, and the summary, $$s$$, using a formula of the form,
$$\sum_i |x_i – s|^p,$$
where $$p$$ was either $$0$$, $$1$$ or $$2$$.

The $$L_p$$ Norms

In this post, I’d like to extend this approach to linear regression. The notion of discrepancies we used in the last post is very closely tied to the idea of measuring the size of a vector in $$\mathbb{R}^n$$. Specifically, we were minimizing a measure of discrepancies that was almost identical to the $$L_p$$ family of norms that can be used to measure the size of vectors. Understanding $$L_p$$ norms makes it much easier to describe several modern generalizations of classical linear regression.

To extend our previous approach to the more standard notion of an $$L_p$$ norm, we simply take the sum we used before and rescale things by taking a $$p^{th}$$ root. This gives the formula for the $$L_p$$ norm of any vector, $$v = (v_1, v_2, \ldots, v_n)$$, as,
$$|v|_p = (\sum_i |v_i|^p)^\frac{1}{p}.$$
When $$p = 2$$, this formula reduces to the familiar formula for the length of a vector:
$$|v|_2 = \sqrt{\sum_i v_i^2}.$$

In the last post, the vector we cared about was the vector of elementwise discrepancies, $$v = (x_1 – s, x_2 – s, \ldots, x_n – s)$$. We wanted to minimize the overall size of this vector in order to make $$s$$ a good summary of $$x_1, \ldots, x_n$$. Because we were interested only in the minimum size of this vector, it didn’t matter that we skipped taking the $$p^{th}$$ root at the end because one vector, $$v_1$$, has a smaller norm than another vector, $$v_2$$, only when the $$p^{th}$$ power of that norm smaller than the $$p^{th}$$ power of the other. What was essential wasn’t the scale of the norm, but rather the value of $$p$$ that we chose. Here we’ll follow that approach again. Specifically, we’ll again be working consistently with the $$p^{th}$$ power of an $$L_p$$ norm:
$$|v|_p^p = (\sum_i |v_i|^p).$$

The Regression Problem

Using $$L_p$$ norms to measure the overall size of a vector of discrepancies extends naturally to other problems in statistics. In the previous post, we were trying to summarize a list of numbers by producing a simple summary statistic. In this post, we’re instead going to summarize the relationship between two lists of numbers in a form that generalizes traditional regression models.

Instead of a single list, we’ll now work with two vectors: $$(x_1, x_2, \ldots, x_n)$$ and $$(y_1, y_2, \ldots, y_n)$$. Because we like simple models, we’ll make the very strong (and very convenient) assumption that the second vector is, approximately, a linear function of the first vector, which gives us the formula:
$$y_i \approx \beta_0 + \beta_1 x_i.$$

In practice, this linear relationship is never perfect, but only an approximation. As such, for any specific values we choose for $$\beta_0$$ and $$\beta_1$$, we have to compute a vector of discrepancies: $$v = (y_1 – (\beta_0 + \beta_1 x_1), \ldots, y_n – (\beta_0 + \beta_1 x_n))$$. The question then becomes: how do we measure the size of this vector of discrepancies? By choosing different norms to measure its size, we arrive at several different forms of linear regression models. In particular, we’ll work with three norms: the $$L_0$$, $$L_1$$ and $$L_2$$ norms.

As we did with the single vector case, here we’ll define discrepancies as,
$$d_i = |y_i – (\beta_0 + \beta_1 x_i)|^p,$$
and the total error as,
$$E_p = \sum_i |y_i – (\beta_0 + \beta_1 x_i)|^p,$$
which is the just the $$p^{th}$$ power of the $$L_p$$ norm.

Several Forms of Regression

In general, we want estimate a set of regression coefficients that minimize this total error. Different forms of linear regression appear when we alter the values of $$p$$. As before, let’s consider three settings:
$$E_0 = \sum_i |y_i – (\beta_0 + \beta_1 x_i)|^0$$
$$E_1 = \sum_i |y_i – (\beta_0 + \beta_1 x_i)|^1$$
$$E_2 = \sum_i |y_i – (\beta_0 + \beta_1 x_i)|^2$$

What happens in these settings? In the first case, we select regression coefficients so that the line passes through as many points as possible. Clearly we can always select a line that passes through any pair of points. And we can show that there are data sets in which we cannot do better. So the $$L_0$$ norm doesn’t seem to provide a very useful form of linear regression, but I’d be interested to see examples of its use.

In contrast, minimizing $$E_1$$ and $$E_2$$ define quite interesting and familiar forms of linear regression. We’ll start with $$E_2$$ because it’s the most familiar: it defines Ordinary Least Squares (OLS) regression, which is the one we all know and love. In the $$L_2$$ case, we select $$\beta_0$$ and $$\beta_1$$ to minimize,
$$E_2 = \sum_i (y_i – (\beta_0 + \beta_1 x_i))^2,$$
which is the summed squared error over all of the $$(x_i, y_i)$$ pairs. In other words, Ordinary Least Squares regression is just an attempt to find an approximating linear relationship between two vectors that minimizes the $$L_2$$ norm of the vector of discrepancies.

Although OLS regression is clearly king, the coefficients we get from minimizing $$E_1$$ are also quite widely used: using the $$L_1$$ norm defines Least Absolute Deviations (LAD) regression, which is also sometimes called Robust Regression. This approach to regression is robust because large outliers that would produce errors greater than $$1$$ are not unnecessarily augmented by the squaring operation that’s used in defining OLS regression, but instead only have their absolute values taken. This means that the resulting model will try to match the overall linear pattern in the data even when there are some very large outliers.

We can also relate these two approaches to the strategy employed in the previous post. When we use OLS regression (which would be better called $$L_2$$ regression), we predict the mean of $$y_i$$ given the value of $$x_i$$. And when we use LAD regression (which would be better called $$L_1$$ regression), we predict the median of $$y_i$$ given the value of $$x_i$$. Just as I said in the previous post, the core theoretical tool that we need to understand is the $$L_p$$ norm. For single number summaries, it naturally leads to modes, medians and means. For simple regression problems, it naturally leads to LAD regression and OLS regression. But there’s more: it also leads naturally to the two most popular forms of regularized regression.

Regularization

If you’re not familiar with regularization, the central idea is that we don’t exclusively try to find the values of $$\beta_0$$ and $$\beta_1$$ that minimize the discrepancy between $$\beta_0 + \beta_1 x_i$$ and $$y_i$$, but also simultaneously try to satisfy a competing requirement that $$\beta_1$$ not get too large. Note that we don’t try to control the size of $$\beta_0$$ because it describes the overall scale of the data rather than the relationship between $$x$$ and $$y$$.

Because these objectives compete, we have to combine them into a single objective. We do that by working with a linear sum of the two objectives. And because both the discrepancy objective and the size of the coefficients can be described in terms of norms, we’ll assume that we want to minimize the $$L_p$$ norm of the discrepancies and the $$L_q$$ norm of the $$\beta$$’s. This means that we end up trying to minimize an expression of the form,
$$(\sum_i |y_i – (\beta_0 + \beta_1 x_i)|^{p}) + \lambda (|\beta_1|^q).$$

In most regularized regression models that I’ve seen in the wild, people tend to use $$p = 2$$ and $$q = 1$$ or $$q = 2$$. When $$q = 1$$, this model is called the LASSO. When $$q = 2$$, this model is called ridge regression. In another approach, I’ll try to describe why the LASSO and ridge regression produce such different patterns of coefficients.

Modes, Medians and Means: A Unifying Perspective

Introduction / Warning

Any traditional introductory statistics course will teach students the definitions of modes, medians and means. But, because introductory courses can’t assume that students have much mathematical maturity, the close relationship between these three summary statistics can’t be made clear. This post tries to remedy that situation by making it clear that all three concepts arise as specific parameterizations of a more general problem.

To do so, I’ll need to introduce one non-standard definition that may trouble some readers. In order to simplify my exposition, let’s all agree to assume that $$0^0 = 0$$. In particular, we’ll want to assume that $$|0|^0 = 0$$, even though $$|\epsilon|^0 = 1$$ for all $$\epsilon > 0$$. This definition is non-standard, but it greatly simplifies what follows and emphasizes the conceptual unity of modes, medians and means.

Constructing a Summary Statistic

To see how modes, medians and means arise, let’s assume that we have a list of numbers, $$(x_1, x_2, \ldots, x_n)$$, that we want to summarize. We want our summary to be a single number, which we’ll call $$s$$. How should we select $$s$$ so that it summarizes the numbers, $$(x_1, x_2, \ldots, x_n)$$, effectively?

To answer that, we’ll assume that $$s$$ is an effective summary of the entire list if the typical discrepancy between $$s$$ and each of the $$x_i$$ is small. With that assumption in place, we only need to do two things: (1) define the notion of discrepancy between two numbers, $$x_i$$ and $$s$$; and (2) define the notion of a typical discrepancy. Because each number $$x_i$$ produces its own discrepancy, we’ll need to introduce a method for aggregating the individual discrepancies to order to say something about the typical discrepancy.

Defining a Discrepancy

We could define the discrepancy between a number $$x_i$$ and another number $$s$$ in many ways. For now, we’ll consider only three possibilities. All of these three options satisfies a basic intuition we have about the notion of discrepancy: we expect that the discrepancy between $$x_i$$ and $$s$$ should be $$0$$ if $$|x_i – s| = 0$$ and that the discrepancy should be greater than $$0$$ if $$|x_i – s| > 0$$. That leaves us with one obvious question: how much greater should the discrepancy be when $$|x_i – s| > 0$$?

To answer that question, let’s consider three definitions of the discrepancy, $$d_i$$:

1. $$d_i = |x_i – s|^0$$
2. $$d_i = |x_i – s|^1$$
3. $$d_i = |x_i – s|^2$$

How should we think about these three possible definitions?

The first definition, $$d_i = |x_i – s|^0$$, says that the discrepancy is $$1$$ if $$x_i \neq s$$ and is $$0$$ only when $$x_i = s$$. This notion of discrepancy is typically called zero-one loss in machine learning. Note that this definition implies that anything other than exact equality produces a constant measure of discrepancy. Summarizing $$x_i = 2$$ with $$s = 0$$ is no better nor worse than using $$s = 1$$. In other words, the discrepancy does not increase at all as $$s$$ gets further and further from $$x_i$$. You can see this reflected in the far-left column of the image below:

The second definition, $$d_i = |x_i – s|^1$$, says that the discrepancy is equal to the distance between $$x_i$$ and $$s$$. This is often called an absolute deviation in machine learning. Note that this definition implies that the discrepancy should increase linearly as $$s$$ gets further and further from $$x_i$$. This is reflected in the center column of the image above.

The third definition, $$d_i = |x_i – s|^2$$, says that the discrepancy is the squared distance between $$x_i$$ and $$s$$. This is often called a squared error in machine learning. Note that this definition implies that the discrepancy should increase super-linearly as $$s$$ gets further and further from $$x_i$$. For example, if $$x_i = 1$$ and $$s = 0$$, then the discrepancy is $$1$$. But if $$x_i = 2$$ and $$s = 0$$, then the discrepancy is $$4$$. This is reflected in the far right column of the image above.

When we consider a list with a single element, $$(x_1)$$, these definitions all suggest that we should choose the same number: namely, $$s = x_1$$.

Aggregating Discrepancies

Although these definitions do not differ for a list with a single element, they suggest using very different summaries of a list with more than one number in it. To see why, let’s first assume that we’ll aggregate the discrepancy between $$x_i$$ and $$s$$ for each of the $$x_i$$ into a single summary of the quality of a proposed value of $$s$$. To perform this aggregation, we’ll sum up the discrepancies over each of the $$x_i$$ and call the result $$E$$.

In that case, our three definitions give three interestingly different possible definitions of the typical discrepancy, which we’ll call $$E$$ for error:
$$E_0 = \sum_{i} |x_i – s|^0.$$

$$E_1 = \sum_{i} |x_i – s|^1.$$

$$E_2 = \sum_{i} |x_i – s|^2.$$

When we write down these expressions in isolation, they don’t look very different. But if we select $$s$$ to minimize each of these three types of errors, we get very different numbers. And, surprisingly, each of these three numbers will be very familiar to us.

Minimizing Aggregate Discrepancies

For example, suppose that we try to find $$s_0$$ that minimizes the zero-one loss definition of the error of a single number summary. In that case, we require that,
$$s_0 = \arg \min_{s} \sum_{i} |x_i – s|^0.$$
What value should $$s_0$$ take on? If you give this some extended thought, you’ll discover two things: (1) there is not necessarily a single best value of $$s_0$$, but potentially many different values; and (2) each of these best values is one of the modes of the $$x_i$$.

In other words, the best single number summary of a set of numbers, when you use exact equality as your metric of error, is one of the modes of that set of numbers.

What happens if we consider some of the other definitions? Let’s start by considering $$s_1$$:
$$s_1 = \arg \min_{s} \sum_{i} |x_i – s|^1.$$
Unlike $$s_0$$, $$s_1$$ is a unique number: it is the median of the $$x_i$$. That is, the best summary of a set of numbers, when you use absolute differences as your metric of error, is the median of that set of numbers.

Since we’ve just found that the mode and the median appear naturally, we might wonder if other familiar basic statistics will appear. Luckily, they will. If we look for,
$$s_2 = \arg \min_{s} \sum_{i} |x_i – s|^2,$$
we’ll find that, like $$s_1$$, $$s_2$$ is again a unique number. Moreover, $$s_2$$ is the mean of the $$x_i$$. That is, the best summary of a set of numbers, when you use squared differences as your metric of error, is the mean of that set of numbers.

To sum up, we’ve just seen that the three most famous single number summaries of a data set are very closely related: they all minimize the average discrepancy between $$s$$ and the numbers being summarized. They only differ in the type of discrepancy being considered:

1. The mode minimizes the number of times that one of the numbers in our summarized list is not equal to the summary that we use.
2. The median minimizes the average distance between each number and our summary.
3. The mean minimizes the average squared distance between each number and our summary.

In equations,

1. $$\text{The mode of } x_i = \arg \min_{s} \sum_{i} |x_i – s|^0$$
2. $$\text{The median of } x_i = \arg \min_{s} \sum_{i} |x_i – s|^1$$
3. $$\text{The mean of } x_i = \arg \min_{s} \sum_{i} |x_i – s|^2$$

Summary

We’ve just seen that the mode, median and mean all arise from a simple parametric process in which we try to minimize the average discrepancy between a single number $$s$$ and a list of numbers, $$x_1, x_2, \ldots, x_n$$ that we try to summarize using $$s$$. In a future blog post, I’ll describe how the ideas we’ve just introduced relate to the concept of $$L_p$$ norms. Thinking about minimizing $$L_p$$ norms is a generalization of taking modes, medians and means that leads to almost every important linear method in statistics — ranging from linear regression to the SVD.

Thanks

Thanks to Sean Taylor for reading a draft of this post and commenting on it.

Writing Better Statistical Programs in R

A while back a friend asked me for advice about speeding up some R code that they’d written. Because they were running an extensive Monte Carlo simulation of a model they’d been developing, the poor performance of their code had become an impediment to their work.

After I looked through their code, it was clear that the performance hurdles they were stumbling upon could be overcome by adopting a few best practices for statistical programming. This post tries to describe some of the simplest best practices for statistical programming in R. Following these principles should make it easier for you to write statistical programs that are both highly performant and correct.

Write Out a DAG

Whenever you’re running a simulation study, you should appreciate the fact that you are working with a probabilistic model. Even if you are primarily focused upon the deterministic components of this model, the presence of any randomness in the model means that all of the theory of probabilistic models applies to your situation.

Almost certainly the most important concept in probabilistic modeling when you want to write efficient code is the notion of conditional independence. Conditional independence is important because many probabilistic models can be decomposed into simple pieces that can be computed in isolation. Although your model contains many variables, any one of these variables may depend upon only a few other variables in your model. If you can organize all of variables in your model based on their dependencies, it will be easier to exploit two computational tricks: vectorization and parallelization.

Let’s go through an example. Imagine that you have the model shown below:

$$X \sim \text{Normal}(0, 1)$$

$$Y1 \sim \text{Uniform}(X, X + 1)$$

$$Y2 \sim \text{Uniform}(X – 1, X)$$

$$Z \sim \text{Cauchy}(Y1 + Y2, 1)$$

In this model, the distribution of Y1 and Y2 depends only on the value of X. Similarly, the distribution of Z depends only on the values of Y1 and Y2. We can formalize this notion using a DAG, which is a directed acyclic graph that depicts which variables depend upon which other variables. It will help you appreciate the value of this format if you think of the arrows in the DAG below as indicating the flow of causality:

Having this DAG drawn out for your model will make it easier to write efficient code, because you can generate all of the values of a variable V simultaneously once you’ve computed the values of the variables that V depends upon. In our example, you can generate the values of X for all of your different simulations at once and then generate all of the Y1’s and Y2’s based on the values of X that you generate. You can then exploit this stepwise generation procedure to vectorize and parallelize your code. I’ll discuss vectorization to give you a sense of how to exploit the DAG we’ve drawn to write faster code.

Sequential dependencies are a major bottleneck in languages like R and Matlab that cannot perform loops efficiently. Looking at the DAG for the model shown able, you might think that you can’t get around writing a “for” loop to generate samples of this model because some of the variables need to be generated before others.

But, in reality, each individual sample from this model is independent of all of the others. As such, you can draw all of the X’s for all of your different simulations using vectorized code. Below I show how this model could be implemented using loops and then show how this same model could be implemented using vectorized operations:

Loop Code

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20  run.sims <- function(n.sims) { results <- data.frame()   for (sim in 1:n.sims) { x <- rnorm(1, 0, 1) y1 <- runif(1, x, x + 1) y2 <- runif(1, x - 1, x) z <- rcauchy(1, y1 + y2, 1) results <- rbind(results, data.frame(X = x, Y1 = y1, Y2 = y2, Z = z)) }   return(results) }   b <- Sys.time() run.sims(5000) e <- Sys.time() e - b

Vectorized Code

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15  run.sims <- function(n.sims) { x <- rnorm(n.sims, 0, 1) y1 <- runif(n.sims, x, x + 1) y2 <- runif(n.sims, x - 1, x) z <- rcauchy(n.sims, y, 1) results <- data.frame(X = x, Y1 = y1, Y2 = y2, Z = z)   return(results) }   b <- Sys.time() run.sims(5000) e <- Sys.time() e - b

The performance gains for this example are substantial when you move from the naive loop code to the vectorized code. (NB: There are also some gains from avoiding the repeated calls to rbind, although they are less important than one might think in this case.)

We could go further and parallelize the vectorized code, but this can be tedious to do in R.

The Data Generation / Model Fitting Cycle

Vectorization can make code in languages like R much more efficient. But speed is useless if you’re not generating correct output. For me, the essential test of correctness for a probabilistic model only becomes clear after I’ve written two complementary functions:

1. A data generation function that produces samples from my model. We can call this function generate. The arguments to generate are the parameters of my model.
2. A model fitting function that estimates the parameters of my model based on a sample of data. We can call this function fit. The arguments to fit are the data points we generated using generate

The value of these two functions is that they can be set up to feedback into one another in the cycle shown below:

I feel confident in the quality of statistical code when these functions interact stably. If the parameters inferred in a single pass through this loop are close to the original inputs, then my code is likely to work correctly. This amounts to a specific instance of the following design pattern:

 1 2 3  data <- generate(model, parameters) inferred.parameters <- fit(model, data) reliability <- error(model, parameters, inferred.parameters)

To see this pattern in action, let’s step through a process of generating data from a normal distribution and then fitting a normal to the data we generate. You can think of this as a form of “currying” in which we hardcore the value of the parameter model:

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34  n.sims <- 100 n.obs <- 100   generate.normal <- function(parameters) { return(rnorm(n.obs, parameters[1], parameters[2])) }   fit.normal <- function(data) { return(c(mean(data), sd(data))) }   distance <- function(true.parameters, inferred.parameters) { return((true.parameters - inferred.parameters)^2) }   reliability <- data.frame()   for (sim in 1:n.sims) { parameters <- c(runif(1), runif(1)) data <- generate.normal(parameters) inferred.parameters <- fit.normal(data) recovery.error <- distance(parameters, inferred.parameters) reliability <- rbind(reliability, data.frame(True1 = parameters[1], True2 = parameters[2], Inferred1 = inferred.parameters[1], Inferred2 = inferred.parameters[2], Error1 = recovery.error[1], Error2 = recovery.error[2])) }

If you generate data this way, you will see that our inference code is quite reliable. And you can see that it becomes better if we set n.obs to a larger value like 100,000.

I expect this kind of performance from all of my statistical code. I can’t trust the quality of either generate or fit until I see that they play well together. It is their mutual coherence that inspires faith.

General Lessons

Speed

When writing code in R, you can improve performance by searching for every possible location in which vectorization is possible. Vectorization essentially replaces R’s loops (which are not efficient) with C’s loops (which are efficient) because the computations in a vectorized call are almost always implemented in a language other than R.

Correctness

When writing code for model fitting in any language, you should always insure that your code can infer the parameters of models when given simulated data with known parameter values.

Americans Live Longer and Work Less

Today I saw an article on Hacker News entitled, “America’s CEOs Want You to Work Until You’re 70”. I was particularly surprised by this article appearing out of the blue because I take it for granted that America will eventually have to raise the retirement age to avoid bankruptcy. After reading the article, I wasn’t able to figure out why the story had been run at all. So I decided to do some basic fact-checking.

I tracked down some time series data about life expectancies in the U.S. from Berkeley and then found some time series data about the average age at retirement from the OECD. Plotting just these two bits of information, as shown below, makes it clear that Americans are spending a larger proportion of their life in retirement.

Perhaps I’m just naive, but it seems obvious to me that we can’t afford to take on several additional years of retirement pension liabilities for every living American. If Americans are living longer, we will need them to work longer in order to pay our bills.

Symbolic Differentiation in Julia

A Brief Introduction to Metaprogramming in Julia

In contrast to my previous post, which described one way in which Julia allows (and expects) the programmer to write code that directly employs the atomic operations offered by computers, this post is meant to introduce newcomers to some of Julia’s higher level functions for metaprogramming. To make metaprogramming more interesting, we’re going to build a system for symbolic differentiation in Julia.

Like Lisp, the Julia interpreter represents Julian expressions using normal data structures: every Julian expression is represented using an object of type Expr. You can see this by typing something like :(x + 1) into the Julia REPL:

 1 2 3 4 5  julia> :(x + 1) :(+(x,1))   julia> typeof(:(x+1)) Expr

Looking at the REPL output when we enter an expression quoted using the : operator, we can see that Julia has rewritten our input expression, originally written using infix notation, as an expression that uses prefix notation. This standardization to prefix notation makes it easier to work with arbitrary expressions because it removes a needless source of variation in the format of expressions.

To develop an intuition for what this kind of expression means to Julia, we can use the dump function to examine its contents:

 1 2 3 4 5 6 7 8  julia> dump(:(x + 1)) Expr head: Symbol call args: Array(Any,(3,)) 1: Symbol + 2: Symbol x 3: Int64 1 typ: Any

Here you can see that a Julian expression consists of three parts:

1. A head symbol, which describes the basic type of the expression. For this blog post, all of the expressions we’ll work with have head equal to :call.
2. An Array{Any} that contains the arguments of the head. In our example, the head is :call, which indicates a function call is being made in this expression. The arguments for the function call are:
1. :+, the symbol denoting the addition function that we are calling.
2. :x, the symbol denoting the variable x
3. 1, the number 1 represented as a 64-bit integer.
3. A typ which stores type inference information. We’ll ignore this information as it’s not relevant to us right now.

Because each expression is built out of normal components, we can construct one piecemeal:

 1 2  julia> Expr(:call, {:+, 1, 1}, Any) :(+(1,1))

Because this expression only depends upon constants, we can immediately evaluate it using the eval function:

 1 2  julia> eval(Expr(:call, {:+, 1, 1}, Any)) 2

Symbolic Differentiation in Julia

Now that we know how Julia expressions are built, we can design a very simple prototype system for doing symbolic differentiation in Julia. We’ll build up our system in pieces using some of the most basic rules of calculus:

1. The Constant Rule: d/dx c = 0
2. The Symbol Rule: d/dx x = 1, d/dx y = 0
3. The Sum Rule: d/dx (f + g) = (d/dx f) + (d/dx g)
4. The Subtraction Rule: d/dx (f - g) = (d/dx f) - (d/dx g)
5. The Product Rule: d/dx (f * g) = (d/dx f) * g + f * (d/dx g)
6. The Quotient Rule: d/dx (f / g) = [(d/dx f) * g - f * (d/dx g)] / g^2

Implementing these operations is quite easy once you understand the data structure Julia uses to represent expressions. And some of these operations would be trivial regardless.

For example, here’s the Constant Rule in Julia:

 1  differentiate(x::Number, target::Symbol) = 0

And here’s the Symbol rule:

 1 2 3 4 5 6 7  function differentiate(s::Symbol, target::Symbol) if s == target return 1 else return 0 end end

The first two rules of calculus don’t actually require us to understand anything about Julian expressions. But the interesting parts of a symbolic differentiation system do. To see that, let’s look at the Sum Rule:

 1 2 3 4 5 6 7 8 9  function differentiate_sum(ex::Expr, target::Symbol) n = length(ex.args) new_args = Array(Any, n) new_args[1] = :+ for i in 2:n new_args[i] = differentiate(ex.args[i], target) end return Expr(:call, new_args, Any) end

The Subtraction Rule can be defined almost identically:

 1 2 3 4 5 6 7 8 9  function differentiate_subtraction(ex::Expr, target::Symbol) n = length(ex.args) new_args = Array(Any, n) new_args[1] = :- for i in 2:n new_args[i] = differentiate(ex.args[i], target) end return Expr(:call, new_args, Any) end

The Product Rule is a little more interesting because we need to build up an expression whose components are themselves expressions:

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18  function differentiate_product(ex::Expr, target::Symbol) n = length(ex.args) res_args = Array(Any, n) res_args[1] = :+ for i in 2:n new_args = Array(Any, n) new_args[1] = :* for j in 2:n if j == i new_args[j] = differentiate(ex.args[j], target) else new_args[j] = ex.args[j] end end res_args[i] = Expr(:call, new_args, Any) end return Expr(:call, res_args, Any) end

Last, but not least, here’s the Quotient Rule, which is a little more complex. We can code this rule up in a more explicit fashion that doesn’t use any loops so that we can directly see the steps we’re taking:

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33  function differentiate_quotient(ex::Expr, target::Symbol) return Expr(:call, { :/, Expr(:call, { :-, Expr(:call, { :*, differentiate(ex.args[2], target), ex.args[3] }, Any), Expr(:call, { :*, ex.args[2], differentiate(ex.args[3], target) }, Any) }, Any), Expr(:call, { :^, ex.args[3], 2 }, Any) }, Any) end

Now that we have all of those basic rules of calculus implemented as functions, we’ll build up a lookup table that we can use to tell our final differentiate function where to send new expressions based on the kind of function’s that being differentiated during each call to differentiate:

 1 2 3 4 5 6  differentiate_lookup = { :+ => differentiate_sum, :- => differentiate_subtraction, :* => differentiate_product, :/ => differentiate_quotient }

With all of the core machinery in place, the final definition of differentiate is very simple:

 1 2 3 4 5 6 7 8 9 10 11  function differentiate(ex::Expr, target::Symbol) if ex.head == :call if has(differentiate_lookup, ex.args[1]) return differentiate_lookup[ex.args[1]](ex, target) else error("Don't know how to differentiate \$(ex.args[1])") end else return differentiate(ex.head) end end

Ive put all of these snippets together in a single GitHub Gist. To try out this new differentiation function, let’s copy the contents of that GitHub gist into a file called differentiate.jl. We can then load the contents of that file into Julia at the REPL using include, which will allow us try out our differentiation tool:

 1 2 3 4 5 6 7  julia> include("differentiate.jl")   julia> differentiate(:(x + x*x), :x) :(+(1,+(*(1,x),*(x,1))))   julia> differentiate(:(x + a*x), :x) :(+(1,+(*(0,x),*(a,1))))

While the expressions that are constructed by our differentiate function are ugly, they are correct: they just need to be simplified so that things like *(0, x) are replaced with 0. If you’d like to see how to write code to perform some basic simplifications, you can see the simplify function I’ve been building for Julia’s new Calculus package. That codebase includes all of the functionality shown here for differentiate, along with several other rules that make the system more powerful.

What I love about Julia is the ease with which one can move from low-level bit operations like those described in my previous post to high-level operations that manipulate Julian expressions. By allowing the programmer to manipulate expressions programmatically, Julia has copied one of the most beautiful parts of Lisp.