## Using Norms to Understand Linear Regression

### Introduction

In my last post, I described how we can derive modes, medians and means as three natural solutions to the problem of summarizing a list of numbers, $$(x_1, x_2, \ldots, x_n)$$, using a single number, $$s$$. In particular, we measured the quality of different potential summaries in three different ways, which led us to modes, medians and means respectively. Each of these quantities emerged from measuring the typical discrepancy between an element of the list, $$x_i$$, and the summary, $$s$$, using a formula of the form,
$$\sum_i |x_i – s|^p,$$
where $$p$$ was either $$0$$, $$1$$ or $$2$$.

### The $$L_p$$ Norms

In this post, I’d like to extend this approach to linear regression. The notion of discrepancies we used in the last post is very closely tied to the idea of measuring the size of a vector in $$\mathbb{R}^n$$. Specifically, we were minimizing a measure of discrepancies that was almost identical to the $$L_p$$ family of norms that can be used to measure the size of vectors. Understanding $$L_p$$ norms makes it much easier to describe several modern generalizations of classical linear regression.

To extend our previous approach to the more standard notion of an $$L_p$$ norm, we simply take the sum we used before and rescale things by taking a $$p^{th}$$ root. This gives the formula for the $$L_p$$ norm of any vector, $$v = (v_1, v_2, \ldots, v_n)$$, as,
$$|v|_p = (\sum_i |v_i|^p)^\frac{1}{p}.$$
When $$p = 2$$, this formula reduces to the familiar formula for the length of a vector:
$$|v|_2 = \sqrt{\sum_i v_i^2}.$$

In the last post, the vector we cared about was the vector of elementwise discrepancies, $$v = (x_1 – s, x_2 – s, \ldots, x_n – s)$$. We wanted to minimize the overall size of this vector in order to make $$s$$ a good summary of $$x_1, \ldots, x_n$$. Because we were interested only in the minimum size of this vector, it didn’t matter that we skipped taking the $$p^{th}$$ root at the end because one vector, $$v_1$$, has a smaller norm than another vector, $$v_2$$, only when the $$p^{th}$$ power of that norm smaller than the $$p^{th}$$ power of the other. What was essential wasn’t the scale of the norm, but rather the value of $$p$$ that we chose. Here we’ll follow that approach again. Specifically, we’ll again be working consistently with the $$p^{th}$$ power of an $$L_p$$ norm:
$$|v|_p^p = (\sum_i |v_i|^p).$$

### The Regression Problem

Using $$L_p$$ norms to measure the overall size of a vector of discrepancies extends naturally to other problems in statistics. In the previous post, we were trying to summarize a list of numbers by producing a simple summary statistic. In this post, we’re instead going to summarize the relationship between two lists of numbers in a form that generalizes traditional regression models.

Instead of a single list, we’ll now work with two vectors: $$(x_1, x_2, \ldots, x_n)$$ and $$(y_1, y_2, \ldots, y_n)$$. Because we like simple models, we’ll make the very strong (and very convenient) assumption that the second vector is, approximately, a linear function of the first vector, which gives us the formula:
$$y_i \approx \beta_0 + \beta_1 x_i.$$

In practice, this linear relationship is never perfect, but only an approximation. As such, for any specific values we choose for $$\beta_0$$ and $$\beta_1$$, we have to compute a vector of discrepancies: $$v = (y_1 – (\beta_0 + \beta_1 x_1), \ldots, y_n – (\beta_0 + \beta_1 x_n))$$. The question then becomes: how do we measure the size of this vector of discrepancies? By choosing different norms to measure its size, we arrive at several different forms of linear regression models. In particular, we’ll work with three norms: the $$L_0$$, $$L_1$$ and $$L_2$$ norms.

As we did with the single vector case, here we’ll define discrepancies as,
$$d_i = |y_i – (\beta_0 + \beta_1 x_i)|^p,$$
and the total error as,
$$E_p = \sum_i |y_i – (\beta_0 + \beta_1 x_i)|^p,$$
which is the just the $$p^{th}$$ power of the $$L_p$$ norm.

### Several Forms of Regression

In general, we want estimate a set of regression coefficients that minimize this total error. Different forms of linear regression appear when we alter the values of $$p$$. As before, let’s consider three settings:
$$E_0 = \sum_i |y_i – (\beta_0 + \beta_1 x_i)|^0$$
$$E_1 = \sum_i |y_i – (\beta_0 + \beta_1 x_i)|^1$$
$$E_2 = \sum_i |y_i – (\beta_0 + \beta_1 x_i)|^2$$

What happens in these settings? In the first case, we select regression coefficients so that the line passes through as many points as possible. Clearly we can always select a line that passes through any pair of points. And we can show that there are data sets in which we cannot do better. So the $$L_0$$ norm doesn’t seem to provide a very useful form of linear regression, but I’d be interested to see examples of its use.

In contrast, minimizing $$E_1$$ and $$E_2$$ define quite interesting and familiar forms of linear regression. We’ll start with $$E_2$$ because it’s the most familiar: it defines Ordinary Least Squares (OLS) regression, which is the one we all know and love. In the $$L_2$$ case, we select $$\beta_0$$ and $$\beta_1$$ to minimize,
$$E_2 = \sum_i (y_i – (\beta_0 + \beta_1 x_i))^2,$$
which is the summed squared error over all of the $$(x_i, y_i)$$ pairs. In other words, Ordinary Least Squares regression is just an attempt to find an approximating linear relationship between two vectors that minimizes the $$L_2$$ norm of the vector of discrepancies.

Although OLS regression is clearly king, the coefficients we get from minimizing $$E_1$$ are also quite widely used: using the $$L_1$$ norm defines Least Absolute Deviations (LAD) regression, which is also sometimes called Robust Regression. This approach to regression is robust because large outliers that would produce errors greater than $$1$$ are not unnecessarily augmented by the squaring operation that’s used in defining OLS regression, but instead only have their absolute values taken. This means that the resulting model will try to match the overall linear pattern in the data even when there are some very large outliers.

We can also relate these two approaches to the strategy employed in the previous post. When we use OLS regression (which would be better called $$L_2$$ regression), we predict the mean of $$y_i$$ given the value of $$x_i$$. And when we use LAD regression (which would be better called $$L_1$$ regression), we predict the median of $$y_i$$ given the value of $$x_i$$. Just as I said in the previous post, the core theoretical tool that we need to understand is the $$L_p$$ norm. For single number summaries, it naturally leads to modes, medians and means. For simple regression problems, it naturally leads to LAD regression and OLS regression. But there’s more: it also leads naturally to the two most popular forms of regularized regression.

### Regularization

If you’re not familiar with regularization, the central idea is that we don’t exclusively try to find the values of $$\beta_0$$ and $$\beta_1$$ that minimize the discrepancy between $$\beta_0 + \beta_1 x_i$$ and $$y_i$$, but also simultaneously try to satisfy a competing requirement that $$\beta_1$$ not get too large. Note that we don’t try to control the size of $$\beta_0$$ because it describes the overall scale of the data rather than the relationship between $$x$$ and $$y$$.

Because these objectives compete, we have to combine them into a single objective. We do that by working with a linear sum of the two objectives. And because both the discrepancy objective and the size of the coefficients can be described in terms of norms, we’ll assume that we want to minimize the $$L_p$$ norm of the discrepancies and the $$L_q$$ norm of the $$\beta$$’s. This means that we end up trying to minimize an expression of the form,
$$(\sum_i |y_i – (\beta_0 + \beta_1 x_i)|^{p}) + \lambda (|\beta_1|^q).$$

In most regularized regression models that I’ve seen in the wild, people tend to use $$p = 2$$ and $$q = 1$$ or $$q = 2$$. When $$q = 1$$, this model is called the LASSO. When $$q = 2$$, this model is called ridge regression. In another approach, I’ll try to describe why the LASSO and ridge regression produce such different patterns of coefficients.

## Modes, Medians and Means: A Unifying Perspective

### Introduction / Warning

Any traditional introductory statistics course will teach students the definitions of modes, medians and means. But, because introductory courses can’t assume that students have much mathematical maturity, the close relationship between these three summary statistics can’t be made clear. This post tries to remedy that situation by making it clear that all three concepts arise as specific parameterizations of a more general problem.

To do so, I’ll need to introduce one non-standard definition that may trouble some readers. In order to simplify my exposition, let’s all agree to assume that $$0^0 = 0$$. In particular, we’ll want to assume that $$|0|^0 = 0$$, even though $$|\epsilon|^0 = 1$$ for all $$\epsilon > 0$$. This definition is non-standard, but it greatly simplifies what follows and emphasizes the conceptual unity of modes, medians and means.

### Constructing a Summary Statistic

To see how modes, medians and means arise, let’s assume that we have a list of numbers, $$(x_1, x_2, \ldots, x_n)$$, that we want to summarize. We want our summary to be a single number, which we’ll call $$s$$. How should we select $$s$$ so that it summarizes the numbers, $$(x_1, x_2, \ldots, x_n)$$, effectively?

To answer that, we’ll assume that $$s$$ is an effective summary of the entire list if the typical discrepancy between $$s$$ and each of the $$x_i$$ is small. With that assumption in place, we only need to do two things: (1) define the notion of discrepancy between two numbers, $$x_i$$ and $$s$$; and (2) define the notion of a typical discrepancy. Because each number $$x_i$$ produces its own discrepancy, we’ll need to introduce a method for aggregating the individual discrepancies to order to say something about the typical discrepancy.

### Defining a Discrepancy

We could define the discrepancy between a number $$x_i$$ and another number $$s$$ in many ways. For now, we’ll consider only three possibilities. All of these three options satisfies a basic intuition we have about the notion of discrepancy: we expect that the discrepancy between $$x_i$$ and $$s$$ should be $$0$$ if $$|x_i – s| = 0$$ and that the discrepancy should be greater than $$0$$ if $$|x_i – s| > 0$$. That leaves us with one obvious question: how much greater should the discrepancy be when $$|x_i – s| > 0$$?

To answer that question, let’s consider three definitions of the discrepancy, $$d_i$$:

1. $$d_i = |x_i – s|^0$$
2. $$d_i = |x_i – s|^1$$
3. $$d_i = |x_i – s|^2$$

How should we think about these three possible definitions?

The first definition, $$d_i = |x_i – s|^0$$, says that the discrepancy is $$1$$ if $$x_i \neq s$$ and is $$0$$ only when $$x_i = s$$. This notion of discrepancy is typically called zero-one loss in machine learning. Note that this definition implies that anything other than exact equality produces a constant measure of discrepancy. Summarizing $$x_i = 2$$ with $$s = 0$$ is no better nor worse than using $$s = 1$$. In other words, the discrepancy does not increase at all as $$s$$ gets further and further from $$x_i$$. You can see this reflected in the far-left column of the image below:

The second definition, $$d_i = |x_i – s|^1$$, says that the discrepancy is equal to the distance between $$x_i$$ and $$s$$. This is often called an absolute deviation in machine learning. Note that this definition implies that the discrepancy should increase linearly as $$s$$ gets further and further from $$x_i$$. This is reflected in the center column of the image above.

The third definition, $$d_i = |x_i – s|^2$$, says that the discrepancy is the squared distance between $$x_i$$ and $$s$$. This is often called a squared error in machine learning. Note that this definition implies that the discrepancy should increase super-linearly as $$s$$ gets further and further from $$x_i$$. For example, if $$x_i = 1$$ and $$s = 0$$, then the discrepancy is $$1$$. But if $$x_i = 2$$ and $$s = 0$$, then the discrepancy is $$4$$. This is reflected in the far right column of the image above.

When we consider a list with a single element, $$(x_1)$$, these definitions all suggest that we should choose the same number: namely, $$s = x_1$$.

### Aggregating Discrepancies

Although these definitions do not differ for a list with a single element, they suggest using very different summaries of a list with more than one number in it. To see why, let’s first assume that we’ll aggregate the discrepancy between $$x_i$$ and $$s$$ for each of the $$x_i$$ into a single summary of the quality of a proposed value of $$s$$. To perform this aggregation, we’ll sum up the discrepancies over each of the $$x_i$$ and call the result $$E$$.

In that case, our three definitions give three interestingly different possible definitions of the typical discrepancy, which we’ll call $$E$$ for error:
$$E_0 = \sum_{i} |x_i – s|^0.$$

$$E_1 = \sum_{i} |x_i – s|^1.$$

$$E_2 = \sum_{i} |x_i – s|^2.$$

When we write down these expressions in isolation, they don’t look very different. But if we select $$s$$ to minimize each of these three types of errors, we get very different numbers. And, surprisingly, each of these three numbers will be very familiar to us.

### Minimizing Aggregate Discrepancies

For example, suppose that we try to find $$s_0$$ that minimizes the zero-one loss definition of the error of a single number summary. In that case, we require that,
$$s_0 = \arg \min_{s} \sum_{i} |x_i – s|^0.$$
What value should $$s_0$$ take on? If you give this some extended thought, you’ll discover two things: (1) there is not necessarily a single best value of $$s_0$$, but potentially many different values; and (2) each of these best values is one of the modes of the $$x_i$$.

In other words, the best single number summary of a set of numbers, when you use exact equality as your metric of error, is one of the modes of that set of numbers.

What happens if we consider some of the other definitions? Let’s start by considering $$s_1$$:
$$s_1 = \arg \min_{s} \sum_{i} |x_i – s|^1.$$
Unlike $$s_0$$, $$s_1$$ is a unique number: it is the median of the $$x_i$$. That is, the best summary of a set of numbers, when you use absolute differences as your metric of error, is the median of that set of numbers.

Since we’ve just found that the mode and the median appear naturally, we might wonder if other familiar basic statistics will appear. Luckily, they will. If we look for,
$$s_2 = \arg \min_{s} \sum_{i} |x_i – s|^2,$$
we’ll find that, like $$s_1$$, $$s_2$$ is again a unique number. Moreover, $$s_2$$ is the mean of the $$x_i$$. That is, the best summary of a set of numbers, when you use squared differences as your metric of error, is the mean of that set of numbers.

To sum up, we’ve just seen that the three most famous single number summaries of a data set are very closely related: they all minimize the average discrepancy between $$s$$ and the numbers being summarized. They only differ in the type of discrepancy being considered:

1. The mode minimizes the number of times that one of the numbers in our summarized list is not equal to the summary that we use.
2. The median minimizes the average distance between each number and our summary.
3. The mean minimizes the average squared distance between each number and our summary.

In equations,

1. $$\text{The mode of } x_i = \arg \min_{s} \sum_{i} |x_i – s|^0$$
2. $$\text{The median of } x_i = \arg \min_{s} \sum_{i} |x_i – s|^1$$
3. $$\text{The mean of } x_i = \arg \min_{s} \sum_{i} |x_i – s|^2$$

### Summary

We’ve just seen that the mode, median and mean all arise from a simple parametric process in which we try to minimize the average discrepancy between a single number $$s$$ and a list of numbers, $$x_1, x_2, \ldots, x_n$$ that we try to summarize using $$s$$. In a future blog post, I’ll describe how the ideas we’ve just introduced relate to the concept of $$L_p$$ norms. Thinking about minimizing $$L_p$$ norms is a generalization of taking modes, medians and means that leads to almost every important linear method in statistics — ranging from linear regression to the SVD.

### Thanks

Thanks to Sean Taylor for reading a draft of this post and commenting on it.

## Writing Better Statistical Programs in R

A while back a friend asked me for advice about speeding up some R code that they’d written. Because they were running an extensive Monte Carlo simulation of a model they’d been developing, the poor performance of their code had become an impediment to their work.

After I looked through their code, it was clear that the performance hurdles they were stumbling upon could be overcome by adopting a few best practices for statistical programming. This post tries to describe some of the simplest best practices for statistical programming in R. Following these principles should make it easier for you to write statistical programs that are both highly performant and correct.

### Write Out a DAG

Whenever you’re running a simulation study, you should appreciate the fact that you are working with a probabilistic model. Even if you are primarily focused upon the deterministic components of this model, the presence of any randomness in the model means that all of the theory of probabilistic models applies to your situation.

Almost certainly the most important concept in probabilistic modeling when you want to write efficient code is the notion of conditional independence. Conditional independence is important because many probabilistic models can be decomposed into simple pieces that can be computed in isolation. Although your model contains many variables, any one of these variables may depend upon only a few other variables in your model. If you can organize all of variables in your model based on their dependencies, it will be easier to exploit two computational tricks: vectorization and parallelization.

Let’s go through an example. Imagine that you have the model shown below:

$$X \sim \text{Normal}(0, 1)$$

$$Y1 \sim \text{Uniform}(X, X + 1)$$

$$Y2 \sim \text{Uniform}(X – 1, X)$$

$$Z \sim \text{Cauchy}(Y1 + Y2, 1)$$

In this model, the distribution of Y1 and Y2 depends only on the value of X. Similarly, the distribution of Z depends only on the values of Y1 and Y2. We can formalize this notion using a DAG, which is a directed acyclic graph that depicts which variables depend upon which other variables. It will help you appreciate the value of this format if you think of the arrows in the DAG below as indicating the flow of causality:

Having this DAG drawn out for your model will make it easier to write efficient code, because you can generate all of the values of a variable V simultaneously once you’ve computed the values of the variables that V depends upon. In our example, you can generate the values of X for all of your different simulations at once and then generate all of the Y1’s and Y2’s based on the values of X that you generate. You can then exploit this stepwise generation procedure to vectorize and parallelize your code. I’ll discuss vectorization to give you a sense of how to exploit the DAG we’ve drawn to write faster code.

Sequential dependencies are a major bottleneck in languages like R and Matlab that cannot perform loops efficiently. Looking at the DAG for the model shown able, you might think that you can’t get around writing a “for” loop to generate samples of this model because some of the variables need to be generated before others.

But, in reality, each individual sample from this model is independent of all of the others. As such, you can draw all of the X’s for all of your different simulations using vectorized code. Below I show how this model could be implemented using loops and then show how this same model could be implemented using vectorized operations:

#### Loop Code

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20  run.sims <- function(n.sims) { results <- data.frame()   for (sim in 1:n.sims) { x <- rnorm(1, 0, 1) y1 <- runif(1, x, x + 1) y2 <- runif(1, x - 1, x) z <- rcauchy(1, y1 + y2, 1) results <- rbind(results, data.frame(X = x, Y1 = y1, Y2 = y2, Z = z)) }   return(results) }   b <- Sys.time() run.sims(5000) e <- Sys.time() e - b

#### Vectorized Code

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15  run.sims <- function(n.sims) { x <- rnorm(n.sims, 0, 1) y1 <- runif(n.sims, x, x + 1) y2 <- runif(n.sims, x - 1, x) z <- rcauchy(n.sims, y, 1) results <- data.frame(X = x, Y1 = y1, Y2 = y2, Z = z)   return(results) }   b <- Sys.time() run.sims(5000) e <- Sys.time() e - b

The performance gains for this example are substantial when you move from the naive loop code to the vectorized code. (NB: There are also some gains from avoiding the repeated calls to rbind, although they are less important than one might think in this case.)

We could go further and parallelize the vectorized code, but this can be tedious to do in R.

### The Data Generation / Model Fitting Cycle

Vectorization can make code in languages like R much more efficient. But speed is useless if you’re not generating correct output. For me, the essential test of correctness for a probabilistic model only becomes clear after I’ve written two complementary functions:

1. A data generation function that produces samples from my model. We can call this function generate. The arguments to generate are the parameters of my model.
2. A model fitting function that estimates the parameters of my model based on a sample of data. We can call this function fit. The arguments to fit are the data points we generated using generate

The value of these two functions is that they can be set up to feedback into one another in the cycle shown below:

I feel confident in the quality of statistical code when these functions interact stably. If the parameters inferred in a single pass through this loop are close to the original inputs, then my code is likely to work correctly. This amounts to a specific instance of the following design pattern:

 1 2 3  data <- generate(model, parameters) inferred.parameters <- fit(model, data) reliability <- error(model, parameters, inferred.parameters)

To see this pattern in action, let’s step through a process of generating data from a normal distribution and then fitting a normal to the data we generate. You can think of this as a form of “currying” in which we hardcore the value of the parameter model:

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34  n.sims <- 100 n.obs <- 100   generate.normal <- function(parameters) { return(rnorm(n.obs, parameters[1], parameters[2])) }   fit.normal <- function(data) { return(c(mean(data), sd(data))) }   distance <- function(true.parameters, inferred.parameters) { return((true.parameters - inferred.parameters)^2) }   reliability <- data.frame()   for (sim in 1:n.sims) { parameters <- c(runif(1), runif(1)) data <- generate.normal(parameters) inferred.parameters <- fit.normal(data) recovery.error <- distance(parameters, inferred.parameters) reliability <- rbind(reliability, data.frame(True1 = parameters[1], True2 = parameters[2], Inferred1 = inferred.parameters[1], Inferred2 = inferred.parameters[2], Error1 = recovery.error[1], Error2 = recovery.error[2])) }

If you generate data this way, you will see that our inference code is quite reliable. And you can see that it becomes better if we set n.obs to a larger value like 100,000.

I expect this kind of performance from all of my statistical code. I can’t trust the quality of either generate or fit until I see that they play well together. It is their mutual coherence that inspires faith.

### General Lessons

#### Speed

When writing code in R, you can improve performance by searching for every possible location in which vectorization is possible. Vectorization essentially replaces R’s loops (which are not efficient) with C’s loops (which are efficient) because the computations in a vectorized call are almost always implemented in a language other than R.

#### Correctness

When writing code for model fitting in any language, you should always insure that your code can infer the parameters of models when given simulated data with known parameter values.

## Americans Live Longer and Work Less

Today I saw an article on Hacker News entitled, “America’s CEOs Want You to Work Until You’re 70”. I was particularly surprised by this article appearing out of the blue because I take it for granted that America will eventually have to raise the retirement age to avoid bankruptcy. After reading the article, I wasn’t able to figure out why the story had been run at all. So I decided to do some basic fact-checking.

I tracked down some time series data about life expectancies in the U.S. from Berkeley and then found some time series data about the average age at retirement from the OECD. Plotting just these two bits of information, as shown below, makes it clear that Americans are spending a larger proportion of their life in retirement.

Perhaps I’m just naive, but it seems obvious to me that we can’t afford to take on several additional years of retirement pension liabilities for every living American. If Americans are living longer, we will need them to work longer in order to pay our bills.

## Symbolic Differentiation in Julia

### A Brief Introduction to Metaprogramming in Julia

In contrast to my previous post, which described one way in which Julia allows (and expects) the programmer to write code that directly employs the atomic operations offered by computers, this post is meant to introduce newcomers to some of Julia’s higher level functions for metaprogramming. To make metaprogramming more interesting, we’re going to build a system for symbolic differentiation in Julia.

Like Lisp, the Julia interpreter represents Julian expressions using normal data structures: every Julian expression is represented using an object of type Expr. You can see this by typing something like :(x + 1) into the Julia REPL:

 1 2 3 4 5  julia> :(x + 1) :(+(x,1))   julia> typeof(:(x+1)) Expr

Looking at the REPL output when we enter an expression quoted using the : operator, we can see that Julia has rewritten our input expression, originally written using infix notation, as an expression that uses prefix notation. This standardization to prefix notation makes it easier to work with arbitrary expressions because it removes a needless source of variation in the format of expressions.

To develop an intuition for what this kind of expression means to Julia, we can use the dump function to examine its contents:

 1 2 3 4 5 6 7 8  julia> dump(:(x + 1)) Expr head: Symbol call args: Array(Any,(3,)) 1: Symbol + 2: Symbol x 3: Int64 1 typ: Any

Here you can see that a Julian expression consists of three parts:

1. A head symbol, which describes the basic type of the expression. For this blog post, all of the expressions we’ll work with have head equal to :call.
2. An Array{Any} that contains the arguments of the head. In our example, the head is :call, which indicates a function call is being made in this expression. The arguments for the function call are:
1. :+, the symbol denoting the addition function that we are calling.
2. :x, the symbol denoting the variable x
3. 1, the number 1 represented as a 64-bit integer.
3. A typ which stores type inference information. We’ll ignore this information as it’s not relevant to us right now.

Because each expression is built out of normal components, we can construct one piecemeal:

 1 2  julia> Expr(:call, {:+, 1, 1}, Any) :(+(1,1))

Because this expression only depends upon constants, we can immediately evaluate it using the eval function:

 1 2  julia> eval(Expr(:call, {:+, 1, 1}, Any)) 2

### Symbolic Differentiation in Julia

Now that we know how Julia expressions are built, we can design a very simple prototype system for doing symbolic differentiation in Julia. We’ll build up our system in pieces using some of the most basic rules of calculus:

1. The Constant Rule: d/dx c = 0
2. The Symbol Rule: d/dx x = 1, d/dx y = 0
3. The Sum Rule: d/dx (f + g) = (d/dx f) + (d/dx g)
4. The Subtraction Rule: d/dx (f - g) = (d/dx f) - (d/dx g)
5. The Product Rule: d/dx (f * g) = (d/dx f) * g + f * (d/dx g)
6. The Quotient Rule: d/dx (f / g) = [(d/dx f) * g - f * (d/dx g)] / g^2

Implementing these operations is quite easy once you understand the data structure Julia uses to represent expressions. And some of these operations would be trivial regardless.

For example, here’s the Constant Rule in Julia:

 1  differentiate(x::Number, target::Symbol) = 0

And here’s the Symbol rule:

 1 2 3 4 5 6 7  function differentiate(s::Symbol, target::Symbol) if s == target return 1 else return 0 end end

The first two rules of calculus don’t actually require us to understand anything about Julian expressions. But the interesting parts of a symbolic differentiation system do. To see that, let’s look at the Sum Rule:

 1 2 3 4 5 6 7 8 9  function differentiate_sum(ex::Expr, target::Symbol) n = length(ex.args) new_args = Array(Any, n) new_args[1] = :+ for i in 2:n new_args[i] = differentiate(ex.args[i], target) end return Expr(:call, new_args, Any) end

The Subtraction Rule can be defined almost identically:

 1 2 3 4 5 6 7 8 9  function differentiate_subtraction(ex::Expr, target::Symbol) n = length(ex.args) new_args = Array(Any, n) new_args[1] = :- for i in 2:n new_args[i] = differentiate(ex.args[i], target) end return Expr(:call, new_args, Any) end

The Product Rule is a little more interesting because we need to build up an expression whose components are themselves expressions:

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18  function differentiate_product(ex::Expr, target::Symbol) n = length(ex.args) res_args = Array(Any, n) res_args[1] = :+ for i in 2:n new_args = Array(Any, n) new_args[1] = :* for j in 2:n if j == i new_args[j] = differentiate(ex.args[j], target) else new_args[j] = ex.args[j] end end res_args[i] = Expr(:call, new_args, Any) end return Expr(:call, res_args, Any) end

Last, but not least, here’s the Quotient Rule, which is a little more complex. We can code this rule up in a more explicit fashion that doesn’t use any loops so that we can directly see the steps we’re taking:

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33  function differentiate_quotient(ex::Expr, target::Symbol) return Expr(:call, { :/, Expr(:call, { :-, Expr(:call, { :*, differentiate(ex.args[2], target), ex.args[3] }, Any), Expr(:call, { :*, ex.args[2], differentiate(ex.args[3], target) }, Any) }, Any), Expr(:call, { :^, ex.args[3], 2 }, Any) }, Any) end

Now that we have all of those basic rules of calculus implemented as functions, we’ll build up a lookup table that we can use to tell our final differentiate function where to send new expressions based on the kind of function’s that being differentiated during each call to differentiate:

 1 2 3 4 5 6  differentiate_lookup = { :+ => differentiate_sum, :- => differentiate_subtraction, :* => differentiate_product, :/ => differentiate_quotient }

With all of the core machinery in place, the final definition of differentiate is very simple:

 1 2 3 4 5 6 7 8 9 10 11  function differentiate(ex::Expr, target::Symbol) if ex.head == :call if has(differentiate_lookup, ex.args[1]) return differentiate_lookup[ex.args[1]](ex, target) else error("Don't know how to differentiate \$(ex.args[1])") end else return differentiate(ex.head) end end

Ive put all of these snippets together in a single GitHub Gist. To try out this new differentiation function, let’s copy the contents of that GitHub gist into a file called differentiate.jl. We can then load the contents of that file into Julia at the REPL using include, which will allow us try out our differentiation tool:

 1 2 3 4 5 6 7  julia> include("differentiate.jl")   julia> differentiate(:(x + x*x), :x) :(+(1,+(*(1,x),*(x,1))))   julia> differentiate(:(x + a*x), :x) :(+(1,+(*(0,x),*(a,1))))

While the expressions that are constructed by our differentiate function are ugly, they are correct: they just need to be simplified so that things like *(0, x) are replaced with 0. If you’d like to see how to write code to perform some basic simplifications, you can see the simplify function I’ve been building for Julia’s new Calculus package. That codebase includes all of the functionality shown here for differentiate, along with several other rules that make the system more powerful.

What I love about Julia is the ease with which one can move from low-level bit operations like those described in my previous post to high-level operations that manipulate Julian expressions. By allowing the programmer to manipulate expressions programmatically, Julia has copied one of the most beautiful parts of Lisp.

## Computers are Machines

When people try out Julia for the first time, many of them are worried by the following example:

 1 2 3 4 5 6 7  julia> factorial(n) = n == 0 ? 1 : n * factorial(n - 1)   julia> factorial(20) 2432902008176640000   julia> factorial(21) -4249290049419214848

If you’re not familiar with computer architecture, this result is very troubling. Why would Julia claim that the factorial of 21 is a negative number?

The answer is simple, but depends upon a set of concepts that are largely unfamiliar to programmers who, like me, grew up using modern languages like Python and Ruby. Julia thinks that the factorial of 21 is a negative number because computers are machines.

Because they are machines, computers represent numbers using many small groups of bits. Most modern machines work with groups of 64 bits at a time. If an operation has to work with more than 64 bits at a time, that operation will be slower than a similar operation than only works with 64 bits at a time.

As a result, if you want to write fast computer code, it helps to only execute operations that are easily expressible using groups of 64 bits.

Arithmetic involving small integers fits into the category of operations that only require 64 bits at a time. Every integer between -9223372036854775808 and 9223372036854775807 can be expressed using just 64 bits. You can see this for yourself by using the typemin and typemax functions in Julia:

 1 2 3 4 5  julia> typemin(Int64) -9223372036854775808   julia> typemax(Int64) 9223372036854775807

If you do things like the following, the computer will quickly produce correct results:

 1 2 3 4 5  julia> typemin(Int64) + 1 -9223372036854775807   julia> typemax(Int64) - 1 9223372036854775806

But things go badly if you try to break out of the range of numbers that can be represented using only 64 bits:

 1 2 3 4 5  julia> typemin(Int64) - 1 9223372036854775807   julia> typemax(Int64) + 1 -9223372036854775808

The reasons for this are not obvious at first, but make more sense if you examine the actual bits being operated upon:

 1 2 3 4 5 6 7 8  julia> bits(typemax(Int64)) "0111111111111111111111111111111111111111111111111111111111111111"   julia> bits(typemax(Int64) + 1) "1000000000000000000000000000000000000000000000000000000000000000"   julia> bits(typemin(Int64)) "1000000000000000000000000000000000000000000000000000000000000000"

When it adds 1 to a number, the computer blindly uses a simple arithmetic rule for individual bits that works just like the carry system you learned as a child. This carrying rule is very efficient, but works poorly if you end up flipping the very first bit in a group of 64 bits. The reason is that this first bit represents the sign of an integer. When this special first bit gets flipped by an operation that overflows the space provided by 64 bits, everything else breaks down.

The special interpretation given to certain bits in a group of 64 is the reason that factorial of 21 is a negative number when Julia computes it. You can confirm this by looking at the exact bits involved:

 1 2 3 4 5  julia> bits(factorial(20)) "0010000111000011011001110111110010000010101101000000000000000000"   julia> bits(factorial(21)) "1100010100000111011111010011011010111000110001000000000000000000"

Here, as before, the computer has just executed the operations necessary to perform multiplication by 21. But the result has flipped the sign bit, which causes the result to appear to be a negative number.

There is a way around this: you can tell Julia to work with groups of more than 64 bits at a time when expressing integers using the BigInt type:

 1 2 3 4 5 6 7 8 9 10  julia> require("BigInt")   julia> BigInt(typemax(Int)) 9223372036854775807   julia> BigInt(typemax(Int)) + 1 9223372036854775808   julia> BigInt(factorial(20)) * 21 51090942171709440000

Now everything works smoothly. By working with BigInt‘s automatically, languages like Python avoid these concerns:

 1 2 3 4  >>> factorial(20) 2432902008176640000 >>> factorial(21) 51090942171709440000L

The L at the end of the numbers here indicates that Python has automatically converted a normal integer into something like Julia’s BigInt. But this automatic conversion comes at a substantial cost: every operation that stays within the bounds of 64-bit arithmetic is slower in Python than Julia because of the time required to check whether an operation might go beyond the 64-bit bound.

Python’s automatic conversion approach is safer, but slower. Julia’s approach is faster, but requires that the programmer understand more about the computer’s architecture. Julia achieves its performance by confronting the fact that computers are machines head on. This is confusing at first and frustrating at times, but it’s a price that you have to pay for high performance computing. Everyone who grew up with C is used to these issues, but they’re largely unfamiliar to programmers who grew up with modern languages like Python. In many ways, Julia sets itself apart from other new languages by its attempt to recover some of the power that was lost in the transition from C to languages like Python. But the transition comes with a substantial learning curve.

And that’s why I wrote this post.

## What is Correctness for Statistical Software?

### Introduction

A few months ago, Drew Conway and I gave a webcast that tried to teach people about the basic principles behind linear and logistic regression. To illustrate logistic regression, we worked through a series of progressively more complex spam detection problems.

The simplest data set we used was the following:

This data set has one clear virtue: the correct classifier defines a decision boundary that implements a simple OR operation on the values of MentionsViagra and MentionsNigeria. Unfortunately, that very simplicity causes the logistic regression model to break down, because the MLE coefficients for MentionsViagra and MentionsNigeria should be infinite. In some ways, our elegantly simple example for logistic regression is actually the statistical equivalent of a SQL injection.

In our webcast, Drew and I decided to ignore that concern because R produces a useful model fit despite the theoretical MLE coefficients being infinite:

Although R produces finite coefficients here despite theory telling us to except something else, I should note that R does produce a somewhat cryptic warning during the model fitting step that alerts the very well-informed user that something has gain awry:

 1  glm.fit: fitted probabilities numerically 0 or 1 occurred

It seems clear to me that R’s warning would be better off if it were substantially more verbose:

 1 2 3 4 5 6  Warning from glm.fit():   Fitted probabilities could not be distinguished from 0's or 1's under finite precision floating point arithmetic. As a result, the optimization algorithm for GLM fitting may have failed to converge. You should check whether your data set is linearly separable.

Although I’ve started this piece with a very focused example of how R’s implementation of logistic regression differs from the purely mathematical definition of that model, I’m not really that interested in the details of how different pieces of software implement logistic regression. If you’re interested in learning more about that kind of thing, I’d suggest reading the excellent piece on R’s logistic regression function that can be found on the Win-Vector blog.

Instead, what interests me right now are a set of broader questions about how statistical software should work. What is the standard for correctness for statistical software? And what is the standard for usefulness? And how closely related are those two criteria?

Let’s think about each of them separately:

• Usefulness: If you want to simply make predictions based on your model, then you want R to produce a fitted model for this data set that makes reasonably good predictions on the training data. R achieves that goal: the fitted predictions for R’s logistic regression model are numerically almost indistinguishable from the 0/1 values that we would expect from a maximum likelihood algorithm. If you want useful algorithms, then R’s decision to produce some model fit is justified.
• Correctness: If you want software to either produce mathematically correct answers or to die trying, then R’s implementation of logistic is not for you. If you insist on theoretical purity, it seems clear that R should not merely emit a warning here, but should instead throw an inescapable error rather than return an imperfect model fit. You might even want R to go further and to teach the end-user about the virtues of SVM’s or the general usefulness of parameter regularization. Whatever you’d like to see, one thing is sure: you definitely do not want R to produce model fits that are mathematically incorrect.

It’s remarkable that such a simple example can bring the goals of predictive power and theoretical correctness into such direct opposition. In part, the conflict arises here because those purely theoretical concerns are linked by a third consideration: computer algorithms are not generally equivalent to their mathematical idealizations. Purely computational concerns involving floating-point imprecision and finite compute time mean that we cannot generally hope for computers to produce answers similar to those prescribed by theoretical mathematics.

What’s fascinating about this specific example is that there’s something strangely desirable about floating-point numbers having finite precision: no one with any practical interest in modeling is likely to be interested in fitting a model with infinite-valued parameters. R’s decision to blindly run an optimization algorithm here unwittingly achieves a form of regularization like that employed in early stopping algorithms for fitting neural networks. And that may be a good thing if you’re interested in using a fitted model to make predictions, even though it means that R produces quantities like standard errors that have no real coherent interpretation in terms of frequentist estimators.

Whatever your take is on the virtues or vices of R’s implementation of logistic regression, there’s a broad take away from this example that I’ve been dealing with constantly while working on Julia: any programmer designing statistical software has to make decisions that involve personal judgment. The requirement for striking a compromise between correctness and usefulness is so nearly omnipresent that one of the most popular pieces of statistical software on Earth implements logistic regression using an algorithm that a pure theorist could argue is basically broken. But it produces an answer that has practical value. And that might just be the more important thing for statistical software to do.

## What is Economics Studying?

Having spent all five of my years as a graduate student trying to get psychologists and economists to agree on basic ideas about decision-making, I think the following two pieces complement one another perfectly:

• Cosma Shalizi’s comments on rereading Blanchard and Fischer’s “Lectures on Macroeconomics”:

Blanchard and Fischer is about “modern” macro, models based on agents who know what the economy is like optimizing over time, possible under some limits. This is the DSGE style of macro. which has lately come into so much discredit — thoroughly deserved discredit. Chaikin and Lubensky is about modern condensed matter physics, especially soft condensed matter, based on principles of symmetry-breaking and phase transitions. Both books are about building stylized theoretical models and solving them to see what they imply; implicitly they are also about the considerations which go into building models in their respective domains.

What is very striking, looking at them side by side, is that while these are both books about mathematical modeling, Chaikin and Lubensky presents empirical data, compares theoretical predictions to experimental results, and goes into some detail into the considerations which lead to this sort of model for nematic liquid crystals, or that model for magnetism. There is absolutely nothing like this in Blanchard and Fischer — no data at all, no comparison of models to reality, no evidence of any kind supporting any of the models. There is not even an attempt, that I can find, to assess different macroeconomic models, by comparing their qualitative predictions to each other and to historical reality. I presume that Blanchard and Fischer, as individual scholars, are not quite so indifferent to reality, but their pedagogy is.

I will leave readers to draw their own morals.

• Itzhak Gilboa’s argument that economic theory is a rhetoric apparatus rather than a set of direct predictions about the world in which we live.

## A Cheap Criticism of p-Values

One of these days I am going to finish my series on problems with how NHST is issued in the social sciences. Until then, I came up with a cheap criticism of p-values today.

To make sense of my complaint, you’ll want to head over to Andy Gelman’s blog and read the comments on his recent blog post about p-values. Reading them makes one thing clear: not even a large group of stats wonks can agree on how to think about p-values. How could we ever hope for understanding from the kind of people who are only reporting p-values because they’re forced to do so by their fields?

## The State of Statistics in Julia

Updated 12.2.2012: Added sample output based on a suggestion from Stefan Karpinski.

### Introduction

Over the last few weeks, the Julia core team has rolled out a demo version of Julia’s package management system. While the Julia package system is still very much in beta, it nevertheless provides the first plausible way for non-expert users to see where Julia’s growing community of developers is heading.

To celebrate some of the amazing work that’s already been done to make Julia usable for day-to-day data analysis, I’d like to give a brief overview of the state of statistical programming in Julia. There are now several packages that, taken as a whole, suggest that Julia may really live up to its potential and become the next generation language for data analysis.

### Getting Julia Installed

If you’d like to try out Julia for yourself, you’ll first need to clone the current Julia repo from GitHub and then build Julia from source as described in the Julia README. Compiling Julia for the first time can take up to two hours, but updating Julia afterwards will be quite fast once you’ve gotten a working copy of the language and its dependencies installed on your system. After you have Julia built, you should add its main directory to your path and then open up the Julia REPL by typing julia at the command line.

### Installing Packages

Once Julia’s REPL is running, you can use the following commands to start installing packages:

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67  julia> require("pkg")   julia> Pkg.init() Initialized empty Git repository in /Users/johnmyleswhite/.julia/.git/ Cloning into 'METADATA'... remote: Counting objects: 443, done. remote: Compressing objects: 100% (208/208), done. remote: Total 443 (delta 53), reused 423 (delta 33) Receiving objects: 100% (443/443), 38.98 KiB, done. Resolving deltas: 100% (53/53), done. [master (root-commit) dbd486e] empty package repo 2 files changed, 4 insertions(+) create mode 100644 .gitmodules create mode 160000 METADATA create mode 100644 REQUIRE   julia> Pkg.add("DataFrames", "Distributions", "MCMC", "Optim", "NHST", "Clustering") Installing DataFrames: v0.0.0 Cloning into 'DataFrames'... remote: Counting objects: 1340, done. remote: Compressing objects: 100% (562/562), done. remote: Total 1340 (delta 760), reused 1229 (delta 655) Receiving objects: 100% (1340/1340), 494.79 KiB, done. Resolving deltas: 100% (760/760), done. Installing Distributions: v0.0.0 Cloning into 'Distributions'... remote: Counting objects: 49, done. remote: Compressing objects: 100% (30/30), done. remote: Total 49 (delta 8), reused 49 (delta 8) Receiving objects: 100% (49/49), 17.29 KiB, done. Resolving deltas: 100% (8/8), done. Installing MCMC: v0.0.0 Cloning into 'MCMC'... warning: no common commits remote: Counting objects: 155, done. remote: Compressing objects: 100% (97/97), done. remote: Total 155 (delta 66), reused 140 (delta 51) Receiving objects: 100% (155/155), 256.68 KiB, done. Resolving deltas: 100% (66/66), done. Installing NHST: v0.0.0 Cloning into 'NHST'... remote: Counting objects: 20, done. remote: Compressing objects: 100% (18/18), done. remote: Total 20 (delta 2), reused 19 (delta 1) Receiving objects: 100% (20/20), 4.31 KiB, done. Resolving deltas: 100% (2/2), done. Installing Optim: v0.0.0 Cloning into 'Optim'... remote: Counting objects: 497, done. remote: Compressing objects: 100% (191/191), done. remote: Total 497 (delta 318), reused 476 (delta 297) Receiving objects: 100% (497/497), 79.68 KiB, done. Resolving deltas: 100% (318/318), done. Installing Options: v0.0.0 Cloning into 'Options'... remote: Counting objects: 10, done. remote: Compressing objects: 100% (8/8), done. remote: Total 10 (delta 1), reused 6 (delta 0) Receiving objects: 100% (10/10), done. Resolving deltas: 100% (1/1), done. Installing Clustering: v0.0.0 Cloning into 'Clustering'... remote: Counting objects: 38, done. remote: Compressing objects: 100% (28/28), done. remote: Total 38 (delta 7), reused 38 (delta 7) Receiving objects: 100% (38/38), 7.77 KiB, done. Resolving deltas: 100% (7/7), done.

That will get you started with some of the core tools for doing statistical programming in Julia. You’ll probably also want to install another package called “RDatasets”, which provides access to 570 of the classic data sets available in R. This package has a much larger file size than the others, which is why I recommend installing it after you’ve first installed the other packages:

 1 2 3 4 5 6 7 8 9 10  require("pkg")   julia> Pkg.add("RDatasets") Installing RDatasets: v0.0.0 Cloning into 'RDatasets'... remote: Counting objects: 609, done. remote: Compressing objects: 100% (588/588), done. remote: Total 609 (delta 21), reused 605 (delta 17) Receiving objects: 100% (609/609), 10.56 MiB | 1.15 MiB/s, done. Resolving deltas: 100% (21/21), done.

Assuming that you’ve gotten everything working, you can then type the following to load Fisher’s classic Iris data set:

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85  julia> load("RDatasets") Warning: redefinition of constant NARule ignored. Warning: New definition ==(NAtype,Any) is ambiguous with ==(Any,AbstractArray{T,N}). Make sure ==(NAtype,AbstractArray{T,N}) is defined first. Warning: New definition ==(Any,NAtype) is ambiguous with ==(AbstractArray{T,N},Any). Make sure ==(AbstractArray{T,N},NAtype) is defined first. Warning: New definition replace!(PooledDataVec{S},NAtype,T) is ambiguous with replace!(PooledDataVec{S},T,NAtype). Make sure replace!(PooledDataVec{S},NAtype,NAtype) is defined first. Warning: New definition promote_rule(Type{AbstractDataVec{T}},Type{T}) is ambiguous with promote_rule(Type{AbstractDataVec{S}},Type{T}). Make sure promote_rule(Type{AbstractDataVec{T}},Type{T}) is defined first. Warning: New definition ^(NAtype,T<:Union(String,Number)) is ambiguous with ^(Any,Integer). Make sure ^(NAtype,_<:Integer) is defined first. Warning: New definition ^(DataVec{T},Number) is ambiguous with ^(Any,Integer). Make sure ^(DataVec{T},Integer) is defined first. Warning: New definition ^(DataFrame,Union(NAtype,Number)) is ambiguous with ^(Any,Integer). Make sure ^(DataFrame,Integer) is defined first.   julia> using DataFrames   julia> using RDatasets   julia> iris = data("datasets", "iris") DataFrame (150,6) Sepal.Length Sepal.Width Petal.Length Petal.Width Species [1,] 1 5.1 3.5 1.4 0.2 "setosa" [2,] 2 4.9 3.0 1.4 0.2 "setosa" [3,] 3 4.7 3.2 1.3 0.2 "setosa" [4,] 4 4.6 3.1 1.5 0.2 "setosa" [5,] 5 5.0 3.6 1.4 0.2 "setosa" [6,] 6 5.4 3.9 1.7 0.4 "setosa" [7,] 7 4.6 3.4 1.4 0.3 "setosa" [8,] 8 5.0 3.4 1.5 0.2 "setosa" [9,] 9 4.4 2.9 1.4 0.2 "setosa" [10,] 10 4.9 3.1 1.5 0.1 "setosa" [11,] 11 5.4 3.7 1.5 0.2 "setosa" [12,] 12 4.8 3.4 1.6 0.2 "setosa" [13,] 13 4.8 3.0 1.4 0.1 "setosa" [14,] 14 4.3 3.0 1.1 0.1 "setosa" [15,] 15 5.8 4.0 1.2 0.2 "setosa" [16,] 16 5.7 4.4 1.5 0.4 "setosa" [17,] 17 5.4 3.9 1.3 0.4 "setosa" [18,] 18 5.1 3.5 1.4 0.3 "setosa" [19,] 19 5.7 3.8 1.7 0.3 "setosa" [20,] 20 5.1 3.8 1.5 0.3 "setosa" : [131,] 131 7.4 2.8 6.1 1.9 "virginica" [132,] 132 7.9 3.8 6.4 2.0 "virginica" [133,] 133 6.4 2.8 5.6 2.2 "virginica" [134,] 134 6.3 2.8 5.1 1.5 "virginica" [135,] 135 6.1 2.6 5.6 1.4 "virginica" [136,] 136 7.7 3.0 6.1 2.3 "virginica" [137,] 137 6.3 3.4 5.6 2.4 "virginica" [138,] 138 6.4 3.1 5.5 1.8 "virginica" [139,] 139 6.0 3.0 4.8 1.8 "virginica" [140,] 140 6.9 3.1 5.4 2.1 "virginica" [141,] 141 6.7 3.1 5.6 2.4 "virginica" [142,] 142 6.9 3.1 5.1 2.3 "virginica" [143,] 143 5.8 2.7 5.1 1.9 "virginica" [144,] 144 6.8 3.2 5.9 2.3 "virginica" [145,] 145 6.7 3.3 5.7 2.5 "virginica" [146,] 146 6.7 3.0 5.2 2.3 "virginica" [147,] 147 6.3 2.5 5.0 1.9 "virginica" [148,] 148 6.5 3.0 5.2 2.0 "virginica" [149,] 149 6.2 3.4 5.4 2.3 "virginica" [150,] 150 5.9 3.0 5.1 1.8 "virginica"   julia> head(iris) DataFrame (6,6) Sepal.Length Sepal.Width Petal.Length Petal.Width Species [1,] 1 5.1 3.5 1.4 0.2 "setosa" [2,] 2 4.9 3.0 1.4 0.2 "setosa" [3,] 3 4.7 3.2 1.3 0.2 "setosa" [4,] 4 4.6 3.1 1.5 0.2 "setosa" [5,] 5 5.0 3.6 1.4 0.2 "setosa" [6,] 6 5.4 3.9 1.7 0.4 "setosa"   julia> tail(iris) DataFrame (6,6) Sepal.Length Sepal.Width Petal.Length Petal.Width Species [1,] 145 6.7 3.3 5.7 2.5 "virginica" [2,] 146 6.7 3.0 5.2 2.3 "virginica" [3,] 147 6.3 2.5 5.0 1.9 "virginica" [4,] 148 6.5 3.0 5.2 2.0 "virginica" [5,] 149 6.2 3.4 5.4 2.3 "virginica" [6,] 150 5.9 3.0 5.1 1.8 "virginica"

Now that you can see that Julia can handle complex data sets, let’s talk a little bit about the packages that make statistical analysis in Julia possible.

### The DataFrames Package

The DataFrames package provides data structures for working with tabular data in Julia. At a minimum, this means that DataFrames provides tools for dealing with individual columns of missing data, which are called DataVec‘s. A collection of DataVec‘s allows one to build up a DataFrame, which provides a tabular data structure like that used by R’s data.frame type.

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30  julia> load("DataFrames")   julia> using DataFrames   julia> data = {"Value" => [1, 2, 3], "Label" => ["A", "B", "C"]} Warning: imported binding for data overwritten in module Main {"Label"=>["A", "B", "C"],"Value"=>[1, 2, 3]}   julia> df = DataFrame(data) DataFrame (3,2) Label Value [1,] "A" 1 [2,] "B" 2 [3,] "C" 3   julia> df["Value"] 3-element DataVec{Int64}   [1,2,3]   julia> df[1, "Value"] = NA NA     julia> head(df) DataFrame (3,2) Label Value [1,] "A" NA [2,] "B" 2 [3,] "C" 3

### Distributions

The Distributions package provides tools for working with probability distributions in Julia. It reifies distributions as types in Julia’s large type hierarchy, which means that quite generic names like rand can be used to sample from complex distributions:

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32  julia> load("Distributions") julia> using Distributions   julia> x = rand(Normal(11.0, 3.0), 10_000) 10000-element Float64 Array: 6.87693 13.3676 7.25008 8.82833 10.6911 7.1004 13.7449 5.96412 8.57957 15.2737 ⋮ 4.89007 15.1509 6.32376 7.83847 14.4476 14.2974 9.74783 9.67398 14.4992   julia> mean(x) 11.00366217730023   julia> var(x) Warning: Possible conflict in library symbol ddot_ 9.288938550823996

### Optim

The Optim package provides tools for numerical optimization of arbitrary functions in Julia. It provides a function, optimize, which works a bit like R’s optim function.

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19  julia> load("Optim") julia> using Optim   julia> f = v -> (10.9 - v[1])^2 + (7.3 - v[2])^2 #   julia> initial_guess = [0.0, 0.0] 2-element Float64 Array: 0.0 0.0   julia> results = optimize(f, initial_guess) Warning: Possible conflict in library symbol dcopy_ OptimizationResults("Nelder-Mead",[0.333333, 0.333333],[10.9, 7.29994],3.2848148720460163e-9,38,true)   julia> results.minimum 2-element Float64 Array: 10.9 7.29994

### MCMC

The MCMC package provides tools for sampling from arbitrary probability distributions using Markov Chain Monte Carlo. It provides functions like slice_sampler, which allows one to sample from a (potentially unnormalized) density function using Radford Neal’s slice sampling algorithm.

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32  julia> load("MCMC")   julia> using MCMC   julia> d = Normal(17.29, 1.0) Normal(17.29,1.0)   julia> f = x -> logpdf(d, x) #   julia> [slice_sampler(0.0, f) for i in 1:100] 100-element (Float64,Float64) Array: (2.7589100475626323,-106.49522613611775) (22.840595204318323,-16.323492094305458) (0.11800384424353683,-148.35766451986206) (25.507580447082677,-34.68325273534245) (25.794565860846134,-37.08275877393945) (25.898128716394307,-37.96887853221083) (9.309878825853284,-32.76010551023705) (30.824102772255355,-92.50490745818972) (9.108789186504177,-34.38504372063516) (25.547686903330494,-35.01363502992266) ⋮ (5.795001414731885,-66.98643477086263) (15.50115292212293,-2.518925467219337) (12.046429369881345,-14.666455009726143) (17.25455052645699,-0.919566865791911) (25.494698549206657,-34.57747767488159) (1.8340810959111111,-120.36165311809079) (2.7112428736526177,-107.18901820771696) (9.21203292192012,-33.54571459047587) (19.12274407701784,-2.5984139591266584)

### NHST

The NHST package provides tools for testing standard statistical hypotheses using null hypothesis significance testing tools like the t-test and the chi-squared test.

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62  julia> load("Distributions")   julia> using Distributions   julia> load("NHST")   julia> using NHST   julia> d1 = Normal(17.29, 1.0) Normal(17.29,1.0)   julia> d2 = Normal(0.0, 1.0) Normal(0.0,1.0)   julia> x = rand(d1, 1_000) 1000-element Float64 Array: 15.7085 18.585 16.6036 18.962 17.8715 16.6814 17.9676 16.8924 16.6022 17.9813 ⋮ 17.1339 17.3964 18.6184 16.7238 18.5003 16.1618 17.9198 17.4928 18.715   julia> y = rand(d2, 1_000) 1000-element Float64 Array: 0.664885 0.147182 0.96265 0.24282 1.881 -0.632478 0.539297 0.996562 -0.483302 0.514629 ⋮ 2.06249 -0.549444 0.857575 -1.47464 -2.33243 0.510751 -0.381069 -1.49165 0.0521203   julia> t_test(x, y) HypothesisTest("t-Test",{"t"=>392.2838409538002},{"df"=>1989.732411290855},0.0,[17.1535, 17.3293],{"mean of x"=>17.24357323225425,"mean of y"=>0.0021786523177457794},0.0,"two-sided","Welch Two Sample t-test","x and y",1989.732411290855)

### Clustering

The Clustering package provides tools for doing simple k-means style clustering.

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85  julia> load("Clustering")   julia> using Clustering   julia> srand(1)   julia> n = 100 100   julia> x = vcat(randn(n, 2), randn(n, 2) .+ 10) 200x2 Float64 Array: 0.0575636 -0.112322 -1.8329 -0.101326 0.370699 -0.956183 1.31816 -1.44351 0.787598 0.148386 0.712214 -1.293 -1.8578 -1.06208 -0.746303 -0.0439182 1.12082 -2.00616 0.364646 -1.09331 ⋮ 10.1974 10.5583 11.0832 8.92082 11.5414 11.6022 9.0453 11.5093 8.86714 10.4233 10.7336 10.7201 8.60415 9.13942 8.62482 8.51701 10.5044 10.3841   julia> true_assignments = vcat(zeros(n), ones(n)) 200-element Float64 Array: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0   julia> results = k_means(x, 2) Warning: Possible conflict in library symbol dgesdd_ Warning: Possible conflict in library symbol dsyrk_ Warning: Possible conflict in library symbol dgemm_ KMeansOutput([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ... 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],2x2 Float64 Array: -0.0166203 -0.248904 10.0418 10.0074 ,3,422.9820560670007,true)   julia> results.assignments 200-element Int64 Array: 1 1 1 1 1 1 1 1 1 1 ⋮ 2 2 2 2 2 2 2 2 2

While all of this software is still quite new and often still buggy, being able to work with these tools through a simple package systems had made me more excited than ever before about the future of Julia as a language for data analysis. There is, of course, one thing conspicuously lacking right now: a really powerful visualization toolkit for interactive graphics like that provided by R’s ggplot2 package. Hopefully something will come into being within the next few months.