Modes, Medians and Means: A Unifying Perspective

Introduction / Warning

Any traditional introductory statistics course will teach students the definitions of modes, medians and means. But, because introductory courses can’t assume that students have much mathematical maturity, the close relationship between these three summary statistics can’t be made clear. This post tries to remedy that situation by making it clear that all three concepts arise as specific parameterizations of a more general problem.

To do so, I’ll need to introduce one non-standard definition that may trouble some readers. In order to simplify my exposition, let’s all agree to assume that \(0^0 = 0\). In particular, we’ll want to assume that \(|0|^0 = 0\), even though \(|\epsilon|^0 = 1\) for all \(\epsilon > 0\). This definition is non-standard, but it greatly simplifies what follows and emphasizes the conceptual unity of modes, medians and means.

Constructing a Summary Statistic

To see how modes, medians and means arise, let’s assume that we have a list of numbers, \((x_1, x_2, \ldots, x_n)\), that we want to summarize. We want our summary to be a single number, which we’ll call \(s\). How should we select \(s\) so that it summarizes the numbers, \((x_1, x_2, \ldots, x_n)\), effectively?

To answer that, we’ll assume that \(s\) is an effective summary of the entire list if the typical discrepancy between \(s\) and each of the \(x_i\) is small. With that assumption in place, we only need to do two things: (1) define the notion of discrepancy between two numbers, \(x_i\) and \(s\); and (2) define the notion of a typical discrepancy. Because each number \(x_i\) produces its own discrepancy, we’ll need to introduce a method for aggregating the individual discrepancies to order to say something about the typical discrepancy.

Defining a Discrepancy

We could define the discrepancy between a number \(x_i\) and another number \(s\) in many ways. For now, we’ll consider only three possibilities. All of these three options satisfies a basic intuition we have about the notion of discrepancy: we expect that the discrepancy between \(x_i\) and \(s\) should be \(0\) if \(|x_i – s| = 0\) and that the discrepancy should be greater than \(0\) if \(|x_i – s| > 0\). That leaves us with one obvious question: how much greater should the discrepancy be when \(|x_i – s| > 0\)?

To answer that question, let’s consider three definitions of the discrepancy, \(d_i\):

  1. \(d_i = |x_i – s|^0\)
  2. \(d_i = |x_i – s|^1\)
  3. \(d_i = |x_i – s|^2\)

How should we think about these three possible definitions?

The first definition, \(d_i = |x_i – s|^0\), says that the discrepancy is \(1\) if \(x_i \neq s\) and is \(0\) only when \(x_i = s\). This notion of discrepancy is typically called zero-one loss in machine learning. Note that this definition implies that anything other than exact equality produces a constant measure of discrepancy. Summarizing \(x_i = 2\) with \(s = 0\) is no better nor worse than using \(s = 1\). In other words, the discrepancy does not increase at all as \(s\) gets further and further from \(x_i\). You can see this reflected in the far-left column of the image below:


Discrepancy

The second definition, \(d_i = |x_i – s|^1\), says that the discrepancy is equal to the distance between \(x_i\) and \(s\). This is often called an absolute deviation in machine learning. Note that this definition implies that the discrepancy should increase linearly as \(s\) gets further and further from \(x_i\). This is reflected in the center column of the image above.

The third definition, \(d_i = |x_i – s|^2\), says that the discrepancy is the squared distance between \(x_i\) and \(s\). This is often called a squared error in machine learning. Note that this definition implies that the discrepancy should increase super-linearly as \(s\) gets further and further from \(x_i\). For example, if \(x_i = 1\) and \(s = 0\), then the discrepancy is \(1\). But if \(x_i = 2\) and \(s = 0\), then the discrepancy is \(4\). This is reflected in the far right column of the image above.

When we consider a list with a single element, \((x_1)\), these definitions all suggest that we should choose the same number: namely, \(s = x_1\).

Aggregating Discrepancies

Although these definitions do not differ for a list with a single element, they suggest using very different summaries of a list with more than one number in it. To see why, let’s first assume that we’ll aggregate the discrepancy between \(x_i\) and \(s\) for each of the \(x_i\) into a single summary of the quality of a proposed value of \(s\). To perform this aggregation, we’ll sum up the discrepancies over each of the \(x_i\) and call the result \(E\).

In that case, our three definitions give three interestingly different possible definitions of the typical discrepancy, which we’ll call \(E\) for error:
$$
E_0 = \sum_{i} |x_i – s|^0.
$$

$$
E_1 = \sum_{i} |x_i – s|^1.
$$

$$
E_2 = \sum_{i} |x_i – s|^2.
$$

When we write down these expressions in isolation, they don’t look very different. But if we select \(s\) to minimize each of these three types of errors, we get very different numbers. And, surprisingly, each of these three numbers will be very familiar to us.

Minimizing Aggregate Discrepancies

For example, suppose that we try to find \(s_0\) that minimizes the zero-one loss definition of the error of a single number summary. In that case, we require that,
$$
s_0 = \arg \min_{s} \sum_{i} |x_i – s|^0.
$$
What value should \(s_0\) take on? If you give this some extended thought, you’ll discover two things: (1) there is not necessarily a single best value of \(s_0\), but potentially many different values; and (2) each of these best values is one of the modes of the \(x_i\).

In other words, the best single number summary of a set of numbers, when you use exact equality as your metric of error, is one of the modes of that set of numbers.

What happens if we consider some of the other definitions? Let’s start by considering \(s_1\):
$$
s_1 = \arg \min_{s} \sum_{i} |x_i – s|^1.
$$
Unlike \(s_0\), \(s_1\) is a unique number: it is the median of the \(x_i\). That is, the best summary of a set of numbers, when you use absolute differences as your metric of error, is the median of that set of numbers.

Since we’ve just found that the mode and the median appear naturally, we might wonder if other familiar basic statistics will appear. Luckily, they will. If we look for,
$$
s_2 = \arg \min_{s} \sum_{i} |x_i – s|^2,
$$
we’ll find that, like \(s_1\), \(s_2\) is again a unique number. Moreover, \(s_2\) is the mean of the \(x_i\). That is, the best summary of a set of numbers, when you use squared differences as your metric of error, is the mean of that set of numbers.

To sum up, we’ve just seen that the three most famous single number summaries of a data set are very closely related: they all minimize the average discrepancy between \(s\) and the numbers being summarized. They only differ in the type of discrepancy being considered:

  1. The mode minimizes the number of times that one of the numbers in our summarized list is not equal to the summary that we use.
  2. The median minimizes the average distance between each number and our summary.
  3. The mean minimizes the average squared distance between each number and our summary.

In equations,

  1. \(\text{The mode of } x_i = \arg \min_{s} \sum_{i} |x_i – s|^0\)
  2. \(\text{The median of } x_i = \arg \min_{s} \sum_{i} |x_i – s|^1\)
  3. \(\text{The mean of } x_i = \arg \min_{s} \sum_{i} |x_i – s|^2\)

Summary

We’ve just seen that the mode, median and mean all arise from a simple parametric process in which we try to minimize the average discrepancy between a single number \(s\) and a list of numbers, \(x_1, x_2, \ldots, x_n\) that we try to summarize using \(s\). In a future blog post, I’ll describe how the ideas we’ve just introduced relate to the concept of \(L_p\) norms. Thinking about minimizing \(L_p\) norms is a generalization of taking modes, medians and means that leads to almost every important linear method in statistics — ranging from linear regression to the SVD.

Thanks

Thanks to Sean Taylor for reading a draft of this post and commenting on it.

Writing Better Statistical Programs in R

A while back a friend asked me for advice about speeding up some R code that they’d written. Because they were running an extensive Monte Carlo simulation of a model they’d been developing, the poor performance of their code had become an impediment to their work.

After I looked through their code, it was clear that the performance hurdles they were stumbling upon could be overcome by adopting a few best practices for statistical programming. This post tries to describe some of the simplest best practices for statistical programming in R. Following these principles should make it easier for you to write statistical programs that are both highly performant and correct.

Write Out a DAG

Whenever you’re running a simulation study, you should appreciate the fact that you are working with a probabilistic model. Even if you are primarily focused upon the deterministic components of this model, the presence of any randomness in the model means that all of the theory of probabilistic models applies to your situation.

Almost certainly the most important concept in probabilistic modeling when you want to write efficient code is the notion of conditional independence. Conditional independence is important because many probabilistic models can be decomposed into simple pieces that can be computed in isolation. Although your model contains many variables, any one of these variables may depend upon only a few other variables in your model. If you can organize all of variables in your model based on their dependencies, it will be easier to exploit two computational tricks: vectorization and parallelization.

Let’s go through an example. Imagine that you have the model shown below:

$$
X \sim \text{Normal}(0, 1)
$$

$$
Y1 \sim \text{Uniform}(X, X + 1)
$$

$$
Y2 \sim \text{Uniform}(X – 1, X)
$$

$$
Z \sim \text{Cauchy}(Y1 + Y2, 1)
$$

In this model, the distribution of Y1 and Y2 depends only on the value of X. Similarly, the distribution of Z depends only on the values of Y1 and Y2. We can formalize this notion using a DAG, which is a directed acyclic graph that depicts which variables depend upon which other variables. It will help you appreciate the value of this format if you think of the arrows in the DAG below as indicating the flow of causality:


Dag

Having this DAG drawn out for your model will make it easier to write efficient code, because you can generate all of the values of a variable V simultaneously once you’ve computed the values of the variables that V depends upon. In our example, you can generate the values of X for all of your different simulations at once and then generate all of the Y1′s and Y2′s based on the values of X that you generate. You can then exploit this stepwise generation procedure to vectorize and parallelize your code. I’ll discuss vectorization to give you a sense of how to exploit the DAG we’ve drawn to write faster code.

Vectorize Your Simulations

Sequential dependencies are a major bottleneck in languages like R and Matlab that cannot perform loops efficiently. Looking at the DAG for the model shown able, you might think that you can’t get around writing a “for” loop to generate samples of this model because some of the variables need to be generated before others.

But, in reality, each individual sample from this model is independent of all of the others. As such, you can draw all of the X’s for all of your different simulations using vectorized code. Below I show how this model could be implemented using loops and then show how this same model could be implemented using vectorized operations:

Loop Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
run.sims <- function(n.sims)
{
	results <- data.frame()
 
	for (sim in 1:n.sims)
	{
		x <- rnorm(1, 0, 1)
		y1 <- runif(1, x, x + 1)
		y2 <- runif(1, x - 1, x)
		z <- rcauchy(1, y1 + y2, 1)
		results <- rbind(results, data.frame(X = x, Y1 = y1, Y2 = y2, Z = z))
	}
 
	return(results)
}
 
b <- Sys.time()
run.sims(5000)
e <- Sys.time()
e - b

Vectorized Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
run.sims <- function(n.sims)
{
	x <- rnorm(n.sims, 0, 1)
	y1 <- runif(n.sims, x, x + 1)
	y2 <- runif(n.sims, x - 1, x)
	z <- rcauchy(n.sims, y, 1)
	results <- data.frame(X = x, Y1 = y1, Y2 = y2, Z = z)
 
	return(results)
}
 
b <- Sys.time()
run.sims(5000)
e <- Sys.time()
e - b

The performance gains for this example are substantial when you move from the naive loop code to the vectorized code. (NB: There are also some gains from avoiding the repeated calls to rbind, although they are less important than one might think in this case.)

We could go further and parallelize the vectorized code, but this can be tedious to do in R.

The Data Generation / Model Fitting Cycle

Vectorization can make code in languages like R much more efficient. But speed is useless if you’re not generating correct output. For me, the essential test of correctness for a probabilistic model only becomes clear after I’ve written two complementary functions:

  1. A data generation function that produces samples from my model. We can call this function generate. The arguments to generate are the parameters of my model.
  2. A model fitting function that estimates the parameters of my model based on a sample of data. We can call this function fit. The arguments to fit are the data points we generated using generate

The value of these two functions is that they can be set up to feedback into one another in the cycle shown below:


Cycle2

I feel confident in the quality of statistical code when these functions interact stably. If the parameters inferred in a single pass through this loop are close to the original inputs, then my code is likely to work correctly. This amounts to a specific instance of the following design pattern:

1
2
3
data <- generate(model, parameters)
inferred.parameters <- fit(model, data)
reliability <- error(model, parameters, inferred.parameters)

To see this pattern in action, let’s step through a process of generating data from a normal distribution and then fitting a normal to the data we generate. You can think of this as a form of “currying” in which we hardcore the value of the parameter model:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
n.sims <- 100
n.obs <- 100
 
generate.normal <- function(parameters)
{
	return(rnorm(n.obs, parameters[1], parameters[2]))
}
 
fit.normal <- function(data)
{
	return(c(mean(data), sd(data)))
}
 
distance <- function(true.parameters, inferred.parameters)
{
	return((true.parameters - inferred.parameters)^2)
}
 
reliability <- data.frame()
 
for (sim in 1:n.sims)
{
	parameters <- c(runif(1), runif(1))
	data <- generate.normal(parameters)
	inferred.parameters <- fit.normal(data)
	recovery.error <- distance(parameters, inferred.parameters)
	reliability <- rbind(reliability,
		                 data.frame(True1 = parameters[1],
		                 	        True2 = parameters[2],
		                 	        Inferred1 = inferred.parameters[1],
		                 	        Inferred2 = inferred.parameters[2],
							        Error1 = recovery.error[1],
							        Error2 = recovery.error[2]))
}

If you generate data this way, you will see that our inference code is quite reliable. And you can see that it becomes better if we set n.obs to a larger value like 100,000.

I expect this kind of performance from all of my statistical code. I can’t trust the quality of either generate or fit until I see that they play well together. It is their mutual coherence that inspires faith.

General Lessons

Speed

When writing code in R, you can improve performance by searching for every possible location in which vectorization is possible. Vectorization essentially replaces R’s loops (which are not efficient) with C’s loops (which are efficient) because the computations in a vectorized call are almost always implemented in a language other than R.

Correctness

When writing code for model fitting in any language, you should always insure that your code can infer the parameters of models when given simulated data with known parameter values.

Americans Live Longer and Work Less

Today I saw an article on Hacker News entitled, “America’s CEOs Want You to Work Until You’re 70″. I was particularly surprised by this article appearing out of the blue because I take it for granted that America will eventually have to raise the retirement age to avoid bankruptcy. After reading the article, I wasn’t able to figure out why the story had been run at all. So I decided to do some basic fact-checking.

I tracked down some time series data about life expectancies in the U.S. from Berkeley and then found some time series data about the average age at retirement from the OECD. Plotting just these two bits of information, as shown below, makes it clear that Americans are spending a larger proportion of their life in retirement.


Retirement

Perhaps I’m just naive, but it seems obvious to me that we can’t afford to take on several additional years of retirement pension liabilities for every living American. If Americans are living longer, we will need them to work longer in order to pay our bills.

Symbolic Differentiation in Julia

A Brief Introduction to Metaprogramming in Julia

In contrast to my previous post, which described one way in which Julia allows (and expects) the programmer to write code that directly employs the atomic operations offered by computers, this post is meant to introduce newcomers to some of Julia’s higher level functions for metaprogramming. To make metaprogramming more interesting, we’re going to build a system for symbolic differentiation in Julia.

Like Lisp, the Julia interpreter represents Julian expressions using normal data structures: every Julian expression is represented using an object of type Expr. You can see this by typing something like :(x + 1) into the Julia REPL:

1
2
3
4
5
julia> :(x + 1)
:(+(x,1))
 
julia> typeof(:(x+1))
Expr

Looking at the REPL output when we enter an expression quoted using the : operator, we can see that Julia has rewritten our input expression, originally written using infix notation, as an expression that uses prefix notation. This standardization to prefix notation makes it easier to work with arbitrary expressions because it removes a needless source of variation in the format of expressions.

To develop an intuition for what this kind of expression means to Julia, we can use the dump function to examine its contents:

1
2
3
4
5
6
7
8
julia> dump(:(x + 1))
Expr 
  head: Symbol call
  args: Array(Any,(3,))
    1: Symbol +
    2: Symbol x
    3: Int64 1
  typ: Any

Here you can see that a Julian expression consists of three parts:

  1. A head symbol, which describes the basic type of the expression. For this blog post, all of the expressions we’ll work with have head equal to :call.
  2. An Array{Any} that contains the arguments of the head. In our example, the head is :call, which indicates a function call is being made in this expression. The arguments for the function call are:
    1. :+, the symbol denoting the addition function that we are calling.
    2. :x, the symbol denoting the variable x
    3. 1, the number 1 represented as a 64-bit integer.
  3. A typ which stores type inference information. We’ll ignore this information as it’s not relevant to us right now.

Because each expression is built out of normal components, we can construct one piecemeal:

1
2
julia> Expr(:call, {:+, 1, 1}, Any)
:(+(1,1))

Because this expression only depends upon constants, we can immediately evaluate it using the eval function:

1
2
julia> eval(Expr(:call, {:+, 1, 1}, Any))
2

Symbolic Differentiation in Julia

Now that we know how Julia expressions are built, we can design a very simple prototype system for doing symbolic differentiation in Julia. We’ll build up our system in pieces using some of the most basic rules of calculus:

  1. The Constant Rule: d/dx c = 0
  2. The Symbol Rule: d/dx x = 1, d/dx y = 0
  3. The Sum Rule: d/dx (f + g) = (d/dx f) + (d/dx g)
  4. The Subtraction Rule: d/dx (f - g) = (d/dx f) - (d/dx g)
  5. The Product Rule: d/dx (f * g) = (d/dx f) * g + f * (d/dx g)
  6. The Quotient Rule: d/dx (f / g) = [(d/dx f) * g - f * (d/dx g)] / g^2

Implementing these operations is quite easy once you understand the data structure Julia uses to represent expressions. And some of these operations would be trivial regardless.

For example, here’s the Constant Rule in Julia:

1
differentiate(x::Number, target::Symbol) = 0

And here’s the Symbol rule:

1
2
3
4
5
6
7
function differentiate(s::Symbol, target::Symbol)
    if s == target
        return 1
    else
        return 0
    end
end

The first two rules of calculus don’t actually require us to understand anything about Julian expressions. But the interesting parts of a symbolic differentiation system do. To see that, let’s look at the Sum Rule:

1
2
3
4
5
6
7
8
9
function differentiate_sum(ex::Expr, target::Symbol)
    n = length(ex.args)
    new_args = Array(Any, n)
    new_args[1] = :+
    for i in 2:n
        new_args[i] = differentiate(ex.args[i], target)
    end
    return Expr(:call, new_args, Any)
end

The Subtraction Rule can be defined almost identically:

1
2
3
4
5
6
7
8
9
function differentiate_subtraction(ex::Expr, target::Symbol)
    n = length(ex.args)
    new_args = Array(Any, n)
    new_args[1] = :-
    for i in 2:n
        new_args[i] = differentiate(ex.args[i], target)
    end
    return Expr(:call, new_args, Any)
end

The Product Rule is a little more interesting because we need to build up an expression whose components are themselves expressions:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
function differentiate_product(ex::Expr, target::Symbol)
    n = length(ex.args)
    res_args = Array(Any, n)
    res_args[1] = :+
    for i in 2:n
       new_args = Array(Any, n)
       new_args[1] = :*
       for j in 2:n
           if j == i
               new_args[j] = differentiate(ex.args[j], target)
           else
               new_args[j] = ex.args[j]
           end
       end
       res_args[i] = Expr(:call, new_args, Any)
    end
    return Expr(:call, res_args, Any)
end

Last, but not least, here’s the Quotient Rule, which is a little more complex. We can code this rule up in a more explicit fashion that doesn’t use any loops so that we can directly see the steps we’re taking:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
function differentiate_quotient(ex::Expr, target::Symbol)
    return Expr(:call,
                {
                    :/,
                    Expr(:call,
                         {
                            :-,
                            Expr(:call,
                                 {
                                    :*,
                                    differentiate(ex.args[2], target),
                                    ex.args[3]
                                 },
                                 Any),
                            Expr(:call,
                                 {
                                    :*,
                                    ex.args[2],
                                    differentiate(ex.args[3], target)
                                 },
                                 Any)
                         },
                         Any),
                    Expr(:call,
                         {
                            :^,
                            ex.args[3],
                            2
                         },
                         Any)
                },
                Any)
end

Now that we have all of those basic rules of calculus implemented as functions, we’ll build up a lookup table that we can use to tell our final differentiate function where to send new expressions based on the kind of function’s that being differentiated during each call to differentiate:

1
2
3
4
5
6
differentiate_lookup = {
                          :+ => differentiate_sum,
                          :- => differentiate_subtraction,
                          :* => differentiate_product,
                          :/ => differentiate_quotient
                       }

With all of the core machinery in place, the final definition of differentiate is very simple:

1
2
3
4
5
6
7
8
9
10
11
function differentiate(ex::Expr, target::Symbol)
    if ex.head == :call
        if has(differentiate_lookup, ex.args[1])
            return differentiate_lookup[ex.args[1]](ex, target)
        else
            error("Don't know how to differentiate $(ex.args[1])")
        end
    else
        return differentiate(ex.head)
    end
end

Ive put all of these snippets together in a single GitHub Gist. To try out this new differentiation function, let’s copy the contents of that GitHub gist into a file called differentiate.jl. We can then load the contents of that file into Julia at the REPL using include, which will allow us try out our differentiation tool:

1
2
3
4
5
6
7
julia> include("differentiate.jl")
 
julia> differentiate(:(x + x*x), :x)
:(+(1,+(*(1,x),*(x,1))))
 
julia> differentiate(:(x + a*x), :x)
:(+(1,+(*(0,x),*(a,1))))

While the expressions that are constructed by our differentiate function are ugly, they are correct: they just need to be simplified so that things like *(0, x) are replaced with 0. If you’d like to see how to write code to perform some basic simplifications, you can see the simplify function I’ve been building for Julia’s new Calculus package. That codebase includes all of the functionality shown here for differentiate, along with several other rules that make the system more powerful.

What I love about Julia is the ease with which one can move from low-level bit operations like those described in my previous post to high-level operations that manipulate Julian expressions. By allowing the programmer to manipulate expressions programmatically, Julia has copied one of the most beautiful parts of Lisp.

Computers are Machines

When people try out Julia for the first time, many of them are worried by the following example:

1
2
3
4
5
6
7
julia> factorial(n) = n == 0 ? 1 : n * factorial(n - 1)
 
julia> factorial(20)
2432902008176640000
 
julia> factorial(21)
-4249290049419214848

If you’re not familiar with computer architecture, this result is very troubling. Why would Julia claim that the factorial of 21 is a negative number?

The answer is simple, but depends upon a set of concepts that are largely unfamiliar to programmers who, like me, grew up using modern languages like Python and Ruby. Julia thinks that the factorial of 21 is a negative number because computers are machines.

Because they are machines, computers represent numbers using many small groups of bits. Most modern machines work with groups of 64 bits at a time. If an operation has to work with more than 64 bits at a time, that operation will be slower than a similar operation than only works with 64 bits at a time.

As a result, if you want to write fast computer code, it helps to only execute operations that are easily expressible using groups of 64 bits.

Arithmetic involving small integers fits into the category of operations that only require 64 bits at a time. Every integer between -9223372036854775808 and 9223372036854775807 can be expressed using just 64 bits. You can see this for yourself by using the typemin and typemax functions in Julia:

1
2
3
4
5
julia> typemin(Int64)
-9223372036854775808
 
julia> typemax(Int64)
9223372036854775807

If you do things like the following, the computer will quickly produce correct results:

1
2
3
4
5
julia> typemin(Int64) + 1
-9223372036854775807
 
julia> typemax(Int64) - 1
9223372036854775806

But things go badly if you try to break out of the range of numbers that can be represented using only 64 bits:

1
2
3
4
5
julia> typemin(Int64) - 1
9223372036854775807
 
julia> typemax(Int64) + 1
-9223372036854775808

The reasons for this are not obvious at first, but make more sense if you examine the actual bits being operated upon:

1
2
3
4
5
6
7
8
julia> bits(typemax(Int64))
"0111111111111111111111111111111111111111111111111111111111111111"
 
julia> bits(typemax(Int64) + 1)
"1000000000000000000000000000000000000000000000000000000000000000"
 
julia> bits(typemin(Int64))
"1000000000000000000000000000000000000000000000000000000000000000"

When it adds 1 to a number, the computer blindly uses a simple arithmetic rule for individual bits that works just like the carry system you learned as a child. This carrying rule is very efficient, but works poorly if you end up flipping the very first bit in a group of 64 bits. The reason is that this first bit represents the sign of an integer. When this special first bit gets flipped by an operation that overflows the space provided by 64 bits, everything else breaks down.

The special interpretation given to certain bits in a group of 64 is the reason that factorial of 21 is a negative number when Julia computes it. You can confirm this by looking at the exact bits involved:

1
2
3
4
5
julia> bits(factorial(20))
"0010000111000011011001110111110010000010101101000000000000000000"
 
julia> bits(factorial(21))
"1100010100000111011111010011011010111000110001000000000000000000"

Here, as before, the computer has just executed the operations necessary to perform multiplication by 21. But the result has flipped the sign bit, which causes the result to appear to be a negative number.

There is a way around this: you can tell Julia to work with groups of more than 64 bits at a time when expressing integers using the BigInt type:

1
2
3
4
5
6
7
8
9
10
julia> require("BigInt")
 
julia> BigInt(typemax(Int))
9223372036854775807
 
julia> BigInt(typemax(Int)) + 1
9223372036854775808
 
julia> BigInt(factorial(20)) * 21
51090942171709440000

Now everything works smoothly. By working with BigInt‘s automatically, languages like Python avoid these concerns:

1
2
3
4
>>> factorial(20)
2432902008176640000
>>> factorial(21)
51090942171709440000L

The L at the end of the numbers here indicates that Python has automatically converted a normal integer into something like Julia’s BigInt. But this automatic conversion comes at a substantial cost: every operation that stays within the bounds of 64-bit arithmetic is slower in Python than Julia because of the time required to check whether an operation might go beyond the 64-bit bound.

Python’s automatic conversion approach is safer, but slower. Julia’s approach is faster, but requires that the programmer understand more about the computer’s architecture. Julia achieves its performance by confronting the fact that computers are machines head on. This is confusing at first and frustrating at times, but it’s a price that you have to pay for high performance computing. Everyone who grew up with C is used to these issues, but they’re largely unfamiliar to programmers who grew up with modern languages like Python. In many ways, Julia sets itself apart from other new languages by its attempt to recover some of the power that was lost in the transition from C to languages like Python. But the transition comes with a substantial learning curve.

And that’s why I wrote this post.

What is Correctness for Statistical Software?

Introduction

A few months ago, Drew Conway and I gave a webcast that tried to teach people about the basic principles behind linear and logistic regression. To illustrate logistic regression, we worked through a series of progressively more complex spam detection problems.

The simplest data set we used was the following:


Spam2

This data set has one clear virtue: the correct classifier defines a decision boundary that implements a simple OR operation on the values of MentionsViagra and MentionsNigeria. Unfortunately, that very simplicity causes the logistic regression model to break down, because the MLE coefficients for MentionsViagra and MentionsNigeria should be infinite. In some ways, our elegantly simple example for logistic regression is actually the statistical equivalent of a SQL injection.

In our webcast, Drew and I decided to ignore that concern because R produces a useful model fit despite the theoretical MLE coefficients being infinite:


ToyClassificationResults

Although R produces finite coefficients here despite theory telling us to except something else, I should note that R does produce a somewhat cryptic warning during the model fitting step that alerts the very well-informed user that something has gain awry:

1
glm.fit: fitted probabilities numerically 0 or 1 occurred

It seems clear to me that R’s warning would be better off if it were substantially more verbose:

1
2
3
4
5
6
Warning from glm.fit():
 
Fitted probabilities could not be distinguished from 0's or 1's 
under finite precision floating point arithmetic. As a result, the 
optimization algorithm for GLM fitting may have failed to converge.
You should check whether your data set is linearly separable.

Broader Questions

Although I’ve started this piece with a very focused example of how R’s implementation of logistic regression differs from the purely mathematical definition of that model, I’m not really that interested in the details of how different pieces of software implement logistic regression. If you’re interested in learning more about that kind of thing, I’d suggest reading the excellent piece on R’s logistic regression function that can be found on the Win-Vector blog.

Instead, what interests me right now are a set of broader questions about how statistical software should work. What is the standard for correctness for statistical software? And what is the standard for usefulness? And how closely related are those two criteria?

Let’s think about each of them separately:

  • Usefulness: If you want to simply make predictions based on your model, then you want R to produce a fitted model for this data set that makes reasonably good predictions on the training data. R achieves that goal: the fitted predictions for R’s logistic regression model are numerically almost indistinguishable from the 0/1 values that we would expect from a maximum likelihood algorithm. If you want useful algorithms, then R’s decision to produce some model fit is justified.
  • Correctness: If you want software to either produce mathematically correct answers or to die trying, then R’s implementation of logistic is not for you. If you insist on theoretical purity, it seems clear that R should not merely emit a warning here, but should instead throw an inescapable error rather than return an imperfect model fit. You might even want R to go further and to teach the end-user about the virtues of SVM’s or the general usefulness of parameter regularization. Whatever you’d like to see, one thing is sure: you definitely do not want R to produce model fits that are mathematically incorrect.

It’s remarkable that such a simple example can bring the goals of predictive power and theoretical correctness into such direct opposition. In part, the conflict arises here because those purely theoretical concerns are linked by a third consideration: computer algorithms are not generally equivalent to their mathematical idealizations. Purely computational concerns involving floating-point imprecision and finite compute time mean that we cannot generally hope for computers to produce answers similar to those prescribed by theoretical mathematics.

What’s fascinating about this specific example is that there’s something strangely desirable about floating-point numbers having finite precision: no one with any practical interest in modeling is likely to be interested in fitting a model with infinite-valued parameters. R’s decision to blindly run an optimization algorithm here unwittingly achieves a form of regularization like that employed in early stopping algorithms for fitting neural networks. And that may be a good thing if you’re interested in using a fitted model to make predictions, even though it means that R produces quantities like standard errors that have no real coherent interpretation in terms of frequentist estimators.

Whatever your take is on the virtues or vices of R’s implementation of logistic regression, there’s a broad take away from this example that I’ve been dealing with constantly while working on Julia: any programmer designing statistical software has to make decisions that involve personal judgment. The requirement for striking a compromise between correctness and usefulness is so nearly omnipresent that one of the most popular pieces of statistical software on Earth implements logistic regression using an algorithm that a pure theorist could argue is basically broken. But it produces an answer that has practical value. And that might just be the more important thing for statistical software to do.

What is Economics Studying?

Having spent all five of my years as a graduate student trying to get psychologists and economists to agree on basic ideas about decision-making, I think the following two pieces complement one another perfectly:

  • Cosma Shalizi’s comments on rereading Blanchard and Fischer’s “Lectures on Macroeconomics”:

    Blanchard and Fischer is about “modern” macro, models based on agents who know what the economy is like optimizing over time, possible under some limits. This is the DSGE style of macro. which has lately come into so much discredit — thoroughly deserved discredit. Chaikin and Lubensky is about modern condensed matter physics, especially soft condensed matter, based on principles of symmetry-breaking and phase transitions. Both books are about building stylized theoretical models and solving them to see what they imply; implicitly they are also about the considerations which go into building models in their respective domains.

    What is very striking, looking at them side by side, is that while these are both books about mathematical modeling, Chaikin and Lubensky presents empirical data, compares theoretical predictions to experimental results, and goes into some detail into the considerations which lead to this sort of model for nematic liquid crystals, or that model for magnetism. There is absolutely nothing like this in Blanchard and Fischer — no data at all, no comparison of models to reality, no evidence of any kind supporting any of the models. There is not even an attempt, that I can find, to assess different macroeconomic models, by comparing their qualitative predictions to each other and to historical reality. I presume that Blanchard and Fischer, as individual scholars, are not quite so indifferent to reality, but their pedagogy is.

    I will leave readers to draw their own morals.

  • Itzhak Gilboa’s argument that economic theory is a rhetoric apparatus rather than a set of direct predictions about the world in which we live.

A Cheap Criticism of p-Values

One of these days I am going to finish my series on problems with how NHST is issued in the social sciences. Until then, I came up with a cheap criticism of p-values today.

To make sense of my complaint, you’ll want to head over to Andy Gelman’s blog and read the comments on his recent blog post about p-values. Reading them makes one thing clear: not even a large group of stats wonks can agree on how to think about p-values. How could we ever hope for understanding from the kind of people who are only reporting p-values because they’re forced to do so by their fields?

The State of Statistics in Julia

Updated 12.2.2012: Added sample output based on a suggestion from Stefan Karpinski.

Introduction

Over the last few weeks, the Julia core team has rolled out a demo version of Julia’s package management system. While the Julia package system is still very much in beta, it nevertheless provides the first plausible way for non-expert users to see where Julia’s growing community of developers is heading.

To celebrate some of the amazing work that’s already been done to make Julia usable for day-to-day data analysis, I’d like to give a brief overview of the state of statistical programming in Julia. There are now several packages that, taken as a whole, suggest that Julia may really live up to its potential and become the next generation language for data analysis.

Getting Julia Installed

If you’d like to try out Julia for yourself, you’ll first need to clone the current Julia repo from GitHub and then build Julia from source as described in the Julia README. Compiling Julia for the first time can take up to two hours, but updating Julia afterwards will be quite fast once you’ve gotten a working copy of the language and its dependencies installed on your system. After you have Julia built, you should add its main directory to your path and then open up the Julia REPL by typing julia at the command line.

Installing Packages

Once Julia’s REPL is running, you can use the following commands to start installing packages:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
julia> require("pkg")
 
julia> Pkg.init()
Initialized empty Git repository in /Users/johnmyleswhite/.julia/.git/
Cloning into 'METADATA'...
remote: Counting objects: 443, done.
remote: Compressing objects: 100% (208/208), done.
remote: Total 443 (delta 53), reused 423 (delta 33)
Receiving objects: 100% (443/443), 38.98 KiB, done.
Resolving deltas: 100% (53/53), done.
[master (root-commit) dbd486e] empty package repo
 2 files changed, 4 insertions(+)
 create mode 100644 .gitmodules
 create mode 160000 METADATA
 create mode 100644 REQUIRE
 
julia> Pkg.add("DataFrames", "Distributions", "MCMC", "Optim", "NHST", "Clustering")
Installing DataFrames: v0.0.0
Cloning into 'DataFrames'...
remote: Counting objects: 1340, done.
remote: Compressing objects: 100% (562/562), done.
remote: Total 1340 (delta 760), reused 1229 (delta 655)
Receiving objects: 100% (1340/1340), 494.79 KiB, done.
Resolving deltas: 100% (760/760), done.
Installing Distributions: v0.0.0
Cloning into 'Distributions'...
remote: Counting objects: 49, done.
remote: Compressing objects: 100% (30/30), done.
remote: Total 49 (delta 8), reused 49 (delta 8)
Receiving objects: 100% (49/49), 17.29 KiB, done.
Resolving deltas: 100% (8/8), done.
Installing MCMC: v0.0.0
Cloning into 'MCMC'...
warning: no common commits
remote: Counting objects: 155, done.
remote: Compressing objects: 100% (97/97), done.
remote: Total 155 (delta 66), reused 140 (delta 51)
Receiving objects: 100% (155/155), 256.68 KiB, done.
Resolving deltas: 100% (66/66), done.
Installing NHST: v0.0.0
Cloning into 'NHST'...
remote: Counting objects: 20, done.
remote: Compressing objects: 100% (18/18), done.
remote: Total 20 (delta 2), reused 19 (delta 1)
Receiving objects: 100% (20/20), 4.31 KiB, done.
Resolving deltas: 100% (2/2), done.
Installing Optim: v0.0.0
Cloning into 'Optim'...
remote: Counting objects: 497, done.
remote: Compressing objects: 100% (191/191), done.
remote: Total 497 (delta 318), reused 476 (delta 297)
Receiving objects: 100% (497/497), 79.68 KiB, done.
Resolving deltas: 100% (318/318), done.
Installing Options: v0.0.0
Cloning into 'Options'...
remote: Counting objects: 10, done.
remote: Compressing objects: 100% (8/8), done.
remote: Total 10 (delta 1), reused 6 (delta 0)
Receiving objects: 100% (10/10), done.
Resolving deltas: 100% (1/1), done.
Installing Clustering: v0.0.0
Cloning into 'Clustering'...
remote: Counting objects: 38, done.
remote: Compressing objects: 100% (28/28), done.
remote: Total 38 (delta 7), reused 38 (delta 7)
Receiving objects: 100% (38/38), 7.77 KiB, done.
Resolving deltas: 100% (7/7), done.

That will get you started with some of the core tools for doing statistical programming in Julia. You’ll probably also want to install another package called “RDatasets”, which provides access to 570 of the classic data sets available in R. This package has a much larger file size than the others, which is why I recommend installing it after you’ve first installed the other packages:

1
2
3
4
5
6
7
8
9
10
require("pkg")
 
julia> Pkg.add("RDatasets")
Installing RDatasets: v0.0.0
Cloning into 'RDatasets'...
remote: Counting objects: 609, done.
remote: Compressing objects: 100% (588/588), done.
remote: Total 609 (delta 21), reused 605 (delta 17)
Receiving objects: 100% (609/609), 10.56 MiB | 1.15 MiB/s, done.
Resolving deltas: 100% (21/21), done.

Assuming that you’ve gotten everything working, you can then type the following to load Fisher’s classic Iris data set:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
julia> load("RDatasets")
Warning: redefinition of constant NARule ignored.
Warning: New definition ==(NAtype,Any) is ambiguous with ==(Any,AbstractArray{T,N}).
         Make sure ==(NAtype,AbstractArray{T,N}) is defined first.
Warning: New definition ==(Any,NAtype) is ambiguous with ==(AbstractArray{T,N},Any).
         Make sure ==(AbstractArray{T,N},NAtype) is defined first.
Warning: New definition replace!(PooledDataVec{S},NAtype,T) is ambiguous with replace!(PooledDataVec{S},T,NAtype).
         Make sure replace!(PooledDataVec{S},NAtype,NAtype) is defined first.
Warning: New definition promote_rule(Type{AbstractDataVec{T}},Type{T}) is ambiguous with promote_rule(Type{AbstractDataVec{S}},Type{T}).
         Make sure promote_rule(Type{AbstractDataVec{T}},Type{T}) is defined first.
Warning: New definition ^(NAtype,T<:Union(String,Number)) is ambiguous with ^(Any,Integer).
         Make sure ^(NAtype,_<:Integer) is defined first.
Warning: New definition ^(DataVec{T},Number) is ambiguous with ^(Any,Integer).
         Make sure ^(DataVec{T},Integer) is defined first.
Warning: New definition ^(DataFrame,Union(NAtype,Number)) is ambiguous with ^(Any,Integer).
         Make sure ^(DataFrame,Integer) is defined first.
 
julia> using DataFrames
 
julia> using RDatasets
 
julia> iris = data("datasets", "iris")
DataFrame  (150,6)
              Sepal.Length Sepal.Width Petal.Length Petal.Width     Species
[1,]        1          5.1         3.5          1.4         0.2    "setosa"
[2,]        2          4.9         3.0          1.4         0.2    "setosa"
[3,]        3          4.7         3.2          1.3         0.2    "setosa"
[4,]        4          4.6         3.1          1.5         0.2    "setosa"
[5,]        5          5.0         3.6          1.4         0.2    "setosa"
[6,]        6          5.4         3.9          1.7         0.4    "setosa"
[7,]        7          4.6         3.4          1.4         0.3    "setosa"
[8,]        8          5.0         3.4          1.5         0.2    "setosa"
[9,]        9          4.4         2.9          1.4         0.2    "setosa"
[10,]      10          4.9         3.1          1.5         0.1    "setosa"
[11,]      11          5.4         3.7          1.5         0.2    "setosa"
[12,]      12          4.8         3.4          1.6         0.2    "setosa"
[13,]      13          4.8         3.0          1.4         0.1    "setosa"
[14,]      14          4.3         3.0          1.1         0.1    "setosa"
[15,]      15          5.8         4.0          1.2         0.2    "setosa"
[16,]      16          5.7         4.4          1.5         0.4    "setosa"
[17,]      17          5.4         3.9          1.3         0.4    "setosa"
[18,]      18          5.1         3.5          1.4         0.3    "setosa"
[19,]      19          5.7         3.8          1.7         0.3    "setosa"
[20,]      20          5.1         3.8          1.5         0.3    "setosa"
  :
[131,]    131          7.4         2.8          6.1         1.9 "virginica"
[132,]    132          7.9         3.8          6.4         2.0 "virginica"
[133,]    133          6.4         2.8          5.6         2.2 "virginica"
[134,]    134          6.3         2.8          5.1         1.5 "virginica"
[135,]    135          6.1         2.6          5.6         1.4 "virginica"
[136,]    136          7.7         3.0          6.1         2.3 "virginica"
[137,]    137          6.3         3.4          5.6         2.4 "virginica"
[138,]    138          6.4         3.1          5.5         1.8 "virginica"
[139,]    139          6.0         3.0          4.8         1.8 "virginica"
[140,]    140          6.9         3.1          5.4         2.1 "virginica"
[141,]    141          6.7         3.1          5.6         2.4 "virginica"
[142,]    142          6.9         3.1          5.1         2.3 "virginica"
[143,]    143          5.8         2.7          5.1         1.9 "virginica"
[144,]    144          6.8         3.2          5.9         2.3 "virginica"
[145,]    145          6.7         3.3          5.7         2.5 "virginica"
[146,]    146          6.7         3.0          5.2         2.3 "virginica"
[147,]    147          6.3         2.5          5.0         1.9 "virginica"
[148,]    148          6.5         3.0          5.2         2.0 "virginica"
[149,]    149          6.2         3.4          5.4         2.3 "virginica"
[150,]    150          5.9         3.0          5.1         1.8 "virginica"
 
julia> head(iris)
DataFrame  (6,6)
          Sepal.Length Sepal.Width Petal.Length Petal.Width  Species
[1,]    1          5.1         3.5          1.4         0.2 "setosa"
[2,]    2          4.9         3.0          1.4         0.2 "setosa"
[3,]    3          4.7         3.2          1.3         0.2 "setosa"
[4,]    4          4.6         3.1          1.5         0.2 "setosa"
[5,]    5          5.0         3.6          1.4         0.2 "setosa"
[6,]    6          5.4         3.9          1.7         0.4 "setosa"
 
julia> tail(iris)
DataFrame  (6,6)
            Sepal.Length Sepal.Width Petal.Length Petal.Width     Species
[1,]    145          6.7         3.3          5.7         2.5 "virginica"
[2,]    146          6.7         3.0          5.2         2.3 "virginica"
[3,]    147          6.3         2.5          5.0         1.9 "virginica"
[4,]    148          6.5         3.0          5.2         2.0 "virginica"
[5,]    149          6.2         3.4          5.4         2.3 "virginica"
[6,]    150          5.9         3.0          5.1         1.8 "virginica"

Now that you can see that Julia can handle complex data sets, let’s talk a little bit about the packages that make statistical analysis in Julia possible.

The DataFrames Package

The DataFrames package provides data structures for working with tabular data in Julia. At a minimum, this means that DataFrames provides tools for dealing with individual columns of missing data, which are called DataVec‘s. A collection of DataVec‘s allows one to build up a DataFrame, which provides a tabular data structure like that used by R’s data.frame type.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
julia> load("DataFrames")
 
julia> using DataFrames
 
julia> data = {"Value" => [1, 2, 3], "Label" => ["A", "B", "C"]}
Warning: imported binding for data overwritten in module Main
{"Label"=>["A", "B", "C"],"Value"=>[1, 2, 3]}
 
julia> df = DataFrame(data)
DataFrame  (3,2)
        Label Value
[1,]      "A"     1
[2,]      "B"     2
[3,]      "C"     3
 
julia> df["Value"]
3-element DataVec{Int64}
 
[1,2,3]
 
julia> df[1, "Value"] = NA
NA
 
 
julia> head(df)
DataFrame  (3,2)
        Label Value
[1,]      "A"    NA
[2,]      "B"     2
[3,]      "C"     3

Distributions

The Distributions package provides tools for working with probability distributions in Julia. It reifies distributions as types in Julia’s large type hierarchy, which means that quite generic names like rand can be used to sample from complex distributions:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
julia> load("Distributions")
julia> using Distributions
 
julia> x = rand(Normal(11.0, 3.0), 10_000)
10000-element Float64 Array:
  6.87693
 13.3676 
  7.25008
  8.82833
 10.6911 
  7.1004 
 13.7449 
  5.96412
  8.57957
 15.27374.89007
 15.1509 
  6.32376
  7.83847
 14.4476 
 14.2974 
  9.74783
  9.67398
 14.4992 
 
julia> mean(x)
11.00366217730023
 
julia> var(x)
Warning: Possible conflict in library symbol ddot_
9.288938550823996

Optim

The Optim package provides tools for numerical optimization of arbitrary functions in Julia. It provides a function, optimize, which works a bit like R’s optim function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
julia> load("Optim")
julia> using Optim
 
julia> f = v -> (10.9 - v[1])^2 + (7.3 - v[2])^2
#<function>
 
julia> initial_guess = [0.0, 0.0]
2-element Float64 Array:
 0.0
 0.0
 
julia> results = optimize(f, initial_guess)
Warning: Possible conflict in library symbol dcopy_
OptimizationResults("Nelder-Mead",[0.333333, 0.333333],[10.9, 7.29994],3.2848148720460163e-9,38,true)
 
julia> results.minimum
2-element Float64 Array:
 10.9    
  7.29994

MCMC

The MCMC package provides tools for sampling from arbitrary probability distributions using Markov Chain Monte Carlo. It provides functions like slice_sampler, which allows one to sample from a (potentially unnormalized) density function using Radford Neal’s slice sampling algorithm.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
julia> load("MCMC")
 
julia> using MCMC
 
julia> d = Normal(17.29, 1.0)
Normal(17.29,1.0)
 
julia> f = x -> logpdf(d, x)
#<function>
 
julia> [slice_sampler(0.0, f) for i in 1:100]
100-element (Float64,Float64) Array:
 (2.7589100475626323,-106.49522613611775) 
 (22.840595204318323,-16.323492094305458) 
 (0.11800384424353683,-148.35766451986206)
 (25.507580447082677,-34.68325273534245)  
 (25.794565860846134,-37.08275877393945)  
 (25.898128716394307,-37.96887853221083)  
 (9.309878825853284,-32.76010551023705)   
 (30.824102772255355,-92.50490745818972)  
 (9.108789186504177,-34.38504372063516)   
 (25.547686903330494,-35.01363502992266)(5.795001414731885,-66.98643477086263)   
 (15.50115292212293,-2.518925467219337)   
 (12.046429369881345,-14.666455009726143) 
 (17.25455052645699,-0.919566865791911)   
 (25.494698549206657,-34.57747767488159)  
 (1.8340810959111111,-120.36165311809079) 
 (2.7112428736526177,-107.18901820771696) 
 (9.21203292192012,-33.54571459047587)    
 (19.12274407701784,-2.5984139591266584)

NHST

The NHST package provides tools for testing standard statistical hypotheses using null hypothesis significance testing tools like the t-test and the chi-squared test.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
julia> load("Distributions")
 
julia> using Distributions
 
julia> load("NHST")
 
julia> using NHST
 
julia> d1 = Normal(17.29, 1.0)
Normal(17.29,1.0)
 
julia> d2 = Normal(0.0, 1.0)
Normal(0.0,1.0)
 
julia> x = rand(d1, 1_000)
1000-element Float64 Array:
 15.7085
 18.585 
 16.6036
 18.962 
 17.8715
 16.6814
 17.9676
 16.8924
 16.6022
 17.981317.1339
 17.3964
 18.6184
 16.7238
 18.5003
 16.1618
 17.9198
 17.4928
 18.715 
 
julia> y = rand(d2, 1_000)
1000-element Float64 Array:
  0.664885 
  0.147182 
  0.96265  
  0.24282  
  1.881    
 -0.632478 
  0.539297 
  0.996562 
 -0.483302 
  0.5146292.06249  
 -0.549444 
  0.857575 
 -1.47464  
 -2.33243  
  0.510751 
 -0.381069 
 -1.49165  
  0.0521203
 
julia> t_test(x, y)
HypothesisTest("t-Test",{"t"=>392.2838409538002},{"df"=>1989.732411290855},0.0,[17.1535, 17.3293],{"mean of x"=>17.24357323225425,"mean of y"=>0.0021786523177457794},0.0,"two-sided","Welch Two Sample t-test","x and y",1989.732411290855)

Clustering

The Clustering package provides tools for doing simple k-means style clustering.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
julia> load("Clustering")
 
julia> using Clustering
 
julia> srand(1)
 
julia> n = 100
100
 
julia> x = vcat(randn(n, 2), randn(n, 2) .+ 10)
200x2 Float64 Array:
  0.0575636  -0.112322 
 -1.8329     -0.101326 
  0.370699   -0.956183 
  1.31816    -1.44351  
  0.787598    0.148386 
  0.712214   -1.293    
 -1.8578     -1.06208  
 -0.746303   -0.0439182
  1.12082    -2.00616  
  0.364646   -1.0933110.1974     10.5583   
 11.0832      8.92082  
 11.5414     11.6022   
  9.0453     11.5093   
  8.86714    10.4233   
 10.7336     10.7201   
  8.60415     9.13942  
  8.62482     8.51701  
 10.5044     10.3841   
 
julia> true_assignments = vcat(zeros(n), ones(n))
200-element Float64 Array:
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.01.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 
julia> results = k_means(x, 2)
Warning: Possible conflict in library symbol dgesdd_
Warning: Possible conflict in library symbol dsyrk_
Warning: Possible conflict in library symbol dgemm_
KMeansOutput([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1  ...  2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],2x2 Float64 Array:
 -0.0166203  -0.248904
 10.0418     10.0074  ,3,422.9820560670007,true)
 
julia> results.assignments
200-element Int64 Array:
 1
 1
 1
 1
 1
 1
 1
 1
 1
 12
 2
 2
 2
 2
 2
 2
 2
 2

While all of this software is still quite new and often still buggy, being able to work with these tools through a simple package systems had made me more excited than ever before about the future of Julia as a language for data analysis. There is, of course, one thing conspicuously lacking right now: a really powerful visualization toolkit for interactive graphics like that provided by R’s ggplot2 package. Hopefully something will come into being within the next few months.

The Shape of Floating Point Random Numbers

[Updated 10/18/2012: Fixed a typo in which mantissa was replaced with exponent.]

Over the weekend, Viral Shah updated Julia’s implementation of randn() to give a 20% speed boost. Because we all wanted to test that this speed-up had not come at the expense of the validity of Julia’s RNG system, I spent some time this weekend trying to get tests up and running. I didn’t get far, but thankfully others chimed in and got things done.

Testing an RNG is serious business. In total, we’ve considered using four different test suites:

All of these suites can be easily used to test uniform random numbers over unsigned integers. Some are also appropriate for testing uniform random numbers over floatint-point values.

But we wanted to test a Gaussian RNG. To do that, we followed Thomas et al.’s lead and mapped the Gaussian RNG’s output through a high-precision quantile function to produce uniform random floating point values. As our high-precision quantile function we ended up using the one described in Marsaglia’s 2004 JSS paper.

With that in place, I started to try modifying my previous RNG testing code. When we previously tried to test Julia’s rand() function, I got STS working on my machine and deciphered its manual well enough to run a suite of tests on a bit stream from Julia.

Unfortunately I made a fairly serious error in how I attempted to test Julia’s RNG. Because STS expects a stream of random 0′s and 1′s, I converted random numbers into 0′s and 1′s by testing whether the floating point numbers being generated were greater than 0.5 or less than 0.5. While this test is not completely wrong, it is very, very weak. Its substantive value comes from two points:

  1. It confirms that the median of the RNG is correctly positioned at 0.5.
  2. It confirms that the placement of successive entries relative to 0.5 is effectively random. In short, there is not trivial correlation between successive values.

Unfortunately that’s about all you learn from this method. We needed something more. So I started exploring how to convert a floating point into bits. Others had the good sense to avoid this and pushed us forward by using the TestU01 suite.

I instead got lost exploring the surprising complexity of trying to work with the individual bits of random floating point numbers. The topic is so subtle because the distribution of bits in a randomly generated floating point number is extremely far from a random source of individual bits.

For example, a uniform variable’s representation in floating point has all the following non-random properties:

  1. The sign bit is never random because uniform variables are never negative.
  2. The exponent is not random either because uniform variables are strictly contained in the interval [0, 1].
  3. Even the mantissa isn’t random. Because floating point numbers aren’t evenly spaced in the reals, the mantissa has to have complex patterns in it to simulate the equal-spacing of uniform numbers.

Inspired by all of this, I decided to get a sense for the bit pattern signature of different RNG’s. Below I’ve plotted the patterns for uniform, normal, gamma and Cauchy variables using lines that describe the mean value of the i-th bit in the bit string. At a minimum, a completely random bit stream would have a flat horizontal line through 0.5, which many of the lines touch for a moment, but never perfectly match.


Signatures

Some patterns:

  1. The first bit (shown on the far left) is the sign bit: you can clearly see which distributions are symmetric by looking for a mean value of 0.5 versus those that are strictly positive and have a mean value of 0.0.
  2. The next eleven bits are the exponent and you can clearly see which distributions are largely concentrated in the interval [-1, 1] and which have substantial density outside of that region. This bit would clue you into the variance of the distribution.
  3. You can see that there is a lot of non-randomness in the last few bits of the mantissa for uniform variables. There’s also non-randomness in the first few bits for all variables. I don’t yet have any real intuition for those patterns.

You can go beyond looking at the signatures of mean bit patterns by looking at covariance matrices as well. Below I show these covariances matrices in a white-blue coloring scheme in which white indicates negative values, light blue indicates zero and dark blue indicates positive values. Note that matrices, generated using R’s image() function are reflections of the more intuitive matrix ordering in which the [1,1] entry of the matrix occurs in the top-left instead of the bottom-left.

Uniform Variables

Uniform covariance

Normal Variables

Normal covariance

Gamma Variables

Gamma covariance

Cauchy Variables

Cauchy covariance

I find these pictures really helpful for reminding me how strangely floating point numbers behave. The complexity of these images is so far removed from the simplicity of the bit non-patterns in randomly generated unsigned integers, which can be generated using IID random bits and concatenating them together.