A Note on the Johnson-Lindenstrauss Lemma

Introduction

A recent thread on Theoretical CS StackExchange comparing the Johnson-Lindenstrauss Lemma with the Singular Value Decomposition piqued my interest enough that I decided to spend some time last night reading the standard JL papers. Until this week, I only had a vague understanding of what the JL Lemma implied. I previously mistook the JL Lemma for a purely theoretical result that established the existence of distance-preserving projections from high-dimensional spaces into low-dimensional spaces.

This vague understanding of the JL Lemma turns out to be almost correct, but it also led me to neglect the most interesting elements of the literature on the JL Lemma: the papers on the JL Lemma do not simply establish the existence of such projections, but also provide (1) an explicit bound on the dimensionality required for a projection to ensure that it will approximately preserve distances and they even provide (2) an explicit construction of a random matrix, \(A\), that produces the desired projection.

Once I knew that the JL Lemma was a constructive proof, I decided to implement code in Julia to construct examples of this family of random projections. The rest of this post walks through that code as a way of explaining the JL Lemma’s practical applications.

Formal Statement of the JL Lemma

The JL Lemma, as stated in “An elementary proof of the Johnson-Lindenstrauss Lemma” by Dasgputa and Gupta, is the following result about dimensionality reduction:

For any \(0 < \epsilon < 1\) and any integer \(n\), let \(k\) be a positive integer such that \(k \geq 4(\epsilon^2/2 - \epsilon^3/3)^{-1}\log(n)\).

Then for any set \(V\) of \(n\) points in \(\mathbb{R}^d\), there is a map \(f : \mathbb{R}^d \to \mathbb{R}^k\) such that for all \(u, v \in V\),

$$
(1 - \epsilon) ||u - v||^2 \leq ||f(u) - f(v)||^2 \leq (1 + \epsilon) ||u - v||^2.
$$

Further this map can be found in randomized polynomial time.

To fully appreciate this result, we can unpack the abstract statement of the lemma into two components.

The JL Lemma in Two Parts

Part 1: Given a number of data points, \(n\), that we wish to project and a relative error, \(\epsilon\), that we are willing to tolerate, we can compute a minimum dimensionality, \(k\), that a projection must map a space into before it can guarantee that distances will be preserved up to a factor of \(\epsilon\).

In particular, \(k = \left \lceil{4(\epsilon^2/2 – \epsilon^3/3)^{-1}\log(n)} \right \rceil\).

Note that this implies that the dimensionality required to preserve distances depends only on the number of points and not on the dimensionality of the original space.

Part 2: Given an input matrix, \(X\), of \(n\) points in \(d\)-dimensional space, we can explicitly construct a map, \(f\), such that the distance between any pair of columns of \(X\) will not distorted by more than a factor of \(\epsilon\).

Surprisingly, this map \(f\) can be a simple matrix, \(A\), constructed by sampling \(k * d\) IID draws from a Gaussian with mean \(0\) and variance \(\frac{1}{k}\).

Coding Up The Projections

We can translate the first part of the JL Lemma into a single line of code that computes the dimensionality, \(k\), of our low-dimensional space given the number of data points, \(n\), and the error, \(\epsilon\), that we are willing to tolerate:

1
mindim(n::Integer, ε::Real) = iceil((4 * log(n)) / (ε^2 / 2 - ε^3 / 3))

Having defined this function, we can try it out on a simple problem:

1
2
mindim(3, 0.1)
# => 942

This result was somewhat surprising to me: to represent \(3\) points with no more than \(10\)% error, we require nearly \(1,000\) dimensions. This reflects an important fact about the JL Lemma: it produces result that can be extremely conservative for small dimensional inputs. It’s obvious that, for data sets that contain \(3\) points in \(100\)-dimensional space, we could use a projection into \(100\) dimensions that would preserve distances perfectly.

But this observation neglects one of the essential aspects of the JL Lemma: the dimensions required by the lemma will be sufficient whether our data set contains points in \(100\)-dimensional space or points in \(10^{100}\)-dimensional space. No matter what dimensionality the raw data lies in, the JL Lemma says that \(942\) dimensions suffices to preserve the distances between \(3\) points.

I found this statement unintuitive at the start. To see that it’s true, let’s construct a random projection matrix, \(A\), that will let us confirm experimentally that the JL Lemma really works:

1
2
3
4
5
6
7
8
9
10
11
using Distributions
 
function projection(
    X::Matrix,
    ε::Real,
    k::Integer = mindim(size(X, 2), ε)
)
    d, n = size(X)
    A = rand(Normal(0, 1 / sqrt(k)), k, d)
    return A, k, A * X
end

This projection function is sufficient to construct a matrix, \(A\), that will satisfy the assumptions of the JL Lemma. It will also return the dimensionality, \(k\), of \(A\) and the result of projecting the input, \(X\), into the new space defined by \(A\). To get a feel for how this works, we can try this out on a very simple data set:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
X = eye(3, 3)
 
ε = 0.1
 
A, k, AX = projection(X, ε)
# =>
# (
# 942x3 Array{Float64,2}:
#  -0.035269    -0.0299966   -0.0292959 
#  -0.00501367   0.0316806    0.0460191 
#   0.0633815   -0.0136478   -0.0198676 
#   0.0262627    0.00187459  -0.0122604 
#   0.0417169   -0.0230222   -0.00842476
#   0.0236389    0.0585979   -0.0642437 
#   0.00685299  -0.0513301    0.0501431 
#   0.027723    -0.0151694    0.00274466
#   0.0338992    0.0216184   -0.0494157 
#   0.0612926    0.0276185    0.0271352 
#   ⋮                                   
#  -0.00167347  -0.018576     0.0290964 
#   0.0158393    0.0124403   -0.0208216 
#  -0.00833401   0.0323784    0.0245698 
#   0.019355     0.0057538    0.0150561 
#   0.00352774   0.031572    -0.0262811 
#  -0.0523636   -0.0388993   -0.00794319
#  -0.0363795    0.0633939   -0.0292289 
#   0.0106868    0.0341909    0.0116523 
#   0.0072586   -0.0337501    0.0405171 ,
# 
# 942,
# 942x3 Array{Float64,2}:
#  -0.035269    -0.0299966   -0.0292959 
#  -0.00501367   0.0316806    0.0460191 
#   0.0633815   -0.0136478   -0.0198676 
#   0.0262627    0.00187459  -0.0122604 
#   0.0417169   -0.0230222   -0.00842476
#   0.0236389    0.0585979   -0.0642437 
#   0.00685299  -0.0513301    0.0501431 
#   0.027723    -0.0151694    0.00274466
#   0.0338992    0.0216184   -0.0494157 
#   0.0612926    0.0276185    0.0271352 
#   ⋮                                   
#  -0.00167347  -0.018576     0.0290964 
#   0.0158393    0.0124403   -0.0208216 
#  -0.00833401   0.0323784    0.0245698 
#   0.019355     0.0057538    0.0150561 
#   0.00352774   0.031572    -0.0262811 
#  -0.0523636   -0.0388993   -0.00794319
#  -0.0363795    0.0633939   -0.0292289 
#   0.0106868    0.0341909    0.0116523 
#   0.0072586   -0.0337501    0.0405171 )

According to the JL Lemma, the new matrix, \(AX\), should approximately preserve the distances between columns of \(X\). We can write a quick function that verifies this claim:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
function ispreserved(X::Matrix, A::Matrix, ε::Real)
    d, n = size(X)
    k = size(A, 1)
 
    for i in 1:n
        for j in (i + 1):n
            u, v = X[:, i], X[:, j]
            d_old = norm(u - v)^2
            d_new = norm(A * u - A * v)^2
            @printf("Considering the pair X[:, %d], X[:, %d]...\n", i, j)
            @printf("\tOld distance: %f\n", d_old)
            @printf("\tNew distance: %f\n", d_new)
            @printf(
                "\tWithin bounds %f <= %f <= %f\n",
                (1 - ε) * d_old,
                d_new,
                (1 + ε) * d_old
            )
            if !((1 - ε) * d_old <= d_old <= (1 + ε) * d_old)
                return false
            end
        end
    end
 
    return true
end

And then we can test out the results:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
ispreserved(X, A, ε)
# =>
# Considering the pair X[:, 1], X[:, 2]...
#     Old distance: 2.000000
#     New distance: 2.104506
#     Within bounds 1.800000 <= 2.104506 <= 2.200000
# Considering the pair X[:, 1], X[:, 3]...
#     Old distance: 2.000000
#     New distance: 2.006130
#     Within bounds 1.800000 <= 2.006130 <= 2.200000
# Considering the pair X[:, 2], X[:, 3]...
#     Old distance: 2.000000
#     New distance: 1.955495
#     Within bounds 1.800000 <= 1.955495 <= 2.200000

As claimed, the distances are indeed preserved up to a factor of \(\epsilon\). But, as we noted earlier, the JL lemma has a somewhat perverse consequence for our \(3×3\) matrix: we’ve expanded our input into a \(942×3\) matrix rather than reduced its dimensionality.

To get meaningful dimensionality reduction, we need to project a data set from a space that has more than \(942\) dimensions. So let’s try out a \(50,000\)-dimensional example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
X = eye(50_000, 3)
 
A, k, AX = projection(X, ε)
 
ispreserved(X, A, ε)
# =>
# Considering the pair X[:, 1], X[:, 2]...
#     Old distance: 2.000000
#     New distance: 2.021298
#     Within bounds 1.800000 <= 2.021298 <= 2.200000
# Considering the pair X[:, 1], X[:, 3]...
#     Old distance: 2.000000
#     New distance: 1.955502
#     Within bounds 1.800000 <= 1.955502 <= 2.200000
# Considering the pair X[:, 2], X[:, 3]...
#     Old distance: 2.000000
#     New distance: 1.988945
#     Within bounds 1.800000 <= 1.988945 <= 2.200000

In this case, the JL Lemma again works as claimed: the pairwise distances between columns of \(X\) are preserved. And we’ve done this while reducing the dimensionality of our data from \(50,000\) to \(942\). Moreover, this same approach would still work if the input space had \(10\) million dimensions.

Conclusion

Contrary to my naive conception of the JL Lemma, the literature on the lemma not only tells us that, abstractly, distances can be preserved by dimensionality reduction techniques. It tells how to perform this reduction — and the mechanism is both simple and general.

Data corruption in R 3.0.2 when using read.csv

Introduction

It may be old news to some, but I just recently discovered that the automatic type inference system that R uses when parsing CSV files assumes that data sets will never contain 64-bit integer values.

Specially, if an integer value read from a CSV file is too large to fit in a 32-bit integer field without overflow, the column of data that contains that value will be automatically converted to floating point. This conversion will take place without any warnings, even though it may lead to data corruption.

The reason that the automatic conversion of 64-bit integer-valued data to floating point is problematic is that floating point numbers lack sufficient precision to exactly represent the full range of 64-bit integer values. As a consequence of the lower precision of floating point numbers, two unequal integer values in the input file may be converted to two equal floating point values in the data.frame R uses to represent that data. Subsequent analysis in R will therefore treat unequal values as if they were equal, corrupting any downstream analysis that assumes that the equality predicate can be trusted.

Below, I demonstrate this general problem using two specific data sets. The specific failure case that I outline occurred for me while using R 3.0.2 on my x86_64-apple-darwin10.8.0 platform laptop, which is a “MacBook Pro Retina, 13-inch, Late 2013″ model.

Failure Case

Consider the following two tables, one containing 32-bit integer values and the other containing 64-bit integer values:

ID
1000
1001
ID
100000000000000000
100000000000000001

What happens when they are read into R using the read.csv function?

32-bit compatible integer values are parsed, correctly, using R’s integer type, which does not lead to data corruption:

1
2
3
4
5
6
7
8
9
data <- "MySQLID\n1000\n1001"
 
ids <- read.csv(text = data)
 
ids[1, 1] == ids[2, 1]
# [1] FALSE
 
class(ids$MySQLID)
# [1] "integer"

64-bit compatible integer values are parsed, incorrectly, using R’s numeric type, which does lead to data corruption:

1
2
3
4
5
6
7
8
9
data <- "MySQLID\n100000000000000000\n100000000000000001"
 
ids <- read.csv(text = data)
 
ids[1, 1] == ids[2, 1]
# [1] TRUE
 
class(ids$MySQLID)
# [1] "numeric"

Conclusions

What should one make of this example? At the minimum, it suggests that R’s default behaviors are not well-suited to a world in which more and more people interact with data derived from commercial web sites, where 64-bit integers are commonplace. I hope that R will change the behavior of read.csv in a future release and deprecate any attempts to treat integer literals as anything other than 64-bit integers.

But, I would argue that this example also teaches a much more general point: it suggests that the assertion that scientists can safely ignore the distinction between integer and floating point data types is false. In the example I’ve provided, the very real distinction that modern CPU’s make between integer and floating point data leads to very real data corruption occurring. How that data corruption affects downstream analyses is situation-dependent, but it is conceivable that the effects are severe in some settings. I would hope that we will stop asserting that scientists can use computers to analyze data without understanding the inherent limitations of the tools they are working with.

The Relationship between Vectorized and Devectorized Code

Introduction

Some people have come to believe that Julia’s vectorized code is unusably slow. To correct this misconception, I outline a naive benchmark below that suggests that Julia’s vectorized code is, in fact, noticeably faster than R’s vectorized code. When experienced Julia programmers suggest that newcomers should consider devectorizing code, we’re not trying to beat R’s speed — our vectorized code does that already. Instead, we’re trying to match C’s speed.

As the examples below indicate, a little bit of devectorization goes a long way towards this loftier goal. In the specific examples I show, I find that:

  • Julia’s vectorized code is 2x faster than R’s vectorized code
  • Julia’s devectorized code is 140x faster than R’s vectorized code
  • Julia’s devectorized code is 1350x faster than R’s devectorized code

Examples of Vectorized and Devectorized Code in R

Let’s start by contrasting two pieces of R code: a vectorized and a devectorized implementation of a trivial snippet of code that does repeated vector addition.

First, we consider an example of idiomatic, vectorized R code:

vectorized <- function()
{
    a <- c(1, 1)
    b <- c(2, 2)
    x <- c(NaN, NaN)

    for (i in 1:1000000)
    {
        x <- a + b
    }

    return()
}

time <- function (N)
{
    timings <- rep(NA, N)

    for (itr in 1:N)
    {
        start <- Sys.time()
        vectorized()
        end <- Sys.time()
        timings[itr] <- end - start
    }

    return(timings)
}

mean(time(10))

This code takes, on average, 0.49 seconds per iteration to compute 1,000,000 vector additions.

Having considered the vectorized implementation, we can then consider an unidiomatic devectorized implementation of the same operation in R:

devectorized <- function()
{
    a <- c(1, 1)
    b <- c(2, 2)
    x <- c(NaN, NaN)

    for (i in 1:1000000)
    {
        for (index in 1:2)
        {
            x[index] <- a[index] + b[index]
        }
    }

    return()
}

time <- function (N)
{
    timings <- rep(NA, N)

    for (itr in 1:N)
    {
        start <- Sys.time()
        devectorized()
        end <- Sys.time()
        timings[itr] <- end - start
    }

    return(timings)
}

mean(time(10))

This takes, on average, 4.72 seconds per iteration to compute 1,000,000 vector additions.

Examples of Vectorized and Devectorized Code in Julia

Let’s now consider two Julia implementations of this same snippet of code. We’ll start with a vectorized implementation:

function vectorized()
    a = [1.0, 1.0]
    b = [2.0, 2.0]
    x = [NaN, NaN]

    for i in 1:1000000
        x = a + b
    end

    return
end

function time(N)
    timings = Array(Float64, N)

    # Force compilation
    vectorized()

    for itr in 1:N
        timings[itr] = @elapsed vectorized()
    end

    return timings
end

mean(time(10))

This takes, on average, 0.236 seconds per iteration to compute 1,000,000 vector additions.

Next, let’s consider a devectorized implementation of this same snippet:

function devectorized()
    a = [1.0, 1.0]
    b = [2.0, 2.0]
    x = [NaN, NaN]

    for i in 1:1000000
        for index in 1:2
            x[index] = a[index] + b[index]
        end
    end

    return
end

function time(N)
    timings = Array(Float64, N)

    # Force compilation
    devectorized()

    for itr in 1:N
        timings[itr] = @elapsed devectorized()
    end

    return timings
end

mean(time(10))

This takes, on average, 0.0035 seconds per iteration to compute 1,000,000 vector additions.

Comparing Performance in R and Julia

We can summarize the results of the four examples above in a single table:

Approach Language Average Time
Vectorized R 0.49
Devectorized R 4.72
Vectorized Julia 0.24
Devectorized Julia 0.0035

All of these examples were timed on my 2.9 GHz Intel Core i7 MacBook Pro. The results are quite striking: Julia is uniformly faster than R. And a very small bit of devectorization produces huge performance improvements. Of course, it would be nice if Julia’s compiler could optimize vectorized code as well as it optimizes devectorized code. But doing so requires a substantial amount of work.

Why is Optimizing Vectorized Code Hard?

What makes automatic devectorization tricky to get right is that even minor variants of the snippet shown above have profoundly different optimization strategies. Consider, for example, the following two snippets of code:

function vectorized2()
    a = [1.0, 1.0]
    b = [2.0, 2.0]

    res = {}

    for i in 1:1000000
        x = [rand(), rand()]
        x += a + b
        push!(res, x)
    end

    return res
end

function time(N)
    timings = Array(Float64, N)

    # Force compilation
    vectorized2()

    for itr in 1:N
        timings[itr] = @elapsed vectorized2()
    end

    return timings
end

mean(time(10))

This first snippet takes 1.29 seconds on average.

function devectorized2()
    a = [1.0, 1.0]
    b = [2.0, 2.0]

    res = {}

    for i in 1:1000000
        x = [rand(), rand()]
        for dim in 1:2
            x[dim] += a[dim] + b[dim]
        end
        push!(res, x)
    end

    return res
end

function time(N)
    timings = Array(Float64, N)

    # Force compilation
    devectorized2()

    for itr in 1:N
        timings[itr] = @elapsed devectorized2()
    end

    return timings
end

mean(time(10))

This second snippet takes, on average, 0.27 seconds.

The gap between vectorized and devectorized code is much smaller here because this second set of code snippets uses memory in a very different way than our original snippets did. In the first set of snippets, it was possible to entirely avoid allocating any memory for storing changes to x. The devectorized code for the first set of snippets explicitly made clear to the compiler that no memory needed to be allocated. The vectorized code did not make this clear. Making it clear that no memory needed to be allocated led to a 75x speedup. Explicitly telling the compiler what it can avoid spending time on goes a long way.

In contrast, in the second set of snippets, a new chunk of memory has to be allocated for every x vector that gets created. And the result is that even the devectorized variant of our second snippet cannot offer much of a performance boost over its vectorized analogue. The devectorized variant is slightly faster because it avoids allocating any memory during the steps in which x has a and b added to it, but this makes less of a difference when there is still a lot of other work being done that cannot be avoided by devectorizing operations.

This reflects a more general statement: the vectorization/devectorization contrast is only correlated, not causally related, with the actual performance characteristics of code. What matters for computations that take place on modern computers is the efficient utilization of processor cycles and memory. In many real examples of vectorized code, it is memory management, rather than vectorization per se, that is the core causal factor responsible for performance.

The Reversed Role of Vectorization in R and Julia

Part of what makes it difficult to have a straightforward discussion about vectorization is that vectorization in R conflates issues that are logically unrelated. In R, vectorization is often done for both (a) readability and (b) performance. In Julia, vectorization is only used for readability; it is devectorization that offers superior performance.

This confuses some people who are not familiar with the internals of R. It is therefore worth noting how one improves the speed of R code. The process of performance improvement is quite simple: one starts with devectorized R code, then replaces it with vectorized R code and then finally implements this vectorized R code in devectorized C code. This last step is unfortunately invisible to many R users, who therefore think of vectorization per se as a mechanism for increasing performance. Vectorization per se does not help make code faster. What makes vectorization in R effective is that it provides a mechanism for moving computations into C, where a hidden layer of devectorization can do its mgic.

In other words, R is doing exactly what Julia is doing to get better performance. R’s vectorized code is simply a thin wrapper around completely devectorized C code. If you don’t believe me, go read the C code for something like R’s distance function, which involves calls to functions like the following:

static double R_euclidean(double *x, int nr, int nc, int i1, int i2)
{
    double dev, dist;
    int count, j;

    count= 0;
    dist = 0;
    for(j = 0 ; j < nc ; j++) {
    if(both_non_NA(x[i1], x[i2])) {
        dev = (x[i1] - x[i2]);
        if(!ISNAN(dev)) {
        dist += dev * dev;
        count++;
        }
    }
    i1 += nr;
    i2 += nr;
    }
    if(count == 0) return NA_REAL;
    if(count != nc) dist /= ((double)count/nc);
    return sqrt(dist);
}

It is important to keep this sort of thing in mind: the term vectorization in R actually refers to a step in which you write devectorized code in C. Vectorization, per se, is a red herring when reasoning about performance.

To finish this last point, let’s summarize the performance hierarchy for R and Julia code in a simple table:

Worst Case Typical Case Best Case
Julia Vectorized Code Julia Devectorized Code
R Devectorized Code R Vectorized Code C Devectorized Code

It is the complete absence of one column for Julia that makes it difficult to compare vectorization across the two languages. Nothing in Julia is as bad as R’s devectorized code. On the other end of the spectrum, the performance of Julia’s devectorized code simply has no point of comparison in pure R: it is more similar to the C code used to power R behind the scenes.

Conclusion

Julia aims to (and typically does) provide vectorized code that is efficient as the vectorized code available in other high-level languages. What sets Julia apart is the possibility of writing, in pure Julia, high performance code that uses CPU and memory resources as effectively as can be done in C.

In particular, vectorization and devectorization stand in the opposite relationship to one another in Julia as they do in R. In R, devectorization makes code unusably slow: R code must be vectorized to perform at an acceptable level. In contrast, Julia programmers view vectorized code as a convenient prototype that can be modified with some clever devectorization to produce production-performance code. Of course, we would like prototype code to perform better. But no popular language offers that kind of functionality. What Julia offers isn’t the requirement for devectorization, but the possibility of doing it in Julia itself, rather than in C.

Writing Type-Stable Code in Julia

For many of the people I talk to, Julia’s main appeal is speed. But achieving peak performance in Julia requires that programmers absorb a few subtle concepts that are generally unfamiliar to users of weakly typed languages.

One particularly subtle performance pitfall is the need to write type-stable code. Code is said to be type-stable if the type of every variable does not vary over time. To clarify this idea, consider the following two closely related function definitions:

function sumofsins1(n::Integer)  
    r = 0  
    for i in 1:n  
        r += sin(3.4)  
    end  
    return r  
end  

function sumofsins2(n::Integer)  
    r = 0.0  
    for i in 1:n  
        r += sin(3.4)  
    end  
    return r  
end  

The only difference between these function definitions is that sumofsins1 initializes r to 0, whereas sumofsins2 initializes r to 0.0.

This seemingly minor distinction has important practical implications because the initialization of r to 0 means that the main loop of sumofsins1 begins with a single iteration in which the computer adds 0 to sin(3.4). This single addition step transforms the type of r from Int, which is the type of 0, to Float64, which is the type of sin(3.4). This means that the type of r is not stable over the course of this loop.

This instability has considerable effects on the performance of sumofsins1. To see this, let’s run some naive benchmarks. As always in Julia, we’ll start with a dry run to get the JIT to compile the functions being compared:

sumofsins1(100_000)  
sumofsins2(100_000)  

@time [sumofsins1(100_000) for i in 1:100];  
@time [sumofsins2(100_000) for i in 1:100];  

The results of this timing comparison are quite striking:

julia> @time [sumofsins1(100_000) for i in 1:100];  
elapsed time: 0.412261722 seconds (320002496 bytes allocated)  

julia> @time [sumofsins2(100_000) for i in 1:100];  
elapsed time: 0.008509995 seconds (896 bytes allocated)  

As you can see, the type-unstable code in sumofsins1 is 50x slower than the type-stable code. What might have seemed like a nitpicky point about the initial value of r has enormous performance implications.

To understand the reasons for this huge performance gap, it’s worth considering what effect type-instability has on the compiler. In this case, the compiler can’t optimize the contents of the main loop of sumofsins1 because it can’t be certain that the type of r will remain invariant throughout the entire loop. Without this crucial form of invariance, the compiler has to check the type of r on every iteration of the loop, which is a much more intensive computation than repeatedly adding a constant value to a Float64.

You can confirm for yourself that the compiler produces more complex code by examining the LLVM IR for both of these functions.

First, we’ll examine the LLVM IR for sumofsins1:

julia> code_llvm(sumofsins1, (Int, ))  

define %jl_value_t* @julia_sumofsins11067(i64) {  
top:  
  %1 = alloca [5 x %jl_value_t*], align 8  
  %.sub = getelementptr inbounds [5 x %jl_value_t*]* %1, i64 0, i64 0  
  %2 = getelementptr [5 x %jl_value_t*]* %1, i64 0, i64 2, !dbg !5145  
  store %jl_value_t* inttoptr (i64 6 to %jl_value_t*), %jl_value_t** %.sub, align 8  
  %3 = load %jl_value_t*** @jl_pgcstack, align 8, !dbg !5145  
  %4 = getelementptr [5 x %jl_value_t*]* %1, i64 0, i64 1, !dbg !5145  
  %.c = bitcast %jl_value_t** %3 to %jl_value_t*, !dbg !5145  
  store %jl_value_t* %.c, %jl_value_t** %4, align 8, !dbg !5145  
  store %jl_value_t** %.sub, %jl_value_t*** @jl_pgcstack, align 8, !dbg !5145  
  %5 = getelementptr [5 x %jl_value_t*]* %1, i64 0, i64 3  
  store %jl_value_t* null, %jl_value_t** %5, align 8  
  %6 = getelementptr [5 x %jl_value_t*]* %1, i64 0, i64 4  
  store %jl_value_t* null, %jl_value_t** %6, align 8  
  store %jl_value_t* inttoptr (i64 140379580131904 to %jl_value_t*), %jl_value_t** %2, align 8, !dbg !5150  
  %7 = icmp slt i64 %0, 1, !dbg !5151  
  br i1 %7, label %L2, label %pass, !dbg !5151  

pass:                                             ; preds = %top, %pass  
  %8 = phi %jl_value_t* [ %13, %pass ], [ inttoptr (i64 140379580131904 to %jl_value_t*), %top ]  
  %"#s6.03" = phi i64 [ %14, %pass ], [ 1, %top ]  
  store %jl_value_t* %8, %jl_value_t** %5, align 8, !dbg !5152  
  %9 = call %jl_value_t* @alloc_2w(), !dbg !5152  
  %10 = getelementptr inbounds %jl_value_t* %9, i64 0, i32 0, !dbg !5152  
  store %jl_value_t* inttoptr (i64 140379580056656 to %jl_value_t*), %jl_value_t** %10, align 8, !dbg !5152  
  %11 = getelementptr inbounds %jl_value_t* %9, i64 1, i32 0, !dbg !5152  
  %12 = bitcast %jl_value_t** %11 to double*, !dbg !5152  
  store double 0xBFD05AC910FF4C6C, double* %12, align 8, !dbg !5152  
  store %jl_value_t* %9, %jl_value_t** %6, align 8, !dbg !5152  
  %13 = call %jl_value_t* @jl_apply_generic(%jl_value_t* inttoptr (i64 140379586379936 to %jl_value_t*), %jl_value_t** %5, i32 2), !dbg !5152  
  store %jl_value_t* %13, %jl_value_t** %2, align 8, !dbg !5152  
  %14 = add i64 %"#s6.03", 1, !dbg !5152  
  %15 = icmp sgt i64 %14, %0, !dbg !5151  
  br i1 %15, label %L2, label %pass, !dbg !5151  

L2:                                               ; preds = %pass, %top  
  %.lcssa = phi %jl_value_t* [ inttoptr (i64 140379580131904 to %jl_value_t*), %top ], [ %13, %pass ]  
  %16 = load %jl_value_t** %4, align 8, !dbg !5153  
  %17 = getelementptr inbounds %jl_value_t* %16, i64 0, i32 0, !dbg !5153  
  store %jl_value_t** %17, %jl_value_t*** @jl_pgcstack, align 8, !dbg !5153  
  ret %jl_value_t* %.lcssa, !dbg !5153  
}  

Then we’ll examine the LLVM IR for sumofsins2:

julia> code_llvm(sumofsins2, (Int, ))  

define double @julia_sumofsins21068(i64) {  
top:  
  %1 = icmp slt i64 %0, 1, !dbg !5151  
  br i1 %1, label %L2, label %pass, !dbg !5151  

pass:                                             ; preds = %top, %pass  
  %"#s6.04" = phi i64 [ %3, %pass ], [ 1, %top ]  
  %r.03 = phi double [ %2, %pass ], [ 0.000000e+00, %top ]  
  %2 = fadd double %r.03, 0xBFD05AC910FF4C6C, !dbg !5156  
  %3 = add i64 %"#s6.04", 1, !dbg !5156  
  %4 = icmp sgt i64 %3, %0, !dbg !5151  
  br i1 %4, label %L2, label %pass, !dbg !5151  

L2:                                               ; preds = %pass, %top  
  %r.0.lcssa = phi double [ 0.000000e+00, %top ], [ %2, %pass ]  
  ret double %r.0.lcssa, !dbg !5157  
}  

The difference in size and complexity of code between these two functions in compiled form is considerable. And this difference is entirely atttributable to the compiler’s need to recheck the type of r on every iteration of the main loop in sumofsins1, which can be optimized out in sumofsins2, where r has a stable type.

Given the potential performance impacts of type-instability, every aspiring Julia programmer needs to learn to recognize potential sources of type-instability in their own code. Future versions of Julia may be configured to issue warnings when type-unstable code is encountered, but, for now, the responsibility lies with the programmer. Thankfully, once you learn about type-stability, it becomes easy to recognize in most cases.

September Talks

To celebrate my last full month on the East Coast, I’m doing a bunch of talks. If you’re interested in hearing more about Julia or statistics in general, you might want to come out to one of the events I’ll be at:

  • Julia Tutorial at DataGotham: On 9/12, Stefan and I will be giving a 3-hour long, hands on Julia tutorial as part of the Thursday DataGotham activities this year. If you’re in NYC and care about data analysis, you should try to make it out to part of the event, even if you skip the tutorials.
  • Online Learning Talk in NYC: On 9/17, I’ll be giving a talk on online learning at the Open Statistical Programming meetup. I’ll talk about using SGD to fit models online. This material is quite basic, but seems to be unfamiliar to a lot of people.
  • Julia Talk in DC: On 9/26, I’ll be giving a quick introduction to Julia in DC at the Statistical Programming DC meetup. The goal will be to introduce people to the basics of Julia.

Hopfield Networks in Julia

As a fun side project last night, I decided to implement a basic package for working with Hopfield networks in Julia.

Since I suspect many of the readers of this blog have never seen a Hopfield net before, let me explain what they are and what they can be used for. The short-and-skinny is that Hopfield networks were invented in the 1980′s to demonstrate how a network of simple neurons might learn to associate incoming stimuli with a fixed pool of existing memories. As you’ll see from the examples below, this associative ability behaves a little bit like locality-sensitive hashing.

To see how Hopfield networks work, we need to define their internal structure. For the purposes of this blog post, we’ll assume that a Hopfield network is made up of N neurons. At every point in time, this network of neurons has a simple binary state, which I’ll associate with a vector of -1′s and +1′s.

Incoming stimuli are also represented using binary vectors of length N. Every time one of these stimuli is shown to the network, the network will use a simple updating rule to modify its state. The network will keep modifying its state until it settles into a stable state, which will be one of many fixed points for the updating rule. We’ll refer to the stable state that the network reaches as the memory that the network associates with the input stimulus.

For example, let’s assume that we have a network consisting of 42 neurons arranged in a 7×6 matrix. We’ll train our network to recognize the letters X and O, which will also be represented as 7×6 matrices. After training the network, we’ll present corrupted copies of the letters X and O to show that the network is able to associate corrupted stimuli with their uncorrupted memories. We’ll also show the network an uncorrupted copy of the unfamiliar letter F to see what memory it associates with an unfamiliar stimulus.

Using the HopfieldNets package, we can do this in Julia as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
using HopfieldNets
 
include(Pkg.dir("HopfieldNets", "demo", "letters.jl"))
 
patterns = hcat(X, O)
 
n = size(patterns, 1)
 
h = DiscreteHopfieldNet(n)
 
train!(h, patterns)
 
Xcorrupt = copy(X)
for i = 2:7
     Xcorrupt[i] = 1
end
 
Xrestored = associate!(h, Xcorrupt)

In the image below, I show what happens when we present X, O and F to the network after training it on the X and O patterns:


Results

As you can see, the network perfectly recovers X and O from corrupted copies of those letters. In addition, the network associates F with an O, although the O is inverted relative to the O found in the training set. This kind of untrained memory emerging is common in Hopfield nets. To continue the analogy with LSH, you can think of the memories produced by the Hopfield net as hashes of the input, which have the property that similar inputs tend to produce similar outputs. In practice, you shouldn’t use a Hopfield net to do LSH, because the computations involved are quite costly.

Hopefully this simple example has piqued your interest in Hopfield networks. If you’d like to learn more, you can read through the code I wrote or work through the very readable presentation of the theory of Hopfield networks in David Mackay’s book on Information Theory, Inference, and Learning Algorithms.

Turning Off Comments

A few days ago I disabled the comment system on this site. I’d been debating the change for some time, but reached a final decision while reading the comments on an article about a vaccine for Lyme disease.

Although this site has generally had very high quality comments, I’ve become increasingly opposed (as a matter of principle) to the use of online comment systems. My feelings mirror those of many other people who’ve deactivated comments on their sites, including Marco Arment and Matt Gemmell. As many have said before, comments tend to bring out the worst in people. The conversations that comments are ostensibly supposed to inspire now occur on Twitter and in volleys of blog posts that are traded between multiple blogs. In contrast, comment threads tend to trap the material that people either (a) don’t want to associate with their own name or (b) don’t want to take the time to write up formally. I think we have too much of both of these sorts of writing and would prefer not to encourage either.

What’s Next

The last two weeks have been full of changes for me. For those who’ve been asking about what’s next, I thought I’d write up a quick summary of all the news.

(1) I successfully defended my thesis this past Monday. Completing a Ph.D. has been a massive undertaking for the past five years, and it’s a major relief to be done. From now on I’ll be (perhaps undeservedly) making airline and restaurant reservations under the name Dr. White.

(2) As announced last week, I’ll be one of the residents at Hacker School this summer. The list of other residents is pretty amazing, and I’m really looking forward to meeting the students.

(3) In addition to my residency at Hacker School, I’ll be a temporary postdoc in the applied math department at MIT, where I’ll be working on Julia full-time. Expect to see lots of work on building up the core data analysis infrastructure.

(4) As of today I’ve accepted an offer to join Facebook’s Data Science team in the fall. I’ll be moving out to the Bay Area in November.

That’s all so far.

Using Norms to Understand Linear Regression

Introduction

In my last post, I described how we can derive modes, medians and means as three natural solutions to the problem of summarizing a list of numbers, \((x_1, x_2, \ldots, x_n)\), using a single number, \(s\). In particular, we measured the quality of different potential summaries in three different ways, which led us to modes, medians and means respectively. Each of these quantities emerged from measuring the typical discrepancy between an element of the list, \(x_i\), and the summary, \(s\), using a formula of the form,
$$
\sum_i |x_i – s|^p,
$$
where \(p\) was either \(0\), \(1\) or \(2\).

The \(L_p\) Norms

In this post, I’d like to extend this approach to linear regression. The notion of discrepancies we used in the last post is very closely tied to the idea of measuring the size of a vector in \(\mathbb{R}^n\). Specifically, we were minimizing a measure of discrepancies that was almost identical to the \(L_p\) family of norms that can be used to measure the size of vectors. Understanding \(L_p\) norms makes it much easier to describe several modern generalizations of classical linear regression.

To extend our previous approach to the more standard notion of an \(L_p\) norm, we simply take the sum we used before and rescale things by taking a \(p^{th}\) root. This gives the formula for the \(L_p\) norm of any vector, \(v = (v_1, v_2, \ldots, v_n)\), as,
$$
|v|_p = (\sum_i |v_i|^p)^\frac{1}{p}.
$$
When \(p = 2\), this formula reduces to the familiar formula for the length of a vector:
$$
|v|_2 = \sqrt{\sum_i v_i^2}.
$$

In the last post, the vector we cared about was the vector of elementwise discrepancies, \(v = (x_1 – s, x_2 – s, \ldots, x_n – s)\). We wanted to minimize the overall size of this vector in order to make \(s\) a good summary of \(x_1, \ldots, x_n\). Because we were interested only in the minimum size of this vector, it didn’t matter that we skipped taking the \(p^{th}\) root at the end because one vector, \(v_1\), has a smaller norm than another vector, \(v_2\), only when the \(p^{th}\) power of that norm smaller than the \(p^{th}\) power of the other. What was essential wasn’t the scale of the norm, but rather the value of \(p\) that we chose. Here we’ll follow that approach again. Specifically, we’ll again be working consistently with the \(p^{th}\) power of an \(L_p\) norm:
$$
|v|_p^p = (\sum_i |v_i|^p).
$$

The Regression Problem

Using \(L_p\) norms to measure the overall size of a vector of discrepancies extends naturally to other problems in statistics. In the previous post, we were trying to summarize a list of numbers by producing a simple summary statistic. In this post, we’re instead going to summarize the relationship between two lists of numbers in a form that generalizes traditional regression models.

Instead of a single list, we’ll now work with two vectors: \((x_1, x_2, \ldots, x_n)\) and \((y_1, y_2, \ldots, y_n)\). Because we like simple models, we’ll make the very strong (and very convenient) assumption that the second vector is, approximately, a linear function of the first vector, which gives us the formula:
$$
y_i \approx \beta_0 + \beta_1 x_i.
$$

In practice, this linear relationship is never perfect, but only an approximation. As such, for any specific values we choose for \(\beta_0\) and \(\beta_1\), we have to compute a vector of discrepancies: \(v = (y_1 – (\beta_0 + \beta_1 x_1), \ldots, y_n – (\beta_0 + \beta_1 x_n))\). The question then becomes: how do we measure the size of this vector of discrepancies? By choosing different norms to measure its size, we arrive at several different forms of linear regression models. In particular, we’ll work with three norms: the \(L_0\), \(L_1\) and \(L_2\) norms.

As we did with the single vector case, here we’ll define discrepancies as,
$$
d_i = |y_i – (\beta_0 + \beta_1 x_i)|^p,
$$
and the total error as,
$$
E_p = \sum_i |y_i – (\beta_0 + \beta_1 x_i)|^p,
$$
which is the just the \(p^{th}\) power of the \(L_p\) norm.

Several Forms of Regression

In general, we want estimate a set of regression coefficients that minimize this total error. Different forms of linear regression appear when we alter the values of \(p\). As before, let’s consider three settings:
$$
E_0 = \sum_i |y_i – (\beta_0 + \beta_1 x_i)|^0
$$
$$
E_1 = \sum_i |y_i – (\beta_0 + \beta_1 x_i)|^1
$$
$$
E_2 = \sum_i |y_i – (\beta_0 + \beta_1 x_i)|^2
$$

What happens in these settings? In the first case, we select regression coefficients so that the line passes through as many points as possible. Clearly we can always select a line that passes through any pair of points. And we can show that there are data sets in which we cannot do better. So the \(L_0\) norm doesn’t seem to provide a very useful form of linear regression, but I’d be interested to see examples of its use.

In contrast, minimizing \(E_1\) and \(E_2\) define quite interesting and familiar forms of linear regression. We’ll start with \(E_2\) because it’s the most familiar: it defines Ordinary Least Squares (OLS) regression, which is the one we all know and love. In the \(L_2\) case, we select \(\beta_0\) and \(\beta_1\) to minimize,
$$
E_2 = \sum_i (y_i – (\beta_0 + \beta_1 x_i))^2,
$$
which is the summed squared error over all of the \((x_i, y_i)\) pairs. In other words, Ordinary Least Squares regression is just an attempt to find an approximating linear relationship between two vectors that minimizes the \(L_2\) norm of the vector of discrepancies.

Although OLS regression is clearly king, the coefficients we get from minimizing \(E_1\) are also quite widely used: using the \(L_1\) norm defines Least Absolute Deviations (LAD) regression, which is also sometimes called Robust Regression. This approach to regression is robust because large outliers that would produce errors greater than \(1\) are not unnecessarily augmented by the squaring operation that’s used in defining OLS regression, but instead only have their absolute values taken. This means that the resulting model will try to match the overall linear pattern in the data even when there are some very large outliers.

We can also relate these two approaches to the strategy employed in the previous post. When we use OLS regression (which would be better called \(L_2\) regression), we predict the mean of \(y_i\) given the value of \(x_i\). And when we use LAD regression (which would be better called \(L_1\) regression), we predict the median of \(y_i\) given the value of \(x_i\). Just as I said in the previous post, the core theoretical tool that we need to understand is the \(L_p\) norm. For single number summaries, it naturally leads to modes, medians and means. For simple regression problems, it naturally leads to LAD regression and OLS regression. But there’s more: it also leads naturally to the two most popular forms of regularized regression.

Regularization

If you’re not familiar with regularization, the central idea is that we don’t exclusively try to find the values of \(\beta_0\) and \(\beta_1\) that minimize the discrepancy between \(\beta_0 + \beta_1 x_i\) and \(y_i\), but also simultaneously try to satisfy a competing requirement that \(\beta_1\) not get too large. Note that we don’t try to control the size of \(\beta_0\) because it describes the overall scale of the data rather than the relationship between \(x\) and \(y\).

Because these objectives compete, we have to combine them into a single objective. We do that by working with a linear sum of the two objectives. And because both the discrepancy objective and the size of the coefficients can be described in terms of norms, we’ll assume that we want to minimize the \(L_p\) norm of the discrepancies and the \(L_q\) norm of the \(\beta\)’s. This means that we end up trying to minimize an expression of the form,
$$
(\sum_i |y_i – (\beta_0 + \beta_1 x_i)|^{p}) + \lambda (|\beta_1|^q).
$$

In most regularized regression models that I’ve seen in the wild, people tend to use \(p = 2\) and \(q = 1\) or \(q = 2\). When \(q = 1\), this model is called the LASSO. When \(q = 2\), this model is called ridge regression. In another approach, I’ll try to describe why the LASSO and ridge regression produce such different patterns of coefficients.