That Way Madness Lies: Arithmetic on data.frames

tl;dr

Please do not use arithmetic on data.frame objects when programming in R. It’s a hack that only works if you know everything about your datasets. If anything happens to change the order of the rows in your data set, previously safe data.frame arithmetic operations will produce incorrect answers. If you learn to always explicitly merge two tables together before performing arithmetic on their shared columns, you’ll produce code that is both more reliable and more powerful.

Arithmetic between tables: getting wrong answers quickly

You may not be aware of it, but R allows you to do arithmetic on data.frame objects. For example, the following code works in R as of version 3.0.2:

1
2
3
4
5
6
7
> df1 <- data.frame(ID = c(1, 2), Obs = c(1.0, 2.0))
> df2 <- data.frame(ID = c(1, 2), Obs = c(2.0, 3.0))
> df3 <- (df1 + df2) / 2
> df3
  ID Obs
1  1 1.5
2  2 2.5

If you discover that you can do this, you might think that it’s a really cool trick. You might even start using data.frame arithmetic without realizing that your specific example had a bunch of special structure that was directly responsible for you getting the right answer.

Unfortunately, other examples that you didn’t see would have produced rather less pleasant outputs and led you to realize that arithmetic operations on data.frame objects don’t really make sense:

1
2
3
4
5
6
7
> df1 <- data.frame(ID = c(1, 2), Obs = c(1.0, 2.0))
> df2 <- data.frame(ID = c(2, 1), Obs = c(3.0, 2.0))
> df3 <- (df1 + df2) / 2
> df3
   ID Obs
1 1.5   2
2 1.5   2

What happened here is obvious in retrospect: R added all of the columns together and then divided the result by two. The problem is that you didn’t actually want to add all of the columns together and then divide the result by two, because you had forgotten that the matching rows in df1 and df2 were not in the same index positions in the two tables.

Getting right answers with just a little more typing

Thankfully, it turns out that doing the right thing just requires a few more characters. What you should have done was to call merge before doing any arithmetic:

1
2
3
4
5
6
7
8
> df1 <- data.frame(ID = c(1, 2), Obs = c(1.0, 2.0))
> df2 <- data.frame(ID = c(2, 1), Obs = c(3.0, 2.0))
> df3 <- merge(df1, df2, by = "ID")
> df3 <- transform(df3, AvgObs = (Obs.x + Obs.y) / 2)
> df3
  ID Obs.x Obs.y AvgObs
1  1     1     2    1.5
2  2     2     3    2.5

What makes merge so unequivocally superior to data.frame arithmetic is that it still works when the two inputs have different numbers of rows:

1
2
3
4
5
6
7
8
> df1 <- data.frame(ID = c(1, 2), Obs = c(1.0, 2.0))
> df2 <- data.frame(ID = c(1, 2, 3), Obs = c(5.0, 6.0, 7.0))
> df3 <- merge(df1, df2, by = "ID")
> df3 <- transform(df3, AvgObs = (Obs.x + Obs.y) / 2)
> df3
  ID Obs.x Obs.y AvgObs
1  1     1     5      3
2  2     2     6      4

Knowledge is half the battle

Now that you know why performing arithmetic operations on data.frame objects is generally unsafe, I implore you to stop doing it. Learn to love merge.

My Experience at JuliaCon

Introduction

I just got home from JuliaCon, the first conference dedicated entirely to Julia. It was a great pleasure to spend two full days listening to talks about a language that I started advocating for just a little more than two years ago.

What follows is a very brief review of the talks that excited me the most. It’s not in any way exhaustive: there were a bunch of other good talks that I saw as well as a few talks I missed so that I could visit the Data Science for Social Good fellows.

Optimization

The optimization community seems to be the academic field that’s been most ready to adopt Julia. Two talks about using Julia for optimization stood out: Iain Dunning and Joey Huchette’s talk about JuMP.jl, and Madeleine Udell’s talk about CVX.jl.

JuMP implements a DSL that allows users to describe an optimization problem in purely mathematical terms. This problem encoding can be then passed to one of many backend solvers to determine a solution. By abstracting across solvers, JuMP makes it easier for people like me to get access to well-established tools like GLPK.

CVX is quite similar to JuMP, but it implements a symbolic computation system that’s especially focused on allowing users to encode convex optimization problems. One of the things that’s most appealing about CVX is that it automatically confirms whether the problem you’re encoding is convex or not. Until I saw Madeleine’s talk, I hadn’t realized how much progress had been made on CVX.jl. Now that I’ve seen CVX.jl in action, I’m hoping to start using it for some of my work. I’ll probably also write a blog post about it in the future.

Statistics

I really enjoyed the statistics talks given by Doug Bates, Simon Byrne and Dan Wlasiuk. I was especially glad to hear Doug Bates remind the audience that, years ago, he’d attended a small meeting about R that was similar in size to this first iteration of JuliaCon. Over the course of the intervening decades, he noted that the R community has grown from dozens to millions of users.

Language-Level Issues

Given that Julia is still something of a language nerd’s language, it’s no surprise that some of the best talks focused on language-level issues.

Arch Robison gave a really interesting talk about the tools used in Julia 0.3 to automatically vectorize code so that it can take advantage of SIMD instructions. For those coming from languages like R or Python, you should be aware that vectorization means almost the exact opposite thing to compiler writers that it means to high-level language users: vectorization involves the transformation of certain kinds of iterative code into the thread-free parallelized instructions that modern CPU’s provide for performing a single operation on multiple data chunks simultaneously. I’ve come to love this kind of compiler design discussion and the invariance properties the compiler needs to prove before it can perform program transformations safely. For example, Arch noted that SIMD instructions can be safely used when working on many integers, but cannot be used on floating point numbers because of failures of associativity.

After Arch spoke, Jeff Bezanson gave a nice description of the process by which Julia code is transformed from raw text users enter into the REPL into the final compiled form that gets executed by CPU’s. For those interested in understanding how Julia works under the hood, this talk is likely to be the best place to start.

In addition, Leah Hanson and Keno Fischer both gave good talks about improved tools for debugging Julia code. Leah spoke about TypeCheck.jl, a system for automatically warning about potential code problems. Keno demoed a very rough draft of a Julia debugger built on top of LLDB. As an added plus, Keno also demoed a new C++ FFI for Julia that I’m really looking forward to. I’m hopeful that the new FFI will make it much easier to wrap C++ libraries for use from Julia.

Deploying Julia in Production

Both Avik Sengupta and Michael Bean described their experiences using Julia in production systems. Knowing that Julia was being used in production anywhere was inspiring.

Graphics and Audio

Daniel C. Jones and Spencer Russell both gave great talks about the developments taking place in graphics and audio support. Daniel C. Jones’s demo of a theremin built using Shashi Gowda’s React.jl and Spencer Russell’s AudioIO.jl was especially impressive.

Take Aways

The Julia community really is a community now. It was big enough to sell out a small conference and to field a large variety of discussion topics. I’m really excited to see how the next JuliaCon will turn out.

Falsifiability versus Rationalization

Here are two hypothetical conversations about psychological research. I’ll leave it to others to decide whether these conversation could ever take place.

Theories are just directional assertions about effects

Person A: And, just as I predicted, I found in my early studies that the correlation between X and Y is 0.4.
Person B: What do you make of the fact that subsequent studies have found that the correlation is closer to 0.001?
Person A: Oh, I was right all along: those studies continue to support my theoretical assertion that the empirical effect goes in the direction that my theory predicted. Exact numbers are meaningless in the social sciences, since we only conduct proof-of-concept studies and there are so many intervening variables we can’t measure.

Theories are just assertions about the existence of effects

Person A: And, just as I predicted, I found in my early studies that the correlation between X and Y is 0.4.
Person B: What do you make of the fact that a conceptual replication, which employed words rather than pictures, found that the correlation between X and Y was -0.05?
Person A: Oh, I was right all along: X does have an effect on Y, even though the effect can switch directions under some circumstances. What matters is that X affects Y at all, which is deeply counter-intuitive.

A Note on the Johnson-Lindenstrauss Lemma

Introduction

A recent thread on Theoretical CS StackExchange comparing the Johnson-Lindenstrauss Lemma with the Singular Value Decomposition piqued my interest enough that I decided to spend some time last night reading the standard JL papers. Until this week, I only had a vague understanding of what the JL Lemma implied. I previously mistook the JL Lemma for a purely theoretical result that established the existence of distance-preserving projections from high-dimensional spaces into low-dimensional spaces.

This vague understanding of the JL Lemma turns out to be almost correct, but it also led me to neglect the most interesting elements of the literature on the JL Lemma: the papers on the JL Lemma do not simply establish the existence of such projections, but also provide (1) an explicit bound on the dimensionality required for a projection to ensure that it will approximately preserve distances and they even provide (2) an explicit construction of a random matrix, \(A\), that produces the desired projection.

Once I knew that the JL Lemma was a constructive proof, I decided to implement code in Julia to construct examples of this family of random projections. The rest of this post walks through that code as a way of explaining the JL Lemma’s practical applications.

Formal Statement of the JL Lemma

The JL Lemma, as stated in “An elementary proof of the Johnson-Lindenstrauss Lemma” by Dasgputa and Gupta, is the following result about dimensionality reduction:

For any \(0 < \epsilon < 1\) and any integer \(n\), let \(k\) be a positive integer such that \(k \geq 4(\epsilon^2/2 - \epsilon^3/3)^{-1}\log(n)\).

Then for any set \(V\) of \(n\) points in \(\mathbb{R}^d\), there is a map \(f : \mathbb{R}^d \to \mathbb{R}^k\) such that for all \(u, v \in V\),

$$
(1 - \epsilon) ||u - v||^2 \leq ||f(u) - f(v)||^2 \leq (1 + \epsilon) ||u - v||^2.
$$

Further this map can be found in randomized polynomial time.

To fully appreciate this result, we can unpack the abstract statement of the lemma into two components.

The JL Lemma in Two Parts

Part 1: Given a number of data points, \(n\), that we wish to project and a relative error, \(\epsilon\), that we are willing to tolerate, we can compute a minimum dimensionality, \(k\), that a projection must map a space into before it can guarantee that distances will be preserved up to a factor of \(\epsilon\).

In particular, \(k = \left \lceil{4(\epsilon^2/2 – \epsilon^3/3)^{-1}\log(n)} \right \rceil\).

Note that this implies that the dimensionality required to preserve distances depends only on the number of points and not on the dimensionality of the original space.

Part 2: Given an input matrix, \(X\), of \(n\) points in \(d\)-dimensional space, we can explicitly construct a map, \(f\), such that the distance between any pair of columns of \(X\) will not distorted by more than a factor of \(\epsilon\).

Surprisingly, this map \(f\) can be a simple matrix, \(A\), constructed by sampling \(k * d\) IID draws from a Gaussian with mean \(0\) and variance \(\frac{1}{k}\).

Coding Up The Projections

We can translate the first part of the JL Lemma into a single line of code that computes the dimensionality, \(k\), of our low-dimensional space given the number of data points, \(n\), and the error, \(\epsilon\), that we are willing to tolerate:

1
mindim(n::Integer, ε::Real) = iceil((4 * log(n)) / (ε^2 / 2 - ε^3 / 3))

Having defined this function, we can try it out on a simple problem:

1
2
mindim(3, 0.1)
# => 942

This result was somewhat surprising to me: to represent \(3\) points with no more than \(10\)% error, we require nearly \(1,000\) dimensions. This reflects an important fact about the JL Lemma: it produces result that can be extremely conservative for small dimensional inputs. It’s obvious that, for data sets that contain \(3\) points in \(100\)-dimensional space, we could use a projection into \(100\) dimensions that would preserve distances perfectly.

But this observation neglects one of the essential aspects of the JL Lemma: the dimensions required by the lemma will be sufficient whether our data set contains points in \(100\)-dimensional space or points in \(10^{100}\)-dimensional space. No matter what dimensionality the raw data lies in, the JL Lemma says that \(942\) dimensions suffices to preserve the distances between \(3\) points.

I found this statement unintuitive at the start. To see that it’s true, let’s construct a random projection matrix, \(A\), that will let us confirm experimentally that the JL Lemma really works:

1
2
3
4
5
6
7
8
9
10
11
using Distributions
 
function projection(
    X::Matrix,
    ε::Real,
    k::Integer = mindim(size(X, 2), ε)
)
    d, n = size(X)
    A = rand(Normal(0, 1 / sqrt(k)), k, d)
    return A, k, A * X
end

This projection function is sufficient to construct a matrix, \(A\), that will satisfy the assumptions of the JL Lemma. It will also return the dimensionality, \(k\), of \(A\) and the result of projecting the input, \(X\), into the new space defined by \(A\). To get a feel for how this works, we can try this out on a very simple data set:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
X = eye(3, 3)
 
ε = 0.1
 
A, k, AX = projection(X, ε)
# =>
# (
# 942x3 Array{Float64,2}:
#  -0.035269    -0.0299966   -0.0292959 
#  -0.00501367   0.0316806    0.0460191 
#   0.0633815   -0.0136478   -0.0198676 
#   0.0262627    0.00187459  -0.0122604 
#   0.0417169   -0.0230222   -0.00842476
#   0.0236389    0.0585979   -0.0642437 
#   0.00685299  -0.0513301    0.0501431 
#   0.027723    -0.0151694    0.00274466
#   0.0338992    0.0216184   -0.0494157 
#   0.0612926    0.0276185    0.0271352 
#   ⋮                                   
#  -0.00167347  -0.018576     0.0290964 
#   0.0158393    0.0124403   -0.0208216 
#  -0.00833401   0.0323784    0.0245698 
#   0.019355     0.0057538    0.0150561 
#   0.00352774   0.031572    -0.0262811 
#  -0.0523636   -0.0388993   -0.00794319
#  -0.0363795    0.0633939   -0.0292289 
#   0.0106868    0.0341909    0.0116523 
#   0.0072586   -0.0337501    0.0405171 ,
# 
# 942,
# 942x3 Array{Float64,2}:
#  -0.035269    -0.0299966   -0.0292959 
#  -0.00501367   0.0316806    0.0460191 
#   0.0633815   -0.0136478   -0.0198676 
#   0.0262627    0.00187459  -0.0122604 
#   0.0417169   -0.0230222   -0.00842476
#   0.0236389    0.0585979   -0.0642437 
#   0.00685299  -0.0513301    0.0501431 
#   0.027723    -0.0151694    0.00274466
#   0.0338992    0.0216184   -0.0494157 
#   0.0612926    0.0276185    0.0271352 
#   ⋮                                   
#  -0.00167347  -0.018576     0.0290964 
#   0.0158393    0.0124403   -0.0208216 
#  -0.00833401   0.0323784    0.0245698 
#   0.019355     0.0057538    0.0150561 
#   0.00352774   0.031572    -0.0262811 
#  -0.0523636   -0.0388993   -0.00794319
#  -0.0363795    0.0633939   -0.0292289 
#   0.0106868    0.0341909    0.0116523 
#   0.0072586   -0.0337501    0.0405171 )

According to the JL Lemma, the new matrix, \(AX\), should approximately preserve the distances between columns of \(X\). We can write a quick function that verifies this claim:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
function ispreserved(X::Matrix, A::Matrix, ε::Real)
    d, n = size(X)
    k = size(A, 1)
 
    for i in 1:n
        for j in (i + 1):n
            u, v = X[:, i], X[:, j]
            d_old = norm(u - v)^2
            d_new = norm(A * u - A * v)^2
            @printf("Considering the pair X[:, %d], X[:, %d]...\n", i, j)
            @printf("\tOld distance: %f\n", d_old)
            @printf("\tNew distance: %f\n", d_new)
            @printf(
                "\tWithin bounds %f <= %f <= %f\n",
                (1 - ε) * d_old,
                d_new,
                (1 + ε) * d_old
            )
            if !((1 - ε) * d_old <= d_old <= (1 + ε) * d_old)
                return false
            end
        end
    end
 
    return true
end

And then we can test out the results:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
ispreserved(X, A, ε)
# =>
# Considering the pair X[:, 1], X[:, 2]...
#     Old distance: 2.000000
#     New distance: 2.104506
#     Within bounds 1.800000 <= 2.104506 <= 2.200000
# Considering the pair X[:, 1], X[:, 3]...
#     Old distance: 2.000000
#     New distance: 2.006130
#     Within bounds 1.800000 <= 2.006130 <= 2.200000
# Considering the pair X[:, 2], X[:, 3]...
#     Old distance: 2.000000
#     New distance: 1.955495
#     Within bounds 1.800000 <= 1.955495 <= 2.200000

As claimed, the distances are indeed preserved up to a factor of \(\epsilon\). But, as we noted earlier, the JL lemma has a somewhat perverse consequence for our \(3×3\) matrix: we’ve expanded our input into a \(942×3\) matrix rather than reduced its dimensionality.

To get meaningful dimensionality reduction, we need to project a data set from a space that has more than \(942\) dimensions. So let’s try out a \(50,000\)-dimensional example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
X = eye(50_000, 3)
 
A, k, AX = projection(X, ε)
 
ispreserved(X, A, ε)
# =>
# Considering the pair X[:, 1], X[:, 2]...
#     Old distance: 2.000000
#     New distance: 2.021298
#     Within bounds 1.800000 <= 2.021298 <= 2.200000
# Considering the pair X[:, 1], X[:, 3]...
#     Old distance: 2.000000
#     New distance: 1.955502
#     Within bounds 1.800000 <= 1.955502 <= 2.200000
# Considering the pair X[:, 2], X[:, 3]...
#     Old distance: 2.000000
#     New distance: 1.988945
#     Within bounds 1.800000 <= 1.988945 <= 2.200000

In this case, the JL Lemma again works as claimed: the pairwise distances between columns of \(X\) are preserved. And we’ve done this while reducing the dimensionality of our data from \(50,000\) to \(942\). Moreover, this same approach would still work if the input space had \(10\) million dimensions.

Conclusion

Contrary to my naive conception of the JL Lemma, the literature on the lemma not only tells us that, abstractly, distances can be preserved by dimensionality reduction techniques. It tells how to perform this reduction — and the mechanism is both simple and general.

Data corruption in R 3.0.2 when using read.csv

Introduction

It may be old news to some, but I just recently discovered that the automatic type inference system that R uses when parsing CSV files assumes that data sets will never contain 64-bit integer values.

Specially, if an integer value read from a CSV file is too large to fit in a 32-bit integer field without overflow, the column of data that contains that value will be automatically converted to floating point. This conversion will take place without any warnings, even though it may lead to data corruption.

The reason that the automatic conversion of 64-bit integer-valued data to floating point is problematic is that floating point numbers lack sufficient precision to exactly represent the full range of 64-bit integer values. As a consequence of the lower precision of floating point numbers, two unequal integer values in the input file may be converted to two equal floating point values in the data.frame R uses to represent that data. Subsequent analysis in R will therefore treat unequal values as if they were equal, corrupting any downstream analysis that assumes that the equality predicate can be trusted.

Below, I demonstrate this general problem using two specific data sets. The specific failure case that I outline occurred for me while using R 3.0.2 on my x86_64-apple-darwin10.8.0 platform laptop, which is a “MacBook Pro Retina, 13-inch, Late 2013″ model.

Failure Case

Consider the following two tables, one containing 32-bit integer values and the other containing 64-bit integer values:

ID
1000
1001
ID
100000000000000000
100000000000000001

What happens when they are read into R using the read.csv function?

32-bit compatible integer values are parsed, correctly, using R’s integer type, which does not lead to data corruption:

1
2
3
4
5
6
7
8
9
data <- "MySQLID\n1000\n1001"
 
ids <- read.csv(text = data)
 
ids[1, 1] == ids[2, 1]
# [1] FALSE
 
class(ids$MySQLID)
# [1] "integer"

64-bit compatible integer values are parsed, incorrectly, using R’s numeric type, which does lead to data corruption:

1
2
3
4
5
6
7
8
9
data <- "MySQLID\n100000000000000000\n100000000000000001"
 
ids <- read.csv(text = data)
 
ids[1, 1] == ids[2, 1]
# [1] TRUE
 
class(ids$MySQLID)
# [1] "numeric"

Conclusions

What should one make of this example? At the minimum, it suggests that R’s default behaviors are not well-suited to a world in which more and more people interact with data derived from commercial web sites, where 64-bit integers are commonplace. I hope that R will change the behavior of read.csv in a future release and deprecate any attempts to treat integer literals as anything other than 64-bit integers.

But, I would argue that this example also teaches a much more general point: it suggests that the assertion that scientists can safely ignore the distinction between integer and floating point data types is false. In the example I’ve provided, the very real distinction that modern CPU’s make between integer and floating point data leads to very real data corruption occurring. How that data corruption affects downstream analyses is situation-dependent, but it is conceivable that the effects are severe in some settings. I would hope that we will stop asserting that scientists can use computers to analyze data without understanding the inherent limitations of the tools they are working with.

The Relationship between Vectorized and Devectorized Code

Introduction

Some people have come to believe that Julia’s vectorized code is unusably slow. To correct this misconception, I outline a naive benchmark below that suggests that Julia’s vectorized code is, in fact, noticeably faster than R’s vectorized code. When experienced Julia programmers suggest that newcomers should consider devectorizing code, we’re not trying to beat R’s speed — our vectorized code does that already. Instead, we’re trying to match C’s speed.

As the examples below indicate, a little bit of devectorization goes a long way towards this loftier goal. In the specific examples I show, I find that:

  • Julia’s vectorized code is 2x faster than R’s vectorized code
  • Julia’s devectorized code is 140x faster than R’s vectorized code
  • Julia’s devectorized code is 1350x faster than R’s devectorized code

Examples of Vectorized and Devectorized Code in R

Let’s start by contrasting two pieces of R code: a vectorized and a devectorized implementation of a trivial snippet of code that does repeated vector addition.

First, we consider an example of idiomatic, vectorized R code:

vectorized <- function()
{
    a <- c(1, 1)
    b <- c(2, 2)
    x <- c(NaN, NaN)

    for (i in 1:1000000)
    {
        x <- a + b
    }

    return()
}

time <- function (N)
{
    timings <- rep(NA, N)

    for (itr in 1:N)
    {
        start <- Sys.time()
        vectorized()
        end <- Sys.time()
        timings[itr] <- end - start
    }

    return(timings)
}

mean(time(10))

This code takes, on average, 0.49 seconds per iteration to compute 1,000,000 vector additions.

Having considered the vectorized implementation, we can then consider an unidiomatic devectorized implementation of the same operation in R:

devectorized <- function()
{
    a <- c(1, 1)
    b <- c(2, 2)
    x <- c(NaN, NaN)

    for (i in 1:1000000)
    {
        for (index in 1:2)
        {
            x[index] <- a[index] + b[index]
        }
    }

    return()
}

time <- function (N)
{
    timings <- rep(NA, N)

    for (itr in 1:N)
    {
        start <- Sys.time()
        devectorized()
        end <- Sys.time()
        timings[itr] <- end - start
    }

    return(timings)
}

mean(time(10))

This takes, on average, 4.72 seconds per iteration to compute 1,000,000 vector additions.

Examples of Vectorized and Devectorized Code in Julia

Let’s now consider two Julia implementations of this same snippet of code. We’ll start with a vectorized implementation:

function vectorized()
    a = [1.0, 1.0]
    b = [2.0, 2.0]
    x = [NaN, NaN]

    for i in 1:1000000
        x = a + b
    end

    return
end

function time(N)
    timings = Array(Float64, N)

    # Force compilation
    vectorized()

    for itr in 1:N
        timings[itr] = @elapsed vectorized()
    end

    return timings
end

mean(time(10))

This takes, on average, 0.236 seconds per iteration to compute 1,000,000 vector additions.

Next, let’s consider a devectorized implementation of this same snippet:

function devectorized()
    a = [1.0, 1.0]
    b = [2.0, 2.0]
    x = [NaN, NaN]

    for i in 1:1000000
        for index in 1:2
            x[index] = a[index] + b[index]
        end
    end

    return
end

function time(N)
    timings = Array(Float64, N)

    # Force compilation
    devectorized()

    for itr in 1:N
        timings[itr] = @elapsed devectorized()
    end

    return timings
end

mean(time(10))

This takes, on average, 0.0035 seconds per iteration to compute 1,000,000 vector additions.

Comparing Performance in R and Julia

We can summarize the results of the four examples above in a single table:

Approach Language Average Time
Vectorized R 0.49
Devectorized R 4.72
Vectorized Julia 0.24
Devectorized Julia 0.0035

All of these examples were timed on my 2.9 GHz Intel Core i7 MacBook Pro. The results are quite striking: Julia is uniformly faster than R. And a very small bit of devectorization produces huge performance improvements. Of course, it would be nice if Julia’s compiler could optimize vectorized code as well as it optimizes devectorized code. But doing so requires a substantial amount of work.

Why is Optimizing Vectorized Code Hard?

What makes automatic devectorization tricky to get right is that even minor variants of the snippet shown above have profoundly different optimization strategies. Consider, for example, the following two snippets of code:

function vectorized2()
    a = [1.0, 1.0]
    b = [2.0, 2.0]

    res = {}

    for i in 1:1000000
        x = [rand(), rand()]
        x += a + b
        push!(res, x)
    end

    return res
end

function time(N)
    timings = Array(Float64, N)

    # Force compilation
    vectorized2()

    for itr in 1:N
        timings[itr] = @elapsed vectorized2()
    end

    return timings
end

mean(time(10))

This first snippet takes 1.29 seconds on average.

function devectorized2()
    a = [1.0, 1.0]
    b = [2.0, 2.0]

    res = {}

    for i in 1:1000000
        x = [rand(), rand()]
        for dim in 1:2
            x[dim] += a[dim] + b[dim]
        end
        push!(res, x)
    end

    return res
end

function time(N)
    timings = Array(Float64, N)

    # Force compilation
    devectorized2()

    for itr in 1:N
        timings[itr] = @elapsed devectorized2()
    end

    return timings
end

mean(time(10))

This second snippet takes, on average, 0.27 seconds.

The gap between vectorized and devectorized code is much smaller here because this second set of code snippets uses memory in a very different way than our original snippets did. In the first set of snippets, it was possible to entirely avoid allocating any memory for storing changes to x. The devectorized code for the first set of snippets explicitly made clear to the compiler that no memory needed to be allocated. The vectorized code did not make this clear. Making it clear that no memory needed to be allocated led to a 75x speedup. Explicitly telling the compiler what it can avoid spending time on goes a long way.

In contrast, in the second set of snippets, a new chunk of memory has to be allocated for every x vector that gets created. And the result is that even the devectorized variant of our second snippet cannot offer much of a performance boost over its vectorized analogue. The devectorized variant is slightly faster because it avoids allocating any memory during the steps in which x has a and b added to it, but this makes less of a difference when there is still a lot of other work being done that cannot be avoided by devectorizing operations.

This reflects a more general statement: the vectorization/devectorization contrast is only correlated, not causally related, with the actual performance characteristics of code. What matters for computations that take place on modern computers is the efficient utilization of processor cycles and memory. In many real examples of vectorized code, it is memory management, rather than vectorization per se, that is the core causal factor responsible for performance.

The Reversed Role of Vectorization in R and Julia

Part of what makes it difficult to have a straightforward discussion about vectorization is that vectorization in R conflates issues that are logically unrelated. In R, vectorization is often done for both (a) readability and (b) performance. In Julia, vectorization is only used for readability; it is devectorization that offers superior performance.

This confuses some people who are not familiar with the internals of R. It is therefore worth noting how one improves the speed of R code. The process of performance improvement is quite simple: one starts with devectorized R code, then replaces it with vectorized R code and then finally implements this vectorized R code in devectorized C code. This last step is unfortunately invisible to many R users, who therefore think of vectorization per se as a mechanism for increasing performance. Vectorization per se does not help make code faster. What makes vectorization in R effective is that it provides a mechanism for moving computations into C, where a hidden layer of devectorization can do its mgic.

In other words, R is doing exactly what Julia is doing to get better performance. R’s vectorized code is simply a thin wrapper around completely devectorized C code. If you don’t believe me, go read the C code for something like R’s distance function, which involves calls to functions like the following:

static double R_euclidean(double *x, int nr, int nc, int i1, int i2)
{
    double dev, dist;
    int count, j;

    count= 0;
    dist = 0;
    for(j = 0 ; j < nc ; j++) {
    if(both_non_NA(x[i1], x[i2])) {
        dev = (x[i1] - x[i2]);
        if(!ISNAN(dev)) {
        dist += dev * dev;
        count++;
        }
    }
    i1 += nr;
    i2 += nr;
    }
    if(count == 0) return NA_REAL;
    if(count != nc) dist /= ((double)count/nc);
    return sqrt(dist);
}

It is important to keep this sort of thing in mind: the term vectorization in R actually refers to a step in which you write devectorized code in C. Vectorization, per se, is a red herring when reasoning about performance.

To finish this last point, let’s summarize the performance hierarchy for R and Julia code in a simple table:

Worst Case Typical Case Best Case
Julia Vectorized Code Julia Devectorized Code
R Devectorized Code R Vectorized Code C Devectorized Code

It is the complete absence of one column for Julia that makes it difficult to compare vectorization across the two languages. Nothing in Julia is as bad as R’s devectorized code. On the other end of the spectrum, the performance of Julia’s devectorized code simply has no point of comparison in pure R: it is more similar to the C code used to power R behind the scenes.

Conclusion

Julia aims to (and typically does) provide vectorized code that is efficient as the vectorized code available in other high-level languages. What sets Julia apart is the possibility of writing, in pure Julia, high performance code that uses CPU and memory resources as effectively as can be done in C.

In particular, vectorization and devectorization stand in the opposite relationship to one another in Julia as they do in R. In R, devectorization makes code unusably slow: R code must be vectorized to perform at an acceptable level. In contrast, Julia programmers view vectorized code as a convenient prototype that can be modified with some clever devectorization to produce production-performance code. Of course, we would like prototype code to perform better. But no popular language offers that kind of functionality. What Julia offers isn’t the requirement for devectorization, but the possibility of doing it in Julia itself, rather than in C.

Writing Type-Stable Code in Julia

For many of the people I talk to, Julia’s main appeal is speed. But achieving peak performance in Julia requires that programmers absorb a few subtle concepts that are generally unfamiliar to users of weakly typed languages.

One particularly subtle performance pitfall is the need to write type-stable code. Code is said to be type-stable if the type of every variable does not vary over time. To clarify this idea, consider the following two closely related function definitions:

function sumofsins1(n::Integer)  
    r = 0  
    for i in 1:n  
        r += sin(3.4)  
    end  
    return r  
end  

function sumofsins2(n::Integer)  
    r = 0.0  
    for i in 1:n  
        r += sin(3.4)  
    end  
    return r  
end  

The only difference between these function definitions is that sumofsins1 initializes r to 0, whereas sumofsins2 initializes r to 0.0.

This seemingly minor distinction has important practical implications because the initialization of r to 0 means that the main loop of sumofsins1 begins with a single iteration in which the computer adds 0 to sin(3.4). This single addition step transforms the type of r from Int, which is the type of 0, to Float64, which is the type of sin(3.4). This means that the type of r is not stable over the course of this loop.

This instability has considerable effects on the performance of sumofsins1. To see this, let’s run some naive benchmarks. As always in Julia, we’ll start with a dry run to get the JIT to compile the functions being compared:

sumofsins1(100_000)  
sumofsins2(100_000)  

@time [sumofsins1(100_000) for i in 1:100];  
@time [sumofsins2(100_000) for i in 1:100];  

The results of this timing comparison are quite striking:

julia> @time [sumofsins1(100_000) for i in 1:100];  
elapsed time: 0.412261722 seconds (320002496 bytes allocated)  

julia> @time [sumofsins2(100_000) for i in 1:100];  
elapsed time: 0.008509995 seconds (896 bytes allocated)  

As you can see, the type-unstable code in sumofsins1 is 50x slower than the type-stable code. What might have seemed like a nitpicky point about the initial value of r has enormous performance implications.

To understand the reasons for this huge performance gap, it’s worth considering what effect type-instability has on the compiler. In this case, the compiler can’t optimize the contents of the main loop of sumofsins1 because it can’t be certain that the type of r will remain invariant throughout the entire loop. Without this crucial form of invariance, the compiler has to check the type of r on every iteration of the loop, which is a much more intensive computation than repeatedly adding a constant value to a Float64.

You can confirm for yourself that the compiler produces more complex code by examining the LLVM IR for both of these functions.

First, we’ll examine the LLVM IR for sumofsins1:

julia> code_llvm(sumofsins1, (Int, ))  

define %jl_value_t* @julia_sumofsins11067(i64) {  
top:  
  %1 = alloca [5 x %jl_value_t*], align 8  
  %.sub = getelementptr inbounds [5 x %jl_value_t*]* %1, i64 0, i64 0  
  %2 = getelementptr [5 x %jl_value_t*]* %1, i64 0, i64 2, !dbg !5145  
  store %jl_value_t* inttoptr (i64 6 to %jl_value_t*), %jl_value_t** %.sub, align 8  
  %3 = load %jl_value_t*** @jl_pgcstack, align 8, !dbg !5145  
  %4 = getelementptr [5 x %jl_value_t*]* %1, i64 0, i64 1, !dbg !5145  
  %.c = bitcast %jl_value_t** %3 to %jl_value_t*, !dbg !5145  
  store %jl_value_t* %.c, %jl_value_t** %4, align 8, !dbg !5145  
  store %jl_value_t** %.sub, %jl_value_t*** @jl_pgcstack, align 8, !dbg !5145  
  %5 = getelementptr [5 x %jl_value_t*]* %1, i64 0, i64 3  
  store %jl_value_t* null, %jl_value_t** %5, align 8  
  %6 = getelementptr [5 x %jl_value_t*]* %1, i64 0, i64 4  
  store %jl_value_t* null, %jl_value_t** %6, align 8  
  store %jl_value_t* inttoptr (i64 140379580131904 to %jl_value_t*), %jl_value_t** %2, align 8, !dbg !5150  
  %7 = icmp slt i64 %0, 1, !dbg !5151  
  br i1 %7, label %L2, label %pass, !dbg !5151  

pass:                                             ; preds = %top, %pass  
  %8 = phi %jl_value_t* [ %13, %pass ], [ inttoptr (i64 140379580131904 to %jl_value_t*), %top ]  
  %"#s6.03" = phi i64 [ %14, %pass ], [ 1, %top ]  
  store %jl_value_t* %8, %jl_value_t** %5, align 8, !dbg !5152  
  %9 = call %jl_value_t* @alloc_2w(), !dbg !5152  
  %10 = getelementptr inbounds %jl_value_t* %9, i64 0, i32 0, !dbg !5152  
  store %jl_value_t* inttoptr (i64 140379580056656 to %jl_value_t*), %jl_value_t** %10, align 8, !dbg !5152  
  %11 = getelementptr inbounds %jl_value_t* %9, i64 1, i32 0, !dbg !5152  
  %12 = bitcast %jl_value_t** %11 to double*, !dbg !5152  
  store double 0xBFD05AC910FF4C6C, double* %12, align 8, !dbg !5152  
  store %jl_value_t* %9, %jl_value_t** %6, align 8, !dbg !5152  
  %13 = call %jl_value_t* @jl_apply_generic(%jl_value_t* inttoptr (i64 140379586379936 to %jl_value_t*), %jl_value_t** %5, i32 2), !dbg !5152  
  store %jl_value_t* %13, %jl_value_t** %2, align 8, !dbg !5152  
  %14 = add i64 %"#s6.03", 1, !dbg !5152  
  %15 = icmp sgt i64 %14, %0, !dbg !5151  
  br i1 %15, label %L2, label %pass, !dbg !5151  

L2:                                               ; preds = %pass, %top  
  %.lcssa = phi %jl_value_t* [ inttoptr (i64 140379580131904 to %jl_value_t*), %top ], [ %13, %pass ]  
  %16 = load %jl_value_t** %4, align 8, !dbg !5153  
  %17 = getelementptr inbounds %jl_value_t* %16, i64 0, i32 0, !dbg !5153  
  store %jl_value_t** %17, %jl_value_t*** @jl_pgcstack, align 8, !dbg !5153  
  ret %jl_value_t* %.lcssa, !dbg !5153  
}  

Then we’ll examine the LLVM IR for sumofsins2:

julia> code_llvm(sumofsins2, (Int, ))  

define double @julia_sumofsins21068(i64) {  
top:  
  %1 = icmp slt i64 %0, 1, !dbg !5151  
  br i1 %1, label %L2, label %pass, !dbg !5151  

pass:                                             ; preds = %top, %pass  
  %"#s6.04" = phi i64 [ %3, %pass ], [ 1, %top ]  
  %r.03 = phi double [ %2, %pass ], [ 0.000000e+00, %top ]  
  %2 = fadd double %r.03, 0xBFD05AC910FF4C6C, !dbg !5156  
  %3 = add i64 %"#s6.04", 1, !dbg !5156  
  %4 = icmp sgt i64 %3, %0, !dbg !5151  
  br i1 %4, label %L2, label %pass, !dbg !5151  

L2:                                               ; preds = %pass, %top  
  %r.0.lcssa = phi double [ 0.000000e+00, %top ], [ %2, %pass ]  
  ret double %r.0.lcssa, !dbg !5157  
}  

The difference in size and complexity of code between these two functions in compiled form is considerable. And this difference is entirely atttributable to the compiler’s need to recheck the type of r on every iteration of the main loop in sumofsins1, which can be optimized out in sumofsins2, where r has a stable type.

Given the potential performance impacts of type-instability, every aspiring Julia programmer needs to learn to recognize potential sources of type-instability in their own code. Future versions of Julia may be configured to issue warnings when type-unstable code is encountered, but, for now, the responsibility lies with the programmer. Thankfully, once you learn about type-stability, it becomes easy to recognize in most cases.

September Talks

To celebrate my last full month on the East Coast, I’m doing a bunch of talks. If you’re interested in hearing more about Julia or statistics in general, you might want to come out to one of the events I’ll be at:

  • Julia Tutorial at DataGotham: On 9/12, Stefan and I will be giving a 3-hour long, hands on Julia tutorial as part of the Thursday DataGotham activities this year. If you’re in NYC and care about data analysis, you should try to make it out to part of the event, even if you skip the tutorials.
  • Online Learning Talk in NYC: On 9/17, I’ll be giving a talk on online learning at the Open Statistical Programming meetup. I’ll talk about using SGD to fit models online. This material is quite basic, but seems to be unfamiliar to a lot of people.
  • Julia Talk in DC: On 9/26, I’ll be giving a quick introduction to Julia in DC at the Statistical Programming DC meetup. The goal will be to introduce people to the basics of Julia.

Hopfield Networks in Julia

As a fun side project last night, I decided to implement a basic package for working with Hopfield networks in Julia.

Since I suspect many of the readers of this blog have never seen a Hopfield net before, let me explain what they are and what they can be used for. The short-and-skinny is that Hopfield networks were invented in the 1980′s to demonstrate how a network of simple neurons might learn to associate incoming stimuli with a fixed pool of existing memories. As you’ll see from the examples below, this associative ability behaves a little bit like locality-sensitive hashing.

To see how Hopfield networks work, we need to define their internal structure. For the purposes of this blog post, we’ll assume that a Hopfield network is made up of N neurons. At every point in time, this network of neurons has a simple binary state, which I’ll associate with a vector of -1′s and +1′s.

Incoming stimuli are also represented using binary vectors of length N. Every time one of these stimuli is shown to the network, the network will use a simple updating rule to modify its state. The network will keep modifying its state until it settles into a stable state, which will be one of many fixed points for the updating rule. We’ll refer to the stable state that the network reaches as the memory that the network associates with the input stimulus.

For example, let’s assume that we have a network consisting of 42 neurons arranged in a 7×6 matrix. We’ll train our network to recognize the letters X and O, which will also be represented as 7×6 matrices. After training the network, we’ll present corrupted copies of the letters X and O to show that the network is able to associate corrupted stimuli with their uncorrupted memories. We’ll also show the network an uncorrupted copy of the unfamiliar letter F to see what memory it associates with an unfamiliar stimulus.

Using the HopfieldNets package, we can do this in Julia as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
using HopfieldNets
 
include(Pkg.dir("HopfieldNets", "demo", "letters.jl"))
 
patterns = hcat(X, O)
 
n = size(patterns, 1)
 
h = DiscreteHopfieldNet(n)
 
train!(h, patterns)
 
Xcorrupt = copy(X)
for i = 2:7
     Xcorrupt[i] = 1
end
 
Xrestored = associate!(h, Xcorrupt)

In the image below, I show what happens when we present X, O and F to the network after training it on the X and O patterns:


Results

As you can see, the network perfectly recovers X and O from corrupted copies of those letters. In addition, the network associates F with an O, although the O is inverted relative to the O found in the training set. This kind of untrained memory emerging is common in Hopfield nets. To continue the analogy with LSH, you can think of the memories produced by the Hopfield net as hashes of the input, which have the property that similar inputs tend to produce similar outputs. In practice, you shouldn’t use a Hopfield net to do LSH, because the computations involved are quite costly.

Hopefully this simple example has piqued your interest in Hopfield networks. If you’d like to learn more, you can read through the code I wrote or work through the very readable presentation of the theory of Hopfield networks in David Mackay’s book on Information Theory, Inference, and Learning Algorithms.