What’s Wrong with Statistics in Julia?

Introduction

Several months ago, I promised to write an updated version of my old post, “The State of Statistics in Julia”, that would describe how Julia’s support for statistical computing has evolved since December 2012.

I’ve kept putting off writing that post for several reasons, but the most important reason is that all of my attention for the last few months has been focused on what’s wrong with how Julia handles statistical computing. As such, the post I’ve decided to write isn’t a review of what’s already been done in Julia, but a summary of what’s being done right now to improve Julia’s support for statistical computing.

In particular, this post focuses on several big changes to the core data structures that are used in Julia to represent statistical data. These changes should all ship when Julia 0.4 is released.

What’s Wrong with Statistics in Julia Today?

The primary problem with statistical computing in Julia is that the current tools were all designed to emulate R. Unfortunately, R’s approach to statistical computing isn’t amenable to the kinds of static analysis techniques that Julia uses to produce efficient machine code.

In particular, the following differences between R and Julia have repeatedly created problems for developers:

  • In Julia, computations involving scalars are at least as important as computations involving vectors. In particular, iterative computations are first-class citizens in Julia. This implies that statistical libraries must allow developers to write efficient code that iterates over the elements of a vector in pure Julia. Because Julia’s compiler can only produce efficient machine code for computations that are type-stable, the representations of missing values, categorical values and ordinal values in Julia programs must all be type-stable. Whether a value is missing or not, its type must remain the same.
  • In Julia, almost all end-users will end up creating their own types. As such, any tools for statistical computing must be generic enough that they can be extended to arbitrary types with little to no effort. In contrast to R, which can heavily optimize its algorithms for a very small number of primitive types, Julia developers must ensure that their libraries are both highly performant and highly abstract.
  • Julia, like most mainstream languages, eagerly evaluates the arguments passed to functions. This implies that idioms from R which depend upon non-standard evaluation are not appropriate for Julia, although it is possible to emulate some forms of non-standard evaluation using macros. In addition, Julia doesn’t allow programmers to reify scope. This implies that idioms from R that require access to the caller’s scope are not appropriate for Julia.

The most important way in which these issues came up in the first generation of statistical libraries was in the representation of a single scalar missing value. In Julia 0.3, this concept is represented by the value NA, but that representation will be replaced when 0.4 is released. Most of this post will focus on the problems created by NA.

In addition to problems involving NA, there were also problems with how expressions were being passed to some functions. These problems have been resolved by removing the function signatures for statistical functions that involved passing expressions as arguments to those functions. A prototype package called DataFramesMeta, which uses macros to emulate some kinds of non-standard evaluation, is being developed by Tom Short.

Representing Missing Values

In Julia 0.3, missing values are represented by a singleton object, NA, of type NAtype. Thus, a variable x, which might be either a Float64 value or a missing value encoded as NA, will end up with type Union(Float64, NAtype). This Union type is a source of performance problems because it defeats Julia’s compiler’s attempts to assign a unique concrete type to every variable.

We could remove this type-instability by ensuring that every type has a specific value, such as NaN, that signals missingness. This is the approach that both R and pandas take. It offers acceptable performance, but does so at the expense of generic handling of non-primitive types. Given Julia’s rampant usage of custom types, the sentinel values approach is not viable.

As such, we’re going to represent missing values in Julia 0.4 by borrowing some ideas from functional languages. In particular, we’ll be replacing the singleton object NA with a new parametric type Nullable{T}. Unlike NA, a Nullable object isn’t a direct scalar value. Rather, a Nullable object is a specialized container type that either contains one value or zero values. An empty Nullable container is taken to represent a missing value.

The Nullable approach to representing a missing scalar value offers two distinct improvements:

  • Nullable{T} provides radically better performance than Union(T, NA). In some benchmarks, I find that iterative constructs can be as much as 100x faster when using Nullable{Float64} instead of Union(Float64, NA). Alternatively, I’ve found that Nullable{Float64} is about 60% slower than using NaN to represent missing values, but involves a generic approach that trivially extends to arbitrary new types, including integers, dates, complex numbers, quaternions, etc…
  • Nullable{T} provides more type safety by requiring that all attempts to interact with potentially missing values explicitly indicate how missing values should be treated.

In a future blog post, I’ll describe how Nullable works in greater detail.

Categorical Values

In addition to revising the representation of missing values, I’ve also been working on revising our representation of categorical values. Working with categorical data in Julia has always been a little strange, because the main tool for representing categorical data, the PooledDataArray, has always occupied an awkward intermediate position between two incompatible objectives:

  • A container that keeps track of the unique values present in the container and uses this information to efficiently represent values as pointers to a pool of unique values.
  • A container that contains values of a categorical variable drawn from a well-defined universe of possible values. The universe can include values that are not currently present in the container.

These two goals come into severe tension when considering subsets of a PooledDataArray. The uniqueness constraint suggests that the pool should shrink, whereas the categorical variable definition suggests that the pool should be maintained without change. In Julia 0.4, we’re going to commit completely to the latter behavior and leave the problem of efficiently representing highly compressible data for another data structure.

We’ll also begin representing scalar values of categorical variables using custom types. The new CategoricalVariable and OrdinalVariable types that will ship with Julia 0.4 will further the efforts to put scalar computations on an equal footing with vector computations. This will be particularly notable for dealing with ordinal variables, which are not supported at all in Julia 0.3.

Metaprogramming

Many R functions employ non-standard evaluation as a mechanism for augmenting the current scope with the column names of a data.frame. In Julia, it’s often possible to emulate this behavior using macros. The in-progress DataFramesMeta package explores this alternative to non-standard evaluation. We will also be exploring other alternatives to non-standard evaluation in the future.

What’s Next

In the long-term future, I’m hoping to improve several other parts of Julia’s core statistical infrastructure. In particular, I’d like to replace DataFrames with a new type that no longer occupies a strange intermediate position between matrices and relational tables. I’ll write another post about those issues later.

The Lesser Known Normal Forms of Database Design

-1st Normal Form: The database contains at least one table that is an exact copy of another table, except with additional columns.

-2nd Normal Form: The database contains at least one table that is a corrupt, out-of-date copy of another table, except with additional columns. It is impossible to determine if these additional columns can be trusted.

-3rd Normal Form: The database contains at least one table whose name contains the string _v1 and another table whose name contains the string _v2. At least one column in the _v1 table must be an undeclared foreign key that refers to rows that no longer exist in any other table.

-4th Normal Form: The database contains at least one table that has two or more columns whose contents are exactly the same, except that one of the columns has a name that is a misspelling of the other column’s name.

-5th Normal Form: The database contains (A) at least one table whose name contains the string do_not_use and (B) at least one other table whose name contains the string do_not_ever_ever_use. In each of these tables separately, at least two columns must only contain NULL’s repeated for every single row. In addition, every string in these tables must have random amounts of whitespace padded on the left- and right-hand sides. Finally, at least one row must contain the text, “lasciate ogni speranza, voi ch’entrate.”

Values vs. Bindings: The Map is Not the Territory

Many newcomers to Julia are confused by the seemingly dissimilar behaviors of the following two functions:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
julia> a = [1, 2, 3]
3-element Array{Int64,1}:
 1
 2
 3
 
julia> function foo!(a)
           a[1] = 10
           return
       end
foo! (generic function with 1 method)
 
julia> foo!(a)
 
julia> a
3-element Array{Int64,1}:
 10
  2
  3
 
julia> function bar!(a)
           a = [1, 2]
           return
       end
bar! (generic function with 1 method)
 
julia> bar!(a)
 
julia> a
3-element Array{Int64,1}:
 10
  2
  3

Why does the first function successfuly alter the global variable a, but the second function does not?

To answer that question, we need to explain the distinction between values and bindings. We’ll start with a particularly simple example of a value and a binding.

In Julia, the number 1 is a value:

1
2
julia> 1
1

In contrast to operating on a value, the Julia assignment operation shown below creates a binding:

1
2
julia> a = 1
1

This newly created binding is an association between the symbolic name a and the value 1. In general, a binding operation always associates a specific value with a specific name. In Julia, the valid names that can be used to create bindings are symbols, because it is important that the names be parseable without ambiguity. For example, the string "a = 1" is not an acceptable name for a binding, because it would be ambiguous with the code that binds the value 1 to the name a.

This first example of values vs. bindings might lead one to believe that values and bindings are very easy to both recognize and distinguish. Unfortunately, the values of many common objects are not obvious to many newcomers.

What, for example, is the value of the following array?

1
2
3
4
5
julia> [1, 2, 3]
3-element Array{Int64,1}:
 1
 2
 3

To answer this question, note that the value of this array is not defined by the contents of the array. You can confirm this by checking whether Julia considers two objects to be exactly identical using the === operator:

1
2
3
4
5
julia> 1 === 1
true
 
julia> [1, 2, 3] === [1, 2, 3]
false

The general rule is simple, but potentially non-intuitive: two arrays with identical contents are not the same array. To motivate this, think of arrays as if they were cardboard boxes. If I have two cardboard boxes, each of which contains a single ream of paper, I would not claim that the two boxes are the exact same box just because they have the same contents. Our intuitive notion of object identity is rich enough to distinguish between two containers with the same contents, but it takes some time for newcomers to programming languages to extend this notion to their understanding of arrays.

Because every container is distinct regardless of what it contains, every array is distinct because every array is its own independent container. An array’s identity is not defined by what it contains. As such, its value is not equivalent to its contents. Instead, an array’s value is a unique identifier that allows one to reliably distinguish each array from every other array. Think of arrays like numbered cardboard boxes. The value of an array is its identifier: thus the value of [1, 2, 3] is something like the identifier “Box 1″. Right now, “Box 1″ happens to contain the values 1, 2 and 3, but it will continue to be “Box 1″ even after its contents have changed.

Hopefully that clarifies what the value of an array is. Starting from that understanding, we need to re-examine bindings because bindings themselves behave like containers.

A binding can be thought of as a named box that can contain either 0 or 1 values. Thus, when a new Julia session is launched, the name a has no value associated with it: it is an empty container. But after executing the line, a = 1, the name has a value: the container now has one element in it. Being a container, the name is distinct from its contents. As such, the name can be rebound by a later operation: the line a = 2 will change the contents of the box called a to refer to the value 2.

The fact that bindings behave like containers becomes a source of confusion when the value of a binding is itself a container:

1
a = [1, 2, 3]

In this case, the value associated with the name a is the identifier of an array that happens to have the values 1, 2, and 3 in it. But if the contents of that array are changed, the name a will still refer to the same array — because the value associated with a is not the contents of the array, but the identifier of the array.

As such, there is a very large difference between the following two operations:

1
2
a[1] = 10
a = [1, 2]
  • In the first case, we are changing the contents of the array that a refers to.
  • In the second case, we are changing which array a refers to.

In this second case, we are actually creating a brand new container as an intermediate step to changing the binding of a. This new container has, as its initial contents, the values 1 and 2. After creating this new container, the name a is changed to refer to the value that is the identifier of this new container.

This is why the two functions at the start of this post behave so differently: one mutates the contents of an array, while the other mutates which array a name refers to. Because variable names in functions are local, changing bindings inside of a function does not change the bindings outside of that function. Thus, the function bar! does not behave as some would hope. To change the contents of an array wholesale, you must not change bindings: you must change the contents of the array. To do that, bar! should be written as:

1
2
3
4
function bar!(a)
    a[:] = [1, 2]
    return
end

The notation a[:] allows one to talk about the contents of an array, rather than its identifier. In general, you should not expect that you can change the contents of any container without employing some indexing syntax that allows you to talk about the contents of the container, rather than the container itself.

That Way Madness Lies: Arithmetic on data.frames

tl;dr

Please do not use arithmetic on data.frame objects when programming in R. It’s a hack that only works if you know everything about your datasets. If anything happens to change the order of the rows in your data set, previously safe data.frame arithmetic operations will produce incorrect answers. If you learn to always explicitly merge two tables together before performing arithmetic on their shared columns, you’ll produce code that is both more reliable and more powerful.

Arithmetic between tables: getting wrong answers quickly

You may not be aware of it, but R allows you to do arithmetic on data.frame objects. For example, the following code works in R as of version 3.0.2:

1
2
3
4
5
6
7
> df1 <- data.frame(ID = c(1, 2), Obs = c(1.0, 2.0))
> df2 <- data.frame(ID = c(1, 2), Obs = c(2.0, 3.0))
> df3 <- (df1 + df2) / 2
> df3
  ID Obs
1  1 1.5
2  2 2.5

If you discover that you can do this, you might think that it’s a really cool trick. You might even start using data.frame arithmetic without realizing that your specific example had a bunch of special structure that was directly responsible for you getting the right answer.

Unfortunately, other examples that you didn’t see would have produced rather less pleasant outputs and led you to realize that arithmetic operations on data.frame objects don’t really make sense:

1
2
3
4
5
6
7
> df1 <- data.frame(ID = c(1, 2), Obs = c(1.0, 2.0))
> df2 <- data.frame(ID = c(2, 1), Obs = c(3.0, 2.0))
> df3 <- (df1 + df2) / 2
> df3
   ID Obs
1 1.5   2
2 1.5   2

What happened here is obvious in retrospect: R added all of the columns together and then divided the result by two. The problem is that you didn’t actually want to add all of the columns together and then divide the result by two, because you had forgotten that the matching rows in df1 and df2 were not in the same index positions in the two tables.

Getting right answers with just a little more typing

Thankfully, it turns out that doing the right thing just requires a few more characters. What you should have done was to call merge before doing any arithmetic:

1
2
3
4
5
6
7
8
> df1 <- data.frame(ID = c(1, 2), Obs = c(1.0, 2.0))
> df2 <- data.frame(ID = c(2, 1), Obs = c(3.0, 2.0))
> df3 <- merge(df1, df2, by = "ID")
> df3 <- transform(df3, AvgObs = (Obs.x + Obs.y) / 2)
> df3
  ID Obs.x Obs.y AvgObs
1  1     1     2    1.5
2  2     2     3    2.5

What makes merge so unequivocally superior to data.frame arithmetic is that it still works when the two inputs have different numbers of rows:

1
2
3
4
5
6
7
8
> df1 <- data.frame(ID = c(1, 2), Obs = c(1.0, 2.0))
> df2 <- data.frame(ID = c(1, 2, 3), Obs = c(5.0, 6.0, 7.0))
> df3 <- merge(df1, df2, by = "ID")
> df3 <- transform(df3, AvgObs = (Obs.x + Obs.y) / 2)
> df3
  ID Obs.x Obs.y AvgObs
1  1     1     5      3
2  2     2     6      4

Knowledge is half the battle

Now that you know why performing arithmetic operations on data.frame objects is generally unsafe, I implore you to stop doing it. Learn to love merge.

My Experience at JuliaCon

Introduction

I just got home from JuliaCon, the first conference dedicated entirely to Julia. It was a great pleasure to spend two full days listening to talks about a language that I started advocating for just a little more than two years ago.

What follows is a very brief review of the talks that excited me the most. It’s not in any way exhaustive: there were a bunch of other good talks that I saw as well as a few talks I missed so that I could visit the Data Science for Social Good fellows.

Optimization

The optimization community seems to be the academic field that’s been most ready to adopt Julia. Two talks about using Julia for optimization stood out: Iain Dunning and Joey Huchette’s talk about JuMP.jl, and Madeleine Udell’s talk about CVX.jl.

JuMP implements a DSL that allows users to describe an optimization problem in purely mathematical terms. This problem encoding can be then passed to one of many backend solvers to determine a solution. By abstracting across solvers, JuMP makes it easier for people like me to get access to well-established tools like GLPK.

CVX is quite similar to JuMP, but it implements a symbolic computation system that’s especially focused on allowing users to encode convex optimization problems. One of the things that’s most appealing about CVX is that it automatically confirms whether the problem you’re encoding is convex or not. Until I saw Madeleine’s talk, I hadn’t realized how much progress had been made on CVX.jl. Now that I’ve seen CVX.jl in action, I’m hoping to start using it for some of my work. I’ll probably also write a blog post about it in the future.

Statistics

I really enjoyed the statistics talks given by Doug Bates, Simon Byrne and Dan Wlasiuk. I was especially glad to hear Doug Bates remind the audience that, years ago, he’d attended a small meeting about R that was similar in size to this first iteration of JuliaCon. Over the course of the intervening decades, he noted that the R community has grown from dozens to millions of users.

Language-Level Issues

Given that Julia is still something of a language nerd’s language, it’s no surprise that some of the best talks focused on language-level issues.

Arch Robison gave a really interesting talk about the tools used in Julia 0.3 to automatically vectorize code so that it can take advantage of SIMD instructions. For those coming from languages like R or Python, you should be aware that vectorization means almost the exact opposite thing to compiler writers that it means to high-level language users: vectorization involves the transformation of certain kinds of iterative code into the thread-free parallelized instructions that modern CPU’s provide for performing a single operation on multiple data chunks simultaneously. I’ve come to love this kind of compiler design discussion and the invariance properties the compiler needs to prove before it can perform program transformations safely. For example, Arch noted that SIMD instructions can be safely used when working on many integers, but cannot be used on floating point numbers because of failures of associativity.

After Arch spoke, Jeff Bezanson gave a nice description of the process by which Julia code is transformed from raw text users enter into the REPL into the final compiled form that gets executed by CPU’s. For those interested in understanding how Julia works under the hood, this talk is likely to be the best place to start.

In addition, Leah Hanson and Keno Fischer both gave good talks about improved tools for debugging Julia code. Leah spoke about TypeCheck.jl, a system for automatically warning about potential code problems. Keno demoed a very rough draft of a Julia debugger built on top of LLDB. As an added plus, Keno also demoed a new C++ FFI for Julia that I’m really looking forward to. I’m hopeful that the new FFI will make it much easier to wrap C++ libraries for use from Julia.

Deploying Julia in Production

Both Avik Sengupta and Michael Bean described their experiences using Julia in production systems. Knowing that Julia was being used in production anywhere was inspiring.

Graphics and Audio

Daniel C. Jones and Spencer Russell both gave great talks about the developments taking place in graphics and audio support. Daniel C. Jones’s demo of a theremin built using Shashi Gowda’s React.jl and Spencer Russell’s AudioIO.jl was especially impressive.

Take Aways

The Julia community really is a community now. It was big enough to sell out a small conference and to field a large variety of discussion topics. I’m really excited to see how the next JuliaCon will turn out.

Falsifiability versus Rationalization

Here are two hypothetical conversations about psychological research. I’ll leave it to others to decide whether these conversation could ever take place.

Theories are just directional assertions about effects

Person A: And, just as I predicted, I found in my early studies that the correlation between X and Y is 0.4.
Person B: What do you make of the fact that subsequent studies have found that the correlation is closer to 0.001?
Person A: Oh, I was right all along: those studies continue to support my theoretical assertion that the empirical effect goes in the direction that my theory predicted. Exact numbers are meaningless in the social sciences, since we only conduct proof-of-concept studies and there are so many intervening variables we can’t measure.

Theories are just assertions about the existence of effects

Person A: And, just as I predicted, I found in my early studies that the correlation between X and Y is 0.4.
Person B: What do you make of the fact that a conceptual replication, which employed words rather than pictures, found that the correlation between X and Y was -0.05?
Person A: Oh, I was right all along: X does have an effect on Y, even though the effect can switch directions under some circumstances. What matters is that X affects Y at all, which is deeply counter-intuitive.

A Note on the Johnson-Lindenstrauss Lemma

Introduction

A recent thread on Theoretical CS StackExchange comparing the Johnson-Lindenstrauss Lemma with the Singular Value Decomposition piqued my interest enough that I decided to spend some time last night reading the standard JL papers. Until this week, I only had a vague understanding of what the JL Lemma implied. I previously mistook the JL Lemma for a purely theoretical result that established the existence of distance-preserving projections from high-dimensional spaces into low-dimensional spaces.

This vague understanding of the JL Lemma turns out to be almost correct, but it also led me to neglect the most interesting elements of the literature on the JL Lemma: the papers on the JL Lemma do not simply establish the existence of such projections, but also provide (1) an explicit bound on the dimensionality required for a projection to ensure that it will approximately preserve distances and they even provide (2) an explicit construction of a random matrix, \(A\), that produces the desired projection.

Once I knew that the JL Lemma was a constructive proof, I decided to implement code in Julia to construct examples of this family of random projections. The rest of this post walks through that code as a way of explaining the JL Lemma’s practical applications.

Formal Statement of the JL Lemma

The JL Lemma, as stated in “An elementary proof of the Johnson-Lindenstrauss Lemma” by Dasgputa and Gupta, is the following result about dimensionality reduction:

For any \(0 < \epsilon < 1\) and any integer \(n\), let \(k\) be a positive integer such that \(k \geq 4(\epsilon^2/2 - \epsilon^3/3)^{-1}\log(n)\).

Then for any set \(V\) of \(n\) points in \(\mathbb{R}^d\), there is a map \(f : \mathbb{R}^d \to \mathbb{R}^k\) such that for all \(u, v \in V\),

$$
(1 - \epsilon) ||u - v||^2 \leq ||f(u) - f(v)||^2 \leq (1 + \epsilon) ||u - v||^2.
$$

Further this map can be found in randomized polynomial time.

To fully appreciate this result, we can unpack the abstract statement of the lemma into two components.

The JL Lemma in Two Parts

Part 1: Given a number of data points, \(n\), that we wish to project and a relative error, \(\epsilon\), that we are willing to tolerate, we can compute a minimum dimensionality, \(k\), that a projection must map a space into before it can guarantee that distances will be preserved up to a factor of \(\epsilon\).

In particular, \(k = \left \lceil{4(\epsilon^2/2 – \epsilon^3/3)^{-1}\log(n)} \right \rceil\).

Note that this implies that the dimensionality required to preserve distances depends only on the number of points and not on the dimensionality of the original space.

Part 2: Given an input matrix, \(X\), of \(n\) points in \(d\)-dimensional space, we can explicitly construct a map, \(f\), such that the distance between any pair of columns of \(X\) will not distorted by more than a factor of \(\epsilon\).

Surprisingly, this map \(f\) can be a simple matrix, \(A\), constructed by sampling \(k * d\) IID draws from a Gaussian with mean \(0\) and variance \(\frac{1}{k}\).

Coding Up The Projections

We can translate the first part of the JL Lemma into a single line of code that computes the dimensionality, \(k\), of our low-dimensional space given the number of data points, \(n\), and the error, \(\epsilon\), that we are willing to tolerate:

1
mindim(n::Integer, ε::Real) = iceil((4 * log(n)) / (ε^2 / 2 - ε^3 / 3))

Having defined this function, we can try it out on a simple problem:

1
2
mindim(3, 0.1)
# => 942

This result was somewhat surprising to me: to represent \(3\) points with no more than \(10\)% error, we require nearly \(1,000\) dimensions. This reflects an important fact about the JL Lemma: it produces result that can be extremely conservative for small dimensional inputs. It’s obvious that, for data sets that contain \(3\) points in \(100\)-dimensional space, we could use a projection into \(100\) dimensions that would preserve distances perfectly.

But this observation neglects one of the essential aspects of the JL Lemma: the dimensions required by the lemma will be sufficient whether our data set contains points in \(100\)-dimensional space or points in \(10^{100}\)-dimensional space. No matter what dimensionality the raw data lies in, the JL Lemma says that \(942\) dimensions suffices to preserve the distances between \(3\) points.

I found this statement unintuitive at the start. To see that it’s true, let’s construct a random projection matrix, \(A\), that will let us confirm experimentally that the JL Lemma really works:

1
2
3
4
5
6
7
8
9
10
11
using Distributions
 
function projection(
    X::Matrix,
    ε::Real,
    k::Integer = mindim(size(X, 2), ε)
)
    d, n = size(X)
    A = rand(Normal(0, 1 / sqrt(k)), k, d)
    return A, k, A * X
end

This projection function is sufficient to construct a matrix, \(A\), that will satisfy the assumptions of the JL Lemma. It will also return the dimensionality, \(k\), of \(A\) and the result of projecting the input, \(X\), into the new space defined by \(A\). To get a feel for how this works, we can try this out on a very simple data set:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
X = eye(3, 3)
 
ε = 0.1
 
A, k, AX = projection(X, ε)
# =>
# (
# 942x3 Array{Float64,2}:
#  -0.035269    -0.0299966   -0.0292959 
#  -0.00501367   0.0316806    0.0460191 
#   0.0633815   -0.0136478   -0.0198676 
#   0.0262627    0.00187459  -0.0122604 
#   0.0417169   -0.0230222   -0.00842476
#   0.0236389    0.0585979   -0.0642437 
#   0.00685299  -0.0513301    0.0501431 
#   0.027723    -0.0151694    0.00274466
#   0.0338992    0.0216184   -0.0494157 
#   0.0612926    0.0276185    0.0271352 
#   ⋮                                   
#  -0.00167347  -0.018576     0.0290964 
#   0.0158393    0.0124403   -0.0208216 
#  -0.00833401   0.0323784    0.0245698 
#   0.019355     0.0057538    0.0150561 
#   0.00352774   0.031572    -0.0262811 
#  -0.0523636   -0.0388993   -0.00794319
#  -0.0363795    0.0633939   -0.0292289 
#   0.0106868    0.0341909    0.0116523 
#   0.0072586   -0.0337501    0.0405171 ,
# 
# 942,
# 942x3 Array{Float64,2}:
#  -0.035269    -0.0299966   -0.0292959 
#  -0.00501367   0.0316806    0.0460191 
#   0.0633815   -0.0136478   -0.0198676 
#   0.0262627    0.00187459  -0.0122604 
#   0.0417169   -0.0230222   -0.00842476
#   0.0236389    0.0585979   -0.0642437 
#   0.00685299  -0.0513301    0.0501431 
#   0.027723    -0.0151694    0.00274466
#   0.0338992    0.0216184   -0.0494157 
#   0.0612926    0.0276185    0.0271352 
#   ⋮                                   
#  -0.00167347  -0.018576     0.0290964 
#   0.0158393    0.0124403   -0.0208216 
#  -0.00833401   0.0323784    0.0245698 
#   0.019355     0.0057538    0.0150561 
#   0.00352774   0.031572    -0.0262811 
#  -0.0523636   -0.0388993   -0.00794319
#  -0.0363795    0.0633939   -0.0292289 
#   0.0106868    0.0341909    0.0116523 
#   0.0072586   -0.0337501    0.0405171 )

According to the JL Lemma, the new matrix, \(AX\), should approximately preserve the distances between columns of \(X\). We can write a quick function that verifies this claim:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
function ispreserved(X::Matrix, A::Matrix, ε::Real)
    d, n = size(X)
    k = size(A, 1)
 
    for i in 1:n
        for j in (i + 1):n
            u, v = X[:, i], X[:, j]
            d_old = norm(u - v)^2
            d_new = norm(A * u - A * v)^2
            @printf("Considering the pair X[:, %d], X[:, %d]...\n", i, j)
            @printf("\tOld distance: %f\n", d_old)
            @printf("\tNew distance: %f\n", d_new)
            @printf(
                "\tWithin bounds %f <= %f <= %f\n",
                (1 - ε) * d_old,
                d_new,
                (1 + ε) * d_old
            )
            if !((1 - ε) * d_old <= d_old <= (1 + ε) * d_old)
                return false
            end
        end
    end
 
    return true
end

And then we can test out the results:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
ispreserved(X, A, ε)
# =>
# Considering the pair X[:, 1], X[:, 2]...
#     Old distance: 2.000000
#     New distance: 2.104506
#     Within bounds 1.800000 <= 2.104506 <= 2.200000
# Considering the pair X[:, 1], X[:, 3]...
#     Old distance: 2.000000
#     New distance: 2.006130
#     Within bounds 1.800000 <= 2.006130 <= 2.200000
# Considering the pair X[:, 2], X[:, 3]...
#     Old distance: 2.000000
#     New distance: 1.955495
#     Within bounds 1.800000 <= 1.955495 <= 2.200000

As claimed, the distances are indeed preserved up to a factor of \(\epsilon\). But, as we noted earlier, the JL lemma has a somewhat perverse consequence for our \(3×3\) matrix: we’ve expanded our input into a \(942×3\) matrix rather than reduced its dimensionality.

To get meaningful dimensionality reduction, we need to project a data set from a space that has more than \(942\) dimensions. So let’s try out a \(50,000\)-dimensional example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
X = eye(50_000, 3)
 
A, k, AX = projection(X, ε)
 
ispreserved(X, A, ε)
# =>
# Considering the pair X[:, 1], X[:, 2]...
#     Old distance: 2.000000
#     New distance: 2.021298
#     Within bounds 1.800000 <= 2.021298 <= 2.200000
# Considering the pair X[:, 1], X[:, 3]...
#     Old distance: 2.000000
#     New distance: 1.955502
#     Within bounds 1.800000 <= 1.955502 <= 2.200000
# Considering the pair X[:, 2], X[:, 3]...
#     Old distance: 2.000000
#     New distance: 1.988945
#     Within bounds 1.800000 <= 1.988945 <= 2.200000

In this case, the JL Lemma again works as claimed: the pairwise distances between columns of \(X\) are preserved. And we’ve done this while reducing the dimensionality of our data from \(50,000\) to \(942\). Moreover, this same approach would still work if the input space had \(10\) million dimensions.

Conclusion

Contrary to my naive conception of the JL Lemma, the literature on the lemma not only tells us that, abstractly, distances can be preserved by dimensionality reduction techniques. It tells how to perform this reduction — and the mechanism is both simple and general.

Data corruption in R 3.0.2 when using read.csv

Introduction

It may be old news to some, but I just recently discovered that the automatic type inference system that R uses when parsing CSV files assumes that data sets will never contain 64-bit integer values.

Specially, if an integer value read from a CSV file is too large to fit in a 32-bit integer field without overflow, the column of data that contains that value will be automatically converted to floating point. This conversion will take place without any warnings, even though it may lead to data corruption.

The reason that the automatic conversion of 64-bit integer-valued data to floating point is problematic is that floating point numbers lack sufficient precision to exactly represent the full range of 64-bit integer values. As a consequence of the lower precision of floating point numbers, two unequal integer values in the input file may be converted to two equal floating point values in the data.frame R uses to represent that data. Subsequent analysis in R will therefore treat unequal values as if they were equal, corrupting any downstream analysis that assumes that the equality predicate can be trusted.

Below, I demonstrate this general problem using two specific data sets. The specific failure case that I outline occurred for me while using R 3.0.2 on my x86_64-apple-darwin10.8.0 platform laptop, which is a “MacBook Pro Retina, 13-inch, Late 2013″ model.

Failure Case

Consider the following two tables, one containing 32-bit integer values and the other containing 64-bit integer values:

ID
1000
1001
ID
100000000000000000
100000000000000001

What happens when they are read into R using the read.csv function?

32-bit compatible integer values are parsed, correctly, using R’s integer type, which does not lead to data corruption:

1
2
3
4
5
6
7
8
9
data <- "MySQLID\n1000\n1001"
 
ids <- read.csv(text = data)
 
ids[1, 1] == ids[2, 1]
# [1] FALSE
 
class(ids$MySQLID)
# [1] "integer"

64-bit compatible integer values are parsed, incorrectly, using R’s numeric type, which does lead to data corruption:

1
2
3
4
5
6
7
8
9
data <- "MySQLID\n100000000000000000\n100000000000000001"
 
ids <- read.csv(text = data)
 
ids[1, 1] == ids[2, 1]
# [1] TRUE
 
class(ids$MySQLID)
# [1] "numeric"

Conclusions

What should one make of this example? At the minimum, it suggests that R’s default behaviors are not well-suited to a world in which more and more people interact with data derived from commercial web sites, where 64-bit integers are commonplace. I hope that R will change the behavior of read.csv in a future release and deprecate any attempts to treat integer literals as anything other than 64-bit integers.

But, I would argue that this example also teaches a much more general point: it suggests that the assertion that scientists can safely ignore the distinction between integer and floating point data types is false. In the example I’ve provided, the very real distinction that modern CPU’s make between integer and floating point data leads to very real data corruption occurring. How that data corruption affects downstream analyses is situation-dependent, but it is conceivable that the effects are severe in some settings. I would hope that we will stop asserting that scientists can use computers to analyze data without understanding the inherent limitations of the tools they are working with.

The Relationship between Vectorized and Devectorized Code

Introduction

Some people have come to believe that Julia’s vectorized code is unusably slow. To correct this misconception, I outline a naive benchmark below that suggests that Julia’s vectorized code is, in fact, noticeably faster than R’s vectorized code. When experienced Julia programmers suggest that newcomers should consider devectorizing code, we’re not trying to beat R’s speed — our vectorized code does that already. Instead, we’re trying to match C’s speed.

As the examples below indicate, a little bit of devectorization goes a long way towards this loftier goal. In the specific examples I show, I find that:

  • Julia’s vectorized code is 2x faster than R’s vectorized code
  • Julia’s devectorized code is 140x faster than R’s vectorized code
  • Julia’s devectorized code is 1350x faster than R’s devectorized code

Examples of Vectorized and Devectorized Code in R

Let’s start by contrasting two pieces of R code: a vectorized and a devectorized implementation of a trivial snippet of code that does repeated vector addition.

First, we consider an example of idiomatic, vectorized R code:

vectorized <- function()
{
    a <- c(1, 1)
    b <- c(2, 2)
    x <- c(NaN, NaN)

    for (i in 1:1000000)
    {
        x <- a + b
    }

    return()
}

time <- function (N)
{
    timings <- rep(NA, N)

    for (itr in 1:N)
    {
        start <- Sys.time()
        vectorized()
        end <- Sys.time()
        timings[itr] <- end - start
    }

    return(timings)
}

mean(time(10))

This code takes, on average, 0.49 seconds per iteration to compute 1,000,000 vector additions.

Having considered the vectorized implementation, we can then consider an unidiomatic devectorized implementation of the same operation in R:

devectorized <- function()
{
    a <- c(1, 1)
    b <- c(2, 2)
    x <- c(NaN, NaN)

    for (i in 1:1000000)
    {
        for (index in 1:2)
        {
            x[index] <- a[index] + b[index]
        }
    }

    return()
}

time <- function (N)
{
    timings <- rep(NA, N)

    for (itr in 1:N)
    {
        start <- Sys.time()
        devectorized()
        end <- Sys.time()
        timings[itr] <- end - start
    }

    return(timings)
}

mean(time(10))

This takes, on average, 4.72 seconds per iteration to compute 1,000,000 vector additions.

Examples of Vectorized and Devectorized Code in Julia

Let’s now consider two Julia implementations of this same snippet of code. We’ll start with a vectorized implementation:

function vectorized()
    a = [1.0, 1.0]
    b = [2.0, 2.0]
    x = [NaN, NaN]

    for i in 1:1000000
        x = a + b
    end

    return
end

function time(N)
    timings = Array(Float64, N)

    # Force compilation
    vectorized()

    for itr in 1:N
        timings[itr] = @elapsed vectorized()
    end

    return timings
end

mean(time(10))

This takes, on average, 0.236 seconds per iteration to compute 1,000,000 vector additions.

Next, let’s consider a devectorized implementation of this same snippet:

function devectorized()
    a = [1.0, 1.0]
    b = [2.0, 2.0]
    x = [NaN, NaN]

    for i in 1:1000000
        for index in 1:2
            x[index] = a[index] + b[index]
        end
    end

    return
end

function time(N)
    timings = Array(Float64, N)

    # Force compilation
    devectorized()

    for itr in 1:N
        timings[itr] = @elapsed devectorized()
    end

    return timings
end

mean(time(10))

This takes, on average, 0.0035 seconds per iteration to compute 1,000,000 vector additions.

Comparing Performance in R and Julia

We can summarize the results of the four examples above in a single table:

Approach Language Average Time
Vectorized R 0.49
Devectorized R 4.72
Vectorized Julia 0.24
Devectorized Julia 0.0035

All of these examples were timed on my 2.9 GHz Intel Core i7 MacBook Pro. The results are quite striking: Julia is uniformly faster than R. And a very small bit of devectorization produces huge performance improvements. Of course, it would be nice if Julia’s compiler could optimize vectorized code as well as it optimizes devectorized code. But doing so requires a substantial amount of work.

Why is Optimizing Vectorized Code Hard?

What makes automatic devectorization tricky to get right is that even minor variants of the snippet shown above have profoundly different optimization strategies. Consider, for example, the following two snippets of code:

function vectorized2()
    a = [1.0, 1.0]
    b = [2.0, 2.0]

    res = {}

    for i in 1:1000000
        x = [rand(), rand()]
        x += a + b
        push!(res, x)
    end

    return res
end

function time(N)
    timings = Array(Float64, N)

    # Force compilation
    vectorized2()

    for itr in 1:N
        timings[itr] = @elapsed vectorized2()
    end

    return timings
end

mean(time(10))

This first snippet takes 1.29 seconds on average.

function devectorized2()
    a = [1.0, 1.0]
    b = [2.0, 2.0]

    res = {}

    for i in 1:1000000
        x = [rand(), rand()]
        for dim in 1:2
            x[dim] += a[dim] + b[dim]
        end
        push!(res, x)
    end

    return res
end

function time(N)
    timings = Array(Float64, N)

    # Force compilation
    devectorized2()

    for itr in 1:N
        timings[itr] = @elapsed devectorized2()
    end

    return timings
end

mean(time(10))

This second snippet takes, on average, 0.27 seconds.

The gap between vectorized and devectorized code is much smaller here because this second set of code snippets uses memory in a very different way than our original snippets did. In the first set of snippets, it was possible to entirely avoid allocating any memory for storing changes to x. The devectorized code for the first set of snippets explicitly made clear to the compiler that no memory needed to be allocated. The vectorized code did not make this clear. Making it clear that no memory needed to be allocated led to a 75x speedup. Explicitly telling the compiler what it can avoid spending time on goes a long way.

In contrast, in the second set of snippets, a new chunk of memory has to be allocated for every x vector that gets created. And the result is that even the devectorized variant of our second snippet cannot offer much of a performance boost over its vectorized analogue. The devectorized variant is slightly faster because it avoids allocating any memory during the steps in which x has a and b added to it, but this makes less of a difference when there is still a lot of other work being done that cannot be avoided by devectorizing operations.

This reflects a more general statement: the vectorization/devectorization contrast is only correlated, not causally related, with the actual performance characteristics of code. What matters for computations that take place on modern computers is the efficient utilization of processor cycles and memory. In many real examples of vectorized code, it is memory management, rather than vectorization per se, that is the core causal factor responsible for performance.

The Reversed Role of Vectorization in R and Julia

Part of what makes it difficult to have a straightforward discussion about vectorization is that vectorization in R conflates issues that are logically unrelated. In R, vectorization is often done for both (a) readability and (b) performance. In Julia, vectorization is only used for readability; it is devectorization that offers superior performance.

This confuses some people who are not familiar with the internals of R. It is therefore worth noting how one improves the speed of R code. The process of performance improvement is quite simple: one starts with devectorized R code, then replaces it with vectorized R code and then finally implements this vectorized R code in devectorized C code. This last step is unfortunately invisible to many R users, who therefore think of vectorization per se as a mechanism for increasing performance. Vectorization per se does not help make code faster. What makes vectorization in R effective is that it provides a mechanism for moving computations into C, where a hidden layer of devectorization can do its mgic.

In other words, R is doing exactly what Julia is doing to get better performance. R’s vectorized code is simply a thin wrapper around completely devectorized C code. If you don’t believe me, go read the C code for something like R’s distance function, which involves calls to functions like the following:

static double R_euclidean(double *x, int nr, int nc, int i1, int i2)
{
    double dev, dist;
    int count, j;

    count= 0;
    dist = 0;
    for(j = 0 ; j < nc ; j++) {
    if(both_non_NA(x[i1], x[i2])) {
        dev = (x[i1] - x[i2]);
        if(!ISNAN(dev)) {
        dist += dev * dev;
        count++;
        }
    }
    i1 += nr;
    i2 += nr;
    }
    if(count == 0) return NA_REAL;
    if(count != nc) dist /= ((double)count/nc);
    return sqrt(dist);
}

It is important to keep this sort of thing in mind: the term vectorization in R actually refers to a step in which you write devectorized code in C. Vectorization, per se, is a red herring when reasoning about performance.

To finish this last point, let’s summarize the performance hierarchy for R and Julia code in a simple table:

Worst Case Typical Case Best Case
Julia Vectorized Code Julia Devectorized Code
R Devectorized Code R Vectorized Code C Devectorized Code

It is the complete absence of one column for Julia that makes it difficult to compare vectorization across the two languages. Nothing in Julia is as bad as R’s devectorized code. On the other end of the spectrum, the performance of Julia’s devectorized code simply has no point of comparison in pure R: it is more similar to the C code used to power R behind the scenes.

Conclusion

Julia aims to (and typically does) provide vectorized code that is efficient as the vectorized code available in other high-level languages. What sets Julia apart is the possibility of writing, in pure Julia, high performance code that uses CPU and memory resources as effectively as can be done in C.

In particular, vectorization and devectorization stand in the opposite relationship to one another in Julia as they do in R. In R, devectorization makes code unusably slow: R code must be vectorized to perform at an acceptable level. In contrast, Julia programmers view vectorized code as a convenient prototype that can be modified with some clever devectorization to produce production-performance code. Of course, we would like prototype code to perform better. But no popular language offers that kind of functionality. What Julia offers isn’t the requirement for devectorization, but the possibility of doing it in Julia itself, rather than in C.