Falsifiability versus Rationalization

Here are two hypothetical conversations about psychological research. I’ll leave it to others to decide whether these conversation could ever take place.

Theories are just directional assertions about effects

Person A: And, just as I predicted, I found in my early studies that the correlation between X and Y is 0.4.
Person B: What do you make of the fact that subsequent studies have found that the correlation is closer to 0.001?
Person A: Oh, I was right all along: those studies continue to support my theoretical assertion that the empirical effect goes in the direction that my theory predicted. Exact numbers are meaningless in the social sciences, since we only conduct proof-of-concept studies and there are so many intervening variables we can’t measure.

Theories are just assertions about the existence of effects

Person A: And, just as I predicted, I found in my early studies that the correlation between X and Y is 0.4.
Person B: What do you make of the fact that a conceptual replication, which employed words rather than pictures, found that the correlation between X and Y was -0.05?
Person A: Oh, I was right all along: X does have an effect on Y, even though the effect can switch directions under some circumstances. What matters is that X affects Y at all, which is deeply counter-intuitive.

A Note on the Johnson-Lindenstrauss Lemma

Introduction

A recent thread on Theoretical CS StackExchange comparing the Johnson-Lindenstrauss Lemma with the Singular Value Decomposition piqued my interest enough that I decided to spend some time last night reading the standard JL papers. Until this week, I only had a vague understanding of what the JL Lemma implied. I previously mistook the JL Lemma for a purely theoretical result that established the existence of distance-preserving projections from high-dimensional spaces into low-dimensional spaces.

This vague understanding of the JL Lemma turns out to be almost correct, but it also led me to neglect the most interesting elements of the literature on the JL Lemma: the papers on the JL Lemma do not simply establish the existence of such projections, but also provide (1) an explicit bound on the dimensionality required for a projection to ensure that it will approximately preserve distances and they even provide (2) an explicit construction of a random matrix, \(A\), that produces the desired projection.

Once I knew that the JL Lemma was a constructive proof, I decided to implement code in Julia to construct examples of this family of random projections. The rest of this post walks through that code as a way of explaining the JL Lemma’s practical applications.

Formal Statement of the JL Lemma

The JL Lemma, as stated in “An elementary proof of the Johnson-Lindenstrauss Lemma” by Dasgputa and Gupta, is the following result about dimensionality reduction:

For any \(0 < \epsilon < 1\) and any integer \(n\), let \(k\) be a positive integer such that \(k \geq 4(\epsilon^2/2 - \epsilon^3/3)^{-1}\log(n)\). Then for any set \(V\) of \(n\) points in \(\mathbb{R}^d\), there is a map \(f : \mathbb{R}^d \to \mathbb{R}^k\) such that for all \(u, v \in V\), $$ (1 - \epsilon) ||u - v||^2 \leq ||f(u) - f(v)||^2 \leq (1 + \epsilon) ||u - v||^2. $$ Further this map can be found in randomized polynomial time.

To fully appreciate this result, we can unpack the abstract statement of the lemma into two components.

The JL Lemma in Two Parts

Part 1: Given a number of data points, \(n\), that we wish to project and a relative error, \(\epsilon\), that we are willing to tolerate, we can compute a minimum dimensionality, \(k\), that a projection must map a space into before it can guarantee that distances will be preserved up to a factor of \(\epsilon\).

In particular, \(k = \left \lceil{4(\epsilon^2/2 – \epsilon^3/3)^{-1}\log(n)} \right \rceil\).

Note that this implies that the dimensionality required to preserve distances depends only on the number of points and not on the dimensionality of the original space.

Part 2: Given an input matrix, \(X\), of \(n\) points in \(d\)-dimensional space, we can explicitly construct a map, \(f\), such that the distance between any pair of columns of \(X\) will not distorted by more than a factor of \(\epsilon\).

Surprisingly, this map \(f\) can be a simple matrix, \(A\), constructed by sampling \(k * d\) IID draws from a Gaussian with mean \(0\) and variance \(\frac{1}{k}\).

Coding Up The Projections

We can translate the first part of the JL Lemma into a single line of code that computes the dimensionality, \(k\), of our low-dimensional space given the number of data points, \(n\), and the error, \(\epsilon\), that we are willing to tolerate:

1
mindim(n::Integer, ε::Real) = iceil((4 * log(n)) / (ε^2 / 2 - ε^3 / 3))

Having defined this function, we can try it out on a simple problem:

1
2
mindim(3, 0.1)
# => 942

This result was somewhat surprising to me: to represent \(3\) points with no more than \(10\)% error, we require nearly \(1,000\) dimensions. This reflects an important fact about the JL Lemma: it produces result that can be extremely conservative for small dimensional inputs. It’s obvious that, for data sets that contain \(3\) points in \(100\)-dimensional space, we could use a projection into \(100\) dimensions that would preserve distances perfectly.

But this observation neglects one of the essential aspects of the JL Lemma: the dimensions required by the lemma will be sufficient whether our data set contains points in \(100\)-dimensional space or points in \(10^{100}\)-dimensional space. No matter what dimensionality the raw data lies in, the JL Lemma says that \(942\) dimensions suffices to preserve the distances between \(3\) points.

I found this statement unintuitive at the start. To see that it’s true, let’s construct a random projection matrix, \(A\), that will let us confirm experimentally that the JL Lemma really works:

1
2
3
4
5
6
7
8
9
10
11
using Distributions
 
function projection(
    X::Matrix,
    ε::Real,
    k::Integer = mindim(size(X, 2), ε)
)
    d, n = size(X)
    A = rand(Normal(0, 1 / sqrt(k)), k, d)
    return A, k, A * X
end

This projection function is sufficient to construct a matrix, \(A\), that will satisfy the assumptions of the JL Lemma. It will also return the dimensionality, \(k\), of \(A\) and the result of projecting the input, \(X\), into the new space defined by \(A\). To get a feel for how this works, we can try this out on a very simple data set:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
X = eye(3, 3)
 
ε = 0.1
 
A, k, AX = projection(X, ε)
# =>
# (
# 942x3 Array{Float64,2}:
#  -0.035269    -0.0299966   -0.0292959 
#  -0.00501367   0.0316806    0.0460191 
#   0.0633815   -0.0136478   -0.0198676 
#   0.0262627    0.00187459  -0.0122604 
#   0.0417169   -0.0230222   -0.00842476
#   0.0236389    0.0585979   -0.0642437 
#   0.00685299  -0.0513301    0.0501431 
#   0.027723    -0.0151694    0.00274466
#   0.0338992    0.0216184   -0.0494157 
#   0.0612926    0.0276185    0.0271352 
#   ⋮                                   
#  -0.00167347  -0.018576     0.0290964 
#   0.0158393    0.0124403   -0.0208216 
#  -0.00833401   0.0323784    0.0245698 
#   0.019355     0.0057538    0.0150561 
#   0.00352774   0.031572    -0.0262811 
#  -0.0523636   -0.0388993   -0.00794319
#  -0.0363795    0.0633939   -0.0292289 
#   0.0106868    0.0341909    0.0116523 
#   0.0072586   -0.0337501    0.0405171 ,
# 
# 942,
# 942x3 Array{Float64,2}:
#  -0.035269    -0.0299966   -0.0292959 
#  -0.00501367   0.0316806    0.0460191 
#   0.0633815   -0.0136478   -0.0198676 
#   0.0262627    0.00187459  -0.0122604 
#   0.0417169   -0.0230222   -0.00842476
#   0.0236389    0.0585979   -0.0642437 
#   0.00685299  -0.0513301    0.0501431 
#   0.027723    -0.0151694    0.00274466
#   0.0338992    0.0216184   -0.0494157 
#   0.0612926    0.0276185    0.0271352 
#   ⋮                                   
#  -0.00167347  -0.018576     0.0290964 
#   0.0158393    0.0124403   -0.0208216 
#  -0.00833401   0.0323784    0.0245698 
#   0.019355     0.0057538    0.0150561 
#   0.00352774   0.031572    -0.0262811 
#  -0.0523636   -0.0388993   -0.00794319
#  -0.0363795    0.0633939   -0.0292289 
#   0.0106868    0.0341909    0.0116523 
#   0.0072586   -0.0337501    0.0405171 )

According to the JL Lemma, the new matrix, \(AX\), should approximately preserve the distances between columns of \(X\). We can write a quick function that verifies this claim:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
function ispreserved(X::Matrix, A::Matrix, ε::Real)
    d, n = size(X)
    k = size(A, 1)
 
    for i in 1:n
        for j in (i + 1):n
            u, v = X[:, i], X[:, j]
            d_old = norm(u - v)^2
            d_new = norm(A * u - A * v)^2
            @printf("Considering the pair X[:, %d], X[:, %d]...\n", i, j)
            @printf("\tOld distance: %f\n", d_old)
            @printf("\tNew distance: %f\n", d_new)
            @printf(
                "\tWithin bounds %f <= %f <= %f\n",
                (1 - ε) * d_old,
                d_new,
                (1 + ε) * d_old
            )
            if !((1 - ε) * d_old <= d_old <= (1 + ε) * d_old)
                return false
            end
        end
    end
 
    return true
end

And then we can test out the results:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
ispreserved(X, A, ε)
# =>
# Considering the pair X[:, 1], X[:, 2]...
#     Old distance: 2.000000
#     New distance: 2.104506
#     Within bounds 1.800000 <= 2.104506 <= 2.200000
# Considering the pair X[:, 1], X[:, 3]...
#     Old distance: 2.000000
#     New distance: 2.006130
#     Within bounds 1.800000 <= 2.006130 <= 2.200000
# Considering the pair X[:, 2], X[:, 3]...
#     Old distance: 2.000000
#     New distance: 1.955495
#     Within bounds 1.800000 <= 1.955495 <= 2.200000

As claimed, the distances are indeed preserved up to a factor of \(\epsilon\). But, as we noted earlier, the JL lemma has a somewhat perverse consequence for our \(3×3\) matrix: we’ve expanded our input into a \(942×3\) matrix rather than reduced its dimensionality.

To get meaningful dimensionality reduction, we need to project a data set from a space that has more than \(942\) dimensions. So let’s try out a \(50,000\)-dimensional example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
X = eye(50_000, 3)
 
A, k, AX = projection(X, ε)
 
ispreserved(X, A, ε)
# =>
# Considering the pair X[:, 1], X[:, 2]...
#     Old distance: 2.000000
#     New distance: 2.021298
#     Within bounds 1.800000 <= 2.021298 <= 2.200000
# Considering the pair X[:, 1], X[:, 3]...
#     Old distance: 2.000000
#     New distance: 1.955502
#     Within bounds 1.800000 <= 1.955502 <= 2.200000
# Considering the pair X[:, 2], X[:, 3]...
#     Old distance: 2.000000
#     New distance: 1.988945
#     Within bounds 1.800000 <= 1.988945 <= 2.200000

In this case, the JL Lemma again works as claimed: the pairwise distances between columns of \(X\) are preserved. And we’ve done this while reducing the dimensionality of our data from \(50,000\) to \(942\). Moreover, this same approach would still work if the input space had \(10\) million dimensions.

Conclusion

Contrary to my naive conception of the JL Lemma, the literature on the lemma not only tells us that, abstractly, distances can be preserved by dimensionality reduction techniques. It tells how to perform this reduction — and the mechanism is both simple and general.

Data corruption in R 3.0.2 when using read.csv

Introduction

It may be old news to some, but I just recently discovered that the automatic type inference system that R uses when parsing CSV files assumes that data sets will never contain 64-bit integer values.

Specially, if an integer value read from a CSV file is too large to fit in a 32-bit integer field without overflow, the column of data that contains that value will be automatically converted to floating point. This conversion will take place without any warnings, even though it may lead to data corruption.

The reason that the automatic conversion of 64-bit integer-valued data to floating point is problematic is that floating point numbers lack sufficient precision to exactly represent the full range of 64-bit integer values. As a consequence of the lower precision of floating point numbers, two unequal integer values in the input file may be converted to two equal floating point values in the data.frame R uses to represent that data. Subsequent analysis in R will therefore treat unequal values as if they were equal, corrupting any downstream analysis that assumes that the equality predicate can be trusted.

Below, I demonstrate this general problem using two specific data sets. The specific failure case that I outline occurred for me while using R 3.0.2 on my x86_64-apple-darwin10.8.0 platform laptop, which is a “MacBook Pro Retina, 13-inch, Late 2013” model.

Failure Case

Consider the following two tables, one containing 32-bit integer values and the other containing 64-bit integer values:

ID
1000
1001
ID
100000000000000000
100000000000000001

What happens when they are read into R using the read.csv function?

32-bit compatible integer values are parsed, correctly, using R’s integer type, which does not lead to data corruption:

1
2
3
4
5
6
7
8
9
data <- "MySQLID\n1000\n1001"
 
ids <- read.csv(text = data)
 
ids[1, 1] == ids[2, 1]
# [1] FALSE
 
class(ids$MySQLID)
# [1] "integer"

64-bit compatible integer values are parsed, incorrectly, using R’s numeric type, which does lead to data corruption:

1
2
3
4
5
6
7
8
9
data <- "MySQLID\n100000000000000000\n100000000000000001"
 
ids <- read.csv(text = data)
 
ids[1, 1] == ids[2, 1]
# [1] TRUE
 
class(ids$MySQLID)
# [1] "numeric"

Conclusions

What should one make of this example? At the minimum, it suggests that R’s default behaviors are not well-suited to a world in which more and more people interact with data derived from commercial web sites, where 64-bit integers are commonplace. I hope that R will change the behavior of read.csv in a future release and deprecate any attempts to treat integer literals as anything other than 64-bit integers.

But, I would argue that this example also teaches a much more general point: it suggests that the assertion that scientists can safely ignore the distinction between integer and floating point data types is false. In the example I’ve provided, the very real distinction that modern CPU’s make between integer and floating point data leads to very real data corruption occurring. How that data corruption affects downstream analyses is situation-dependent, but it is conceivable that the effects are severe in some settings. I would hope that we will stop asserting that scientists can use computers to analyze data without understanding the inherent limitations of the tools they are working with.

The Sad State of Website Password Standards

Correction: One problem with the argument I make below is that password storage length isn’t actually relevant if websites hash passwords instead of storing them in plaintext. So there’s no need to convince websites to store longer passwords: they should always be storing a fixed length hash to begin with. Thanks to @petrillic for bringing this mistake to my attention.

tl;dr: Every website I use seems to have a slightly different password policy. Here I review some very basic algebraic facts about randomly generated passwords. Based on those facts, I argue that every able-bodied website should adopt a few simple standards for user passwords, including the following rules:

  1. Never prevent users from copying-and-pasting passwords.
  2. Never restrict the length of passwords to anything lower than 40 characters. If you can, give users 255 characters to work with.
  3. Never truncate the password that a user submits, because truncation will invalidate the user’s password. Do not lock users out of their accounts as a punishment for using strong passwords.
  4. Never impose requirements on the types of characters used in passwords. Avoid encouraging l33t-speak passwords.

Introduction

After a long drive last week, I arrived home to find my phone filled with a long sequence of two-factor authentication tokens, which I took as evidence that someone had been trying to break into one of my accounts. After looking into the matter a bit, it became clear that the account to which access had been requested was one of a few accounts that shared a password with my old (and recently compromised) Adobe account. Thankfully, all of my important user accounts had two-factor authentication enabled and, as such, I have not found evidence of any successful intrusions. But the incident was nevertheless sufficient to inspire me to create unique passwords for every single website account that I have.

The Sad State of Website Password Standards

Unfortunately, the process of resetting my passwords en masse served mostly to remind me how poorly password security is managed by most websites. Every website seems to have its own set of ad hoc standards: some require passwords with a lot of different types of characters, whereas others require long passwords. Many websites will allow passwords to be 40 characters or longer, but a large number of websites impose puzzling restrictions on the maximum length of passwords.

In particular, one site, which will rename nameless to protect the guilty, required that passwords contain (a) 1 lowercase letter, (b) 1 uppercase letter, (c) 1 digit and (d) 1 special character — and simultaneously required that passwords not be longer than 10 characters.

The widespread requirement that users employ l33t-speak passwords puzzles me. A little bit of arithmetic makes it clear that a randomly generated password is enormously more likely to be secure if it consists of 40 randomly chosen lowercase characters than if it consists of 5 randomly chosen characters from the extended character set that contains lowercase characters, uppercase characters, digits and special characters. To see this, consider the number of distinct passwords generated by either (a) varying the diversity of the character set or (b) varying the length of the character set. Some sample calculations are shown below in which I consider four types of characters sets and randomly generated passwords of length 5, 10 and 20:

Character Set Password Length Number of Distinct Passwords
Lowercase 5 11881376
Lowercase 10 141167095653376
Lowercase 20 19928148895209409152340197376
Lowercase + Uppercase 5 380204032
Lowercase + Uppercase 10 144555105949057024
Lowercase + Uppercase 20 20896178655943101411324274803736576
Lowercase + Uppercase + Digits 5 916132832
Lowercase + Uppercase + Digits 10 839299365868340224
Lowercase + Uppercase + Digits 20 704423425546998022968330264616370176
Lowercase + Uppercase + Digits + Special Characters 5 1934917632
Lowercase + Uppercase + Digits + Special Characters 10 3743906242624487424
Lowercase + Uppercase + Digits + Special Characters 20 14016833953562607293918185758734155776

Why Does This Table Look Like This?

It’s hard not to be struck by the evidence the table above provides for the enormous superiority of long passwords over diverse passwords. When you go from the least diverse password to the most diverse password for a 5-character long password, you only go from 11881376 different passwords to 1934917632 passwords. If random guessing is feasible for one, it’s not that much harder for the other.

If, instead, you stick with only lowercase passwords and go from 5-characters to 20-characters, you go from 11881376 different passwords to 19928148895209409152340197376 passwords. Even if you could try out all of the smaller passwords in a second, we’d all be dead before you tried out all of the longer passwords.

To see why this happens, consider the number of distinct passwords you get when you either double the number of characters used or you double the length of the passwords generated. If you start with \(a\) different letters and use \(n\) of them, you’ll end up with \(a^n\) different passwords. And when you double the number of different letters, you increase this number from \(a^n\) to \((2a)^n = (2^n)(a^n)\).

If, instead, you double the length of the password, you increase \(a^n\) to \(a^{2n} = (a^n)^2 = (a^n)(a^n)\). This means that the superiority of using longer passwords over diverse passwords grows like \((\frac{a}{2})^n\), which is a huge number for even simple character sets that contain \(a\) different letters.

For example, if you’re looking at just lowercase characters for English, using a longer password instead of more diverse passwords is going to be \(13^n\) times better. In other words, going from a 10-character password to a 20-character password is more a billion times more secure than allowing both upper and lowercase letters in passwords that are always 10 characters long.

What About the Costs of Using Longer Passwords?

In response to my argument, one might ask: “won’t using longer passwords impose a serious storage cost on websites?”

Thankfully, the answer is, “not really”. Suppose you’re a wildly popular site and have a billion users. Then going from 10 character passwords to 255 characters will impose an additional storage cost of about 245 GB on you. Right now, getting a 512 GB hard drive costs about 50 dollars. So, the increased storage cost for most websites should be on the order of magnitude of 100 dollars. That seems quite affordable to me.

If Longer Passwords Are So Much Better, Why Don’t We Use Them?

Why did we wind up with a requirement for l33t-speak passwords and no ability to use long passwords? Probably because websites have found that users keep using simple dictionary words like “password” as their password, which you can prevent as long you force users to set up passwords that contain digits and other special characters. It’s hard to get users to employ truly random passwords, so it’s easier to impose some trivial amount of randomness by making them use “random” characters like @ and $.

A Plan for Action

Once we acknowledge the reasons why we wound up with the current set of heuristics for making passwords secure, we can move on and think how websites could do a better job in the future. The foremost step towards security, in my mind, is facilitating the use of high-quality password managers, like 1Password. You can make things easier for advanced users by adopting the following steps:

  • Allow users to copy-and-paste their passwords from another source. When you force users to set up passwords they can easily type, you incentivize them to use short, memorable passwords. But short, memorable passwords are terrible passwords.
  • Drop the unhelpful requirement that passwords contain a “diverse” set of characters. Adding a few special characters to a password does almost nothing to make it more secure. It’s not hard for a hacker to set up a dictionary of passwords that adds l33t-speak equivalents for every real English word.
  • Impose a requirement that passwords be at least 12 characters long. Under no circumstances impose on users the requirement that passwords be shorter than 40 characters. Similarly, you should never truncate the password that a user submits to a length shorter than the string they submitted. Truncating passwords makes a password less secure and also locks the user out of their account.
  • Offer to automatically generate a fully random password for every new user with an explanation of how this random password is better for security. Most users won’t take you up on the offer, but those that do will benefit substantially from the polite nudge towards using per-website unique passwords.

The Relationship between Vectorized and Devectorized Code

Introduction

Some people have come to believe that Julia’s vectorized code is unusably slow. To correct this misconception, I outline a naive benchmark below that suggests that Julia’s vectorized code is, in fact, noticeably faster than R’s vectorized code. When experienced Julia programmers suggest that newcomers should consider devectorizing code, we’re not trying to beat R’s speed — our vectorized code does that already. Instead, we’re trying to match C’s speed.

As the examples below indicate, a little bit of devectorization goes a long way towards this loftier goal. In the specific examples I show, I find that:

  • Julia’s vectorized code is 2x faster than R’s vectorized code
  • Julia’s devectorized code is 140x faster than R’s vectorized code
  • Julia’s devectorized code is 1350x faster than R’s devectorized code

Examples of Vectorized and Devectorized Code in R

Let’s start by contrasting two pieces of R code: a vectorized and a devectorized implementation of a trivial snippet of code that does repeated vector addition.

First, we consider an example of idiomatic, vectorized R code:

vectorized <- function()
{
    a <- c(1, 1)
    b <- c(2, 2)
    x <- c(NaN, NaN)

    for (i in 1:1000000)
    {
        x <- a + b
    }

    return()
}

time <- function (N)
{
    timings <- rep(NA, N)

    for (itr in 1:N)
    {
        start <- Sys.time()
        vectorized()
        end <- Sys.time()
        timings[itr] <- end - start
    }

    return(timings)
}

mean(time(10))

This code takes, on average, 0.49 seconds per iteration to compute 1,000,000 vector additions.

Having considered the vectorized implementation, we can then consider an unidiomatic devectorized implementation of the same operation in R:

devectorized <- function()
{
    a <- c(1, 1)
    b <- c(2, 2)
    x <- c(NaN, NaN)

    for (i in 1:1000000)
    {
        for (index in 1:2)
        {
            x[index] <- a[index] + b[index]
        }
    }

    return()
}

time <- function (N)
{
    timings <- rep(NA, N)

    for (itr in 1:N)
    {
        start <- Sys.time()
        devectorized()
        end <- Sys.time()
        timings[itr] <- end - start
    }

    return(timings)
}

mean(time(10))

This takes, on average, 4.72 seconds per iteration to compute 1,000,000 vector additions.

Examples of Vectorized and Devectorized Code in Julia

Let’s now consider two Julia implementations of this same snippet of code. We’ll start with a vectorized implementation:

function vectorized()
    a = [1.0, 1.0]
    b = [2.0, 2.0]
    x = [NaN, NaN]

    for i in 1:1000000
        x = a + b
    end

    return
end

function time(N)
    timings = Array(Float64, N)

    # Force compilation
    vectorized()

    for itr in 1:N
        timings[itr] = @elapsed vectorized()
    end

    return timings
end

mean(time(10))

This takes, on average, 0.236 seconds per iteration to compute 1,000,000 vector additions.

Next, let’s consider a devectorized implementation of this same snippet:

function devectorized()
    a = [1.0, 1.0]
    b = [2.0, 2.0]
    x = [NaN, NaN]

    for i in 1:1000000
        for index in 1:2
            x[index] = a[index] + b[index]
        end
    end

    return
end

function time(N)
    timings = Array(Float64, N)

    # Force compilation
    devectorized()

    for itr in 1:N
        timings[itr] = @elapsed devectorized()
    end

    return timings
end

mean(time(10))

This takes, on average, 0.0035 seconds per iteration to compute 1,000,000 vector additions.

Comparing Performance in R and Julia

We can summarize the results of the four examples above in a single table:

Approach Language Average Time
Vectorized R 0.49
Devectorized R 4.72
Vectorized Julia 0.24
Devectorized Julia 0.0035

All of these examples were timed on my 2.9 GHz Intel Core i7 MacBook Pro. The results are quite striking: Julia is uniformly faster than R. And a very small bit of devectorization produces huge performance improvements. Of course, it would be nice if Julia’s compiler could optimize vectorized code as well as it optimizes devectorized code. But doing so requires a substantial amount of work.

Why is Optimizing Vectorized Code Hard?

What makes automatic devectorization tricky to get right is that even minor variants of the snippet shown above have profoundly different optimization strategies. Consider, for example, the following two snippets of code:

function vectorized2()
    a = [1.0, 1.0]
    b = [2.0, 2.0]

    res = {}

    for i in 1:1000000
        x = [rand(), rand()]
        x += a + b
        push!(res, x)
    end

    return res
end

function time(N)
    timings = Array(Float64, N)

    # Force compilation
    vectorized2()

    for itr in 1:N
        timings[itr] = @elapsed vectorized2()
    end

    return timings
end

mean(time(10))

This first snippet takes 1.29 seconds on average.

function devectorized2()
    a = [1.0, 1.0]
    b = [2.0, 2.0]

    res = {}

    for i in 1:1000000
        x = [rand(), rand()]
        for dim in 1:2
            x[dim] += a[dim] + b[dim]
        end
        push!(res, x)
    end

    return res
end

function time(N)
    timings = Array(Float64, N)

    # Force compilation
    devectorized2()

    for itr in 1:N
        timings[itr] = @elapsed devectorized2()
    end

    return timings
end

mean(time(10))

This second snippet takes, on average, 0.27 seconds.

The gap between vectorized and devectorized code is much smaller here because this second set of code snippets uses memory in a very different way than our original snippets did. In the first set of snippets, it was possible to entirely avoid allocating any memory for storing changes to x. The devectorized code for the first set of snippets explicitly made clear to the compiler that no memory needed to be allocated. The vectorized code did not make this clear. Making it clear that no memory needed to be allocated led to a 75x speedup. Explicitly telling the compiler what it can avoid spending time on goes a long way.

In contrast, in the second set of snippets, a new chunk of memory has to be allocated for every x vector that gets created. And the result is that even the devectorized variant of our second snippet cannot offer much of a performance boost over its vectorized analogue. The devectorized variant is slightly faster because it avoids allocating any memory during the steps in which x has a and b added to it, but this makes less of a difference when there is still a lot of other work being done that cannot be avoided by devectorizing operations.

This reflects a more general statement: the vectorization/devectorization contrast is only correlated, not causally related, with the actual performance characteristics of code. What matters for computations that take place on modern computers is the efficient utilization of processor cycles and memory. In many real examples of vectorized code, it is memory management, rather than vectorization per se, that is the core causal factor responsible for performance.

The Reversed Role of Vectorization in R and Julia

Part of what makes it difficult to have a straightforward discussion about vectorization is that vectorization in R conflates issues that are logically unrelated. In R, vectorization is often done for both (a) readability and (b) performance. In Julia, vectorization is only used for readability; it is devectorization that offers superior performance.

This confuses some people who are not familiar with the internals of R. It is therefore worth noting how one improves the speed of R code. The process of performance improvement is quite simple: one starts with devectorized R code, then replaces it with vectorized R code and then finally implements this vectorized R code in devectorized C code. This last step is unfortunately invisible to many R users, who therefore think of vectorization per se as a mechanism for increasing performance. Vectorization per se does not help make code faster. What makes vectorization in R effective is that it provides a mechanism for moving computations into C, where a hidden layer of devectorization can do its mgic.

In other words, R is doing exactly what Julia is doing to get better performance. R’s vectorized code is simply a thin wrapper around completely devectorized C code. If you don’t believe me, go read the C code for something like R’s distance function, which involves calls to functions like the following:

static double R_euclidean(double *x, int nr, int nc, int i1, int i2)
{
    double dev, dist;
    int count, j;

    count= 0;
    dist = 0;
    for(j = 0 ; j < nc ; j++) {
    if(both_non_NA(x[i1], x[i2])) {
        dev = (x[i1] - x[i2]);
        if(!ISNAN(dev)) {
        dist += dev * dev;
        count++;
        }
    }
    i1 += nr;
    i2 += nr;
    }
    if(count == 0) return NA_REAL;
    if(count != nc) dist /= ((double)count/nc);
    return sqrt(dist);
}

It is important to keep this sort of thing in mind: the term vectorization in R actually refers to a step in which you write devectorized code in C. Vectorization, per se, is a red herring when reasoning about performance.

To finish this last point, let’s summarize the performance hierarchy for R and Julia code in a simple table:

Worst Case Typical Case Best Case
Julia Vectorized Code Julia Devectorized Code
R Devectorized Code R Vectorized Code C Devectorized Code

It is the complete absence of one column for Julia that makes it difficult to compare vectorization across the two languages. Nothing in Julia is as bad as R’s devectorized code. On the other end of the spectrum, the performance of Julia’s devectorized code simply has no point of comparison in pure R: it is more similar to the C code used to power R behind the scenes.

Conclusion

Julia aims to (and typically does) provide vectorized code that is efficient as the vectorized code available in other high-level languages. What sets Julia apart is the possibility of writing, in pure Julia, high performance code that uses CPU and memory resources as effectively as can be done in C.

In particular, vectorization and devectorization stand in the opposite relationship to one another in Julia as they do in R. In R, devectorization makes code unusably slow: R code must be vectorized to perform at an acceptable level. In contrast, Julia programmers view vectorized code as a convenient prototype that can be modified with some clever devectorization to produce production-performance code. Of course, we would like prototype code to perform better. But no popular language offers that kind of functionality. What Julia offers isn’t the requirement for devectorization, but the possibility of doing it in Julia itself, rather than in C.

Writing Type-Stable Code in Julia

For many of the people I talk to, Julia’s main appeal is speed. But achieving peak performance in Julia requires that programmers absorb a few subtle concepts that are generally unfamiliar to users of weakly typed languages.

One particularly subtle performance pitfall is the need to write type-stable code. Code is said to be type-stable if the type of every variable does not vary over time. To clarify this idea, consider the following two closely related function definitions:

function sumofsins1(n::Integer)  
    r = 0  
    for i in 1:n  
        r += sin(3.4)  
    end  
    return r  
end  

function sumofsins2(n::Integer)  
    r = 0.0  
    for i in 1:n  
        r += sin(3.4)  
    end  
    return r  
end  

The only difference between these function definitions is that sumofsins1 initializes r to 0, whereas sumofsins2 initializes r to 0.0.

This seemingly minor distinction has important practical implications because the initialization of r to 0 means that the main loop of sumofsins1 begins with a single iteration in which the computer adds 0 to sin(3.4). This single addition step transforms the type of r from Int, which is the type of 0, to Float64, which is the type of sin(3.4). This means that the type of r is not stable over the course of this loop.

This instability has considerable effects on the performance of sumofsins1. To see this, let’s run some naive benchmarks. As always in Julia, we’ll start with a dry run to get the JIT to compile the functions being compared:

sumofsins1(100_000)  
sumofsins2(100_000)  

@time [sumofsins1(100_000) for i in 1:100];  
@time [sumofsins2(100_000) for i in 1:100];  

The results of this timing comparison are quite striking:

julia> @time [sumofsins1(100_000) for i in 1:100];  
elapsed time: 0.412261722 seconds (320002496 bytes allocated)  

julia> @time [sumofsins2(100_000) for i in 1:100];  
elapsed time: 0.008509995 seconds (896 bytes allocated)  

As you can see, the type-unstable code in sumofsins1 is 50x slower than the type-stable code. What might have seemed like a nitpicky point about the initial value of r has enormous performance implications.

To understand the reasons for this huge performance gap, it’s worth considering what effect type-instability has on the compiler. In this case, the compiler can’t optimize the contents of the main loop of sumofsins1 because it can’t be certain that the type of r will remain invariant throughout the entire loop. Without this crucial form of invariance, the compiler has to check the type of r on every iteration of the loop, which is a much more intensive computation than repeatedly adding a constant value to a Float64.

You can confirm for yourself that the compiler produces more complex code by examining the LLVM IR for both of these functions.

First, we’ll examine the LLVM IR for sumofsins1:

julia> code_llvm(sumofsins1, (Int, ))  

define %jl_value_t* @julia_sumofsins11067(i64) {  
top:  
  %1 = alloca [5 x %jl_value_t*], align 8  
  %.sub = getelementptr inbounds [5 x %jl_value_t*]* %1, i64 0, i64 0  
  %2 = getelementptr [5 x %jl_value_t*]* %1, i64 0, i64 2, !dbg !5145  
  store %jl_value_t* inttoptr (i64 6 to %jl_value_t*), %jl_value_t** %.sub, align 8  
  %3 = load %jl_value_t*** @jl_pgcstack, align 8, !dbg !5145  
  %4 = getelementptr [5 x %jl_value_t*]* %1, i64 0, i64 1, !dbg !5145  
  %.c = bitcast %jl_value_t** %3 to %jl_value_t*, !dbg !5145  
  store %jl_value_t* %.c, %jl_value_t** %4, align 8, !dbg !5145  
  store %jl_value_t** %.sub, %jl_value_t*** @jl_pgcstack, align 8, !dbg !5145  
  %5 = getelementptr [5 x %jl_value_t*]* %1, i64 0, i64 3  
  store %jl_value_t* null, %jl_value_t** %5, align 8  
  %6 = getelementptr [5 x %jl_value_t*]* %1, i64 0, i64 4  
  store %jl_value_t* null, %jl_value_t** %6, align 8  
  store %jl_value_t* inttoptr (i64 140379580131904 to %jl_value_t*), %jl_value_t** %2, align 8, !dbg !5150  
  %7 = icmp slt i64 %0, 1, !dbg !5151  
  br i1 %7, label %L2, label %pass, !dbg !5151  

pass:                                             ; preds = %top, %pass  
  %8 = phi %jl_value_t* [ %13, %pass ], [ inttoptr (i64 140379580131904 to %jl_value_t*), %top ]  
  %"#s6.03" = phi i64 [ %14, %pass ], [ 1, %top ]  
  store %jl_value_t* %8, %jl_value_t** %5, align 8, !dbg !5152  
  %9 = call %jl_value_t* @alloc_2w(), !dbg !5152  
  %10 = getelementptr inbounds %jl_value_t* %9, i64 0, i32 0, !dbg !5152  
  store %jl_value_t* inttoptr (i64 140379580056656 to %jl_value_t*), %jl_value_t** %10, align 8, !dbg !5152  
  %11 = getelementptr inbounds %jl_value_t* %9, i64 1, i32 0, !dbg !5152  
  %12 = bitcast %jl_value_t** %11 to double*, !dbg !5152  
  store double 0xBFD05AC910FF4C6C, double* %12, align 8, !dbg !5152  
  store %jl_value_t* %9, %jl_value_t** %6, align 8, !dbg !5152  
  %13 = call %jl_value_t* @jl_apply_generic(%jl_value_t* inttoptr (i64 140379586379936 to %jl_value_t*), %jl_value_t** %5, i32 2), !dbg !5152  
  store %jl_value_t* %13, %jl_value_t** %2, align 8, !dbg !5152  
  %14 = add i64 %"#s6.03", 1, !dbg !5152  
  %15 = icmp sgt i64 %14, %0, !dbg !5151  
  br i1 %15, label %L2, label %pass, !dbg !5151  

L2:                                               ; preds = %pass, %top  
  %.lcssa = phi %jl_value_t* [ inttoptr (i64 140379580131904 to %jl_value_t*), %top ], [ %13, %pass ]  
  %16 = load %jl_value_t** %4, align 8, !dbg !5153  
  %17 = getelementptr inbounds %jl_value_t* %16, i64 0, i32 0, !dbg !5153  
  store %jl_value_t** %17, %jl_value_t*** @jl_pgcstack, align 8, !dbg !5153  
  ret %jl_value_t* %.lcssa, !dbg !5153  
}  

Then we’ll examine the LLVM IR for sumofsins2:

julia> code_llvm(sumofsins2, (Int, ))  

define double @julia_sumofsins21068(i64) {  
top:  
  %1 = icmp slt i64 %0, 1, !dbg !5151  
  br i1 %1, label %L2, label %pass, !dbg !5151  

pass:                                             ; preds = %top, %pass  
  %"#s6.04" = phi i64 [ %3, %pass ], [ 1, %top ]  
  %r.03 = phi double [ %2, %pass ], [ 0.000000e+00, %top ]  
  %2 = fadd double %r.03, 0xBFD05AC910FF4C6C, !dbg !5156  
  %3 = add i64 %"#s6.04", 1, !dbg !5156  
  %4 = icmp sgt i64 %3, %0, !dbg !5151  
  br i1 %4, label %L2, label %pass, !dbg !5151  

L2:                                               ; preds = %pass, %top  
  %r.0.lcssa = phi double [ 0.000000e+00, %top ], [ %2, %pass ]  
  ret double %r.0.lcssa, !dbg !5157  
}  

The difference in size and complexity of code between these two functions in compiled form is considerable. And this difference is entirely atttributable to the compiler’s need to recheck the type of r on every iteration of the main loop in sumofsins1, which can be optimized out in sumofsins2, where r has a stable type.

Given the potential performance impacts of type-instability, every aspiring Julia programmer needs to learn to recognize potential sources of type-instability in their own code. Future versions of Julia may be configured to issue warnings when type-unstable code is encountered, but, for now, the responsibility lies with the programmer. Thankfully, once you learn about type-stability, it becomes easy to recognize in most cases.

September Talks

To celebrate my last full month on the East Coast, I’m doing a bunch of talks. If you’re interested in hearing more about Julia or statistics in general, you might want to come out to one of the events I’ll be at:

  • Julia Tutorial at DataGotham: On 9/12, Stefan and I will be giving a 3-hour long, hands on Julia tutorial as part of the Thursday DataGotham activities this year. If you’re in NYC and care about data analysis, you should try to make it out to part of the event, even if you skip the tutorials.
  • Online Learning Talk in NYC: On 9/17, I’ll be giving a talk on online learning at the Open Statistical Programming meetup. I’ll talk about using SGD to fit models online. This material is quite basic, but seems to be unfamiliar to a lot of people.
  • Julia Talk in DC: On 9/26, I’ll be giving a quick introduction to Julia in DC at the Statistical Programming DC meetup. The goal will be to introduce people to the basics of Julia.

Hopfield Networks in Julia

As a fun side project last night, I decided to implement a basic package for working with Hopfield networks in Julia.

Since I suspect many of the readers of this blog have never seen a Hopfield net before, let me explain what they are and what they can be used for. The short-and-skinny is that Hopfield networks were invented in the 1980’s to demonstrate how a network of simple neurons might learn to associate incoming stimuli with a fixed pool of existing memories. As you’ll see from the examples below, this associative ability behaves a little bit like locality-sensitive hashing.

To see how Hopfield networks work, we need to define their internal structure. For the purposes of this blog post, we’ll assume that a Hopfield network is made up of N neurons. At every point in time, this network of neurons has a simple binary state, which I’ll associate with a vector of -1’s and +1’s.

Incoming stimuli are also represented using binary vectors of length N. Every time one of these stimuli is shown to the network, the network will use a simple updating rule to modify its state. The network will keep modifying its state until it settles into a stable state, which will be one of many fixed points for the updating rule. We’ll refer to the stable state that the network reaches as the memory that the network associates with the input stimulus.

For example, let’s assume that we have a network consisting of 42 neurons arranged in a 7×6 matrix. We’ll train our network to recognize the letters X and O, which will also be represented as 7×6 matrices. After training the network, we’ll present corrupted copies of the letters X and O to show that the network is able to associate corrupted stimuli with their uncorrupted memories. We’ll also show the network an uncorrupted copy of the unfamiliar letter F to see what memory it associates with an unfamiliar stimulus.

Using the HopfieldNets package, we can do this in Julia as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
using HopfieldNets
 
include(Pkg.dir("HopfieldNets", "demo", "letters.jl"))
 
patterns = hcat(X, O)
 
n = size(patterns, 1)
 
h = DiscreteHopfieldNet(n)
 
train!(h, patterns)
 
Xcorrupt = copy(X)
for i = 2:7
     Xcorrupt[i] = 1
end
 
Xrestored = associate!(h, Xcorrupt)

In the image below, I show what happens when we present X, O and F to the network after training it on the X and O patterns:


Results

As you can see, the network perfectly recovers X and O from corrupted copies of those letters. In addition, the network associates F with an O, although the O is inverted relative to the O found in the training set. This kind of untrained memory emerging is common in Hopfield nets. To continue the analogy with LSH, you can think of the memories produced by the Hopfield net as hashes of the input, which have the property that similar inputs tend to produce similar outputs. In practice, you shouldn’t use a Hopfield net to do LSH, because the computations involved are quite costly.

Hopefully this simple example has piqued your interest in Hopfield networks. If you’d like to learn more, you can read through the code I wrote or work through the very readable presentation of the theory of Hopfield networks in David Mackay’s book on Information Theory, Inference, and Learning Algorithms.

Turning Off Comments

A few days ago I disabled the comment system on this site. I’d been debating the change for some time, but reached a final decision while reading the comments on an article about a vaccine for Lyme disease.

Although this site has generally had very high quality comments, I’ve become increasingly opposed (as a matter of principle) to the use of online comment systems. My feelings mirror those of many other people who’ve deactivated comments on their sites, including Marco Arment and Matt Gemmell. As many have said before, comments tend to bring out the worst in people. The conversations that comments are ostensibly supposed to inspire now occur on Twitter and in volleys of blog posts that are traded between multiple blogs. In contrast, comment threads tend to trap the material that people either (a) don’t want to associate with their own name or (b) don’t want to take the time to write up formally. I think we have too much of both of these sorts of writing and would prefer not to encourage either.

What’s Next

The last two weeks have been full of changes for me. For those who’ve been asking about what’s next, I thought I’d write up a quick summary of all the news.

(1) I successfully defended my thesis this past Monday. Completing a Ph.D. has been a massive undertaking for the past five years, and it’s a major relief to be done. From now on I’ll be (perhaps undeservedly) making airline and restaurant reservations under the name Dr. White.

(2) As announced last week, I’ll be one of the residents at Hacker School this summer. The list of other residents is pretty amazing, and I’m really looking forward to meeting the students.

(3) In addition to my residency at Hacker School, I’ll be a temporary postdoc in the applied math department at MIT, where I’ll be working on Julia full-time. Expect to see lots of work on building up the core data analysis infrastructure.

(4) As of today I’ve accepted an offer to join Facebook’s Data Science team in the fall. I’ll be moving out to the Bay Area in November.

That’s all so far.