Why Julia's DataFrames are Still Slow

2015-11-28 6-minute read

Introduction

Although I’ve recently decided to take a break from working on OSS for a little while, I’m still as excited as ever about Julia as a language.

That said, I’m still unhappy with the performance of Julia’s core data analysis infrastructure. The performance of code that deals with missing values has been substantially improved thanks to the beta release of the NullableArrays package, which David Gold developed during this past Julia Summer of Code. But the DataFrames package is still a source of performance problems.

The goal of this post is to explain why Julia’s DataFrames are still unacceptably slow in many important use cases – and will remain slow even after the current dependency on the DataArrays package is replaced with a dependency on NullableArrays.

Problematic Interactions with Julia’s Compiler

The core problem with the DataFrames library is that a DataFrame is, at its core, a black-box container that could, in theory, contain objects of arbitrary types. In practice, a DataFrame contains highly constrained objects, but those constraints are (a) hard to express to the compiler and (b) still too weak to allow the compiler to produce the most efficient machine code.

The use of any black-box container creates the potential for performance problems in Julia because of the way that Julia’s compiler works. In particular, Julia’s compiler is able to execute code quickly because it can generate custom machine code for every function call – and this custom machine code is specialized for the specific run-time types of the function’s arguments.

This run-time generation of custom machine code is called specialization. When working with black-box containers, Julia’s approach to specialization is not used to full effect because machine code specialization based on run-time types only occurs at function call sites. If you access objects from a black-box container and then perform extended computations on the results, those computations will not be fully specialized because there is no function call between (a) the moment at which type uncertainty about the contents of the black-box container is removed and (b) the moment at which code that could benefit from type information is executed.

A Minimal Example

To see this concern in practice, consider the following minimal example of a hot loop being executed on values that are extracted from a black-box container:

function g1(black_box_container)
    x, y = black_box_container[1], black_box_container[2]
    n = length(x)
    s = 0.0
    for i in 1:n
        s += x[i] * y[i]
    end
    s
end

function hot_loop(x, y)
    n = length(x)
    s = 0.0
    for i in 1:n
        s += x[i] * y[i]
    end
    s
end

function g2(black_box_container)
    x, y = black_box_container[1], black_box_container[2]
    hot_loop(x, y)
end

container = Any[randn(10_000_000), randn(10_000_000)];

@time g1(container)
# 2.258571 seconds (70.00 M allocations: 1.192 GB, 5.03% gc time)

@time g2(container)
# 0.015286 seconds (5 allocations: 176 bytes)

g1 is approximately 150x slower than g2 on my machine. But g2 is, at a certain level of abstraction, exactly equivalent to g1 – the only difference is that the hot loop in g1 has been put inside of a function call. To convince yourself that the function call boundary is the only important difference between these two functions, consider the following variation of g2 and hot_loop:

@inline function hot_loop_alternative(x, y)
    n = length(x)
    s = 0.0
    for i in 1:n
        s += x[i] * y[i]
    end
    s
end

function g3(black_box_container)
    x, y = black_box_container[1], black_box_container[2]
    hot_loop_alternative(x, y)
end

@time g1(container)
# 2.290116 seconds (70.00 M allocations: 1.192 GB, 4.90% gc time)

@time g2(container)
# 0.017835 seconds (5 allocations: 176 bytes)

@time g3(container)
# 2.250301 seconds (70.00 M allocations: 1.192 GB, 5.08% gc time)

On my system, forcing the hot loop code to be inlined removes all of the performance difference between g1 and g2. Somewhat ironically, by inlining the hot loop, we’ve prevented the compiler from generating machine code that’s specialized on the types of the x and y values we pull out of our black_box_container. Inlining removes a function call site – and function call sites are the only times when machine code can be fully specialized based on run-time type information.

This problem is the core issue that needs to be resolved to make Julia’s DataFrames as efficient as they should be. Below I outline three potential solutions to this problem. I do not claim that these three are the only solutions; I offer them only to illustrate important issues that need to be addressed.

Potential Solutions to the Under-Specialization Problem

One possible solution to the problem of under-specialization is to change Julia’s compiler. I think that work on that front could be very effective, but the introduction of specialization strategies beyond Julia’s current “specialize at function call sites” would make Julia’s compiler much more complex – and could, in theory, make some code slower if the compiler were to spend more time performing compilation and less time performing the actual computations that a user wants to perform.

A second possible solution is to generate custom DataFrame types for every distinct DataFrame object. This could convert DataFrames from black-box containers that contain objects of arbitrary type into fully typed containers that can only contain objects of types that are fully known to the compiler.

The danger with this strategy is that you could generate an excessively large number of different specializations – which would again run the risk of spending more time inside the compiler than inside of the code you actually want to execute. It could also create excessive memory pressure as an increasing number of specialized code paths are stored in memory. Despite these concerns, a more aggressively typed DataFrame might be a powerful tool for doing data analysis.

The last possible solution I know of is the introduction of a high-level API that ensures that operations on DataFrames always reduce down to operations on objects whose types are known when hot loops execute. This is essentially the computational model used in traditional databases: take in a SQL specification of a computation, make use of knowledge about the data actually stored in existing tables to formulate an optimized plan for performing that computation, and then perform that optimized computation.

I think this third option is the best because it will also solve another problem Julia’s data infrastructure will hit eventually: the creation of code that is insufficiently generic and not portable to other backends. If people learn to write code that only works efficiently for a specific implementation of DataFrames, then their code will likely not work when they try to apply it to data stored in alternative backends (e.g. traditional databases). This would trap users into data structures that may not suit their needs. The introduction of a layer of appropriate abstractions (as in dplyr and Ibis) would resolve both issues at once.

Take-Aways

Making Julia’s DataFrames better is still a work-in-progress.
The core issue is still the usage of data structures that are not amenable to Julia’s type inference machinery. One of the two main issues is now resolved; another must be addressed before things function smoothly.
Several solutions to this remaining are possible; we will probably see one or more of these solutions gain traction in the near-term future.