Speeding Up MLE Code in R

Recently, I’ve been fitting some models from the behavioral economics literature to choice data. Most of these models amount to non-linear variants of logistic regression in which I want to infer the parameters of a utility function. Because several of these models aren’t widely used, I’ve had to write my own maximum likelihood code to estimate the parameters of these models.

In the process, I’ve started to learn something about how to write code that runs quickly in R. In this post, I’ll try to share some of that knowledge by describing three ways of performing maximum likelihood estimation in R whose runtimes differ by two orders of magnitude. The differences seem to depend upon two factors: (1) how I access the entries of a data frame and (2) whether I use loops or vectorized operations to perform basic arithmetic.

To simplify things, I’ll present a model that should be familiar to people with a background in economics: the exponentially discounted utility model. To implement it in R, we define the discounted value of x dollars at time t as:

 1 2 3 4  discounted.value <- function(x, t, delta) { return(x * delta ^ t) }

In addition to the discounted utility model, we assume that choices originate from a stochastic choice model with logistic noise. To invert this noise during inference, we’ll use the inverse logit transform:

 1 2 3 4  invlogit <- function(z) { return(1 / (1 + exp(-z))) }

To test my inference routine, I need to generate “stochastic” data of the sort you would expect to see from an exponentially discounting agent that’s indifferent between having $1 at time t = 0 and$3 at time t = 1. I’ll refer to the first good as (X1, T1) and the second good as (X2, T2). If the agent chooses (X2, T2), I’ll write that as C == 1; if they choose (X1, T1), I’ll write that as C == 0. With those conventions, the sample data is generated as:

 1 2 3 4 5 6 7  n <- 100   choices <- data.frame(X1 = rep(1, each = n), T1 = rep(0, each = n), X2 = rep(3, each = n), T2 = rep(1, each = n), C = rep(c(0, 1), by = n / 2))

To fit the exponential model to this data set, we’ll use the optim function to minimize the negative log likelihood of the data by setting two parameters: a, the variance of the noise in the utility function; and delta, the discount factor in the discounted utility model. The three implementations of this model that I’ll show only differ in the definition of the log likelihood function, so the final call to optim to perform maximum likelihood estimation is constant across all examples:

 1 2 3 4 5 6  logit.estimator <- function(choices) { wrapper <- function(x) {-log.likelihood(choices, x[1], x[2])} optimization.results <- optim(c(1, 1), wrapper, method = 'L-BFGS-B', lower = c(0, 0), upper = c(Inf, 1)) return(optimization.results$par) } Here, I had to specify bounds for the parameters, a and delta, because it’s assumed that a must be positive and that delta must lie in the interval [0, 1]. To deal with these bounds, one has to use the L-BFGS-B method in optim. The first implementation I’ll show is the one I find most natural to write, even though it turns out to be the least efficient by far:  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23  log.likelihood <- function(choices, a, delta) { ll <- 0 for (i in 1:nrow(choices)) { u2 <- discounted.value(choices[i, 'X2'], choices[i, 'T2'], delta) u1 <- discounted.value(choices[i, 'X1'], choices[i, 'T1'], delta) p <- invlogit(a * (u2 - u1)) if (choices[i, 'C'] == 1) { ll <- ll + log(p) } else { ll <- ll + log(1 - p) } } return(ll) } In the second implementation, I define a row level likelihood function, so that the summing and logarithmic transform are vectorized.  1 2 3 4 5 6 7 8 9 10 11 12 13  rowwise.likelihood <- function(row, a, delta) { u2 <- discounted.value(row['X2'], row['T2'], delta) u1 <- discounted.value(row['X1'], row['T1'], delta) p <- invlogit(a * (u2 - u1)) return(ifelse(row['C'] == 1, p, 1 - p)) } log.likelihood <- function(choices, a, delta) { likelihoods <- apply(choices, 1, function (row) {rowwise.likelihood(row, a, delta)}) return(sum(log(likelihoods))) } In the third implementation, I define a fully vectorized log likelihood function that avoids any explicit iteration and therefore removes most of the data frame indexing operations:  1 2 3 4 5 6 7 8  log.likelihood <- function(choices, a, delta) { u2 <- discounted.value(choices$X2, choices$T2, delta) u1 <- discounted.value(choices$X1, choices$T1, delta) p <- invlogit(a * (u2 - u1)) likelihoods <- ifelse(choices$C == 1, p, 1 - p) return(sum(log(likelihoods))) }

The code I used to call all of these implementations and compare them is up on GitHub for those interested. The results, which strike me as remarkable, are below:

1. On my laptop, implementation 1 takes ~1.0 second to run.
2. On my laptop, implementation 2 takes ~0.25 seconds to run.
3. On my laptop, implementation 3 takes ~0.01 seconds to run.

In short, the third implementation is 100x faster than the first implementation with only minor changes to the code I originally wrote. Hopefully this example will help inspire others who have R code they’d like to speed up, but aren’t sure where to start.

9 responses to “Speeding Up MLE Code in R”

1. Hi
Two more suggestions to improve the speed:
1> compile discounted.value, invlogit and log.likelihood with the cmpfun. This method is very easy to implement:
library(compiler)
discounted.value<- cmpfun(discounted.value)
invlogit_<- cmpfun(invlogit)
log.likelihood use Rcpp to write discounted.value, invlogit and log.likelihood. I didn’t implement this out, but i am sure you could an improvement in speed if you compute the log.likelihood with C++ code.

2. Hi John, Thanks for the post – super helpful. Do you know how I would fit a logistic function around a non-binomial outcome variable? In other words, the logistic function assumes that Y = 0 or Y = 1. I have a logistic function where Y can take on any value between 0 and 1. I have a linear model that I want to “feed” the logistic function and get back estimated parameters (similar to your approach above). How would that alter the MLE approach?

3. John, thanks for your response. I can use a non-linear regression function nls() that I believe does MLE.

The model I’m trying to predict is:

Y = d / ( 1 + exp(a + bX) )

where X and Y are a column of n observations, d is a constant, and a & b are the parameters of interest (to be measured). The one difference between my model and the standard logit is that Y takes on any value (not just 0 and 1).

Here’s the code in R (d = 1000 in this example):

log1 = nls( Y ~ 1000 / (1 + exp(a + b*X)) ,
start = list(a = 0.12345, b = 0.54321), data = logdat)
summary(log1)

Cheers,
Kevin

4. yuck r for loop really makes me feel antsy. sapply, lapply are also almost as slow.
john chambers wrote a warning in his R book that vectorization is much faster because the operation becomes truly linear. if you use for loop to implement something that should be linear, the whole operation becomes O(n^2), which is quite sad.

5. Hi,
To whom so ever
I am trying to find MLE for two parametr case by using Newton Raphson method, can you help me R codes to find MLE ?.

6. Very interesting. I also wondered how much compiled code can help to further improve performance.

7. You can enhance the code further by avoiding ifelse:

likelihoods <- ifelse(choices\$C == 1, p, 1 – p)