Doing Maximum Likelihood Estimation by Hand in R

Lately I’ve been writing maximum likelihood estimation code by hand for some economic models that I’m working with. It’s actually a fairly simple task, so I thought that I would write up the basic approach in case there are readers who haven’t built a generic estimation system before.

First, let’s start with a toy example for which there is a closed-form analytic solution. We’ll ignore that solution and use optimization functions to do the estimation. Starting with this toy example makes it easy to see how well an approximation system can be expected to perform under the best circumstances — and also where it goes wrong if you make poor programming decisions.

Suppose that you’ve got a sequence of values from an unknown Bernoulli variable like so:

1
2
3
p.parameter <- 0.8
sequence <- rbinom(10, 1, p.parameter)
# [1] 0 1 1 1 1 1 1 0 1 0

Given the sequence, we want to estimate the value of the parameter, p, which is not known to us. The maximum likelihood approach says that we should select the parameter that makes the data most probable. For a Bernoulli variable, this is simply a search through the space of values for p (i.e [0, 1]) that makes the data most probable to have observed.

It’s worth pointing out that the analytic solution to the maximum likelihood estimation problem is to use the sample mean. We’ll therefore use mean(sequence) as a measure of the accuracy of our approximation algorithm.

How do we find the parameter numerically? First, we want to define a function that specifies the probability of our entire data set. We assume that each observation in the data is independently and identically distributed, so that the probability of the sequence is the product of the probabilities of each value. For the Bernoulli variables, this becomes the following function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
likelihood <- function(sequence, p.parameter)
{
  likelihood <- 1
 
  for (i in 1:length(sequence))
  {
    if (sequence[i] == 1)
    {
      likelihood <- likelihood * p.parameter
    }
    else
    {
      likelihood <- likelihood * (1 - p.parameter)
    }
  }
 
  return(likelihood)
}

To do maximum likelihood estimation, we therefore only need to use an optimization function to maximize this function. A quick examination of the likelihood function as a function of p makes it clear that any decent optimization algorithm should be able to find the maximum:

1
2
3
4
5
6
7
8
9
10
possible.p <- seq(0, 1, by = 0.001)
jpeg('Likelihood_Concavity.jpg')
library('ggplot2')
qplot(possible.p,
      sapply(possible.p, function (p) {likelihood(sequence, p)}),
      geom = 'line',
      main = 'Likelihood as a Function of P',
      xlab = 'P',
      ylab = 'Likelihood')
dev.off()
Likelihood_Concavity.jpg

For single variable cases, I find that it’s easiest to use R’s base function optimize to solve the optimization problem:

1
2
3
4
5
6
7
8
9
10
mle.results <- optimize(function(p) {likelihood(sequence, p)},
                        interval = c(0, 1),
                        maximum = TRUE)
 
mle.results
# $maximum
# [1] 0.6999843
#
# $objective
# [1] 0.002223566

Here I’ve used an anonymous function that returns the likelihood of our current data given a value of p; I’ve also specified that the values of p must lie in the interval [0, 1] and asked optimize to maximize the result, rather than minimize, which is the default behavior. Examining the output of optimize, we can see that the likelihood of the data set was maximized very near 0.7, the sample mean. This suggests that the optimization approximation can work. It’s worth noting that the objective value is the likelihood of the data set for the specified value of p. The smallness of the objective for large problems can become a major problem. To understand why, it’s worth seeing what happens as the size of the sample grows from 10 to 2500 samples:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
error.behavior <- data.frame()
 
for (n in 10:2500)
{
  sequence <- rbinom(n, 1, p.parameter)
 
  likelihood.results <- optimize(function(p) {likelihood(sequence, p)},
                                 interval = c(0, 1),
                                 maximum = TRUE)
 
  true.mle <- mean(sequence)
 
  likelihood.error <- true.mle - likelihood.results$maximum
 
  error.behavior <- rbind(error.behavior,
                          data.frame(N = n,
                                     Error = likelihood.error,
                                     Algorithm = 'Likelihood'))
}
Likelihood_Problems.jpg

As you can see, our approximation approach works great until our data set grows, and then it falls apart. This is exactly the opposite of what asymptotical statistical theory tells us should be happening, so it’s clear that something is going very wrong. A quick examination of the results from the last pass through our loop makes clear what’s wrong:

1
2
3
4
5
6
7
8
9
10
11
12
sequence <- rbinom(2500, 1, p.parameter)
 
likelihood.results <- optimize(function(p) {likelihood(sequence, p)},
                               interval = c(0, 1),
                               maximum = TRUE)
 
likelihood.results
# $maximum
# [1] 0.9999339
#
# $objective
# [1] 0

The likelihood of our data is numerically indistinguishable from 0 given the precision of my machine’s floating point values. Multiplying thousands of probabilities together is simply not a viable approach without infinite precision. Thankfully, there’s a very simple solution: replace all of the probabilities with their logarithms. Instead of maximizing the likelihood, we maximize the log likelihood, which involves summing rather than multiplying, and therefore stays numerically stable:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
log.likelihood <- function(sequence, p)
{
  log.likelihood <- 0
 
  for (i in 1:length(sequence))
  {
    if (sequence[i] == 1)
    {
      log.likelihood <- log.likelihood + log(p)
    }
    else
    {
      log.likelihood <- log.likelihood + log(1 - p)
    }
  }
 
  return(log.likelihood)
}

You can check that this problem is as easily solved numerically as the original problem by graphing the log likelihood:

1
2
3
4
5
6
7
8
9
10
sequence <- c(0, 1, 1, 1, 1, 1, 1, 0, 1, 0)
possible.p <- seq(0, 1, by = 0.001)
jpeg('Log_Likelihood_Concavity.jpg')
qplot(possible.p,
      sapply(possible.p, function (p) {log.likelihood(sequence, p)}),
      geom = 'line',
      main = 'Log Likelihood as a Function of P',
      xlab = 'P',
      ylab = 'Log Likelihood')
dev.off()
Log_Likelihood_Concavity.jpg

And then you can rerun our error diagnostics using both approaches to confirm that the log likelihood approach does not suffer from the same numerical problems:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
error.behavior <- data.frame()
 
for (n in 10:2500)
{
  sequence <- rbinom(n, 1, p.parameter)
 
  likelihood.results <- optimize(function(p) {likelihood(sequence, p)},
                                 interval = c(0, 1),
                                 maximum = TRUE)
 
  log.likelihood.results <- optimize(function(p) {log.likelihood(sequence, p)},
                                     interval = c(0, 1),
                                     maximum = TRUE)
 
  true.mle <- mean(sequence)
 
  likelihood.error <- true.mle - likelihood.results$maximum
  log.likelihood.error <- true.mle - log.likelihood.results$maximum
 
  error.behavior <- rbind(error.behavior,
                          data.frame(N = n,
                                     Error = likelihood.error,
                                     Algorithm = 'Likelihood'),
                          data.frame(N = n,
                                     Error = log.likelihood.error,
                                     Algorithm = 'Log Likelihood'))
}
 
jpeg('Long-Term_Error_Behavior.jpg')
ggplot(error.behavior, aes(x = N, y = Error)) + geom_line(aes(group = Algorithm, color = Algorithm)) + opts(title = 'Long-Term Error Behavior of Two Numerical Approaches', xlab = 'Sample Size', ylab = 'Deviation from True MLE')
dev.off()
Long-Term_Error_Behavior.jpg

More generally, given any data set and any model, you can — at least in principle — solve the maximum likelihood estimation problem using numerical optimization algorithms. The general algorithm requires that you specify a more general log likelihood function analogous to the R-like pseudocode below:

1
2
3
4
5
6
7
8
9
10
11
log.likelihood <- function(sequence.as.data.frame, likelihood.function, parameters)
{
  log.likelihood <- 0
 
  for (i in 1:nrow(sequence.as.data.frame))
  {
    log.likelihood <- log.likelihood + log(likelihood.function(sequence.as.data.frame[i,], parameters))
  }
 
  return(log.likelihood)
}

Then you need to apply multivariable, constrained optimization tools to find your maximum likelihood estimates. This actually turns out to be a hard problem in general, so I’m going to bail out on the topic here.

24 responses to “Doing Maximum Likelihood Estimation by Hand in R”

  1. Fabian

    no need to reinvent the wheel.
    stats4:::mle() is a nice wrapper for optim for this kind of thing……

  2. Eduardo

    You should vectorize your log.likelihood function. e.g.

    log.likelihood.sum <- function(sequence, p) {
    log.likelihood <- sum(log(p)*(sequence==1)) + sum(log(1-p)*(sequence==0))
    }

    For N=10000, it is something like 30 times faster.

  3. Fabian

    You’re right of course – it does have didactic value…
    Great blog you have here, I’ll be coming back.

  4. romunov

    Thank you for this educational post.

  5. Jim Adams

    Another way to define the likelihood function:
    likelihood <- function(sequence, p)
    {cumprod(c(1,rep(p,sum(sequence==1)),rep((1-p),sum(sequence==0))))[length(sequence)+1]}

  6. Jim Adams

    Another way to define the likelihood function:
    likelihood <- function(sequence, p){
    cumprod(c(1,rep(p,sum(sequence==1)),rep((1-p),sum(sequence==0))))[length(sequence)+1]
    }

  7. Kim

    Hi,

    This is a really nice blog, for sure to help others when no standard code is available in R!
    It is of help at least to me!

    Kim

  8. Kiran

    Hi John,
    in this case the function that is trying to be maximized is always:

    n1*log(p)+(n-n1)*log(1-p)
    where n1 is the number of +ve examples and n is the total number of observations

    The differentiation of this function and subsequent equating to 0 should give us the maximum

    f(p) = n1*log(p)+n2*log(1-p)
    f’(p) = n1/p + n2/(1-p)

    f’(p) = 0 implies
    n1/p = -n2/(1-p)
    n1*(1-p) = -n2*p
    n1 – n1*p = -n2*p
    n1 = -n2*p + n1*p = p*(n1-n2)
    p= n1/(n1-n2)

    What am I doing wrong?

  9. Kiran

    Hi John,
    in this case the function that is trying to be maximized is always:

    n1*log(p)+(n-n1)*log(1-p)
    where n1 is the number of +ve examples and n is the total number of observations

    When will maximum likelihood give an answer different from the average value?

  10. Kiran

    Hi John,
    Sounds good.

    In the binomial case, say I have three examples:
    a) 20 out of 100 right
    b) 40 out of 200 right
    c) 60 out of 300 right

    MLE and average for each of the three is 0.2
    But I should have more confidence in c) than in a) or b) as I had more samples to derive the value.

    How can this be accomodated?

    Thanks
    Kiran
    [Btw: I have read your book and it is a gem. Thanks for writing!]

  11. Kiran

    Thanks John. I figured it out.

    In Bernoulli case, we can assume the prior distribution is a beta distribution with parameters alpha1 and alpha; alpha1/alpha being the prior probability. Say MLE = N1/N where N1 and N are numbers in the sample.

    Now the estimate would be
    lambda * prior + (1-lambda) * MLE
    where lambda = N/(N+alpha)

    Only problem here is figuring out what the prior estimate should be! Do you know of any way to do that to choose alpha or alpha1 or maybe use a function to find their values in the optimal way

  12. Shashank

    Hi could any one help in decoding this particular code?
    I am guessing its MLE but definitely not sure what’s happening, is it possible for any of you to explain the code below.

    llmaxM2D=function(b1,b2,b3,k2,g3,dv,ev,wv=1){
    # b1,b2,b3,g3 are given
    # solve for k2
    k21=k2
    k20=k21-1
    thetat=b2*ev*exp(b1+b3*g3)
    s1=sum(dv*b2*wv)
    while(abs(k21-k20) > 0.1)
    {
    k20=k21
    f0=sum((exp(k20*b2)*thetat)*wv)-s1
    df0=sum((exp(k20*b2)*b2*thetat)*wv)
    k21=k20-f0/df0
    }
    k21
    }

    Many Thanks,
    Shashank

  13. Chia-hung Tsai

    Really excellent instruction. Thanks a lot.

  14. Emily Rochefort

    Hello

    I would like to build a few customised function(s) for developing a baseline ARMA(1,1)-GARCH(1,1) stock return volatility model and its variants (e.g. EGARCH, NGARCH, and TGARCH etc). Somehow my current optimization gives infinite values or leads to those results that differ dramatically from those results based on the fGarch command (which takes only a few seconds). Also, the draft optimization is very slow.

    My goal is to build the customised function for each of the 4 GARCH variants and then to adapt each function to allow the volatility coefficients to change over time. I fear that the customised function may be wrong in the first place. It would be good if you could provide some guidance given your expertise in the R programming space. At any rate, I would be happy to communicate via some email exchange. Many thanks.

    Emily

    #=====================================================================================================#
    # This R script offers a suite of functions for estimating the volatility dynamics based on the standard ARMA(1,1)-GARCH(1,1) model and its variants.
    # The baseline ARMA(1,1) model characterizes the dynamic evolution of the return generating process.
    # The baseline GARCH(1,1) model depicts the the return volatility dynamics over time.
    # We can extend the GARCH(1,1) volatility model to a variety of alternative specifications to capture the potential asymmetry for a better comparison:
    # GARCH(1,1), EGARCH(1,1), NGARCH(1,1), and TGARCH(1,1).
    #=====================================================================================================#
    options(scipen=10)

    intel= read.csv(file=”intel.csv”)
    summary(intel)

    raw_data= as.matrix(intel$logret)

    library(fGarch)
    garchFit(~arma(1,1)+garch(1,1), data=raw_data, trace=FALSE)

    negative_log_likelihood_arma11_garch11=
    function(theta, data)
    {mean =theta[1]
    delta=theta[2]
    gamma=theta[3]
    omega=theta[4]
    alpha=theta[5]
    beta= theta[6]

    r= ts(data)
    n= length(r)

    u= vector(length=n)
    u= ts(u)
    u[1]= r[1]- mean

    for (t in 2:n)
    {u[t]= r[t]- mean- delta*r[t-1]- gamma*u[t-1]}

    h= vector(length=n)
    h= ts(h)
    h[1]= omega/(1-alpha-beta)

    for (t in 2:n)
    {h[t]= omega+ alpha*(u[t-1]^2)+ beta*h[t-1]}

    #return(-sum(dnorm(u[2:n], mean=mean, sd=sqrt(h[2:n]), log=TRUE)))
    pi=3.141592653589793238462643383279502884197169399375105820974944592
    return(-sum(-0.5*log(2*pi) -0.5*log(h[2:n]) -0.5*(u[2:n]^2)/h[2:n]))
    }

    theta0=rep(0.01,6)
    negative_log_likelihood_arma11_garch11(theta=theta0, data=raw_data)

    alpha= proc.time()
    maximum_likelihood_fit_arma11_garch11=
    nlm(negative_log_likelihood_arma11_garch11,
    p=theta0,
    data=raw_data,
    hessian=TRUE,
    iterlim=500)
    #optim(theta0,
    # negative_log_likelihood_arma11_garch11,
    # data=raw_data,
    # method=”L-BFGS-B”,
    # upper=c(+0.999999999999,+0.999999999999,+0.999999999999,0.999999999999,0.999999999999,0.999999999999),
    # lower=c(-0.999999999999,-0.999999999999,-0.999999999999,0.000000000001,0.000000000001,0.000000000001),
    # hessian=TRUE)

    # We record the end time and calculate the total runtime for the above work.
    omega= proc.time()
    runtime= omega-alpha
    zhours = floor(runtime/60/60)
    zminutes=floor(runtime/60- zhours*60)
    zseconds=floor(runtime- zhours*60*60- zminutes*60)
    print(paste(“It takes “,zhours,”hour(s)”, zminutes,” minute(s) “,”and “, zseconds,”second(s) to finish running this R program”,sep=””))

    maximum_likelihood_fit_arma11_garch11

    sqrt(diag(solve(maximum_likelihood_fit_arma11_garch11$hessian)))

  15. lovelygirl

    Hi,
    John, I just want to know how to coding the log-likelihood of a beta distribution in R?

    Thank you very much!

  16. Roman Luštrik

    @lovelygirl, you can find the log likelihood @ wikipedia: http://en.wikipedia.org/wiki/Beta_distribution#Maximum_likelihood Or did I misinterpret the question and you’re having trouble implementing this in R?

  17. lovelygirl

    Yes, I really confused about how to write the log-likelihood of the Beta distribution in R. my lecturer gave me some R codes that I need to finish, and if I can complete this, I can compute the log-likelihood of a Beta distribution in R, but I got some difficulties to complete it. So could you please help me some to have a look at it and give me some hints on these?
    The R codes which the lecturer gave to me are:
    LL. Beta<-function(par){
    a<-par[1]
    b<-par[2]
    ll<- (this is the line I need to complete in R)
    return(ll)
    }

  18. niren sirohi

    hi there–very interesting post. Would you know how in R i can write a log likelihood function and use an EM algorithm to maximize it?