Lately I’ve been writing maximum likelihood estimation code by hand for some economic models that I’m working with. It’s actually a fairly simple task, so I thought that I would write up the basic approach in case there are readers who haven’t built a generic estimation system before.

First, let’s start with a toy example for which there is a closed-form analytic solution. We’ll ignore that solution and use optimization functions to do the estimation. Starting with this toy example makes it easy to see how well an approximation system can be expected to perform under the best circumstances — and also where it goes wrong if you make poor programming decisions.

Suppose that you’ve got a sequence of values from an unknown Bernoulli variable like so:

1 2 3 | p.parameter <- 0.8 sequence <- rbinom(10, 1, p.parameter) # [1] 0 1 1 1 1 1 1 0 1 0 |

Given the sequence, we want to estimate the value of the parameter, * p*, which is not known to us. The maximum likelihood approach says that we should select the parameter that makes the data most probable. For a Bernoulli variable, this is simply a search through the space of values for

*(i.e [0, 1]) that makes the data most probable to have observed.*

**p**It’s worth pointing out that the analytic solution to the maximum likelihood estimation problem is to use the sample mean. We’ll therefore use `mean(sequence)`

as a measure of the accuracy of our approximation algorithm.

How do we find the parameter numerically? First, we want to define a function that specifies the probability of our entire data set. We assume that each observation in the data is independently and identically distributed, so that the probability of the sequence is the product of the probabilities of each value. For the Bernoulli variables, this becomes the following function:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | likelihood <- function(sequence, p.parameter) { likelihood <- 1 for (i in 1:length(sequence)) { if (sequence[i] == 1) { likelihood <- likelihood * p.parameter } else { likelihood <- likelihood * (1 - p.parameter) } } return(likelihood) } |

To do maximum likelihood estimation, we therefore only need to use an optimization function to maximize this function. A quick examination of the likelihood function as a function of * p* makes it clear that any decent optimization algorithm should be able to find the maximum:

1 2 3 4 5 6 7 8 9 10 | possible.p <- seq(0, 1, by = 0.001) jpeg('Likelihood_Concavity.jpg') library('ggplot2') qplot(possible.p, sapply(possible.p, function (p) {likelihood(sequence, p)}), geom = 'line', main = 'Likelihood as a Function of P', xlab = 'P', ylab = 'Likelihood') dev.off() |

For single variable cases, I find that it’s easiest to use R’s base function `optimize`

to solve the optimization problem:

1 2 3 4 5 6 7 8 9 10 | mle.results <- optimize(function(p) {likelihood(sequence, p)}, interval = c(0, 1), maximum = TRUE) mle.results # $maximum # [1] 0.6999843 # # $objective # [1] 0.002223566 |

Here I’ve used an anonymous function that returns the likelihood of our current data given a value of * p*; I’ve also specified that the values of

*must lie in the interval [0, 1] and asked*

**p**`optimize`

to maximize the result, rather than minimize, which is the default behavior. Examining the output of `optimize`

, we can see that the likelihood of the data set was maximized very near 0.7, the sample mean. This suggests that the optimization approximation can work. It’s worth noting that the objective value is the likelihood of the data set for the specified value of *. The smallness of the objective for large problems can become a major problem. To understand why, it’s worth seeing what happens as the size of the sample grows from 10 to 2500 samples:*

**p**1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | error.behavior <- data.frame() for (n in 10:2500) { sequence <- rbinom(n, 1, p.parameter) likelihood.results <- optimize(function(p) {likelihood(sequence, p)}, interval = c(0, 1), maximum = TRUE) true.mle <- mean(sequence) likelihood.error <- true.mle - likelihood.results$maximum error.behavior <- rbind(error.behavior, data.frame(N = n, Error = likelihood.error, Algorithm = 'Likelihood')) } |

As you can see, our approximation approach works great until our data set grows, and then it falls apart. This is exactly the opposite of what asymptotical statistical theory tells us should be happening, so it’s clear that something is going very wrong. A quick examination of the results from the last pass through our loop makes clear what’s wrong:

1 2 3 4 5 6 7 8 9 10 11 12 | sequence <- rbinom(2500, 1, p.parameter) likelihood.results <- optimize(function(p) {likelihood(sequence, p)}, interval = c(0, 1), maximum = TRUE) likelihood.results # $maximum # [1] 0.9999339 # # $objective # [1] 0 |

The likelihood of our data is numerically indistinguishable from 0 given the precision of my machine’s floating point values. Multiplying thousands of probabilities together is simply not a viable approach without infinite precision. Thankfully, there’s a very simple solution: replace all of the probabilities with their logarithms. Instead of maximizing the likelihood, we maximize the log likelihood, which involves summing rather than multiplying, and therefore stays numerically stable:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | log.likelihood <- function(sequence, p) { log.likelihood <- 0 for (i in 1:length(sequence)) { if (sequence[i] == 1) { log.likelihood <- log.likelihood + log(p) } else { log.likelihood <- log.likelihood + log(1 - p) } } return(log.likelihood) } |

You can check that this problem is as easily solved numerically as the original problem by graphing the log likelihood:

1 2 3 4 5 6 7 8 9 10 | sequence <- c(0, 1, 1, 1, 1, 1, 1, 0, 1, 0) possible.p <- seq(0, 1, by = 0.001) jpeg('Log_Likelihood_Concavity.jpg') qplot(possible.p, sapply(possible.p, function (p) {log.likelihood(sequence, p)}), geom = 'line', main = 'Log Likelihood as a Function of P', xlab = 'P', ylab = 'Log Likelihood') dev.off() |

And then you can rerun our error diagnostics using both approaches to confirm that the log likelihood approach does not suffer from the same numerical problems:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | error.behavior <- data.frame() for (n in 10:2500) { sequence <- rbinom(n, 1, p.parameter) likelihood.results <- optimize(function(p) {likelihood(sequence, p)}, interval = c(0, 1), maximum = TRUE) log.likelihood.results <- optimize(function(p) {log.likelihood(sequence, p)}, interval = c(0, 1), maximum = TRUE) true.mle <- mean(sequence) likelihood.error <- true.mle - likelihood.results$maximum log.likelihood.error <- true.mle - log.likelihood.results$maximum error.behavior <- rbind(error.behavior, data.frame(N = n, Error = likelihood.error, Algorithm = 'Likelihood'), data.frame(N = n, Error = log.likelihood.error, Algorithm = 'Log Likelihood')) } jpeg('Long-Term_Error_Behavior.jpg') ggplot(error.behavior, aes(x = N, y = Error)) + geom_line(aes(group = Algorithm, color = Algorithm)) + opts(title = 'Long-Term Error Behavior of Two Numerical Approaches', xlab = 'Sample Size', ylab = 'Deviation from True MLE') dev.off() |

More generally, given any data set and any model, you can — at least in principle — solve the maximum likelihood estimation problem using numerical optimization algorithms. The general algorithm requires that you specify a more general log likelihood function analogous to the R-like pseudocode below:

1 2 3 4 5 6 7 8 9 10 11 | log.likelihood <- function(sequence.as.data.frame, likelihood.function, parameters) { log.likelihood <- 0 for (i in 1:nrow(sequence.as.data.frame)) { log.likelihood <- log.likelihood + log(likelihood.function(sequence.as.data.frame[i,], parameters)) } return(log.likelihood) } |

Then you need to apply multivariable, constrained optimization tools to find your maximum likelihood estimates. This actually turns out to be a hard problem in general, so I’m going to bail out on the topic here.

no need to reinvent the wheel.

stats4:::mle() is a nice wrapper for optim for this kind of thing……

Thanks for the link. I wasn’t aware of stats4:::mle(). In practice, I’ll probably switch over to using it. For me (and hopefully for others), I do think there’s some didactic value in reinventing this specific wheel.

You should vectorize your log.likelihood function. e.g.

log.likelihood.sum <- function(sequence, p) {

log.likelihood <- sum(log(p)*(sequence==1)) + sum(log(1-p)*(sequence==0))

}

For N=10000, it is something like 30 times faster.

Thanks, Eduardo! That was something I was wondering how to do recently.

You’re right of course – it does have didactic value…

Great blog you have here, I’ll be coming back.

Thank you for this educational post.

Another way to define the likelihood function:

likelihood <- function(sequence, p)

{cumprod(c(1,rep(p,sum(sequence==1)),rep((1-p),sum(sequence==0))))[length(sequence)+1]}

Another way to define the likelihood function:

likelihood <- function(sequence, p){

cumprod(c(1,rep(p,sum(sequence==1)),rep((1-p),sum(sequence==0))))[length(sequence)+1]

}

Hi,

This is a really nice blog, for sure to help others when no standard code is available in R!

It is of help at least to me!

Kim

Thanks for the compliment, Kim.

Hi John,

in this case the function that is trying to be maximized is always:

n1*log(p)+(n-n1)*log(1-p)

where n1 is the number of +ve examples and n is the total number of observations

The differentiation of this function and subsequent equating to 0 should give us the maximum

f(p) = n1*log(p)+n2*log(1-p)

f'(p) = n1/p + n2/(1-p)

f'(p) = 0 implies

n1/p = -n2/(1-p)

n1*(1-p) = -n2*p

n1 – n1*p = -n2*p

n1 = -n2*p + n1*p = p*(n1-n2)

p= n1/(n1-n2)

What am I doing wrong?

Hi John,

in this case the function that is trying to be maximized is always:

n1*log(p)+(n-n1)*log(1-p)

where n1 is the number of +ve examples and n is the total number of observations

When will maximum likelihood give an answer different from the average value?

Hi Kiran,

You’re right that this specific example can be solved analytically. You’re also right to conclude that the MLE for p for binomial data will always be the average value. But, in general, the computation of MLE’s isn’t something that can be done analytically. When the calculus-based analysis isn’t tractable, it’s very important to be able to compute an MLE using generic optimization techniques. I decided to illustrate those generic techniques using a simple example in which there is an analytic solution precisely because the analytic solution lets us check our work — but in practice the computational techniques are applied when the analytic technique breaks down.

Hi John,

Sounds good.

In the binomial case, say I have three examples:

a) 20 out of 100 right

b) 40 out of 200 right

c) 60 out of 300 right

MLE and average for each of the three is 0.2

But I should have more confidence in c) than in a) or b) as I had more samples to derive the value.

How can this be accomodated?

Thanks

Kiran

[Btw: I have read your book and it is a gem. Thanks for writing!]

Hi Kiran,

The concern that you should have more “confidence” (which is a loaded word in statistical theory and should probably be avoided whenever possible) is that the sampling distribution of the MLE given a true parameter is much tighter in (c) than in (a) or (b). Fully unpacking that statement is a complicated task for which I’ll direct you to standard statistics textbooks like Freedman, Purves and Pisani, but the big idea is simply that the variance of the MLE (as opposed to the mean) goes down as you have more samples.

Glad you liked the book.

Thanks John. I figured it out.

In Bernoulli case, we can assume the prior distribution is a beta distribution with parameters alpha1 and alpha; alpha1/alpha being the prior probability. Say MLE = N1/N where N1 and N are numbers in the sample.

Now the estimate would be

lambda * prior + (1-lambda) * MLE

where lambda = N/(N+alpha)

Only problem here is figuring out what the prior estimate should be! Do you know of any way to do that to choose alpha or alpha1 or maybe use a function to find their values in the optimal way

That’s certainly a viable approach, although I don’t think that there is a constant _optimal_ way to do things. I would personally use a Jeffreys’ prior for the beta distribution.

Hi could any one help in decoding this particular code?

I am guessing its MLE but definitely not sure what’s happening, is it possible for any of you to explain the code below.

llmaxM2D=function(b1,b2,b3,k2,g3,dv,ev,wv=1){

# b1,b2,b3,g3 are given

# solve for k2

k21=k2

k20=k21-1

thetat=b2*ev*exp(b1+b3*g3)

s1=sum(dv*b2*wv)

while(abs(k21-k20) > 0.1)

{

k20=k21

f0=sum((exp(k20*b2)*thetat)*wv)-s1

df0=sum((exp(k20*b2)*b2*thetat)*wv)

k21=k20-f0/df0

}

k21

}

Many Thanks,

Shashank

Really excellent instruction. Thanks a lot.

Hello

I would like to build a few customised function(s) for developing a baseline ARMA(1,1)-GARCH(1,1) stock return volatility model and its variants (e.g. EGARCH, NGARCH, and TGARCH etc). Somehow my current optimization gives infinite values or leads to those results that differ dramatically from those results based on the fGarch command (which takes only a few seconds). Also, the draft optimization is very slow.

My goal is to build the customised function for each of the 4 GARCH variants and then to adapt each function to allow the volatility coefficients to change over time. I fear that the customised function may be wrong in the first place. It would be good if you could provide some guidance given your expertise in the R programming space. At any rate, I would be happy to communicate via some email exchange. Many thanks.

Emily

#=====================================================================================================#

# This R script offers a suite of functions for estimating the volatility dynamics based on the standard ARMA(1,1)-GARCH(1,1) model and its variants.

# The baseline ARMA(1,1) model characterizes the dynamic evolution of the return generating process.

# The baseline GARCH(1,1) model depicts the the return volatility dynamics over time.

# We can extend the GARCH(1,1) volatility model to a variety of alternative specifications to capture the potential asymmetry for a better comparison:

# GARCH(1,1), EGARCH(1,1), NGARCH(1,1), and TGARCH(1,1).

#=====================================================================================================#

options(scipen=10)

intel= read.csv(file=”intel.csv”)

summary(intel)

raw_data= as.matrix(intel$logret)

library(fGarch)

garchFit(~arma(1,1)+garch(1,1), data=raw_data, trace=FALSE)

negative_log_likelihood_arma11_garch11=

function(theta, data)

{mean =theta[1]

delta=theta[2]

gamma=theta[3]

omega=theta[4]

alpha=theta[5]

beta= theta[6]

r= ts(data)

n= length(r)

u= vector(length=n)

u= ts(u)

u[1]= r[1]- mean

for (t in 2:n)

{u[t]= r[t]- mean- delta*r[t-1]- gamma*u[t-1]}

h= vector(length=n)

h= ts(h)

h[1]= omega/(1-alpha-beta)

for (t in 2:n)

{h[t]= omega+ alpha*(u[t-1]^2)+ beta*h[t-1]}

#return(-sum(dnorm(u[2:n], mean=mean, sd=sqrt(h[2:n]), log=TRUE)))

pi=3.141592653589793238462643383279502884197169399375105820974944592

return(-sum(-0.5*log(2*pi) -0.5*log(h[2:n]) -0.5*(u[2:n]^2)/h[2:n]))

}

theta0=rep(0.01,6)

negative_log_likelihood_arma11_garch11(theta=theta0, data=raw_data)

alpha= proc.time()

maximum_likelihood_fit_arma11_garch11=

nlm(negative_log_likelihood_arma11_garch11,

p=theta0,

data=raw_data,

hessian=TRUE,

iterlim=500)

#optim(theta0,

# negative_log_likelihood_arma11_garch11,

# data=raw_data,

# method=”L-BFGS-B”,

# upper=c(+0.999999999999,+0.999999999999,+0.999999999999,0.999999999999,0.999999999999,0.999999999999),

# lower=c(-0.999999999999,-0.999999999999,-0.999999999999,0.000000000001,0.000000000001,0.000000000001),

# hessian=TRUE)

# We record the end time and calculate the total runtime for the above work.

omega= proc.time()

runtime= omega-alpha

zhours = floor(runtime/60/60)

zminutes=floor(runtime/60- zhours*60)

zseconds=floor(runtime- zhours*60*60- zminutes*60)

print(paste(“It takes “,zhours,”hour(s)”, zminutes,” minute(s) “,”and “, zseconds,”second(s) to finish running this R program”,sep=””))

maximum_likelihood_fit_arma11_garch11

sqrt(diag(solve(maximum_likelihood_fit_arma11_garch11$hessian)))

Hi,

John, I just want to know how to coding the log-likelihood of a beta distribution in R?

Thank you very much!

@lovelygirl, you can find the log likelihood @ wikipedia: http://en.wikipedia.org/wiki/Beta_distribution#Maximum_likelihood Or did I misinterpret the question and you’re having trouble implementing this in R?

Yes, I really confused about how to write the log-likelihood of the Beta distribution in R. my lecturer gave me some R codes that I need to finish, and if I can complete this, I can compute the log-likelihood of a Beta distribution in R, but I got some difficulties to complete it. So could you please help me some to have a look at it and give me some hints on these?

The R codes which the lecturer gave to me are:

LL. Beta<-function(par){

a<-par[1]

b<-par[2]

ll<- （this is the line I need to complete in R)

return(ll)

}

hi there–very interesting post. Would you know how in R i can write a log likelihood function and use an EM algorithm to maximize it?