Simulated Annealing in Julia

Building Optimization Functions for Julia

In hopes of adding enough statistical functionality to Julia to make it usable for my day-to-day modeling projects, I’ve written a very basic implementation of the simulated annealing (SA) algorithm, which I’ve placed in the same JuliaVsR GitHub repository that I used for the code for my previous post about Julia. For easier reading, my current code for SA is shown below:

The Simulated Annealing Algorithm

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
# simulated_annealing()
# Arguments:
# * cost: Function from states to the real numbers. Often called an energy function, but this algorithm works for both positive and negative costs.
# * s0: The initial state of the system.
# * neighbor: Function from states to states. Produces what the Metropolis algorithm would call a proposal.
# * temperature: Function specifying the temperature at time i.
# * iterations: How many iterations of the algorithm should be run? This is the only termination condition.
# * keep_best: Do we return the best state visited or the last state visisted? (Should default to true.)
# * trace: Do we show a trace of the system's evolution?
 
function simulated_annealing(cost,
                             s0,
                             neighbor,
                             temperature,
                             iterations,
                             keep_best,
                             trace)
 
  # Set our current state to the specified intial state.
  s = s0
 
  # Set the best state we've seen to the intial state.
  best_s = s0
 
  # We always perform a fixed number of iterations.
  for i = 1:iterations
    t = temperature(i)
    s_n = neighbor(s)
    if trace println("$i: s = $s") end
    if trace println("$i: s_n = $s_n") end
    y = cost(s)
    y_n = cost(s_n)
    if trace println("$i: y = $y") end
    if trace println("$i: y_n = $y_n") end
    if y_n <= y
      s = s_n
      if trace println("Accepted") end
    else
      p = exp(- ((y_n - y) / t))
      if trace println("$i: p = $p") end
      if rand() <= p
        s = s_n
        if trace println("Accepted") end
      else
        s = s
        if trace println("Rejected") end
      end
    end
    if trace println() end
    if cost(s) < cost(best_s)
      best_s = s
    end
  end
 
  if keep_best
    best_s
  else
    s
  end
end

I’ve tried to implement the algorithm as it was presented by Bertsimas and Tsitsiklis in their 1993 review paper in Statistical Science. The major differences between my implementation and their description of the algorithm is that (1) I’ve made it possible to keep the best solution found during the search rather than always use the last solution found and (2) I’ve made no effort to select a value for their d parameter carefully: I’ve simply set it to 1 for all of my examples, though my code will allow you to specify another rule for setting the temperature of the annealer at time t other than the 1 / log(t) cooling scheme I’ve been using. (In fact, the code currently forces you to select a cooling scheme, since there are no default arguments in Julia yet.)

I chose simulated annealing as my first optimization algorithm to implement in Julia because it’s a natural relative of the Metropolis algorithm that I used in the previous post. Indeed, coding up an implementation of SA made me appreciate the fact that the Metropolis algorithm as used in Bayesian statistics can be considered a special case of the SA algorithm in which the temperature is always 1 and in which the cost function for a state with posterior probability p is -log(p).

Coding up the SA algorithm for myself also me made realize why it’s important that it uses an additive comparison of cost functions rather than a ratio (as in the Metropolis algorithm): the ratio goes haywire when the cost function can take on both positive and negative values (which, of course, doesn’t matter for Bayesian methods because probabilities are strictly non-negative). I discovered this when I initially tried to code up SA from my inaccurate memory without first consulting the literature and discovered that I couldn’t get a ratio-based algorithm to work no matter how many times I tried changing the cooling schedule.

To test out the SA implementation I’ve written, I’ve written two tests cases that attempt to minimize the Rosenbrock and Himmelbrau functions, which I found listed as examples of hard-to-minimize functions in the Wikipedia description of the Nelder-Mead method. You can see those two examples below this paragraph. In addition, I’ve used R to generate plots showing how the SA algorithm works under repeated application on the same optimization problem; in these plots, I’ve used a heatmap to show the cost functions value at each (x, y) position, colored crosshairs to indicate the position of a true minimum of the function in question, and red dots to indicate the purported solutions found by my implementation of SA.

Finding the Minimum of the Rosenbrock Function

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Find a solution of the Rosenbrock function using SA.
load("simulated_annealing.jl")
load("../rng.jl")
 
function rosenbrock(x, y)
  (1 - x)^2 + 100(y - x^2)^2
end
 
function neighbors(z)
  [rand_uniform(z[1] - 1, z[1] + 1), rand_uniform(z[2] - 1, z[2] + 1)]
end
 
srand(1)
 
solution = simulated_annealing(z -> rosenbrock(z[1], z[2]),
                               [0, 0],
                               neighbors,
                               log_temperature,
                               10000,
                               true,
                               false)
rosenbrock.png

Finding the Minima of the Himmelbrau Function

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Find a solution of the Himmelbrau function using SA.
load("simulated_annealing.jl")
load("../rng.jl")
 
function himmelbrau(x, y)
  (x^2 + y - 11)^2 + (x + y^2 - 7)^2
end
 
function neighbors(z)
  [rand_uniform(z[1] - 1, z[1] + 1), rand_uniform(z[2] - 1, z[2] + 1)]
end
 
srand(1)
 
solution = simulated_annealing(z -> himmelbrau(z[1], z[2]),
                               [0, 0],
                               neighbors,
                               log_temperature,
                               10000,
                               true,
                               false)
himmelbrau.png

Moving Forward

Now that I’ve got a form of SA working, I’m interested in coding up a suite of optimization functions for Julia so that I can start to do maximum likelihood estimation in pure Julia. Once that’s possible, I can use Julia to do real science, e.g. when I need to fit simple models for which finding the MLE is appropriate. (I will leave the development of cleaner statistical functions for special cases of maximum likelihood estimation to more capable people, like Douglas Bates, who has already produced some GLM code.)

At present my code is meant simply to demonstrate how one could write an implementation of simulated annealing in Julia. I’m sure that the code can be more efficient and I suspect that I’ve violated some of the idioms of the language. In addition, I’d much prefer that this function use default values for many of the arguments, as there is no reason that an end-user needs to be concerned with finding the best cooling schedule if SA seems to work out of the box on their problem with the cooling schedule I’ve been using.

With those disclaimers about my code in place, I’d like to think that I haven’t made any mathematical errors and that this function can be used by others. So, I’d ask that those interested please tear apart my code so that I can make it usable as a general purpose function for optimization in Julia. Alternatively, please tell me that there’s no need for a pure Julia implementation of SA because, for example, there are nice C libraries that would be much easier to link to than to re-implement.

With an implementation of SA in place, I’ll probably start working on implementing L-BFGS-S soon, which is the other optimization algorithm I use often in R. (To be honest, I use L-BFGS-S almost exclusively, but SA was much easier to implement.)

Incidentally, this code base demonstrates how I view the relationship between R and Julia: Julia is a beautiful new language that is still missing many important pieces. We can all work together to build the best pieces of R that are missing from Julia. While we’re working on improving Julia, we’ll need to keep using R to handle things like visualization of our results. For this post, I turned back to ggplot2 for all of the graphics generation.

7 responses to “Simulated Annealing in Julia”

  1. Ben Bolker

    This is kind of cool, but I actually don’t think that rewriting optimization algorithms in Julia is the best use of effort for the open-source community. (Of course, there’s nothing stopping anyone from doing it because they enjoy it!) Rather, I think that work toward polishing and improving the existing open-source algorithms, and making sure there are good interfaces/glue to make them available in as many languages as possible, is the way forward … I may rant further about this on my own (very sporadic) blog … That said, simulated annealing and Metropolis-Hastings are probably the best ones to start with, because they generally require a lot of problem-specific tuning and are ones where you would be likely to want to open the black box. They’re also computationally intensive …

  2. Dennis

    Gah, please ignore Ben Bolker and post more like this, which helped me learn both Julia and simulated annealing.

  3. Andrei Formiga

    I believe default arguments can be simulated using the standard multiple dispatch in the language. At least this works:

    function f(x, y, z)
    (x + y) * z
    end

    f(x, y) = f(x, y, 1)

    julia> f(3, 4)
    7

    julia> f(3, 4, 2)
    14

    For the d parameter, I believe convergence is only guaranteed if d is greater than a constant which is problem-dependent, and the convergence condition is slightly different for SA on continuous state spaces. Also people tend to use geometric cooling schedules which are faster, but have no convergence guarantees (that I know of). The function could have defaults for logarithmic and geometric schedules.

  4. Andrei Formiga

    I meant there could be functions for both default schedules. Also the default argument in the example I’ve shown will work ok only for the last argument, but by declaring argument types it should work for arbitrary arguments, I think.

  5. John Myles White

    Hi Andrei,

    You’re right that the logarithmic cooling scheme requires that a `d` parameter be set. Sadly, my reading of the literature leads to the conclusion that `d`’s proper value cannot be known without already knowing the minimum of the function. Of course, in practice one can try to get around that by searching through values. I just chose a simple starting point, while allowing people infinite flexibility to use other cooling schemes.

    You’re also right that defaults could be handled using multiple dispatch. I’m actually waiting for Harlan Harris’s Options framework to be settled on before starting down that path or any other.

    I’d like to know more about the geometric cooling scheme. Please submit a patch on GitHub for optim.jl if you have a moment.

  6. Andrei Formiga

    Indeed, knowing the proper value would require knowing a measure of distance from the global minimum, so it can’t be calculated (not without doing the minimization anyway). So it’s something users of the SA function must be aware of anyway, and experiment with it if necessary.

    About the geometric cooling schedule, it’s just T(n) = \alpha^n \times T_0 for some initial constant T_0, and \alpha between 0 and 1. Sometimes people tune the \alpha initially to get a high acceptance rate in the Metropolis algorithm, but I don’t know much about the details. I’ve also seen (somewhat) recent references about using a fixed low temperature instead of cooling, but the chosen temperature value must be tuned to the specific problem being considered, so it’s hard to make into a general optimization function.

  7. John Myles White

    Thanks for the details of the geometric cooling schedule. I’ll throw that into the code.