At MSR this week, we had two very good talks on algorithmic methods for tuning the hyperparameters of machine learning models. Selecting appropriate settings for hyperparameters is a constant problem in machine learning, which is somewhat surprising given how much expertise the machine learning community has in optimization theory. I suspect there’s interesting psychological and sociological work to be done exploring why a problem that could be answered using known techniques wasn’t given an appropriate solution earlier.

Thankfully, the take away message of this blog post is that this problem is starting to be understood.

### A Two-Part Optimization Problem

To set up the problem of hyperparameter tuning, it’s helpful to think of the canonical model-tuning and model-testing setup used in machine learning: one splits the original data set into three parts — a training set, a validation set and a test set. If, for example, we plan to use L2-regularized linear regression to solve our problem, we will use the training set and validation set to select a value for the \(\lambda\) hyperparameter that is used to determine the strength of the penalty for large coefficients relative to the penalty for errors in predictions.

With this context in mind, we can set up our problem using five types of variables:

- Features: \(x\)
- Labels: \(y\)
- Parameters: \(\theta\)
- Hyperparameters: \(\lambda\)
- Cost function: \(C\)

We then estimate our parameters and hyperparameters in the following multi-step way so as to minimize our cost function:

\[

\theta_{Train}(\lambda) = \arg \min_{\theta} C(x_{Train}, y_{Train}, \theta, \lambda)

\]

\[

\lambda_{Validation}^{*} = \arg \min_{\lambda} C(x_{Validation}, y_{Validation}, \theta_{Train}(\lambda), \lambda)

\]

The final model performance is assessed using:

\[

C(x_{Test}, y_{Test}, \theta_{Train + Validation}(\lambda_{Validation}^{*}), \lambda_{Validation}^{*})

\]

This two-part minimization problem is similar in many ways to stepwise regression. Like stepwise regression, it feels like an opportunity for clean abstraction is being passed over, but it’s not clear to me (or anyone I think) if there is any analytic way to solve this problem more abstractly.

Instead, the methods we saw presented in our seminars were ways to find better approximations to \(\lambda^{*}\) using less compute time. I’ll go through the traditional approach, then describe the newer and cleaner methods.

### Grid Search

Typically, hyperparameters are set using the Grid Search algorithm, which works as follows:

- For each parameter \(p_{i}\) the researcher selects a list of values to test empirically.
- For each element of the Cartesian product of these values, the computer evaluates the cost function.
- The computer selects the hyperparameter settings from this grid with the lowest cost.

Grid Search is about the worst algorithm one could possibly use, but it’s in widespread use because (A) machine learning experts seem to have less familiarity with derivative-free optimization techniques than with gradient-based optimization methods and (B) machine learning culture does not traditionally think of hyperparameter tuning as a formal optimization problem. Almost certainly (B) is more important than (A).

### Random Search

James Bergstra’s first proposed solution was so entertaining because, absent evidence that it works, it seems almost flippant to even propose: he suggested replacing Grid Search with Random Search. Instead of selecting a grid of values and walking through it exhaustively, you select a value for each hyperparameter independently using some probability distribution. You then evaluate the cost function given these random settings for the hyperparameters.

Since this approach seems like it might be worst than Grid Search, it’s worth pondering why it should work. James’ argument is this: most ML models have *low-effective dimension*, which means that a small number of parameters really affect the cost function and most have almost no effect. Random search lets you explore a greater variety of settings for each parameter, which allows you to find better values for the few parameters that really matter.

I am sure that Paul Meehl would have a field day with this research if he were alive to hear about it.

### Arbitrary Regression Problem

An alternative approach is to view our problem as one of Bayesian Optimization: we have an arbitrary function that we want to minimize which is costly to evaluate and we would like to find a good approximate minimum in a small number of evaluations.

When viewed in this perspective, the natural strategy is to regress the cost function on the settings of the hyperparameters. Because the cost function may depend on the hyperparameters in strange ways, it is wise to use very general purpose regression methods. I’ve recently seen two clever strategies for this, one of which was presented to us at MSR:

- Jasper Snoek, Hugo Larochelle and Ryan Adams suggest that one use a Gaussian Process.
- Among other methods, Frank Hutter, Holger H. Hoos and Kevin Leyton-Brown suggest that one use Random Forests.

From my viewpoint, it seems that any derivative-free optimization method might be worth trying. While I have yet to see it published, I’d like to see more people try the Nelder-Mead method for tuning hyperparameters.

James Bergstra, that you cite for random search, actually does Bayesian optimization using Gaussian Processes. The random search is only the initialization step.

With regards to Nelder-Mead, something to keep in mind is that is is not guaranteed to converge to this optimal point if the objective function is not strictly convex. I tried it for hyper-parameter selection, and I practice what often happens is that they are large areas of the hyper-parameter space that are flat, with additional noise. This completely confuses the Nelder-Mead algorithm that easily ends up stuck in these areas. My experience was that it sometimes works, and sometimes does not. I thus ended up giving up on it, because I realized that I couldn’t trust it, and would run a grid search just to make sure that it hadn’t done something stupid.

If you are interested in pursuing this idea, I have some Python code implementing a Nelder-Mead hyper-parameter optimizer as a drop-in replacement for the grid search in the scikit-learn ( http://scikit-learn.org ) that I can share.

Hi Gael,

James originally did pure Random Search to set the hyperparameters as described in that paper. From his talk and looking through that paper again, I think the GP is only used to measure the effective dimension of the outer-loop function. Random Search is what’s actually used to set the hyperparameters. But please do check the paper and correct me if I’m wrong. James now has newer approaches, but I didn’t find a publication for them in a cursory search, so I didn’t want to put anything about it in public.

People here who’ve tried Nelder-Mead also reported the same issues you do: it works at times and fails at other times. Thanks for offering code. If my schedule frees up sometime soon, I’ll take you up on it.

Nice post!

While James Bergstra’s 2011 JMLR article only uses random search, his 2011 NIPS paper (http://people.fas.harvard.edu/~bergstra/files/pub/11_nips_hyperopt.pdf) also explores two Bayesian optimization methods: 1) based on Gaussian Processes and 2) based on a tree-structured Parzen density estimate of the “good” configurations. That second approach performed somewhat better in James’ experiments.

In our own experiments, we also achieved better performance than GPs with tree-based methods: random forests deal much better with discrete change points, high dimensions, categorical parameters, conditional parameters, etc (see e.g. http://www.cs.ubc.ca/~hutter/papers/10-LION-TB-SPO.pdf or http://www.cs.ubc.ca/~hutter/papers/11-LION5-SMAC.pdf). Very much in line with your post that we first published these methods in the optimization community and are only now using them for hyperparameter optimization …

What I found interesting is indeed the fact that the area of using derivative-free optimization for parameter tuning seems more explored in the optimization community. See the LION, SLS, GECCO or CEC conferencies, with a lot of material on the subject (stochastic local search, parameter tuning or setting, probabilistic optimization, black-box evolutionary computation, hyperheuristics, etc.)

For example, I found the work published by Bartz-Beilestein about Sequential Parameter Optimization in 2005 [1] very close to the paper of Snoek et al. (basically, it combines latin-square sampling and gaussian process in a iterative search method). And it was in 2005, a lot of work as been done around such methods, since.

The derivative-free optimization community also do not seem to use a lot Nelder-Mead search, especially since algorithms like CMA-ES (which behave like a descent algorithm which use sampling to estimate the gradient) have been found a lot more efficient (~10 years ago).

Maybe that the optimization community must put effort on publishing their algorithms in ML tools and that the ML community must take a closer look to optimizatin toolsâ€¦ isn’t that in the scope of the LION conference?

[1] http://www.informatik.uni-trier.de/~ley/db/conf/cec/cec2005.html#Bartz-BeielsteinLP05