Stochastic Hyperparameter Optimization through Hypernetworks

02/26/2018 ∙ by Jonathan Lorraine, et al. ∙ 0

Machine learning models are often tuned by nesting optimization of model weights inside the optimization of hyperparameters. We give a method to collapse this nested optimization into joint stochastic optimization of weights and hyperparameters. Our process trains a neural network to output approximately optimal weights as a function of hyperparameters. We show that our technique converges to locally optimal weights and hyperparameters for sufficiently large hypernetworks. We compare this method to standard hyperparameter optimization strategies and demonstrate its effectiveness for tuning thousands of hyperparameters.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

hypernet-hypertraining

Code for Stochastic Hyperparameter Optimization through Hypernetworks


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Model selection and hyperparameter tuning is a significant bottleneck in designing predictive models. Hyperparameter optimization is a nested optimization: The inner optimization finds model parameters which minimize the training loss given hyperparameters . The outer optimization chooses to reduce a validation loss :

Training

Hypernetwork

Cross-validation                      Hyper-training
Figure 1: Left: A typical computational graph for cross-validation, where are the optimizer parameters, and are training loss hyperparameters. It is expensive to differentiate through the entire training procedure. Right: The proposed computational graph with our changes in red, where are the hypernetwork parameters. We can cheaply differentiate through the hypernetwork to optimize the validation loss with respect to hyperparameters . We use , , and to refer to a data point, its label, and a prediction respectively.
Figure 2:

The validation loss of a neural net, estimated by cross-validation (crosses) or by a hypernetwork (line), which outputs

-dimensional network weights. Cross-validation requires optimizing from scratch each time. The hypernetwork can be used to evaluate the validation loss cheaply.
(1)

Standard practice in machine learning solves (1) by gradient-free optimization of hyperparameters, such as grid search or random search. Each set of hyperparameters is evaluated by re-initializing weights and training the model to completion. Re-training a model from scratch is wasteful if the hyperparameters change by a small amount. Some approaches, such as Hyperband (Li et al., 2016) and freeze-thaw Bayesian optimization (Swersky et al., 2014), resume model training and do not waste this effort. However, these methods often scale poorly beyond 10 to 20 dimensions.

How can we avoid re-training from scratch each time? Note that the optimal parameters are a deterministic function of the hyperparameters :

(2)

We propose to learn this function. Specifically, we train a neural network that takes hyperparameters as input, and outputs an approximately optimal set of weights.

This formulation provides two major benefits: First, we can train the hypernetwork to convergence using stochastic gradient descent (SGD) without training any particular model to completion. Second, differentiating through the hypernetwork allows us to optimize hyperparameters with stochastic gradient-based optimization.

aaaaaaaaaaaaaa Training loss surface aaaaaaaa Validation loss surface
Figure 3: A visualization of exact (blue) and approximate (red) optimal weights as a function of hyperparameters. The approximately optimal weights are output by a linear model fit at . The true optimal hyperparameter is , while the hyperparameter estimated using approximately optimal weights is nearby at .

2 Training a network to output optimal weights

How can we teach a hypernetwork (Ha et al., 2016) to output approximately optimal weights to another neural network? The basic idea is that at each iteration, we ask a hypernetwork to output a set of weights given some hyperparameters: . Instead of updating the weights using the training loss gradient , we update the hypernetwork weights

using the chain rule:

. This formulation allows us to optimize the hyperparameters with the validation loss gradient . We call this method hyper-training and contrast it with standard training methods.

We call the function that outputs optimal weights for hyperparameters a best-response function. At convergence, we want our hypernetwork to match the best-response function closely.

Our method is closely related to the concurrent work of Brock et al. (2017), whose SMASH algorithm also approximates the optimal weights as a function of model architectures, to perform a gradient-free search over discrete model structures. Their work focuses on efficiently estimating the performance of a variety of model architectures, while we focus on efficiently exploring continuous spaces of models. We further extend this idea by formulating an algorithm to optimize the hypernetwork and hyperparameters jointly. Joint optimization of parameters and hyperparameters addresses one of the main weaknesses of SMASH, which is that the the hypernetwork must be very large to learn approximately optimal weights for many different settings. During joint optimization, the hypernetwork need only model approximately optimal weights for the neighborhood around the current hyperparameters, allowing us to use even linear hypernetworks.

2.1 Advantages of hypernetwork-based optimization

Hyper-training is a method to learn a mapping from hyperparameters to validation loss which is differentiable and cheap to evaluate. We can compare hyper-training to other model-based hyperparameter schemes. Bayesian optimization (e.g., Lizotte (2008); Snoek et al. (2012)) builds a model of the validation loss as a function of hyperparameters, usually using a Gaussian process (e.g., Rasmussen & Williams (2006)) to track uncertainty. This approach has several disadvantages compared to hyper-training.

First, obtaining data for standard Bayesian optimization requires optimizing models from initialization for each set of hyperparameters. In contrast, hyper-training never needs to optimize any one model fully removing choices like how many models to train and for how long.

Second, standard Bayesian optimization treats the validation loss as a black-box function: . In contrast, hyper-training takes advantage of the fact that the validation loss is a known, differentiable function: . This information removes the need to learn a model of the validation loss. This function can also be evaluated stochastically by sampling points from the validation set.

Hyper-training has a benefit of learning hyperparameter to optimized weight mapping, which is substituted into the validation loss. This often has a better inductive bias for learning hyperparameter to validation loss than directly learning the loss. Also, the hypernetwork learns continuous best-responses, which may be a beneficial prior for finding weights by enforcing stability.

2.2 Limitations of hypernetwork-based optimization

We can apply this method to unconstrained continuous bi-level optimization problems with an inner loss function with inner parameters, and an outer loss function with outer parameters. What sort of parameters can be optimized by our approach? Hyperparameters typically fall into two broad categories: 1) Optimization hyperparameters, such as learning rates, which affect the choice of locally optimal point converged to, and 2) regularization or model architecture parameters which change the set of locally optimal points. Hyper-training

does not have inner optimization parameters because there is no internal training loop, so we can not optimize these. However, we must still choose optimization parameters for the fused optimization loop. In principle, hyper-training can handle discrete hyperparameters, but does not offer particular advantages for optimization over continuous hyperparameters.

Another limitation is that our approach only proposes making local changes to the hyperparameters, and does not do uncertainty-based exploration. Uncertainty can be incorporated into the hypernetwork by using stochastic variational inference as in Blundell et al. (2015), and we leave this for future work. Finally, it is not obvious how to choose the training distribution of hyperparameters . If we do not sample a sufficient range of hyperparameters, the implicit estimated gradient of the validation loss w.r.t. the hyperparameters may be inaccurate. We discuss several approaches to this problem in section 2.4.

A clear difficulty of this approach is that hypernetworks can require several times as many parameters as the original model. For example, training a fully-connected hypernetwork with 1 hidden layer of units to output parameters requires at least hypernetwork parameters. To address this problem, in section 2.4, we propose an algorithm that only trains a linear model mapping hyperparameters to model weights.

Algorithm 1 Standard cross-validation with stochastic optimization
  for  do
     
     
     loop
        
        
     end loop
     
  end for
  
  
  
  Return
Algorithm 2 Optimization of hypernetwork, then hyperparameters
  
  
  
  loop
     ,
     
  end loop
  
  loop
     
     
  end loop
  Return
Algorithm 3 Joint optimization of hypernetwork and hyperparameters
  
  
  
  loop
     ,
     
     
     
     
     
  end loop
  Return
A comparison of standard hyperparameter optimization, our first algorithm, and our joint algorithm. Here, hyperopt refers to a generic hyperparameter optimization. Instead of updating weights using the loss gradient , we update hypernetwork weights and hyperparameters using the chain rule: or respectively. This allows our method to use gradient-based hyperparameter optimization.

2.3 Asymptotic convergence properties

Algorithm 2 trains a hypernetwork using SGD, drawing hyperparameters from a fixed distribution . This section proves that Algorithm 2 converges to a local best-response under mild assumptions. In particular, we show that, for a sufficiently large hypernetwork, the choice of does not matter as long as it has sufficient support. Notation as if has a unique solution for or is used for simplicity, but is not true in general.

Theorem 2.1.

Sufficiently powerful hypernetworks can learn continuous best-response functions, which minimizes the expected loss for all hyperparameter distributions with convex support.

Proof.

If is a universal approximator (Hornik, 1991) and the best-response is continuous in (which allows approximation by ), then there exists optimal hypernetwork parameters such that for all hyperparameters , . Thus, . In other words, universal approximator hypernetworks can learn continuous best-responses.

Substituting into the training loss gives . By Jensen’s inequality, . To satisfy Jensen’s requirements, we have

as our convex function on the convex vector space of functions

. To guarantee convexity of the vector space we require that is convex and with . Thus, . In other words, if the hypernetwork learns the best-response it will simultaneously minimize the loss for every point in . ∎

Thus, having a universal approximator and a continuous best-response implies , , because

. Thus, under mild conditions, we will learn a best-response in the support of the hyperparameter distribution. If the best-response is differentiable, then there is a neighborhood about each hyperparameter where the best-response is approximately linear. If the support of the hyperparameter distribution is this neighborhood, then we can learn the best-response locally with linear regression.

In practice, there are no guarantees about the network being a universal approximator or the finite-time convergence of optimization. The optimal hypernetwork will depend on the hyperparameter distribution , not just the support of this distribution. We appeal to experimental results that our method is feasible in practice.

Algorithm 4 Simplified joint optimization of hypernetwork and hyperparameters
  
  loop
     
     
     
  end loop
  Return
Algorithm 4 builds on Algorithm 3 by using gradient updates on as a source of noise. This variant does not have asymptotic guarantees, but performs similarly to Algorithm 3 in practice.

2.4 Jointly training parameters and hyperparameters

Figure 4: Training and validation losses of a neural network, estimated by cross-validation (crosses) or a linear hypernetwork (lines). The hypernetwork’s limited capacity makes it only accurate where the hyperparameter distribution puts mass.

Theorem 2.1 holds for any . In practice, we should choose a that puts most of its mass on promising hyperparameter values, because it may not be possible to learn a best-response for all hyperparameters due to limited hypernetwork capacity. Thus, we propose Algorithm 3, which only tries to match a best-response locally. We introduce a “current” hyperparameter , which is updated each iteration. We define a conditional hyperparameter distribution, , which only puts mass close to .

Algorithm 3 combines the two phases of Algorithm 2 into one. Instead of first learning a hypernetwork that can output weights for any hyperparameter then optimizing the hyperparameters, Algorithm 3 only samples hyperparameters near the current guess. This means the hypernetwork just has to be trained to estimate good enough weights for a small set of hyperparameters. There is an extra cost of having to re-train the hypernetwork each time we update . The locally-trained hypernetwork can then be used to provide gradients to update the hyperparameters based on validation set performance.

How simple can we make the hypernetwork, and still obtain useful gradients to optimize hyperparameters? Consider the case in our experiments where the hypernetwork is a linear function of the hyperparameters and the conditional hyperparameter distribution is for some small

. This hypernetwork learns a tangent hyperplane to a best-response function and only needs to make minor adjustments at each step if the hyperparameter updates are sufficiently small. We can further restrict the capacity of a linear hypernetwork by factorizing its weights, effectively adding a bottleneck layer with a linear activation and a small number of hidden units.

3 Related Work

Our work is complementary to the SMASH algorithm of Brock et al. (2017), with section 2 discussing our differences.

Model-free approaches

Model-free approaches use only trial-and-error to explore the hyperparameter space. Simple model-free approaches applied to hyperparameter optimization include grid search and random search (Bergstra & Bengio, 2012). Hyperband (Li et al., 2016) combines bandit approaches with modeling the learning procedure.

Model-based approaches

Model-based approaches try to build a surrogate function, which can allow gradient-based optimization or active learning. A common example is Bayesian optimization. Freeze-thaw Bayesian optimization can condition on partially-optimized model performance.

Optimization-based approaches

Another line of related work attempts to directly approximate gradients of the validation loss with respect to hyperparameters. Domke (2012) proposes to differentiate through unrolled optimization to approximate best-responses in nested optimization and Maclaurin et al. (2015a) differentiate through entire unrolled learning procedures. DrMAD (Fu et al., 2016) approximates differentiating through an unrolled learning procedure to relax memory requirements for deep neural networks. HOAG (Pedregosa, 2016) finds hyperparameter gradients with implicit differentiation by deriving an implicit equation for the gradient with optimality conditions. Franceschi et al. (2017) study forward and reverse-mode differentiation for constructing hyperparameter gradients. Also, Feng & Simon (2017) establish conditions where the validation loss of best-responding weights are almost everywhere smooth, allowing gradient-based training of hyperparameters.

A closely-related procedure to our method is the method of Luketina et al. (2016)

, which also provides an algorithm for stochastic gradient-based optimization of hyperparameters. The convergence of their procedure to local optima of the validation loss depends on approximating the Hessian of the training loss for parameters with the identity matrix. In contrast, the convergence of our method depends on having a suitably powerful hypernetwork.

Game theory

Best-response functions are extensively studied as a solution concept in discrete and continuous multi-agent games (e.g., Fudenberg & Levine (1998)). Games where learning a best-response can be applied include adversarial training (Goodfellow et al., 2014), or Stackelberg competitions (e.g., Brückner & Scheffer (2011)). For adversarial training, the analog of our method is a discriminator who observes the generator’s parameters.

4 Experiments

Figure 5: Validation and test losses during hyperparameter optimization with a separate weight decay applied to each weight in the model. Thus, models with more parameters have more hyperparameters. Top: We solve the -dimensional hyperparameter optimization problem with a linear model and multiple algorithms. Hypernetwork-based optimization converges to a sub-optimal solution faster than unrolled optimization from Maclaurin et al. (2015a). Bottom: Hyper-training is applied different layer configurations in the model.

In our experiments, we examine the standard example of stochastic gradient-based optimization of neural networks, with a weight regularization penalty. Some gradient-based methods explicitly use the gradient of a loss, while others use the gradient of a learned surrogate loss. Hyper-training learns and substitutes a surrogate best-response function into a real loss. We may contrast our algorithm with methods learning the loss like Bayesian optimization, gradient-based methods only handling hyperparameters that affect the training loss and gradient-based methods which can handle optimization parameters. The best comparison for hyper-training is to gradient-based methods which only handle parameters affecting the training loss because other methods apply to a more general set of problems. In this case, we write the training and validation losses as:

In all experiments, Algorithms 2 or 3 are used to optimize weights with a mean squared error on MNIST (LeCun et al., 1998) with as an weight decay penalty weighted by . The elementary model has

weights. All hidden units in the hypernetwork have a ReLU activation 

(Nair & Hinton, 2010) unless otherwise specified. Autograd (Maclaurin et al., 2015b) was used to compute all derivatives. For each experiment, the minibatch samples pairs of hyperparameters and up to training data points. We used Adam for training the hypernetwork and hyperparameters, with a step size of . We ran all experiments on a CPU.

4.1 Learning a global best-response

Our first experiment, shown in Figure 2, demonstrates learning a global approximation to a best-response function using Algorithm 2. To make visualization of the regularization loss easier, we use training data points to exacerbate overfitting. We compare the performance of weights output by the hypernetwork to those trained by standard cross-validation (Algorithm 1). Thus, elementary weights were randomly initialized for each hyperparameter choice and optimized using Adam (Kingma & Ba, 2014) for iterations with a step size of .

When training the hypernetwork, hyperparameters were sampled from a broad Gaussian distribution:

. The hypernetwork has hidden units which results in parameters of the hypernetwork.

The minimum of the best-response in Figure 2 is close to the real minimum of the validation loss, which shows a hypernetwork can satisfactorily approximate a global best-response function in small problems.

4.2 Learning a local best-response

Figure 4 shows the same experiment, but using the Algorithm 3. The fused updates result in finding a best-response approximation whose minimum is the actual minimum faster than the prior experiment. The conditional hyperparameter distribution is given by . The hypernetwork is a linear model, with only weights. We use the same optimizer as the global best-response to update both the hypernetwork and the hyperparameters.

Again, the minimum of the best-response at the end of training minimizes the validation loss. This experiment shows that using only a locally-trained linear best-response function can give sufficient gradient information to optimize hyperparameters on a small problem. Algorithm 3 is also less computationally expensive than Algorithms 1 or 2.

GP mean Hyper-training fixed Hyper-training

Inferred loss

True loss

Frequency

Inferred - true loss
Figure 6: Comparing three approaches to inferring validation loss. First column: A Gaussian process, fit on hyperparameters and the corresponding validation losses. Second column: A hypernetwork, fit on the same hyperparameters and the corresponding optimized weights. Third column: Our proposed method, a hypernetwork trained with stochastically sampled hyperparameters. Top row: The distribution of inferred and true losses. The diagonal black line is where predicted loss equals true loss. Bottom row: The distribution of differences between inferred and true losses. The Gaussian process often under-predicts the true loss, while the hypernetwork trained on the same data tends to over-predict the true loss.

4.3 Hyper-training and unrolled optimization

To compare hyper-training with other gradient-based hyperparameter optimization methods, we train models with hyperparameters and a separate weight decay applied to each weight in a 1 layer (linear) model. The conditional hyperparameter distribution and optimizer for the hypernetwork and hyperparameters is the same the prior experiment. We factorize the weights for the model by selecting a hypernetwork with hidden units. The factorized linear hypernetwork has hidden units giving weights. Each hypernetwork iteration is times as expensive as an iteration on just the model because there is the same number of hyperparameters as model parameters.

Figure 5, top, shows that Algorithm 3 converges more quickly than the unrolled reverse-mode optimization introduced in Maclaurin et al. (2015a) and implemented by Franceschi et al. (2017). Hyper-training reaches sub-optimal solutions because of limitations on how many hyperparameters can be sampled for each update but overfits validation data less than unrolling. Standard Bayesian optimization cannot be scaled to this many hyperparameters. Thus, this experiment shows Algorithm 3 can efficiently partially optimize thousands of hyperparameters. It may be useful to combine these methods by using a hypernetwork to output initial parameters and then unrolling several steps of optimization to differentiate through.

4.4 Optimizing with deeper networks

To see if we can optimize deeper networks with hyper-training we optimize models with 1, 2, and 3 layers and a separate weight decay applied to each weight. The conditional hyperparameter distribution and optimizer for the hypernetwork and hyperparameters is the same the prior experiment. We factorize the weights for each model by selecting a hypernetwork with hidden units.

Figure 5, bottom, shows that Algorithm 3 can scale to networks with multiple hidden layers and outperform hand-tuned settings. As we add more layers the difference between validation loss and testing loss decreases, and the model performs better on the validation set. Future work should compare other architectures like recurrent or convolutional networks. Additionally, note that more layers perform with lesser training (not shown), validation, and test losses, instead of lower training loss and higher validation or test loss. This performance indicates that using weight decay on each weight could be a prior for generalization, or that hyper-training enforces another useful prior like the continuity of a best-response.

4.5 Estimating weights versus estimating loss

Our approach differs from Bayesian optimization which attempts to directly model the validation loss of optimized weights, where we try to learn to predict optimal weights. In this experiment, we untangle the reason for the better performance of our method: Is it because of a better inductive bias, or because our way can see more hyperparameter settings during optimization?

First, we constructed a hyper-training set: We optimized sets of weights to completion, given randomly-sampled hyperparameters. We chose samples since that is the regime in which we expect Gaussian process-based approaches to have the most significant advantage. We also constructed a validation set of (optimized weight, hyperparameter) tuples generated in the same manner. We then fit a Gaussian process (GP) regression model with an RBF kernel from sklearn on the validation loss data. A hypernetwork is fit to the same set of hyperparameters and data. Finally, we optimize another hypernetwork using Algorithm 2, for the same amount of time as building the GP training set. The two hypernetworks were linear models and trained with the same optimizer parameters as the -dimensional hyperparameter optimization.

Figure 6 shows the distribution of prediction errors of these three models. We can see that the Gaussian process tends to underestimate loss. The hypernetwork trained with the same small fixed set of examples tends to overestimate loss. We conjecture that this is due to the hypernetwork producing bad weights in regions where it doesn’t have enough training data. Because the hypernetwork must provide actual weights to predict the validation loss, poorly-fit regions will overestimate the validation loss. Finally, the hypernetwork trained with Algorithm 2 produces errors tightly centered around 0. The main takeaway from this experiment is a hypernetwork can learn more accurate surrogate functions than a GP for equal compute budgets because it views (noisy) evaluations of more points.

5 Conclusions and Future Work

In this paper, we addressed the question of tuning hyperparameters using gradient-based optimization, by replacing the training optimization loop with a differentiable hypernetwork. We gave a theoretical justification that sufficiently large networks will learn the best-response for all hyperparameters viewed in training. We also presented a simpler and more scalable method that jointly optimizes both hyperparameters and hypernetwork weights, allowing our method to work with manageably-sized hypernetworks.

Experimentally, we showed that hypernetworks could provide a better inductive bias for hyperparameter optimization than Gaussian processes fitting the validation loss empirically.

There are many directions to extend the proposed methods. For instance, the hypernetwork could be composed with several iterations of optimization, as an easily-differentiable fine-tuning step. Or, hypernetworks could be incorporated into meta-learning schemes, such as MAML (Finn et al., 2017), which finds weights that perform a variety of tasks after unrolling gradient descent.

We also note that the prospect of optimizing thousands of hyperparameters raises the question of hyper-regularization, or regularization of hyperparameters.

Acknowledgments

We thank Matthew MacKay, Dougal Maclaurin, Daniel Flam-Shepard, Daniel Roy, and Jack Klys for helpful discussions.

References