Code for Stochastic Hyperparameter Optimization through Hypernetworks
Machine learning models are often tuned by nesting optimization of model weights inside the optimization of hyperparameters. We give a method to collapse this nested optimization into joint stochastic optimization of weights and hyperparameters. Our process trains a neural network to output approximately optimal weights as a function of hyperparameters. We show that our technique converges to locally optimal weights and hyperparameters for sufficiently large hypernetworks. We compare this method to standard hyperparameter optimization strategies and demonstrate its effectiveness for tuning thousands of hyperparameters.READ FULL TEXT VIEW PDF
Hyperparameter optimization can be formulated as a bilevel optimization
Skip-gram with negative sampling, a popular variant of Word2vec original...
While deep neural networks excel in solving visual recognition tasks, th...
We present a new method for searching optimal hyperparameters among seve...
Projection algorithms such as t-SNE or UMAP are useful for the visualiza...
Stress and driving are a dangerous combination which can lead to crashes...
We propose an algorithm for inexpensive gradient-based hyperparameter
Code for Stochastic Hyperparameter Optimization through Hypernetworks
Model selection and hyperparameter tuning is a significant bottleneck in designing predictive models. Hyperparameter optimization is a nested optimization: The inner optimization finds model parameters which minimize the training loss given hyperparameters . The outer optimization chooses to reduce a validation loss :
Standard practice in machine learning solves (1) by gradient-free optimization of hyperparameters, such as grid search or random search. Each set of hyperparameters is evaluated by re-initializing weights and training the model to completion. Re-training a model from scratch is wasteful if the hyperparameters change by a small amount. Some approaches, such as Hyperband (Li et al., 2016) and freeze-thaw Bayesian optimization (Swersky et al., 2014), resume model training and do not waste this effort. However, these methods often scale poorly beyond 10 to 20 dimensions.
How can we avoid re-training from scratch each time? Note that the optimal parameters are a deterministic function of the hyperparameters :
We propose to learn this function. Specifically, we train a neural network that takes hyperparameters as input, and outputs an approximately optimal set of weights.
This formulation provides two major benefits: First, we can train the hypernetwork to convergence using stochastic gradient descent (SGD) without training any particular model to completion. Second, differentiating through the hypernetwork allows us to optimize hyperparameters with stochastic gradient-based optimization.
How can we teach a hypernetwork (Ha et al., 2016) to output approximately optimal weights to another neural network? The basic idea is that at each iteration, we ask a hypernetwork to output a set of weights given some hyperparameters: . Instead of updating the weights using the training loss gradient , we update the hypernetwork weights
using the chain rule:. This formulation allows us to optimize the hyperparameters with the validation loss gradient . We call this method hyper-training and contrast it with standard training methods.
We call the function that outputs optimal weights for hyperparameters a best-response function. At convergence, we want our hypernetwork to match the best-response function closely.
Our method is closely related to the concurrent work of Brock et al. (2017), whose SMASH algorithm also approximates the optimal weights as a function of model architectures, to perform a gradient-free search over discrete model structures. Their work focuses on efficiently estimating the performance of a variety of model architectures, while we focus on efficiently exploring continuous spaces of models. We further extend this idea by formulating an algorithm to optimize the hypernetwork and hyperparameters jointly. Joint optimization of parameters and hyperparameters addresses one of the main weaknesses of SMASH, which is that the the hypernetwork must be very large to learn approximately optimal weights for many different settings. During joint optimization, the hypernetwork need only model approximately optimal weights for the neighborhood around the current hyperparameters, allowing us to use even linear hypernetworks.
Hyper-training is a method to learn a mapping from hyperparameters to validation loss which is differentiable and cheap to evaluate. We can compare hyper-training to other model-based hyperparameter schemes. Bayesian optimization (e.g., Lizotte (2008); Snoek et al. (2012)) builds a model of the validation loss as a function of hyperparameters, usually using a Gaussian process (e.g., Rasmussen & Williams (2006)) to track uncertainty. This approach has several disadvantages compared to hyper-training.
First, obtaining data for standard Bayesian optimization requires optimizing models from initialization for each set of hyperparameters. In contrast, hyper-training never needs to optimize any one model fully removing choices like how many models to train and for how long.
Second, standard Bayesian optimization treats the validation loss as a black-box function: . In contrast, hyper-training takes advantage of the fact that the validation loss is a known, differentiable function: . This information removes the need to learn a model of the validation loss. This function can also be evaluated stochastically by sampling points from the validation set.
Hyper-training has a benefit of learning hyperparameter to optimized weight mapping, which is substituted into the validation loss. This often has a better inductive bias for learning hyperparameter to validation loss than directly learning the loss. Also, the hypernetwork learns continuous best-responses, which may be a beneficial prior for finding weights by enforcing stability.
We can apply this method to unconstrained continuous bi-level optimization problems with an inner loss function with inner parameters, and an outer loss function with outer parameters. What sort of parameters can be optimized by our approach? Hyperparameters typically fall into two broad categories: 1) Optimization hyperparameters, such as learning rates, which affect the choice of locally optimal point converged to, and 2) regularization or model architecture parameters which change the set of locally optimal points. Hyper-trainingdoes not have inner optimization parameters because there is no internal training loop, so we can not optimize these. However, we must still choose optimization parameters for the fused optimization loop. In principle, hyper-training can handle discrete hyperparameters, but does not offer particular advantages for optimization over continuous hyperparameters.
Another limitation is that our approach only proposes making local changes to the hyperparameters, and does not do uncertainty-based exploration. Uncertainty can be incorporated into the hypernetwork by using stochastic variational inference as in Blundell et al. (2015), and we leave this for future work. Finally, it is not obvious how to choose the training distribution of hyperparameters . If we do not sample a sufficient range of hyperparameters, the implicit estimated gradient of the validation loss w.r.t. the hyperparameters may be inaccurate. We discuss several approaches to this problem in section 2.4.
A clear difficulty of this approach is that hypernetworks can require several times as many parameters as the original model. For example, training a fully-connected hypernetwork with 1 hidden layer of units to output parameters requires at least hypernetwork parameters. To address this problem, in section 2.4, we propose an algorithm that only trains a linear model mapping hyperparameters to model weights.
Algorithm 2 trains a hypernetwork using SGD, drawing hyperparameters from a fixed distribution . This section proves that Algorithm 2 converges to a local best-response under mild assumptions. In particular, we show that, for a sufficiently large hypernetwork, the choice of does not matter as long as it has sufficient support. Notation as if has a unique solution for or is used for simplicity, but is not true in general.
Sufficiently powerful hypernetworks can learn continuous best-response functions, which minimizes the expected loss for all hyperparameter distributions with convex support.
If is a universal approximator (Hornik, 1991) and the best-response is continuous in (which allows approximation by ), then there exists optimal hypernetwork parameters such that for all hyperparameters , . Thus, . In other words, universal approximator hypernetworks can learn continuous best-responses.
Substituting into the training loss gives . By Jensen’s inequality, . To satisfy Jensen’s requirements, we have
as our convex function on the convex vector space of functions. To guarantee convexity of the vector space we require that is convex and with . Thus, . In other words, if the hypernetwork learns the best-response it will simultaneously minimize the loss for every point in . ∎
Thus, having a universal approximator and a continuous best-response implies , , because
. Thus, under mild conditions, we will learn a best-response in the support of the hyperparameter distribution. If the best-response is differentiable, then there is a neighborhood about each hyperparameter where the best-response is approximately linear. If the support of the hyperparameter distribution is this neighborhood, then we can learn the best-response locally with linear regression.
In practice, there are no guarantees about the network being a universal approximator or the finite-time convergence of optimization. The optimal hypernetwork will depend on the hyperparameter distribution , not just the support of this distribution. We appeal to experimental results that our method is feasible in practice.
Theorem 2.1 holds for any . In practice, we should choose a that puts most of its mass on promising hyperparameter values, because it may not be possible to learn a best-response for all hyperparameters due to limited hypernetwork capacity. Thus, we propose Algorithm 3, which only tries to match a best-response locally. We introduce a “current” hyperparameter , which is updated each iteration. We define a conditional hyperparameter distribution, , which only puts mass close to .
Algorithm 3 combines the two phases of Algorithm 2 into one. Instead of first learning a hypernetwork that can output weights for any hyperparameter then optimizing the hyperparameters, Algorithm 3 only samples hyperparameters near the current guess. This means the hypernetwork just has to be trained to estimate good enough weights for a small set of hyperparameters. There is an extra cost of having to re-train the hypernetwork each time we update . The locally-trained hypernetwork can then be used to provide gradients to update the hyperparameters based on validation set performance.
How simple can we make the hypernetwork, and still obtain useful gradients to optimize hyperparameters? Consider the case in our experiments where the hypernetwork is a linear function of the hyperparameters and the conditional hyperparameter distribution is for some small
. This hypernetwork learns a tangent hyperplane to a best-response function and only needs to make minor adjustments at each step if the hyperparameter updates are sufficiently small. We can further restrict the capacity of a linear hypernetwork by factorizing its weights, effectively adding a bottleneck layer with a linear activation and a small number of hidden units.
Our work is complementary to the SMASH algorithm of Brock et al. (2017), with section 2 discussing our differences.
Model-free approaches use only trial-and-error to explore the hyperparameter space. Simple model-free approaches applied to hyperparameter optimization include grid search and random search (Bergstra & Bengio, 2012). Hyperband (Li et al., 2016) combines bandit approaches with modeling the learning procedure.
Model-based approaches try to build a surrogate function, which can allow gradient-based optimization or active learning. A common example is Bayesian optimization. Freeze-thaw Bayesian optimization can condition on partially-optimized model performance.
Another line of related work attempts to directly approximate gradients of the validation loss with respect to hyperparameters. Domke (2012) proposes to differentiate through unrolled optimization to approximate best-responses in nested optimization and Maclaurin et al. (2015a) differentiate through entire unrolled learning procedures. DrMAD (Fu et al., 2016) approximates differentiating through an unrolled learning procedure to relax memory requirements for deep neural networks. HOAG (Pedregosa, 2016) finds hyperparameter gradients with implicit differentiation by deriving an implicit equation for the gradient with optimality conditions. Franceschi et al. (2017) study forward and reverse-mode differentiation for constructing hyperparameter gradients. Also, Feng & Simon (2017) establish conditions where the validation loss of best-responding weights are almost everywhere smooth, allowing gradient-based training of hyperparameters.
A closely-related procedure to our method is the method of Luketina et al. (2016)
, which also provides an algorithm for stochastic gradient-based optimization of hyperparameters. The convergence of their procedure to local optima of the validation loss depends on approximating the Hessian of the training loss for parameters with the identity matrix. In contrast, the convergence of our method depends on having a suitably powerful hypernetwork.
Best-response functions are extensively studied as a solution concept in discrete and continuous multi-agent games (e.g., Fudenberg & Levine (1998)). Games where learning a best-response can be applied include adversarial training (Goodfellow et al., 2014), or Stackelberg competitions (e.g., Brückner & Scheffer (2011)). For adversarial training, the analog of our method is a discriminator who observes the generator’s parameters.
In our experiments, we examine the standard example of stochastic gradient-based optimization of neural networks, with a weight regularization penalty. Some gradient-based methods explicitly use the gradient of a loss, while others use the gradient of a learned surrogate loss. Hyper-training learns and substitutes a surrogate best-response function into a real loss. We may contrast our algorithm with methods learning the loss like Bayesian optimization, gradient-based methods only handling hyperparameters that affect the training loss and gradient-based methods which can handle optimization parameters. The best comparison for hyper-training is to gradient-based methods which only handle parameters affecting the training loss because other methods apply to a more general set of problems. In this case, we write the training and validation losses as:
weights. All hidden units in the hypernetwork have a ReLU activation(Nair & Hinton, 2010) unless otherwise specified. Autograd (Maclaurin et al., 2015b) was used to compute all derivatives. For each experiment, the minibatch samples pairs of hyperparameters and up to training data points. We used Adam for training the hypernetwork and hyperparameters, with a step size of . We ran all experiments on a CPU.
Our first experiment, shown in Figure 2, demonstrates learning a global approximation to a best-response function using Algorithm 2. To make visualization of the regularization loss easier, we use training data points to exacerbate overfitting. We compare the performance of weights output by the hypernetwork to those trained by standard cross-validation (Algorithm 1). Thus, elementary weights were randomly initialized for each hyperparameter choice and optimized using Adam (Kingma & Ba, 2014) for iterations with a step size of .
When training the hypernetwork, hyperparameters were sampled from a broad Gaussian distribution:. The hypernetwork has hidden units which results in parameters of the hypernetwork.
The minimum of the best-response in Figure 2 is close to the real minimum of the validation loss, which shows a hypernetwork can satisfactorily approximate a global best-response function in small problems.
Figure 4 shows the same experiment, but using the Algorithm 3. The fused updates result in finding a best-response approximation whose minimum is the actual minimum faster than the prior experiment. The conditional hyperparameter distribution is given by . The hypernetwork is a linear model, with only weights. We use the same optimizer as the global best-response to update both the hypernetwork and the hyperparameters.
Again, the minimum of the best-response at the end of training minimizes the validation loss. This experiment shows that using only a locally-trained linear best-response function can give sufficient gradient information to optimize hyperparameters on a small problem. Algorithm 3 is also less computationally expensive than Algorithms 1 or 2.
|GP mean||Hyper-training fixed||Hyper-training|
|Inferred - true loss|
To compare hyper-training with other gradient-based hyperparameter optimization methods, we train models with hyperparameters and a separate weight decay applied to each weight in a 1 layer (linear) model. The conditional hyperparameter distribution and optimizer for the hypernetwork and hyperparameters is the same the prior experiment. We factorize the weights for the model by selecting a hypernetwork with hidden units. The factorized linear hypernetwork has hidden units giving weights. Each hypernetwork iteration is times as expensive as an iteration on just the model because there is the same number of hyperparameters as model parameters.
Figure 5, top, shows that Algorithm 3 converges more quickly than the unrolled reverse-mode optimization introduced in Maclaurin et al. (2015a) and implemented by Franceschi et al. (2017). Hyper-training reaches sub-optimal solutions because of limitations on how many hyperparameters can be sampled for each update but overfits validation data less than unrolling. Standard Bayesian optimization cannot be scaled to this many hyperparameters. Thus, this experiment shows Algorithm 3 can efficiently partially optimize thousands of hyperparameters. It may be useful to combine these methods by using a hypernetwork to output initial parameters and then unrolling several steps of optimization to differentiate through.
To see if we can optimize deeper networks with hyper-training we optimize models with 1, 2, and 3 layers and a separate weight decay applied to each weight. The conditional hyperparameter distribution and optimizer for the hypernetwork and hyperparameters is the same the prior experiment. We factorize the weights for each model by selecting a hypernetwork with hidden units.
Figure 5, bottom, shows that Algorithm 3 can scale to networks with multiple hidden layers and outperform hand-tuned settings. As we add more layers the difference between validation loss and testing loss decreases, and the model performs better on the validation set. Future work should compare other architectures like recurrent or convolutional networks. Additionally, note that more layers perform with lesser training (not shown), validation, and test losses, instead of lower training loss and higher validation or test loss. This performance indicates that using weight decay on each weight could be a prior for generalization, or that hyper-training enforces another useful prior like the continuity of a best-response.
Our approach differs from Bayesian optimization which attempts to directly model the validation loss of optimized weights, where we try to learn to predict optimal weights. In this experiment, we untangle the reason for the better performance of our method: Is it because of a better inductive bias, or because our way can see more hyperparameter settings during optimization?
First, we constructed a hyper-training set: We optimized sets of weights to completion, given randomly-sampled hyperparameters. We chose samples since that is the regime in which we expect Gaussian process-based approaches to have the most significant advantage. We also constructed a validation set of (optimized weight, hyperparameter) tuples generated in the same manner. We then fit a Gaussian process (GP) regression model with an RBF kernel from sklearn on the validation loss data. A hypernetwork is fit to the same set of hyperparameters and data. Finally, we optimize another hypernetwork using Algorithm 2, for the same amount of time as building the GP training set. The two hypernetworks were linear models and trained with the same optimizer parameters as the -dimensional hyperparameter optimization.
Figure 6 shows the distribution of prediction errors of these three models. We can see that the Gaussian process tends to underestimate loss. The hypernetwork trained with the same small fixed set of examples tends to overestimate loss. We conjecture that this is due to the hypernetwork producing bad weights in regions where it doesn’t have enough training data. Because the hypernetwork must provide actual weights to predict the validation loss, poorly-fit regions will overestimate the validation loss. Finally, the hypernetwork trained with Algorithm 2 produces errors tightly centered around 0. The main takeaway from this experiment is a hypernetwork can learn more accurate surrogate functions than a GP for equal compute budgets because it views (noisy) evaluations of more points.
In this paper, we addressed the question of tuning hyperparameters using gradient-based optimization, by replacing the training optimization loop with a differentiable hypernetwork. We gave a theoretical justification that sufficiently large networks will learn the best-response for all hyperparameters viewed in training. We also presented a simpler and more scalable method that jointly optimizes both hyperparameters and hypernetwork weights, allowing our method to work with manageably-sized hypernetworks.
Experimentally, we showed that hypernetworks could provide a better inductive bias for hyperparameter optimization than Gaussian processes fitting the validation loss empirically.
There are many directions to extend the proposed methods. For instance, the hypernetwork could be composed with several iterations of optimization, as an easily-differentiable fine-tuning step. Or, hypernetworks could be incorporated into meta-learning schemes, such as MAML (Finn et al., 2017), which finds weights that perform a variety of tasks after unrolling gradient descent.
We also note that the prospect of optimizing thousands of hyperparameters raises the question of hyper-regularization, or regularization of hyperparameters.
We thank Matthew MacKay, Dougal Maclaurin, Daniel Flam-Shepard, Daniel Roy, and Jack Klys for helpful discussions.