hypernethypertraining
Code for Stochastic Hyperparameter Optimization through Hypernetworks
view repo
Machine learning models are often tuned by nesting optimization of model weights inside the optimization of hyperparameters. We give a method to collapse this nested optimization into joint stochastic optimization of weights and hyperparameters. Our process trains a neural network to output approximately optimal weights as a function of hyperparameters. We show that our technique converges to locally optimal weights and hyperparameters for sufficiently large hypernetworks. We compare this method to standard hyperparameter optimization strategies and demonstrate its effectiveness for tuning thousands of hyperparameters.
READ FULL TEXT VIEW PDF
Hyperparameter optimization can be formulated as a bilevel optimization
...
read it
Skipgram with negative sampling, a popular variant of Word2vec original...
read it
While deep neural networks excel in solving visual recognition tasks, th...
read it
We present a new method for searching optimal hyperparameters among seve...
read it
Projection algorithms such as tSNE or UMAP are useful for the visualiza...
read it
Stress and driving are a dangerous combination which can lead to crashes...
read it
We propose an algorithm for inexpensive gradientbased hyperparameter
op...
read it
Code for Stochastic Hyperparameter Optimization through Hypernetworks
Model selection and hyperparameter tuning is a significant bottleneck in designing predictive models. Hyperparameter optimization is a nested optimization: The inner optimization finds model parameters which minimize the training loss given hyperparameters . The outer optimization chooses to reduce a validation loss :
(1) 
Standard practice in machine learning solves (1) by gradientfree optimization of hyperparameters, such as grid search or random search. Each set of hyperparameters is evaluated by reinitializing weights and training the model to completion. Retraining a model from scratch is wasteful if the hyperparameters change by a small amount. Some approaches, such as Hyperband (Li et al., 2016) and freezethaw Bayesian optimization (Swersky et al., 2014), resume model training and do not waste this effort. However, these methods often scale poorly beyond 10 to 20 dimensions.
How can we avoid retraining from scratch each time? Note that the optimal parameters are a deterministic function of the hyperparameters :
(2) 
We propose to learn this function. Specifically, we train a neural network that takes hyperparameters as input, and outputs an approximately optimal set of weights.
This formulation provides two major benefits: First, we can train the hypernetwork to convergence using stochastic gradient descent (SGD) without training any particular model to completion. Second, differentiating through the hypernetwork allows us to optimize hyperparameters with stochastic gradientbased optimization.
How can we teach a hypernetwork (Ha et al., 2016) to output approximately optimal weights to another neural network? The basic idea is that at each iteration, we ask a hypernetwork to output a set of weights given some hyperparameters: . Instead of updating the weights using the training loss gradient , we update the hypernetwork weights
using the chain rule:
. This formulation allows us to optimize the hyperparameters with the validation loss gradient . We call this method hypertraining and contrast it with standard training methods.We call the function that outputs optimal weights for hyperparameters a bestresponse function. At convergence, we want our hypernetwork to match the bestresponse function closely.
Our method is closely related to the concurrent work of Brock et al. (2017), whose SMASH algorithm also approximates the optimal weights as a function of model architectures, to perform a gradientfree search over discrete model structures. Their work focuses on efficiently estimating the performance of a variety of model architectures, while we focus on efficiently exploring continuous spaces of models. We further extend this idea by formulating an algorithm to optimize the hypernetwork and hyperparameters jointly. Joint optimization of parameters and hyperparameters addresses one of the main weaknesses of SMASH, which is that the the hypernetwork must be very large to learn approximately optimal weights for many different settings. During joint optimization, the hypernetwork need only model approximately optimal weights for the neighborhood around the current hyperparameters, allowing us to use even linear hypernetworks.
Hypertraining is a method to learn a mapping from hyperparameters to validation loss which is differentiable and cheap to evaluate. We can compare hypertraining to other modelbased hyperparameter schemes. Bayesian optimization (e.g., Lizotte (2008); Snoek et al. (2012)) builds a model of the validation loss as a function of hyperparameters, usually using a Gaussian process (e.g., Rasmussen & Williams (2006)) to track uncertainty. This approach has several disadvantages compared to hypertraining.
First, obtaining data for standard Bayesian optimization requires optimizing models from initialization for each set of hyperparameters. In contrast, hypertraining never needs to optimize any one model fully removing choices like how many models to train and for how long.
Second, standard Bayesian optimization treats the validation loss as a blackbox function: . In contrast, hypertraining takes advantage of the fact that the validation loss is a known, differentiable function: . This information removes the need to learn a model of the validation loss. This function can also be evaluated stochastically by sampling points from the validation set.
Hypertraining has a benefit of learning hyperparameter to optimized weight mapping, which is substituted into the validation loss. This often has a better inductive bias for learning hyperparameter to validation loss than directly learning the loss. Also, the hypernetwork learns continuous bestresponses, which may be a beneficial prior for finding weights by enforcing stability.
We can apply this method to unconstrained continuous bilevel optimization problems with an inner loss function with inner parameters, and an outer loss function with outer parameters. What sort of parameters can be optimized by our approach? Hyperparameters typically fall into two broad categories: 1) Optimization hyperparameters, such as learning rates, which affect the choice of locally optimal point converged to, and 2) regularization or model architecture parameters which change the set of locally optimal points. Hypertraining
does not have inner optimization parameters because there is no internal training loop, so we can not optimize these. However, we must still choose optimization parameters for the fused optimization loop. In principle, hypertraining can handle discrete hyperparameters, but does not offer particular advantages for optimization over continuous hyperparameters.Another limitation is that our approach only proposes making local changes to the hyperparameters, and does not do uncertaintybased exploration. Uncertainty can be incorporated into the hypernetwork by using stochastic variational inference as in Blundell et al. (2015), and we leave this for future work. Finally, it is not obvious how to choose the training distribution of hyperparameters . If we do not sample a sufficient range of hyperparameters, the implicit estimated gradient of the validation loss w.r.t. the hyperparameters may be inaccurate. We discuss several approaches to this problem in section 2.4.
A clear difficulty of this approach is that hypernetworks can require several times as many parameters as the original model. For example, training a fullyconnected hypernetwork with 1 hidden layer of units to output parameters requires at least hypernetwork parameters. To address this problem, in section 2.4, we propose an algorithm that only trains a linear model mapping hyperparameters to model weights.
loop
,
end loop
loop
end loop
Return

loop
,
end loop
Return

Algorithm 2 trains a hypernetwork using SGD, drawing hyperparameters from a fixed distribution . This section proves that Algorithm 2 converges to a local bestresponse under mild assumptions. In particular, we show that, for a sufficiently large hypernetwork, the choice of does not matter as long as it has sufficient support. Notation as if has a unique solution for or is used for simplicity, but is not true in general.
Sufficiently powerful hypernetworks can learn continuous bestresponse functions, which minimizes the expected loss for all hyperparameter distributions with convex support.
If is a universal approximator (Hornik, 1991) and the bestresponse is continuous in (which allows approximation by ), then there exists optimal hypernetwork parameters such that for all hyperparameters , . Thus, . In other words, universal approximator hypernetworks can learn continuous bestresponses.
Substituting into the training loss gives . By Jensen’s inequality, . To satisfy Jensen’s requirements, we have
as our convex function on the convex vector space of functions
. To guarantee convexity of the vector space we require that is convex and with . Thus, . In other words, if the hypernetwork learns the bestresponse it will simultaneously minimize the loss for every point in . ∎Thus, having a universal approximator and a continuous bestresponse implies , , because
. Thus, under mild conditions, we will learn a bestresponse in the support of the hyperparameter distribution. If the bestresponse is differentiable, then there is a neighborhood about each hyperparameter where the bestresponse is approximately linear. If the support of the hyperparameter distribution is this neighborhood, then we can learn the bestresponse locally with linear regression.
In practice, there are no guarantees about the network being a universal approximator or the finitetime convergence of optimization. The optimal hypernetwork will depend on the hyperparameter distribution , not just the support of this distribution. We appeal to experimental results that our method is feasible in practice.
Theorem 2.1 holds for any . In practice, we should choose a that puts most of its mass on promising hyperparameter values, because it may not be possible to learn a bestresponse for all hyperparameters due to limited hypernetwork capacity. Thus, we propose Algorithm 3, which only tries to match a bestresponse locally. We introduce a “current” hyperparameter , which is updated each iteration. We define a conditional hyperparameter distribution, , which only puts mass close to .
Algorithm 3 combines the two phases of Algorithm 2 into one. Instead of first learning a hypernetwork that can output weights for any hyperparameter then optimizing the hyperparameters, Algorithm 3 only samples hyperparameters near the current guess. This means the hypernetwork just has to be trained to estimate good enough weights for a small set of hyperparameters. There is an extra cost of having to retrain the hypernetwork each time we update . The locallytrained hypernetwork can then be used to provide gradients to update the hyperparameters based on validation set performance.
How simple can we make the hypernetwork, and still obtain useful gradients to optimize hyperparameters? Consider the case in our experiments where the hypernetwork is a linear function of the hyperparameters and the conditional hyperparameter distribution is for some small
. This hypernetwork learns a tangent hyperplane to a bestresponse function and only needs to make minor adjustments at each step if the hyperparameter updates are sufficiently small. We can further restrict the capacity of a linear hypernetwork by factorizing its weights, effectively adding a bottleneck layer with a linear activation and a small number of hidden units.
Our work is complementary to the SMASH algorithm of Brock et al. (2017), with section 2 discussing our differences.
Modelfree approaches use only trialanderror to explore the hyperparameter space. Simple modelfree approaches applied to hyperparameter optimization include grid search and random search (Bergstra & Bengio, 2012). Hyperband (Li et al., 2016) combines bandit approaches with modeling the learning procedure.
Modelbased approaches try to build a surrogate function, which can allow gradientbased optimization or active learning. A common example is Bayesian optimization. Freezethaw Bayesian optimization can condition on partiallyoptimized model performance.
Another line of related work attempts to directly approximate gradients of the validation loss with respect to hyperparameters. Domke (2012) proposes to differentiate through unrolled optimization to approximate bestresponses in nested optimization and Maclaurin et al. (2015a) differentiate through entire unrolled learning procedures. DrMAD (Fu et al., 2016) approximates differentiating through an unrolled learning procedure to relax memory requirements for deep neural networks. HOAG (Pedregosa, 2016) finds hyperparameter gradients with implicit differentiation by deriving an implicit equation for the gradient with optimality conditions. Franceschi et al. (2017) study forward and reversemode differentiation for constructing hyperparameter gradients. Also, Feng & Simon (2017) establish conditions where the validation loss of bestresponding weights are almost everywhere smooth, allowing gradientbased training of hyperparameters.
A closelyrelated procedure to our method is the method of Luketina et al. (2016)
, which also provides an algorithm for stochastic gradientbased optimization of hyperparameters. The convergence of their procedure to local optima of the validation loss depends on approximating the Hessian of the training loss for parameters with the identity matrix. In contrast, the convergence of our method depends on having a suitably powerful hypernetwork.
Bestresponse functions are extensively studied as a solution concept in discrete and continuous multiagent games (e.g., Fudenberg & Levine (1998)). Games where learning a bestresponse can be applied include adversarial training (Goodfellow et al., 2014), or Stackelberg competitions (e.g., Brückner & Scheffer (2011)). For adversarial training, the analog of our method is a discriminator who observes the generator’s parameters.
In our experiments, we examine the standard example of stochastic gradientbased optimization of neural networks, with a weight regularization penalty. Some gradientbased methods explicitly use the gradient of a loss, while others use the gradient of a learned surrogate loss. Hypertraining learns and substitutes a surrogate bestresponse function into a real loss. We may contrast our algorithm with methods learning the loss like Bayesian optimization, gradientbased methods only handling hyperparameters that affect the training loss and gradientbased methods which can handle optimization parameters. The best comparison for hypertraining is to gradientbased methods which only handle parameters affecting the training loss because other methods apply to a more general set of problems. In this case, we write the training and validation losses as:
In all experiments, Algorithms 2 or 3 are used to optimize weights with a mean squared error on MNIST (LeCun et al., 1998) with as an weight decay penalty weighted by . The elementary model has
weights. All hidden units in the hypernetwork have a ReLU activation
(Nair & Hinton, 2010) unless otherwise specified. Autograd (Maclaurin et al., 2015b) was used to compute all derivatives. For each experiment, the minibatch samples pairs of hyperparameters and up to training data points. We used Adam for training the hypernetwork and hyperparameters, with a step size of . We ran all experiments on a CPU.Our first experiment, shown in Figure 2, demonstrates learning a global approximation to a bestresponse function using Algorithm 2. To make visualization of the regularization loss easier, we use training data points to exacerbate overfitting. We compare the performance of weights output by the hypernetwork to those trained by standard crossvalidation (Algorithm 1). Thus, elementary weights were randomly initialized for each hyperparameter choice and optimized using Adam (Kingma & Ba, 2014) for iterations with a step size of .
When training the hypernetwork, hyperparameters were sampled from a broad Gaussian distribution:
. The hypernetwork has hidden units which results in parameters of the hypernetwork.The minimum of the bestresponse in Figure 2 is close to the real minimum of the validation loss, which shows a hypernetwork can satisfactorily approximate a global bestresponse function in small problems.
Figure 4 shows the same experiment, but using the Algorithm 3. The fused updates result in finding a bestresponse approximation whose minimum is the actual minimum faster than the prior experiment. The conditional hyperparameter distribution is given by . The hypernetwork is a linear model, with only weights. We use the same optimizer as the global bestresponse to update both the hypernetwork and the hyperparameters.
Again, the minimum of the bestresponse at the end of training minimizes the validation loss. This experiment shows that using only a locallytrained linear bestresponse function can give sufficient gradient information to optimize hyperparameters on a small problem. Algorithm 3 is also less computationally expensive than Algorithms 1 or 2.
GP mean  Hypertraining fixed  Hypertraining  
Inferred loss 


True loss  
Frequency 

Inferred  true loss 
To compare hypertraining with other gradientbased hyperparameter optimization methods, we train models with hyperparameters and a separate weight decay applied to each weight in a 1 layer (linear) model. The conditional hyperparameter distribution and optimizer for the hypernetwork and hyperparameters is the same the prior experiment. We factorize the weights for the model by selecting a hypernetwork with hidden units. The factorized linear hypernetwork has hidden units giving weights. Each hypernetwork iteration is times as expensive as an iteration on just the model because there is the same number of hyperparameters as model parameters.
Figure 5, top, shows that Algorithm 3 converges more quickly than the unrolled reversemode optimization introduced in Maclaurin et al. (2015a) and implemented by Franceschi et al. (2017). Hypertraining reaches suboptimal solutions because of limitations on how many hyperparameters can be sampled for each update but overfits validation data less than unrolling. Standard Bayesian optimization cannot be scaled to this many hyperparameters. Thus, this experiment shows Algorithm 3 can efficiently partially optimize thousands of hyperparameters. It may be useful to combine these methods by using a hypernetwork to output initial parameters and then unrolling several steps of optimization to differentiate through.
To see if we can optimize deeper networks with hypertraining we optimize models with 1, 2, and 3 layers and a separate weight decay applied to each weight. The conditional hyperparameter distribution and optimizer for the hypernetwork and hyperparameters is the same the prior experiment. We factorize the weights for each model by selecting a hypernetwork with hidden units.
Figure 5, bottom, shows that Algorithm 3 can scale to networks with multiple hidden layers and outperform handtuned settings. As we add more layers the difference between validation loss and testing loss decreases, and the model performs better on the validation set. Future work should compare other architectures like recurrent or convolutional networks. Additionally, note that more layers perform with lesser training (not shown), validation, and test losses, instead of lower training loss and higher validation or test loss. This performance indicates that using weight decay on each weight could be a prior for generalization, or that hypertraining enforces another useful prior like the continuity of a bestresponse.
Our approach differs from Bayesian optimization which attempts to directly model the validation loss of optimized weights, where we try to learn to predict optimal weights. In this experiment, we untangle the reason for the better performance of our method: Is it because of a better inductive bias, or because our way can see more hyperparameter settings during optimization?
First, we constructed a hypertraining set: We optimized sets of weights to completion, given randomlysampled hyperparameters. We chose samples since that is the regime in which we expect Gaussian processbased approaches to have the most significant advantage. We also constructed a validation set of (optimized weight, hyperparameter) tuples generated in the same manner. We then fit a Gaussian process (GP) regression model with an RBF kernel from sklearn on the validation loss data. A hypernetwork is fit to the same set of hyperparameters and data. Finally, we optimize another hypernetwork using Algorithm 2, for the same amount of time as building the GP training set. The two hypernetworks were linear models and trained with the same optimizer parameters as the dimensional hyperparameter optimization.
Figure 6 shows the distribution of prediction errors of these three models. We can see that the Gaussian process tends to underestimate loss. The hypernetwork trained with the same small fixed set of examples tends to overestimate loss. We conjecture that this is due to the hypernetwork producing bad weights in regions where it doesn’t have enough training data. Because the hypernetwork must provide actual weights to predict the validation loss, poorlyfit regions will overestimate the validation loss. Finally, the hypernetwork trained with Algorithm 2 produces errors tightly centered around 0. The main takeaway from this experiment is a hypernetwork can learn more accurate surrogate functions than a GP for equal compute budgets because it views (noisy) evaluations of more points.
In this paper, we addressed the question of tuning hyperparameters using gradientbased optimization, by replacing the training optimization loop with a differentiable hypernetwork. We gave a theoretical justification that sufficiently large networks will learn the bestresponse for all hyperparameters viewed in training. We also presented a simpler and more scalable method that jointly optimizes both hyperparameters and hypernetwork weights, allowing our method to work with manageablysized hypernetworks.
Experimentally, we showed that hypernetworks could provide a better inductive bias for hyperparameter optimization than Gaussian processes fitting the validation loss empirically.
There are many directions to extend the proposed methods. For instance, the hypernetwork could be composed with several iterations of optimization, as an easilydifferentiable finetuning step. Or, hypernetworks could be incorporated into metalearning schemes, such as MAML (Finn et al., 2017), which finds weights that perform a variety of tasks after unrolling gradient descent.
We also note that the prospect of optimizing thousands of hyperparameters raises the question of hyperregularization, or regularization of hyperparameters.
We thank Matthew MacKay, Dougal Maclaurin, Daniel FlamShepard, Daniel Roy, and Jack Klys for helpful discussions.
Comments
There are no comments yet.