Self-Tuning Networks: Bilevel Optimization of Hyperparameters using Structured Best-Response Functions

by   Matthew MacKay, et al.

Hyperparameter optimization can be formulated as a bilevel optimization problem, where the optimal parameters on the training set depend on the hyperparameters. We aim to adapt regularization hyperparameters for neural networks by fitting compact approximations to the best-response function, which maps hyperparameters to optimal weights and biases. We show how to construct scalable best-response approximations for neural networks by modeling the best-response as a single network whose hidden units are gated conditionally on the regularizer. We justify this approximation by showing the exact best-response for a shallow linear network with L2-regularized Jacobian can be represented by a similar gating mechanism. We fit this model using a gradient-based hyperparameter optimization algorithm which alternates between approximating the best-response around the current hyperparameters and optimizing the hyperparameters using the approximate best-response function. Unlike other gradient-based approaches, we do not require differentiating the training loss with respect to the hyperparameters, allowing us to tune discrete hyperparameters, data augmentation hyperparameters, and dropout probabilities. Because the hyperparameters are adapted online, our approach discovers hyperparameter schedules that can outperform fixed hyperparameter values. Empirically, our approach outperforms competing hyperparameter optimization methods on large-scale deep learning problems. We call our networks, which update their own hyperparameters online during training, Self-Tuning Networks (STNs).



There are no comments yet.


page 21

page 22

page 23

page 24


Self-tuning networks:

Hyperparameter optimization can be formulated as a bilevel optimization ...

Delta-STN: Efficient Bilevel Optimization for Neural Networks using Structured Response Jacobians

Hyperparameter optimization of neural networks can be elegantly formulat...

Online hyperparameter optimization by real-time recurrent learning

Conventional hyperparameter optimization methods are computationally int...

Optimizing Millions of Hyperparameters by Implicit Differentiation

We propose an algorithm for inexpensive gradient-based hyperparameter op...

HyperNP: Interactive Visual Exploration of Multidimensional Projection Hyperparameters

Projection algorithms such as t-SNE or UMAP are useful for the visualiza...

Training Deep Neural Networks by optimizing over nonlocal paths in hyperparameter space

Hyperparameter optimization is both a practical issue and an interesting...

Hyperparameter optimization with approximate gradient

Most models in machine learning contain at least one hyperparameter to c...

Code Repositories


Code for Self-Tuning Networks (ICLR 2019)

view repo


PyTorch implementation of "STNs" and "Delta-STNs"

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Regularization hyperparameters such as weight decay, data augmentation, and dropout (Srivastava et al., 2014) are crucial to the generalization of neural networks, but are difficult to tune. Popular approaches to hyperparameter optimization include grid search, random search (Bergstra & Bengio, 2012), and Bayesian optimization (Snoek et al., 2012). These approaches work well with low-dimensional hyperparameter spaces and ample computational resources; however, they pose hyperparameter optimization as a black-box optimization problem, ignoring structure which can be exploited for faster convergence, and require many training runs.

We can formulate hyperparameter optimization as a bilevel optimization problem. Let w denote parameters (e.g. weights and biases) and denote hyperparameters (e.g. dropout probability). Let and be functions mapping parameters and hyperparameters to training and validation losses, respectively. We aim to solve111The uniqueness of the is assumed.:


Substituting the best-response function gives a single-level problem:


If the best-response is known, the validation loss can be minimized directly by gradient descent using Equation 2, offering dramatic speed-ups over black-box methods. However, as the solution to a high-dimensional optimization problem, it is difficult to compute even approximately.

Following Lorraine & Duvenaud (2018), we propose to approximate the best-response directly with a parametric function . We jointly optimize and , first updating so that in a neighborhood around the current hyperparameters, then updating by using as a proxy for in Eq. 2:


Finding a scalable approximation when w represents the weights of a neural network is a significant challenge, as even simple implementations entail significant memory overhead. We show how to construct a compact approximation by modelling the best-response of each row in a layer’s weight matrix/bias as a rank-one affine transformation of the hyperparameters. We show that this can be interpreted as computing the activations of a base network in the usual fashion, plus a correction term dependent on the hyperparameters. We justify this approximation by showing the exact best-response for a shallow linear network with -regularized Jacobian follows a similar structure. We call our proposed networks Self-Tuning Networks (STNs) since they update their own hyperparameters online during training.

STNs enjoy many advantages over other hyperparameter optimization methods. First, they are easy to implement by replacing existing modules in deep learning libraries with “hyper” counterparts which accept an additional vector of hyperparameters as input222

We illustrate how this is done for the PyTorch library

(Paszke et al., 2017) in Appendix G.. Second, because the hyperparameters are adapted online, we ensure that computational effort expended to fit around previous hyperparameters is not wasted. In addition, this online adaption yields hyperparameter schedules which we find empirically to outperform fixed hyperparameter settings. Finally, the STN training algorithm does not require differentiating the training loss with respect to the hyperparameters, unlike other gradient-based approaches (Maclaurin et al., 2015; Larsen et al., 1996), allowing us to tune discrete hyperparameters, such as the number of holes to cut out of an image (DeVries & Taylor, 2017), data-augmentation hyperparameters, and discrete-noise dropout parameters. Empirically, we evaluate the performance of STNs on large-scale deep-learning problems with the Penn Treebank (Marcus et al., 1993) and CIFAR-10 datasets (Krizhevsky & Hinton, 2009), and find that they substantially outperform baseline methods.

2 Bilevel Optimization

A bilevel optimization problem consists of two sub-problems called the upper-level and lower-level problems, where the upper-level problem must be solved subject to optimality of the lower-level problem. Minimax problems are an example of bilevel programs where the upper-level objective equals the negative lower-level objective. Bilevel programs were first studied in economics to model leader/follower firm dynamics (Von Stackelberg, 2010) and have since found uses in various fields (see Colson et al. (2007)

for an overview). In machine learning, many problems can be formulated as bilevel programs, including hyperparameter optimization, GAN training

(Goodfellow et al., 2014), meta-learning, and neural architecture search (Zoph & Le, 2016).

Even if all objectives and constraints are linear, bilevel problems are strongly NP-hard (Hansen et al., 1992; Vicente et al., 1994). Due to the difficulty of obtaining exact solutions, most work has focused on restricted settings, considering linear, quadratic, and convex functions. In contrast, we focus on obtaining local solutions in the nonconvex, differentiable, and unconstrained setting. Let denote the upper- and lower-level objectives (e.g., and ) and denote the upper- and lower-level parameters. We aim to solve:


It is desirable to design a gradient-based algorithm for solving Problem 4, since using gradient information provides drastic speed-ups over black-box optimization methods (Nesterov, 2013). The simplest method is simultaneous gradient descent, which updates using and w using . However, simultaneous gradient descent often gives incorrect solutions as it fails to account for the dependence of w on . Consider the relatively common situation where doesn’t depend directly on , so that and hence is never updated.

2.1 Gradient Descent via the Best-Response Function

A more principled approach to solving Problem 4 is to use the best-response function (Gibbons, 1992). Assume the lower-level Problem 4b has a unique optimum for each . Substituting the best-response function converts Problem 4 into a single-level problem:


If is differentiable, we can minimize Eq. 5 using gradient descent on with respect to . This method requires a unique optimum for Problem 4b for each and differentiability of . In general, these conditions are difficult to verify. We give sufficient conditions for them to hold in a neighborhood of a point where solves Problem 4b given .

Lemma 1.

(Fiacco & Ishizuka, 1990) Let solve Problem 4b for . Suppose is in a neighborhood of and the Hessian is positive definite. Then for some neighborhood of , there exists a continuously differentiable function such that is the unique solution to Problem 4b for each and .


See Appendix B.1. ∎

The gradient of decomposes into two terms, which we term the direct gradient and the response gradient. The direct gradient captures the direct reliance of the upper-level objective on , while the response gradient captures how the lower-level parameter responds to changes in the upper-level parameter:


Even if and simultaneous gradient descent is possible, including the response gradient can stabilize optimization by converting the bilevel problem into a single-level one, as noted by Metz et al. (2016) for GAN optimization. Conversion to a single-level problem ensures that the gradient vector field is conservative, avoiding pathological issues described by Mescheder et al. (2017).

2.2 Approximating the Best-Response Function

In general, the solution to Problem 4b is a set, but assuming uniqueness of a solution and differentiability of can yield fruitful algorithms in practice. In fact, gradient-based hyperparameter optimization methods can often be interpreted as approximating either the best-response or its Jacobian , as detailed in Section 5. However, these approaches can be computationally expensive and often struggle with discrete hyperparameters and stochastic hyperparameters like dropout probabilities, since they require differentiating the training loss with respect to the hyperparameters. Promising approaches to approximate directly were proposed by Lorraine & Duvenaud (2018), and are detailed below.

1. Global Approximation. The first algorithm proposed by Lorraine & Duvenaud (2018) approximates as a differentiable function with parameters . If w represents neural net weights, then the mapping is a hypernetwork (Schmidhuber, 1992; Ha et al., 2016). If the distribution is fixed, then gradient descent with respect to minimizes:


If is broad and is sufficiently flexible, then can be used as a proxy for in Problem 5, resulting in the following objective:


2. Local Approximation. In practice, is usually insufficiently flexible to model on . The second algorithm of Lorraine & Duvenaud (2018) locally approximates in a neighborhood around the current upper-level parameter . They set to a factorized Gaussian noise distribution with a fixed scale parameter , and found by minimizing the objective:


Intuitively, the upper-level parameter is perturbed by a small amount, so the lower-level parameter learns how to respond. An alternating gradient descent scheme is used, where is updated to minimize equation 9 and is updated to minimize equation 8. This approach worked for problems using regularization on MNIST (LeCun et al., 1998). However, it is unclear if the approach works with different regularizers or scales to larger problems. It requires , which is a priori unwieldy for high dimensional w. It is also unclear how to set , which defines the size of the neighborhood on which is trained, or if the approach can be adapted to discrete and stochastic hyperparameters.

3 Self-Tuning Networks

In this section, we first construct a best-response approximation that is memory efficient and scales to large neural networks. We justify this approximation through analysis of simpler situations. Then, we describe a method to automatically adjust the scale of the neighborhood is trained on. Finally, we formally describe our algorithm and discuss how it easily handles discrete and stochastic hyperparameters. We call the resulting networks, which update their own hyperparameters online during training, Self-Tuning Networks (STNs).

3.1 An Efficient Best-Response Approximation for Neural Networks

We propose to approximate the best-response for a given layer’s weight matrix and bias as an affine transformation of the hyperparameters 333We describe modifications for convolutional filters in Appendix C.:


Here, indicates elementwise multiplication and

indicates row-wise rescaling. This architecture computes the usual elementary weight/bias, plus an additional weight/bias which has been scaled by a linear transformation of the hyperparameters. Alternatively, it can be interpreted as directly operating on the pre-activations of the layer, adding a correction to the usual pre-activation to account for the hyperparameters:


This best-response architecture is tractable to compute and memory-efficient: it requires parameters to represent and parameters to represent , where is the number of hyperparameters. Furthermore, it enables parallelism: since the predictions can be computed by transforming the pre-activations (Equation 11), the hyperparameters for different examples in a batch can be perturbed independently, improving sample efficiency. In practice, the approximation can be implemented by simply replacing existing modules in deep learning libraries with “hyper” counterparts which accept an additional vector of hyperparameters as input444We illustrate how this is done for the PyTorch library (Paszke et al., 2017) in Appendix G..

3.2 Exact Best-Response for Two-Layer Linear Networks

Given that the best-response function is a mapping from to the high-dimensional weight space , why should we expect to be able to represent it compactly? And why in particular would equation 10 be a reasonable approximation? In this section, we exhibit a model whose best-response function can be represented exactly using a minor variant of equation 10: a linear network with Jacobian norm regularization. In particular, the best-response takes the form of a network whose hidden units are modulated conditionally on the hyperparameters.

Consider using a 2-layer linear network with weights to predict targets from inputs :


Suppose we use a squared-error loss regularized with an penalty on the Jacobian , where the penalty weight lies in and is mapped using to lie :

Theorem 2.

Let , where is the change-of-basis matrix to the principal components of the data matrix and solves the unregularized version of Problem 13 given . Then there exist such that the best-response function555This is an abuse of notation since there is not a unique solution to Problem 13 for each in general. is:


is the sigmoid function.


See Appendix B.2. ∎

Observe that can be implemented as a regular network with weights with an additional sigmoidal gating of its hidden units :


This architecture is shown in Figure 1. Inspired by this example, we use a similar gating of the hidden units to approximate the best-response for deep, nonlinear networks.

3.3 Linear Best-Response Approximations

The sigmoidal gating architecture of the preceding section can be further simplified if one only needs to approximate the best-response function for a small range of hyperparameter values. In particular, for a narrow enough hyperparameter distribution, a smooth best-response function can be approximated by an affine function (i.e. its first-order Taylor approximation). Hence, we replace the sigmoidal gating with linear gating, in order that the weights be affine in the hyperparameters. The following theorem shows that, for quadratic lower-level objectives, using an affine approximation to the best-response function and minimizing yields the correct best-response Jacobian, thus ensuring gradient descent on the approximate objective converges to a local optimum:

Theorem 3.

Suppose is quadratic with , is Gaussian with mean

and variance

, and is affine. Fix and let . Then we have .


See Appendix B.3. ∎

Figure 1: Best-response architecture for an -Jacobian regularized two-layer linear network.

3.4 Adapting the Hyperparameter Distribution

Figure 2: The effect of the sampled neighborhood. Left: If the sampled neighborhood is too small (e.g., a point mass) the approximation learned will only match the exact best-response at the current hyperparameter, with no guarantee that its gradient matches that of the best-response. Middle: If the sampled neighborhood is not too small or too wide, the gradient of the approximation will match that of the best-response. Right: If the sampled neighborhood is too wide, the approximation will be insufficiently flexible to model the best-response, and again the gradients will not match.

The entries of control the scale of the hyperparameter distribution on which is trained. If the entries are too large, then will not be flexible enough to capture the best-response over the samples. However, the entries must remain large enough to force to capture the shape locally around the current hyperparameter values. We illustrate this in Figure 2. As the smoothness of the loss landscape changes during training, it may be beneficial to vary .

To address these issues, we propose adjusting during training based on the sensitivity of the upper-level objective to the sampled hyperparameters. We include an entropy term weighted by which acts to enlarge the entries of . The resulting objective is:


This is similar to a variational inference objective, where the first term is analogous to the negative log-likelihood, but . As ranges from to

, our objective interpolates between variational optimization

(Staines & Barber, 2012) and variational inference, as noted by Khan et al. (2018). Similar objectives have been used in the variational inference literature for better training (Blundell et al., 2015) and representation learning (Higgins et al., 2017).

Minimizing the first term on its own eventually moves all probability mass towards an optimum , resulting in if is an isolated local minimum. This compels to balance between shrinking to decrease the first term while remaining sufficiently large to avoid a heavy entropy penalty. When benchmarking our algorithm’s performance, we evaluate at the deterministic current hyperparameter

. (This is a common practice when using stochastic operations during training, such as batch normalization or dropout.)

3.5 Training Algorithm

We now describe the complete STN training algorithm and discuss how it can tune hyperparameters that other gradient-based algorithms cannot, such as discrete or stochastic hyperparameters. We use an unconstrained parametrization of the hyperparameters. Let denote the element-wise function which maps to the appropriate constrained space, which will involve a non-differentiable discretization for discrete hyperparameters.

Let and denote training and validation losses which are (possibly stochastic, e.g., if using dropout) functions of the hyperparameters and parameters. Define functions by and . STNs are trained by a gradient descent scheme which alternates between updating for steps to minimize (Eq. 9) and updating and for steps to minimize (Eq. 15). We give our complete algorithm as Algorithm 1 and show how it can be implemented in code in Appendix G. The possible non-differentiability of

due to discrete hyperparameters poses no problem. To estimate the derivative of

with respect to , we can use the reparametrization trick and compute and , neither of whose computation paths involve the discretization . To differentiate with respect to a discrete hyperparameter , there are two cases we must consider:

  Initialize: Best-response approximation parameters , hyperparameters , learning rates
  while not converged do
     for  do
     for  do
Algorithm 1 STN Training Algorithm

Case 1: For most regularization schemes, and hence does not depend on directly and thus the only gradient is through . Thus, the reparametrization gradient can be used.

Case 2: If relies explicitly on , then we can use the REINFORCE gradient estimator (Williams, 1992) to estimate the derivative of the expectation with respect to . The number of hidden units in a layer is an example of a hyperparameter that requires this approach since it directly affects the validation loss. We do not show this in Algorithm 1, since we do not tune any hyperparameters which fall into this case.

4 Experiments

We applied our method to convolutional networks and LSTMs (Hochreiter & Schmidhuber, 1997), yielding self-tuning CNNs (ST-CNNs) and self-tuning LSTMs (ST-LSTMs). We first investigated the behavior of STNs in a simple setting where we tuned a single hyperparameter, and found that STNs discovered hyperparameter schedules that outperformed fixed hyperparameter values. Next, we compared the performance of STNs to commonly-used hyperparameter optimization methods on the CIFAR-10 (Krizhevsky & Hinton, 2009) and PTB (Marcus et al., 1993) datasets.

4.1 Hyperparameter Schedules

Due to the joint optimization of the hypernetwork weights and hyperparameters, STNs do not use a single, fixed hyperparameter during training. Instead, STNs discover schedules for adapting the hyperparameters online, which can outperform any fixed hyperparameter. We examined this behavior in detail on the PTB corpus (Marcus et al., 1993) using an ST-LSTM to tune the output dropout rate applied to the hidden units.

The schedule discovered by an ST-LSTM for output dropout, shown in Figure 3, outperforms the best, fixed output dropout rate (0.68) found by a fine-grained grid search, achieving 82.58 vs 85.83 validation perplexity. We claim that this is a consequence of the schedule, and not of regularizing effects from sampling hyperparameters or the limited capacity of .

To rule out the possibility that the improved performance is due to stochasticity introduced by sampling hyperparameters during STN training, we trained a standard LSTM while perturbing its dropout rate around the best value found by grid search. We used (1) random Gaussian perturbations, and (2) sinusoid perturbations for a cyclic regularization schedule. STNs outperformed both perturbation methods (Table 1), showing that the improvement is not merely due to hyperparameter stochasticity. Details and plots of each perturbation method are provided in Appendix F.

Method Val Test
, Fixed 85.83 83.19
w/ Gaussian Noise 85.87 82.29
w/ Sinusoid Noise 85.29 82.15
(Final STN Value) 89.65 86.90
STN 82.58 79.02
LSTM w/ STN Schedule 82.87 79.93
Table 1: Comparing an LSTM trained with fixed and perturbed output dropouts, an STN, and LSTM trained with the STN schedule.
Figure 3: Dropout schedules found by the ST-LSTM for different initial dropout rates.

To determine whether the limited capacity of acts as a regularizer, we trained a standard LSTM from scratch using the schedule for output dropout discovered by the ST-LSTM. Using this schedule, the standard LSTM performed nearly as well as the STN, providing evidence that the schedule itself (rather than some other aspect of the STN) was responsible for the improvement over a fixed dropout rate. To further demonstrate the importance of the hyperparameter schedule, we also trained a standard LSTM from scratch using the final dropout value found by the STN (0.78), and found that it did not perform as well as when following the schedule. The final validation and test perplexities of each variant are shown in Table 1.

Next, we show in Figure 3 that the STN discovers the same schedule regardless of the initial hyperparameter values. Because hyperparameters adapt over a shorter timescale than the weights, we find that at any given point in training, the hyperparameter adaptation has already equilibrated. As shown empirically in Appendix F, low regularization is best early in training, while higher regularization is better later on. We found that the STN schedule implements a curriculum by using a low dropout rate early in training, aiding optimization, and then gradually increasing the dropout rate, leading to better generalization.

Method Val Perplexity Test Perplexity Val Loss Test Loss
Grid Search 97.32 94.58 0.794 0.809
Random Search 84.81 81.46 0.921 0.752
Bayesian Optimization 72.13 69.29 0.636 0.651
STN 70.30 67.68 0.575 0.576
Table 2: Final validation and test performance of each method on the PTB word-level language modeling task, and the CIFAR-10 image-classification task.

4.2 Language modeling

We evaluated an ST-LSTM on the PTB corpus (Marcus et al., 1993), which is widely used as a benchmark for RNN regularization due to its small size (Gal & Ghahramani, 2016; Merity et al., 2018; Wen et al., 2018). We used a 2-layer LSTM with 650 hidden units per layer and 650-dimensional word embeddings. We tuned 7 hyperparameters: variational dropout rates for the input, hidden state, and output; embedding dropout (that sets rows of the embedding matrix to ); DropConnect (Wan et al., 2013) on the hidden-to-hidden weight matrix; and coefficients and that control the strength of activation regularization and temporal activation regularization, respectively. For LSTM tuning, we obtained the best results when using a fixed perturbation scale of 1 for the hyperparameters. Additional details about the experimental setup and the role of these hyperparameters can be found in Appendix D.

(a) Time comparison
(b) STN schedule for dropouts
(c) STN schedule for and
Figure 4: (a) A comparison of the best validation perplexity achieved on PTB over time, by grid search, random search, Bayesian optimization, and STNs. STNs achieve better (lower) validation perplexity in less time than the other methods. (b) The hyperparameter schedule found by the STN for each type of dropout. (c) The hyperparameter schedule found by the STN for the coefficients of activation regularization and temporal activation regularization.

We compared STNs to grid search, random search, and Bayesian optimization.666For grid search and random search we used the Ray Tune libraries ( For Bayesian optimization, we used Spearmint ( Figure 3(a) shows the best validation perplexity achieved by each method over time. STNs outperform other methods, achieving lower validation perplexity more quickly. The final validation and test perplexities achieved by each method are shown in Table 2. We show the schedules the STN finds for each hyperparameter in Figures 3(b) and 3(c); we observe that they are nontrivial, with some forms of dropout used to a greater extent at the start of training (including input and hidden dropout), some used throughout training (output dropout), and some that are increased over the course of training (embedding and weight dropout).

4.3 Image Classification

Figure 5: A comparison of the best validation loss achieved on CIFAR-10 over time, by grid search, random search, Bayesian optimization, and STNs. STNs outperform other methods for many computational budgets.

We evaluated ST-CNNs on the CIFAR-10 (Krizhevsky & Hinton, 2009) dataset, where it is easy to overfit with high-capacity networks. We used the AlexNet architecture (Krizhevsky et al., 2012), and tuned: (1) continuous hyperparameters controlling per-layer activation dropout, input dropout, and scaling noise applied to the input, (2) discrete data augmentation hyperparameters controlling the length and number of cut-out holes (DeVries & Taylor, 2017), and (3) continuous data augmentation hyperparameters controlling the amount of noise to apply to the hue, saturation, brightness, and contrast of an image. In total, we considered 15 hyperparameters.

We compared STNs to grid search, random search, and Bayesian optimization. Figure 5 shows the lowest validation loss achieved by each method over time, and Table 2 shows the final validation and test losses for each method. Details of the experimental set-up are provided in Appendix E. Again, STNs find better hyperparameter configurations in less time than other methods. The hyperparameter schedules found by the STN are shown in Figure 6.

Figure 6: The hyperparameter schedule prescribed by the STN while training for image classification. The dropouts are indexed by the convolutional layer they are applied to. FC dropout is for the fully-connected layers.

5 Related Work

Bilevel Optimization. Colson et al. (2007) provide an overview of bilevel problems, and a comprehensive textbook was written by Bard (2013). When the objectives/constraints are restricted to be linear, quadratic, or convex, a common approach replaces the lower-level problem with its KKT conditions added as constraints for the upper-level problem (Hansen et al., 1992; Vicente et al., 1994). In the unrestricted setting, our work loosely resembles trust-region methods (Colson et al., 2005), which repeatedly approximate the problem locally using a simpler bilevel program. In closely related work, Sinha et al. (2013) used evolutionary techniques to estimate the best-response function iteratively.

Hypernetworks. First considered by Schmidhuber (1993, 1992), hypernetworks are functions mapping to the weights of a neural net. Predicting weights in CNNs has been developed in various forms (Denil et al., 2013; Yang et al., 2015). Ha et al. (2016) used hypernetworks to generate weights for modern CNNs and RNNs. Brock et al. (2017) used hypernetworks to globally approximate a best-response for architecture search. Because the architecture is not optimized during training, they require a large hypernetwork, unlike ours which locally approximates the best-response.

Gradient-Based Hyperparameter Optimization. There are two main approaches. The first approach approximates using , the value of w after steps of gradient descent on with respect to w starting at . The descent steps are differentiated through to approximate . This approach was proposed by Domke (2012) and used by Maclaurin et al. (2015), Luketina et al. (2016) and Franceschi et al. (2018). The second approach uses the Implicit Function Theorem to derive under certain conditions. This was first developed for hyperparameter optimization in neural networks (Larsen et al., 1996) and developed further by Pedregosa (2016). Similar approaches have been used for hyperparameter optimization in log-linear models (Foo et al., 2008), kernel selection (Chapelle et al., 2002; Seeger, 2007), and image reconstruction (Kunisch & Pock, 2013; Calatroni et al., 2015). Both approaches struggle with certain hyperparameters, since they differentiate gradient descent or the training loss with respect to the hyperparameters. In addition, differentiating gradient descent becomes prohibitively expensive as the number of descent steps increases, while implicitly deriving requires using Hessian-vector products with conjugate gradient solvers to avoid directly computing the Hessian.

Model-Based Hyperparameter Optimization. A common model-based approach is Bayesian optimization, which models , the conditional probability of the performance on some metric given hyperparameters and a dataset . We can model with various methods (Hutter et al., 2011; Bergstra et al., 2011; Snoek et al., 2012, 2015). is constructed iteratively, where the next to train on is chosen by maximizing an acquisition function which balances exploration and exploitation. Training each model to completion can be avoided if assumptions are made on learning curve behavior (Swersky et al., 2014; Klein et al., 2017). These approaches require building inductive biases into which may not hold in practice, do not take advantage of the network structure when used for hyperparameter optimization, and do not scale well with the number of hyperparameters. However, these approaches have consistency guarantees in the limit, unlike ours.

Model-Free Hyperparameter Optimization. Model-free approaches include grid search and random search. Bergstra & Bengio (2012) advocated using random search over grid search. Successive Halving (Jamieson & Talwalkar, 2016) and Hyperband (Li et al., 2017) extend random search by adaptively allocating resources to promising configurations using multi-armed bandit techniques. These methods ignore structure in the problem, unlike ours which uses rich gradient information. However, it is trivial to parallelize model-free methods over computing resources and they tend to perform well in practice.

Hyperparameter Scheduling. Population Based Training (PBT) (Jaderberg et al., 2017) considers schedules for hyperparameters. In PBT, a population of networks is trained in parallel. The performance of each network is evaluated periodically, and the weights of under-performing networks are replaced by the weights of better-performing ones; the hyperparameters of the better network are also copied and randomly perturbed for training the new network clone. In this way, a single model can experience different hyperparameter settings over the course of training, implementing a schedule. STNs replace the population of networks by a single best-response approximation and use gradients to tune hyperparameters during a single training run.

6 Conclusion

We introduced Self-Tuning Networks (STNs), which efficiently approximate the best-response of parameters to hyperparameters by scaling and shifting their hidden units. This allowed us to use gradient-based optimization to tune various regularization hyperparameters, including discrete hyperparameters. We showed that STNs discover hyperparameter schedules that can outperform fixed hyperparameters. We validated the approach on large-scale problems and showed that STNs achieve better generalization performance than competing approaches, in less time. We believe STNs offer a compelling path towards large-scale, automated hyperparameter tuning for neural networks.


We thank Matt Johnson for helpful discussions and advice. MM is supported by an NSERC CGS-M award, and PV is supported by an NSERC PGS-D award. RG acknowledges support from the CIFAR Canadian AI Chairs program.


Appendix A Table of Notation

Hyperparameters and parameters
Current, fixed hyperparameters and parameters
Hyperparameter and elementary parameter dimension
Lower-level & upper-level objective
Function mapping unconstrained hyperparameters to the appropriate restricted space
Training loss & validation loss -
Best-response of the parameters to the hyperparameters
Single-level objective from best-response, equals
Optimal hyperparameters
Parametric approximation to the best-response function
Approximate best-response parameters
Scale of the hyperparameter noise distribution
The sigmoid function
Sampled perturbation noise, to be added to hyperparameters
The noise distribution and induced hyperparameter distribution
A learning rate
Number of training steps on the training and validation data
An input datapoint and its associated target
A data set consisting of tuples of inputs and targets
The dimensionality of input data
Prediction function for input data and elementary parameters w
Row-wise rescaling - not elementwise multiplication
First and second layer weights of the linear network in Problem 13
The basis change matrix and solution to the unregularized Problem 13
The best response weights of the linear network in Problem 13
Activations of hidden units in the linear network of Problem 13
A layer’s weight matrix and bias
A layer’s input dimensionality and output dimensionality
The (validation loss) direct (hyperparameter) gradient
The (elementary parameter) response gradient
The (validation loss) response gradient
The hyperparameter gradient: a sum of the validation losses direct and response gradients
Table 3: Table of Notation

Appendix B Proofs

b.1 Lemma 1

Because solves Problem 4b given , by the first-order optimality condition we must have:


The Jacobian of decomposes as a block matrix with sub-blocks given by:


We know that is in some neighborhood of , so is continuously differentiable in this neighborhood. By assumption, the Hessian is positive definite and hence invertible at . By the Implicit Function Theorem, there exists a neighborhood of and a unique continuously differentiable function such that for and .

Furthermore, by continuity we know that there is a neighborhood of such that is positive definite on this neighborhood. Setting , we can conclude that for all . Combining this with and using second-order sufficient optimality conditions, we conclude that is the unique solution to Problem 4b for all .

b.2 Lemma 2

This discussion mostly follows from Hastie et al. (2001). We let denote the data matrix where is the number of training examples and is the dimensionality of the data. We let denote the associated targets. We can write the SVD decomposition of as:


where and are and orthogonal matrices and is a diagonal matrix with entries . We next simplify the function by setting , so that . We see that the Jacobian is constant, and Problem 13 simplifies to standard

-regularized least-squares linear regression with the following loss function:


It is well-known (see Hastie et al. (2001), Chapter 3) that the optimal solution minimizing Equation 19 is given by:


Furthermore, the optimal solution to the unregularized version of Problem 19 is given by:


Recall that we defined , i.e., the change-of-basis matrix from the standard basis to the principal components of the data matrix, and we defined to solve the unregularized regression problem given . Thus, we require that which implies .

There are not unique solutions to Problem 13, so we take any functions which satisfy as “best-response functions”. We will show that our chosen functions and , where and for , meet this criteria. We start by noticing that for any , we have:


It follows that:


b.3 Theorem 3

By assumption is quadratic, so there exist , , and such that:


One can easily compute that:


Since we assume , we must have . Setting the derivative equal to and using second-order sufficient conditions, we have:


Hence, we find:


We let , and define to be the function given by:


Substituting and simplifying:


Expanding, we find that equation 36 is equal to: