selftuningnetworks
Code for SelfTuning Networks (ICLR 2019) https://arxiv.org/abs/1903.03088
view repo
Hyperparameter optimization can be formulated as a bilevel optimization problem, where the optimal parameters on the training set depend on the hyperparameters. We aim to adapt regularization hyperparameters for neural networks by fitting compact approximations to the bestresponse function, which maps hyperparameters to optimal weights and biases. We show how to construct scalable bestresponse approximations for neural networks by modeling the bestresponse as a single network whose hidden units are gated conditionally on the regularizer. We justify this approximation by showing the exact bestresponse for a shallow linear network with L2regularized Jacobian can be represented by a similar gating mechanism. We fit this model using a gradientbased hyperparameter optimization algorithm which alternates between approximating the bestresponse around the current hyperparameters and optimizing the hyperparameters using the approximate bestresponse function. Unlike other gradientbased approaches, we do not require differentiating the training loss with respect to the hyperparameters, allowing us to tune discrete hyperparameters, data augmentation hyperparameters, and dropout probabilities. Because the hyperparameters are adapted online, our approach discovers hyperparameter schedules that can outperform fixed hyperparameter values. Empirically, our approach outperforms competing hyperparameter optimization methods on largescale deep learning problems. We call our networks, which update their own hyperparameters online during training, SelfTuning Networks (STNs).
READ FULL TEXT VIEW PDFCode for SelfTuning Networks (ICLR 2019) https://arxiv.org/abs/1903.03088
PyTorch implementation of "STNs" and "DeltaSTNs"
None
Regularization hyperparameters such as weight decay, data augmentation, and dropout (Srivastava et al., 2014) are crucial to the generalization of neural networks, but are difficult to tune. Popular approaches to hyperparameter optimization include grid search, random search (Bergstra & Bengio, 2012), and Bayesian optimization (Snoek et al., 2012). These approaches work well with lowdimensional hyperparameter spaces and ample computational resources; however, they pose hyperparameter optimization as a blackbox optimization problem, ignoring structure which can be exploited for faster convergence, and require many training runs.
We can formulate hyperparameter optimization as a bilevel optimization problem. Let w denote parameters (e.g. weights and biases) and denote hyperparameters (e.g. dropout probability). Let and be functions mapping parameters and hyperparameters to training and validation losses, respectively. We aim to solve^{1}^{1}1The uniqueness of the is assumed.:
(1) 
Substituting the bestresponse function gives a singlelevel problem:
(2) 
If the bestresponse is known, the validation loss can be minimized directly by gradient descent using Equation 2, offering dramatic speedups over blackbox methods. However, as the solution to a highdimensional optimization problem, it is difficult to compute even approximately.
Following Lorraine & Duvenaud (2018), we propose to approximate the bestresponse directly with a parametric function . We jointly optimize and , first updating so that in a neighborhood around the current hyperparameters, then updating by using as a proxy for in Eq. 2:
(3) 
Finding a scalable approximation when w represents the weights of a neural network is a significant challenge, as even simple implementations entail significant memory overhead. We show how to construct a compact approximation by modelling the bestresponse of each row in a layer’s weight matrix/bias as a rankone affine transformation of the hyperparameters. We show that this can be interpreted as computing the activations of a base network in the usual fashion, plus a correction term dependent on the hyperparameters. We justify this approximation by showing the exact bestresponse for a shallow linear network with regularized Jacobian follows a similar structure. We call our proposed networks SelfTuning Networks (STNs) since they update their own hyperparameters online during training.
STNs enjoy many advantages over other hyperparameter optimization methods. First, they are easy to implement by replacing existing modules in deep learning libraries with “hyper” counterparts which accept an additional vector of hyperparameters as input^{2}^{2}2
We illustrate how this is done for the PyTorch library
(Paszke et al., 2017) in Appendix G.. Second, because the hyperparameters are adapted online, we ensure that computational effort expended to fit around previous hyperparameters is not wasted. In addition, this online adaption yields hyperparameter schedules which we find empirically to outperform fixed hyperparameter settings. Finally, the STN training algorithm does not require differentiating the training loss with respect to the hyperparameters, unlike other gradientbased approaches (Maclaurin et al., 2015; Larsen et al., 1996), allowing us to tune discrete hyperparameters, such as the number of holes to cut out of an image (DeVries & Taylor, 2017), dataaugmentation hyperparameters, and discretenoise dropout parameters. Empirically, we evaluate the performance of STNs on largescale deeplearning problems with the Penn Treebank (Marcus et al., 1993) and CIFAR10 datasets (Krizhevsky & Hinton, 2009), and find that they substantially outperform baseline methods.A bilevel optimization problem consists of two subproblems called the upperlevel and lowerlevel problems, where the upperlevel problem must be solved subject to optimality of the lowerlevel problem. Minimax problems are an example of bilevel programs where the upperlevel objective equals the negative lowerlevel objective. Bilevel programs were first studied in economics to model leader/follower firm dynamics (Von Stackelberg, 2010) and have since found uses in various fields (see Colson et al. (2007)
for an overview). In machine learning, many problems can be formulated as bilevel programs, including hyperparameter optimization, GAN training
(Goodfellow et al., 2014), metalearning, and neural architecture search (Zoph & Le, 2016).Even if all objectives and constraints are linear, bilevel problems are strongly NPhard (Hansen et al., 1992; Vicente et al., 1994). Due to the difficulty of obtaining exact solutions, most work has focused on restricted settings, considering linear, quadratic, and convex functions. In contrast, we focus on obtaining local solutions in the nonconvex, differentiable, and unconstrained setting. Let denote the upper and lowerlevel objectives (e.g., and ) and denote the upper and lowerlevel parameters. We aim to solve:
(4a)  
(4b) 
It is desirable to design a gradientbased algorithm for solving Problem 4, since using gradient information provides drastic speedups over blackbox optimization methods (Nesterov, 2013). The simplest method is simultaneous gradient descent, which updates using and w using . However, simultaneous gradient descent often gives incorrect solutions as it fails to account for the dependence of w on . Consider the relatively common situation where doesn’t depend directly on , so that and hence is never updated.
A more principled approach to solving Problem 4 is to use the bestresponse function (Gibbons, 1992). Assume the lowerlevel Problem 4b has a unique optimum for each . Substituting the bestresponse function converts Problem 4 into a singlelevel problem:
(5) 
If is differentiable, we can minimize Eq. 5 using gradient descent on with respect to . This method requires a unique optimum for Problem 4b for each and differentiability of . In general, these conditions are difficult to verify. We give sufficient conditions for them to hold in a neighborhood of a point where solves Problem 4b given .
See Appendix B.1. ∎
The gradient of decomposes into two terms, which we term the direct gradient and the response gradient. The direct gradient captures the direct reliance of the upperlevel objective on , while the response gradient captures how the lowerlevel parameter responds to changes in the upperlevel parameter:
(6) 
Even if and simultaneous gradient descent is possible, including the response gradient can stabilize optimization by converting the bilevel problem into a singlelevel one, as noted by Metz et al. (2016) for GAN optimization. Conversion to a singlelevel problem ensures that the gradient vector field is conservative, avoiding pathological issues described by Mescheder et al. (2017).
In general, the solution to Problem 4b is a set, but assuming uniqueness of a solution and differentiability of can yield fruitful algorithms in practice. In fact, gradientbased hyperparameter optimization methods can often be interpreted as approximating either the bestresponse or its Jacobian , as detailed in Section 5. However, these approaches can be computationally expensive and often struggle with discrete hyperparameters and stochastic hyperparameters like dropout probabilities, since they require differentiating the training loss with respect to the hyperparameters. Promising approaches to approximate directly were proposed by Lorraine & Duvenaud (2018), and are detailed below.
1. Global Approximation. The first algorithm proposed by Lorraine & Duvenaud (2018) approximates as a differentiable function with parameters . If w represents neural net weights, then the mapping is a hypernetwork (Schmidhuber, 1992; Ha et al., 2016). If the distribution is fixed, then gradient descent with respect to minimizes:
(7) 
If is broad and is sufficiently flexible, then can be used as a proxy for in Problem 5, resulting in the following objective:
(8) 
2. Local Approximation. In practice, is usually insufficiently flexible to model on . The second algorithm of Lorraine & Duvenaud (2018) locally approximates in a neighborhood around the current upperlevel parameter . They set to a factorized Gaussian noise distribution with a fixed scale parameter , and found by minimizing the objective:
(9) 
Intuitively, the upperlevel parameter is perturbed by a small amount, so the lowerlevel parameter learns how to respond. An alternating gradient descent scheme is used, where is updated to minimize equation 9 and is updated to minimize equation 8. This approach worked for problems using regularization on MNIST (LeCun et al., 1998). However, it is unclear if the approach works with different regularizers or scales to larger problems. It requires , which is a priori unwieldy for high dimensional w. It is also unclear how to set , which defines the size of the neighborhood on which is trained, or if the approach can be adapted to discrete and stochastic hyperparameters.
In this section, we first construct a bestresponse approximation that is memory efficient and scales to large neural networks. We justify this approximation through analysis of simpler situations. Then, we describe a method to automatically adjust the scale of the neighborhood is trained on. Finally, we formally describe our algorithm and discuss how it easily handles discrete and stochastic hyperparameters. We call the resulting networks, which update their own hyperparameters online during training, SelfTuning Networks (STNs).
We propose to approximate the bestresponse for a given layer’s weight matrix and bias as an affine transformation of the hyperparameters ^{3}^{3}3We describe modifications for convolutional filters in Appendix C.:
(10) 
Here, indicates elementwise multiplication and
indicates rowwise rescaling. This architecture computes the usual elementary weight/bias, plus an additional weight/bias which has been scaled by a linear transformation of the hyperparameters. Alternatively, it can be interpreted as directly operating on the preactivations of the layer, adding a correction to the usual preactivation to account for the hyperparameters:
(11) 
This bestresponse architecture is tractable to compute and memoryefficient: it requires parameters to represent and parameters to represent , where is the number of hyperparameters. Furthermore, it enables parallelism: since the predictions can be computed by transforming the preactivations (Equation 11), the hyperparameters for different examples in a batch can be perturbed independently, improving sample efficiency. In practice, the approximation can be implemented by simply replacing existing modules in deep learning libraries with “hyper” counterparts which accept an additional vector of hyperparameters as input^{4}^{4}4We illustrate how this is done for the PyTorch library (Paszke et al., 2017) in Appendix G..
Given that the bestresponse function is a mapping from to the highdimensional weight space , why should we expect to be able to represent it compactly? And why in particular would equation 10 be a reasonable approximation? In this section, we exhibit a model whose bestresponse function can be represented exactly using a minor variant of equation 10: a linear network with Jacobian norm regularization. In particular, the bestresponse takes the form of a network whose hidden units are modulated conditionally on the hyperparameters.
Consider using a 2layer linear network with weights to predict targets from inputs :
(12) 
Suppose we use a squarederror loss regularized with an penalty on the Jacobian , where the penalty weight lies in and is mapped using to lie :
(13) 
Let , where is the changeofbasis matrix to the principal components of the data matrix and solves the unregularized version of Problem 13 given . Then there exist such that the bestresponse function^{5}^{5}5This is an abuse of notation since there is not a unique solution to Problem 13 for each in general. is:
where
is the sigmoid function.
See Appendix B.2. ∎
Observe that can be implemented as a regular network with weights with an additional sigmoidal gating of its hidden units :
(14) 
This architecture is shown in Figure 1. Inspired by this example, we use a similar gating of the hidden units to approximate the bestresponse for deep, nonlinear networks.
The sigmoidal gating architecture of the preceding section can be further simplified if one only needs to approximate the bestresponse function for a small range of hyperparameter values. In particular, for a narrow enough hyperparameter distribution, a smooth bestresponse function can be approximated by an affine function (i.e. its firstorder Taylor approximation). Hence, we replace the sigmoidal gating with linear gating, in order that the weights be affine in the hyperparameters. The following theorem shows that, for quadratic lowerlevel objectives, using an affine approximation to the bestresponse function and minimizing yields the correct bestresponse Jacobian, thus ensuring gradient descent on the approximate objective converges to a local optimum:
Suppose is quadratic with , is Gaussian with mean
and variance
, and is affine. Fix and let . Then we have .See Appendix B.3. ∎
The entries of control the scale of the hyperparameter distribution on which is trained. If the entries are too large, then will not be flexible enough to capture the bestresponse over the samples. However, the entries must remain large enough to force to capture the shape locally around the current hyperparameter values. We illustrate this in Figure 2. As the smoothness of the loss landscape changes during training, it may be beneficial to vary .
To address these issues, we propose adjusting during training based on the sensitivity of the upperlevel objective to the sampled hyperparameters. We include an entropy term weighted by which acts to enlarge the entries of . The resulting objective is:
(15) 
This is similar to a variational inference objective, where the first term is analogous to the negative loglikelihood, but . As ranges from to
, our objective interpolates between variational optimization
(Staines & Barber, 2012) and variational inference, as noted by Khan et al. (2018). Similar objectives have been used in the variational inference literature for better training (Blundell et al., 2015) and representation learning (Higgins et al., 2017).Minimizing the first term on its own eventually moves all probability mass towards an optimum , resulting in if is an isolated local minimum. This compels to balance between shrinking to decrease the first term while remaining sufficiently large to avoid a heavy entropy penalty. When benchmarking our algorithm’s performance, we evaluate at the deterministic current hyperparameter
. (This is a common practice when using stochastic operations during training, such as batch normalization or dropout.)
We now describe the complete STN training algorithm and discuss how it can tune hyperparameters that other gradientbased algorithms cannot, such as discrete or stochastic hyperparameters. We use an unconstrained parametrization of the hyperparameters. Let denote the elementwise function which maps to the appropriate constrained space, which will involve a nondifferentiable discretization for discrete hyperparameters.
Let and denote training and validation losses which are (possibly stochastic, e.g., if using dropout) functions of the hyperparameters and parameters. Define functions by and . STNs are trained by a gradient descent scheme which alternates between updating for steps to minimize (Eq. 9) and updating and for steps to minimize (Eq. 15). We give our complete algorithm as Algorithm 1 and show how it can be implemented in code in Appendix G. The possible nondifferentiability of
due to discrete hyperparameters poses no problem. To estimate the derivative of
with respect to , we can use the reparametrization trick and compute and , neither of whose computation paths involve the discretization . To differentiate with respect to a discrete hyperparameter , there are two cases we must consider:Case 1: For most regularization schemes, and hence does not depend on directly and thus the only gradient is through . Thus, the reparametrization gradient can be used.
Case 2: If relies explicitly on , then we can use the REINFORCE gradient estimator (Williams, 1992) to estimate the derivative of the expectation with respect to . The number of hidden units in a layer is an example of a hyperparameter that requires this approach since it directly affects the validation loss. We do not show this in Algorithm 1, since we do not tune any hyperparameters which fall into this case.
We applied our method to convolutional networks and LSTMs (Hochreiter & Schmidhuber, 1997), yielding selftuning CNNs (STCNNs) and selftuning LSTMs (STLSTMs). We first investigated the behavior of STNs in a simple setting where we tuned a single hyperparameter, and found that STNs discovered hyperparameter schedules that outperformed fixed hyperparameter values. Next, we compared the performance of STNs to commonlyused hyperparameter optimization methods on the CIFAR10 (Krizhevsky & Hinton, 2009) and PTB (Marcus et al., 1993) datasets.
Due to the joint optimization of the hypernetwork weights and hyperparameters, STNs do not use a single, fixed hyperparameter during training. Instead, STNs discover schedules for adapting the hyperparameters online, which can outperform any fixed hyperparameter. We examined this behavior in detail on the PTB corpus (Marcus et al., 1993) using an STLSTM to tune the output dropout rate applied to the hidden units.
The schedule discovered by an STLSTM for output dropout, shown in Figure 3, outperforms the best, fixed output dropout rate (0.68) found by a finegrained grid search, achieving 82.58 vs 85.83 validation perplexity. We claim that this is a consequence of the schedule, and not of regularizing effects from sampling hyperparameters or the limited capacity of .
To rule out the possibility that the improved performance is due to stochasticity introduced by sampling hyperparameters during STN training, we trained a standard LSTM while perturbing its dropout rate around the best value found by grid search. We used (1) random Gaussian perturbations, and (2) sinusoid perturbations for a cyclic regularization schedule. STNs outperformed both perturbation methods (Table 1), showing that the improvement is not merely due to hyperparameter stochasticity. Details and plots of each perturbation method are provided in Appendix F.
Method  Val  Test 
, Fixed  85.83  83.19 
w/ Gaussian Noise  85.87  82.29 
w/ Sinusoid Noise  85.29  82.15 
(Final STN Value)  89.65  86.90 
STN  82.58  79.02 
LSTM w/ STN Schedule  82.87  79.93 
To determine whether the limited capacity of acts as a regularizer, we trained a standard LSTM from scratch using the schedule for output dropout discovered by the STLSTM. Using this schedule, the standard LSTM performed nearly as well as the STN, providing evidence that the schedule itself (rather than some other aspect of the STN) was responsible for the improvement over a fixed dropout rate. To further demonstrate the importance of the hyperparameter schedule, we also trained a standard LSTM from scratch using the final dropout value found by the STN (0.78), and found that it did not perform as well as when following the schedule. The final validation and test perplexities of each variant are shown in Table 1.
Next, we show in Figure 3 that the STN discovers the same schedule regardless of the initial hyperparameter values. Because hyperparameters adapt over a shorter timescale than the weights, we find that at any given point in training, the hyperparameter adaptation has already equilibrated. As shown empirically in Appendix F, low regularization is best early in training, while higher regularization is better later on. We found that the STN schedule implements a curriculum by using a low dropout rate early in training, aiding optimization, and then gradually increasing the dropout rate, leading to better generalization.
PTB  CIFAR10  
Method  Val Perplexity  Test Perplexity  Val Loss  Test Loss 
Grid Search  97.32  94.58  0.794  0.809 
Random Search  84.81  81.46  0.921  0.752 
Bayesian Optimization  72.13  69.29  0.636  0.651 
STN  70.30  67.68  0.575  0.576 
We evaluated an STLSTM on the PTB corpus (Marcus et al., 1993), which is widely used as a benchmark for RNN regularization due to its small size (Gal & Ghahramani, 2016; Merity et al., 2018; Wen et al., 2018). We used a 2layer LSTM with 650 hidden units per layer and 650dimensional word embeddings. We tuned 7 hyperparameters: variational dropout rates for the input, hidden state, and output; embedding dropout (that sets rows of the embedding matrix to ); DropConnect (Wan et al., 2013) on the hiddentohidden weight matrix; and coefficients and that control the strength of activation regularization and temporal activation regularization, respectively. For LSTM tuning, we obtained the best results when using a fixed perturbation scale of 1 for the hyperparameters. Additional details about the experimental setup and the role of these hyperparameters can be found in Appendix D.



We compared STNs to grid search, random search, and Bayesian optimization.^{6}^{6}6For grid search and random search we used the Ray Tune libraries (https://github.com/rayproject/ray/tree/master/python/ray/tune). For Bayesian optimization, we used Spearmint (https://github.com/HIPS/Spearmint). Figure 3(a) shows the best validation perplexity achieved by each method over time. STNs outperform other methods, achieving lower validation perplexity more quickly. The final validation and test perplexities achieved by each method are shown in Table 2. We show the schedules the STN finds for each hyperparameter in Figures 3(b) and 3(c); we observe that they are nontrivial, with some forms of dropout used to a greater extent at the start of training (including input and hidden dropout), some used throughout training (output dropout), and some that are increased over the course of training (embedding and weight dropout).
We evaluated STCNNs on the CIFAR10 (Krizhevsky & Hinton, 2009) dataset, where it is easy to overfit with highcapacity networks. We used the AlexNet architecture (Krizhevsky et al., 2012), and tuned: (1) continuous hyperparameters controlling perlayer activation dropout, input dropout, and scaling noise applied to the input, (2) discrete data augmentation hyperparameters controlling the length and number of cutout holes (DeVries & Taylor, 2017), and (3) continuous data augmentation hyperparameters controlling the amount of noise to apply to the hue, saturation, brightness, and contrast of an image. In total, we considered 15 hyperparameters.
We compared STNs to grid search, random search, and Bayesian optimization. Figure 5 shows the lowest validation loss achieved by each method over time, and Table 2 shows the final validation and test losses for each method. Details of the experimental setup are provided in Appendix E. Again, STNs find better hyperparameter configurations in less time than other methods. The hyperparameter schedules found by the STN are shown in Figure 6.
Bilevel Optimization. Colson et al. (2007) provide an overview of bilevel problems, and a comprehensive textbook was written by Bard (2013). When the objectives/constraints are restricted to be linear, quadratic, or convex, a common approach replaces the lowerlevel problem with its KKT conditions added as constraints for the upperlevel problem (Hansen et al., 1992; Vicente et al., 1994). In the unrestricted setting, our work loosely resembles trustregion methods (Colson et al., 2005), which repeatedly approximate the problem locally using a simpler bilevel program. In closely related work, Sinha et al. (2013) used evolutionary techniques to estimate the bestresponse function iteratively.
Hypernetworks. First considered by Schmidhuber (1993, 1992), hypernetworks are functions mapping to the weights of a neural net. Predicting weights in CNNs has been developed in various forms (Denil et al., 2013; Yang et al., 2015). Ha et al. (2016) used hypernetworks to generate weights for modern CNNs and RNNs. Brock et al. (2017) used hypernetworks to globally approximate a bestresponse for architecture search. Because the architecture is not optimized during training, they require a large hypernetwork, unlike ours which locally approximates the bestresponse.
GradientBased Hyperparameter Optimization. There are two main approaches. The first approach approximates using , the value of w after steps of gradient descent on with respect to w starting at . The descent steps are differentiated through to approximate . This approach was proposed by Domke (2012) and used by Maclaurin et al. (2015), Luketina et al. (2016) and Franceschi et al. (2018). The second approach uses the Implicit Function Theorem to derive under certain conditions. This was first developed for hyperparameter optimization in neural networks (Larsen et al., 1996) and developed further by Pedregosa (2016). Similar approaches have been used for hyperparameter optimization in loglinear models (Foo et al., 2008), kernel selection (Chapelle et al., 2002; Seeger, 2007), and image reconstruction (Kunisch & Pock, 2013; Calatroni et al., 2015). Both approaches struggle with certain hyperparameters, since they differentiate gradient descent or the training loss with respect to the hyperparameters. In addition, differentiating gradient descent becomes prohibitively expensive as the number of descent steps increases, while implicitly deriving requires using Hessianvector products with conjugate gradient solvers to avoid directly computing the Hessian.
ModelBased Hyperparameter Optimization. A common modelbased approach is Bayesian optimization, which models , the conditional probability of the performance on some metric given hyperparameters and a dataset . We can model with various methods (Hutter et al., 2011; Bergstra et al., 2011; Snoek et al., 2012, 2015). is constructed iteratively, where the next to train on is chosen by maximizing an acquisition function which balances exploration and exploitation. Training each model to completion can be avoided if assumptions are made on learning curve behavior (Swersky et al., 2014; Klein et al., 2017). These approaches require building inductive biases into which may not hold in practice, do not take advantage of the network structure when used for hyperparameter optimization, and do not scale well with the number of hyperparameters. However, these approaches have consistency guarantees in the limit, unlike ours.
ModelFree Hyperparameter Optimization. Modelfree approaches include grid search and random search. Bergstra & Bengio (2012) advocated using random search over grid search. Successive Halving (Jamieson & Talwalkar, 2016) and Hyperband (Li et al., 2017) extend random search by adaptively allocating resources to promising configurations using multiarmed bandit techniques. These methods ignore structure in the problem, unlike ours which uses rich gradient information. However, it is trivial to parallelize modelfree methods over computing resources and they tend to perform well in practice.
Hyperparameter Scheduling. Population Based Training (PBT) (Jaderberg et al., 2017) considers schedules for hyperparameters. In PBT, a population of networks is trained in parallel. The performance of each network is evaluated periodically, and the weights of underperforming networks are replaced by the weights of betterperforming ones; the hyperparameters of the better network are also copied and randomly perturbed for training the new network clone. In this way, a single model can experience different hyperparameter settings over the course of training, implementing a schedule. STNs replace the population of networks by a single bestresponse approximation and use gradients to tune hyperparameters during a single training run.
We introduced SelfTuning Networks (STNs), which efficiently approximate the bestresponse of parameters to hyperparameters by scaling and shifting their hidden units. This allowed us to use gradientbased optimization to tune various regularization hyperparameters, including discrete hyperparameters. We showed that STNs discover hyperparameter schedules that can outperform fixed hyperparameters. We validated the approach on largescale problems and showed that STNs achieve better generalization performance than competing approaches, in less time. We believe STNs offer a compelling path towards largescale, automated hyperparameter tuning for neural networks.
We thank Matt Johnson for helpful discussions and advice. MM is supported by an NSERC CGSM award, and PV is supported by an NSERC PGSD award. RG acknowledges support from the CIFAR Canadian AI Chairs program.
Choosing multiple parameters for Support Vector Machines.
Machine Learning, 46(13):131–159, 2002.A theoretically grounded application of dropout in recurrent neural networks.
In Advances in Neural Information Processing Systems, pp. 1027–1035, 2016.A Primer in Game Theory
. Harvester Wheatsheaf, 1992.International Conference on Artificial Intelligence and Statistics
, 2016.Simple statistical gradientfollowing algorithms for connectionist reinforcement learning.
Machine Learning, 8(34):229–256, 1992.International Conference on Computer Vision
, pp. 1476–1483, 2015.Hyperparameters and parameters  
Current, fixed hyperparameters and parameters  
Hyperparameter and elementary parameter dimension  
Lowerlevel & upperlevel objective  
Function mapping unconstrained hyperparameters to the appropriate restricted space  
Training loss & validation loss   
Bestresponse of the parameters to the hyperparameters  
Singlelevel objective from bestresponse, equals  
Optimal hyperparameters  
Parametric approximation to the bestresponse function  
Approximate bestresponse parameters  
Scale of the hyperparameter noise distribution  
The sigmoid function  
Sampled perturbation noise, to be added to hyperparameters  
The noise distribution and induced hyperparameter distribution  
A learning rate  
Number of training steps on the training and validation data  
An input datapoint and its associated target  
A data set consisting of tuples of inputs and targets  
The dimensionality of input data  
Prediction function for input data and elementary parameters w  
Rowwise rescaling  not elementwise multiplication  
First and second layer weights of the linear network in Problem 13  
The basis change matrix and solution to the unregularized Problem 13  
The best response weights of the linear network in Problem 13  
Activations of hidden units in the linear network of Problem 13  
A layer’s weight matrix and bias  
A layer’s input dimensionality and output dimensionality  
The (validation loss) direct (hyperparameter) gradient  
The (elementary parameter) response gradient  
The (validation loss) response gradient  
The hyperparameter gradient: a sum of the validation losses direct and response gradients 
Because solves Problem 4b given , by the firstorder optimality condition we must have:
(16) 
The Jacobian of decomposes as a block matrix with subblocks given by:
(17) 
We know that is in some neighborhood of , so is continuously differentiable in this neighborhood. By assumption, the Hessian is positive definite and hence invertible at . By the Implicit Function Theorem, there exists a neighborhood of and a unique continuously differentiable function such that for and .
Furthermore, by continuity we know that there is a neighborhood of such that is positive definite on this neighborhood. Setting , we can conclude that for all . Combining this with and using secondorder sufficient optimality conditions, we conclude that is the unique solution to Problem 4b for all .
This discussion mostly follows from Hastie et al. (2001). We let denote the data matrix where is the number of training examples and is the dimensionality of the data. We let denote the associated targets. We can write the SVD decomposition of as:
(18) 
where and are and orthogonal matrices and is a diagonal matrix with entries . We next simplify the function by setting , so that . We see that the Jacobian is constant, and Problem 13 simplifies to standard
regularized leastsquares linear regression with the following loss function:
(19) 
It is wellknown (see Hastie et al. (2001), Chapter 3) that the optimal solution minimizing Equation 19 is given by:
(20) 
Furthermore, the optimal solution to the unregularized version of Problem 19 is given by:
(21) 
Recall that we defined , i.e., the changeofbasis matrix from the standard basis to the principal components of the data matrix, and we defined to solve the unregularized regression problem given . Thus, we require that which implies .
There are not unique solutions to Problem 13, so we take any functions which satisfy as “bestresponse functions”. We will show that our chosen functions and , where and for , meet this criteria. We start by noticing that for any , we have:
(22) 
It follows that:
(23)  
(24)  
(25)  
(26)  
(27)  
(28)  
(29) 
By assumption is quadratic, so there exist , , and such that:
(30) 
One can easily compute that:
(31) 
(32) 
Since we assume , we must have . Setting the derivative equal to and using secondorder sufficient conditions, we have:
(33) 
Hence, we find:
(34) 
We let , and define to be the function given by:
(35) 
Substituting and simplifying:
(36) 
Expanding, we find that equation 36 is equal to:
Comments
There are no comments yet.