Gradient based optimization is a cornerstone of modern machine learning. Improvements in optimization have been critical to recent successes on a wide variety of problems. In practice, this typically involves analysis and development of hand-designed optimization algorithms(Nesterov, 1983; Duchi et al., 2011; Tieleman & Hinton, 2012; Kingma & Ba, 2014)
. These algorithms generally work well on a wide variety of tasks, and are tuned to specific problems via hyperparameter search. On the other hand, a complementary approach is tolearn the optimization algorithm (Bengio et al., 1990; Schmidhuber, 1995; Hochreiter et al., 2001; Andrychowicz et al., 2016; Wichrowska et al., 2017; Li & Malik, 2017b; Lv et al., 2017; Bello et al., 2017). That is, to learn a function to perform optimization, targeted to particular problems of interest. In this way, the algorithm may learn task specific structure, enabling dramatic performance improvements over more general optimizers.
However, training learned optimizers is notoriously difficult. Existing work in this vein can be classified into two broad categories. On one hand are black-box methods such as evolution(Goldberg & Holland, 1988; Bengio et al., 1992), random search (Bergstra & Bengio, 2012)2017; Li & Malik, 2017a, b), or Bayesian optimization (Snoek et al., 2012). However, these methods scale poorly with the number of optimizer parameters. The other approach is to use first-order methods, by computing the gradient of some measure of optimizer effectiveness with respect to the optimizer parameters. Computing these gradients is costly as we need to iteratively apply the learned update rule, and then backpropagate through these applications, a technique commonly referred to as “unrolled optimization” (Bengio, 2000; Maclaurin et al., 2015)
. To address the problem of backpropagation through many optimization steps (analogous to many timesteps in recurrent neural networks), many works make use of truncated backpropagation though time (TBPTT) to partition the long unrolled computational graph into separate pieces(Werbos, 1990; Tallec & Ollivier, 2017). This not only yields computational savings, at the cost of increased bias (Tallec & Ollivier, 2017), but also limits exploding gradients which emerge from too many iterated non-linear function applications (Pascanu et al., 2013; Parmas et al., 2018)
. Existing methods address the bias at the cost of increased variance or computational complexity(Williams & Zipser, 1989; Ollivier et al., 2015; Tallec & Ollivier, 2017). Previous techniques for training RNNs via TBPTT have thus far not been effective for training optimizers.
In this paper, we analytically and experimentally explore the debilitating role of bias and exploding gradients on training optimizers (§2.3). We then show how these pathologies can be remedied by optimizing the parameters of a distribution over the optimizer parameters, known as variational optimization (Staines & Barber, 2012) (§3). We define two unbiased gradient estimators for this objective: a reparameterization based gradient (Kingma & Welling, 2013), and evolutionary strategies (Rechenberg, 1973; Nesterov & Spokoiny, 2011). By dynamically reweighting the contribution of these two gradient estimators, we are able to avoid strongly biased or exploding gradients, and thus stably and efficiently train learned optimizers.
We demonstrate the utility of this approach by training a learned optimizer to target optimization of small convolutional networks on image classification (§4). On the targeted task distribution, this learned optimizer achieves better test loss, and is five times faster in wall-clock time
, compared to well tuned hand-designed optimizers such as SGD+Momentum, RMSProp, and ADAM (Figure1). While not the explicit focus of this work, we also find that the learned optimizer demonstrates promising generalization ability on out of distribution tasks (Figure 6).
|Dataset consisting of train and validation split, and .|
|The set of tasks, where each task is a dataset (e.g., a subset of Imagenet classes).|
|Parameters of inner-problem at iteration . These are updated by the learned optimizer, and depend implicitly on and .|
|Loss on inner-problem, for mini-batch .|
|Parameters of the optimizer.|
|Function defining the learned optimizer. The inner-loop update is , for .|
|Outer-level objective targeting training loss, .|
|Outer-level objective targeting validation loss, .|
|The variational (smoothed) outer-loop objective, .|
2 Unrolled optimization for learning optimizers
2.1 Problem Framework
Our goal is to learn an optimizer which is well suited to some set of target optimization tasks. Throughout the paper, we will use the notation defined in Figure 2. Learning an optimizer can be thought of as a bi-level optimization problem (Franceschi et al., 2018), with inner and outer levels. The inner minimization consists of optimizing of the weights () of a target problem by the repeated application of an update rule (). The update rule is a parameterized function that defines how to map the weights at iteration to iteration : . Here, represents the parameters of the learned optimizer. In the outer loop, these optimizer parameters () are updated so as to minimize some measure of optimizer performance, the outer-objective (). Our choice for will be the average value of the target loss () measured over either training or validation data. Throughout the paper, we use inner- and outer- prefixes to make it clear when we are referring to applying a learned optimizer on a target problem (inner) versus training a learned optimizer (outer).
2.2 Unrolled optimization
In order to train an optimizer, we wish to compute derivatives of the outer-objective with respect to the optimizer parameters, . Doing this requires unrolling the optimization process. That is, we can form an unrolled computational graph that consists of iteratively applying an optimizer () to optimize the weights () of a target problem (Figure 2). Computing gradients for the optimizer parameters involves backpropagating the outer loss through this unrolled computational graph. This is a costly operation, as the entire inner-optimization problem must be unrolled in order to get a single outer-gradient. Partitioning the unrolled computation into separate segments, known as truncated backpropagation, allows one to compute multiple outer-gradients over shorter segments. That is, rather than compute the full gradient from iteration to , we compute gradients in windows from to . The choice for the number of inner-steps per truncation is challenging. Using a large number of steps per truncation can result in exploding gradients making outer-training difficult, while using a small number of steps can produce biased gradients resulting in poor performance. In the following sections we analyze these two problems.
2.3 Exponential explosion of gradients with increased sequence length
We can illustrate the problem of exploding gradients analytically with a simple example: learning a learning rate. Following the notation in Figure 2, we define the optimizer as
where is a scalar learning rate that we wish to learn for minimizing some target problem . For simplicity, we assume a deterministic loss () with no batch of data ().
The quantity we are interested in is the derivative of the loss after steps of gradient descent with respect to . We can compute this gradient (see Appendix A) as:
where and are the gradient and Hessian of the target problem at iteration and , respectively. We see that this equation involves a sum of products of Hessians. In particular, the first term in the sum involves a product over the entire sequence of Hessians observed during training. That is, the outer-gradient becomes a matrix polynomial of degree , where is the number of gradient descent steps. Thus, the outer-gradient can grow exponentially with .
We can see another problem with long unrolled gradients empirically. Consider the task of optimizing a loss surface with two local minima defined as with initial condition using a momentum based optimizer with a parameterized momentum value (Figure 3a). At low momentum values the optimizer converges in the first of the two local minima, whereas for larger momentum values the optimizer settles in the second minimum. With even larger values of momentum, the iterate oscillates between the two minima before settling. We visualize both the trajectory of over training and the final loss value for different momentum values in Figure 3b and 3c. With increasing unrolling steps, the loss surface as a function of the momentum becomes less and less smooth, and develops near-discontinuities at some values of the momentum.
2.4 Increasing bias with truncated gradients
Existing work on learned optimizers often avoids exploding gradients (§2.3) by using a short truncation window. Here, we demonstrate the bias short truncation windows can introduce in unrolled optimization. These results are similar to those presented in Wu et al. (2016), except that we utilize multiple truncations rather than a single, shortened unroll. First, consider outer-learning the learning rate of Adam when optimizing a small two layer neural network on MNIST (LeCun, 1998). A grid search can be used to find the optimal learning rate, which is . We initialize Adam with a learning rate of and outer-train using increasing truncation amounts (Figure 4a). Despite initializing close to the optimal learning rate, when outer-training with severely truncated backprop the resulting learning rate decreases, increasing the outer-loss. The sum of truncated outer-gradients are anti-correlated to the true outer-gradient. We visualize the per-truncation gradients for 500 step unrolls in Figure 4b and cumulative truncated gradients in Figure 4c. Early in inner-training there is a large negative outer-gradient which increases the learning rate. Later in inner-training, the outer-gradients are positive, decreasing the learning rate. Thus, the optimizer parameter is pulled in opposite directions by truncated gradients early versus late in inner-training, revealing an inherent tension in truncated unrolled optimization.
3 Towards stable training of learned optimizers
To perform outer-optimization of a loss landscape with high frequency structure like that in Figure 3, one might intuitively want to smooth the outer-objective loss surface. To do this, instead of optimizing directly we instead optimize a smoothed outer-loss ,
where is a fixed variance (set to 0.01 in all experiments) which determines the degree of smoothing. This is the same approach taken in variational optimization (Staines & Barber, 2012). We can construct two different unbiased gradient estimators for : one via the reparameterization trick (Kingma & Welling, 2013), and one via the log-derivative trick similar to what is done in evolutionary strategies (ES) (Wierstra et al., 2008). We denote the two estimates as and respectively,
where is the sample count, and in implementation the same samples can be reused for and .
Following the insight from (Parmas et al., 2018) in the context of reinforcement learning111 Parmas et al. (2018) go on to propose a more sophisticated gradient estimator that operates on a per iteration level. While this should result in an even lower variance estimator in our setting, we find that the simpler solution of combing both terms at the end is easier to implement and works well in practice., we combine these estimates using inverse variance weighting (Fleiss, 1993),
where and are empirical estimates of the variances of and respectively. When outer-training learned optimizes we find the variances of and can differ by as many as 20 orders of magnitude (Figure 5a). This merged estimator addresses this by having at most the lowest of the two variances. To further reduce variance, we use antithetic sampling when estimating both and .
The cost of computing a single sample of and is thus 2 forward and backward passes of an unrolled optimization. To compute the empirical variance, we leverage data parallelism to compute multiple samples of and . In theory, to prevent bias the samples used to evaluate and must be independent of and , but in practice we found good performance using the same samples for both. Finally, an increasing curriculum over steps per truncation is used over the course of outer-training. This introduces bias early in training, but also allows for far more frequent outer-parameter updates, resulting in much faster outer-training in terms of wall-clock time. The full outer-training algorithm is described in Appendix B.
4.1 Optimizer architecture
The optimizer architecture used in all experiments consists of a simple fully connected neural network, with one hidden layer containing 32 ReLU units (1k parameters). This network is applied to each target problem weight independently. The outputs of the MLP consist of an update direction and a log learning rate, which are combined to produce weight updates. The network for each weight takes as input: the gradient with respect to that weight, parameter value, RMS gradient terms, exponentially weighted moving averages of gradients at multiple time scales (Lucas et al., 2018), as well as a representation of the current iteration. Many of these input features were motivated by Wichrowska et al. (2017). We conduct ablation studies for these inputs in §4.5. See Appendix C for further architectural details.
4.2 Optimizer target problem
The problem that each learned optimizer is trained against (
) consists of training a three layer convolutional neural network (32 units per layer, 20k parameters) inner-trained for ten thousand inner-iterations on 32x32x3 image classification tasks. We split the Imagenet dataset(Russakovsky et al., 2015) by class into 700 training and 300 validation classes, and sample training and validation problems by sampling 10 classes at random using all images from each class. This experimental design lets the optimizer learn problem specific structure (e.g. convolutional networks trained on object classification), but does not allow the optimizer to memorize particular weights for the base problem. See Appendix C for further details.
To train the optimizer, we linearly increase the number of unrolled steps from 50 to 10,000 over the course of 5,000 outer-training weight updates. The number of unrolled steps is additionally jittered by a small percentage (sampled uniformly up to 20%). Due to the heterogeneous, small workloads, we train with asynchronous batched SGD using 128 CPU workers.
Figure 5 shows the performance of the optimizer (averaged over 40 randomly sampled outer-train and outer-test inner-problems) while outer-training. Despite the stability improvements described in the last section, there is still variability in optimizer performance over random initializations of the optimizer parameters. We use outer-training loss to select the best model and use this in the remainder of the evaluation.
4.4 Learned optimizer performance
Figure 1 shows performance of the learned optimizer, after outer-training, compared against other first-order methods on a sampled validation task (classes not seen during outer-training). For all first-order methods, we report the best performance after tuning the learning rate by grid search using 11 values over a logarithmically spaced range from to 10. When outer-trained against the training outer-objective, , we achieve faster convergence on training loss by a factor of 5x (Figure 1a), but poor performance on test loss (Figure 1b). However, when outer-trained against the validation outer-objective, , we also achieve faster optimization and reach a lower test loss (Figure 1b).
Figure 1 summarizes the performance of the learned optimizer across many sampled validation tasks. It shows the difference in final test loss between the best first-order method and the learned optimizer. We choose the best first-order method by selecting the best validation performance over RMSProp, SGD+Momentum, and Adam. This learned optimizer (which does not require tuning on the validation tasks) outperforms the best baseline optimizer 98% of the time.
Although the focus of our approach was not generalization, we find that our learned optimizer nonetheless generalizes to varying degrees to dissimilar datasets, different numbers of units per layer, different number of layers, and even to fully connected networks. In Figure 6 we show performance on a six layer convolutional neural network trained on MNIST. Despite the different number of layers, different dataset, and different input size, the learned optimizers still reduces the loss, and in the case of the validation outer-objective trains faster to a lower validation loss. We further explore the limits of generalization of our learned optimizer on additional tasks in Appendix E.
To assess the importance of the gradient estimator discussed in §3, the unrolling curriculum §4.3, as well as the features fed to the optimizer enumerated in §4.1, we re-trained the learned optimizer removing each of these additions. In particular, we trained optimizers with: only the reparameterization gradient estimator (Gradients), only with evolutionary strategies (ES), a fixed number unrolled steps (10, 100, 1000) as opposed to a schedule, no RMS gradient scaling (No RMS), no momentum terms (No Mom), no momentum and no RMS scaling (Only Grads), and without the current iteration (No Time). To account for variance, each configuration is repeated with multiple random seeds. Figure 7 summarizes these findings, showing the learned optimizer performance for each of these ablations. We find that the gradient estimator (in §3) and an increasing schedule of unroll steps are critical to performance, along with including momentum as an input to the optimizer.
In this work we demonstrate two difficulties when training learned optimizers: “exploding” gradients, and a bias introduced by truncated backpropagation through time. To combat this, we construct a variational bound of the outer-objective and minimize this via a combination of reparameterization and ES style gradient estimators. By using our combined estimator and a curriculum over truncation step we are able to train learned optimizers that achieve more than five times speedup on wallclock time as compared to existing optimizers.
In this work, we focused on applying optimizers to a restricted family of tasks. While useful in its own right (e.g. rapid retraining of models on new data), future work will explore the limits of “no free lunch” (Wolpert & Macready, 1997) in the context of optimizers, to understand how and when learned optimizers generalize across tasks. We are also interested in using these methods to better understand what problem structure our learned optimizers exploit. By analyzing the trained optimizer, we hope to develop insights that may transfer back to hand-designed optimizers. Outside of meta-learning, we believe the outer-gradient estimator presented here can be used to train other long time dependence recurrent problems such as neural turning machines (Graves et al., 2014), or neural GPUs (Kaiser & Sutskever, 2015).
Much in the same way deep learning has replaced feature design for perceptual tasks, we see meta-learning as a tool capable of learning new and interesting algorithms, especially for domains with unexploited problem-specific structure. With better outer-training stability, we hope to improve our ability to learn interesting algorithms, both for optimizers and beyond.
We would like to thank Madhu Advani, Alex Alemi, Samy Bengio, Brian Cheung, Chelsea Finn, Sam Greydanus, Hugo Larochelle, Ben Poole, George Tucker, and Olga Wichrowska, as well as the rest of the Brain Team for conversations that helped shape this work.
- Andrychowicz et al. (2016) Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pp. 3981–3989, 2016.
- Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
- Bello et al. (2017) Irwan Bello, Barret Zoph, Vijay Vasudevan, and Quoc Le. Neural optimizer search with reinforcement learning. 2017. URL https://arxiv.org/pdf/1709.07417.pdf.
- Bengio et al. (1992) Samy Bengio, Yoshua Bengio, Jocelyn Cloutier, and Jan Gecsei. On the optimization of a synaptic learning rule. In Preprints Conf. Optimality in Artificial and Biological Neural Networks, pp. 6–8. Univ. of Texas, 1992.
- Bengio (2000) Yoshua Bengio. Gradient-based optimization of hyperparameters. Neural computation, 12(8):1889–1900, 2000.
- Bengio et al. (1990) Yoshua Bengio, Samy Bengio, and Jocelyn Cloutier. Learning a synaptic learning rule. Université de Montréal, Département d’informatique et de recherche opérationnelle, 1990.
- Bergstra & Bengio (2012) James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(Feb):281–305, 2012.
- Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
- Fleiss (1993) JL Fleiss. Review papers: The statistical basis of meta-analysis. Statistical methods in medical research, 2(2):121–145, 1993.
- Franceschi et al. (2018) Luca Franceschi, Paolo Frasconi, Saverio Salzo, and Massimilano Pontil. Bilevel programming for hyperparameter optimization and meta-learning. arXiv preprint arXiv:1806.04910, 2018.
Glorot & Bengio (2010)
Xavier Glorot and Yoshua Bengio.
Understanding the difficulty of training deep feedforward neural
Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256, 2010.
- Goldberg & Holland (1988) David E Goldberg and John H Holland. Genetic algorithms and machine learning. Machine learning, 3(2):95–99, 1988.
- Graves et al. (2014) Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
- Hochreiter et al. (2001) Sepp Hochreiter, A Steven Younger, and Peter R Conwell. Learning to learn using gradient descent. In International Conference on Artificial Neural Networks, pp. 87–94. Springer, 2001.
- Kaiser & Sutskever (2015) Łukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms. arXiv preprint arXiv:1511.08228, 2015.
- Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Kingma & Welling (2013) Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
The mnist database of handwritten digits.http://yann. lecun. com/exdb/mnist/, 1998.
- Li & Malik (2017a) Ke Li and Jitendra Malik. Learning to optimize. International Conference on Learning Representations, 2017a.
- Li & Malik (2017b) Ke Li and Jitendra Malik. Learning to optimize neural nets. arXiv preprint arXiv:1703.00441, 2017b.
- Lucas et al. (2018) James Lucas, Richard Zemel, and Roger Grosse. Aggregated momentum: Stability through passive damping. arXiv preprint arXiv:1804.00325, 2018.
- Lv et al. (2017) Kaifeng Lv, Shunhua Jiang, and Jian Li. Learning gradient descent: Better generalization and longer horizons. arXiv preprint arXiv:1703.03633, 2017.
- Maclaurin et al. (2015) Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter optimization through reversible learning. In International Conference on Machine Learning, pp. 2113–2122, 2015.
- Nesterov (1983) Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o (1/k^ 2). In Doklady AN USSR, volume 269, pp. 543–547, 1983.
- Nesterov & Spokoiny (2011) Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. Technical report, Université catholique de Louvain, Center for Operations Research and Econometrics (CORE), 2011.
- Ollivier et al. (2015) Yann Ollivier, Corentin Tallec, and Guillaume Charpiat. Training recurrent networks online without backtracking. arXiv preprint arXiv:1507.07680, 2015.
- Parmas et al. (2018) Paavo Parmas, Carl Edward Rasmussen, Jan Peters, and Kenji Doya. Pipps: Flexible model-based policy search robust to the curse of chaos. In International Conference on Machine Learning, pp. 4062–4071, 2018.
- Pascanu et al. (2013) Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning, pp. 1310–1318, 2013.
- Rechenberg (1973) Ingo Rechenberg. Evolutionsstrategie–optimierung technisher systeme nach prinzipien der biologischen evolution. 1973.
Russakovsky et al. (2015)
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,
Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein,
Alexander C. Berg, and Li Fei-Fei.
ImageNet Large Scale Visual Recognition Challenge.
International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.
- Schmidhuber (1995) Juergen Schmidhuber. On learning how to learn learning strategies. 1995.
- Snoek et al. (2012) Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959, 2012.
- Staines & Barber (2012) Joe Staines and David Barber. Variational optimization. arXiv preprint arXiv:1212.4507, 2012.
- Tallec & Ollivier (2017) Corentin Tallec and Yann Ollivier. Unbiasing truncated backpropagation through time. arXiv preprint arXiv:1705.08209, 2017.
- Tieleman & Hinton (2012) Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
- Werbos (1990) Paul J Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10):1550–1560, 1990.
- Wichrowska et al. (2017) Olga Wichrowska, Niru Maheswaranathan, Matthew W Hoffman, Sergio Gomez Colmenarejo, Misha Denil, Nando de Freitas, and Jascha Sohl-Dickstein. Learned optimizers that scale and generalize. International Conference on Machine Learning, 2017.
- Wierstra et al. (2008) Daan Wierstra, Tom Schaul, Jan Peters, and Juergen Schmidhuber. Natural evolution strategies. In Evolutionary Computation, 2008. CEC 2008.(IEEE World Congress on Computational Intelligence). IEEE Congress on, pp. 3381–3387. IEEE, 2008.
- Williams & Zipser (1989) Ronald J Williams and David Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280, 1989.
- Wolpert & Macready (1997) David H Wolpert and William G Macready. No free lunch theorems for optimization. IEEE transactions on evolutionary computation, 1(1):67–82, 1997.
- Wu et al. (2016) Yuhuai Wu, Mengye Ren, Renjie Liao, and Roger B Grosse. Understanding short-horizon bias in stochastic meta-optimization. pp. 478–487, 2016.
Appendix A Derivation of the unrolled gradient
For the case of learning a learning rate, we can derive the unrolled gradient to get some intuition for issues that arise with outer-training. Here, the update rule is given by:
where are the inner-parameters to train, is the inner-loss, and is a scalar learning rate, the only outer-parameter. We use superscripts to denote the iteration, so are the parameters at iteration . In addition, we use and to denote the gradient and Hessian of the loss at iteration , respectively.
We are interested in computing the gradient of the loss after steps of gradient descent with respect to the learning rate, . This quantity is given by . The second term in this inner product tells us how changes in the learning rate affect the final parameter value after steps. This quantity can be defined recursively using the total derivative:
By expanding the above expression from to , we get the following expression for the unrolled gradient:
This expression highlights where the exploding outer-gradient comes from: the recursive definition of means that computing it will involve a product of the Hessian at every iteration.
This expression makes intuitive sense if we restrict the number of unrolled steps to one. In this case, the unrolled gradient is the negative inner product between the current and previous gradients: . This means that if the current and previous gradients are correlated (have positive inner product), then updating the learning rate in the direction of the negative unrolled gradient means that we should increase the learning rate. This makes sense as if the current and previous gradients are correlated, we expect that we should move faster along this direction.
Appendix B Outer-Training Algorithm
Appendix C Architecture details
In a similar vein to diagonal preconditioning optimizers, and existing learned optimizers our architecture operates on each parameter independently. Unlike other works, we do not use a recurrent model as we have not found applications where the performance gains are worth the increased computation. We instead employ a single hidden layer feed forward MLP with 32 hidden units. This MLP takes as input momentum terms, as well as rms terms inspired by Adam computed at a few different decay values: [0.5, 0.9, 0.99, 0.999, 0.9999] (Wichrowska et al., 2017). A similar idea has been explored with regard to momentum parameters in Lucas et al. (2018). We also pass in 2 terms: and also from computations performed in Adam. Despite being critical in (Wichrowska et al., 2017) we find these features of minimal impact (see §4.5). So far, this is features. The current gradient as well as the current weight value are also used as features (2 additional features). By passing in weight values, the optimizer can learn to do arbitrary norm weight decay. To emulate learning rate schedules, the current training iteration is fed in transformed via applying a tanh squashing functions at different timescales: where is the timescale. We use 9 timescales logarithmicly spaced from (3, 300k). This leaves us in total with 31 features.
All non-time features are normalized by the second moment with regard to other elements in the “batch” dimension (the other weights of the weight tensor). We choose this over other normalization strategies (e.g. batch norm) to preserve directionality. These activations are then passed the into a hidden layer, 32 unit MLP with ReLU activations. Many existing optimizer hyperparameters (such as learning rate) operate on an exponential scale. As such, the network produces two outputs, and we combine them in an exponential manner:making use of two temperature parameters and which are both set to . Without these scaling terms, the default initialization yields steps on the order of size 1 – far above the step size of any known optimizer and result in highly chaotic regions of . It is still possible to optimize given our estimator, but training is slow and the solutions found are quite different.
The optimizer targets a 3 layer convolutional neural network with 3x3 kernels, and 32 units per layer. The first 2 layers are stride 2, and the 3rd layer has stride 1. We use ReLU activations and glorot initializations(Glorot & Bengio, 2010). At the last convolutional layer, an average pool is performed, and a linear projection is applied to get the 10 output classes.
We train using the algorithm described in Appendix B using a linear schedule on the number of unrolling steps from 50 - 10k over the course of 5k outer-training iterations. To add variation in length, we additionally shift this length by a percentage uniformly sampled between (-20%, 20%). We optimize the outer-parameters, , using Adam (Kingma & Ba, 2014) with a batch size of 128 and with a learning rate of 0.003 for the training outer-objective and 0.0003 for the validation outer-objective, and (following existing literature on non-stationary optimization (Arjovsky et al., 2017)). While both values of learning rate work for both outer-objectives, we find the validation outer-objective to be considerably harder, and training is more stable with the lower learning rate.
Appendix D Additional inner loop problem learning curves
We plot additional learning curves from both the outer-train task distribution and the outer-validation task distribution. The horizontal lines represent the minimum performance achieved over 20k steps. See Figure 1.
Appendix E Out of domain generalization
In this work, we focus our attention to learning optimizers over a specific task distribution (3 layer convolutional networks trained on 10 way subsets of 32x32 Imagenet). In addition to testing on these in domain problems (Appendix D), we test our learned optimizer on a variety of out of domain target problems. Despite little variation in the outer-training task distribution, our models show promising generalization when transferred to a wide range of different architectures (fully connected, convolutional networks) depths (2 layer to 6 layer) and number of parameters (models roughly 16x more parameters). We see these as promising sign that our learned optimizer has a reasonable (but not perfect) inductive bias. We leave training with increased variation to encourage better generalization as an area for future work.
Appendix F Inner-loop training speed
When training models, often one cares about taking less wallclock time as compared to loss decrease per weight update. Much like existing first order optimizers, the computation performed in our learned optimizer is linear in terms of number of parameters in the model being trained and smaller than the cost of computing gradients. The bulk of the computation in our model consists of two batched matrix multiplies of size 31x32, and 32x2. When training models that make use of weight sharing, e.g. RNN or CNN, the computation performed per weight often grows super linearly with parameter count. As the learned optimizer methods are scaled up, the additional overhead in performing more complex weight updates will vanish.
For the specific models we test in this paper, we measure the performance of our optimizer on CPU and GPU. We implement Adam, SGD, and our learned optimizer in TensorFlow for this comparison. Given the small scale of problem we are working at, we implement training in graph in atf.while_loop to avoid TensorFlow Session overhead. We use random input data instead of real data to avoid any data loading confounding. On CPU the learned optimizer executes at 80 batches a second where Adam runs at 92 batches a second and SGD at 93 batches per second. The learned optimizer is 16% slower than both.
On a GPU (Nvidia Titan X) we measure 177 batches per second for the learned and 278 batches per second for Adam, and 358 for sgd. This is or 57% slower than Adam and 102% slower than SGD.
Overhead is considerably higher on GPU due to the increased number of ops, and thus kernel executions, sent to the GPU. We expect a fused kernel can dramatically reduce this overhead. Despite the slowdown in computation, the performance gains (greater than 400% faster in steps) far exceed the slowdown, resulting in an optimizer that is still considerably faster when measured in wallclock time.