, have been served as the cornerstone of modern machine learning with big data. It has been empirically shown that DNNs achieve state-of-the-art generalization performance on a wide variety of tasks when trained with SGDZhang et al. (2017); Arpit et al. (2017). Several recent researches observe that SGD tends to select the so-called flat minima, which seems to generalize better in practice Hochreiter and Schmidhuber (1997a); Keskar et al. (2017); Dinh et al. (2017); Wu et al. (2018a); Izmailov et al. (2018); Li et al. (2018). Specifically, it has been experimentally studied how the learning rate (LR) Goyal et al. (2017); Hoffer et al. (2017); Jastrzębski et al. (2017) influence mimima solutions found by SGD. Theoretically, Wu et al.Wu et al. (2018a) analyzed that LR plays an important role in minima selection from a dynamical stability perspective. He et al. He et al. (2019) provided a PAC-Bayes generalization bound for DNNs trained by SGD, which is correlated with LR. Therefore, the LR highly influences the generalization performance of model training, and finding a proper LR schedule has been widely studied recently Bengio (2012); Schaul et al. (2013); Jastrzębski et al. (2017); Nar and Sastry (2018).
There mainly exist three kinds of hand-designed LR schedules to help improve the SGD training. 1) Pre-designed LR strategy is mostly used in current works, like decaying LR Gower et al. (2019) or cyclic LR Smith (2017); Loshchilov and Hutter (2017)
. These elaborate heuristic strategies have resulted in large improvements in training efficiency. Some theoretical works suggested that the decay schedule can yield faster convergenceGe et al. (2019); Davis et al. (2019) or avoid strict saddles Lee et al. (2019); Panageas et al. (2019) under some mild conditions. However, this strategy produces extra hyper-parameters to tune, e.g., when to decay and the decaying factor for this decay schedule. 2) Traditional LR search methods Nocedal and Wright (2006) can be extended to automatically search the LR for SGD when training DNNs, such as Polyak’s update rule Rolinek and Martius (2018), Frank-Wolfe algorithm Berrada et al. (2019), Armijo line-search Vaswani et al. (2019), etc. However, it needs to heuristically set some extra tunable parameters in their theoretical assumption conditions to ensure practical performance. 3) Adaptive gradient methods and their variants like Adam have been developed Duchi et al. (2011); Tieleman and Hinton (2012); Kingma and Ba (2015), to adapt coordinate-specific LR according to some gradient information to avoid tuning LR. However, it is still suggested to further carefully hand-tune the global LR and other hyper-parameters to obtain good performance in practice Wilson et al. (2017).
Although above LR schedules (as depicted in Fig. 1(a) and 1(b)) can achieve competitive results on their learning tasks, they still have evident deficiencies in practice. On the one hand, these pre-defined LR schedules as well as their additional hyper-parameters, suffer from the limited flexibility to adapt to non-convex optimization problems due to the significant variation of training dynamics. On the other hand, there does not exist a common methodology to guide the design of general LR schedules. When encountering new problems, it should choose the LR schedules above, and then tune the hyper-parameters, which is time and computation expensive to find such a good schedule. This tends to increase their application difficulty and harm their performance stability in real problems.
To alleviate the aforementioned issue, this paper presents a model to learn an adaptive LR schedule for SGD algorithm from data. The main idea is to parameterize the LR schedule as a LSTM network Hochreiter and Schmidhuber (1997b), which is capable of dealing with such a long-term information dependent problem. As shown in Fig. 1(c), the proposed Meta-LR-Schedule-Net (MLR-SNet) learns an explicit loss-LR dependent relationship, that can adjust LR adaptively based on current training loss as well as the information delivered from past training histories stored in the MLR-SNet, through the sound guidance of a small set of validation set. In summary, this paper makes the following three-fold contributions.
We propose a MLR-SNet to learn an adaptive LR schedule in a meta-learning manner. The MLR-SNet can adjust LR adaptively to comply with current training dynamic by leveraging the information during training process. Due to the explicit parameterized form of MLR-SNet, it can be more flexible than pre-defined LR schedules to find a proper LR schedule for the specific learning task. Fig.1(d) and 1(e) show LR schedules learned by our method, which show similar tendency as pre-defined strategy. While their locality has more variations, demonstrating our method is capable of adjusting LR according to current training dynamic adaptively in algorithm iteration.
The trained MLR-SNet, as a ready LR schedule, can be generally used in other various tasks, including different batch sizes, epochs, datasets and network architectures. Fig.1(f)
shows transferred LR schedules by MLR-SNet achieve similar forms like pre-set LR schedules in our experiments. Especially, we attempt to transfer learned LR schedules to large scale optimization problems, like training ImageNet with ResNet-50, and obtain comparable results with hand-designed LR schedules (shown in Fig.16). This potentially saves large labor and computation cost in applications.
Different from current hand-designed LR schedules varying against different tasks, our MLR-SNet is able to learn the LR schedule under a unique data-driven learning methodology, making it easily applied to different tasks without requiring much LR setting prior knowledge. Specifically, as shown in Table 2, on datasets with different image corruption noise types as in Hendrycks and Dietterich (2019), by using a unique MLR-SNet algorithm, our method can perform more robust and stable in average than conventional hand-designed LR schedules required to specifically set different strategies for these datasets.
2 Related Work
Meta learning for optimization. Meta learning or learning to learn has a long history in psychology Ward (1937); Lake et al. (2017). Meta learning for optimization can date back to 1980s-1990s Schmidhuber (1992); Bengio et al. (1991), aiming to meta-learn the optimization process of learning itself. Recently, Andrychowicz et al. (2016); Ravi and Larochelle (2017); Chen et al. (2017); Wichrowska et al. (2017); Li and Malik (2017); Lv et al. (2017) have attempted to scale this approach to larger DNN optimization problems. The main idea is to construct a meta-learner as the optimizer, which takes the gradients as input and outputs the updating rules. These approaches tend to make selecting appropriate training algorithms, scheduling LR and tuning other hyper-parameters in an automatic way. The meta-learner of these approaches can be updated by minimizing the generalization error on the validation set. Also, Li and Malik (2017)
utilized reinforcement learning andRavi and Larochelle (2017) used test error of few-shot learning tasks to train the meta-learner. Except for solving continuous optimization problems, some works employ these ideas to other optimization problems, such as black-box functions Chen et al. (2017), model’s curvature Park and Oliva (2019), evolution strategies Houthooft et al. (2018), combinatorial functions Rosenfeld et al. (2018), MCMC Proposals Wang et al. (2018), etc.
Though faster in decreasing training loss than the traditional optimizers in some cases, the learned optimizers may not always generalize well to diverse problems, especially for longer horizons Lv et al. (2017) and large scale optimization problems Wichrowska et al. (2017). Moreover, they can not be guaranteed to output a proper descent direction in each iteration for network training, since they assume that all parameters share one small net and ignore the relationship among involved parameters. Our proposed method attempts to learn an adaptive LR schedule rather than the whole updating rules. This makes it easier and more faithful to learn and the learned schedules are capable of readily being generalized to other tasks.
HPO and LR schedule adaptation.Hyperparameter optimization (HPO) was historically investigated by selecting proper values for algorithm hyper-parameters to obtain better performance on validation set (see Hutter et al. (2019) for an overview). Typical methods include grid search, random search Bergstra and Bengio (2012), Bayesian optimization Snoek et al. (2012), gradient-based methods Franceschi et al. (2017); Shu et al. (2020a), etc. Recently, some works attempt to find a proper LR schedule under the framework of gradient-based HPO, which can be solved by bilevel optimization Franceschi et al. (2017). To improve computation efficiency, Baydin et al. (2018) managed to derive greedy updating rules. However, most HPO techniques tend to fall into short-horizon bias and easily find a bad minima Wu et al. (2018b). Meanwhile, since they regard LR as hyper-parameter to learn without a transferable formulation, the learned LR schedules can not generalize to other learning tasks directly.
Our method attempts to walk a further step along this line. Instead of treating LR as hyper-parameter, we propose to design a meta-learner with explicit mapping formulation to parameterize LR schedules, which can adjust LR adaptively to comply with current training dynamic by leveraging the information from past training histories. Meanwhile, the parameterized formulation makes it possible to generalize to other tasks. Recently, Xu et al. (2019) employed a LR controller to help the learned LR schedule generalize to new tasks. However, they use a reinforcement learning framework to train the controller, which is always hard to scale to long horizons and large scale optimization problem comparatively.
3 The Proposed Meta-LR-Schedule-Net (MLR-SNet) Method
The problem of training DNNs can be formulated as the following non-convex optimization problem,
is the training loss function for data samples, which characters the deviation of the model prediction from the data, and represents the parameters of the model (e.g., the weight matrices in a neural network) to be optimized. SGD Robbins and Monro (1951); Polyak (1964) and its variants, including Momentum Tseng (1998), Adagrad Duchi et al. (2011), Adadelta Zeiler (2012)
, RMSpropTieleman and Hinton (2012), Adam Kingma and Ba (2015), are often used for training DNNs. In general, these algorithms can be summarized as the following formulation Robbins and Monro (1951),
where is -th updating model parameters, denotes the gradient of at , represents the historical gradient information, and is the hyperparameter of the optimizer , e.g., LR. To present our method’s efficiency, we focus on the following vanilla SGD formulation,
where denotes the batch samples randomly sampled from the training dataset, denotes the gradient of sample computed at and is the LR at -th iteration.
3.1 Existing LR schedule strategies
As Bengio demonstrated in Bengio (2012), the choice of LR remains central to effective DNNs training with SGD. As mentioned in Section 1, a variety of hand-designed LR schedules have been proposed. While they achieve competitive results on some learning tasks, they mostly share several drawbacks: 1) The pre-defined LR schedules as well as their additional hyper-parameters suffer from the limited flexibility to adapt to the non-convex optimization problems due to the significant variation of training dynamic. 2) There does not exist a common methodology for such LR schedule setting issue, which makes it time-consuming and computationally expensive to find a good schedule for a new problem.
Inspired by current meta-learning developments Schmidhuber (1992); Finn et al. (2017); Shu et al. (2018, 2019, 2020b), some researches proposed to learn a generic optimizer from data Andrychowicz et al. (2016); Ravi and Larochelle (2017); Chen et al. (2017); Wichrowska et al. (2017); Li and Malik (2017); Lv et al. (2017). The main idea is to learn a meta-learner as the optimizer to guide the learning of the whole updating rules for a specific problem. For example, Andrychowicz et al.Andrychowicz et al. (2016) try to replace Eq.(2) with the following formulation,
where is the output of a LSTM net , parameterized by , whose state is .This strategy can make selecting appropriate training algorithms, scheduling LR and tuning other hyper-parameters in a unified and automatic way. Though faster in decreasing training loss than the traditional optimizers in some cases, the learned optimizer may not always generalize well to more variant and diverse problems, like longer horizons Lv et al. (2017) and large scale optimization problems Wichrowska et al. (2017). Moreover, it can not guarantee to output a proper descent direction in each iteration for network training. This tends to further increase their application difficulty and harm their performance stability in real problems.
Recently, some methods Franceschi et al. (2017); Baydin et al. (2018) consider the following constrained optimization problem to search the optimal LR schedule such that the producing models are associated with small validation error,
where denotes the validation loss function, denotes hold-out validation set, is to-be-solved hyperparameter, is a stochastic weight update dynamic, like the updating rule in Eq.(2) or the vanilla SGD in Eq.(3), and is the maximum iteration step. Though achieving similar results on some tasks compared with hand-designed LR schedules, they still can not generalize to new tasks without an explicit transferable mapping able to be readily transferred.
3.2 Meta-LR-Schedule-Net (MLR-SNet) Method
To address aforementioned issues, We specifically design a meta-learner with explicit mapping formulation, called MLR-SNet, to parameterize LR schedules that can learn an adaptive LR schedule to comply with current training dynamic by leveraging the information from past training histories. To this aim, we formulate the MLR-SNet as shown in Fig. 1(c), and the structure is shown in Fig. 2.
where is the parameter of MLR-SNet at -th iteration (). At any iteration steps, can learn an explicit loss-LR dependent relationship, such that the net can adaptively predict LR according to the current input loss , as well as the historical information stored in the net. For every iteration step, the whole forward computation process is
where denote the Input, Forget and Output gates, respectively. Different from vanilla LSTM, the input and the training loss are preprocessed by a fully-connected layer. After that, the predicted value
is obtained by a linear transformon the with a Sigmoid activation function. Finally, we introduce a scale factor 111We find that the loss range of text tasks is around one order of magnitude higher than image tasks. In our paper, we empirically set 1 for image tasks, and 20 for text tasks to eliminate the influence of loss magnitude. to guarantee the final predicted LR located in the interval of . Albeit simple, this net is known for dealing with such long-term information dependent problems, and thus capable of finding a proper LR schedule to comply with the significant variations of training dynamic.
Here, we employ the technique in Finn et al. (2017); Shu et al. (2019) to jointly update MLR-SNet parameter and model parameter to explore a proper LR schedule with better generalization for DNNs training.
Updating . At iteration step , we firstly adjust the MLR-SNet parameter according to the model parameter and MLR-SNet parameter obtained in the last step by minimizing the validation loss defined in Eq.(8). Adam can be employed to optimize the validation loss, i.e.,
where denotes the Adam algorithm, whose input is the gradient of validation loss with respect to MLR-SNet parameter on mini-batch samples from . denotes the LR of Adam. Other SGD variants can be used to update , and we choose Adam to avoid extra tuning on LR. The following equation is used to formulate 222Notice that here is a function of to guarantee the gradient in Eq.(9) to be able to compute. on a mini-batch training samples from ,
Updating . Then, the updated is employed to ameliorate the model parameter , i.e.,
The MLR-SNet learning algorithm can be summarized in Algorithm 1
. All computations of gradients can be efficiently implemented by automatic differentiation libraries, like PyTorchPaszke et al. (2019), and generalized to any DNN architectures. It can be seen that the MLR-SNet can be gradually optimized during the learning process and adjust the LR dynamically based on the training dynamic of DNNs.
4 Experimental Results
To evaluate the proposed MLR-SNet, we firstly conduct experiments to show our method is capable of finding proper LR schedules compared with baseline methods. Then we transfer the learned LR schedules to various tasks to show its superiority in generalization. Finally, we show our method behaves robust and stable when training data contain different data corruptions by using the proposed unique MSR-SNet algorithm instead of different manually set LR schedules as conventional.
4.1 Evaluation on the Learned LR Schedule by MLR-SNet
Datasets and models. To verify general effectiveness of our method, we respectively train different models on four benchmark data, including ResNet-18 He et al. (2016) on CIFAR-10, WideResNet-28-10 Zagoruyko and Komodakis (2016) on CIFAR-100 Krizhevsky (2009), 2-layer LSTM on Penn Treebank Marcus and Marcinkiewicz , 2-layer Transformer Vaswani et al. (2017) on WikiText-2 Merity et al. (2017).
Baselines. For image classification tasks, the compared methods include SGD with hand-designed LR schedules: 1) Fixed LR, 2) Exponential decay, 3) MultiStep decay, 4) SGD with restarts (SGDR) Loshchilov and Hutter (2017). Also, we compare with SGD with Momentum (SGDM) with above four LR schedules. The momentum is fixed as 0.9. Meanwhile, we compare with adaptive gradient method: 5)Adam, LR search method: 6) L4 Rolinek and Martius (2018), and current LR schedule adaptation methods: 7) hyper-gradient descent (HD) Baydin et al. (2018), 8) real-time hyper-parameter optimization (RTHO) Franceschi et al. (2017). For text classification tasks, we compare with 1) SGD and 2) Adam with LR tuned using a validation set. They drop the LR by a factor of 4 when the validation loss stops decreasing. Also, we compared with 3) L4, 4) HD, 5) RTHO. We run all experiments with 3 different seeds reporting accuracy. The detailed illustrations of experimental setting, and more experimental results are presented in supplementary material.
Image tasks. Fig.3(a) and 3(b) show the classification accuracy on CIFAR-10 and CIFAR-100 test sets, respectively. It can be observed that: 1) our algorithm outperforms all other competing methods, and the learned LR schedules by MLR-SNet are presented in Fig.1(d), which have similar shape as the hand-designed strategies, while with more elaborate variation details in locality for adapting training dynamic. 2) The Fixed LR has similar performance to other baselines at the early training, while falls into fluctuations at the later training. This implies that the Fixed LR can not finely adapt to such DNNs training dynamics. 3) The MultiStep LR drops the LR at some epochs, and such elegant strategy overcomes the issue of Fixed LR and obtains higher and stabler performance at the later training. 3) The Exponential LR improves test performance faster at the early training than other baselines, while makes a slow progress due to smaller LR at the later training. 4) SGDR uses the cyclic LR, which needs more epochs to obtain a stable result. 5) Though Adam has an adaptive coordinate-specific LR, it behaves worse than MultiStep and Exponential LR as demonstrated in Wilson et al. (2017). An extra tuning is necessary for better performance. 6) L4 greedily searches LR to decrease loss, while the complex DNNs training dynamics can not guarantee it to obtain a good minima. 7) HD and RTHO are able to achieve similar performance to hand-designed LR schedules. Since image tasks often use SGDM to train DNNs, Fig.3(d) and 3(e) show the results of baseline methods trained with SGDM, and they obtain a remarkable improvement than SGD. Though not using extra historical gradient to help optimization, our method achieves comparable results with baselines by finding a proper LR schedule for SGD.
Text tasks. Fig.3(c) and 3(f) show the test perplexity on the Penn Treebank and WikiText-2 dataset, respectively. Adam and SGD tune LR using a validation set. Thus they always performs better. Our method achieves comparable results with them, while outperforms other competing methods. The learned LR schedules are presented in Fig.1(b), which have similar shape as the hand-designed strategies. L4 easily falls into a bad minima, and HD, RTHO sometimes underperform SGD.
4.2 Transferability of Learned LR Schedule
We investigate the transferability of the learned LR schedule when applying it to various tasks. Since the methods 6),7),8) in Section 4.1 are not able to generalize, we do not compare them here. The compared methods are trained with SGDM for image tasks for a stronger baseline. We use the MLR-SNet learned on CIFAR-10 with ResNet-18 in Section 4.1 as the transferred LR schedule.
Generalization to different batch-sizes. The learned MLR-SNets are trained with batch size 128. We can then readily transfer the learned schedule to varying batch sizes as shown in Fig. 4. Comparable performance to specifically hand-designed LR schedules can be obtained. Particularly, when increasing the batch size, the test accuracy of our method has less degradation than fixed LR.
Generalization to different epochs. The learned MLR-SNets are trained with epoch 200, and we transfer the learned LR schedules to other different training epochs. As shown in Fig.5, the performances of models trained by our transferred LR schedules are gradually improved when increasing the training epochs, while there exists little improvement for competitive Exponential LR.
Generalization to different datasets. We transfer the LR schedules learned on CIFAR-10 to SVHN Netzer et al. (2011), TinyImageNet 333It can be downloaded at https://tiny-imagenet.herokuapp.com., and Penn Treebank Marcus and Marcinkiewicz . As shown in Fig. 6, though datasets vary from image to text, our method can still obtain a relatively stable generalization performance for different tasks.
Generalization to different net architectures. We also transfer the LR schedules learned on ResNet-18 to light-weight nets ShuffleNetV2Ma et al. (2018), MobileNetV2Sandler et al. (2018) or NASNet Zoph et al. (2018)444The pytorch code of all these networks can be found on https://github.com/weiaicunzai/pytorch-cifar100.. As shown in Fig. 7, our method achieves almost similar results to SGDM with MultiStep or Exponential LR.
Generalization to large scale optimization. To our best knowledge, only Wichrowska et al. Wichrowska et al. (2017) attempted to train DNNs on ImageNet dataset among current learning-to-optimize literatures. Yet it can only be executed for thousands of steps, far from the optimization process in practice. We transfer the learned LR schedule to train ImageNet dataset Deng et al. (2009) with ResNet-50 555The training code can be found on https://github.com/pytorch/examples/tree/master/imagenet.. As shown in Fig.16, the validation accuracy of our method is competitive with those hand-designed baseline methods.
4.3 Robustness on Different Data Corruptions
While the hand-designed LR schedules may be elaborate and effective for specific tasks, it is always hard to flexibly being adapted for a new problem without human invention. However, our proposed regime can naturally alleviate this issue with a unique data-driven automatic LR-schedule adapting methodology under the sound guidance of a small clean meta dataset. To illustrate this, we design experiments as follows: we take CIFAR-10-C and CIFAR-100-C Hendrycks and Dietterich (2019) as our training set, consisting of 15 types of different generated corruptions on test images data of CIFAR-10/CIFAR-100, and the original training set of CIFAR-10/100 as test set. Though the original images of CIFAR-10/100-C are the same with the CIFAR-10/100 test set, different corruptions have changed the data distributions. To guarantee the calculated models finely generalize to test set, we choose the validation set as 10 clean images for each class. Each corruption can be roughly regarded as a task, and thus we obtain 15 models trained on CIFAR-10/100-C. Table 2 shows the mean test accuracy of 15 models, which are trained on CIFAR-10/100-C using MLR-SNet and hand-designed LR schedules for SGDM. As can be seen, though our method underperforms baseline methods in Section 4.1 on the regular CIFAR training, our method evidently outperforms them under the new training setting. It implies that our method behaves more robust and stable than the pre-set LR schedules when the learning tasks are changed, since our method always tries to find a proper LR schedule to minimize the generalization error based on the knowledge specifically conveyed from the given data.
5 Conclusion and Discussion
In this paper, we have proposed to learn an adaptive LR schedule in a meta learning manner. To this aim, we design a meta-learner with explicit mapping formulation to parameterize LR schedules, adaptively adjusting the LR to comply with current training dynamic based on training loss and information from past training histories. Comprehensive experiments substantiate the superiority of our method on various image and text benchmarks in its adaptability, generalization capability and robustness, as compared with current hand-designed LR schedules.
The preliminary experimental evaluations show that our method gives good convergence performance on various tasks. We observe that the learned LR schedule in our experiments follows a consistent trajectory as shown in Fig.1, sharing a similar tendency as the pre-set LR schedules. such convergence guarantee Li et al. (2020) can roughly explain our good convergence performance for such DNNs training. The detailed theoretical analysis for convergence of our methods is left for further work. Furthermore, Keskar et al. (2017); Dinh et al. (2017) suggested that the width of a local optimum is related to generalization. Wider optima leads to better generalization. We use the visualization technique in Izmailov et al. (2018) to visualize the "width" of the solutions for different LR schedules on CIFAR-100 with ResNet-18. As shown in Fig.9, our method lies a wide flat region of the train loss. This could explain the better generalization of our method compared with pre-set LR schedules. Deeper understandings on this point will be further investigated.
-  (2016) Learning to learn by gradient descent by gradient descent. In NeurIPS, Cited by: §2, §3.1.
-  (2017) A closer look at memorization in deep networks. In ICML, Cited by: §1.
-  (2018) Online learning rate adaptation with hypergradient descent. In ICLR, Cited by: §2, §3.1, §4.1.
-  (1991) Learning a synaptic learning rule. In IJCNN, Vol. 2, pp. 969–vol. Cited by: §2.
-  (2012) Practical recommendations for gradient-based training of deep architectures. In Neural networks: Tricks of the trade, pp. 437–478. Cited by: §1, §3.1.
-  (2012) Random search for hyper-parameter optimization. JMLR. Cited by: §2.
-  (2019) Deep frank-wolfe for neural network optimization. In ICLR, Cited by: §1.
-  (2017) Learning to learn without gradient descent by gradient descent. In ICML, Cited by: §2, §3.1.
-  (2019) Stochastic algorithms with geometric step decay converge linearly on sharp functions. arXiv:1907.09547. Cited by: §1.
-  (2009) Imagenet: a large-scale hierarchical image database. In CVPR, Cited by: §4.2.
-  (2017) Sharp minima can generalize for deep nets. In ICML, Cited by: §1, §5.
-  (2011) Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research 12 (Jul), pp. 2121–2159. Cited by: §1, §1, §3.
-  (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, Cited by: §3.1, §3.2.
-  (2017) Forward and reverse gradient-based hyperparameter optimization. In ICML, Cited by: Appendix A, §2, §3.1, §4.1.
-  (2019) The step decay schedule: a near optimal, geometrically decaying learning rate procedure for least squares. In NeurIPS, Cited by: §1.
-  (2019) SGD: general analysis and improved rates. In ICML, Cited by: §1.
-  (2017) Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv:1706.02677. Cited by: §1.
-  (2019) Control batch size and learning rate to generalize well: theoretical and empirical evidence. In NeurIPS, Cited by: §1.
-  (2016) Deep residual learning for image recognition. In CVPR, Cited by: §4.1.
-  (2019) Benchmarking neural network robustness to common corruptions and perturbations. In ICLR, Cited by: Appendix C, 3rd item, §4.3.
-  (1997) Flat minima. Neural Computation 9 (1), pp. 1–42. Cited by: §1.
-  (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1.
-  (2017) Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In NeurIPS, Cited by: §1.
-  (2018) Evolved policy gradients. In NeurIPS, Cited by: §2.
-  (2017) Densely connected convolutional networks. In CVPR, Cited by: Appendix B.
-  (2019) Automated machine learning. Springer. Cited by: §2.
-  (2018) Averaging weights leads to wider optima and better generalization. In UAI, Cited by: §1, §5.
-  (2017) Three factors influencing minima in sgd. arXiv:1711.04623. Cited by: §1.
On large-batch training for deep learning: generalization gap and sharp minima. In ICLR, Cited by: §1, §5.
-  (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §1, §1, §3.
-  (2009) Learning multiple layers of features from tiny images. Technical report Cited by: Appendix A, §4.1.
-  (2017) Building machines that learn and think like people. Behavioral and brain sciences 40. Cited by: §2.
-  (2019) First-order methods almost always avoid saddle points. Mathematical Programming. Cited by: §1.
-  (2018) Visualizing the loss landscape of neural nets. In NeurIPS, Cited by: §1.
-  (2017) Learning to optimize neural nets. In ICLR, Cited by: §2, §3.1.
-  (2020) Exponential step sizes for non-convex optimization. arXiv preprint arXiv:2002.05273. Cited by: §5.
-  (2017) Sgdr: stochastic gradient descent with warm restarts. In ICLR, Cited by: §1, §4.1.
-  (2017) Learning gradient descent: better generalization and longer horizons. In ICML, Cited by: §2, §2, §3.1.
-  (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In ECCV, Cited by: §4.2.
-  Building a large annotated corpus of english: the penn treebank. Computational Linguistics 19 (2). Cited by: Appendix A, §4.1, §4.2.
-  (2017) Pointer sentinel mixture models. In ICLR, Cited by: Appendix A, §4.1.
-  (2018) Step size matters in deep learning. In NeurIPS, Cited by: §1.
-  (2011) Reading digits in natural images with unsupervised feature learning. In NeurIPS Workshop on Deep Learning and Unsupervised Feature Learning, Cited by: §4.2.
-  (2006) Numerical optimization. Springer Science & Business Media. Cited by: §1.
-  (2019) First-order methods almost always avoid saddle points: the case of vanishing step-sizes. In NeurIPS, Cited by: §1.
-  (2019) Meta-curvature. In NeurIPS, Cited by: §2.
-  (2019) PyTorch: an imperative style, high-performance deep learning library. In NeurIPS, Cited by: §3.2.
-  (1964) Some methods of speeding up the convergence of iteration methods. Computational Mathematics and Mathematical Physics 4 (5), pp. 1–17. Cited by: §3.
-  (2017) Optimization as a model for few-shot learning. In ICLR, Cited by: §2, §3.1.
-  (1951) A stochastic approximation method. The annals of mathematical statistics, pp. 400–407. Cited by: §1, §3.
-  (2018) L4: practical loss-based stepsize adaptation for deep learning. In NeurIPS, Cited by: §1, §4.1.
-  (2018) Learning to optimize combinatorial functions. In ICML, Cited by: §2.
-  (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In CVPR, Cited by: §4.2.
-  (2013) No more pesky learning rates. In ICML, Cited by: §1.
-  (1992) Learning to control fast-weight memories: an alternative to dynamic recurrent networks. Neural Computation 4 (1), pp. 131–139. Cited by: §2, §3.1.
-  (2019) Meta-weight-net: learning an explicit mapping for sample weighting. In NeurIPS, Cited by: §3.1, §3.2.
-  (2018) Small sample learning in big data era. arXiv:1808.04572. Cited by: §3.1.
-  (2020) Learning adaptive loss for robust learning with noisy labels. arXiv:2002.06482. Cited by: §2.
-  (2020) Meta transition adaptation for robust deep learning with noisy labels. arXiv preprint arXiv:2006.05697. Cited by: §3.1.
-  (2017) Cyclical learning rates for training neural networks. In WACV, Cited by: §1.
-  (2012) Practical bayesian optimization of machine learning algorithms. In NeurIPS, Cited by: §2.
-  (2012) Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. Neural networks for machine learning. Cited by: §1, §1, §3.
-  (1998) An incremental gradient (-projection) method with momentum term and adaptive stepsize rule. SIAM Journal on Optimization 8 (2), pp. 506–531. Cited by: §3.
-  (2017) Attention is all you need. In NeurIPS, Cited by: §4.1.
Painless stochastic gradient: interpolation, line-search, and convergence rates. In NeurIPS, Cited by: §1.
-  (2018) Meta-learning mcmc proposals. In NeurIPS, Cited by: §2.
-  (1937) Reminiscence and rote learning.. Psychological Monographs 49 (4). Cited by: §2.
-  (2017) Learned optimizers that scale and generalize. In ICML, Cited by: §2, §2, §3.1, §4.2.
-  (2017) The marginal value of adaptive gradient methods in machine learning. In NeurIPS, Cited by: §1, §4.1.
-  (2018) How sgd selects the global minima in over-parameterized learning: a dynamical stability perspective. In NeurIPS, Cited by: §1.
-  (2018) Understanding short-horizon bias in stochastic meta-optimization. In ICLR, Cited by: §2.
-  (2019) Learning an adaptive learning rate schedule. arXiv:1909.09712. Cited by: Appendix E, Appendix E, §2.
-  (2016) Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: Appendix A, §4.1.
-  (2012) Adadelta: an adaptive learning rate method. arXiv:1212.5701. Cited by: §1, §3.
-  (2017) Understanding deep learning requires rethinking generalization. In ICLR, Cited by: §1.
-  (2018) Learning transferable architectures for scalable image recognition. In CVPR, Cited by: §4.2.
Appendix A Experimental details and additional results in Section 4.1
In this section, we attempt to evaluate the capability of MLR-SNet to learn LR schedules compared with baseline methods. Here, we provide implementation details of all experiments.
Datasets. We choose two datasets in image classification (CIFAR-10 and CIFAR-100), and two datasets in text classification (Penn Treebank and WikiText-2) to present the efficiency of our method. CIFAR-10 and CIFAR-100 , consisting of 3232 color images arranged in 10 and 100 classes, respectively. Both datasets contain 50,000 training and 10,000 test images. Penn Treebank  is composed of 929k training words, 73k validation words, and 82k test words, with a 10k vocabulary in total. WikiText-2 , with a total vocabulary of 33278, contains more than 2088k training words, 217k validation words and 245k test words. Our algorithm and RTHO  randomly select 1,000 clean images in the training set of CIFAR-10/100 as validation data, and directly use the validation set in Penn Treebank and WikiText-2 as validation data.
CIFAR-10 & CIFAR-100. We employ ResNet-18 on CIFAR-10 and WideResNet-28-10  on CIFAR-100. All compared methods and MLR-SNet are trained for 200 epochs with batch size . For baselines involving SGD as base optimizer, we set the initial LR to , weight decay parameter to and momentum to if used. While for Adam, we just follow the default parameter setting. The hyper-parameters of hand-designed LR schedules are listed below: Exponential decay, multiplying LR with every epoch; MultiStep decay, decaying LR by every 60 epochs; SGDR, setting T_0 to , T_Mult to and minimum LR to . L4, HD and RTHO update LR every data batch, and we use the recommended setting in the original paper of L4 () and search different hyper-lrs from for HD and RTHO, reporting the best performing hyper-lr.
We use a 2-layer LSTM network which follows a word-embedding layer and the output is fed into a linear layer to compute the probability of each word in the vocabulary. Hidden size of LSTM cell is set toand so is the word-embedding size. We tie weights of the word-embedding layer and the final linear layer. Dropout is applied to the output of word-embedding layer together with both the first and second LSTM layers with a rate of . As for training, the LSTM net is trained for 150 epochs with a batch size of and a sequence length of . We set the base optimizer SGD to have an initial LR of without momentum, for Adam, the initial LR is set to and weight for moving average of gradient is set to . We apply a weight decay of to both base optimizers. All experiments involve a clipping to the network gradient norm. For both SGD and Adam, we decrease LR by a factor of 4 when performance on validation set shows no progress. For L4, we try different in and reporting the best test perplexity among them. For both HD and RTHO, we search the hyper-lr lying in , and report the best results.
WikiText-2. We employ a 2-layer Transformer on WikiText-2. In that we target on text classification, only the encoder layer of Transformer is included in the network and we simply use a linear layer as the decoder 666The detailed architectures of both 2-layer LSTM and 2-layer Transformer can be found in https://github.com/pytorch/examples/blob/master/word_language_model/model.py. Each encoder layer has two heads of attention modules, and both word-embedding size and hidden size of encoder are fixed to . We also apply dropout to the positional encoding layer and the encoder in Transformer with dropout rate being
. The Transformer network is trained for 40 epochs with a batch size ofand a sequence length of . For base optimizer SGD, initial LR is set to be and a weight decay of without momentum. While for Adam, LR is fixed to , zero factor for the moving average of gradient and a weight decay, too. We adopt the same ways to determine the baseline methods’ settings as those for Penn Treebank.
MLR-SNet architecture and parameter setting. The architecture of MLR-SNet is illustrated in Section 3.2. In our experiment, the size of hidden nodes is set as 40. The Pytorch implementation of MLR-SNet is listed bellow.
An important parameter of our MLR-SNet is the scale factor , which should be different for various tasks. We find that the loss range of text tasks is around one order of magnitude higher than image tasks. In our paper, we empirically set 1 for image tasks, and 20 for text tasks to eliminate the influence of loss magnitude.
We employ Adam optimizer to train MLR-SNet, and just set the parameters as originally recommended with a weight decay of , which avoids extra hyper-parameter tuning. For image classification tasks, input of MLR-SNet is the training loss of a mini batch samples. Every data batch’s LR is predicted by MLR-SNet and we update it once per epoch according to the loss of the validation data. While for text classification tasks, we take as input of MLR-SNet to deal with the influence of large scale classes of text. MLR-SNet is updated every 100 batches due to the large number of batches per epoch compared to that in image datasets.
Pytorch implementation of MLR-SNet. class LSTMCell(nn.Module): def __init__(self, num_inputs, hidden_size): super(LSTMCell, self).__init__() self.hidden_size = hidden_size self.fc_i2h = nn.Sequential( nn.Linear(num_inputs, hidden_size), nn.ReLU(), nn.Linear(hidden_size, 4 * hidden_size) ) self.fc_h2h = nn.Sequential( nn.Linear(hidden_size, hidden_size), nn.ReLU(), nn.Linear(hidden_size, 4 * hidden_size) )
def forward(self, inputs, state): hx, cx = state i2h = self.fc_i2h(inputs) h2h = self.fc_h2h(hx) x = i2h + h2h gates = x.split(self.hidden_size, 1) in_gate = torch.sigmoid(gates) forget_gate = torch.sigmoid(gates) out_gate = torch.sigmoid(gates) in_transform = torch.tanh(gates) cx = forget_gate * cx + in_gate * in_transform hx = out_gate * torch.tanh(cx) return hx, cx
class MLRNet(nn.Module): def __init__(self, num_layers, hidden_size): super(MLRNet, self).__init__() self.hidden_size = hidden_size self.layer1 = LSTMCell(1, hidden_size) self.layer2 = nn.Linear(hidden_size, 1)
def forward(self, x, gamma): self.hx, self.cx = self.layer1(x, (self.hx, self.cx)) x = self.hx x = self.layer2(x) out = torch.sigmoid(x) return gamma * out
Results. Due to the space limitation, we only present the test accuracy in the main paper. Here, we present the training loss and test accuracy of our method and all compared methods on image and text tasks, as shown in Fig. 10. For image tasks, except for Adam and SGD with fixed LR, other methods can decrease the loss to 0 almostly. Though local minima can be reached by these methods, the generalization ability of the these mimimas has a huge difference, which can be summarized from test accuracy curves. As shown in Fig. 10(a),10(b),10(g),10(h), when using SGD to train DNNs, the compared methods SGD with Exponential LR, L4, HD, RTHO fail to find such good solutions to generalize well. Especially, L4 greedily searches LR to decrease loss to 0, making it fairly hard to adapt the complex DNNs training dynamic and obtain a good mimima, while our method can adjust LR to comply with the significant variations of training dynamic, leading to a better generalization solution. As shown in Fig. 10(d),10(e),10(j),10(k), when baseline methods are trained with SGDM, these methods make a great progress in escaping from the bad minimas. In spite of this, our method still shows superiority in finding a solution with better generalization compared with these competitive training strategies.
In the third column in Fig. 10, we plot learned LR schedules of compared methods and our method. As can be seen, our method can learn LR schedules approximating the hand-designed LR schedules. HD and RTHO often have the same trajectory while producing lower or faster downward trend than ours. This tends to explain our final performances on test set is better than HD and RTHO, since our method can adaptively adjust LR utilizing the past training histories explicitly. L4 greedily searches a LR to decrease the loss. This often leads to a large value causing fluctuations or even divergence (Fig. 10(l)), or a small value causing slow progress (Fig. 10(r)), or both of them (Fig. 10(c) 10(f) 10(i) 10(o)). Such LR schedules often result in bad mimimas. Moreover, all compared methods regard LR as hyper-parameter to learn without a transferable formulation, and the learned LR schedules can not generalize to other learning tasks directly. While our parameterized formulation of MLR-SNet makes it possible to generalize to other tasks.
Appendix B Experimental details and additional results in Section 4.2
We investigate the transferability of the learned LR schedule when applied to various tasks in Section 4.2 of the main paper. We employ the learned MLR-SNet to directly predict the LR for SGD algorithm. Here, we provide implementation details of all experiments.
As is shown in Fig.11, it can be seen that the predicted LR by the learned LR schedules converges after several iterations. This is because that the training trajectories are long in our experiments, and the learned MLR-SNet can not memory all the information since we locally adjust our MLR-SNet according to the validation error. If we directly select one MLR-SNet learned at any epoch, that will raise overfitting issues as shown in Fig.11. Thus we should select more than two learned MLR-SNets for test. Here, we propose a heuristic strategy to select MLR-SNets for test. Generally, if we want to select nets for test, the MLR-SNet learned at -th epoch () should be chosen, where denotes ceiling operator. Fig. 11 and 11 show the train loss and test accuracy with ResNet-18 on CIFAR-100 of different test strategies, i.e., choosing different number of nets to transfer. It can be seen that almost choosing more than three nets have similar performance. Therefore, in the following experiments we choose three MLR-SNets to show the transferability.
Generalization to different batch sizes. We transfer the learned LR schedules for different batch sizes training. All the methods are trained with ResNet-18 on CIFAR-100 for 200 epochs with different batch size. The hyper-parameter setting for compared hand-designed LR schedules are the same with Section 4.1 in the main paper as illustrated above. Fig. 12 shows the test accuracy of all methods with varying batch sizes (Adding the results of batch size 128, 256, 1024).
Generalization to different epochs. We transfer the learned LR schedules for different epochs training. All the methods are trained with ResNet-18 on CIFAR-100 with batch size 128 for different epochs. The hyper-parameter setting for compared hand-designed LR schedules is the same with Section 4.1 in the main paper as illustrated above, except for MultiStep LR. For epoch 100, MultiStep LR decays LR by every 30 epochs; For epoch 400, MultiStep LR decays LR by every 120 epochs; For epoch 1200, MultiStep LR decays LR by every 360 epochs. Other hyper-parameters of MultiStep LR keep unchanged. For our method, we use the transferred strategy as below: 1) For epoch 100, we employ the 3 nets at 0-33, 33-67, 67-100 epoch, respectively; 2) For epoch 400, we employ the 3 nets at 0-133, 133-267, 267-400 epoch, respectively; 3) For epoch 1200, we employ the 3 nets at 0-400, 400-800, 800-1200 epoch, respectively.
Generalization to SGDM algorithm. The learned MLR-SNets are trained with SGD algorithm, and we transfer the learned LR schedules to SGDM algorithm with Momentum 0.9. All the methods are trained with ResNet-18 on CIFAR-100 for 200 epochs with batch size 128. The hyper-parameter setting for compared hand-designed LR schedules are the same with Section 4.1 in the main paper. We set here. As shown in Fig. 14, our method outperforms all the baseline methods.
Generalization to different datasets. We transfer the learned LR schedules for different datasets training. For image classification, we train a ResNet-18 on SVHN and TinyImageNet, respectively. The hyper-parameters of all compared methods are set the same as those of CIFAR-10. For text classification, we train a 2-layer LSTM on Penn Treebank. The hyper-parameters of all compared methods are with the same setting as introduced in Section 4.1.
Generalization to different net architectures. We transfer the learned LR schedules for different net architectures training. All the methods are trained on CIFAR-10 with different net architectures. The hyper-parameters of all methods are the same with the setting of CIFAR-10 with ResNet-18. We test the learned LR schedule to different configurations of DenseNet . As shown in Fig. 13, our method perform slightly stable than MultiStep strategy at about 75-125 epochs. This tends to show the superiority of adaptive LR to train the DenseNets. Also, we transfer the LR schedules to several novel networks, the results are presented in Fig. 8 in the main paper.
Generalization to large scale optimization. We transfer the learned LR schedules for the training of the large scale optimization problems. The predicted LR by MLR-SNet will not substantially increase the complexity compared with hand-designed LR schedules for DNNs training. This makes it feasible and reliable to transfer our learned LR schedules to such large scale optimization problems. We train a ResNet-50 on ImageNet with hand-designed LR schedules and our transferred LR schedules. The training code can be found on https://github.com/pytorch/examples/tree/master/imagenet, and the parameter setting keeps unchanged except the LR. All compared hand-designed LR schedules are trained by SGDM with a momentum , a weight decay , an initial learning rate for 90 epochs, and batch size 256. Fixed LR uses 0.1 LR during the whole training; Exponential LR multiplies LR with every epoch; MultiStep LR decays LR by every 30 epochs; SGDR sets T_0 to , T_Mult to and minimum LR to ; Adam just uses the default parameter setting. The results are presented in Fig. 9 in the main paper.
Appendix C Experimental details and additional results in Section 4.3
The datasets CIFAR-10-C and CIFAR-100-C  can be downloaded at https://zenodo.org/record/2535967#.Xt4mVigzZPY, https://zenodo.org/record/3555552#.Xt4mdSgzZPY. Each dataset contains 15 types of algorithmically generated corruptions from noise, blur, weather, and digital categories. These corruptions contain Gaussian Noise, Shot Noise, Impulse Noise, Defocus Blur, Frosted Glass Blur, Motion Blur, Zoom Blur, Snow, Frost, Fog, Brightness, Contrast, Elastic, Pixelate and JPEG. All the corruptions are gererated on 10,000 test set images, and each corruption contains 50,000 images since each type of corruption has five levels of severity. We treat CIFAR-10-C or CIFAR-100-C dataset as training set, and train a model with ResNet-18 for each corruption dataset. Finally, we can obtain 15 models for CIFAR-10/100-C. Each corruption can be roughly regarded as a task, and the average accuracy of 15 models on test data 777We use the original 50,000 train images of CIFAR-10/100 as test data. is used to evaluate the robust performance of different tasks for each LR schedules strategy.
For experimental setting in Section 4.3, all compared hand-designed LR schedules are trained with a ResNet-18 by SGDM with a momentum , a weight decay , an initial learning rate for 100 epochs, and batch size 128. Fixed LR uses 0.1 LR during the whole training; Exponential LR multiplies LR with every epoch; MultiStep LR decays LR by every 30 epochs; SGDR sets T_0 to , T_Mult to and minimum LR to ; Adam just uses the default parameter setting. Our method trains the ResNet-18 by SGD with a weight decay , and the MLR-SNet is learned under the guidance of a small set of validation set without corruptions. We randomly choose 10 clean images for each class as validation set. The experimental result is listed in Table 1 in the main paper.
Additional robustness results of transferred LR schedules on different data corruptions. Furthermore, we want to explore the robust performance of different tasks for our transferred LR schedules. Different from above experiments where all 15 models are trained under the guidance of a small set of validation set, we just train a ResNet-18 on Gaussian Noise corruption to learn the MLR-SNet, and then transfer the learned LR schedules to other 14 corruptions. We report the average accuracy of 14 models on test data to show the robust performance of our transferred LR schedules. All the methods are trained with a ResNet-18 for 100 epochs with batch size 128. The hyperparameter setting of hand-designed LR schedules keeps same with above. Table 2 shows the mean test accuracy of 14 models. As can be seen, our transferred LR schedules obtain the final best performance compared with hand-designed LR schedules. This implies that our transferred LR schedules can also perform robust and stable than the pre-set LR schedules when the learning tasks are changed.
Appendix D Computational Complexity Analysis
Our MLR-SNet learning algorithm can be roughly regarded as requiring two extra full forward and backward passes of the network (step 6 in algorithm 1) in the presence of the normal network parameters update (step 8 in algorithm 1), together with the forward passes of MLR-SNet for every LR. Therefore compared to normal training, our method needs about computation time for one iteration. Since we periodically update MLR-SNet after several iterations, this will not substantially increase the computational complexity compared with normal network training. On the other hand, our transferred LR schedules predict LR for each iteration by a small MLR-SNet, whose computational cost should be significantly less than the cost of the normal network training. To empirically show the differences between hand-designed LR schedules and our method, we conduct experiments with ResNet-18 on CIFAR-10 and report the running time for all methods. All experiments are implemented on a computer with Intel Xeon(R) CPU E5-2686 v4 and a NVIDIA GeForce RTX 2080 8GB GPU. We follow the corresponding settings in Section 4.1, and results are shown in Figure 15. Except that RTHO costs significantly more time, other methods including MLR-SNet training and testing give similar results. Our MLR-SNet takes barely longer time to complete the training phase and due to the light-weight structure of MLR-SNet, and little extra time is added in the testing phase compared to hand-designed LR schedules. Thus our method is completely capable of practical application.
Appendix E Experimental Results of Additional Compared Method LR Controller
In this section, we present the experimental results of LR Controller , which is a related work of ours but under the reinforcement learning framework. Due to their learning algorithm is relatively computationally expensive and not very easy to optimize, we will show our method has a superiority in finding such a good LR schedule that scales and generalizes.
To start a fair comparison, we follow all the training settings and structure of LR Controller proposed in  except that we modify the batch size to 128 and increase training steps to cover 200 epochs of data to match our setup in Section 4.1 888Code for LR Controller can be found at https://github.com/nicklashansen/adaptive-learning-rate-schedule. Firstly, we train LR Controller on CIFAR-10 with ResNet-18 and CIFAR-100 with WideResNet-28-10 as we do in Section 4.1. As shown in Fig. 16, our method demonstrates evident superiority in finding a solution with better generalization compared with LR Controller strategies. LR Controller performs steadily in the early training phase, but soon fluctuates significantly and fails to progress. This tends to show that the LR Controller suffers from a severe stability issue when training step increases, especially being compared to our MLR-SNet.
Then we transfer the LR schedules learned on CIFAR-10 for our method and LR Controller to CIFAR-100 to verify their transferability. Test settings are the same with those related in Section 4.2. As shown in Fig. 17, LR Controller makes a comparatively slower progress in the whole training process. While our method achieves a competitive performance, which indicates the capability of transferring to other tasks for our method.