1 Introduction
Stochastic gradient descent (SGD) and its many variants Robbins and Monro (1951); Duchi et al. (2011); Zeiler (2012); Tieleman and Hinton (2012); Kingma and Ba (2015)
, have been served as the cornerstone of modern machine learning with big data. It has been empirically shown that DNNs achieve stateoftheart generalization performance on a wide variety of tasks when trained with SGD
Zhang et al. (2017); Arpit et al. (2017). Several recent researches observe that SGD tends to select the socalled flat minima, which seems to generalize better in practice Hochreiter and Schmidhuber (1997a); Keskar et al. (2017); Dinh et al. (2017); Wu et al. (2018a); Izmailov et al. (2018); Li et al. (2018). Specifically, it has been experimentally studied how the learning rate (LR) Goyal et al. (2017); Hoffer et al. (2017); Jastrzębski et al. (2017) influence mimima solutions found by SGD. Theoretically, Wu et al.Wu et al. (2018a) analyzed that LR plays an important role in minima selection from a dynamical stability perspective. He et al. He et al. (2019) provided a PACBayes generalization bound for DNNs trained by SGD, which is correlated with LR. Therefore, the LR highly influences the generalization performance of model training, and finding a proper LR schedule has been widely studied recently Bengio (2012); Schaul et al. (2013); Jastrzębski et al. (2017); Nar and Sastry (2018).There mainly exist three kinds of handdesigned LR schedules to help improve the SGD training. 1) Predesigned LR strategy is mostly used in current works, like decaying LR Gower et al. (2019) or cyclic LR Smith (2017); Loshchilov and Hutter (2017)
. These elaborate heuristic strategies have resulted in large improvements in training efficiency. Some theoretical works suggested that the decay schedule can yield faster convergence
Ge et al. (2019); Davis et al. (2019) or avoid strict saddles Lee et al. (2019); Panageas et al. (2019) under some mild conditions. However, this strategy produces extra hyperparameters to tune, e.g., when to decay and the decaying factor for this decay schedule. 2) Traditional LR search methods Nocedal and Wright (2006) can be extended to automatically search the LR for SGD when training DNNs, such as Polyak’s update rule Rolinek and Martius (2018), FrankWolfe algorithm Berrada et al. (2019), Armijo linesearch Vaswani et al. (2019), etc. However, it needs to heuristically set some extra tunable parameters in their theoretical assumption conditions to ensure practical performance. 3) Adaptive gradient methods and their variants like Adam have been developed Duchi et al. (2011); Tieleman and Hinton (2012); Kingma and Ba (2015), to adapt coordinatespecific LR according to some gradient information to avoid tuning LR. However, it is still suggested to further carefully handtune the global LR and other hyperparameters to obtain good performance in practice Wilson et al. (2017).Although above LR schedules (as depicted in Fig. 1(a) and 1(b)) can achieve competitive results on their learning tasks, they still have evident deficiencies in practice. On the one hand, these predefined LR schedules as well as their additional hyperparameters, suffer from the limited flexibility to adapt to nonconvex optimization problems due to the significant variation of training dynamics. On the other hand, there does not exist a common methodology to guide the design of general LR schedules. When encountering new problems, it should choose the LR schedules above, and then tune the hyperparameters, which is time and computation expensive to find such a good schedule. This tends to increase their application difficulty and harm their performance stability in real problems.
To alleviate the aforementioned issue, this paper presents a model to learn an adaptive LR schedule for SGD algorithm from data. The main idea is to parameterize the LR schedule as a LSTM network Hochreiter and Schmidhuber (1997b), which is capable of dealing with such a longterm information dependent problem. As shown in Fig. 1(c), the proposed MetaLRScheduleNet (MLRSNet) learns an explicit lossLR dependent relationship, that can adjust LR adaptively based on current training loss as well as the information delivered from past training histories stored in the MLRSNet, through the sound guidance of a small set of validation set. In summary, this paper makes the following threefold contributions.

We propose a MLRSNet to learn an adaptive LR schedule in a metalearning manner. The MLRSNet can adjust LR adaptively to comply with current training dynamic by leveraging the information during training process. Due to the explicit parameterized form of MLRSNet, it can be more flexible than predefined LR schedules to find a proper LR schedule for the specific learning task. Fig.1(d) and 1(e) show LR schedules learned by our method, which show similar tendency as predefined strategy. While their locality has more variations, demonstrating our method is capable of adjusting LR according to current training dynamic adaptively in algorithm iteration.

The trained MLRSNet, as a ready LR schedule, can be generally used in other various tasks, including different batch sizes, epochs, datasets and network architectures. Fig.1(f)
shows transferred LR schedules by MLRSNet achieve similar forms like preset LR schedules in our experiments. Especially, we attempt to transfer learned LR schedules to large scale optimization problems, like training ImageNet with ResNet50, and obtain comparable results with handdesigned LR schedules (shown in Fig.
16). This potentially saves large labor and computation cost in applications. 
Different from current handdesigned LR schedules varying against different tasks, our MLRSNet is able to learn the LR schedule under a unique datadriven learning methodology, making it easily applied to different tasks without requiring much LR setting prior knowledge. Specifically, as shown in Table 2, on datasets with different image corruption noise types as in Hendrycks and Dietterich (2019), by using a unique MLRSNet algorithm, our method can perform more robust and stable in average than conventional handdesigned LR schedules required to specifically set different strategies for these datasets.
2 Related Work
Meta learning for optimization. Meta learning or learning to learn has a long history in psychology Ward (1937); Lake et al. (2017). Meta learning for optimization can date back to 1980s1990s Schmidhuber (1992); Bengio et al. (1991), aiming to metalearn the optimization process of learning itself. Recently, Andrychowicz et al. (2016); Ravi and Larochelle (2017); Chen et al. (2017); Wichrowska et al. (2017); Li and Malik (2017); Lv et al. (2017) have attempted to scale this approach to larger DNN optimization problems. The main idea is to construct a metalearner as the optimizer, which takes the gradients as input and outputs the updating rules. These approaches tend to make selecting appropriate training algorithms, scheduling LR and tuning other hyperparameters in an automatic way. The metalearner of these approaches can be updated by minimizing the generalization error on the validation set. Also, Li and Malik (2017)
utilized reinforcement learning and
Ravi and Larochelle (2017) used test error of fewshot learning tasks to train the metalearner. Except for solving continuous optimization problems, some works employ these ideas to other optimization problems, such as blackbox functions Chen et al. (2017), model’s curvature Park and Oliva (2019), evolution strategies Houthooft et al. (2018), combinatorial functions Rosenfeld et al. (2018), MCMC Proposals Wang et al. (2018), etc.Though faster in decreasing training loss than the traditional optimizers in some cases, the learned optimizers may not always generalize well to diverse problems, especially for longer horizons Lv et al. (2017) and large scale optimization problems Wichrowska et al. (2017). Moreover, they can not be guaranteed to output a proper descent direction in each iteration for network training, since they assume that all parameters share one small net and ignore the relationship among involved parameters. Our proposed method attempts to learn an adaptive LR schedule rather than the whole updating rules. This makes it easier and more faithful to learn and the learned schedules are capable of readily being generalized to other tasks.
HPO and LR schedule adaptation.Hyperparameter optimization (HPO) was historically investigated by selecting proper values for algorithm hyperparameters to obtain better performance on validation set (see Hutter et al. (2019) for an overview). Typical methods include grid search, random search Bergstra and Bengio (2012), Bayesian optimization Snoek et al. (2012), gradientbased methods Franceschi et al. (2017); Shu et al. (2020a), etc. Recently, some works attempt to find a proper LR schedule under the framework of gradientbased HPO, which can be solved by bilevel optimization Franceschi et al. (2017). To improve computation efficiency, Baydin et al. (2018) managed to derive greedy updating rules. However, most HPO techniques tend to fall into shorthorizon bias and easily find a bad minima Wu et al. (2018b). Meanwhile, since they regard LR as hyperparameter to learn without a transferable formulation, the learned LR schedules can not generalize to other learning tasks directly.
Our method attempts to walk a further step along this line. Instead of treating LR as hyperparameter, we propose to design a metalearner with explicit mapping formulation to parameterize LR schedules, which can adjust LR adaptively to comply with current training dynamic by leveraging the information from past training histories. Meanwhile, the parameterized formulation makes it possible to generalize to other tasks. Recently, Xu et al. (2019) employed a LR controller to help the learned LR schedule generalize to new tasks. However, they use a reinforcement learning framework to train the controller, which is always hard to scale to long horizons and large scale optimization problem comparatively.
3 The Proposed MetaLRScheduleNet (MLRSNet) Method
The problem of training DNNs can be formulated as the following nonconvex optimization problem,
(1) 
where
is the training loss function for data samples
, which characters the deviation of the model prediction from the data, and represents the parameters of the model (e.g., the weight matrices in a neural network) to be optimized. SGD Robbins and Monro (1951); Polyak (1964) and its variants, including Momentum Tseng (1998), Adagrad Duchi et al. (2011), Adadelta Zeiler (2012), RMSprop
Tieleman and Hinton (2012), Adam Kingma and Ba (2015), are often used for training DNNs. In general, these algorithms can be summarized as the following formulation Robbins and Monro (1951),(2) 
where is th updating model parameters, denotes the gradient of at , represents the historical gradient information, and is the hyperparameter of the optimizer , e.g., LR. To present our method’s efficiency, we focus on the following vanilla SGD formulation,
(3) 
where denotes the batch samples randomly sampled from the training dataset, denotes the gradient of sample computed at and is the LR at th iteration.
3.1 Existing LR schedule strategies
As Bengio demonstrated in Bengio (2012), the choice of LR remains central to effective DNNs training with SGD. As mentioned in Section 1, a variety of handdesigned LR schedules have been proposed. While they achieve competitive results on some learning tasks, they mostly share several drawbacks: 1) The predefined LR schedules as well as their additional hyperparameters suffer from the limited flexibility to adapt to the nonconvex optimization problems due to the significant variation of training dynamic. 2) There does not exist a common methodology for such LR schedule setting issue, which makes it timeconsuming and computationally expensive to find a good schedule for a new problem.
Inspired by current metalearning developments Schmidhuber (1992); Finn et al. (2017); Shu et al. (2018, 2019, 2020b), some researches proposed to learn a generic optimizer from data Andrychowicz et al. (2016); Ravi and Larochelle (2017); Chen et al. (2017); Wichrowska et al. (2017); Li and Malik (2017); Lv et al. (2017). The main idea is to learn a metalearner as the optimizer to guide the learning of the whole updating rules for a specific problem. For example, Andrychowicz et al.Andrychowicz et al. (2016) try to replace Eq.(2) with the following formulation,
(4) 
where is the output of a LSTM net , parameterized by , whose state is .This strategy can make selecting appropriate training algorithms, scheduling LR and tuning other hyperparameters in a unified and automatic way. Though faster in decreasing training loss than the traditional optimizers in some cases, the learned optimizer may not always generalize well to more variant and diverse problems, like longer horizons Lv et al. (2017) and large scale optimization problems Wichrowska et al. (2017). Moreover, it can not guarantee to output a proper descent direction in each iteration for network training. This tends to further increase their application difficulty and harm their performance stability in real problems.
Recently, some methods Franceschi et al. (2017); Baydin et al. (2018) consider the following constrained optimization problem to search the optimal LR schedule such that the producing models are associated with small validation error,
(5) 
where denotes the validation loss function, denotes holdout validation set, is tobesolved hyperparameter, is a stochastic weight update dynamic, like the updating rule in Eq.(2) or the vanilla SGD in Eq.(3), and is the maximum iteration step. Though achieving similar results on some tasks compared with handdesigned LR schedules, they still can not generalize to new tasks without an explicit transferable mapping able to be readily transferred.
3.2 MetaLRScheduleNet (MLRSNet) Method
To address aforementioned issues, We specifically design a metalearner with explicit mapping formulation, called MLRSNet, to parameterize LR schedules that can learn an adaptive LR schedule to comply with current training dynamic by leveraging the information from past training histories. To this aim, we formulate the MLRSNet as shown in Fig. 1(c), and the structure is shown in Fig. 2.
Calculation principle of MLRSNet. The computational graph of MLRSNet is depicted in Fig.2(a). Let denote the MLRSNet, and then the updating equation of SGD in Eq.(3) can be rewritten as
(6) 
where is the parameter of MLRSNet at th iteration (). At any iteration steps, can learn an explicit lossLR dependent relationship, such that the net can adaptively predict LR according to the current input loss , as well as the historical information stored in the net. For every iteration step, the whole forward computation process is
(7) 
where denote the Input, Forget and Output gates, respectively. Different from vanilla LSTM, the input and the training loss are preprocessed by a fullyconnected layer
with ReLU activation function. Then it works as LSTM and obtains the output
. After that, the predicted valueis obtained by a linear transform
on the with a Sigmoid activation function. Finally, we introduce a scale factor ^{1}^{1}1We find that the loss range of text tasks is around one order of magnitude higher than image tasks. In our paper, we empirically set 1 for image tasks, and 20 for text tasks to eliminate the influence of loss magnitude. to guarantee the final predicted LR located in the interval of . Albeit simple, this net is known for dealing with such longterm information dependent problems, and thus capable of finding a proper LR schedule to comply with the significant variations of training dynamic.Now, the hyperparameter in Eq.(5) is replaced by MLRSNet, and then Eq.(5) can be rewritten as
(8) 
Here, we employ the technique in Finn et al. (2017); Shu et al. (2019) to jointly update MLRSNet parameter and model parameter to explore a proper LR schedule with better generalization for DNNs training.
Updating . At iteration step , we firstly adjust the MLRSNet parameter according to the model parameter and MLRSNet parameter obtained in the last step by minimizing the validation loss defined in Eq.(8). Adam can be employed to optimize the validation loss, i.e.,
(9) 
where denotes the Adam algorithm, whose input is the gradient of validation loss with respect to MLRSNet parameter on minibatch samples from . denotes the LR of Adam. Other SGD variants can be used to update , and we choose Adam to avoid extra tuning on LR. The following equation is used to formulate ^{2}^{2}2Notice that here is a function of to guarantee the gradient in Eq.(9) to be able to compute. on a minibatch training samples from ,
(10) 
Updating . Then, the updated is employed to ameliorate the model parameter , i.e.,
(11) 
The MLRSNet learning algorithm can be summarized in Algorithm 1
. All computations of gradients can be efficiently implemented by automatic differentiation libraries, like PyTorch
Paszke et al. (2019), and generalized to any DNN architectures. It can be seen that the MLRSNet can be gradually optimized during the learning process and adjust the LR dynamically based on the training dynamic of DNNs.4 Experimental Results
To evaluate the proposed MLRSNet, we firstly conduct experiments to show our method is capable of finding proper LR schedules compared with baseline methods. Then we transfer the learned LR schedules to various tasks to show its superiority in generalization. Finally, we show our method behaves robust and stable when training data contain different data corruptions by using the proposed unique MSRSNet algorithm instead of different manually set LR schedules as conventional.
4.1 Evaluation on the Learned LR Schedule by MLRSNet
Datasets and models. To verify general effectiveness of our method, we respectively train different models on four benchmark data, including ResNet18 He et al. (2016) on CIFAR10, WideResNet2810 Zagoruyko and Komodakis (2016) on CIFAR100 Krizhevsky (2009), 2layer LSTM on Penn Treebank Marcus and Marcinkiewicz , 2layer Transformer Vaswani et al. (2017) on WikiText2 Merity et al. (2017).
Baselines. For image classification tasks, the compared methods include SGD with handdesigned LR schedules: 1) Fixed LR, 2) Exponential decay, 3) MultiStep decay, 4) SGD with restarts (SGDR) Loshchilov and Hutter (2017). Also, we compare with SGD with Momentum (SGDM) with above four LR schedules. The momentum is fixed as 0.9. Meanwhile, we compare with adaptive gradient method: 5)Adam, LR search method: 6) L4 Rolinek and Martius (2018), and current LR schedule adaptation methods: 7) hypergradient descent (HD) Baydin et al. (2018), 8) realtime hyperparameter optimization (RTHO) Franceschi et al. (2017). For text classification tasks, we compare with 1) SGD and 2) Adam with LR tuned using a validation set. They drop the LR by a factor of 4 when the validation loss stops decreasing. Also, we compared with 3) L4, 4) HD, 5) RTHO. We run all experiments with 3 different seeds reporting accuracy. The detailed illustrations of experimental setting, and more experimental results are presented in supplementary material.
Image tasks. Fig.3(a) and 3(b) show the classification accuracy on CIFAR10 and CIFAR100 test sets, respectively. It can be observed that: 1) our algorithm outperforms all other competing methods, and the learned LR schedules by MLRSNet are presented in Fig.1(d), which have similar shape as the handdesigned strategies, while with more elaborate variation details in locality for adapting training dynamic. 2) The Fixed LR has similar performance to other baselines at the early training, while falls into fluctuations at the later training. This implies that the Fixed LR can not finely adapt to such DNNs training dynamics. 3) The MultiStep LR drops the LR at some epochs, and such elegant strategy overcomes the issue of Fixed LR and obtains higher and stabler performance at the later training. 3) The Exponential LR improves test performance faster at the early training than other baselines, while makes a slow progress due to smaller LR at the later training. 4) SGDR uses the cyclic LR, which needs more epochs to obtain a stable result. 5) Though Adam has an adaptive coordinatespecific LR, it behaves worse than MultiStep and Exponential LR as demonstrated in Wilson et al. (2017). An extra tuning is necessary for better performance. 6) L4 greedily searches LR to decrease loss, while the complex DNNs training dynamics can not guarantee it to obtain a good minima. 7) HD and RTHO are able to achieve similar performance to handdesigned LR schedules. Since image tasks often use SGDM to train DNNs, Fig.3(d) and 3(e) show the results of baseline methods trained with SGDM, and they obtain a remarkable improvement than SGD. Though not using extra historical gradient to help optimization, our method achieves comparable results with baselines by finding a proper LR schedule for SGD.
Text tasks. Fig.3(c) and 3(f) show the test perplexity on the Penn Treebank and WikiText2 dataset, respectively. Adam and SGD tune LR using a validation set. Thus they always performs better. Our method achieves comparable results with them, while outperforms other competing methods. The learned LR schedules are presented in Fig.1(b), which have similar shape as the handdesigned strategies. L4 easily falls into a bad minima, and HD, RTHO sometimes underperform SGD.
4.2 Transferability of Learned LR Schedule
We investigate the transferability of the learned LR schedule when applying it to various tasks. Since the methods 6),7),8) in Section 4.1 are not able to generalize, we do not compare them here. The compared methods are trained with SGDM for image tasks for a stronger baseline. We use the MLRSNet learned on CIFAR10 with ResNet18 in Section 4.1 as the transferred LR schedule.
Generalization to different batchsizes. The learned MLRSNets are trained with batch size 128. We can then readily transfer the learned schedule to varying batch sizes as shown in Fig. 4. Comparable performance to specifically handdesigned LR schedules can be obtained. Particularly, when increasing the batch size, the test accuracy of our method has less degradation than fixed LR.
Generalization to different epochs. The learned MLRSNets are trained with epoch 200, and we transfer the learned LR schedules to other different training epochs. As shown in Fig.5, the performances of models trained by our transferred LR schedules are gradually improved when increasing the training epochs, while there exists little improvement for competitive Exponential LR.
Generalization to different datasets. We transfer the LR schedules learned on CIFAR10 to SVHN Netzer et al. (2011), TinyImageNet ^{3}^{3}3It can be downloaded at https://tinyimagenet.herokuapp.com., and Penn Treebank Marcus and Marcinkiewicz . As shown in Fig. 6, though datasets vary from image to text, our method can still obtain a relatively stable generalization performance for different tasks.
Generalization to different net architectures. We also transfer the LR schedules learned on ResNet18 to lightweight nets ShuffleNetV2Ma et al. (2018), MobileNetV2Sandler et al. (2018) or NASNet Zoph et al. (2018)^{4}^{4}4The pytorch code of all these networks can be found on https://github.com/weiaicunzai/pytorchcifar100.. As shown in Fig. 7, our method achieves almost similar results to SGDM with MultiStep or Exponential LR.
Generalization to large scale optimization. To our best knowledge, only Wichrowska et al. Wichrowska et al. (2017) attempted to train DNNs on ImageNet dataset among current learningtooptimize literatures. Yet it can only be executed for thousands of steps, far from the optimization process in practice. We transfer the learned LR schedule to train ImageNet dataset Deng et al. (2009) with ResNet50 ^{5}^{5}5The training code can be found on https://github.com/pytorch/examples/tree/master/imagenet.. As shown in Fig.16, the validation accuracy of our method is competitive with those handdesigned baseline methods.
Datasets/Methods  Fixed  MultiStep  Exponential  SGDR  Adam  Ours(Train)  

CIFAR10C  Best  79.783.95  85.521.72  83.481.45  85.941.52  81.451.42  86.041.51 
Last  77.883.91  85.361.71  83.321.43  78.212.01  80.291.64  85.871.54  
CIFAR100C  Best  46.743.03  52.262.58  49.721.97  52.542.49  45.451.94  52.562.26 
Last  44.793.91  52.162.59  49.581.98  41.583.24  43.762.22  52.422.34 
4.3 Robustness on Different Data Corruptions
While the handdesigned LR schedules may be elaborate and effective for specific tasks, it is always hard to flexibly being adapted for a new problem without human invention. However, our proposed regime can naturally alleviate this issue with a unique datadriven automatic LRschedule adapting methodology under the sound guidance of a small clean meta dataset. To illustrate this, we design experiments as follows: we take CIFAR10C and CIFAR100C Hendrycks and Dietterich (2019) as our training set, consisting of 15 types of different generated corruptions on test images data of CIFAR10/CIFAR100, and the original training set of CIFAR10/100 as test set. Though the original images of CIFAR10/100C are the same with the CIFAR10/100 test set, different corruptions have changed the data distributions. To guarantee the calculated models finely generalize to test set, we choose the validation set as 10 clean images for each class. Each corruption can be roughly regarded as a task, and thus we obtain 15 models trained on CIFAR10/100C. Table 2 shows the mean test accuracy of 15 models, which are trained on CIFAR10/100C using MLRSNet and handdesigned LR schedules for SGDM. As can be seen, though our method underperforms baseline methods in Section 4.1 on the regular CIFAR training, our method evidently outperforms them under the new training setting. It implies that our method behaves more robust and stable than the preset LR schedules when the learning tasks are changed, since our method always tries to find a proper LR schedule to minimize the generalization error based on the knowledge specifically conveyed from the given data.
5 Conclusion and Discussion
In this paper, we have proposed to learn an adaptive LR schedule in a meta learning manner. To this aim, we design a metalearner with explicit mapping formulation to parameterize LR schedules, adaptively adjusting the LR to comply with current training dynamic based on training loss and information from past training histories. Comprehensive experiments substantiate the superiority of our method on various image and text benchmarks in its adaptability, generalization capability and robustness, as compared with current handdesigned LR schedules.
The preliminary experimental evaluations show that our method gives good convergence performance on various tasks. We observe that the learned LR schedule in our experiments follows a consistent trajectory as shown in Fig.1, sharing a similar tendency as the preset LR schedules. such convergence guarantee Li et al. (2020) can roughly explain our good convergence performance for such DNNs training. The detailed theoretical analysis for convergence of our methods is left for further work. Furthermore, Keskar et al. (2017); Dinh et al. (2017) suggested that the width of a local optimum is related to generalization. Wider optima leads to better generalization. We use the visualization technique in Izmailov et al. (2018) to visualize the "width" of the solutions for different LR schedules on CIFAR100 with ResNet18. As shown in Fig.9, our method lies a wide flat region of the train loss. This could explain the better generalization of our method compared with preset LR schedules. Deeper understandings on this point will be further investigated.
References
 [1] (2016) Learning to learn by gradient descent by gradient descent. In NeurIPS, Cited by: §2, §3.1.
 [2] (2017) A closer look at memorization in deep networks. In ICML, Cited by: §1.
 [3] (2018) Online learning rate adaptation with hypergradient descent. In ICLR, Cited by: §2, §3.1, §4.1.
 [4] (1991) Learning a synaptic learning rule. In IJCNN, Vol. 2, pp. 969–vol. Cited by: §2.
 [5] (2012) Practical recommendations for gradientbased training of deep architectures. In Neural networks: Tricks of the trade, pp. 437–478. Cited by: §1, §3.1.
 [6] (2012) Random search for hyperparameter optimization. JMLR. Cited by: §2.
 [7] (2019) Deep frankwolfe for neural network optimization. In ICLR, Cited by: §1.
 [8] (2017) Learning to learn without gradient descent by gradient descent. In ICML, Cited by: §2, §3.1.
 [9] (2019) Stochastic algorithms with geometric step decay converge linearly on sharp functions. arXiv:1907.09547. Cited by: §1.
 [10] (2009) Imagenet: a largescale hierarchical image database. In CVPR, Cited by: §4.2.
 [11] (2017) Sharp minima can generalize for deep nets. In ICML, Cited by: §1, §5.
 [12] (2011) Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research 12 (Jul), pp. 2121–2159. Cited by: §1, §1, §3.
 [13] (2017) Modelagnostic metalearning for fast adaptation of deep networks. In ICML, Cited by: §3.1, §3.2.
 [14] (2017) Forward and reverse gradientbased hyperparameter optimization. In ICML, Cited by: Appendix A, §2, §3.1, §4.1.
 [15] (2019) The step decay schedule: a near optimal, geometrically decaying learning rate procedure for least squares. In NeurIPS, Cited by: §1.
 [16] (2019) SGD: general analysis and improved rates. In ICML, Cited by: §1.
 [17] (2017) Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv:1706.02677. Cited by: §1.
 [18] (2019) Control batch size and learning rate to generalize well: theoretical and empirical evidence. In NeurIPS, Cited by: §1.
 [19] (2016) Deep residual learning for image recognition. In CVPR, Cited by: §4.1.
 [20] (2019) Benchmarking neural network robustness to common corruptions and perturbations. In ICLR, Cited by: Appendix C, 3rd item, §4.3.
 [21] (1997) Flat minima. Neural Computation 9 (1), pp. 1–42. Cited by: §1.
 [22] (1997) Long shortterm memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1.
 [23] (2017) Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In NeurIPS, Cited by: §1.
 [24] (2018) Evolved policy gradients. In NeurIPS, Cited by: §2.
 [25] (2017) Densely connected convolutional networks. In CVPR, Cited by: Appendix B.
 [26] (2019) Automated machine learning. Springer. Cited by: §2.
 [27] (2018) Averaging weights leads to wider optima and better generalization. In UAI, Cited by: §1, §5.
 [28] (2017) Three factors influencing minima in sgd. arXiv:1711.04623. Cited by: §1.

[29]
(2017)
On largebatch training for deep learning: generalization gap and sharp minima
. In ICLR, Cited by: §1, §5.  [30] (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §1, §1, §3.
 [31] (2009) Learning multiple layers of features from tiny images. Technical report Cited by: Appendix A, §4.1.
 [32] (2017) Building machines that learn and think like people. Behavioral and brain sciences 40. Cited by: §2.
 [33] (2019) Firstorder methods almost always avoid saddle points. Mathematical Programming. Cited by: §1.
 [34] (2018) Visualizing the loss landscape of neural nets. In NeurIPS, Cited by: §1.
 [35] (2017) Learning to optimize neural nets. In ICLR, Cited by: §2, §3.1.
 [36] (2020) Exponential step sizes for nonconvex optimization. arXiv preprint arXiv:2002.05273. Cited by: §5.
 [37] (2017) Sgdr: stochastic gradient descent with warm restarts. In ICLR, Cited by: §1, §4.1.
 [38] (2017) Learning gradient descent: better generalization and longer horizons. In ICML, Cited by: §2, §2, §3.1.
 [39] (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In ECCV, Cited by: §4.2.
 [40] Building a large annotated corpus of english: the penn treebank. Computational Linguistics 19 (2). Cited by: Appendix A, §4.1, §4.2.
 [41] (2017) Pointer sentinel mixture models. In ICLR, Cited by: Appendix A, §4.1.
 [42] (2018) Step size matters in deep learning. In NeurIPS, Cited by: §1.
 [43] (2011) Reading digits in natural images with unsupervised feature learning. In NeurIPS Workshop on Deep Learning and Unsupervised Feature Learning, Cited by: §4.2.
 [44] (2006) Numerical optimization. Springer Science & Business Media. Cited by: §1.
 [45] (2019) Firstorder methods almost always avoid saddle points: the case of vanishing stepsizes. In NeurIPS, Cited by: §1.
 [46] (2019) Metacurvature. In NeurIPS, Cited by: §2.
 [47] (2019) PyTorch: an imperative style, highperformance deep learning library. In NeurIPS, Cited by: §3.2.
 [48] (1964) Some methods of speeding up the convergence of iteration methods. Computational Mathematics and Mathematical Physics 4 (5), pp. 1–17. Cited by: §3.
 [49] (2017) Optimization as a model for fewshot learning. In ICLR, Cited by: §2, §3.1.
 [50] (1951) A stochastic approximation method. The annals of mathematical statistics, pp. 400–407. Cited by: §1, §3.
 [51] (2018) L4: practical lossbased stepsize adaptation for deep learning. In NeurIPS, Cited by: §1, §4.1.
 [52] (2018) Learning to optimize combinatorial functions. In ICML, Cited by: §2.
 [53] (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In CVPR, Cited by: §4.2.
 [54] (2013) No more pesky learning rates. In ICML, Cited by: §1.
 [55] (1992) Learning to control fastweight memories: an alternative to dynamic recurrent networks. Neural Computation 4 (1), pp. 131–139. Cited by: §2, §3.1.
 [56] (2019) Metaweightnet: learning an explicit mapping for sample weighting. In NeurIPS, Cited by: §3.1, §3.2.
 [57] (2018) Small sample learning in big data era. arXiv:1808.04572. Cited by: §3.1.
 [58] (2020) Learning adaptive loss for robust learning with noisy labels. arXiv:2002.06482. Cited by: §2.
 [59] (2020) Meta transition adaptation for robust deep learning with noisy labels. arXiv preprint arXiv:2006.05697. Cited by: §3.1.
 [60] (2017) Cyclical learning rates for training neural networks. In WACV, Cited by: §1.
 [61] (2012) Practical bayesian optimization of machine learning algorithms. In NeurIPS, Cited by: §2.
 [62] (2012) Lecture 6.5rmsprop: divide the gradient by a running average of its recent magnitude. Neural networks for machine learning. Cited by: §1, §1, §3.
 [63] (1998) An incremental gradient (projection) method with momentum term and adaptive stepsize rule. SIAM Journal on Optimization 8 (2), pp. 506–531. Cited by: §3.
 [64] (2017) Attention is all you need. In NeurIPS, Cited by: §4.1.

[65]
(2019)
Painless stochastic gradient: interpolation, linesearch, and convergence rates
. In NeurIPS, Cited by: §1.  [66] (2018) Metalearning mcmc proposals. In NeurIPS, Cited by: §2.
 [67] (1937) Reminiscence and rote learning.. Psychological Monographs 49 (4). Cited by: §2.
 [68] (2017) Learned optimizers that scale and generalize. In ICML, Cited by: §2, §2, §3.1, §4.2.
 [69] (2017) The marginal value of adaptive gradient methods in machine learning. In NeurIPS, Cited by: §1, §4.1.
 [70] (2018) How sgd selects the global minima in overparameterized learning: a dynamical stability perspective. In NeurIPS, Cited by: §1.
 [71] (2018) Understanding shorthorizon bias in stochastic metaoptimization. In ICLR, Cited by: §2.
 [72] (2019) Learning an adaptive learning rate schedule. arXiv:1909.09712. Cited by: Appendix E, Appendix E, §2.
 [73] (2016) Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: Appendix A, §4.1.
 [74] (2012) Adadelta: an adaptive learning rate method. arXiv:1212.5701. Cited by: §1, §3.
 [75] (2017) Understanding deep learning requires rethinking generalization. In ICLR, Cited by: §1.
 [76] (2018) Learning transferable architectures for scalable image recognition. In CVPR, Cited by: §4.2.
Appendix A Experimental details and additional results in Section 4.1
In this section, we attempt to evaluate the capability of MLRSNet to learn LR schedules compared with baseline methods. Here, we provide implementation details of all experiments.
Datasets. We choose two datasets in image classification (CIFAR10 and CIFAR100), and two datasets in text classification (Penn Treebank and WikiText2) to present the efficiency of our method. CIFAR10 and CIFAR100 [31], consisting of 3232 color images arranged in 10 and 100 classes, respectively. Both datasets contain 50,000 training and 10,000 test images. Penn Treebank [40] is composed of 929k training words, 73k validation words, and 82k test words, with a 10k vocabulary in total. WikiText2 [41], with a total vocabulary of 33278, contains more than 2088k training words, 217k validation words and 245k test words. Our algorithm and RTHO [14] randomly select 1,000 clean images in the training set of CIFAR10/100 as validation data, and directly use the validation set in Penn Treebank and WikiText2 as validation data.
CIFAR10 & CIFAR100. We employ ResNet18 on CIFAR10 and WideResNet2810 [73] on CIFAR100. All compared methods and MLRSNet are trained for 200 epochs with batch size . For baselines involving SGD as base optimizer, we set the initial LR to , weight decay parameter to and momentum to if used. While for Adam, we just follow the default parameter setting. The hyperparameters of handdesigned LR schedules are listed below: Exponential decay, multiplying LR with every epoch; MultiStep decay, decaying LR by every 60 epochs; SGDR, setting T_0 to , T_Mult to and minimum LR to . L4, HD and RTHO update LR every data batch, and we use the recommended setting in the original paper of L4 () and search different hyperlrs from for HD and RTHO, reporting the best performing hyperlr.
Penn Treebank.
We use a 2layer LSTM network which follows a wordembedding layer and the output is fed into a linear layer to compute the probability of each word in the vocabulary. Hidden size of LSTM cell is set to
and so is the wordembedding size. We tie weights of the wordembedding layer and the final linear layer. Dropout is applied to the output of wordembedding layer together with both the first and second LSTM layers with a rate of . As for training, the LSTM net is trained for 150 epochs with a batch size of and a sequence length of . We set the base optimizer SGD to have an initial LR of without momentum, for Adam, the initial LR is set to and weight for moving average of gradient is set to . We apply a weight decay of to both base optimizers. All experiments involve a clipping to the network gradient norm. For both SGD and Adam, we decrease LR by a factor of 4 when performance on validation set shows no progress. For L4, we try different in and reporting the best test perplexity among them. For both HD and RTHO, we search the hyperlr lying in , and report the best results.WikiText2. We employ a 2layer Transformer on WikiText2. In that we target on text classification, only the encoder layer of Transformer is included in the network and we simply use a linear layer as the decoder ^{6}^{6}6The detailed architectures of both 2layer LSTM and 2layer Transformer can be found in https://github.com/pytorch/examples/blob/master/word_language_model/model.py. Each encoder layer has two heads of attention modules, and both wordembedding size and hidden size of encoder are fixed to . We also apply dropout to the positional encoding layer and the encoder in Transformer with dropout rate being
. The Transformer network is trained for 40 epochs with a batch size of
and a sequence length of . For base optimizer SGD, initial LR is set to be and a weight decay of without momentum. While for Adam, LR is fixed to , zero factor for the moving average of gradient and a weight decay, too. We adopt the same ways to determine the baseline methods’ settings as those for Penn Treebank.MLRSNet architecture and parameter setting. The architecture of MLRSNet is illustrated in Section 3.2. In our experiment, the size of hidden nodes is set as 40. The Pytorch implementation of MLRSNet is listed bellow.
An important parameter of our MLRSNet is the scale factor , which should be different for various tasks. We find that the loss range of text tasks is around one order of magnitude higher than image tasks. In our paper, we empirically set 1 for image tasks, and 20 for text tasks to eliminate the influence of loss magnitude.
We employ Adam optimizer to train MLRSNet, and just set the parameters as originally recommended with a weight decay of , which avoids extra hyperparameter tuning. For image classification tasks, input of MLRSNet is the training loss of a mini batch samples. Every data batch’s LR is predicted by MLRSNet and we update it once per epoch according to the loss of the validation data. While for text classification tasks, we take as input of MLRSNet to deal with the influence of large scale classes of text. MLRSNet is updated every 100 batches due to the large number of batches per epoch compared to that in image datasets.
Pytorch implementation of MLRSNet. class LSTMCell(nn.Module): def __init__(self, num_inputs, hidden_size): super(LSTMCell, self).__init__() self.hidden_size = hidden_size self.fc_i2h = nn.Sequential( nn.Linear(num_inputs, hidden_size), nn.ReLU(), nn.Linear(hidden_size, 4 * hidden_size) ) self.fc_h2h = nn.Sequential( nn.Linear(hidden_size, hidden_size), nn.ReLU(), nn.Linear(hidden_size, 4 * hidden_size) )
def forward(self, inputs, state): hx, cx = state i2h = self.fc_i2h(inputs) h2h = self.fc_h2h(hx) x = i2h + h2h gates = x.split(self.hidden_size, 1) in_gate = torch.sigmoid(gates[0]) forget_gate = torch.sigmoid(gates[1]) out_gate = torch.sigmoid(gates[2]) in_transform = torch.tanh(gates[3]) cx = forget_gate * cx + in_gate * in_transform hx = out_gate * torch.tanh(cx) return hx, cx
class MLRNet(nn.Module): def __init__(self, num_layers, hidden_size): super(MLRNet, self).__init__() self.hidden_size = hidden_size self.layer1 = LSTMCell(1, hidden_size) self.layer2 = nn.Linear(hidden_size, 1)
def forward(self, x, gamma): self.hx, self.cx = self.layer1(x, (self.hx, self.cx)) x = self.hx x = self.layer2(x) out = torch.sigmoid(x) return gamma * out
Results. Due to the space limitation, we only present the test accuracy in the main paper. Here, we present the training loss and test accuracy of our method and all compared methods on image and text tasks, as shown in Fig. 10. For image tasks, except for Adam and SGD with fixed LR, other methods can decrease the loss to 0 almostly. Though local minima can be reached by these methods, the generalization ability of the these mimimas has a huge difference, which can be summarized from test accuracy curves. As shown in Fig. 10(a),10(b),10(g),10(h), when using SGD to train DNNs, the compared methods SGD with Exponential LR, L4, HD, RTHO fail to find such good solutions to generalize well. Especially, L4 greedily searches LR to decrease loss to 0, making it fairly hard to adapt the complex DNNs training dynamic and obtain a good mimima, while our method can adjust LR to comply with the significant variations of training dynamic, leading to a better generalization solution. As shown in Fig. 10(d),10(e),10(j),10(k), when baseline methods are trained with SGDM, these methods make a great progress in escaping from the bad minimas. In spite of this, our method still shows superiority in finding a solution with better generalization compared with these competitive training strategies.
In the third column in Fig. 10, we plot learned LR schedules of compared methods and our method. As can be seen, our method can learn LR schedules approximating the handdesigned LR schedules. HD and RTHO often have the same trajectory while producing lower or faster downward trend than ours. This tends to explain our final performances on test set is better than HD and RTHO, since our method can adaptively adjust LR utilizing the past training histories explicitly. L4 greedily searches a LR to decrease the loss. This often leads to a large value causing fluctuations or even divergence (Fig. 10(l)), or a small value causing slow progress (Fig. 10(r)), or both of them (Fig. 10(c) 10(f) 10(i) 10(o)). Such LR schedules often result in bad mimimas. Moreover, all compared methods regard LR as hyperparameter to learn without a transferable formulation, and the learned LR schedules can not generalize to other learning tasks directly. While our parameterized formulation of MLRSNet makes it possible to generalize to other tasks.
Appendix B Experimental details and additional results in Section 4.2
We investigate the transferability of the learned LR schedule when applied to various tasks in Section 4.2 of the main paper. We employ the learned MLRSNet to directly predict the LR for SGD algorithm. Here, we provide implementation details of all experiments.
As is shown in Fig.11, it can be seen that the predicted LR by the learned LR schedules converges after several iterations. This is because that the training trajectories are long in our experiments, and the learned MLRSNet can not memory all the information since we locally adjust our MLRSNet according to the validation error. If we directly select one MLRSNet learned at any epoch, that will raise overfitting issues as shown in Fig.11. Thus we should select more than two learned MLRSNets for test. Here, we propose a heuristic strategy to select MLRSNets for test. Generally, if we want to select nets for test, the MLRSNet learned at th epoch () should be chosen, where denotes ceiling operator. Fig. 11 and 11 show the train loss and test accuracy with ResNet18 on CIFAR100 of different test strategies, i.e., choosing different number of nets to transfer. It can be seen that almost choosing more than three nets have similar performance. Therefore, in the following experiments we choose three MLRSNets to show the transferability.
Generalization to different batch sizes. We transfer the learned LR schedules for different batch sizes training. All the methods are trained with ResNet18 on CIFAR100 for 200 epochs with different batch size. The hyperparameter setting for compared handdesigned LR schedules are the same with Section 4.1 in the main paper as illustrated above. Fig. 12 shows the test accuracy of all methods with varying batch sizes (Adding the results of batch size 128, 256, 1024).
Generalization to different epochs. We transfer the learned LR schedules for different epochs training. All the methods are trained with ResNet18 on CIFAR100 with batch size 128 for different epochs. The hyperparameter setting for compared handdesigned LR schedules is the same with Section 4.1 in the main paper as illustrated above, except for MultiStep LR. For epoch 100, MultiStep LR decays LR by every 30 epochs; For epoch 400, MultiStep LR decays LR by every 120 epochs; For epoch 1200, MultiStep LR decays LR by every 360 epochs. Other hyperparameters of MultiStep LR keep unchanged. For our method, we use the transferred strategy as below: 1) For epoch 100, we employ the 3 nets at 033, 3367, 67100 epoch, respectively; 2) For epoch 400, we employ the 3 nets at 0133, 133267, 267400 epoch, respectively; 3) For epoch 1200, we employ the 3 nets at 0400, 400800, 8001200 epoch, respectively.
Generalization to SGDM algorithm. The learned MLRSNets are trained with SGD algorithm, and we transfer the learned LR schedules to SGDM algorithm with Momentum 0.9. All the methods are trained with ResNet18 on CIFAR100 for 200 epochs with batch size 128. The hyperparameter setting for compared handdesigned LR schedules are the same with Section 4.1 in the main paper. We set here. As shown in Fig. 14, our method outperforms all the baseline methods.
Generalization to different datasets. We transfer the learned LR schedules for different datasets training. For image classification, we train a ResNet18 on SVHN and TinyImageNet, respectively. The hyperparameters of all compared methods are set the same as those of CIFAR10. For text classification, we train a 2layer LSTM on Penn Treebank. The hyperparameters of all compared methods are with the same setting as introduced in Section 4.1.
Generalization to different net architectures. We transfer the learned LR schedules for different net architectures training. All the methods are trained on CIFAR10 with different net architectures. The hyperparameters of all methods are the same with the setting of CIFAR10 with ResNet18. We test the learned LR schedule to different configurations of DenseNet [25]. As shown in Fig. 13, our method perform slightly stable than MultiStep strategy at about 75125 epochs. This tends to show the superiority of adaptive LR to train the DenseNets. Also, we transfer the LR schedules to several novel networks, the results are presented in Fig. 8 in the main paper.
Generalization to large scale optimization. We transfer the learned LR schedules for the training of the large scale optimization problems. The predicted LR by MLRSNet will not substantially increase the complexity compared with handdesigned LR schedules for DNNs training. This makes it feasible and reliable to transfer our learned LR schedules to such large scale optimization problems. We train a ResNet50 on ImageNet with handdesigned LR schedules and our transferred LR schedules. The training code can be found on https://github.com/pytorch/examples/tree/master/imagenet, and the parameter setting keeps unchanged except the LR. All compared handdesigned LR schedules are trained by SGDM with a momentum , a weight decay , an initial learning rate for 90 epochs, and batch size 256. Fixed LR uses 0.1 LR during the whole training; Exponential LR multiplies LR with every epoch; MultiStep LR decays LR by every 30 epochs; SGDR sets T_0 to , T_Mult to and minimum LR to ; Adam just uses the default parameter setting. The results are presented in Fig. 9 in the main paper.
Appendix C Experimental details and additional results in Section 4.3
The datasets CIFAR10C and CIFAR100C [20] can be downloaded at https://zenodo.org/record/2535967#.Xt4mVigzZPY, https://zenodo.org/record/3555552#.Xt4mdSgzZPY. Each dataset contains 15 types of algorithmically generated corruptions from noise, blur, weather, and digital categories. These corruptions contain Gaussian Noise, Shot Noise, Impulse Noise, Defocus Blur, Frosted Glass Blur, Motion Blur, Zoom Blur, Snow, Frost, Fog, Brightness, Contrast, Elastic, Pixelate and JPEG. All the corruptions are gererated on 10,000 test set images, and each corruption contains 50,000 images since each type of corruption has five levels of severity. We treat CIFAR10C or CIFAR100C dataset as training set, and train a model with ResNet18 for each corruption dataset. Finally, we can obtain 15 models for CIFAR10/100C. Each corruption can be roughly regarded as a task, and the average accuracy of 15 models on test data ^{7}^{7}7We use the original 50,000 train images of CIFAR10/100 as test data. is used to evaluate the robust performance of different tasks for each LR schedules strategy.
For experimental setting in Section 4.3, all compared handdesigned LR schedules are trained with a ResNet18 by SGDM with a momentum , a weight decay , an initial learning rate for 100 epochs, and batch size 128. Fixed LR uses 0.1 LR during the whole training; Exponential LR multiplies LR with every epoch; MultiStep LR decays LR by every 30 epochs; SGDR sets T_0 to , T_Mult to and minimum LR to ; Adam just uses the default parameter setting. Our method trains the ResNet18 by SGD with a weight decay , and the MLRSNet is learned under the guidance of a small set of validation set without corruptions. We randomly choose 10 clean images for each class as validation set. The experimental result is listed in Table 1 in the main paper.
Additional robustness results of transferred LR schedules on different data corruptions. Furthermore, we want to explore the robust performance of different tasks for our transferred LR schedules. Different from above experiments where all 15 models are trained under the guidance of a small set of validation set, we just train a ResNet18 on Gaussian Noise corruption to learn the MLRSNet, and then transfer the learned LR schedules to other 14 corruptions. We report the average accuracy of 14 models on test data to show the robust performance of our transferred LR schedules. All the methods are trained with a ResNet18 for 100 epochs with batch size 128. The hyperparameter setting of handdesigned LR schedules keeps same with above. Table 2 shows the mean test accuracy of 14 models. As can be seen, our transferred LR schedules obtain the final best performance compared with handdesigned LR schedules. This implies that our transferred LR schedules can also perform robust and stable than the preset LR schedules when the learning tasks are changed.
Datasets/Methods  Fixed  MultiStep  Exponential  SGDR  Adam  Ours(Train)  

CIFAR10C  Best  79.964.09  85.641.71  83.631.38  86.101.44  81.571.39  85.731.71 
Last  77.894.05  85.481.71  83.471.37  78.461.92  80.391.65  85.621.76  
CIFAR100C  Best  46.913.08  52.382.43  49.901.93  52.802.39  45.581.95  52.512.38 
Last  44.815.98  52.282.44  49.751.94  41.683.33  43.942.18  52.352.46 
Appendix D Computational Complexity Analysis
Our MLRSNet learning algorithm can be roughly regarded as requiring two extra full forward and backward passes of the network (step 6 in algorithm 1) in the presence of the normal network parameters update (step 8 in algorithm 1), together with the forward passes of MLRSNet for every LR. Therefore compared to normal training, our method needs about computation time for one iteration. Since we periodically update MLRSNet after several iterations, this will not substantially increase the computational complexity compared with normal network training. On the other hand, our transferred LR schedules predict LR for each iteration by a small MLRSNet, whose computational cost should be significantly less than the cost of the normal network training. To empirically show the differences between handdesigned LR schedules and our method, we conduct experiments with ResNet18 on CIFAR10 and report the running time for all methods. All experiments are implemented on a computer with Intel Xeon(R) CPU E52686 v4 and a NVIDIA GeForce RTX 2080 8GB GPU. We follow the corresponding settings in Section 4.1, and results are shown in Figure 15. Except that RTHO costs significantly more time, other methods including MLRSNet training and testing give similar results. Our MLRSNet takes barely longer time to complete the training phase and due to the lightweight structure of MLRSNet, and little extra time is added in the testing phase compared to handdesigned LR schedules. Thus our method is completely capable of practical application.
Appendix E Experimental Results of Additional Compared Method LR Controller
In this section, we present the experimental results of LR Controller [72], which is a related work of ours but under the reinforcement learning framework. Due to their learning algorithm is relatively computationally expensive and not very easy to optimize, we will show our method has a superiority in finding such a good LR schedule that scales and generalizes.
To start a fair comparison, we follow all the training settings and structure of LR Controller proposed in [72] except that we modify the batch size to 128 and increase training steps to cover 200 epochs of data to match our setup in Section 4.1 ^{8}^{8}8Code for LR Controller can be found at https://github.com/nicklashansen/adaptivelearningrateschedule. Firstly, we train LR Controller on CIFAR10 with ResNet18 and CIFAR100 with WideResNet2810 as we do in Section 4.1. As shown in Fig. 16, our method demonstrates evident superiority in finding a solution with better generalization compared with LR Controller strategies. LR Controller performs steadily in the early training phase, but soon fluctuates significantly and fails to progress. This tends to show that the LR Controller suffers from a severe stability issue when training step increases, especially being compared to our MLRSNet.
Then we transfer the LR schedules learned on CIFAR10 for our method and LR Controller to CIFAR100 to verify their transferability. Test settings are the same with those related in Section 4.2. As shown in Fig. 17, LR Controller makes a comparatively slower progress in the whole training process. While our method achieves a competitive performance, which indicates the capability of transferring to other tasks for our method.