Deep architectures are generally trained by minimizing a non-convex loss function via underlying optimization algorithm such as stochastic gradient descent or its variants. It takes a fairly large amount of time to find the best suited optimization algorithm and its optimal hyperparameters (such as learning rate, batch size etc.) for training a model to the desired accuracy, this being a major challenge for academicians and industry practitioners alike. Usually, such tuning is done by initial configuration optimization through grid search or random search. Recent works have also formulated it as a bandit problem ().
However, it has been widely demonstrated that hyperparameters, especially the learning rate often needs to be dynamically adjusted as the training progresses, irrespective of the initial choice of configuration. If not adjusted dynamically, the training might get stuck in a bad minima, and no amount of training time can recover it. In this work, we focus on learning rate which is the foremost hyperparameter that one seeks to tune when training a deep learning model to get favourable results.
), RMSProp (), Adam () among others have been proposed that automatically adjust the learning rate as the training progresses, using functions of gradient. Yet others have proposed fixed learning rate and/or batch size change regimes (, ) for certain data set and model combination.
In addition to traditional natural learning tasks where a good LR regime might already be known from past experiments, adversarial training for generating robust models is gaining a lot of popularity off late. In these cases, tuning the LR would generally require time consuming multiple experiments, since the LR regime is unlikely to be known for every attack for every model and dataset of interest222For example, one can see a piecewise LR schedule given by  at https://github.com/MadryLab/cifar10_challenge/blob/master/config.json for a particular model.. Moreover, new models are surfacing every day courtesy the state-of-the-art model synthesis systems, and new datasets are also becoming available quite often in different domains such as healthcare, automobile industy etc. In each of these cases, no prior LR regime would be known, and would require considerable manual tuning in the absence of a universal method, with demonstrated effectiveness over a wide range of tasks, models and datasets.
 observed that solutions found by existing adaptive methods often generalize worse than those found by non-adaptive methods. Even though initially adaptive methods might display faster initial progress on the training set, their performance quickly plateaus on the test set, and learning rate tuning is required to improve the generalization performance of these methods. For the case of SGD with Momentum, learning rate (LR) step decay is very popular (,, ReduceLRonPlateau333For eg. https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ReduceLROnPlateau.). However, in certain junctures of training, increasing the LR can potentially lead to a quick, further exploration of the loss landscape and help the training to escape a sharp minima (having poor generalisation ). Further, recent works have shown that the distance traveled by the model in the parameter space determines how far the training is from convergence . This inspires the idea that increasing the LR to take bigger steps in the loss landscape, while maintaining numerical stability might help in better generalization.
The idea of increasing and decreasing the LR periodically during training has been demonstrated by [17, 16] in their cyclical learning rate method (CLR). This has also been shown by , in Stochastic Gradient Descent with Warm Restarts (SGDR, popularly referred to as Cosine Annealing with Warm Restarts). In CLR, the LR is varied periodically in a linear manner, between a maximum and a minimum value, and it is shown empirically that such increase of learning rate is overall beneficial to the training compared to fixed schedules. In SGDR, the training periodically restarts from an initial learning rate, and then decreases to a minimum learning rate through a cosine schedule of LR decay. The period typically increases in powers of 2. The authors suggest optimizing the initial LR and minimum LR for good performance.
 had suggested an adaptive learning rate schedule that allows the learning rate to increase when the signal is non-stationary and the underline distribution changes. This is a computationally heavy method, requiring computing the Hessian in an online manner.
Recently, there has been some work that explore gradients in different forms for hyperaparameter optimization.  suggest an approach by which they exactly reverse SGD with momentum to compute gradients with respect to all continuous learning parameters (referred to as hypergradients); this is then propagated through an inner optimization.  suggest a dynamic LR-tuning approach, namely, hypergradient descent, that apply gradient-based updates to the learning rate at each iteration in an online fashion.
We propose a new algorithm to automatically determine the learning rate for a deep learning job in an autonomous manner that simply compares the current training loss with the best observed thus far to adapt the LR. The proposed algorithm works across multiple datasets and models for different tasks such as natural as well as adversarial training. It is an ‘optimistic’ method, in the sense that it increases the LR to as high as possible by examining the training loss repeatedly. We show through rigorous experimentation that in spite of its simplicity, the proposed algorithm performs surprisingly well as compared to the state-of-the-art.
We propose a novel, and simple algorithmic approach for autonomous, adaptive learning rate determination that does not require any manual tuning, inspection, or pre-experimental discovery of the algorithmic parameters.
Our proposed algorithm works across data sets and models with no customization
and reaches higher or comparable accuracy as standard baselines in literature in the same number of epochs on each of these datasets and models. Itconsistently performs well, finding stable minima with good generalization and converges smoothly.
Our algorithm works very well for adversarial learning scenario along with natural training as demonstrated across different models and datasets for White-Box FGSM attacks.
We provide theoretical as well as extensive empirical validation of our algorithm
2 Proposed Method
We propose an autonomous, adaptive LR tuning algorithm 1 towards determining the LR trajectory during the course of training. It operates in two phases: Phase 1: Initial LR exploration, that strives to find a good starting LR; Phase 2: Optimistic Binary Exploration. 5The pseudocode is provided at Algorithm 1. For the rest of the paper, we refer to the Automated Adaptive Learning Rate tuning algorithm as AALR in short.
The notation used in the following description is as follows. Patience: , Learning rate: , best loss , current loss . Model , Loss function . is initialized as the after initializing the model, before training starts.
Phase 1: Initial LR exploration
Phase 1 starts from an initial learning rate , and patience . It trains for an epoch, evaluates the loss , and compares to the best loss . If , the is updated, and it continues training for another epoch. Otherwise, the model is reloaded and re-initialized, LR is halved , and optimizer is reset with the new LR. The patience counter is reset. This continues till a stable LR is determined by the algorithm, in which it trains at this LR for epochs. The loss , the model and the optimizer state after Phase 1 is saved in a checkpoint.
Phase 2: Optimistic Binary Exploration
In this phase, AALR keeps the learning rate as high as possible for as long as possible at any given state of the training. Phase 2 starts by doubling LR to , and setting . After training for epochs, firstly AALR checks if the loss is NAN. In this case, the checkpoint (model and optimizer) corresponding to the best loss along with the best loss value are reloaded. Then LR is halved , patience is doubled, and the training continues. If instead, the loss is observed to decrease compared to the best loss, , then is updated, and the corresponding model , optimizer and are updated in checkpoint. This is followed by doubling the LR , resetting to and continuing training for the next epochs.
On the other hand, if , AALR trains for another epochs and check the loss . This is because as informally stated before, AALR is ‘optimistic’ and ‘resists’ lowering the LR for as long as possible. (In case, the newly evaluated loss is NAN, the previous approach is followed.) However, if the new loss , then AALR finally lowers the LR. AALR halves the LR , doubles patience , and continues training for epochs. If however, the loss had decreased, , the previous approach is followed: i.e., it doubles the LR , resets the patience , updates best loss and checkpoint, and repeats training for epochs. The above cycle repeats till the stopping criterion is met. For ease of exposition, the pseudocode is given in Algorithm 1
3 Motivation and Related Work
Increasing the LR optimistically can potentially help the training to escape saddle points that slow down the training, as well as find flatter minima with good generalization performance. This is inspired mainly from the following observations in the literature.
 suggest that saddle points slow down the training of deep networks.  states that SGD moves in valley like regions of the loss surface in deep networks by jumping from one valley wall to another at a height above the valley floor which is determined by the LR. Large LR can help in generalization by helping SGD to quickly cross over the valley floor as well as its barriers, to travel far away from the initialization point in a short time. Similarly,  describe the initial training phase as a high-dimensional “random walk on a random potential” process, with an “ultra-slow” logarithmic increase in the distance of the weights from their initialization.
From the above discussion, it seems that if one could increase the step size or LR continuously (as long as stability is maintained), it might considerably speed up the increase in distance of the weights from the initialization point, making the initial ultra-slow diffusion process faster. In this way, further exploration of the loss landscape might be possible, leading to better generalization.
The idea of increasing the LR has been explored by algorithms like SGDR and CLR. In SGDR, the LR is reset to a high value in a periodic manner; this is referred to as warm restart. After this, the LR decays to a low value following a cosine annealing schedule. In CLR, the LR is increased and decreased linearly in a periodic manner. While the regular increase in LR in most cases, probably helps generalization and helps in finding flatter minima, they follow a preset method, that does not depend on the training state or progress. Detection of convergence also becomes difficult due to heavy fluctuation in the training output (which happens due to the periodic nature of these methods). Moreover, the authors of each of these methods suggest tuning the parameters of the algorithm for better performance. Even though suggest a method that uses information about the state and distribution, it is computationally heavy method. Similarly, hypergradient descent due to  requires additional computation of gradients. Moreover, it requires tuning of initial LR and introduces additional hyperparameters to be tuned, such as the learning rate for the LR itself.
We propose the simple idea of exploring LR in a binary fashion, without requiring any parameter tuning. This is an adaptive LR tuning algorithm that tries to follow the training state and set the LR accordingly. Increasing LR for better generalization through exploration (and also, potential acceleration of initial phase of SGD) are the main motivations for the optimistic doubling. At the same time, once SGD is in the vicinity of a good minimum, LR might need to be reduced to access the valley. Hence, if the algorithm observes that the loss is not reducing even after a few ‘patience’ iterations, it halves the LR. The reduction is kept conservative at to encourage finding flatter minima.
The automated adaptive LR algorithm we propose achieves good generalization in all cases, including adversarial scenario, and converges smoothly in roughly the same time as LR-tuned SGD baselines available in the literature and community.
4 Convergence Analysis
Convergence analysis of SGD typically requires that the sequence of step sizes, or, learning rates used during training satisfy the following conditions: and .
Consider an optimal stochastic gradient approach OPT that any point in time has oracle access to (and applies) the highest value of learning rate, that would be amenable for good training (ensuring fast convergence and good generalization). The sequence of LRs chosen by OPT satisfy the above condition. The sequence ensure that OPT will converge (to a good generalization), at the same time, the convergence is the fast since the step sizes or LRs are kept as high as possible. Let the LR of OPT at any epoch be denoted as .
One can define OPT as the following:
Definition: An optimal oracle SGD, with LR at any epoch , such that the following properties hold:
Any SGD algorithm that has the same location in parameter space as OPT at the start of an epoch must have LR , otherwise training will diverge (loss will increase),
Any SGD algorithm starting from the same location in parameter space as OPT and achieving similar generalization for a given training task, will require at least as many epochs as OPT for convergence in expectation.
We will compare AALR with OPT, and show that the sequence of LRs chosen by AALR follow the sequence of LRs of OPT with a bounded delay, and hence bound the expected maximum time to convergence, under some assumptions. We also show that divergence will not happen.
A typical well-tuned SGD algorithm can be thought to be a proxy for OPT for a given scenario, and hence this analysis will bound the convergence time of AALR with respect to LR-tuned SGD for the same problem.
In a typical step decay LR-regime for SGD, the LR does not increase, but generally decreases at certain intervals by some factor . For standard LR schedules, one can see that the following rule-of-thumb holds: the number of epochs in between two consecutive LR changes is directly proportional to . In fact one can see that for standard regimes, , where . (For example, change by a factor of happens at every epochs or more, or, change by a factor of happens every epochs or more). Such typical LR regimes are often designed out of observing of loss plateauing. We assume that OPT has a similar behaviour in the following analysis.
4.1 Bounding the Delay in Convergence due to Doubling
Let the LR of OPT at any epoch be denoted as and that of AALR be denoted as
. We assume that Phase 1 has estimated a stable initial LR, and that both AALR and OPT are roughly in the same space in the loss valley at the start of Phase , denoted as epoch (for simplicity). In the following, we refer to decrease in loss compared to the best observed loss thus far as an improvement in state.
Assuming that loss surface is smooth, loss will continue to decrease for AALR, as long as , and it will start increasing otherwise (by the definition of OPT).
We first argue that AALR will not diverge. From Algorithm 1, it can be seen that every time state improves, the checkpoint is updated. If and when, due to doubling (or due to initial LR), loss diverges and goes to NAN, the last checkpoint is reloaded, LR is halved and training continues. This will continue till a stable LR is reached, and loss is no longer NAN. In this way, AALR can avoid losing way due to exploding gradients, caused by undue increase in LR.
Now, let us consider the case, when OPT has increased its LR. AALR, by design is always optimistically doubling the LR whenever state improves. For an increase in by a factor , it can be seen that AALR will require epochs. This is because, when state improves, patience will be reset to , LR will be doubled, and the state will be checked again after training for epochs.
We would next show that AALR reaches the same or lower LR as OPT (with some delay) every time OPT reduces LR and hence reaches similar generalization in expectation.
AALR starts with and trains for epochs and checks the state. If OPT has maintained at , then state will not improve. AALR will train for another epochs, and then reduce LR by half, and double the patience. It trains at this LR for epochs. Therefore, effective training for AALR is for epochs out of the epochs it spent. Now if the state improves, AALR would double the LR and the above cycle would repeat till we come to the state where OPT needs to reduce the LR for making progress. Let there be such cycles, such that OPT has trained for epochs and AALR has trained for epochs to arrive at roughly the same location in parameter space (assuming bounded gradients) and both have the same LR.
Now, let OPT reduce its LR by , i.e., (For simplicity, let us assume that is a power of ). AALR would be first doubling the LR to (since its state was improving till epochs), and patience will be reset to . It will need to reduce its LR times before it observes an improves in state (by assumption on OPT maintaining the highest possible LR for training progress). It will train for epochs, then halve the LR, double , train for epochs, and repeat this for times. One can see by induction that AALR will be spending a total of epochs. At this time, at this time. Now, AALR will train for epochs at this LR, after which it will observe an improvement in state. Note that by the earlier observation regarding typical LR regimes and the behavior of OPT, OPT would train for at least epochs at this new LR. Hence, AALR has trained for a total of roughly twice the number of epochs as OPT, and at the new LR for roughly the same number of epochs as OPT. Therefore, both are now at a similar location in parameter and loss space. After this AALR would again double the LR, and the earlier cycle would repeat for another times, such that OPT would have trained for epochs and AALR would have trained for epochs till the next LR change happens in OPT. Therefore, one can see that AALR would take at the most times in expectation the number of epochs as OPT to reach the same or lower LR, every time OPT lowers the LR.
Claim: From the above discussion, it follows that AALR would reach a similar minima as an optimal adaptive approach, hence a well-tuned SGD in at most twice the number of epochs.
Since the LR sequence of AALR follows the LR sequence of OPT with some finite delay, it can be argued that the following convergence requirements on the LR sequence hold for AALR: (1) , and, (2) .
In practice, we observe that AALR converges in around the same time as LR-tuned SGD.
We trained with AALR on several model-datasets combinations, in multiple scenarios such as natural training, as well as adversarial training. We observed that AALR achieve similar or better accuracy as the state-of-the-art baselines.
We have compared to standard SOTA (SGD or other) LR tuned values reported in the literature and with three other adaptive LR tuning algorithms, SGDR (Cosine Scheduling with Warm Restarts), CLR (Cyclic Learning Rates), and ADAM. Since the principle claim of AALR is that it is a completely autonomous adaptive approach that does not require any tuning, for fair comparison, we have not tuned the parameters of any other adaptive approaches compared with. Since AALR does not have any tunable parameters by design, sensitivity analyses experiments were not performed for AALR.
5.1 Settings, Datasets and Models
Experiments were done in PyTorch in x86 systems using 6 cores and 1 GPU. Where baselines for SGDR () and CLR are not available in literature, the PyTorch provided implementations of the corresponding LR schedulers with default settings were used (available here https://pytorch.org/docs/stable/_modules/torch/optim/lr_scheduler.html.
We have tested on datasets CIFAR10 and CIFAR100 using standard data augmentation for both, on models Resnet-18, WideResnet-28-10 with dropout and with both dropout and cutout, WideResnet-34-10 (for adversarial scenario only), SimplenetV1, and Vgg16 with and without batch normalization. For Resnet-18, WideResNets and Vgg16 models, we ran all algorithms forepochs at a batch size of for both datasets. For SimplenetV1, we ran for epochs and used batch size of and for CIFAR10 and CIFAR100 respectively for all algorithms. The code for SimpleNetV1 was obtained from https://github.com/Coderx7/SimpleNet_Pytorch. The code for Resnet-18 and WideResnets was obtained from https://github.com/uoguelph-mlrg/Cutout. The code for Vgg16 was obtained from https://github.com/chengyangfu/pytorch-vgg-cifar10/blob/master/main.py. We use cross entropy loss in all cases.
Where permitted by compute time and resources, we ran or more runs (for these cases, we report the mean of the peak accuracy) to the same number of epochs as baselines. In other cases, where we could complete only run, we report the peak values. Some ADAM runs completed to only a certain number of epochs, here we report the value along with the epochs ran. For certain models, the runs were not scheduled (for running on GPU resources) by the time of reporting, hence, we report ‘-’ in these cases.
Our experiments comprehensively show that AALR is a state-of-the-art automated adaptive LR tuning algorithm that works universally across models-datasets for both natural and adversarial training. It is either better or comparable to LR-tuned baselines and other adaptive algorithms uniformly and consistently, with a smooth convergence behavior. This makes the case that for new models andor datasets, AALR should be a reliable LR algorithm of choice, in the absence of any prior tuning or experimentation.
For the adaptive algorithms we compared with, SGDR though performs comparable with AALR in most cases of or natural training, it catastrophically failed for at least two natural training cases (which indicates it require tuning of either initial LR or some other parameters, and hence not a completely stand alone automated approach) and moreover, it generally did significantly worse than AALR for adversarial training. CLR achieved slightly lower accuracy compared to AALR in most cases of natural training, and in adversarial training its performance fluctuated on a case by case basis. ADAM generally converged to lower accuracy and significantly lower in adversarial scenario, and furthermore, it catastrophically failed in two cases of natural training, which shows it requires extensive tuning of parameters.
AALR was consistently top-level in every case, which makes the case for its universal and reliable applicability, especially when new models/datasets/training tasks surface for which prior tuning or information is not available.
5.3 Natural Training
The baseline values reported have the following sources. (a) Resnet-18 as reported by , (b) WideResnet-28-10 baseline as reported by , and sgdr as reported by  (c) WideResnet-28-10 with Cutout baseline as reported by , (d) SimplenetV1 baseline as reported at https://github.com/Coderx7/SimpleNet_Pytorch (originally  had reported lower values), (e) Vgg16 with and without Batch Normalization for CIFAR10 as reported at https://github.com/chengyangfu/pytorch-vgg-cifar10 and http://torch.ch/blog/2015/07/30/cifar.html (the former values are higher). The corresponding baseline for CIFAR100 was not available at the same place.
|SimpleNet-V1||95.51||95.17||95.44||93.66||93.99 (till 327 epochs)|
|WRN-28-10 (Dropout + Cutout)||96.92||96.44||96.6||95.42||-|
|SimpleNet-V1||78.51||78.21||77.47||74.02||73.48 (till 340 epochs)|
5.4 Adversarial Training
Here we outline the results and observations from adversarial training. We observe that AALR is particularly effective in Adversarial Training and achieves (to the best of our knowledge) state-of-the-art adversarial test accuracy for FGSM attack in a White Box model. It generally does significantly better compared to the other adaptive algorithms compared with and convergence is easier to detect, unlike the other methods. It would be interesting to explore theoretical justification regarding the effectiveness of AALR in first-order adversarial training, and it might be related to the loss landscape of the min-max saddle point problem defined by .
In the process of these experiments, we discover that SimpleNetV1 is a very effective adversarially strong model with respect to CIFAR10, when trained especially with AALR FGSM attack ( and ) in White Box model. This is a light weight model. Adversarial training generally being a compute heavy and time consuming process, becomes much easier and faster with it. The effectiveness of AALR in training models for different and new scenarios is clearly underlined by these experiments. All attacks are within ball. All models were trained on CIFAR10 for epochs, using batch size of .
The baseline we could find is as follows. For FGSM White Box atack on CIFAR10,  report . For other cases, we could not find baseline figures for these. Therefore, we consider as the representative baseline for each of the cases in FGSM. We use cross entropy loss in all cases.
|Resnet-18 (FGSM, , )||66.91||59.89||68.19||33.26|
|WRN-34-10 (FGSM, , )||65.86||61.55||55.63||16.13 (till 68 epochs)|
|SimpleNet-V1 (FGSM, , )||65.02||55.4||62.43||17.88|
|WRN-28-10 (FGSM, , ),|
|WRN-28-10 (FGSM, , ),|
-  (2018) Online learning rate adaptation with hypergradient descent. ICLR. Cited by: §1, §3.
-  (2014) Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in neural information processing systems, pp. 2933–2941. Cited by: §3.
Improved regularization of convolutional neural networks with cutout. https://arxiv.org/abs/1708.04552. Cited by: §5.3.
Adaptive subgradient methods for online learning and stochastic optimization.
Journal of Machine Learning Research12 (Jul), pp. 2121–2159. Cited by: §1.
Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677. Cited by: §1, §1.
-  (2016) Lets keep it simple, using simple architectures to outperform deeper and more complex architectures. arXiv preprint arXiv:1608.06037. Cited by: §5.3.
-  (2017) Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In Advances in Neural Information Processing Systems, pp. 1731–1741. Cited by: §1, §3.
-  (2017) Densely connected convolutional networks. CVPR. Cited by: §1.
-  (2016) On large-batch training for deep learning: generalization gap and sharp minima. arXiv preprint arXiv:1609.04836. Cited by: §1.
-  (2015) Adam: a method for stochastic optimization. ICLR. Cited by: §1.
-  (2017) Hyperband: bandit-based configuration evaluation for hyperparameter optimization. Journal of Machine Learning Research. Cited by: §1.
-  (2016) Sgdr: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983. Cited by: §1, §5.1, §5.3.
-  (2015) Gradient-based hyperparameter optimization through reversible learning. ICML 2015. Cited by: §1.
-  (2018) Towards deep learning models resistant to adversarial attacks. ICLR. Cited by: §5.4, §5.4, footnote 2.
-  (2013) No more pesky learning rates. In International Conference on Machine Learning, pp. 343–351. Cited by: §1, §3.
-  (2017) Exploring loss function topology with cyclical learning rates. arXiv preprint arXiv:1702.04283. Cited by: §1.
Cyclical learning rates for training neural networks.
2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 464–472. Cited by: §1.
-  (2018) Don’t decay the learning rate, increase the batch size. ICLR. Cited by: §1.
-  (2012) Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning. Cited by: §1.
-  (2017) The marginal value of adaptive gradient methods in machine learning. NIPS 2017. Cited by: §1.
-  (2018) A walk with sgd. arXiv preprint arXiv:1802.08770. Cited by: §3.
-  (2016) Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: §5.3.
-  (2012) ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701. Cited by: §1.
Appendix A Appendix
Here we provide some representative plots obtained during training.