One of the fascinating properties of deep neural networks (DNNs) is their ability to generalize really well, i.e., deliver high accuracy on the unseen test dataset. DNNs are typically over-parameterized, i.e., have far more parameters than the number of training samples. Nevertheless, they show remarkable generalization, countering traditional measures such as VC dimension ((Vapnik, 1999)), Rademacher complexity ((Bartlett and Mendelson, 2002)
), etc. Many results have pointed out that the use of optimizers based on stochastic gradient descent (SGD), are crucial to the good generalization behavior of DNNs ((Xing et al., 2018; Keskar et al., 2016; Wu et al., 2018)). Moreover, the batch size and the learning rate schedules used for these optimizers seem to play an important role in the generalization performance ((Keskar et al., 2016; Goyal et al., 2017; Wu et al., 2018)).
Although understanding generalization of deep neural networks is an open problem, there have been interesting findings recently. (Kawaguchi, 2016) found that deep neural networks have many local minima, but all local minima are also the global minima (also see (Goodfellow et al., 2016)). Also, it is widely believed that wide minima generalize much better than narrow minima ((Hochreiter and Schmidhuber, 1997; Keskar et al., 2016; Jastrzebski et al., 2017; Wang et al., 2018)), even though they have the same training loss. (Keskar et al., 2016) found that small batch SGD generalizes better than large batch SGD and also lands in wider minima, suggesting that noise in SGD acts as an implicit regularizer. Interestingly however, later work was able to generalize quite well even with very large batch sizes ((Goyal et al., 2017; McCandlish et al., 2018; Shallue et al., 2018)), by scaling the learning rate linearly as a function of the batch size. This suggests that the lack of noise in large batch SGD can be compensated with high learning rates, and thus the learning rate plays a crucial role in generalization.
In this paper, we study the question, what are the key properties of a learning rate schedule that help DNNs generalize well during training?
We conduct a series of experiments training Resnet18 on Cifar-10 over 200 epochs. We vary the number of epochs trained at a high learning rate of, called the explore epochs, from 40 to 100 and divide up the remaining epochs equally for training with learning rates of and . Note that the training loss typically stagnates around 50 epochs with . Despite that, we find that as the number of explore epochs increase to 100, the average test accuracy also increases. Further, we find that the minima found in higher test accuracy runs are wider than the minima from lower test accuracy runs, corroborating past work on wide-minima and generalization. However, what was particularly surprising was that, even when using 40 explore epochs, a few runs out of many still resulted in high test accuracies!
Thus, we find that an initial exploration phase with a high learning rate is essential to the good generalization of DNNs. Moreover, this exploration phase needs to be run for sufficient time, even if the training loss stagnates much earlier. Further, even when the exploration phase is not given sufficient time, a few runs still see high test accuracy values.
To explain these observations, we hypothesize that, in the DNN loss landscape, the density of narrow minima is significantly higher than that of wide minima.
Consider the fact that a large learning rate can escape narrow minima easily (as the optimizer can jump out of them with large steps). However, once it reaches a wide minima, it is likely to get stuck in it (if the ”width” of the wide minima is large compared to the step size). With fewer explore epochs, a large learning rate might still get lucky occasionally in finding a wide minima but invariably finds only a narrower minima due to their higher density. As the explore duration increase, the probability of eventually landing in a wide minima also increase.Thus, a minimum duration of explore is necessary to land in a wide minimum with high probability.
Motivated by the above wide-minima density hypothesis, we design a novel Explore-Exploit learning rate schedule, where the initial explore phase optimizes at a high learning rate in order to arrive in the vicinity of a wide minimum. This is followed by an exploit phase which descends to the bottom of this wide minimum. We give explore phase enough time so that the probability of landing in a wide minima is high. For the exploit phase, we experimented with multiple schemes, and found a simple, parameter-less, linear decay to zero to be effective. Thus, our proposed learning rate schedule optimizes at a constant high learning rate for some minimum time, followed by a linear decay to zero. We call this LR schedule the Knee schedule.
We extensively evaluate the Knee schedule
across a wide range of models and datasets, ranging from NLP (SQuAD on BERT-base, Transformer on IWSLT) to CNNs (e.g. ImageNet on ResNet-50, Cifar-10 on ResNet18), and spanning multiple optimizers: SGD Momentum, RAdam, and Adam. In all cases,Knee schedule improves the test accuracy of state-of-the-art hand-tuned learning rate schedules, when trained using the original training budget. We also experimented with reducing the training budget, and found that Knee schedule can achieve the same accuracy as the baseline under much reduced training budgets. For example, on SQuAD v1.1 fine-tuning with BERTBASE (Devlin et al., 2018), Knee schedule is able achieve an EM score of 81.38, compared to 80.9. For the ImageNet dataset, we are able to train in 44% less training budget for the same test accuracy.
The main contributions of our work are:
The observation that an initial explore phase with high learning rate is crucial for good generalization.
A hypothesis of lower density of wide minima in the DNN loss landscape, explaining why a high learning rate needs to be maintained for a sufficient duration to achieve good generalization.
Incorporating this hypothesis via an Explore-Exploit learning rate schedule that outperforms prior hand-tuned learning rate schedules.
2 Wide-Minima Density Hypothesis
Many popular learning rate (LR) schedules, such as the step decay schedules for image datasets, start the training with high LR, and then reduce the LR periodically. For example, consider the case of Cifar-10 on Resnet-18, trained using a typical step LR schedule of for 100, 50, 50 epochs each. In many such schedules, an interesting observation is that, even though training loss stagnates after several epochs of high LR, one still needs to continue training at high LR in order to get good generalization.
For example, see Figure 1, which shows the training loss for Cifar-10 on Resnet-18, trained with a fixed LR of 0.1 (orange curve), compared to a model trained via a step schedule with LR reduced at epoch 50 (blue curve). As can be seen from the figure, the training loss stagnates after 50 epochs for the orange curve, and locally it makes sense to reduce the learning rate to decrease the loss. However, as shown in Table 1, generalization is directly correlated with duration of training at high LR, with the highest test accuracy achieved when the high LR is used for 100 epochs, well past the point where training loss stagnates.
|Epochs at||Test Accuracy||Train Loss|
|0.1 LR||Avg. (Std. Dev.)||Avg. (Std. Dev.)|
|40||94.56 (0.16)||0.0015 (5e-5)|
|60||94.67 (0.17)||0.0016 (6e-5)|
|80||94.74 (0.15)||0.0016 (5e-5)|
|100||94.79 (0.13)||0.0017 (5e-5)|
Cifar-10 on Resnet-18 is trained for 200 epochs with SGD and 0.9 Momentum. LR of 0.1 is used for varying number of epochs. Half the remaining epochs are trained at 0.01 and the other half at 0.001. We report average/standard deviation over 50 runs.
|(a) Explore 40||(b) Explore 60||(c) Explore 80||(d) Explore 100|
Histogram of maximum eigenvalue of the Hessian at the final minima for 50 random trials of Cifar-10 on Resnet-18. Each figure shows histograms for runs with different number of explore epochs. The distribution moves toward lower eigenvalues, as well as sharpens, as the number of explore epochs increase.
|(a) Explore 40||(b) Explore 60||(c) Explore 80||(d) Explore 100|
To understand the above phenomena, we run the following experiment. We train Cifar-10 on Resnet-18 for 200 epochs, using a high LR of for only 40 epochs and then use LR of and for 80 epochs each. We repeat this training 50 times with different random weight initializations. On an average, as expected, this training yields a low test accuracy of . However, in 2 of the 50 runs, we find that the test accuracy crosses (with a maximum of ), significantly higher than the average accuracy of obtained while training at high LR for 100 epochs!
Minima Width definition. We would now like to understand the correlation between these test accuracy values and the shape of the minima. In this paper, we characterize the minima width by the curvature of loss surface around the minimum (Keskar et al., 2016; Chaudhari et al., 2019). Specifically, we use the highest eigenvalue111We used the opensource implementation at https://github.com/noahgolmant/pytorch-hessian-eigenthings of the Hessian of the loss surface at the end of training as a measure of the minima width. Thus, we define one minima as wider than another minima if it has a lower eigenvalue in the direction of the sharpest curvature. This metric of using the highest eigenvalue to measure minima width has been used in several previous papers ((Wu et al., 2018), (Keskar et al., 2016)).
To identify the relation between minima width and test accuracy for the above (40-epoch at 0.1 LR) runs, we compute their highest eigenvalues. We find that the high test accuracy runs consistently have smaller eigenvalues (wider minima) compared to the low accuracy runs, corroborating the observations of several previous papers (Hochreiter and Schmidhuber, 1997; Keskar et al., 2016; Jastrzebski et al., 2017; Wang et al., 2018). For example the run with highest test accuracy of (training loss: ) had an eigenvalue of , while the run with median accuracy of (training loss: ) had an eigenvalue of , and the run with minimum accuracy of (training loss: ) had an eigenvalue of .
Hypothesis. To explain the above observations, i.e., using a high learning rate for short duration results in low average test accuracy with rare occurrences of high test accuracy, while using the same high learning rate for long duration achieves high average test accuracy, we introduce a new hypothesis. We hypothesize that, in the DNN loss landscape, the density of narrow minima is significantly higher than that of wide minima.
Now, an intuitive explanation of why high learning rates are necessary to locate wide minima is that a large learning rate can escape narrow minima “valleys” easily (as the optimizer can jump out of them with large steps). However, once it reaches a wide minima “valley”, it is likely to get stuck in it (if the “width” of the wide valley is large compared to the step size). For example, see (Wu et al., 2018) for a result showing that large learning rates are unstable at narrow minima and thus don’t converge to them. Thus the optimizer, when running at a high learning rate, jumps from one narrow minimum region to another, until it lands in a wide minimum region where it then gets stuck. Now, the probability of an optimization step landing in a wide minima is a direct function of the proportion of wide minima compared to that of narrow minima. Thus, if our hypothesis is true, i.e., wide minima are much fewer than narrow minima, this probability is very low, and the optimizer needs to take a lot of steps to have a high probability of eventually landing in a wide minimum. This explains the observation in Table 1, where the average accuracy continues to improve as we increase the number of high learning rate training steps. The hypothesis also explains why very few (just 2) of the 50 runs trained at LR for 40-epochs also managed to attain high accuracy – they just got lucky probabilistically and landed in a wide minimum even with a shorter duration. Further, please see Section 5 for another experiment in the literature that adds more evidence to this hypothesis.
To validate this hypothesis further, we run experiments similar to the one in Table 1. Specifically, we train Cifar-10 on Resnet-18 model for 200 epochs using a standard step schedule with LR of . We vary the number of epochs trained using the high LR of 0.1, called the explore epochs, from 40 to 100 epochs, and divide up the rest of the training equally between 0.01 and 0.001. For each experimental setting, we conduct 50 random trials and plot the distributions of final test accuracy and the largest Eigenvalue of the Hessian of the loss at the final minima. If our hypothesis is true, then the more you explore, the higher the probability of landing (and getting stuck) in a wide minima region, which should cause the distribution to sharpen and move towards wider minima (lower eigenvalue), as the number of explore steps increase. This is exactly what is observed in Figure 2. Also since wide minima correlate with higher test accuracy, we should see the test accuracy distribution to move towards higher accuracy and sharpen, as the number of explore steps increase. This is confirmed as well in Figure 3.
Multi-scale. Given the importance of explore at high LR, a natural question that may arise is whether explore is necessary at smaller LR as well. To answer this, we train the same network for a total of 200 epochs with an initial high LR of for 100 epochs, but now we vary the number of epochs trained with the LR of (we call this finer-scale explore), and train with LR of for the remaining epochs.
As can be seen from Table 2, although the final training loss remains similar, we find that finer-scale explore also plays a role similar to the initial explore in determining the final test accuracy. This result indicates that our hypothesis about density of wide/narrow regions indeed holds at multiple scales. To understand how the density of wide minima at this scale compares to the density at the initial scale, we do 50 random trials of the 20 epochs finer-scale explore experiment (Table 2 second row), and plot the distribution of final test accuracy. As shown in Figure 4, the distribution is much tighter than the corresponding distribution of initial explore in Figure 3(a) at higher LR (we compare 20-epoch fine-scale explore runs with the 40-epoch initial LR explore runs since both use 40% of their total 50 and 100 epoch explore budgets). Specifically, test accuracy of the initial 40-epoch explore runs range from to while for the 20-epoch finer-explore runs range from to . This indicates that the explore at higher LR is more important than finer-scale explores.
3 Explore-Exploit Learning Rate Schedule
Given that we need to explore at multiple scales for good generalization, how do we go about designing a good learning rate schedule? The search space of the varying learning rate steps and their respective explore duration is enormous.
Fortunately, since the explore at the initial scale is searching over the entire loss surface while explore at finer-scales is confined to exploring only the wide-minima region identified by the initial explore, the former is more crucial. In our experiments, we found that the initial portion of the training is much more sensitive to exploration and needs a substantial number of explore steps, while after the this initial phase, several decay schemes worked equally well. This is similar to the observations in (Golatkar et al., 2019) where the authors found that regularization such as weight-decay and data augmentation mattered significantly only during the initial phase of training.
The above observations motivate our Explore-Exploit learning rate schedule, where the explore phase first optimizes at a high learning rate for some minimum time in order to land in the vicinity of a wide minima. We should give the explore phase enough time (a hyper-parameter), so that the probability of landing in a wide minima is high. After the explore phase, we know with a high probability, that the optimizer is in the vicinity of a wide region. We now start the exploit phase to descend to the bottom of this wide region while progressively decreasing the learning rate. Any smoothly decaying learning rate schedule can be thought of as doing micro explore-exploit at progressively reduced scales. A steady descent would allow more explore duration at all scales, while a fast descent would explore less at higher learning rates. We experimented with multiple schedules for the exploit phase, and found a simple linear decay to zero, that does not require any hyper-parameter, to be effective in all the models/datasets we tried. We call our proposed learning rate schedule which starts at a constant high learning rate for some minimum time (a hyper-parameter), followed by a linear decay to zero, the Knee schedule.
Since any learning rate decay scheme incorporates an implicit explore at different learning rates, we compare Knee schedule against several decay schemes such as linear and cosine that does not have an explicit explore phase. Interestingly, the results depend on the length of training. For long budget experiments, simple decay schemes (e.g. linear/cosine) perform reasonably well and comparable to Knee schedule, since the implicit explore duration is also large and helps these schemes achieve good generalization. However for short budget experiments, these schemes perform significantly worse than Knee schedule, since the implicit explore duration is much shorter. See Table 8 and 9 for the comparison.
Warmup. Some optimizers such as Adam use an initial warmup phase to slowly increase the learning rate. However, as shown in (Liu et al., 2019)
, learning rate warmup is needed mainly to reduce variance during initial training stages and can be eliminated with an optimizer such as RAdam. Learning rate warmup is also used for large-batch training(Goyal et al., 2017). Here, warmup is necessary since the learning rate is scaled to a very large value to compensate for the large batch size. This warmup is complementary and can be incorporated into Knee schedule.
We do two sets of evaluation. First we validate our hypothesis further, and then evaluate the effectiveness of Knee schedule on multiple models and datasets.
4.1 Hypothesis Validation
For validating our hypothesis on the density of wide minima vs narrow minima, we did multiple experiments, most of which were discussed in section 2. We summarize them here, and do another experiment on the IWSLT German to English dataset (Cettolo et al., 2014)
trained on Transformer networks(Vaswani et al., 2017) to demonstrate that our hypothesis holds even on a completely different dataset and network architecture.
In Figures 2, 3, we showed for Cifar-10 on Resnet-18, that as the number of explore steps increase, the distribution of minima width and test accuracy sharpens and shifts towards wider minima and better accuracy, respectively. This behaviour is predictable from our hypothesis as increasing explore steps increase the probability of landing in a wide region. From the same argument, the average accuracy should increase as the number of explore steps increase, which is confirmed in Table 1. Our hypothesis also predicts that even at low explore epochs, although the probability of landing in a wide region is low, it is non zero. Thus, out of many trials with low number of explore epochs, a few runs should still yield high test accuracy. This is what we observe in Figure 3, where 2 out of 50 trials obtain an accuracy more than even though the average accuracy is .
We now do a similar experiment on the IWSLT dataset trained on the Transformer network, where we train with the Knee schedule for a total budget of 50 epochs, but keep varying the number of explore epochs. As shown in Table 3, the test BLEU score increases as we increase the number of explore epochs. Note that once the explore duration is high enough, the probability of landing in a wide minima will be very high according to our hypothesis. Thus more explore should not help much after a point, which is what we observe here – the test BLEU score stagnates and does not improve much between 25 and 30 explores. Further, one of the low explore epoch runs had a high BLEU score of 35.20, suggesting that the run got lucky. Thus, these results on the IWSLT dataset add more evidence to the wide-minima density hypothesis.
4.2 Knee schedule Evaluation
We extensively evaluate our method on various networks and datasets, spanning multiple optimizers including SGD Momentum, Adam (Kingma and Ba, 2014a) and RAdam (Liu et al., 2019). For all experiments, we used an out of the box policy, where we only change the learning rate schedule, and don’t modify anything else. We evaluate on multiple image datasets – Imagenet on Resnet-50, Cifar-10 on Resnet-18; as well as NLP datasets – Squad v1.1 for BERT finetuning and IWSLT with transformer networks.
For all settings we compare Knee schedule against the original hand-tuned learning rate baseline for the corresponding model and dataset, showing an improvement in test accuracy in all cases. We also show that Knee schedule can achieve the same accuracy as the baseline with a much reduced training budget (e.g. less for ImageNet). Further, we also run our experiments with other common learning rate schedules such as linear decay, cosine decay, one-cycle (Smith, 2018). See Table 8 and Table 9 for a comparison of all learning rate schedules on the default and shorter budgets, respectively.
4.2.1 ImageNet image classification on Resnet-50
|LR Schedule||Training||Test Top 1||Test Top 5|
|Baseline||0.74 (0.001)||75.87 (0.035)||92.90 (0.015)|
|Knee||0.79 (0.001)||76.63 (0.105)||93.32 (0.034)|
|Knee||0.93 (0.0009)||75.77(0.202)||92.89 (0.049)|
In this experiment, we train the Resnet-50 network (He et al., 2016) using the ImageNet dataset (Russakovsky et al., 2015) with the SGD Momentum optimizer 222We used the opensource implementation at: https://github.com/cybertronai/imagenet18_old. For baseline runs, we used the standard hand-tuned step learning rate schedule of for 30 epochs each. With Knee schedule we trained with the original budget of 90 epochs, as well as a reduced budget of 50 epochs. We used 30 explore epochs for both runs, and the same baseline learning rate of 0.1. Table 4 shows the training loss and test accuracies for the various runs. As shown, Knee schedule comfortably beats the test accuracy of the baseline in the full budget run (with absolute gains of 0.8% and 0.4% in top-1 and top-5 accuracy, respectively), while meeting the baseline accuracy even with a much shorter budget. The fact that the baseline schedule takes almost more training time than Knee schedule for the same test accuracy, shows the effectiveness of our Explore-Exploit scheme. See Figure 6 in supplementary material for detailed comparisons of training loss, test accuracy, and learning rate.
4.2.2 Cifar-10 image classification on Resnet-18
|LR Schedule||Training Loss||Test Accuracy|
|Baseline||0.002 (0.0006)||94.81 (0.1)|
|Knee||0.002 (0.0001)||94.94 (0.09)|
|Knee||0.004 (0.0001)||94.8 (0.03)|
In this experiment, we train the Resnet-18 network (He et al., 2016) using the Cifar-10 dataset (Krizhevsky et al., 2009) with the SGD Momentum optimizer 333We used the implementation at: https://github.com/kuangliu/pytorch-cifar. For baseline, we used the hand-tuned step learning rate schedule of for 100, 50, 50 epochs, respectively. With Knee schedule, we train the network with the original budget of 200 epochs, as well as a reduced budget of 150 epochs. We used 100 explore epochs for both runs, and the same baseline seed learning rate of 0.1. Table 5 shows the training loss and test accuracy for the various runs. As shown, Knee schedule beats the test accuracy of baseline in the full budget run. Also note that, even in the shorter budget run, Knee schedule matches the test accuracy of the baseline schedule which takes more training time. See Figure 7 in supplementary material for detailed comparisons of training loss, test accuracy, and learning rate.
4.2.3 SQuAD fine-tuning on BERT
We now evaluate Knee schedule on a few NLP tasks. In the first task, we fine-tune the BERTBASE model (Devlin et al., 2018) on SQuAD v1.1 (Rajpurkar et al., 2016) with the Adam (Kingma and Ba, 2014b) optimizer 444We used the implementation at: https://github.com/huggingface/pytorch-transformers. BERT fine-tuning is prone to overfitting because of the huge model size compared to the small fine-tuning dataset, and is typically run for only a few epochs. For baseline we use the linear decay schedule mentioned in (Devlin et al., 2018). We use a seed learning rate of and train for 2 epochs. For Knee schedule, we train the network with 1 explore epoch with the same seed learning rate of . Table 6 shows our results over 3 runs. We achieve a mean EM score of 81.4, compared to baseline’s 80.9, a 0.5% absolute improvement. We don’t do a short budget run for this example, as the full budget is just 2 epochs. See Figure 8 in supplementary material for detailed comparisons.
|LR Schedule||Train Loss (av)||EM (av)||F1 (av)|
|Baseline||1.0003 (0.004)||80.89 (0.15)||88.38 (0.032)|
|Knee schedule||1.003 (0.002)||81.38 (0.02)||88.66 (0.045)|
4.2.4 Machine Translation on Transformer Network with IWSLT
In the second NLP task, we train the Transformer network (Vaswani et al., 2017) on the IWSLT German-to-English (De-En) dataset (Cettolo et al., 2014) with the RAdam (Liu et al., 2019) optimizer 555We used the implementation at: https://github.com/pytorch/fairseq. For baseline, we train for 50 epochs using the linear decay learning rate schedule, as mentioned in (Liu et al., 2019). With Knee schedule, we trained with the original budget of 50 epochs, as well as a reduced budget of 35 epochs. We used 30 explore epochs for both runs, and use a seed learning rate of for both Knee schedule and baseline. In all cases we use the model checkpoint with least loss on the validation set for computing BLEU scores on the test set. Table 7 shows the training loss and test accuracy averaged over 3 runs. As shown, Knee schedule comfortably beats the test BLEU score of baseline in the full budget run. In the shorter budget run, Knee schedule matches the test accuracy of the baseline schedule which takes almost more training time. See Figure 9 in supplementary material for detailed comparisons of training/validation perplexity, learning rate, etc.
|Baseline||3.36 (0.001)||4.92 (0.035)||34.97 (0.035)|
|Knee||2.99 (0.047)||4.87 (0.04)||35.25 (0.093)|
|Knee||3.58 (0.063)||4.90 (0.049)||35.08 (0.12)|
|Experiment||Knee schedule||Training||Baseline||One-Cycle||Cosine Decay||Linear Decay|
|Experiment||Knee schedule||Shortened Training||One-Cycle||Cosine Decay||Linear Decay|
4.3 Comparison with other learning schedules
We also ran all our experiments with multiple other learning rate schedules – one-cycle (Smith, 2018), cosine decay (Loshchilov and Hutter, 2016) and linear decay. See section B in supplementary material for details of these learning rate schedules, and a detailed performance comparison. For one-cycle, the maximum learning rate was chosen to be 1/10th of the learning rate found via the LR range test of (Smith, 2018, 2017). Table 8 shows the test accuracies of our experiments on all the learning rate schedules, when trained with the original budget; while Table 9 shows the results when trained with a reduced budget. As shown, for the full budget runs, Knee schedule improves on the test accuracies on all experiments. Moreover, Knee schedule is able to achieve the same test accuracies as the baseline’s full budget runs with a much lower training budget. Also note that in some experiments of the full budget runs, the accuracy achieved by cosine decay and linear decay schedules is similar to Knee schedule; however, this is not observed in the reduced budget runs. This observation is consistent with our hypothesis as the cosine and linear decay schedules have an implicit explore portion in the initial part of the run (where the learning rate stays high enough). The duration of this “implicit” explore reduces as we reduce the training budget, hurting test accuracy. The Knee schedule on the other hand prioritizes the essential explore phase even in the reduced budget runs, thus, achieving higher test accuracies.
5 Related Work
Generalization. There has been a lot of work recently on understanding the generalization characteristics of DNNs. (Kawaguchi, 2016) found that DNNs have many local minima, but all local minima were also the global minima. It has been observed by several authors that wide minima generalize better than narrow minima ((Arora et al., 2018; Hochreiter and Schmidhuber, 1997; Keskar et al., 2016; Jastrzebski et al., 2017; Wang et al., 2018)) but there have been other work questioning this hypothesis as well ((Dinh et al., 2017; Golatkar et al., 2019; Guiroy et al., 2019; Jastrzebski et al., 2019; Yoshida and Miyato, 2017)).
(Keskar et al., 2016) found that small batch SGD generalizes better and lands in wider minima than large batch SGD. However, recent work has been able to generalize quite well even with very large batch sizes(Goyal et al., 2017; McCandlish et al., 2018; Shallue et al., 2018), by scaling the learning rate linearly as a function of the batch size. (Jastrzebski et al., 2019) analyze how batch size and learning rate influence the curvature of not only the SGD endpoint but also the whole trajectory. They found that small batch or large step SGD have similar characteristics, and yield smaller and earlier peak of spectral norm as well as smaller largest eigenvalue. (Dinh et al., 2017) show analytically using model reparameterization that wide minima can be converted to sharp minima without hurting generalization. (Wang et al., 2018) analytically show that generalization of a model is related to the Hessian and propose a new metric for the generalization capability of a model that is unaffected by model reparameterization of (Dinh et al., 2017). (Yoshida and Miyato, 2017) argue that regularizing the spectral norm of the weights of the neural network help them generalize better. On the other hand, (Arora et al., 2018) derive generalization bounds by showing that networks with low stable rank (high spectral norm) generalize better. (Guiroy et al., 2019) looks at generalization in gradient-based meta-learning and they show experimentally that generalization and wide minima are not always correlated. Finally, (Golatkar et al., 2019) show that regularization results in higher test accuracy specifically when it is applied during initial phase of training, similar to the importance of Knee schedule’s explore phase during initial phase of training.
Lower density of wide minima. (Wu et al., 2018) compares the sharpness of minima obtained by batch gradient descent (GD) with different learning rates for small neural networks on FashionMNIST and Cifar10 datasets. They find that GD with a given learning rate finds the theoretically sharpest feasible minima for that learning rate. Thus, in the presence of several flatter minimas, GD with lower learning rates does not find them, leading to the conjecture that density of sharper minima is perhaps larger than density of wider minima.
6 Conclusions and Future work
In this paper, we make an observation that an initial explore phase with a high learning rate is essential for good generalization of DNNs. Further, we find that a minimum explore duration is required even if the training loss stops improving much earlier. We explain this observation via our hypothesis that in the DNN loss landscape, the density of wide minima is significantly lower than that of narrow minima. An explore at high learning rate allows the optimizer to jump over narrow regions and finally get stuck in a wide region. Since the density of wide regions is much lower than that of narrow regions, we need to explore for a long enough duration to land in a wide region with high probability. Motivated by this hypothesis, we present an Explore-Exploit based learning rate schedule, called the Knee schedule. We do extensive evaluation of Knee schedule on multiple models and datasets, including both image and NLP tasks. In all experiments, the Knee schedule outperforms prior hand-tuned baselines on test accuracies when trained with the original training budget, and achieves the same test accuracy as the baseline when trained with a much shorter budget.
Although we have done multiple experiments for validating this hypothesis, an exciting area of further study will be to explore this aspect theoretically. We are also interested in developing techniques to automatically ascertain that the optimizer has landed in a wide region, and switch to exploit part of the Knee schedule. This would save precious training cycles, unlike now where we have to choose a high explore duration to ensure landing in wide region with a high probability. Another area of future work would be to automate the exploit part. Although simple linear decay works well, we believe that this can be improved further via an automated technique, especially given that our wide-minima density hypothesis seems to hold at multiple scales.
- Stronger generalization bounds for deep nets via a compression approach. arXiv preprint arXiv:1802.05296. Cited by: §5, §5.
Rademacher and gaussian complexities: risk bounds and structural results.
Journal of Machine Learning Research3 (Nov), pp. 463–482. Cited by: §1.
- Report on the 11th iwslt evaluation campaign, iwslt 2014. In Proceedings of the International Workshop on Spoken Language Translation, Hanoi, Vietnam, pp. 57. Cited by: §4.1, §4.2.4.
- Entropy-sgd: biasing gradient descent into wide valleys. Journal of Statistical Mechanics: Theory and Experiment 2019 (12), pp. 124018. Cited by: §2.
- Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §4.2.3.
- Sharp minima can generalize for deep nets. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1019–1028. Cited by: §5, §5.
- Time matters in regularizing deep networks: weight decay and data augmentation affect early learning dynamics, matter little near convergence. arXiv preprint arXiv:1905.13277. Cited by: §3, §5, §5.
- Deep learning. Vol. 1, MIT Press. Cited by: §1.
- Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677. Cited by: §1, §1, §3, §5.
- Towards understanding generalization in gradient-based meta-learning. arXiv preprint arXiv:1907.07287. Cited by: §5, §5.
- Deep residual learning for image recognition. In , pp. 770–778. Cited by: §4.2.1, §4.2.2.
- Flat minima. Neural Computation 9 (1), pp. 1–42. Cited by: §1, §2, §5.
- Three factors influencing minima in sgd. arXiv preprint arXiv:1711.04623. Cited by: §1, §2, §5.
- On the relation between the sharpest directions of DNN loss and the SGD step length. In International Conference on Learning Representations, External Links: Cited by: §5, §5.
- Deep learning without poor local minima. In Advances in neural information processing systems, pp. 586–594. Cited by: §1, §5.
- On large-batch training for deep learning: generalization gap and sharp minima. arXiv preprint arXiv:1609.04836. Cited by: §1, §1, §2, §2, §5, §5.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.3.
- Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §4.2.2.
- On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265. Cited by: §3, §4.2.4, §4.2.
- Sgdr: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983. Cited by: §4.3.
- An empirical model of large-batch training. arXiv preprint arXiv:1812.06162. Cited by: §1, §5.
- Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250. Cited by: §4.2.3.
- Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §4.2.1.
- Measuring the effects of data parallelism on neural network training. arXiv preprint arXiv:1811.03600. Cited by: §1, §5.
- Cyclical learning rates for training neural networks. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 464–472. Cited by: Appendix B, §4.3.
- A disciplined approach to neural network hyper-parameters: part 1–learning rate, batch size, momentum, and weight decay. arXiv preprint arXiv:1803.09820. Cited by: Appendix B, §4.2, §4.3.
An overview of statistical learning theory. IEEE transactions on neural networks 10 (5), pp. 988–999. Cited by: §1.
- Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §4.1, §4.2.4.
- Identifying generalization properties in neural networks. arXiv preprint arXiv:1809.07402. Cited by: §1, §2, §5, §5.
- How sgd selects the global minima in over-parameterized learning: a dynamical stability perspective. In Advances in Neural Information Processing Systems, pp. 8279–8288. Cited by: §1, §2, §2, §5.
- A walk with sgd. arXiv preprint arXiv:1802.08770. Cited by: §1.
- Spectral norm regularization for improving the generalizability of deep learning. arXiv preprint arXiv:1705.10941. Cited by: §5, §5.
Appendix A Learning Rate Sensitivity
We performed sensitivity analysis of the starting learning rate, referred to as the seed learning rate, for Knee schedule. We trained the Cifar-10 dataset on Resnet-18 with the Knee schedule for a shortened budget of 150 epochs, starting at different seed LRs. For each experiment, we do a simple linear search to find the best explore duration. The test accuracies and optimal explore duration for the different seed LR choices is shown in Table 10. As shown, the seed learning rate can impact the final accuracy, but Knee schedule is not highly sensitive to it. In fact, we can achieve the target accuracy of 94.8 with multiple seed learning rates of 0.085, 0.075, 0.0625 and 0.05, as compared to the original seed learning rate of 0.1, by tuning the number of explore epochs.
Another interesting observation is that the optimal explore duration varies inversely with the seed LR. Since a bigger learning rate has higher probability of escaping narrow minima compared to a lower learning rate, it would, on an average, require fewer steps to land in a wide minima. Thus, larger learning rates can explore faster, and spend more time in the exploit phase to go deeper in the wide minimum. This observation is thus consistent with our hypothesis and further corroborates it.
We also note that by tuning both seed LR and explore duration, we can achieve the twin objectives of achieving a higher accuracy, as well as a shorter training time – e.g. here we are able to achieve an accuracy of 94.9 in 150 epochs (seed LR 0.075), compared to 94.81 achieved by the baseline schedule in 200 epochs.
Appendix B Comparisons with More Baseline Learning Rate Schedules
In this section we compare Knee schedule against more learning rate schedules – one-cycle, linear decay and cosine decay.
One-Cycle: The one-cycle learning rate schedule was proposed in (Smith, 2018) (also see (Smith, 2017)). This schedule first chooses a maximum learning rate based on an LR Range test. The LR range test starts from a small learning rate and keeps increasing the learning rate until the loss starts exploding (see figure 5). (Smith, 2018) suggests that the maximum learning rate should be chosen to be bit before the minima, in a region where the loss is still decreasing. There is some subjectivity in making this choice, although some blogs and libraries666See e.g. https://towardsdatascience.com/finding-good-learning-rate-and-the-one-cycle-policy-7159fe1db5d6 and https://sgugger.github.io/how-do-you-find-a-good-learning-rate.html. Also see https://docs.fast.ai/callbacks.lr_finder.html and https://docs.fast.ai/callbacks.one_cycle.html suggest using a learning rate one order lower than the one at minima. We go with this choice for all our runs.
Once the maximum learning rate is chosen, the one-cycle schedule proceeds as follows. The learning rate starts at a specified fraction777See div_factor in https://docs.fast.ai/callbacks.one_cycle.html. We chose the fraction to be 0.1 in our experiments. of the maximum learning rate and is increased linearly to the maximum learning rate for 45 percent of the training budget and then decreased linearly for the remaining 45. For the final 10 percent, the learning rate is reduced by a large factor (we chose a factor of 10). We used an opensource implementation 888https://github.com/nachiket273/One_Cycle_Policy for our experiments.
Linear Decay: The linear decay learning rate schedule simply decays the learning rate linearly to zero starting from a seed LR.
Cosine Decay: The cosine decay learning rate schedule decays the learning rate to zero following a cosine curve, starting from a seed LR.
Figure 4(a) shows the LR range test for Cifar-10 with the Resnet-18 network. The minima occurs around learning rate of 0.09, and we choose as the maximum learning rate. For linear, cosine decay schedules we start with a seed learning rate of 0.1 as used in the standard baselines. The training loss and test accuracy for the various schedules are shown in Table 11 for the full budget runs (200 epochs), and in Table 12 for the short budget runs (150 epochs).
|LR Schedule||Train Loss||Test Accuracy|
|One-Cycle||0.0041 (6e-5)||94.08 (0.07)|
|Cosine Decay||0.0020 (7e-5)||94.76 (0.21)|
|Linear Decay||0.0015 (4e-5)||94.88 (0.12)|
|Knee schedule||0.0023 (1e-4)||94.94 (0.09)|
|LR Schedule||Train Loss||Test Accuracy|
|One-Cycle||0.0052 (7e-5)||93.84 (0.082)|
|Cosine Decay||0.0022 (4e-5)||94.66 (0.054)|
|Linear Decay||0.0016 (7e-5)||94.58 (0.073)|
|Knee schedule||0.0044 (1e-4)||94.80 (0.035)|
b.2 Iwslt’14 De-En
Figure 4(b) shows the LR range test for IWSLT on the transformer networks. The minima occurs near 2.5e-3. For the maximum learning rate, we choose 2.5e-4 for the default one-cycle policy. For linear, cosine decay schedules we start with a seed learning rate of 3e-4 as used in the standard baselines The training, validation perplexity and BLEU scores for the various schedules are shown in Table 13 for the full budget runs (50 epochs), and in Table 14 for the short budget runs (35 epochs).
|LR Schedule||Train ppl||Validation ppl||Test BLEU Score|
|One-Cycle||3.68 (0.009)||4.97 (0.010)||34.77 (0.064)|
|Cosine Decay||3.08 (0.004)||4.88 (0.014)||35.21 (0.063)|
|Linear Decay||3.36 (0.001)||4.92 (0.035)||34.97 (0.035)|
|Knee schedule||2.99 (0.047)||4.87 (0.040)||35.25 (0.093)|
|LR Schedule||Train ppl||Validation ppl||Test BLEU Score|
|One-Cycle||3.98 (0.028)||5.09 (0.017)||34.43 (0.26)|
|Cosine Decay||3.86 (0.131)||5.06 (0.106)||34.46 (0.33)|
|Linear Decay||4.11 (0.092)||5.14 (0.066)||34.16 (0.28)|
|Knee schedule||3.58 (0.063)||4.90 (0.049)||35.08 (0.12)|
Figure 4(c) show the LR range test for SQuAD fine-tuning on BERT. The minima occurs at 1e-4. For the maximum learning rate, we choose 1e-5 for the default one-cycle policy. For linear, cosine decays we start with a seed learning rate of 3e-5 as used in standard baselines. Table 15 show the average training loss, average test EM and F1 scores for the various schedules. We did not do a short budget training for this dataset, as the full budget is just 2 epochs.
|LR Schedule||Train Loss (av)||EM (av)||F1 (av)|
|One Cycle||1.062 (0.003)||79.9 (0.17)||87.8 (0.091)|
|Cosine Decay||0.999 (0.003)||81.31 (0.07)||88.61 (0.040)|
|Linear decay||1.0003 (0.004)||80.89 (0.15)||88.38 (0.042)|
|Knee schedule||1.003 (0.002)||81.38 (0.02)||88.66 (0.045)|
Figure 4(a) shows the LR range test for ImageNet with the Resnet-50 network. The minima occurs around learning rate of 2.16, and we choose as the maximum learning rate. For linear, cosine decay schedules we start with a seed learning rate of 0.1 as used in the standard baselines. The training loss and test accuracy for the various schedules are shown in Table 16 for the full budget runs (90 epochs), and in Table 17 for the short budget runs (50 epochs).
|LR Schedule||Train Loss (av)||Test Top-1||Test Top-5|
|One Cycle||0.96 (0.003)||75.39 (0.137)||92.56 (0.040)|
|Cosine Decay||0.80 (0.002)||76.41 (0.212)||93.28 (0.066)|
|Linear decay||0.75 (0.001)||76.58 (0.155)||93.21 (0.051)|
|Knee schedule||0.79 (0.001)||76.63 (0.105)||93.32 (0.034)|
|LR Schedule||Train Loss (av)||Test Top-1||Test Top-5|
|One Cycle||1.033 (0.004)||75.36 (0.096)||92.53 (0.079)|
|Cosine Decay||0.96 (0.002)||75.79 (0.116)||92.81 (0.033)|
|Linear decay||0.91 (0.002)||75.82 (0.080)||92.84 (0.036)|
|Knee schedule||0.93 (0.001)||75.77 (0.202)||92.89 (0.049)|