super-convergence
Files to create the figures in the paper "Super-Convergence: Very Fast Training of Residual Networks Using Large Learning Rates"
view repo
It is known that the learning rate is the most important hyper-parameter to tune for training deep neural networks. This paper describes a new method for setting the learning rate, named cyclical learning rates, which practically eliminates the need to experimentally find the best values and schedule for the global learning rates. Instead of monotonically decreasing the learning rate, this method lets the learning rate cyclically vary between reasonable boundary values. Training with cyclical learning rates instead of fixed values achieves improved classification accuracy without a need to tune and often in fewer iterations. This paper also describes a simple way to estimate "reasonable bounds" -- linearly increasing the learning rate of the network for a few epochs. In addition, cyclical learning rates are demonstrated on the CIFAR-10 and CIFAR-100 datasets with ResNets, Stochastic Depth networks, and DenseNets, and the ImageNet dataset with the AlexNet and GoogLeNet architectures. These are practical tools for everyone who trains neural networks.
READ FULL TEXT VIEW PDF
It is well known that the learning rate is the most important hyper-para...
read it
Optimizing deep neural networks is largely thought to be an empirical
pr...
read it
In this paper, we show a phenomenon, which we named "super-convergence",...
read it
Effective hyper-parameter tuning is essential to guarantee the performan...
read it
Training deep neural networks on large and sparse datasets is still
chal...
read it
We investigate neural network training and generalization using the conc...
read it
Convolutional Neural Network (CNN) has become the most used method for i...
read it
Files to create the figures in the paper "Super-Convergence: Very Fast Training of Residual Networks Using Large Learning Rates"
None
Using the CLR algorithm for training (https://arxiv.org/abs/1506.01186)
None
None
Deep neural networks are the basis of state-of-the-art results for image recognition [17, 23, 25], object detection [7]
[26], speech recognition [8], machine translation [24], image caption generation [28], and driverless car technology [14]. However, training a deep neural network is a difficult global optimization problem.A deep neural network is typically updated by stochastic gradient descent and the parameters
(weights) are updated by, where L is a loss function and
is the learning rate. It is well known that too small a learning rate will make a training algorithm converge slowly while too large a learning rate will make the training algorithm diverge [2]. Hence, one must experiment with a variety of learning rates and schedules.Conventional wisdom dictates that the learning rate should be a single value that monotonically decreases during training. This paper demonstrates the surprising phenomenon that a varying learning rate during training is beneficial overall and thus proposes to let the global learning rate vary cyclically within a band of values instead of setting it to a fixed value. In addition, this cyclical learning rate (CLR) method practically eliminates the need to tune the learning rate yet achieve near optimal classification accuracy. Furthermore, unlike adaptive learning rates, the CLR methods require essentially no additional computation.
The potential benefits of CLR can be seen in Figure 1, which shows the test data classification accuracy of the CIFAR-10 dataset during training^{1}^{1}1
Hyper-parameters and architecture were obtained in April 2015 from caffe.berkeleyvision.org/gathered/examples/cifar10.html
. The baseline (blue curve) reaches a final accuracy of after iterations. In contrast, it is possible to fully train the network using the CLR method instead of tuning (red curve) within 25,000 iterations and attain the same accuracy.The contributions of this paper are:
A methodology for setting the global learning rates for training neural networks that eliminates the need to perform numerous experiments to find the best values and schedule with essentially no additional computation.
A surprising phenomenon is demonstrated - allowing the learning rate to rise and fall is beneficial overall even though it might temporarily harm the network’s performance.
The book “Neural Networks: Tricks of the Trade” is a terrific source of practical advice. In particular, Yoshua Bengio [2] discusses reasonable ranges for learning rates and stresses the importance of tuning the learning rate. A technical report by Breuel [3] provides guidance on a variety of hyper-parameters. There are also a numerous websites giving practical suggestions for setting the learning rates.
Adaptive learning rates: Adaptive learning rates can be considered a competitor to cyclical learning rates because one can rely on local adaptive learning rates in place of global learning rate experimentation but there is a significant computational cost in doing so. CLR does not possess this computational costs so it can be used freely.
A review of the early work on adaptive learning rates can be found in George and Powell [6]. Duchi, [5] proposed AdaGrad, which is one of the early adaptive methods that estimates the learning rates from the gradients.
RMSProp is discussed in the slides by Geoffrey Hinton^{2}^{2}2www.cs.toronto.edu/ tijmen/csc321/slides/lecture_slides_lec6.pdf [27]. RMSProp is described there as “Divide the learning rate for a weight by a running average of the magnitudes of recent gradients for that weight.” RMSProp is a fundamental adaptive learning rate method that others have built on.
Schaul [22] discuss an adaptive learning rate based on a diagonal estimation of the Hessian of the gradients. One of the features of their method is that they allow their automatic method to decrease or increase the learning rate. However, their paper seems to limit the idea of increasing learning rate to non-stationary problems. On the other hand, this paper demonstrates that a schedule of increasing the learning rate is more universally valuable.
Zeiler [29] describes his AdaDelta method, which improves on AdaGrad based on two ideas: limiting the sum of squared gradients over all time to a limited window, and making the parameter update rule consistent with a units evaluation on the relationship between the update and the Hessian.
More recently, several papers have appeared on adaptive learning rates. Gulcehre and Bengio [9]
propose an adaptive learning rate algorithm, called AdaSecant, that utilizes the root mean square statistics and variance of the gradients. Dauphin
[4] show that RMSProp provides a biased estimate and go on to describe another estimator, named ESGD, that is unbiased. Kingma and Lei-Ba [16] introduce Adam that is designed to combine the advantages from AdaGrad and RMSProp. Bache, [1] propose exploiting solutions to a multi-armed bandit problem for learning rate selection. A summary and tutorial of adaptive learning rates can be found in a recent paper by Ruder [20].Adaptive learning rates are fundamentally different from CLR policies, and CLR can be combined with adaptive learning rates, as shown in Section 4.1. In addition, CLR policies are computationally simpler than adaptive learning rates. CLR is likely most similar to the SGDR method [18] that appeared recently.
The essence of this learning rate policy comes from the observation that increasing the learning rate might have a short term negative effect and yet achieve a longer term beneficial effect. This observation leads to the idea of letting the learning rate vary within a range of values rather than adopting a stepwise fixed or exponentially decreasing value. That is, one sets minimum and maximum boundaries and the learning rate cyclically varies between these bounds. Experiments with numerous functional forms, such as a triangular window (linear), a Welch window (parabolic) and a Hann window (sinusoidal) all produced equivalent results This led to adopting a triangular window (linearly increasing then linearly decreasing), which is illustrated in Figure 2, because it is the simplest function that incorporates this idea. The rest of this paper refers to this as the learning rate policy.
An intuitive understanding of why CLR methods work comes from considering the loss function topology. Dauphin [4] argue that the difficulty in minimizing the loss arises from saddle points rather than poor local minima. Saddle points have small gradients that slow the learning process. However, increasing the learning rate allows more rapid traversal of saddle point plateaus. A more practical reason as to why CLR works is that, by following the methods in Section 3.3, it is likely the optimum learning rate will be between the bounds and near optimal learning rates will be used throughout training.
The red curve in Figure 1 shows the result of the policy on CIFAR-10. The settings used to create the red curve were a minimum learning rate of (as in the original parameter file) and a maximum of . Also, the cycle length (i.e., the number of iterations until the learning rate returns to the initial value) is set to iterations (i.e., ) and Figure 1 shows that the accuracy peaks at the end of each cycle.
Implementation of the code for a new learning rate policy is straightforward. An example of the code added to Torch 7 in the experiments shown in Section
4.1.2 is the following few lines:where is the specified lower (i.e., base) learning rate, is the number of epochs of training, and is the computed learning rate. This policy is named and is as described above, with two new input parameters defined: (half the period or cycle length) and (the maximum learning rate boundary). This code varies the learning rate linearly between the minimum () and the maximum ().
In addition to the triangular policy, the following CLR policies are discussed in this paper:
; the same as the policy except the learning rate difference is cut in half at the end of each cycle. This means the learning rate difference drops after each cycle.
; the learning rate varies between the minimum and maximum boundaries and each boundary value declines by an exponential factor of .
The length of a cycle and the input parameter can be easily computed from the number of iterations in an epoch. An epoch is calculated by dividing the number of training images by the used. For example, CIFAR-10 has training images and the is so an epoch iterations. The final accuracy results are actually quite robust to cycle length but experiments show that it often is good to set equal to times the number of iterations in an epoch. For example, setting with the CIFAR-10 training run (as shown in Figure 1) only gives slightly better results than setting .
Furthermore, there is a certain elegance to the rhythm of these cycles and it simplifies the decision of when to drop learning rates and when to stop the current training run. Experiments show that replacing each step of a constant learning rate with at least 3 cycles trains the network weights most of the way and running for 4 or more cycles will achieve even better performance. Also, it is best to stop training at the end of a cycle, which is when the learning rate is at the minimum value and the accuracy peaks.
There is a simple way to estimate reasonable minimum and maximum boundary values with one training run of the network for a few epochs. It is a “LR range test”; run your model for several epochs while letting the learning rate increase linearly between low and high LR values. This test is enormously valuable whenever you are facing a new architecture or dataset.
The learning rate policy provides a simple mechanism to do this. For example, in Caffe, set to the minimum value and set to the maximum value. Set both the and to the same number of iterations. In this case, the learning rate will increase linearly from the minimum value to the maximum value during this short run. Next, plot the accuracy versus learning rate. Note the learning rate value when the accuracy starts to increase and when the accuracy slows, becomes ragged, or starts to fall. These two learning rates are good choices for bounds; that is, set to the first value and set to the latter value. Alternatively, one can use the rule of thumb that the optimum learning rate is usually within a factor of two of the largest one that converges [2] and set to or of .
Figure 3 shows an example of making this type of run with the CIFAR-10 dataset, using the architecture and hyper-parameters provided by Caffe. One can see from Figure 3 that the model starts converging right away, so it is reasonable to set . Furthermore, above a learning rate of the accuracy rise gets rough and eventually begins to drop so it is reasonable to set .
Whenever one is starting with a new architecture or dataset, a single LR range test provides both a good LR value and a good range. Then one should compare runs with a fixed LR versus CLR with this range. Whichever wins can be used with confidence for the rest of one’s experiments.
Dataset | LR policy | Iterations | Accuracy (%) |
---|---|---|---|
CIFAR-10 | 70,000 | 81.4 | |
CIFAR-10 | 81.4 | ||
CIFAR-10 | 25,000 | 78.5 | |
CIFAR-10 | 70,000 | 79.1 | |
CIFAR-10 | 42,000 | ||
AlexNet | 400,000 | 58.0 | |
AlexNet | 400,000 | ||
AlexNet | 300,000 | 56.0 | |
AlexNet | 460,000 | 56.5 | |
AlexNet | 300,000 | 56.5 | |
GoogLeNet | 420,000 | 63.0 | |
GoogLeNet | 420,000 | ||
GoogLeNet | 240,000 | 58.2 | |
GoogLeNet | 240,000 | 60.2 |
The purpose of this section is to demonstrate the effectiveness of the CLR methods on some standard datasets and with a range of architectures. In the subsections below, CLR policies are used for training with the CIFAR-10, CIFAR-100, and ImageNet datasets. These three datasets and a variety of architectures demonstrate the versatility of CLR.
The CIFAR-10 architecture and hyper-parameter settings on the Caffe website are fairly standard and were used here as a baseline. As discussed in Section 3.2, an epoch is equal to iterations and a good setting for is . Section 3.3 discussed how to estimate reasonable minimum and maximum boundary values for the learning rate from Figure 3. All that is needed to optimally train the network is to set and . This is all that is needed to optimally train the network. For the policy run shown in Figure 1, the and learning rate bounds are shown in Table 2.
base_lr | max_lr | stepsize | start | max_iter |
---|---|---|---|---|
0.001 | 0.005 | 2,000 | 0 | 16,000 |
0.0001 | 0.0005 | 1,000 | 16,000 | 22,000 |
0.00001 | 0.00005 | 500 | 22,000 | 25,000 |
Figure 1 shows the result of running with the policy with the parameter setting in Table 2. As shown in Table 1, one obtains the same test classification accuracy of after only iterations with the policy as obtained by running the standard hyper-parameter settings for iterations.
One might speculate that the benefits from the policy derive from reducing the learning rate because this is when the accuracy climbs the most. As a test, a policy was implemented where the learning rate starts at the value and then is linearly reduced to the value for number of iterations. After that, the learning rate is fixed to . For the policy, , , and . Table 1 shows that the final accuracy is only , providing evidence that both increasing and decreasing the learning rate are essential for the benefits of the CLR method.
Figure 4 compares the learning rate policy in Caffe with the new policy using for both policies. The result is that when using the policy one can stop training at iteration with a test accuracy of (going to iteration does not improve on this result). This is substantially better than the best test accuracy of one obtains from using the learning rate policy.
The current Caffe download contains additional architectures and hyper-parameters for CIFAR-10 and in particular there is one with sigmoid non-linearities and batch normalization. Figure 6 compares the training accuracy using the downloaded hyper-parameters with a fixed learning rate (blue curve) to using a cyclical learning rate (red curve). As can be seen in this Figure, the final accuracy for the fixed learning rate (60.8%) is substantially lower than the cyclical learning rate final accuracy (72.2%). There is clear performance improvement when using CLR with this architecture containing sigmoids and batch normalization.
LR type/bounds | LR policy | Iterations | Accuracy (%) |
---|---|---|---|
Nesterov [19] | 70,000 | 82.1 | |
0.001 - 0.006 | 25,000 | 81.3 | |
ADAM [16] | 70,000 | 81.4 | |
0.0005 - 0.002 | 25,000 | 79.8 | |
70,000 | 81.1 | ||
RMSprop [27] | 70,000 | 75.2 | |
0.0001 - 0.0003 | 25,000 | 72.8 | |
70,000 | 75.1 | ||
AdaGrad [5] | 70,000 | 74.6 | |
0.003 - 0.035 | 25,000 | 76.0 | |
AdaDelta [29] | 70,000 | 67.3 | |
0.01 - 0.1 | 25,000 | 67.3 |
Experiments were carried out with architectures featuring both adaptive learning rate methods and CLR. Table 3 lists the final accuracy values from various adaptive learning rate methods, run with and without CLR. All of the adaptive methods in Table 3 were run by invoking the respective option in Caffe. The learning rate boundaries are given in Table 3 (just below the method’s name), which were determined by using the technique described in Section 3.3. Just the lower bound was used for for the policy.
Table 3 shows that for some adaptive learning rate methods combined with CLR, the final accuracy after only 25,000 iterations is equivalent to the accuracy obtained without CLR after 70,000 iterations. For others, it was necessary (even with CLR) to run until 70,000 iterations to obtain similar results. Figure 5 shows the curves from running the Nesterov method with CLR (reached 81.3% accuracy in only 25,000 iterations) and the Adam method both with and without CLR (both needed 70,000 iterations). When using adaptive learning rate methods, the benefits from CLR are sometimes reduced, but CLR can still valuable as it sometimes provides benefit at essentially no cost.
Residual networks [10, 11], and the family of variations that have subsequently emerged, achieve state-of-the-art results on a variety of tasks. Here we provide comparison experiments between the original implementations and versions with CLR for three members of this residual network family: the original ResNet [10], Stochastic Depth networks [13], and the recent DenseNets [12]. Our experiments can be readily replicated because the authors of these papers make their Torch code available^{3}^{3}3https://github.com/facebook/fb.resnet.torch, https://github.com/yueatsprograms/Stochastic_Depth, https://github.com/liuzhuang13/DenseNet. Since all three implementation are available using the Torch 7 framework, the experiments in this section were performed using Torch. In addition to the experiment in the previous Section, these networks also incorporate batch normalization [15] and demonstrate the value of CLR for architectures with batch normalization.
Both CIFAR-10 and the CIFAR-100 datasets were used in these experiments. The CIFAR-100 dataset is similar to the CIFAR-10 data but it has 100 classes instead of 10 and each class has 600 labeled examples.
Architecture | CIFAR-10 (LR) | CIFAR-100 (LR) |
---|---|---|
ResNet | ||
ResNet | ||
ResNet | ||
ResNet+CLR | ||
SD | ||
SD | ||
SD | ||
SD+CLR | ||
DenseNet | ||
DenseNet | ||
DenseNet | ||
DenseNet+CLR |
The results for these two datasets on these three architectures are summarized in Table 4. The left column give the architecture and whether CLR was used in the experiments. The other two columns gives the average final accuracy from five runs and the initial learning rate or range used in parenthesis, which are reduced (for both the fixed learning rate and the range) during the training according to the same schedule used in the original implementation. For all three architectures, the original implementation uses an initial LR of 0.1 which we use as a baseline.
The accuracy results in Table 4 in the right two columns are the average final test accuracies of five runs. The Stochastic Depth implementation was slightly different than the ResNet and DenseNet implementation in that the authors split the 50,000 training images into 45,000 training images and 5,000 validation images. However, the reported results in Table 4 for the SD architecture is only test accuracies for the five runs. The learning rate range used by CLR was determined by the LR range test method and the cycle length was choosen as a tenth of the maximum number of epochs that was specified in the original implementation.
In addition to the accuracy results shown in Table 4, similar results were obtained in Caffe for DenseNets [12] on CIFAR-10 using the prototxt files provided by the authors. The average accuracy of five runs with learning rates of was respectively, but running with CLR within the range of 0.1 to 0.3, the average accuracy was .
The results from all of these experiments show similar or better accuracy performance when using CLR versus using a fixed learning rate, even though the performance drops at some of the learning rate values within this range. These experiments confirm that it is beneficial to use CLR for a variety of residual architectures and for both CIFAR-10 and CIFAR-100.
The ImageNet dataset [21]
is often used in deep learning literature as a standard for comparison. The ImageNet classification challenge provides about
training images for each of the classes, giving a total of labeled training images.The Caffe website provides the architecture and hyper-parameter files for a slightly modified AlexNet [17]. These were downloaded from the website and used as a baseline. In the training results reported in this section, all weights were initialized the same so as to avoid differences due to different random initializations.
Since the in the architecture file is , an epoch is equal to iterations. Hence, a reasonable setting for is epochs or iterations.
Next, one can estimate reasonable minimum and maximum boundaries for the learning rate from Figure 7. It can be seen from this figure that the training doesn’t start converging until at least so setting is reasonable. However, for a fair comparison to the baseline where , it is necessary to set the to for the and policies or else the majority of the apparent improvement in the accuracy will be from the smaller learning rate. As for the maximum boundary value, the training peaks and drops above a learning rate of so is reasonable. For comparing the policy to the policy, setting and is reasonable and in this case one expects that the average accuracy of the policy to be equal to the accuracy from the policy.
Figure 9 compares the results of running with the versus the policy for the AlexNet architecture. Here, the peaks at iterations that are multiples of 60,000 should produce a classification accuracy that corresponds to the policy. Indeed, the accuracy peaks at the end of a cycle for the policy are similar to the accuracies from the standard policy, which implies that the baseline learning rates are set quite well (this is also implied by Figure 7). As shown in Table 1, the final accuracies from the CLR training run are only better than the accuracies from the policy.
Figure 10 compares the results of running with the versus the policy for the AlexNet architecture with for both policies. As expected, Figure 10 shows that the accuracies from the policy do oscillate around the policy accuracies. The advantage of the policy is that the accuracy of is already obtained at iteration whereas the policy takes until iteration to reach .
The GoogLeNet architecture was a winning entry to the ImageNet 2014 image classification competition. Szegedy [25] describe the architecture in detail but did not provide the architecture file. The architecture file publicly available from Princeton^{4}^{4}4vision.princeton.edu/pvt/GoogLeNet/ was used in the following experiments. The GoogLeNet paper does not state the learning rate values and the hyper-parameter solver file is not available for a baseline but not having these hyper-parameters is a typical situation when one is developing a new architecture or applying a network to a new dataset. This is a situation that CLR readily handles. Instead of running numerous experiments to find optimal learning rates, the was set to a best guess value of .
The first step is to estimate the setting. Since the architecture uses a batchsize of an epoch is equal to iterations. Hence, good settings for would be , , or possibly . The results in this section are based on .
The next step is to estimate the bounds for the learning rate, which is found with the LR range test by making a run for 4 epochs where the learning rate linearly increases from to (Figure 11). This figure shows that one can use bounds between and and still have the model reach convergence. However, learning rates above cause the training to converge erratically. For both and the policies, the was set to and was set to . As above, the accuracy peaks for both these learning rate policies correspond to the same learning rate value as the and policies. Hence, the comparisons below will focus on the peak accuracies from the LCR methods.
Figure 12 compares the results of running with the versus the policy for this architecture (due to time limitations, each training stage was not run until it fully plateaued). In this case, the peaks at the end of each cycle for the policy produce better accuracies than the policy. The final accuracy shows an improvement from the network trained by the policy (Table 1) to be better than the accuracy from the policy. This demonstrates that the policy improves on a “best guess” for a fixed learning rate.
The results presented in this paper demonstrate the benefits of the cyclic learning rate (CLR) methods. A short run of only a few epochs where the learning rate linearly increases is sufficient to estimate boundary learning rates for the CLR policies. Then a policy where the learning rate cyclically varies between these bounds is sufficient to obtain near optimal classification results, often with fewer iterations. This policy is easy to implement and unlike adaptive learning rate methods, incurs essentially no additional computational expense.
This paper shows that use of cyclic functions as a learning rate policy provides substantial improvements in performance for a range of architectures. In addition, the cyclic nature of these methods provides guidance as to times to drop the learning rate values (after 3 - 5 cycles) and when to stop the the training. All of these factors reduce the guesswork in setting the learning rates and make these methods practical tools for everyone who trains neural networks.
This work has not explored the full range of applications for cyclic learning rate methods. We plan to determine if equivalent policies work for training different architectures, such as recurrent neural networks. Furthermore, we believe that a theoretical analysis would provide an improved understanding of these methods, which might lead to improvements in the algorithms.
The Journal of Machine Learning Research
, 12:2121–2159, 2011.Imagenet classification with deep convolutional neural networks.
Advances in neural information processing systems, 2012.Modify SGDSolver¡Dtype¿::GetLearningRate() which is in sgd_solver.cpp (near line 38):
Modify message SolverParameter which is in caffe.proto (near line 100):
Please see https://github.com/bckenstler/CLR.
Comments
There are no comments yet.