Code for paper "L4: Practical loss-based stepsize adaptation for deep learning"
We propose a stepsize adaptation scheme for stochastic gradient descent. It operates directly with the loss function and rescales the gradient in order to make fixed predicted progress on the loss. We demonstrate its capabilities by strongly improving the performance of Adam and Momentum optimizers. The enhanced optimizers with default hyperparameters consistently outperform their constant stepsize counterparts, even the best ones, without a measurable increase in computational cost. The performance is validated on multiple architectures including ResNets and the Differential Neural Computer. A prototype implementation as a TensorFlow optimizer is released.READ FULL TEXT VIEW PDF
Code for paper "L4: Practical loss-based stepsize adaptation for deep learning"
L4: Practical loss-based stepsize adaptation for PyTorch
Stochastic gradient methods are the driving force behind the recent boom of deep learning. As a result, the demand for practical efficiency as well as for theoretical understanding has never been stronger. Naturally, this has inspired a lot of research and has given rise to new and currently very popular optimization methods such as Adam , AdaGrad 
, or RMSProp, which serve as competitive alternatives to classical stochastic gradient descent (SGD).
However, the current situation still causes huge overhead in implementations. In order to extract the best performance, one is expected to choose the right optimizer, finely tune its hyperparameters (sometimes multiple), often also to handcraft a specific stepsize adaptation scheme, and finally combine this with a suitable regularization strategy. All of this, mostly based on intuition and experience.
If we put aside the regularization aspects, the holy grail for resolving the optimization issues would be a widely applicable automatic stepsize adaptation for stochastic gradients. This idea has been floating in the community for years and different strategies were proposed. One line of work casts the learning rate as another parameter one can train with a gradient descent (see , also for a survey). Another approach is to make use of (an approximation of) second order information (see  and  as examples). Also, an interesting Bayesian approach for probabilistic line search has been proposed in . Finally, another related research branch is based on the “Learning to learn” paradigm 
(possibly using reinforcement learning such as in).
Although some of the mentioned papers claim to “effectively remove the need for learning rate tuning”, this has not been observed in practice. Whether this is due to conservativism on the implementor’s side or due to lack of solid experimental evidence, we leave aside. In any case, we also take the challenge.
Our strategy is performance oriented. Admittedly, this also means, that while our stepsize adaptation scheme makes sense intuitively (and is related to sound methods), we do not provide or claim any theoretical guarantees. Instead, we focus on strong reproducible performance against optimized baselines across multiple different architectures, on a minimum need for tuning, and on releasing a prototype implementation that is easy to use in practice.
Our adaptation method is called Linearized Loss-based optimaL Learning-rate (L
) and it has two main features. First, it operates directly with the (currently observed) value of the loss. This eventually allows for almost independent stepsize computation of consecutive updates and consequently enables very rapid learning rate changes. Second, we separate the two roles a gradient vector typically has. It provides both a local linear approximation as well as an actual vector of the update step. We allow using a different gradient method for each of the two tasks.
The scheme itself is a meta-algorithm and can be combined with any stochastic gradient method. We report our results for the L adaptation of Adam and Momentum SGD.
The stochasticity poses a severe challenge for stepsize adaptation methods. Any changes in the learning rate based on one or a few noisy loss estimates are likely to be inaccurate. In a setting, where any overestimation of the learning rate can be very punishing, this leaves little maneuvering space.
The approach we take is different. We do not maintain any running value of the stepsize. Instead, at every iteration, we compute it anew with the intention to make maximum possible progress on the (linearized) loss. This is inspired by the classical iterative Newton’s method for finding roots of one-dimensional functions. At every step, this method computes the linearization of the function at the current point and proceeds to the root of this linearization. We use analogous updates to locate the root (minimum) of the loss function.
The idea of using linear approximation for line search is, of course, not novel, as witnessed for example by the Armijo-Wolfe line search . Also, and more notably, our motivation is identical to the one of the Polyak’s update rule , where the loss-linearization (Eq. 2) is already proposed in a deterministic setting as well as the idea of approximating the minimum loss.
Therefore, our scheme should be thought of as an adaptation of these classical methods for the practical needs of deep learning. Also, the ideological proximity to provably correct methods is reassuring.
In the following section, we describe how the stepsize is chosen for a gradient update proposed by an underlying optimizer (e. g. SGD, Adam, momentum SGD). We begin with a simplified core version.
Let be the loss function (on current batch) depending on the parameters and let be the update step provided by some standard optimizer, e. g. in case of SGD this would be . Throughout the paper, the loss will be considered to be non-negative.
For now, let us assume the minimum attainable loss is (see Section 2.4 for details). We consider the stepsize needed to reach (under idealized assumptions) by satisfying
We linearize (around ) and then, after denoting , we solve for :
First of all, note the clear separation between , the estimator of the gradient of and , the proposed update step. Moreover, it is easily seen that the final update step is independent of the magnitude of . In other words, the adaptation method only takes into account the “direction” of the proposed update. This decomposition into the gradient estimate and the update direction is the core principle behind the method and is also vital for its performance.
The update rule is illustrated in Fig. 2 for a quadratic (or other convex) loss. Here, we see (deceptively) that the proposed stepsize is, in fact, still conservative. However, in the multidimensional case, the minimum will not necessarily lie on the line given by the gradient. That is why in real-world problems, this stepsize is far too aggressive and prone to divergence. In addition there are the following reasons to be more conservative: the problems in deep learning are (often strongly) non-convex, and actually minimizing the currently seen batch loss is very likely to not generalize to the whole dataset.
For these reasons, we introduce a hyperparameter which captures the fixed fraction of the stepsize (2) we take at each step. Then the update rule becomes:
Even though a few more hyperparameters will appear later as stability measures and regularizers, is the main hyperparameter to consider. We observed in experiments that the relevant range is consistently . In comparison, for SGD the range of stable learning rates varies over multiple orders of magnitude. We chose the slightly conservative value as a default setting. We report its performance on all the tasks in Section 3.
Here, we offer a partial explanation why the value of stays in the same small relevant range even for very different problems and datasets. Interestingly, the new update equation (3) is invariant to affine loss transformations of the type:
with . Let us briefly verify this. The gradient of will be and we will assume that the underlying optimizer will offer the same update direction in both cases (we have already established that its magnitude does not matter). Then we can simply write
and we see that the updates are the same in both cases. On top of being a good sanity check for any loss-based method, we additionally believe that it simplifies problem-to-problem adaptation (also in terms of hyperparameters).
We still owe an explanation of how
is maintained during training. We base its value on the minimal loss seen so far. Naturally, some mini-batches will have a lower loss and will be used as a reference for the others. By itself, this comes with some disadvantages. In case of small variance across batches, thisestimate would be very pessimistic. Also, the “new best” mini-batches would have zero stepsize.
Therefore, we introduce a factor
which captures the fraction of the lowest seen loss that is still believed to be achievable. Similarly, to correct for possibly strong effects of a few outlier batches, we letslowly increase with a timescale . This reactiveness of slightly shifts its interpretation from “globally minimum loss” to “minimum currently achievable loss”. This reflects on the fact that in practical settings, it is unrealistic to aim for the global minimum in each update. All in all, when a new value of the loss comes, we set
then we use for the gradient update and apply the “forgetting”
The value of gets initialized by a fixed fraction of the first seen loss , that is . We set , , and as default settings and we use these values in all our experiments. Even though, we can not exclude that tuning these values could lead to enhanced performance, we have not observed such effects and we do not feel the necessity to modify these values.
Another unresolved issue is the division by an inner product in Eq. (3). Our solution to potential numerical instabilities are two-fold. First, we require compatibility of and in the sense that the angle between the vectors does not exceed . In other words, we insist on . For LAdam and LMom this is the case, see Section 2.5, Eq. (7). Second, we add a tiny as a regularizer to the denominator. The final form of update rule then is:
with the default value .
The algorithm is called Linearized Loss-based optimaL Learning-rate (L) and it works on top of provided gradient estimator (producing ) and an update direction algorithm (producing ), see Algorithm 1 for the pseudocode. For compactness of presentation, we introduce a notation for exponential moving averages as with timescale using bias correction just as in  (see Algorithm 2).
In this paper, we introduce two variants of L leading to two optimizers: (1) with momentum gradient descent, denoted by LMom, and (2) with Adam , denoted by LAdam. We choose the update directions for LMom and LAdam, respectively as
with and being the timescales for momentum and (in case of L
Adam) second moment averaging.
In both cases, the choice of ensures , as mentioned in Section 2.4. Additional reasoning is that the averaged local gradient is in practice often a more accurate estimator of the gradient on the global loss.
We evaluate the proposed method on five different setups, spanning over different architectures, datasets, and loss functions. We compare to the de facto standard methods: stochastic gradient descent (SGD), momentum SGD (Mom), and Adam .
For each of the methods, the performance is evaluated for the best setting of the stepsize/learning rate parameter (found via a fine grid search with multiple restarts). All other parameters are as follows: for momentum SGD we used a timescale of steps (); for Adam: and . The (non-default) value of was selected in accordance with TensorFlow documentation to decrease the instability of Adam.
In all experiments, the performance of the standard methods heavily depends on the stepsize parameter. However, in case of the proposed method, the default setting showed remarkable consistency. Across the experiments, it outperforms even the best constant learning rates for the respective gradient-based update rules. In addition, the performance of these default settings is also comparable with handcrafted optimization policies on more complicated architectures. We consider this to be the main strength of the L method.
We present results for LMom and LAdam, see Tab. 1 for an overview. In all experiments we strictly followed the out-of-the-box policy. We simply cloned an official repository, changed the optimizer, and left everything else intact. Also, throughout the experiments we have observed neither any runtime increase nor additional memory requirements arising from the adaptation.
|Synthetic||2-Layer MLP||-||0.0005||0.001||0.15 [0.25]||0.15 [0.25]|
|MNIST||3-Layer MLP||64 [8,16,32]||0.05||0.001||0.15 [0.25]||0.15 [0.25]|
|DNC||Recurrent||16 [8, 32, 64]||1.2||0.01||0.15||0.15|
As a general nomenclature, a method is marked with a if optimized stepsize was used. Otherwise (in case of L optimizers), the default settings are in place.
To the end of this section, we append experiments with varying batch sizes as hinted in Tab. 1 by values in brackets.
Neither of the L optimizers slows down network training in practical settings. By inspection of Equations (6) and (7), we can see that the only additional computation (compared to Adam or momentum SGD) is calculating the inner product . This introduces two additional operations per weight (multiplication and addition). In any realistic scenario, these have negligible runtimes when compared to matrix multiplications (convolutions), which are required both in forward and backward pass.
The first task we investigate is a linear regression with badly conditioned input/output relationship. It has recently been brought into the spotlight by Ali Rahimi in his NIPS 2017 talk, see, as an example of a problem “resistant” to standard stochastic gradient optimization methods. For our experiments, we used the corresponding code by Ben Recht .
The network has two weight matrices , and the loss function is given by
where is a badly conditioned matrix, i. e. , with and
are the largest and the smallest singular values of, respectively. Note that this is in disguise a (realizable) matrix factorization problem: . Also, it is not a stochastic optimization problem but a deterministic one.
Figure 2 shows the results for , , , (the default configuration of ) and condition number . The statistics is given for 5 independent runs (with randomly generated matrices ) and a fixed dataset of 1000 samples. We can confirm that standard optimizers indeed have great difficulty reaching convergence. Only a fine grid search discovered settings behaving reasonably well (divergence or too early plateaus are very common). The proposed stepsize adaptation method apparently overcomes this issue (see Fig. 2).
|96 trainable weights|
|192 trainable weights|
As an extension of the experiments on badly conditioned regression, we also include a comparison with the classical Levenberg-Marquardt algorithm (LMA)  which can be viewed as a Gauss-Newton method with a trust region. In Tab. 2 the speed of the algorithms, both in terms of the number of iterations as well as wall-clock time111The experiments were conducted on a machine with i7-7800X CPU @ 3.50GHz with 8 cores. is reported. The same comparison is also performed on an instance of twice the size (all dimensions doubled).
The results show that the gradients provided by LMA reach convergence in a much smaller number of steps. However, at the same time, LMA is significantly more computationally expensive since each step involves solving a least squares problem. This can be clearly seen from comparing performance on the problem sizes in Tab. 2.
The second task is a classical multilayer neural network trained for digit recognition using the MNIST dataset. We use the standard architecture with two layers containing and digit classes. Batch size in use is .
|(a) learning curve for 2-hidden layer NN on MNIST||(b) effective learning rate (MNIST)|
Figure 3 shows the learning curves and the effective learning rates. The effective learning rate is given by in (3). Note how after epochs the effective learning of LAdam becomes very small and actually becomes around epochs. This is simply because by then the loss is 0 (within machine precision) on every batch and thus ; a global optimum was found. The very high learning rates that precede can be attributed to a “plateau” character of the obtained minimum. The gradients are so small in magnitude that very high stepsize is necessary to make any progress. This is, perhaps, unexpected since in optimization theory convergence is typically linked to decrease in the learning rate, rather than increase.
Generally, we see that the effective learning rate shows highly nontrivial behavior. We can observe sharp increases as well as sharp decreases. Also, even in short time period it fully spans 2 or more orders of magnitude as highlighted by the shaded area, see Fig. 3(b). None of this causes instabilities in the training itself.
Even though the ability to generalize and compatibility with various regularization methods are not our main focus in this work, we still report in Tab. 3 the development of test accuracy during the training. We see that the test performance of all optimizers is comparable. This does not come as a surprise as the used architecture has no regularization. Also, it can be seen that the L optimizers reach near-final accuracies faster, already after around 10 epochs.
|Test accuracy in %|
A list of papers reporting improved performance over SGD on MNIST is long (examples include [20, 14, 1, 2, 15]). Unfortunately, there are no widely recognized benchmarks to use for comparison. There is a lot of variety in choosing the baseline optimizer (often only the default setting for SGD) and in the number of training steps reported (often fewer than one epoch). In this situation, it is difficult to make any substantiated claims. However, to our knowledge, previous work does not achieve such rapid convergence as can be seen in Fig. 3.
In the next two tasks, we target finely tuned publicly available implementations of well-known architectures and compare their performance to our default setting. We begin with the deep residual network architecture for CIFAR-10  taken from the official TensorFlow repository . Deep residual networks 
, or ResNets for short, provided the breakthrough idea of identity mappings in order to enable training of very deep convolutional neural networks. The provided architecture has 32 layers and uses batch normalization for batches of size. The loss is given by cross-entropy with regularization.
The deployed optimization policy is momentum gradient with manually crafted piece-wise constant stepsize adaptation. We simply replace it with default settings of LMom and LAdam.
The first surprise comes when we look at Fig. 4, which compares the effective learning rates. Clearly, the adaptive learning rates are much more conservative in behavior compared to MNIST, possibly signaling for different nature of the datasets. Also, the LMom learning rate approximately matches the manually designed schedule (also for momentum gradient) during the decisive first 150 epochs.
|(a) training loss||(b) test accuracy|
Comparing performance against optimized constant learning rates is favorable for L optimizers both in terms of loss and test accuracy (see Fig. 5). Note also that the two L optimizers perform almost indistinguishably. However, competing with the default policy has another surprising outcome. While the default policy is inferior in loss minimization (more strongly at the beginning than at the end), in terms of test accuracy it eventually dominates. By careful inspection of Fig. 5, we see the decisive gain happens right after the first drop in the hardcoded learning rate. This, in itself, is very intriguing since both default and LMom use the same type of gradients of similar magnitudes. Also, it explains the original authors’ choice of a piece-wise constant learning rate schedule.
To our knowledge, there is no satisfying answer to why piece-wise constant learning rates lead to good generalization. Yet, practitioners use them frequently, perhaps precisely for this reason.
As the last task, we chose a somewhat exotic one; a recurrent architecture of Google Deepmind’s Differential Neural Computer (DNC) . Again, we compare with the performance from the official repository . The DNC is a culmination of a line of work developing LSTM-like architectures with a differentiable memory management, e. g. [6, 22], and is in itself very complex. The targeted tasks have typically very structured flavor (e. g. shortest path, question answering).
The task implemented in  is to learn a REPEAT-COPY algorithm. In a nutshell, the input specifies a sequence of bits and a number of repeats while the expected output is a sequence consisting of repeats of
. The loss function is a negative log-probability of outputting the correct sequence.
Since, the ground truth is a known algorithm, the training data can be generated on the fly, and there is no separate test regime. This time, the optimizer in place is RMSProp 
with gradient clipping. We found out, however, that the constant learning rateprovided in  can be further tuned and we compare our results against the improved value . We also used the best performing constant learning rates for Adam and for momentum SGD (both with the suggested gradient clipping) as baselines. The L optimizers did not use gradient clipping.
|(a) training loss (DNC)||(b)effective learning rate (DNC)|
Again, we can see in Fig. 6 that LAdam and LMom performed almost the same on average, even though LMom was more prone to instabilities as can be seen from the volume of the orange-shaded regions. More importantly, they both performed better or on par with the optimized baselines.
We end this experimental section with a short discussion of Fig. 6(b), since it illustrates multiple features of the adaptation all at once. In this figure, we compare the effective learning rates of L and plain Adam. We immediately notice the dramatic evolution of the L learning rate, jumping across multiple orders of magnitude, until finally settling around . This behavior, however, results in a much more stable optimization process (see again Fig. 6), unlike in the case of plain Adam optimizer (note the volume of the green-shaded regions).
The intuitive explanation is two-fold. For one, the high gradients only need a small learning rate to make the expected progress. This lowers the danger of divergence and, in this sense, it plays the role of gradient clipping. And second, plateau regions with small gradients will force very high learning rates in order to leave them. This beneficial rapid adaptation is due to almost independent stepsize computation for every batch. Only and possibly (depending on the underlying gradient methods) some gradient history is reused. This is a fundamental difference to methods that at each step make a small update to the previous learning rate. This is in agreement with , where the phenomenon was discussed in more depth.
The Fashion MNIST dataset 
is a drop-in replacement for MNIST that is considered to better represent modern computer vision tasks. We ran it on a TensorFlow official implementation of a ConvNet for MNIST. The architecture consists of two convolutional layers followed by two fully connected layers that a have a dropout  in between. By default, the batch size is 100 and the optimizer is Adam.
We see in Fig. 7, that both L optimizers work out-of-the-box and despite the presence of dropout during training, both achieve a loss that is roughly an order of maginute lower than the losses of optimized baselines. This leads to a mild gain in test accuracy as can be seen in Tab. 4.
Such low training loss of L optimizers despite the presence of dropout suggests increasing the dropout rate in hope for better generalization. And indeed when switching from the default rate to value , one can see (also in Tab. 4) an additional gain in test accuracy.
Although generalization performance was not our main focus in this paper, we firmly believe that superior performance in optimization is convertible to better results in test time. This case of increasing the dropout rate is one promising example of it.
|LAdam ()||LMom ()||LAdam||LMom||Adam*||Mom*|
. Reported are mean and standard deviation of five independent restarts.
Since L recomputes the stepsize individually for each batch, it is natural to investigate how its performance depends on the batch size. For this experiment we chose the MNIST and DNC datasets, since there L displayed the most variance in the effective learning rates, and thus behaved “most different” from standard optimizers. Of particular interest is the “high variance” regime of small batch sizes, where the stepsize adaptation can be expected to evolve the learning rate rapidly. The same default setting of both LMom and LAdam was consistently applied.
In both cases we swept over batch size 8, 16, 32, and 64. In the experiments from the main text of the paper, the selected batch sizes were 64, and 16, respectively for MNIST and DNC.
The results for MNIST are plotted in Fig. 9. It turns out that the original setting is the least favorable for both L optimizers. In fact, the performance increases with decreasing the batch size.
For DNC, we report the performance in Fig. 9. We also observe that L favors small batch sizes. Here batch size 8 is probably the limit of what L can tolerate. This is not directly visible on the loss curves but in fact one of the runs of LMom diverged (after processing 50000 examples and reaching loss). In all other cases, good level of convergence was reached.
In conclusion, adapting the stepsize for every batch shows to be gradually more beneficial as we lower the batch sizes (loss estimates increase in variance). This is a further validation for applying learning rates that are highly varying from mini-batch to mini-batch.
Although the batch sizes in both cases range almost over an order of magnitude, no severe deterioration of performance was ever detected.
We propose a stepsize adaptation scheme L compatible with currently most prominent gradient methods. Two arising optimizers were tested on a multitude of datasets, spanning across different batch sizes, loss functions and network structures. The results validate the stepsize adaptation in itself, as the adaptive optimizers consistently outperform their non-adaptive counterparts, even when the adaptive optimizers use the default setting and the non-adaptive ones were finely tuned. This default setting also performs well when compared to hand-tuned optimization policies from official repositories of modern high-performing architectures. Although we cannot give guarantees, this is a promising step towards practical “no-tuning-necessary” stochastic optimization.
The core design feature, ability to change stepsize dramatically from batch to batch, while occasionally reaching extremely high stepsizes, was also validated. This idea does not seem widespread in the community and we would like to inspire further work.
The ability of the proposed method to actually drive loss to convergence creates an opportunity to better evaluate regularization strategies and develop new ones. This can potentially convert the superiority in training to enhanced test performance as discussed in the Fashion MNIST experiments.
Finally, Ali Rahimi and Benjamin Recht suggested in their NIPS 2017 talk (and the corresponding blog post) [18, 19] that the failure to drive loss to zero within machine precision might be an actual bottleneck of deep learning (using exactly the ill-conditioned regression task). We show on this example and on MNIST that our method can break this “optimization floor”.
We would like to thank Alex Kolesnikov, Friedrich Solowjow, and Anna Levina for helping to improve the manuscript.
=0mu plus 1mu
Proceedings of the 30th International Conference on Machine Learning, volume 28/3 of Proceedings of Machine Learning Research, pages 343–351, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR.