L4: Practical loss-based stepsize adaptation for deep learning

02/14/2018
by   Rolinek Michal, et al.
Max Planck Society
0

We propose a stepsize adaptation scheme for stochastic gradient descent. It operates directly with the loss function and rescales the gradient in order to make fixed predicted progress on the loss. We demonstrate its capabilities by strongly improving the performance of Adam and Momentum optimizers. The enhanced optimizers with default hyperparameters consistently outperform their constant stepsize counterparts, even the best ones, without a measurable increase in computational cost. The performance is validated on multiple architectures including ResNets and the Differential Neural Computer. A prototype implementation as a TensorFlow optimizer is released.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

05/31/2021

Combining resampling and reweighting for faithful stochastic optimization

Many machine learning and data science tasks require solving non-convex ...
03/28/2019

PAL: A fast DNN optimization method based on curvature information

We present a novel optimizer for deep neural networks that combines the ...
06/08/2020

The Golden Ratio of Learning and Momentum

Gradient descent has been a central training principle for artificial ne...
01/25/2022

On Uniform Boundedness Properties of SGD and its Momentum Variants

A theoretical, and potentially also practical, problem with stochastic g...
02/11/2015

Gradient-based Hyperparameter Optimization through Reversible Learning

Tuning hyperparameters of learning algorithms is hard because gradients ...
09/14/2020

Deforming the Loss Surface to Affect the Behaviour of the Optimizer

In deep learning, it is usually assumed that the optimization process is...
06/24/2016

Stochastic Multiple Choice Learning for Training Diverse Deep Ensembles

Many practical perception systems exist within larger processes that inc...

Code Repositories

l4-optimizer

Code for paper "L4: Practical loss-based stepsize adaptation for deep learning"


view repo

l4-pytorch

L4: Practical loss-based stepsize adaptation for PyTorch


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Stochastic gradient methods are the driving force behind the recent boom of deep learning. As a result, the demand for practical efficiency as well as for theoretical understanding has never been stronger. Naturally, this has inspired a lot of research and has given rise to new and currently very popular optimization methods such as Adam [9], AdaGrad [5]

, or RMSProp

[25], which serve as competitive alternatives to classical stochastic gradient descent (SGD).

However, the current situation still causes huge overhead in implementations. In order to extract the best performance, one is expected to choose the right optimizer, finely tune its hyperparameters (sometimes multiple), often also to handcraft a specific stepsize adaptation scheme, and finally combine this with a suitable regularization strategy. All of this, mostly based on intuition and experience.

If we put aside the regularization aspects, the holy grail for resolving the optimization issues would be a widely applicable automatic stepsize adaptation for stochastic gradients. This idea has been floating in the community for years and different strategies were proposed. One line of work casts the learning rate as another parameter one can train with a gradient descent (see [2], also for a survey). Another approach is to make use of (an approximation of) second order information (see [3] and [20] as examples). Also, an interesting Bayesian approach for probabilistic line search has been proposed in [14]. Finally, another related research branch is based on the “Learning to learn” paradigm [1]

(possibly using reinforcement learning such as in

[13]).

Although some of the mentioned papers claim to “effectively remove the need for learning rate tuning”, this has not been observed in practice. Whether this is due to conservativism on the implementor’s side or due to lack of solid experimental evidence, we leave aside. In any case, we also take the challenge.

Our strategy is performance oriented. Admittedly, this also means, that while our stepsize adaptation scheme makes sense intuitively (and is related to sound methods), we do not provide or claim any theoretical guarantees. Instead, we focus on strong reproducible performance against optimized baselines across multiple different architectures, on a minimum need for tuning, and on releasing a prototype implementation that is easy to use in practice.

Our adaptation method is called Linearized Loss-based optimaL Learning-rate (L

) and it has two main features. First, it operates directly with the (currently observed) value of the loss. This eventually allows for almost independent stepsize computation of consecutive updates and consequently enables very rapid learning rate changes. Second, we separate the two roles a gradient vector typically has. It provides both a local linear approximation as well as an actual vector of the update step. We allow using a different gradient method for each of the two tasks.

The scheme itself is a meta-algorithm and can be combined with any stochastic gradient method. We report our results for the L adaptation of Adam and Momentum SGD.

2 Method

2.1 Motivation

The stochasticity poses a severe challenge for stepsize adaptation methods. Any changes in the learning rate based on one or a few noisy loss estimates are likely to be inaccurate. In a setting, where any overestimation of the learning rate can be very punishing, this leaves little maneuvering space.

The approach we take is different. We do not maintain any running value of the stepsize. Instead, at every iteration, we compute it anew with the intention to make maximum possible progress on the (linearized) loss. This is inspired by the classical iterative Newton’s method for finding roots of one-dimensional functions. At every step, this method computes the linearization of the function at the current point and proceeds to the root of this linearization. We use analogous updates to locate the root (minimum) of the loss function.

The idea of using linear approximation for line search is, of course, not novel, as witnessed for example by the Armijo-Wolfe line search [16]. Also, and more notably, our motivation is identical to the one of the Polyak’s update rule [17], where the loss-linearization (Eq. 2) is already proposed in a deterministic setting as well as the idea of approximating the minimum loss.

Therefore, our scheme should be thought of as an adaptation of these classical methods for the practical needs of deep learning. Also, the ideological proximity to provably correct methods is reassuring.

2.2 Algorithm

In the following section, we describe how the stepsize is chosen for a gradient update proposed by an underlying optimizer (e. g. SGD, Adam, momentum SGD). We begin with a simplified core version.

Let be the loss function (on current batch) depending on the parameters and let be the update step provided by some standard optimizer, e. g. in case of SGD this would be . Throughout the paper, the loss will be considered to be non-negative.

For now, let us assume the minimum attainable loss is (see Section 2.4 for details). We consider the stepsize needed to reach (under idealized assumptions) by satisfying

(1)

We linearize (around ) and then, after denoting , we solve for :

(2)

First of all, note the clear separation between , the estimator of the gradient of and , the proposed update step. Moreover, it is easily seen that the final update step is independent of the magnitude of . In other words, the adaptation method only takes into account the “direction” of the proposed update. This decomposition into the gradient estimate and the update direction is the core principle behind the method and is also vital for its performance.

Figure 1: Illustration of stepsize calculation for one parameter. Given a minimum loss, the stepsize is such that the linearized loss would be minimal after one step. In practice a fraction of that stepsize is used, see Sec. 2.4.
Figure 2: Training performance on badly conditioned regression task with . The mean (in log-space) training loss over restarts is shown. The areas between minimal and maximal loss (after log-space smoothing) are shaded. For all algorithms the best stepsize was selected ( and for SGD and Adam respectively, and for both L  optimizers), except for the default setting without the “*”. Note the logarithmic scale of the loss.

The update rule is illustrated in Fig. 2 for a quadratic (or other convex) loss. Here, we see (deceptively) that the proposed stepsize is, in fact, still conservative. However, in the multidimensional case, the minimum will not necessarily lie on the line given by the gradient. That is why in real-world problems, this stepsize is far too aggressive and prone to divergence. In addition there are the following reasons to be more conservative: the problems in deep learning are (often strongly) non-convex, and actually minimizing the currently seen batch loss is very likely to not generalize to the whole dataset.

For these reasons, we introduce a hyperparameter which captures the fixed fraction of the stepsize (2) we take at each step. Then the update rule becomes:

(3)

Even though a few more hyperparameters will appear later as stability measures and regularizers, is the main hyperparameter to consider. We observed in experiments that the relevant range is consistently . In comparison, for SGD the range of stable learning rates varies over multiple orders of magnitude. We chose the slightly conservative value as a default setting. We report its performance on all the tasks in Section 3.

2.3 Invariance to affine transforms of the loss

Here, we offer a partial explanation why the value of stays in the same small relevant range even for very different problems and datasets. Interestingly, the new update equation (3) is invariant to affine loss transformations of the type:

(4)

with . Let us briefly verify this. The gradient of will be and we will assume that the underlying optimizer will offer the same update direction   in both cases (we have already established that its magnitude does not matter). Then we can simply write

and we see that the updates are the same in both cases. On top of being a good sanity check for any loss-based method, we additionally believe that it simplifies problem-to-problem adaptation (also in terms of hyperparameters).

It should be noted though that we lose this precise invariance once we introduce some heuristical and regularization steps in Section 

2.4.

2.4 Stability measures and heuristics

adaptation:

We still owe an explanation of how

is maintained during training. We base its value on the minimal loss seen so far. Naturally, some mini-batches will have a lower loss and will be used as a reference for the others. By itself, this comes with some disadvantages. In case of small variance across batches, this

estimate would be very pessimistic. Also, the “new best” mini-batches would have zero stepsize.

Therefore, we introduce a factor

which captures the fraction of the lowest seen loss that is still believed to be achievable. Similarly, to correct for possibly strong effects of a few outlier batches, we let

slowly increase with a timescale . This reactiveness of slightly shifts its interpretation from “globally minimum loss” to “minimum currently achievable loss”. This reflects on the fact that in practical settings, it is unrealistic to aim for the global minimum in each update. All in all, when a new value of the loss comes, we set

then we use for the gradient update and apply the “forgetting”

(5)

The value of gets initialized by a fixed fraction of the first seen loss , that is . We set , , and as default settings and we use these values in all our experiments. Even though, we can not exclude that tuning these values could lead to enhanced performance, we have not observed such effects and we do not feel the necessity to modify these values.

Numerical stability:

Another unresolved issue is the division by an inner product in Eq. (3). Our solution to potential numerical instabilities are two-fold. First, we require compatibility of and in the sense that the angle between the vectors does not exceed . In other words, we insist on . For LAdam and LMom this is the case, see Section 2.5, Eq. (7). Second, we add a tiny as a regularizer to the denominator. The final form of update rule then is:

(6)

with the default value .

2.5 Putting it together: LMom and LAdam

The algorithm is called Linearized Loss-based optimaL Learning-rate (L) and it works on top of provided gradient estimator (producing ) and an update direction algorithm (producing ), see Algorithm 1 for the pseudocode. For compactness of presentation, we introduce a notation for exponential moving averages as with timescale using bias correction just as in [9] (see Algorithm 2).

  Require: : Stepsize/fraction
  Require: : optimism loss improvement fraction
  Require: : timescale of forgetting minimum loss
  Require: : non-negative stochastic loss function w.r.t. parameters (a sample at step is denoted by )
  Require: : initial parameter vector
  Require: : gradient direction function
  Require: : gradient step function
  
    (fraction of initial loss)
  while  not converged do
     
       (gradient step)
       (gradient estimator)
       (minimum loss)
       (parameter update)
       (forgetting minimum loss)
  end while
  return:  (final parameter vector)
Algorithm 1 L, a meta algorithm for stochastic optimization, compatible with momentum SGD, Adam or other gradient estimators. The default hyperparameters are: , , , , and .
  Require: : timescape
   (initialize mean with zero vector)
   (initialize step counter)
  update_average():  (: input vector to be averaged)
     
       (update average)
     return:  (correct bias)
Algorithm 2 Bias corrected moving average.

In this paper, we introduce two variants of L leading to two optimizers: (1) with momentum gradient descent, denoted by LMom, and (2) with Adam [9], denoted by LAdam. We choose the update directions for LMom and LAdam, respectively as

(7)

with and being the timescales for momentum and (in case of L

Adam) second moment averaging.

In both cases, the choice of ensures , as mentioned in Section 2.4. Additional reasoning is that the averaged local gradient is in practice often a more accurate estimator of the gradient on the global loss.

3 Results

We evaluate the proposed method on five different setups, spanning over different architectures, datasets, and loss functions. We compare to the de facto standard methods: stochastic gradient descent (SGD), momentum SGD (Mom), and Adam [9].

For each of the methods, the performance is evaluated for the best setting of the stepsize/learning rate parameter (found via a fine grid search with multiple restarts). All other parameters are as follows: for momentum SGD we used a timescale of steps (); for Adam: and . The (non-default) value of was selected in accordance with TensorFlow documentation to decrease the instability of Adam.

In all experiments, the performance of the standard methods heavily depends on the stepsize parameter. However, in case of the proposed method, the default setting showed remarkable consistency. Across the experiments, it outperforms even the best constant learning rates for the respective gradient-based update rules. In addition, the performance of these default settings is also comparable with handcrafted optimization policies on more complicated architectures. We consider this to be the main strength of the L method.

We present results for LMom and LAdam, see Tab. 1 for an overview. In all experiments we strictly followed the out-of-the-box policy. We simply cloned an official repository, changed the optimizer, and left everything else intact. Also, throughout the experiments we have observed neither any runtime increase nor additional memory requirements arising from the adaptation.

Dataset Arch Batch size
Synthetic 2-Layer MLP - 0.0005 0.001 0.15 [0.25] 0.15 [0.25]
MNIST 3-Layer MLP 64 [8,16,32] 0.05 0.001 0.15 [0.25] 0.15 [0.25]
CIFAR-10 ResNet 128 0.004 0.0002 0.15 0.15
DNC Recurrent 16 [8, 32, 64] 1.2 0.01 0.15 0.15
Fashion MNIST ConvNet 100 0.01 0.0003 0.15 0.15
Table 1: Overview of experiments. The experiments span over classical datasets, traditional and modern architectures, as well as different batch sizes. The tested learning rates are denoted by and marked with if chosen optimally via grid search. The optimal learning rates for the baselines vary while L optimizers can keep a fixed setting and still outperforming in terms of training and test loss.

As a general nomenclature, a method is marked with a if optimized stepsize was used. Otherwise (in case of L optimizers), the default settings are in place.

To the end of this section, we append experiments with varying batch sizes as hinted in Tab. 1 by values in brackets.

Running time:

Neither of the L optimizers slows down network training in practical settings. By inspection of Equations (6) and (7), we can see that the only additional computation (compared to Adam or momentum SGD) is calculating the inner product . This introduces two additional operations per weight (multiplication and addition). In any realistic scenario, these have negligible runtimes when compared to matrix multiplications (convolutions), which are required both in forward and backward pass.

3.1 Badly conditioned regression

The first task we investigate is a linear regression with badly conditioned input/output relationship. It has recently been brought into the spotlight by Ali Rahimi in his NIPS 2017 talk, see

[19], as an example of a problem “resistant” to standard stochastic gradient optimization methods. For our experiments, we used the corresponding code by Ben Recht [18].

The network has two weight matrices , and the loss function is given by

(8)

where is a badly conditioned matrix, i. e. , with and

are the largest and the smallest singular values of

, respectively. Note that this is in disguise a (realizable) matrix factorization problem: . Also, it is not a stochastic optimization problem but a deterministic one.

Figure 2 shows the results for , , , (the default configuration of [18]) and condition number . The statistics is given for 5 independent runs (with randomly generated matrices ) and a fixed dataset of 1000 samples. We can confirm that standard optimizers indeed have great difficulty reaching convergence. Only a fine grid search discovered settings behaving reasonably well (divergence or too early plateaus are very common). The proposed stepsize adaptation method apparently overcomes this issue (see Fig. 2).

Method Steps Time (s)
96 trainable weights
LAdam*
LMom*
LMA*
192 trainable weights
LAdam*
LMom*
LMA*
Table 2: Comparison with Levenberg-Marquardt algorithm. Time and the number of gradient updates needed to reach convergence () is reported. The average is with respect to 5 restarts. Two problem setups are considered, the default from [18] (, , , ) and its “scaled up by two” version. Stepsize was selected for LMA as the best performing one, and is chosen for both L  optimizers.

3.1.1 Comparison with LMA

As an extension of the experiments on badly conditioned regression, we also include a comparison with the classical Levenberg-Marquardt algorithm (LMA) [12] which can be viewed as a Gauss-Newton method with a trust region. In Tab. 2 the speed of the algorithms, both in terms of the number of iterations as well as wall-clock time111The experiments were conducted on a machine with i7-7800X CPU @ 3.50GHz with 8 cores. is reported. The same comparison is also performed on an instance of twice the size (all dimensions doubled).

The results show that the gradients provided by LMA reach convergence in a much smaller number of steps. However, at the same time, LMA is significantly more computationally expensive since each step involves solving a least squares problem. This can be clearly seen from comparing performance on the problem sizes in Tab. 2.

3.2 MNIST digit recognition

The second task is a classical multilayer neural network trained for digit recognition using the MNIST 

[11] dataset. We use the standard architecture with two layers containing and

hidden units and ReLu activations functions followed by a logistic regression output layer for the

digit classes. Batch size in use is .

(a) learning curve for 2-hidden layer NN on MNIST (b) effective learning rate (MNIST)
Figure 3: Training progress of multilayer neural networks on MNIST, see Section 3.2 for details. (a) Average (in log-space) training loss with respect to five restarts with a shaded area between minimum and maximum loss (after log-space smoothing). (b) Effective learning rates for a single run. The bold curves are averages taken in log-space.

Figure 3 shows the learning curves and the effective learning rates. The effective learning rate is given by in (3). Note how after epochs the effective learning of LAdam becomes very small and actually becomes around epochs. This is simply because by then the loss is 0 (within machine precision) on every batch and thus ; a global optimum was found. The very high learning rates that precede can be attributed to a “plateau” character of the obtained minimum. The gradients are so small in magnitude that very high stepsize is necessary to make any progress. This is, perhaps, unexpected since in optimization theory convergence is typically linked to decrease in the learning rate, rather than increase.

Generally, we see that the effective learning rate shows highly nontrivial behavior. We can observe sharp increases as well as sharp decreases. Also, even in short time period it fully spans 2 or more orders of magnitude as highlighted by the shaded area, see Fig. 3(b). None of this causes instabilities in the training itself.

Even though the ability to generalize and compatibility with various regularization methods are not our main focus in this work, we still report in Tab. 3 the development of test accuracy during the training. We see that the test performance of all optimizers is comparable. This does not come as a surprise as the used architecture has no regularization. Also, it can be seen that the L  optimizers reach near-final accuracies faster, already after around 10 epochs.

Test accuracy in %
Adam mSGD LAdam LAdam* LMom LMom*
1 epoch
10 epochs
30 epochs
Table 3: Test accuracy after a certain number of epochs of (unregularized) MNIST training. The results are reported over 5 restarts.
Comparison to other work:

A list of papers reporting improved performance over SGD on MNIST is long (examples include [20, 14, 1, 2, 15]). Unfortunately, there are no widely recognized benchmarks to use for comparison. There is a lot of variety in choosing the baseline optimizer (often only the default setting for SGD) and in the number of training steps reported (often fewer than one epoch). In this situation, it is difficult to make any substantiated claims. However, to our knowledge, previous work does not achieve such rapid convergence as can be seen in Fig. 3.

3.3 ResNets for CIFAR-10 image classification

In the next two tasks, we target finely tuned publicly available implementations of well-known architectures and compare their performance to our default setting. We begin with the deep residual network architecture for CIFAR-10 [10] taken from the official TensorFlow repository [24]. Deep residual networks [8]

, or ResNets for short, provided the breakthrough idea of identity mappings in order to enable training of very deep convolutional neural networks. The provided architecture has 32 layers and uses batch normalization for batches of size

. The loss is given by cross-entropy with regularization.

The deployed optimization policy is momentum gradient with manually crafted piece-wise constant stepsize adaptation. We simply replace it with default settings of LMom and LAdam.

Figure 4: Effective learning rates for CIFAR10. The adaptive stepsize of LMom matches roughly the hand-coded decay schedule (grey line) until 150 epochs. Both use the same gradient type.

The first surprise comes when we look at Fig. 4, which compares the effective learning rates. Clearly, the adaptive learning rates are much more conservative in behavior compared to MNIST, possibly signaling for different nature of the datasets. Also, the LMom learning rate approximately matches the manually designed schedule (also for momentum gradient) during the decisive first 150 epochs.

(a) training loss (b) test accuracy
Figure 5: Training and test performance on ResNet architecture for CIFAR-10. Mean loss and accuracy are shown with respect to three restarts. The default settings of the L  optimizers perform better in loss minimization, however, become inferior in test accuracy after the first drop in learning rate of the baseline’s learning rate schedule (see also Fig. 4). For Adam and mSGD, the best performing constant stepsizes and were evaulated.

Comparing performance against optimized constant learning rates is favorable for L optimizers both in terms of loss and test accuracy (see Fig. 5). Note also that the two L optimizers perform almost indistinguishably. However, competing with the default policy has another surprising outcome. While the default policy is inferior in loss minimization (more strongly at the beginning than at the end), in terms of test accuracy it eventually dominates. By careful inspection of Fig. 5, we see the decisive gain happens right after the first drop in the hardcoded learning rate. This, in itself, is very intriguing since both default and LMom use the same type of gradients of similar magnitudes. Also, it explains the original authors’ choice of a piece-wise constant learning rate schedule.

To our knowledge, there is no satisfying answer to why piece-wise constant learning rates lead to good generalization. Yet, practitioners use them frequently, perhaps precisely for this reason.

3.4 Differential neural computer

As the last task, we chose a somewhat exotic one; a recurrent architecture of Google Deepmind’s Differential Neural Computer (DNC) [7]. Again, we compare with the performance from the official repository [4]. The DNC is a culmination of a line of work developing LSTM-like architectures with a differentiable memory management, e. g. [6, 22], and is in itself very complex. The targeted tasks have typically very structured flavor (e. g. shortest path, question answering).

The task implemented in [4] is to learn a REPEAT-COPY algorithm. In a nutshell, the input specifies a sequence of bits and a number of repeats while the expected output is a sequence consisting of repeats of

. The loss function is a negative log-probability of outputting the correct sequence.

Since, the ground truth is a known algorithm, the training data can be generated on the fly, and there is no separate test regime. This time, the optimizer in place is RMSProp [25]

with gradient clipping. We found out, however, that the constant learning rate

provided in [4] can be further tuned and we compare our results against the improved value . We also used the best performing constant learning rates for Adam and for momentum SGD (both with the suggested gradient clipping) as baselines. The L  optimizers did not use gradient clipping.

(a) training loss (DNC) (b)effective learning rate (DNC)
Figure 6: Training progress of the DNC. (a) Training loss (equals test loss) on the Differential Neural Computer architecture. See Fig. 2 for details. The L optimizers use default settings, whereas RMSProp and Adam use best performing learning rates and , respectively. We see high stochasticity in training, particularly with Adam. Both L optimizers match or beat RMSProp in performance. (b) Effective learning rate of LAdam and plain Adam. The LAdam displays a huge variance in the selected stepsize. This however has a stabilizing effect on the training progress.

Again, we can see in Fig. 6 that LAdam and LMom performed almost the same on average, even though LMom was more prone to instabilities as can be seen from the volume of the orange-shaded regions. More importantly, they both performed better or on par with the optimized baselines.

We end this experimental section with a short discussion of Fig. 6(b), since it illustrates multiple features of the adaptation all at once. In this figure, we compare the effective learning rates of L and plain Adam. We immediately notice the dramatic evolution of the L learning rate, jumping across multiple orders of magnitude, until finally settling around . This behavior, however, results in a much more stable optimization process (see again Fig. 6), unlike in the case of plain Adam optimizer (note the volume of the green-shaded regions).

The intuitive explanation is two-fold. For one, the high gradients only need a small learning rate to make the expected progress. This lowers the danger of divergence and, in this sense, it plays the role of gradient clipping. And second, plateau regions with small gradients will force very high learning rates in order to leave them. This beneficial rapid adaptation is due to almost independent stepsize computation for every batch. Only and possibly (depending on the underlying gradient methods) some gradient history is reused. This is a fundamental difference to methods that at each step make a small update to the previous learning rate. This is in agreement with [26], where the phenomenon was discussed in more depth.

3.5 Fashion MNIST

The Fashion MNIST dataset [27]

is a drop-in replacement for MNIST that is considered to better represent modern computer vision tasks. We ran it on a TensorFlow official implementation of a ConvNet for MNIST

[23]. The architecture consists of two convolutional layers followed by two fully connected layers that a have a dropout [21] in between. By default, the batch size is 100 and the optimizer is Adam.

We see in Fig. 7, that both L optimizers work out-of-the-box and despite the presence of dropout during training, both achieve a loss that is roughly an order of maginute lower than the losses of optimized baselines. This leads to a mild gain in test accuracy as can be seen in Tab. 4.

Such low training loss of L optimizers despite the presence of dropout suggests increasing the dropout rate in hope for better generalization. And indeed when switching from the default rate to value , one can see (also in Tab. 4) an additional gain in test accuracy.

Figure 7: Training performance on Fashion MNIST. Default L optimizers reach lower level of the loss depsite the presence of dropout (rate ).The optimized learning rates for mSGD and Adam were 0.01 and 0.0003, respectively. Results are reported over five independent restarts.

Although generalization performance was not our main focus in this paper, we firmly believe that superior performance in optimization is convertible to better results in test time. This case of increasing the dropout rate is one promising example of it.

LAdam () LMom () LAdam LMom Adam* Mom*
Table 4: Test accuracies on Fashion MNIST. Both LMom and LAdam reach (slightly) higher accuracies. This effect is strenghtened by increasing the dropout rate to

. Reported are mean and standard deviation of five independent restarts.

3.6 Sweeping over batch sizes

Since L recomputes the stepsize individually for each batch, it is natural to investigate how its performance depends on the batch size. For this experiment we chose the MNIST and DNC datasets, since there L displayed the most variance in the effective learning rates, and thus behaved “most different” from standard optimizers. Of particular interest is the “high variance” regime of small batch sizes, where the stepsize adaptation can be expected to evolve the learning rate rapidly. The same default setting of both LMom and LAdam was consistently applied.

Figure 8: Sweeping over batch sizes for MNIST. The plotted curves are averages over five restarts (with smoothing in log-spcae). All lower batch sizes outperform the batch size 64 used in the main paper.
Figure 9: Sweeping over batch sizes for DNC. Results are averaged over five independent restarts. The -axis is normalized per number of data points processed. In case of LMom with batch size 8, one run diverged (after already having reached convergence). The reported curve is the average over the remaining four runs.

In both cases we swept over batch size 8, 16, 32, and 64. In the experiments from the main text of the paper, the selected batch sizes were 64, and 16, respectively for MNIST and DNC.

The results for MNIST are plotted in Fig. 9. It turns out that the original setting is the least favorable for both L optimizers. In fact, the performance increases with decreasing the batch size.

For DNC, we report the performance in Fig. 9. We also observe that L favors small batch sizes. Here batch size 8 is probably the limit of what L can tolerate. This is not directly visible on the loss curves but in fact one of the runs of LMom diverged (after processing 50000 examples and reaching loss). In all other cases, good level of convergence was reached.

In conclusion, adapting the stepsize for every batch shows to be gradually more beneficial as we lower the batch sizes (loss estimates increase in variance). This is a further validation for applying learning rates that are highly varying from mini-batch to mini-batch.

Although the batch sizes in both cases range almost over an order of magnitude, no severe deterioration of performance was ever detected.

4 Discussion

We propose a stepsize adaptation scheme L  compatible with currently most prominent gradient methods. Two arising optimizers were tested on a multitude of datasets, spanning across different batch sizes, loss functions and network structures. The results validate the stepsize adaptation in itself, as the adaptive optimizers consistently outperform their non-adaptive counterparts, even when the adaptive optimizers use the default setting and the non-adaptive ones were finely tuned. This default setting also performs well when compared to hand-tuned optimization policies from official repositories of modern high-performing architectures. Although we cannot give guarantees, this is a promising step towards practical “no-tuning-necessary” stochastic optimization.

The core design feature, ability to change stepsize dramatically from batch to batch, while occasionally reaching extremely high stepsizes, was also validated. This idea does not seem widespread in the community and we would like to inspire further work.

The ability of the proposed method to actually drive loss to convergence creates an opportunity to better evaluate regularization strategies and develop new ones. This can potentially convert the superiority in training to enhanced test performance as discussed in the Fashion MNIST experiments.

Finally, Ali Rahimi and Benjamin Recht suggested in their NIPS 2017 talk (and the corresponding blog post) [18, 19] that the failure to drive loss to zero within machine precision might be an actual bottleneck of deep learning (using exactly the ill-conditioned regression task). We show on this example and on MNIST that our method can break this “optimization floor”.

5 Acknowledgement

We would like to thank Alex Kolesnikov, Friedrich Solowjow, and Anna Levina for helping to improve the manuscript.

=0mu plus 1mu

References

  • [1] Marcin Andrychowicz, Misha Denil, Sergio Gómez, Matthew W Hoffman, David Pfau, Tom Schaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems 29, pages 3981–3989. Curran Associates, Inc., 2016.
  • [2] Atilim Gunes Baydin, Robert Cornish, David Martinez Rubio, Mark Schmidt, and Frank D. Wood. Online learning rate adaptation with hypergradient descent. CoRR, abs/1703.04782, 2017.
  • [3] R. H. Byrd, S. L. Hansen, Jorge Nocedal, and Y. Singer. A stochastic quasi-newton method for large-scale optimization. SIAM Journal on Optimization, 26(2):1008–1031, 1 2016.
  • [4] Google Deepmind. Official implementation of the differential neural computer, 2017. https://github.com/deepmind/dnc Commit a4debae.
  • [5] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12:2121–2159, July 2011.
  • [6] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines, 2014.
  • [7] Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwińska, Sergio Gómez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, Adrià Puigdomènech Badia, Karl Moritz Hermann, Yori Zwols, Georg Ostrovski, Adam Cain, Helen King, Christopher Summerfield, Phil Blunsom, Koray Kavukcuoglu, and Demis Hassabis. Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626):471–476, October 2016.
  • [8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778. IEEE Computer Society, 2016.
  • [9] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In in Proceedings of ICLR, 2015. arXiv preprint https://arxiv.org/abs/1412.6980.
  • [10] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. CIFAR-10 (Canadian Institute for Advanced Research), 2009.
  • [11] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998.
  • [12] Kenneth Levenberg. A method for the solution of certain non-linear problems in least squares. Quarterly of Applied Mathematics, 2:164–168, 1944.
  • [13] Ke Li and Jitendra Malik. Learning to optimize. CoRR, abs/1606.01885, 2016.
  • [14] M. Mahsereci and P. Hennig. Probabilistic line searches for stochastic optimization. In Advances in Neural Information Processing Systems 28, pages 181–189. Curran Associates, Inc., 2015.
  • [15] Franziska Meier, Daniel Kappler, and Stefan Schaal. Online learning of a memory for learning rates. arXiv preprint https://arxiv.org/abs/1709.06709, 2017.
  • [16] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, New York, 2nd edition, 2006.
  • [17] B. T. Polyak. Introduction to optimization. Translations Series in Mathematics and Engineering, Optimization Software, 1987.
  • [18] Benjamin Recht. Gradient descent doesn’t find a local minimum, 2017. https://github.com/benjamin-recht/shallow-linear-net Commit d192d96.
  • [19] Benjamin Recht and Ali Rahimi. Reflections on random kitchen sinks, 2017. http://www.argmin.net/2017/12/05/kitchen-sinks, Dec. 5 2017.
  • [20] Tom Schaul, Sixin Zhang, and Yann LeCun. No more pesky learning rates. In Sanjoy Dasgupta and David McAllester, editors,

    Proceedings of the 30th International Conference on Machine Learning

    , volume 28/3 of Proceedings of Machine Learning Research, pages 343–351, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR.
  • [21] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958, January 2014.
  • [22] Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. End-to-end memory networks. In Advances in neural information processing systems, pages 2440–2448. Curran Associates, Inc., 2015.
  • [23] TensorFlow GitHub Repository. Tensorflow implementation of ConvNet for MNIST, 2016. Commit 1f34fcaf.
  • [24] TensorFlow GitHub Repository. Tensorflow implementation of ResNets, 2016. Commit 1f34fcaf.
  • [25] T. Tieleman and G. Hinton. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
  • [26] Yuhuai Wu, Mengye Ren, Renjie Liao, and Roger Grosse. Understanding short-horizon bias in stochastic meta-optimization. In International Conference on Learning Representations, 2018.
  • [27] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.