Adaptive Learning Rates with Maximum Variation Averaging

06/21/2020
by   Chen Zhu, et al.
University of Maryland
Microsoft
19

Adaptive gradient methods such as RMSProp and Adam use exponential moving estimate of the squared gradient to compute element-wise adaptive step sizes and handle noisy gradients. However, Adam can have undesirable convergence behavior in some problems due to unstable or extreme adaptive learning rates. Methods such as AMSGrad and AdaBound have been proposed to stabilize the adaptive learning rates of Adam in the later stage of training, but they do not outperform Adam in some practical tasks such as training Transformers. In this paper, we propose an adaptive learning rate rule in which the running mean squared gradient is replaced by a weighted mean, with weights chosen to maximize the estimated variance of each coordinate. This gives a worst-case estimate for the local gradient variance, taking smaller steps when large curvatures or noisy gradients are present, resulting in more desirable convergence behavior than Adam. We analyze and demonstrate the improved efficacy of our adaptive averaging approach on image classification, neural machine translation and natural language understanding tasks.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

12/23/2014

ADASECANT: Robust Adaptive Secant Method for Stochastic Gradient

Stochastic gradient algorithms have been the main focus of large-scale l...
10/27/2019

An Adaptive and Momental Bound Method for Stochastic Learning

Training deep neural networks requires intricate initialization and care...
08/08/2019

On the Variance of the Adaptive Learning Rate and Beyond

The learning rate warmup heuristic achieves remarkable success in stabil...
01/01/2021

Adam revisited: a weighted past gradients perspective

Adaptive learning rate methods have been successfully applied in many fi...
06/04/2022

A Control Theoretic Framework for Adaptive Gradient Optimizers in Machine Learning

Adaptive gradient methods have become popular in optimizing deep neural ...
03/30/2018

Spectral Estimation of Plasma Fluctuations I: Comparison of Methods

The relative root mean squared errors (RMSE) of nonparametric methods fo...
07/04/2021

AdaL: Adaptive Gradient Transformation Contributes to Convergences and Generalizations

Adaptive optimization methods have been widely used in deep learning. Th...

Code Repositories

1 Introduction

SGD and its variants are both efficient and effective in training deep neural networks despite their simplicity. In their simplest form, gradient methods train a network by iteratively moving each parameter in the direction of the negative gradient (or the running average of gradients) of the loss function on a randomly sampled mini-batch of training data, as well as a scalar learning rate to control the size of the update. In contrast,

adaptive stochastic methods use coordinate-specific learning rates inversely proportional to the square root of a running mean of squared gradients Tieleman and Hinton (2012); Duchi et al. (2011); Kingma and Ba (2015)

. Such methods have been proposed to improve the stability of SGD on non-stationary problems, and have achieved success in training models for various fields including Speech, Computer Vision (CV), and Natural Language Processing (NLP).

Despite the popularity of adaptive methods such as Adam Kingma and Ba (2015), the instability of adaptive learning rates sometimes results in worse generalization performance than traditional SGD, converging to sub-optimal solutions on some simple problems, or even exhibiting non-convergent behavior Wilson et al. (2017); Reddi et al. (2018); Luo et al. (2019). AMSGrad Reddi et al. (2018) was proposed to stabilize Adam by computing the adaptive learning rate with an update rule that guarantees monotonically decaying adaptive learning rates for each coordinate. AdaBound Luo et al. (2019) clips the adaptive learning rate of Adam with a decreasing upper bound and an increasing lower bound, so that it converges to SGD in the final stage of training. However, to our knowledge, neither of the two approaches have been deployed to enhance Adam on recent large-scale problems such as training Transformer-based language models Devlin et al. (2019); Liu et al. (2019); Lan et al. (2020); Raffel et al. (2019); Zhu et al. (2020). The stochastic gradients of Transformer’s loss functions exhibit heavy-tailed statistics, making SGD unstable unless training begins with a small “warmup” learning rate. And Zhang et al. (2019b) has to use clipping to stablize SGD for training transformers, indicating that the strategy of AdaBound might fail on Transformers since it transitions to SGD. RAdam Liu et al. (2020) was recently invented to free Adam from the warmup schedule for training Transformers, but its variance rectification term does not really depend on the observed gradients during training, and Ma and Yarats (2019) found that using a linear warmup over iterations for Adam achieves almost the same convergence as RAdam.

In this work, we explore a different approach to improve the stability of adaptive learning rates. We propose Maximum Variation Averaging (MVA), which computes the running average of squared gradients using dynamic, rather than constant, coordinate-wise weights. These weights are chosen so that the estimated variance of gradients is maximized. The MVA weights for maximizing this variance have a simple closed-form solution that requires little storage or computational cost, yet is able to improve the test set performance of Adam on a variety of CV and NLP tasks.

2 Background & Definitions

We introduce some preliminary definitions before diving into the proposed method. By default, all vector-vector operators will be element-wise in the following sections. Let

be the parameters of the network to be trained, is the loss of the model with parameters evaluated at . Our goal is to minimize the expected risk on the data distribution defined as:

(1)

In most deep learning problems, only a finite number of potentially noisy samples can be used to approximate Eq. 

1, and the gradients are computed on randomly sampled minibatches during training. Stochastic regularizations like Dropout Srivastava et al. (2014) further adds to the randomness of the gradients, and are commonly used with Transformers Vaswani et al. (2017). Thus, it is important to design optimizers that tolerate noisy gradients.

Adam Kingma and Ba (2015) is an effective optimizer that adapts to such noisy gradients. It keeps exponential averages of the gradient and its square, and , defined as:

(2)

where , and , and parameters are updated by:

(3)

where is typically a small constant for numerical stability. If we assume that the distribution of the stochastic gradient is constant within the effective horizon of the running average Balles and Hennig (2018), an assumption that is more accurate when the model is closer to convergence, then and

will be estimates of the first and second moment of the gradient

. Specifically, we analyze adaptive learning rates through the lens of the following assumption:

Assumption 1

Let be the variance of . At time , assume , .

Under this assumption, the update step of Adam

 can be seen as an approximation to the Signal-to-Noise Ratio (SNR) of the gradient, which results in smaller steps when the SNR is low 

Kingma and Ba (2015). RMSprop Tieleman and Hinton (2012) and other variants that divide the update steps by can also be seen as adapting to the gradient variance under the same assumption Balles and Hennig (2018). These adaptive methods take smaller step sizes when the estimated variance is high. Higher local gradient variance indicates higher local curvature, and in certain quadratic approximations to the loss function, this variance is directly proportional to the curvature Schaul et al. (2013) (Eq. 13 of our paper). Therefore, similar to a diagonal approximation to Newton’s method, using smaller learning rates in these directions can improve convergence of first-order methods.

However, the adaptive learning rate of Adam and RMSprop can take extreme values, making it unable to converge to the desired solution of some simple problems Wilson et al. (2017); Chen et al. (2019). Reddi et al. (2018) gave one of such counter examples where gradients in the correct direction are large but occur at a low frequency, and Adam converges to the solution with a maximum regret. They solve this issue by keeping track of the maximum for each coordinate throughout training with a new variable , and change the adaptive learning rate into to enforce monotonically descreasing learning rates. Extremely small adaptive learning rates can also cause undesirable convergence behavior, as demonstrated by a counter example in Luo et al. (2019).

3 Maximizing the Variance of Running Estimations

1:Input: Learning rate , parameter , ,
2:Set
3:for  to  do
4:     Draw samples from training set
5:     Compute
6:     
7:      see Eq 8
8:     
9:     
10:     
11:     
12:     
Algorithm 1 MAdam
1:Input: Learning rate , parameter , ,
2:Set
3:for  to  do
4:     Draw samples from training set
5:     Compute
6:      see Eq 8
7:     
8:     
9:     
10:     
11:     
12:     
Algorithm 2 LaMAdam

Motivation. We propose to mitigate the undesirable convergence issue of Adam by changing the constant running average coefficient for the second moment into an adaptive one. The idea is to allow to adopt the value that maximizes the estimated covariance of the gradient at each iteration . Therefore, our algorithm can use as the adaptive running average coefficient to take steps that are conservative enough to avoid instability but aggressive enough to make progress.

Maximum Variation Averaging. Formally, we estimate the variance of the gradient at each coordinate by keeping track of the zeroth, first, and second moment of the gradient as functions of the adaptive running average coefficient , denoted as and , respectively:

(4)
(5)
(6)

The zeroth moment tracks the total weight used for averaging, and is used to normalize and

to achieve unbiased estimates

and for the first and second moment Kingma and Ba (2015).

Then, the unbiased local estimate of the gradient variance is . Through an argmax optimization, we find the that achieves the worst-case (maximal) variance for each coordinate

(7)

We call our approach for finding adaptive running average coefficient Maximum Variation Averaging (MVA). We plug MVA into Adam and its variant, LaProp Ziyin et al. (2020), which results in two novel algorithms, MAdam and LaMAdam, listed in Algorithm 1 and Algorithm 2. LaProp

 uses local running estimation of the variance to normalize the gradients before taking the running average, which results in higher empirical stability under various hyperparameters. Note, we only use the MVA formula for the

second moment used for scaling the learning rate; is still an exponential moving average (with a constant coefficient ) for the gradient of MAdam or the normalized gradient of LaMAdam.

Finding via a closed-form solution. The maximization for in Eq. 7 is quadratic and has a relatively simple closed-form solution that produces maximal for each coordinate. This is given by

(8)

where we have abbreviated , and into and , and will use this abbreviation by default in the following sections. We defer the derivation of Eq. 8 to Appendix A.

Practical notes. We apply MVA in every step except for the first step, where the gradient variance we can observe is 0. The coefficient for is set to a constant that is the same as typical values for Adam, and for Algorithm 1 and Algorithm 2 we define

(9)

To obtain a valid running average, we clip so that , where the typical values are . For convenience, we set by default. For , will monotonically increase from to 1. Before clipping, for any satisfying in Eq. 8, we have . As a result, the lower bound that we use () is tight and does not really change the value of , and as , and . We have a special case at , where is a constant .

In practice, we also add a small coefficient to the denominator of Eq. 8 to prevent division by zero, which will have negligible effect on the value of and does not violate the maximum variation objective (Eq. 7). All the derivations for these conclusions are deferred to Appendix B.

The effect of Maximum Variation Averaging. In most cases, . If we define a new variable , which represents the degree of deviation of gradient from the current estimated average, we can rewrite

(10)

From Eq. 10, we can see monotonically decreases from to as increases from 0 to , and equals to 1 when . As a result, for each entry, if , or the deviation of the gradient from the current running mean

is within the estimated standard deviation

, we will use to update , which is the slowest change we allow for . If deviates much more than , MVA will find a smaller and therefore a higher weight on to adapt to the change faster. This allows a quick response to impede abnormally large gradients, which enables a better handling for the heavy-tailed distribution of gradients in the process of training Transformers Zhang et al. (2019b). As a side effect, tends to be larger than Adam/LaProp using a constant , but as we will show in the experiments, using a larger learning rate counters such an effect and achieves better results.

On the other hand, when the variance decreases in the later phase of training, tends to be within , and MVA tends to find the slowest rate for decreasing . This allows large values of to last for a longer horizon even compared with setting to a constant on the same sequence, since we have assigned more mass to large gradients, which can be seen as an adaptive version of AMSGrad.

4 Analysis with Simulations

4.1 Convergence with Stochastic Gradients

Figure 1: Objective value, Term A, Term B and  Chen et al. (2019) of Adam, AMSGrad, MAdam on the problem defined in Eq. 11. The constant learning rate is cancelled here since we only care about the ratios.

We will analyze the convergence of MAdam through an illustrative example from Chen et al. (2019)

which simulates the process of stochastic gradient descent. Theoretical results are left as future work, but we evaluate the terms of the convergence rate obtained in Theorem 3.1 of 

Chen et al. (2019) and shed some lights on the theoretical analysis. Formally, we consider the problem where

(11)

At every step, a random index is sampled uniformly from , and the randomness of this problem only comes from stochastic sampling. The only stationary point where is . We compare Adam, AMSGrad and MAdam on optimizing this objective, where we use constant learning rates in every step, and set for Adam and AMSGrad, and for MAdam. Adam never converged for a variety of we tried within , consistent with Chen et al. (2019). Generally, a larger gives faster convergence for both AMSGrad and MAdam. For reproducibility, we repeat experiments 100 times with the same settings, and choose the for AMSGrad and MAdam where the solution every time. satisfies this requirement for MAdam, but AMSGrad only satisfied it 1% of the times for and 65% of the times for . is the largest we find for AMSGrad to achieve satisfaction. Therefore, we use for both Adam and AMSGrad.

We plot the values of the objective, Term A, Term B and of Theorem 3.1 of Chen et al. (2019) in Figure 1. We can conclude from the objective value that MAdam fixes up the divergence issue of Adam, and converges faster than AMSGrad on this problem. This result is also consistent with Theorem 3.1 of Chen et al. (2019) from the observation that MAdam has a significant smaller Term A, a Term B that is only slightly larger than AMSGrad, and a that is only slightly smaller than AMSGrad. How to condense these terms with the hyperparameters from MVA to prove an improved convergence rate is an interesting future direction.

4.2 Convergence in the Noisy Quadratic Model

Figure 2:

Results on the noisy quadratic model. The left figure shows the average loss and its standard error under different learning rates. Figure on the right gives a qualitative example of the trajectories of two approaches. The best solution of

MAdam is better than Adam under all evaluated. Comparing the best solutions of two methods, MAdam achieves both lower average loss and lower standard error at convergence (2.46e-3 (2.94e-4) vs. 4.05e-3 (4.82e-4) at learning rate 0.005).

We analyze the ability of MAdam to adapt to curvature and noise on the simple but illustrative “noisy quadratic model”, which has been widely adopted for analyzing the optimization dynamics Schaul et al. (2013); Wu et al. (2018); Zhang et al. (2019a, c). The loss function is defined as:

(12)

where is a noisy observation of the ground-truth parameter , simulating the gradient noise in stochastic optimization, and represents the curvature of the system in dimensions. In each step, the algorithm takes the following noisy gradient for dimension as the input:

(13)

from which we can see the gradient noise is proportional to the curvature .

To verify the effectiveness of MVA, we compare MAdam with Adam under a variety of different curvature and noise level on a 2D problem (). For each instance of the problem, we test both algorithms on a variety of learning rates. We set for MAdam, and for fair comparison, we test Adam with . We repeat the experiments 100 times under each setting, where we select a random initialization of each time, and run MAdam and Adam with different hyper-parameters from this random initialization. We take the mean and standard error of the 100 runs for comparison, where  MAdam consistently achieves 30% to 40% lower average loss with smaller standard error. Figure 2 shows the results for one of the instances, from which we find the best results of MAdam  is consistently better than Adam under all choices of , confirming the difference MVA has made by choosing an adaptive . From the qualitative example, MVA also demonstrates smaller variance near convergence, caused by a more aggressive response to impede the noise with a smaller . We provide more experimental results under other settings in Appendix C.

5 Experiments with Practical Datasets

In this section, we thoroughly evaluate MAdam  and LaMAdam

 on a variety of tasks against well-calibrated baselines: CIFAR10/100 and ImageNet for image classification, IWSLT’14 DE-EN/WMT’16 EN-DE for neural machine translation, and the GLUE benchmark for natural language understanding. The implementations are based on PyTorch, and we run the experiments on Nvidia V100 GPUs. In both the image and language tasks, after tuning the weight decay carefully, we find the decoupled weight decay 

Loshchilov and Hutter (2018) gives much better results for Adam, MAdam, LaProp and LaMAdam. Therefore, we use this approach in all of our experiments. Across the plots in this section, we define the average step size at time as the average of for Adam/MAdam and for LaProp/LaMAdam over all the entries.

5.1 Image Classification

Figure 3: Training loss, test accuracy and average step size on CIFAR10.
Model CIFAR-10 CIFAR-100 ImageNet
SGD 95.44 (.04) 79.62 (.07) 70.18
Adam 95.37 (.03) 78.77 (.07) 66.54
LaProp 95.34 (.03) 78.36 (.07) 70.02
MAdam 95.51 (.09) 79.32 (.08) 69.96
LaMAdam 95.38 (.11) 79.21 (.11) 70.16
Table 1: Accuracies on CIFAR10/100 and ImageNet. The Adam results on ImageNet is copied from Liu et al. (2020). CIFAR10/100 experiments are the median (standard error) over 4 runs.

To evaluate the effectiveness of MVA for image classification, we compare with SGD, Adam and LaProp in training ResNet18 He et al. (2016) on CIFAR10, CIFAR100 and ImageNet. On all datasets, we perform a grid search for the learning rate and weight decay and report the best results for each method in Table 1. For CIFAR10/100, we adopt the ResNet18 from a public repository.222https://github.com/kuangliu/pytorch-cifar

We use a batch size of 128 and train the model for 200 epochs. Instead of using the multistep schedule, we find the cosine learning rate schedule to give better results for both SGD and adaptive methods. Therefore, we set a final learning rate of 2e-6 in all cases. We also find AMSGrad 

Reddi et al. (2018) to improve the classification accuracy of all adaptive methods evaluated on CIFAR10/100, and we apply AMSGrad in all experiments with adaptive methods. Further details are in Appendix D. On ImageNet, we use the implementation from torchvision and the default learning rate schedule, multiplying the learning rate by 0.1 every 30 epochs and train a total of 90 epochs, with a batch size of 256. We do not use AMSGrad in this case.

Despite we achieved a marginal improvement on CIFAR10, adaptive methods often cannot beat carefully tuned SGD on CIFAR100 and ImageNet when training popular architectures such as ResNet, as confirmed by results such as Wilson et al. (2017); Zhang et al. (2019c); Liu et al. (2020). Nevertheless, with the proposed MVA, we shrink the gap between adaptive methods and carefully tuned SGD for training convlutional networks on these image classification datasets, and achieved a top-1 accuracy very close to SGD on ImageNet.

5.2 Neural Machine Translation

Figure 4: Training loss, validation BLEU and average step size on IWSLT’14 DE-EN, trained with =5e-4, =1e-2, =0.999 for LaProp and =1.5e-3, =1e-2, =0.5, =0.999 for LaMAdam, and =4.375e-4, =1e-2, =0.999 for LaProp-.

We train Transformers Vaswani et al. (2017) from scratch with LaProp and LaMAdam on IWSLT’14 German-to-English (DE-EN) translation dataset Cettolo et al. (2014) and WMT’16 English-to-German (EN-DE) translation dataset, based on the implementation of fairseq.333https://github.com/pytorch/fairseq. We do not compare with SGD, since it is unstable for Transformers Zhang et al. (2019b). IWSLT’14 DE-EN has 160k training examples, while WMT’16 EN-DE has 4.5M training examples.

For IWSLT’14 DE-EN, we use 512-dimensional word embeddings and 6 Transformer blocks with 4 attention heads and 1024 FFN dimensions for the encoder/decoder. To demonstrate the full potential of adaptive methods under constant learning rates, we use the tristage learning rate schedule Park et al. (2019), linearly increase the learning rate from to the full learning rate in 4k iterations, hold it at for 32k iterations, and exponentially decay it to in 24k iterations. We train a total of 60k iterations, during which each minibatch has up to 4096 tokens. Results are summarized in Table 2, where the baseline’s BLEU score is already 1.22 higher than the best results reported in Liu et al. (2020) using the same model. Figure 4 shows the training dynamics of LaProp and LaMAdam. Despite using 3 times higher learning rate, the average update size of LaMAdam is smaller, but LaMAdam shows slightly better convergence on the training set and better validation BLEU. This may be explained by the heavy-tailed distribution of the gradient in the process of training transformers from scratch Zhang et al. (2019b), and smaller step sizes mitigating the effect of extreme gradient values on the model’s performance. It is worth mentioning that LaProp diverges using the large learning rate 1.5e-3.

Method IWSLT’14 DE-EN WMT’16 EN-DE
LaProp 35.98 (0.06) 27.02
LaMAdam 36.09 (0.04) 27.11
Table 2: BLEU score of LaProp and LaMAdam for training transformers on machine translation datasets. We report the median and standard error for IWSLT’14 over 5 runs.

Further, we find LaProp produces a similar step size curve as LaMAdam with learning rate 4.375e-4, but with weaker performance than LaMAdam. LaMAdam uses the maximum variation rule to select the adaptive learning rate for each dimension, creating benefit that is not achievable by simply scaling the base learning rate .

For WMT’16, we aim to evaluate our approaches on large-scale datasets/models like in Ott et al. (2018), and use 1024-dimensional word embeddings, 6 transformer blocks with 16 attention heads and 4096 FFN dimensions for the encoder/decoder. Each minibatch has up to 480k tokens and we train for 32k iterations, and use the inverse square root learning rate schedule with 4k step warmup Vaswani et al. (2017). We evaluate the single model BLEU on newstest2013, unlike Liu et al. (2020) where models from the last 20 epochs are averaged to get the results. As shown in Table 2, LaMAdam also achieves better results. Further implementation details are provided in Appendix E

5.3 General Language Understanding Evaluation (GLUE)

Figure 5: Training loss, validation accuracy and step size of various optimization methods on SST-2. All optimizers here use . Adam and LaProp use ()=(1e-5, 0.98), MAdam and LaMAdam use ()=(4e-5, 0.5, 0.98), Adam- and LaProp- use ()=(1.6e-5, 0.98).

To evaluate the efficacy of MVA for transfer learning, we finetune pre-trained language models on the GLEU benchmark 

Wang et al. (2018). GLUE is a collection of 9 natural language understanding tasks, formulated into classification and regression problems. Following prevalent validation settings Devlin et al. (2019); Lan et al. (2020); Raffel et al. (2019), we report the median and standard error for finetuning the RoBERTa-base model Liu et al. (2019) over 4 runs with the same hyperparameters but different random seeds on the dev set of 8 of the 9 tasks, and report the results in Table 3. MAdam and LaMAdam give better scores than the corresponding baselines in the 8 tasks. More experimental details are given in Appendix F.

To highlight the difference of the optimizers on transfer learning, we compare the training loss, dev set accuracy and the average step size on SST-2, as shown in Figure 5. Different from Machine Translation experiments where we train the Transformers from scratch, the adaptive step size of MAdam/LaMAdam is higher.444MAdam/LaMAdam use 4x than Adam/LaProp and the step size is about 1.8x on GLEU, while on IWSLT’14 the two ratios are 2 and (approximately) 0.875. Because we start from a pre-trained model, the heavy tail of the gradient is alleviated, just as the BERT model in the later stage of training as shown in Zhang et al. (2019b), but still MVA helps in this situation. Same as in Machine Translation experiments, the highest test accuracy of Adam/LaProp cannot reach the same value as MAdam/LaMAdam by simply scaling the base learning rate to reach similar step sizes as MAdam/LaMAdam.

Method MNLI QNLI QQP RTE SST-2 MRPC CoLA STS-B
(Acc) (Acc) (Acc) (Acc) (Acc) (Acc) (Mcc) (Pearson)
Reported Liu et al. (2019) 87.6 92.8 91.9 78.7 94.8 90.2 63.6 91.2
Adam 87.70 (.03) 92.85 (.06) 91.80 (.03) 79.25 (.71) 94.75 (.08) 88.50 (.24) 61.92 (1.1) 91.17 (.13)
LaProp 87.80 (.04) 92.85 (.13) 91.80 (.03) 78.00 (.46) 94.65 (.11) 89.20 (.20) 63.01 (.61) 91.17 (.06)
MAdam 87.90 (.08) 92.95 (.07) 91.85 (.03) 79.60 (.66) 94.85 (.12) 89.70 (.17) 63.33 (.60) 91.28 (.03)
LaMAdam 87.80 (.03) 93.05 (.05) 91.85 (.05) 80.15 (.64) 95.15 (.15) 90.20 (.20) 63.84 (.85) 91.36 (.04)
Table 3: Results (median and variance) on the dev sets of GLUE based on finetuning the RoBERTa-base model, from 4 runs with the same hyperparameter but different random seeds.

6 Related Work

Our work is motivated by recent advances in optimizing deep neural networks, particularly Transformers, on adaptive learning rates and heuristic warmup.

Adaptive Learning Rate  Many studies have focused on achieving more reliable convergence with adaptive step sizes Kingma and Ba (2015); Duchi et al. (2011); Tieleman and Hinton (2012); Zeiler (2012). Reddi et al. (2018) proposed to compute the adaptive learning rate with the coordinate-wise maximum value of the running squared gradient. AdaBound Luo et al. (2019) clips the adaptive learning rate of Adam with a decreasing upper bound and an increasing lower bound. Lookahead Zhang et al. (2019c) computes weight updates by looking ahead at the sequence of “fast weights" generated by another optimizer. LAPROP Ziyin et al. (2020) uses local running estimation of the variance to normalize the gradients, resulting in higher empirical stability. You et al. (2017) proposed Layer-wise Adaptive Rate Scaling (LARS), and scaled the batch size to 16,384 for training ResNet50. LAMB You et al. (2020) is proposed to improve LARS on training BERT. Different from these methods, we focus on finding the adaptive averaging coefficients.

Learning Rate Warmup  It is observed that adaptive learning schemes may lead to bad local optima and must be adjusted through linear learning rate scaling and warmup heuristics Goyal et al. (2017); Gotmare et al. (2019). Similar phenomena are observed in other natural language processing tasks Bogoychev et al. (2018); Vaswani et al. (2017). Most theoretical analysis about linear learning rate scaling considers stochastic gradient descent only Hoffer et al. (2017); Smith et al. (2018). Our approach uses an aggressive averaging coefficient and smaller learning rate for abnormal gradients, which stabilize the training adaptively.

7 Conclusion

In this paper, we present Maximum Variation Averaging (MVA), a novel adaptive learning rate scheme that replaces the exponential running average of squared gradient with an adaptive weighted mean. In each step, MVA chooses the weight for each coordinate, such that the esimated gradient variance is maximized. This enables MVA to take smaller steps when large curvatures or abnormal gradients are present, which leads to more desirable convergence behaviors in face of gradient noise. We illustrate how our proposed models improve convergence by a better adaptation to variance and demonstrate strong empirical results on a wide range of tasks including image classification, neural machine translation and natural language understanding.

8 Broader Impacts

Transformer models are quickly becoming a central tool for analyzing text, and have the potential to make web content better moderated, more accessible to readers that speak under-resourced languages, are more available to readers with visual impairments. Unfortunately, the development of Transformer models is slowed by the difficulty and expense of optimizing this relatively new class of models, and their reliance of adaptive optimizers that are not well-understood. As a result, research in this area is largely conducted by large companies with huge computing resources for tuning optimization parameters.

Our goal with this work is to take a step towards making adaptive optimizers more reliable, faster, and easy to use. While research in this direction is fairy technical, it has the broader potential to contribute to a better internet with better translation and moderation tools. It also has the potential to help democratize language modelling, making it more accessible to organizations with modest computing budgets.

References

  • [1] L. Balles and P. Hennig (2018) Dissecting adam: the sign, magnitude and variance of stochastic gradients. In ICML, pp. 404–413. Cited by: §2, §2.
  • [2] N. Bogoychev, K. Heafield, A. F. Aji, and M. Junczys-Dowmunt (2018) Accelerating asynchronous stochastic gradient descent for neural machine translation. In EMNLP, Cited by: §6.
  • [3] M. Cettolo, J. Niehues, S. Stüker, L. Bentivogli, and M. Federico (2014) Report on the 11th iwslt evaluation campaign, iwslt 2014. In Proceedings of the International Workshop on Spoken Language Translation, Hanoi, Vietnam, Vol. 57. Cited by: §5.2.
  • [4] X. Chen, S. Liu, R. Sun, and M. Hong (2019) On the convergence of a class of adam-type algorithms for non-convex optimization. ICLR. Cited by: §2, Figure 1, §4.1, §4.1.
  • [5] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL, pp. 4171–4186. Cited by: §1, §5.3.
  • [6] J. Duchi, E. Hazan, and Y. Singer (2011) Adaptive subgradient methods for online learning and stochastic optimization.

    Journal of Machine Learning Research

    , pp. 2121–2159.
    Cited by: §1, §6.
  • [7] A. Gotmare, N. S. Keskar, C. Xiong, and R. Socher (2019) A closer look at deep learning heuristics: learning rate restarts, warmup and distillation. In ICLR, Cited by: §6.
  • [8] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He (2017) Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv:1706.02677. Cited by: §6.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §5.1.
  • [10] E. Hoffer, I. Hubara, and D. Soudry (2017) Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In Neurips, pp. 1731–1741. Cited by: §6.
  • [11] D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In ICLR, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §1, §1, §2, §2, §3, §6.
  • [12] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2020)

    Albert: a lite bert for self-supervised learning of language representations

    .
    ICLR. Cited by: §1, §5.3.
  • [13] L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han (2020) On the variance of the adaptive learning rate and beyond. ICLR. Cited by: Appendix D, §1, §5.1, §5.2, §5.2, Table 1.
  • [14] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv:1907.11692. Cited by: Appendix F, §1, §5.3, Table 3.
  • [15] I. Loshchilov and F. Hutter (2018) Fixing weight decay regularization in adam. Cited by: §5.
  • [16] L. Luo, Y. Xiong, Y. Liu, and X. Sun (2019) Adaptive gradient methods with dynamic bound of learning rate. ICLR. Cited by: §1, §2, §6.
  • [17] J. Ma and D. Yarats (2019) On the adequacy of untuned warmup for adaptive optimization. arXiv:1910.04209. Cited by: §1.
  • [18] M. Ott, S. Edunov, D. Grangier, and M. Auli (2018) Scaling neural machine translation. In Proceedings of the Third Conference on Machine Translation, pp. 1–9. Cited by: §5.2.
  • [19] D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le (2019)

    SpecAugment: a simple data augmentation method for automatic speech recognition

    .
    Interspeech, pp. 2613–2617. Cited by: §5.2.
  • [20] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2019) Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv:1910.10683. Cited by: §1, §5.3.
  • [21] S. J. Reddi, S. Kale, and S. Kumar (2018) On the convergence of adam and beyond. ICLR. Cited by: §1, §2, §5.1, §6.
  • [22] T. Schaul, S. Zhang, and Y. LeCun (2013) No more pesky learning rates. In ICML, pp. 343–351. Cited by: §2, §4.2.
  • [23] S. L. Smith, P. Kindermans, and Q. V. Le (2018) Don’t decay the learning rate, increase the batch size. In ICLR, Cited by: §6.
  • [24] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §2.
  • [25] T. Tieleman and G. Hinton (2012) Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. Note: COURSERA: Neural Networks for Machine Learning Cited by: §1, §2, §6.
  • [26] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NeurIPS, Cited by: Adaptive Learning Rates with Maximum Variation Averaging, §2, §5.2, §5.2, §6.
  • [27] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2018) GLUE: a multi-task benchmark and analysis platform for natural language understanding. EMNLP, pp. 353. Cited by: §5.3.
  • [28] A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht (2017) The marginal value of adaptive gradient methods in machine learning. In Neurips, pp. 4148–4158. Cited by: §1, §2, §5.1.
  • [29] Y. Wu, M. Ren, R. Liao, and R. Grosse (2018) Understanding short-horizon bias in stochastic meta-optimization. arXiv:1803.02021. Cited by: §4.2.
  • [30] Y. You, I. Gitman, and B. Ginsburg (2017) Scaling SGD batch size to 32k for imagenet training. CoRR abs/1708.03888. Cited by: §6.
  • [31] Y. You, J. Li, S. Reddi, J. Hseu, S. Kumar, S. Bhojanapalli, X. Song, J. Demmel, K. Keutzer, and C. Hsieh (2020) Large batch optimization for deep learning: training bert in 76 minutes. In International Conference on Learning Representations, Cited by: §6.
  • [32] M. D. Zeiler (2012) ADADELTA: an adaptive learning rate method. CoRR abs/1212.5701. Cited by: §6.
  • [33] G. Zhang, L. Li, Z. Nado, J. Martens, S. Sachdeva, G. Dahl, C. Shallue, and R. B. Grosse (2019) Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model. In NeurIPS, pp. 8194–8205. Cited by: §4.2.
  • [34] J. Zhang, S. P. Karimireddy, A. Veit, S. Kim, S. J. Reddi, S. Kumar, and S. Sra (2019)

    Why adam beats sgd for attention models

    .
    arXiv:1912.03194. Cited by: §1, §3, §5.2, §5.2, §5.3.
  • [35] M. R. Zhang, J. Lucas, J. Ba, and G. E. Hinton (2019) Lookahead optimizer: k steps forward, 1 step back. In NeurIPS, pp. 9593–9604. Cited by: §4.2, §5.1, §6.
  • [36] C. Zhu, Y. Cheng, Z. Gan, S. Sun, T. Goldstein, and J. Liu (2020) FreeLB: enhanced adversarial training for natural language understanding. In ICLR, Cited by: §1.
  • [37] L. Ziyin, Z. T. Wang, and M. Ueda (2020) Laprop: a better way to combine momentum with adaptive gradient. arXiv:2002.04839. Cited by: Appendix F, §3, §6.

Adaptive Learning Rates with Maximum Variation Averaging (Appendix)

Appendix A Deriving the closed form solution Eq.8

Plugging Eq. 4,5,6, and the unbiased estimations into Eq. 7, each coordinate is solving the same problem:

(14)

Let , we can see can be represented as a quadratic function of . Specifically,

Meanwhile, is a monotonic function of . Therefore, has a unique maximum value.

To find the maximum value, we return to Eq. 14, from which we can find a stationary point

(15)

Appendix B Practical notes of

Claims and arguments:

  1. For , will monotonically increase from to 1.

    This is obvious since in every step,

    is an interpolation between

    and , and . We have also set .

  2. For any satisfying in Eq. 8, we have .

    Eq. 10 is monotonic in .Since can be any value, can be any value from 0 to . If , takes the largest value . If , .

  3. As , and .
    Combining Claims 1 and 2 to get this result.

  4. Adding a small coefficient to the denominator of Eq. 8 has negligibale effect on the value of and does not violate the maximum variation objective (Eq. 7).
    Since is small, it has negligible effect on when division by zero does not happen. We only need to confirm adding will not affect the solution when division by zero happens. We can re-write the dividend of Eq. 8 as

    (16)

    Since , we can conclude that .

    When , Eq. 16 can be 0 only when and . In this special case, we can set to any value in without changing ; we will always have , and . Only is affected by , which takes a larger value when is smaller. The solution given by adding to the denominator is , and the following clipping will set , resulting in the largest possible . In the next step, if Eq. 16 is not zero, then we have , and we know .555Otherwise we will still have and Eq. 16 is 0. In this case, for , increases as decreases, so setting to its maximum will achieve the maximum variance at the next step. Otherwise if Eq. 16 is zero, doing this will not change .

    When , Eq. 16 is 0 if and only if . As a result, if , we have the same conclusion as before. Otherwise, before clipping, and after clipping. Also, any will not change the value of . Since , to maximize , we should set so that takes the maximum value, which is consistent with the solution after adding to the denominator.

Appendix C Additional Experimental Results on the Noisy Quadratic Model

Figure 6: More results on the Noisy Quadratic Model.

In this section, we give more results comparing Adam and MAdam on the Noisy Quadratic Model. The results are shown in Figure 6. Generally, the best result of MAdam has a more significant margin when and are higher, i.e., the improvement is more significant when the problem is worse conditioned and the noise level is higher. Note that for each trial, we start both algorithms from the same random initialization.

Appendix D Additional Details of Experiments on Image Classification

For Adam and LaProp, we set . For MAdam and LaMAdam, we set and

in all cases. On CIFAR10 and CIFAR100, we use random cropping (4-pixel zero paddings on each side) and random horizontal flip as the data augmentations. On ImageNet, we use random resized crop and random horizontal flip for data augmentation. For each optimizer, we do a grid search over the learning rate and weight decay for the best hyperparameters.

Hyperparameters of CIFAR10.

Except for SGD, we tried learning rates from {5e-4, 1e-3, 2e-3, 3e-3, 4e-3, 6e-3, 8e-3} and weight decay from {0.025, 0.05, 0.1, 0.2, 0.4, 0.8, 1 }. The best learning rate and weight decay for Adam, LaProp, MAdam and LaMAdam are (3e-3, 0.2), (1e-3, 0.4), (8e-3, 0.05) and (6e-3, 0.05) respectively. As to SGD, we tried learning rates from {3e-2, 5e-2, 1e-1, 2e-1, 3e-1} and weight decays from {1e-4, 3e-4, 5e-4, 1e-3, 2e-3}, and the best result was achieved with learning rate 2e-1 and weight decay 3e-4. These hyperparameters that gave the best results are also the hyperparameters we used for plotting Figure 3.

Hyperparameters for CIFAR100.

We use the same grid search configurations as for CIFAR10. The best learning rate and weight decay for Adam, LaProp, MAdam and LaMAdam are (2e-3, 0.4), (5e-4, 1), (4e-3, 0.2) and (3e-3, 0.2) respectively. For SGD, the best learning rate and weight decay are 3e-2 and 2e-3 respectively.

Hyperparameters for ImageNet.

Due to the heavy workload and the time limit, we were not able to accomplish 4 runs for each hyperparameter in ImageNet, so we report the best results for each optimizer in Table 1, except for the result of Adam , which was copied from Liu et al. (2020) but uses the same hyperparameters except for the learning rate and weight decay. For LaProp, MAdam and LaMAdam, we choose learning rates from {1e-3, 2e-3, 3e-3, 4e-3, 5e-3, 6e-3, 8e-3} and weight decay from {0.003, 0.006, 0.01, 0.012, 0.02, 0.03}, and found the best combinations for LaProp, MAdam and LaMAdam are (2e-3, 0.03), (5e-3, 0.012) and (6e-3, 0.012). For SGD, we choose learning rate from {0.05, 0.1, 0.2} and weight decay from {5e-5, 7e-5, 1e-4}, and found the best combination to be (0.1, 7e-5).

Appendix E Additional Details of Experiments on the Machine Translation

For LaProp, we tried from {0.98, 0.99, 0.997, 0.999}. We found 0.999 to work the best and used it for all the grid search experiments. For LaMAdam, we set , . We use transformer_iwslt_de_en and transformer_vaswani_wmt_en_de_big architectures defined in fairseq for IWSLT’14 and WMT’16, respectively.

Hyperparameters for IWSLT’14.

We do a grid search for the learning rate and weight decay for both optimizers. We tried from {2.5e-4, 5e-4, 1e-3, 1.5e-3, 2e-3}, and weight decay from {0.0001, 0.001, 0.01, 0.1}. The best combinations for LaProp and LaMAdam

 are (5e-4, 0.01) and (1.5e-3, 0.01). For other hyperparameters, we use the default setting in the fairseq example, which sets dropout probability to 0.3, uses label smoothed cross entropy loss with a smoothing coefficient 0.1, and shares the input and output token embedding parameters.

Hyperparameters for WMT’16.

The default implementation from fairseq did not use weight decay, so we also ignore weight decay in all experiments. For LaProp, we found to give the best results, and we set in all experiments. This takes around 8 hours on 16 V100 GPUs each run. For grid search, we tried from {5e-4, 1e-3, 1.5e-3, 2e-3}, and found 1e-3 and 1.5e-3 to work the best for LaProp and LaMAdam respectively. Other hyperparameters are the defaults of the corresponding fairseq example, which uses a dropout probability of 0.3, the label smoothed cross entropy loss with a smoothing coefficient 0.1, and shares all embedding parameters.

Appendix F Additional Details of Experiments on the GLUE benchmark

It is reported in Liu et al. (2019) that Adam is sensitive to the choice of on GLUE. Following their settings, we set for both Adam and MAdam. For LaProp and LaMAdam, however, we always set , like all other experiments in this paper, which is consistent with the observation in Ziyin et al. (2020) that LaProp is robust to the choice of . We set for Adam and LaProp, and for LaProp and LaMAdam. All other hyperparameters are set to the same as the example in fairseq.666https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.glue.md For each task, we do a grid search over the learning rate and weight decay, which are chosen from {5e-6, 1e-5, 2e-5, 4e-5, 5e-5, 6e-5} and {0.025, 0.05, 0.1, 0.2} respectively. We list the best combinations for Adam, MAdam, LaProp and LaMAdam on each task as below:

  • (1e-5, 0.1), (1e-5, 0.1), (4e-5, 0.025), (4e-5, 0.025).

  • (1e-5, 0.1), (1e-5, 0.1), (4e-5, 0.025), (4e-5, 0.025).

  • (1e-5, 0.1), (1e-5, 0.1), (4e-5, 0.05), (4e-5, 0.05).

  • (1e-5, 0.1), (1e-5, 0.1), (4e-5, 0.1), (4e-5, 0.1).

  • (2e-5, 0.1), (2e-5, 0.1), (6e-5, 0.1), (6e-5, 0.1).

  • (1e-5, 0.1), (1e-5, 0.1), (6e-5, 0.1), (6e-5, 0.1).

  • (2e-5, 0.1), (2e-5, 0.1), (4e-5, 0.5), (4e-5, 0.5).

  • (2e-5, 0.1), (2e-5, 0.1), (6e-5, 0.5), (6e-5, 0.5).