Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks

by   Boris Ginsburg, et al.

We propose NovoGrad, a first-order stochastic gradient method with layer-wise gradient normalization via second moment estimators and with decoupled weight decay for a better regularization. The method requires half as much memory as Adam/AdamW. We evaluated NovoGrad on the diverse set of problems, including image classification, speech recognition, neural machine translation and language modeling. On these problems, NovoGrad performed equal to or better than SGD and Adam/AdamW. Empirically we show that NovoGrad (1) is very robust during the initial training phase and does not require learning rate warm-up, (2) works well with the same learning rate policy for different problems, and (3) generally performs better than other optimizers for very large batch sizes


page 1

page 2

page 3

page 4


ADASECANT: Robust Adaptive Secant Method for Stochastic Gradient

Stochastic gradient algorithms have been the main focus of large-scale l...

Train Feedfoward Neural Network with Layer-wise Adaptive Rate via Approximating Back-matching Propagation

Stochastic gradient descent (SGD) has achieved great success in training...

Adaptive Gradient Method with Resilience and Momentum

Several variants of stochastic gradient descent (SGD) have been proposed...

Stochastic gradient descent with random learning rate

We propose to optimize neural networks with a uniformly-distributed rand...

Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

In several recently proposed stochastic optimization methods (e.g. RMSPr...

AdaScale SGD: A User-Friendly Algorithm for Distributed Training

When using large-batch training to speed up stochastic gradient descent,...

Robust Training of Neural Networks using Scale Invariant Architectures

In contrast to SGD, adaptive gradient methods like Adam allow robust tra...

1 Introduction

The most popular algorithms for training of Deep Neural Networks (DNNs) are Stochastic Gradient Descent (SGD) with momentum

(Polyak, 1964; Sutskever et al., 2013), Adam Kingma and Ba (2015)

, RMSProp

Tieleman and Hinton (2012), and AdaGrad Duchi et al. (2011)

. SGD with momentum is the preferred algorithm for computer vision problems, while Adam is the most commonly used for natural language processing (NLP) and speech problems. Compared to SGD, Adam is perceived as safer and more robust to weight initialization and learning rate policy.

111A.Karpathy, A Recipe for Training Neural Networks,

However, Adam has certain drawbacks. First, as noted in the original Adam paper (Kingma and Ba (2015)), the second moment can vanish or explode for some variables which can lead to instability, especially during the initial phase of training. To alleviate this problem, a learning rate (LR) warmup is typically used (Vaswani et al. (2017)). Second, Adam often leads to solutions that generalize worse than SGD (Wilson et al. (2017)). Finally, it is incompatible with regularization, as shown in Loshchilov and Hutter (2019). To improve Adam regularization, Loshchilov and Hutter (2019) proposed AdamW, a variant of Adam where weight decay is decoupled from the moment computation. This decoupling significantly boosts the validation accuracy of models trained with Adam, especially for very large networks.

NovoGrad builds upon the strengths of SGD and Adam algorithms in the following ways:

  1. Gradient normalization with 2nd moments makes it invariant to weight re-scaling and improves the algorithm robustness.

  2. NovoGrad computes 2nd moments per layer, instead of per individual parameter, resulting in half the memory consumption of Adam (see explanation in Section 3).

  3. NovoGrad uses weight decay decoupling (as in AdamW) for better regularization.

We applied NovoGrad to a variety of large scale problems — image classification, neural machine translation, language modeling, and speech recognition — and found that in all cases, it performs as well or better than Adam/AdamW, and SGD with momentum.

2 Related work

SGD-based algorithms take a batch of training samples [] and compute the gradient of the loss with respect to the weights at each time-step :


SGD with momentum uses the first-order moment to update the weights:


where is the learning rate and is momentum.222We moved

into the weight update for consistency with TensorFlow and PyTorch implementation.

Adam is a popular adaptive learning rate method (Kingma and Ba, 2015). It computes the first- and second-order moments, and respectively, using an exponential moving average:


The purpose of the 2nd moment is to "normalize" the 1st moment during the weights update333We skip the bias correction for brevity


Note that the Adam algorithm is scale invariant and the weight update in Equation 6 is bounded by for typical and . These two properties make Adam relatively robust to weight initialization and exploding gradients.

NovoGrad belongs to the family of Stochastic Normalized Gradient Descent (SNGD) methods (Hazan et al., 2015; Nesterov, 1984). SNGD only uses the direction of the stochastic gradient (SG) to update the weights, and the step size does not depend on the magnitude of that gradient. By ignoring the gradient magnitude, SNGD is robust to vanishing and exploding gradients. Hazan et al. (2015) proved that the direction of the gradient was sufficient for convergence. In their experiments, SNGD performs comparable to SGD with momentum for small scale problems like MNIST.

SGD with layer-wise gradient normalization was introduced by Singh et al. (2015) as a remedy against vanishing gradients. Their method scales up small gradients, while keeping large gradients unchanged:


is the vector of gradients for the layer

at time-step . A similar approach was proposed by Yu et al. (2018), who used layer-wise gradient normalization to alleviate both vanishing and exploding gradients. They divide the stochastic gradient for layer by its norm :

They showed that gradient normalization can boost both SGD with Momentum and Adam.

NovoGrad is also closely related to the Normalized Direction-preserving Adam, (ND-Adam), an algorithm proposed by Zhang et al. (2017). For each layer, ND-Adam first removes the projection of gradients on the current weights :

Then, is used to compute the 1st and 2nd scalar moments:

Finally, the weights are updated with the 1st moment re-scaled by 2nd moment similarly to Adam:

ND-Adam does not use weight decay or L2-regularization. Instead, layer weights are explicitly re-normalized in the spirit of Path-SGD (Neyshabur et al. (2015)):

Wilson et al. (2017) showed that adaptive methods like Adam generalize worse than SGD with momentum. One solution to this problem, proposed by Keskar and Socher (2017), is to use Adam during the initial stage and switch to SGD in the later stage of training. Luo et al. (2019) proposed to improve Adam regularization by limiting the factor to a certain range. They showed that limiting from above helps decrease the training loss while limiting from below helps generalize better.

Loshchilov and Hutter (2019) showed that Adam’s weak regularization is due to the fact that the 2nd moment normalization effectively disables L2-regularization. They proposed a new method AdamW, which decouples the weight decay term from the gradient and adds it to the weight update:


Because it must stored separately, computation of the 2nd

moment in Adam doubles the memory required by the optimizer compared to SGD with momentum. This especially affects large models like OpenAI’s GPT-2 with 1.5 billion parameters.

Shazeer and Stern (2018) proposed the AdaFactor algorithm, which reduces memory usage by replacing the full 2nd moment with moving averages of the row and column sums of the squared gradients. For a layer defined by an matrix, this would reduce memory from to .

3 Algorithm

Our motivation for this work is to find an algorithm which: (1) performs equally well for image classification, machine translation, and language modeling, and (2) is robust to learning rate choice and weight initialization. We begin with AdamW as a starting design point. To improve its robustness to learning rate choice, we switch to a layer-wise second moment. This improves stability during the initial training phase, allowing us to remove learning rate warm-up and to use the same learning rate policy for a diverse set of tasks. We also use normalized gradients Hazan et al. (2015) in the first moment for large batch training. The resulting algorithm, NovoGrad, combines SGD’s and Adam’s strengths without requiring sophisticated learning rate policy tuning and works well with large batch sizes.

Let be the stochastic gradient for layer at step . NovoGrad first computes the second moment using the norm :444We use -norm for . It would be interesting to see how or norms perform.


where controls the exponential decay rate of the moving average of the moment. The moment is used to normalize the gradient when calculating the first-order moment :


where is the momentum. The gradient re-scaling at each layer improves robustness to weight initialization and prevents vanishing gradients.

Similarly to AdamW, we decouple weight decay from the stochastic gradient for regularization555We move weight decay into the 1st moment, while AdamW Loshchilov and Hutter (2019) uses weight decay in the weight update. We do not observe any difference in the performance.:


Good results are often obtained by setting , and . The first moment can be also computed via an exponential moving average instead of momentum in an Adam-like style:

We use the following moments initialization to remove bias:

Weights are updated the same way as in SGD with momentum:666To improve the algorithm robustness for large learning rates, one can optionally apply layer-wise update clipping (similar to LARC, see also (Shazeer and Stern, 2018)) to make sure that , where :

  Parameters: Initial learning rate , moments , weight decay , number of steps
  Weight initialization: Initialize .
  Moment initialization: for each layer set .
  while   do
      (compute the global learning rate)
     for  each layer  do
     end for
  end while
Algorithm 1 NovoGrad with weight decay

To summarize, NovoGrad is a first-order SGD method with gradients normalized per layer. Borrowing from ND-Adam, NovoGrad uses the 2nd moment Zhang et al. (2017) for normalization and decouples weight decay from stochastic gradient for regularization as in AdamW Loshchilov and Hutter (2019). NovoGrad has half the memory consumption compared to Adam (similar to AdaFactor Shazeer and Stern (2018), but with a simpler moment computation). Unlike AdaFactor, NovoGrad does not require learning rate warmup.

3.1 Notes on convergence

Similar to other methods with stochastic gradient normalization by the second moment based on exponential moving average, one can easily construct a counter-example for the stochastic convex one-dimensional problem as shown by Wilson et al. (2017) and Reddi et al. (2018). To guarantee the convergence of NovoGrad for a stochastic convex case, we can apply the “AMS-Grad" fix Reddi et al. (2018):

4 Experiments

We evaluated NovoGrad on the following models:

  • ResNet-50 He et al. (2016)

    — for ImageNet classification

  • Transformer-big Vaswani et al. (2017) — for WMT 2014 English-to-German translation

  • Jasper Li et al. (2019) — for LibriSpeech speech recognition

  • Transformer-XL Dai et al. (2019) — for WikiText-103 word-level language modeling

and compared it to SGD with momentum, Adam, and AdamW.777Training was done in OpenSeq2Seq Kuchaiev et al. (2018) toolkit using mixed precision Micikevicius et al. (2017) on DGX-1 with 8 V100 GPUs. In all the experiments, NovoGrad performed on par or better than SGD and Adam/AdamW.

4.1 Image classification

We used ResNet-50 v2 He et al. (2016) for ImageNet classification task Russakovsky et al. (2015).888OpenSeq2Seq mixed precision replica of TensorFlow ResNet-50:

We trained this model with 3 optimizers: SGD with momentum (SGD), AdamW, and NovoGrad. All models have been trained with the batch size of 1024 for 100 epochs. We used polynomial (quadratic) LR decay for SGD with momentum and NovoGrad. We could not find any reference for training ResNet-50 with AdamW for ImageNet, so we reported the best accuracy we achieved after extensive hyper-parameter search with cosine learning rate decay (

Loshchilov and Hutter (2016)). We used only standard data augmentation methods: re-size, flip, and random crop, and did not employ any additional training tricks (He et al. (2018)). The single-crop validation accuracy for each algorithm is reported in Table 1.

optimizer batch epochs top-1,% top-5,% LR policy init LR WD
SGD 1K 100 76.38 93.08 poly (2) 0.400 0.0001
200 76.33 92.96
AdamW 1K 100 76.36 93.01 cosine 0.002 0.120
200 76.48 92.94
NovoGrad 1K 100 77.00 93.37 poly (2) 0.010 0.002
200 77.47 93.58
300 77.63 93.73
Table 1: ImageNet classification — ResNet-50(v2), batch 1024, top-1 and top-5 accuracy(%).

NovoGrad outperformed both AdamW and SGD obtaining the top-1 accuracy of 77% after 100 epochs. SGD and Adam accuracy remained under 76.5% if we trained for 200 epochs instead, while NovoGrad accuracy improved to 77.47%. NovoGrad demonstrated powerful regularization capabilities: training for 100 additional epochs improved top-1 even further to 77.63%. Note that this is "vanilla" ResNet-50, without sophisticated data augmentation or additional model tweaking (He et al., 2018).

4.1.1 Large batch training

Hazan et al. (2015) showed that large batch size is beneficial for SNGD convergence, which motivated us to explore NovoGrad for large batch training. We trained ResNet-50 v2 with batch sizes of 8K and 32K. To compare with the previous methods, we train the model for 90 epochs. To emulate large batch, we used a mini-batch of 128 per GPU and accumulated gradients from several mini-batches before each weight update.

Batch top-1,% top-5,% initial LR weight decay
1K 76.86 93.31 0.01 0.0027
8K 76.64 93.12 0.02 0.0060
32K 75.48 92.46 0.03 0.0100
Table 2: Large batch training with NovoGrad — ImageNet, ResNet-50 v2, 90 epochs, accuracy(%).

Instead of scaling the learning rate linearly with the batch size as in Goyal et al. (2017) we increased both the learning rate and the weight decay to improve the regularization (see Table. 2).

For comparison, we took 3 other methods, which (1) use fixed batch size during training and (2) don’t modify the original model. All 3 methods employ SGD with momentum (SGD). The first method (Goyal et al. (2017)) scales the LR linearly with batch size and uses the LR warmup to stabilize the initial training phase. The second method (You et al. (2018)) combines LR warmup with Layer-wise Adaptive Rate Scaling (LARS) You et al. (2017). The last method (Codreanu et al. (2017)) uses LR warmup and dynamic weight decay (WD).

Reference Optimizer Bag of Tricks #epochs B=1K B=8K B=32K
Goyal et al.Goyal et al. (2017) SGD LR warmup 90 76.47 76.26 72.45
You et al.You et al. (2018) SGD LR warmup 90 75.30 75.30 75.40
CodreanuCodreanu et al. (2017) SGD LR warmup 92-100 76.50 76.26 75.31
multi-step WD
NovoGrad - 90 76.86 76.64 75.48
Table 3: Large batch training comparison — ImageNet, ResNet-50v 2, top-1 accuracy(%).

NovoGrad outperformed all other methods without using any additional techniques like LR warmup Goyal et al. (2017)

, dynamic weight decay, special batch normalization initialization, etc.

Jia et al. (2018) and Ying et al. (2018) proposed a few modifications of ResNet-50 model, which significantly improved the accuracy for a large batch. We are planning to experiment on augmenting NovoGrad with these techniques, checkpoint averaging (Ying et al., 2018), and label smoothing (Szegedy et al., 2015).

4.2 Neural machine translation

We trained Transformer "big" model (Vaswani et al. (2017)) for WMT 2014 English-to-German translation task. We used OpenSeq2Seq (Kuchaiev et al., 2018) transformer-big which differs from the original999 implementation in two ways: (1) we measure batch size in sentence pairs, not tokens and (2) we use mixed precision training (Micikevicius et al., 2017). For these experiments, the vocabulary is 32K tokens based on joint source and target byte-pair-encoding (Sennrich et al., 2015).101010 Models have been trained on WMT’14 dataset and evaluated on newtest14 with sacreBLEU (Post, 2018) on de-tokenized output111111 BLEU+case.mixed+lang.en-de+numrefs.1+smooth.exp+test.wmt14/full+tok.13a+version.1.2.12 . For Adam and AdamW we used the "Noam" (Shazeer and Stern (2018)) learning rate policy with a warmup period of 8,000 steps and decreasing thereafter proportionally to the inverse square root of the step number. In was observed in (Vaswani et al., 2017) (and our experiments confirm this), that the learning rate warmup is crucial for training Transformer-big with these algorithms. However, with NovoGrad, we were able to use the same poly decay policy as for ResNet-50 without any warmup policy.

Optimizer batch epochs BLEU(c) BLEU(lc) LR policy init LR weight decay
Adam 1K 100 27.6 28.1 Noam 1.0 -
200 27.8 28.3 2.0 -
AdamW 1K 100 27.8 28.3 Noam 2.0
200 27.8 28.2 2.0
NovoGrad 1K 100 28.1 28.5 poly (2) 0.03
200 28.5 29.0 0.035
Table 4: WMT’14 English-to-German translation, Transformer-big, batch 1024 (sentence pairs), 100/200 epochs, sacreBLEU(case/low case) on WMT’14 (newstest14). We have not used checkpoint averaging in any of the runs.

NovoGrad performed better than Adam/AdamW, especially for long runs. We observed that NovoGrad is also more stable than Adam to initial LR choice, and it converges without LR warmup.

4.3 Speech recognition

We conducted experiments with Jasper-10x5 (Li et al. (2019)), a state-of-the-art deep convolutional neural acoustic model, on the LibriSpeech 960h speech recognition task Panayotov et al. (2015)

. Jasper was trained with SGD with momentum (SGD) and NovoGrad for 400 epochs. In both cases, we used a batch size of 256, polynomial LR decay, speed perturbation for data augmentation, and Layer-wise Adaptive Rate Clipping (LARC) for gradient clipping.

121212See and LARC clips layer gradients with respect to layer weights :

where .

Optimizer dev-clean dev-other test-clean test-other
Adam 13.20 31.71 13.36 32.71
SGD 3.91 12.77 3.98 12.79
NovoGrad 3.64 11.89 3.86 11.95
Table 5: Speech recognition — Jasper-10x5, LibriSpeech, 400 epochs, WER (%)

We found that NovoGrad yields lower Word Error Rates (WER) comparing to SGD with momentum, especially for the long runs. Unfortunately, we were unable to get good results using Adam. The details about the model and training parameters are available in (Li et al., 2019).

4.4 Language modeling

We trained Transformer-XL Dai et al. (2019), the state-of-the-art LM architecture on the word-level WikiText–103 Merity et al. (2016) benchmark. For all the experiments we used a -layer base model with M parameters (, , , ). All other hyper-parameters were taken from the original Transformer-XL paper and the source code was based on a publicly available implementation131313 Each configuration was trained for billion tokens which corresponds to approximately epochs and training iterations.

Figure 1 shows that NovoGrad may require more training steps for the model to converge if compared to Adam. However, NovoGrad exhibits a much smaller gap between training and validation perplexity, which results in better generalization and improved performance on the test set.

Optimizer #tokens batch LR policy init LR WD Val PPL Test PPL
Adam B cosine

B cosine
NovoGrad B poly (2) 0.01 -
Table 6: Language modeling — Transformer-XL trained on WikiText-103, perplexity (PPL).
Figure 1: Learning curves for Transformer-XL model trained with Adam and NovoGrad.

4.5 Question answering

Question answering is a popular downstream NLP task which frequently uses a pre-trained language model instead of training the resulting neural net from scratch. We fine-tuned the large BERT model with Adam, AdamW and NovoGrad on the question answering benchmark SQuAD v1.1, which involves predicting the answer text span in a paragraph given a question. For Adam, LR warm-up over 10% of the iterations was used to stabilize the initial training phase. With NovoGrad we did not use LR warm-up. Interestingly, while NovoGrad required 4 epochs to get comparable results, it still had exactly the same number of updates as Adam because of the 2x larger batch size. Table 7 shows the best F1 and Exact Match (EM) scores obtained for the SQuAD benchmark on the evaluation dataset.

Optimizer batch epochs EM F1 LR policy init LR WD
Adam 12 2 84.66 91.28 poly(1)+warmup
AdamW 16 2 84.52 91.19 poly(1)+warmup
NovoGrad 24 4 84.43 91.14 cosine
Table 7: Question answering — large BERT fine-tuned on SQuAD v1.1 with batch size of 16.

5 Conclusion

We propose NovoGrad – a first-order SGD method with gradients normalized by the second moment computed as moving average of squared norms of layer gradients. Because of the layer-wise second moment, NovoGrad requires half the memory compared to Adam. NovoGrad also decouples gradients and weight decay for better regularization.

We tested NovoGrad on very large models for image classification, translation, language modeling, and speech recognition. In these experiments, NovoGrad performed equally or better than SGD and Adam/AdamW. We found that NovoGrad is more robust to the initial learning rate and weight initialization. For example, NovoGrad works well with the same learning rate decay schedule without warm-up, while other methods require it. The layer-wise normalized gradient makes training with NovoGrad robust for large batch sizes. NovoGrad outperformed current methods for ResNet-50 large batch training. Strong optimization and regularization qualities allow NovoGrad to train longer without over-fitting. NovoGrad and all models described in this work are open sourced in OpenSeq2Seq toolkit.


The authors would like to thank Anima Anandkumar, Yaroslav Bulatov, Ilya Loshchilov, and Sebastian Ruder for their valuable feedback.