, RMSPropTieleman and Hinton (2012), and AdaGrad Duchi et al. (2011)
. SGD with momentum is the preferred algorithm for computer vision problems, while Adam is the most commonly used for natural language processing (NLP) and speech problems. Compared to SGD, Adam is perceived as safer and more robust to weight initialization and learning rate policy.111A.Karpathy, A Recipe for Training Neural Networks, http://karpathy.github.io/2019/04/25/recipe/
However, Adam has certain drawbacks. First, as noted in the original Adam paper (Kingma and Ba (2015)), the second moment can vanish or explode for some variables which can lead to instability, especially during the initial phase of training. To alleviate this problem, a learning rate (LR) warmup is typically used (Vaswani et al. (2017)). Second, Adam often leads to solutions that generalize worse than SGD (Wilson et al. (2017)). Finally, it is incompatible with regularization, as shown in Loshchilov and Hutter (2019). To improve Adam regularization, Loshchilov and Hutter (2019) proposed AdamW, a variant of Adam where weight decay is decoupled from the moment computation. This decoupling significantly boosts the validation accuracy of models trained with Adam, especially for very large networks.
NovoGrad builds upon the strengths of SGD and Adam algorithms in the following ways:
Gradient normalization with 2nd moments makes it invariant to weight re-scaling and improves the algorithm robustness.
NovoGrad computes 2nd moments per layer, instead of per individual parameter, resulting in half the memory consumption of Adam (see explanation in Section 3).
NovoGrad uses weight decay decoupling (as in AdamW) for better regularization.
We applied NovoGrad to a variety of large scale problems — image classification, neural machine translation, language modeling, and speech recognition — and found that in all cases, it performs as well or better than Adam/AdamW, and SGD with momentum.
2 Related work
SGD-based algorithms take a batch of training samples  and compute the gradient of the loss with respect to the weights at each time-step :
SGD with momentum uses the first-order moment to update the weights:
Adam is a popular adaptive learning rate method (Kingma and Ba, 2015). It computes the first- and second-order moments, and respectively, using an exponential moving average:
The purpose of the 2nd moment is to "normalize" the 1st moment during the weights update333We skip the bias correction for brevity
Note that the Adam algorithm is scale invariant and the weight update in Equation 6 is bounded by for typical and . These two properties make Adam relatively robust to weight initialization and exploding gradients.
NovoGrad belongs to the family of Stochastic Normalized Gradient Descent (SNGD) methods (Hazan et al., 2015; Nesterov, 1984). SNGD only uses the direction of the stochastic gradient (SG) to update the weights, and the step size does not depend on the magnitude of that gradient. By ignoring the gradient magnitude, SNGD is robust to vanishing and exploding gradients. Hazan et al. (2015) proved that the direction of the gradient was sufficient for convergence. In their experiments, SNGD performs comparable to SGD with momentum for small scale problems like MNIST.
SGD with layer-wise gradient normalization was introduced by Singh et al. (2015) as a remedy against vanishing gradients. Their method scales up small gradients, while keeping large gradients unchanged:
is the vector of gradients for the layerat time-step . A similar approach was proposed by Yu et al. (2018), who used layer-wise gradient normalization to alleviate both vanishing and exploding gradients. They divide the stochastic gradient for layer by its norm :
They showed that gradient normalization can boost both SGD with Momentum and Adam.
NovoGrad is also closely related to the Normalized Direction-preserving Adam, (ND-Adam), an algorithm proposed by Zhang et al. (2017). For each layer, ND-Adam first removes the projection of gradients on the current weights :
Then, is used to compute the 1st and 2nd scalar moments:
Finally, the weights are updated with the 1st moment re-scaled by 2nd moment similarly to Adam:
ND-Adam does not use weight decay or L2-regularization. Instead, layer weights are explicitly re-normalized in the spirit of Path-SGD (Neyshabur et al. (2015)):
Wilson et al. (2017) showed that adaptive methods like Adam generalize worse than SGD with momentum. One solution to this problem, proposed by Keskar and Socher (2017), is to use Adam during the initial stage and switch to SGD in the later stage of training. Luo et al. (2019) proposed to improve Adam regularization by limiting the factor to a certain range. They showed that limiting from above helps decrease the training loss while limiting from below helps generalize better.
Loshchilov and Hutter (2019) showed that Adam’s weak regularization is due to the fact that the 2nd moment normalization effectively disables L2-regularization. They proposed a new method AdamW, which decouples the weight decay term from the gradient and adds it to the weight update:
Because it must stored separately, computation of the 2nd
moment in Adam doubles the memory required by the optimizer compared to SGD with momentum. This especially affects large models like OpenAI’s GPT-2 with 1.5 billion parameters.Shazeer and Stern (2018) proposed the AdaFactor algorithm, which reduces memory usage by replacing the full 2nd moment with moving averages of the row and column sums of the squared gradients. For a layer defined by an matrix, this would reduce memory from to .
Our motivation for this work is to find an algorithm which: (1) performs equally well for image classification, machine translation, and language modeling, and (2) is robust to learning rate choice and weight initialization. We begin with AdamW as a starting design point. To improve its robustness to learning rate choice, we switch to a layer-wise second moment. This improves stability during the initial training phase, allowing us to remove learning rate warm-up and to use the same learning rate policy for a diverse set of tasks. We also use normalized gradients Hazan et al. (2015) in the first moment for large batch training. The resulting algorithm, NovoGrad, combines SGD’s and Adam’s strengths without requiring sophisticated learning rate policy tuning and works well with large batch sizes.
Let be the stochastic gradient for layer at step . NovoGrad first computes the second moment using the norm :444We use -norm for . It would be interesting to see how or norms perform.
where controls the exponential decay rate of the moving average of the moment. The moment is used to normalize the gradient when calculating the first-order moment :
where is the momentum. The gradient re-scaling at each layer improves robustness to weight initialization and prevents vanishing gradients.
Similarly to AdamW, we decouple weight decay from the stochastic gradient for regularization555We move weight decay into the 1st moment, while AdamW Loshchilov and Hutter (2019) uses weight decay in the weight update. We do not observe any difference in the performance.:
Good results are often obtained by setting , and . The first moment can be also computed via an exponential moving average instead of momentum in an Adam-like style:
We use the following moments initialization to remove bias:
Weights are updated the same way as in SGD with momentum:666To improve the algorithm robustness for large learning rates, one can optionally apply layer-wise update clipping (similar to LARC, see also (Shazeer and Stern, 2018))
to make sure that , where :
To summarize, NovoGrad is a first-order SGD method with gradients normalized per layer. Borrowing from ND-Adam, NovoGrad uses the 2nd moment Zhang et al. (2017) for normalization and decouples weight decay from stochastic gradient for regularization as in AdamW Loshchilov and Hutter (2019). NovoGrad has half the memory consumption compared to Adam (similar to AdaFactor Shazeer and Stern (2018), but with a simpler moment computation). Unlike AdaFactor, NovoGrad does not require learning rate warmup.
3.1 Notes on convergence
Similar to other methods with stochastic gradient normalization by the second moment based on exponential moving average, one can easily construct a counter-example for the stochastic convex one-dimensional problem as shown by Wilson et al. (2017) and Reddi et al. (2018). To guarantee the convergence of NovoGrad for a stochastic convex case, we can apply the “AMS-Grad" fix Reddi et al. (2018):
We evaluated NovoGrad on the following models:
and compared it to SGD with momentum, Adam, and AdamW.777Training was done in OpenSeq2Seq Kuchaiev et al. (2018) toolkit using mixed precision Micikevicius et al. (2017) on DGX-1 with 8 V100 GPUs. In all the experiments, NovoGrad performed on par or better than SGD and Adam/AdamW.
4.1 Image classification
We used ResNet-50 v2 He et al. (2016) for ImageNet classification task Russakovsky et al. (2015).888OpenSeq2Seq mixed precision replica of TensorFlow ResNet-50: https://github.com/NVIDIA/OpenSeq2Seq/blob/master/example_configs/image2label/resnet-50-v2-mp.py.
We trained this model with 3 optimizers: SGD with momentum (SGD), AdamW, and NovoGrad. All models have been trained with the batch size of 1024 for 100 epochs. We used polynomial (quadratic) LR decay for SGD with momentum and NovoGrad. We could not find any reference for training ResNet-50 with AdamW for ImageNet, so we reported the best accuracy we achieved after extensive hyper-parameter search with cosine learning rate decay (Loshchilov and Hutter (2016)). We used only standard data augmentation methods: re-size, flip, and random crop, and did not employ any additional training tricks (He et al. (2018)). The single-crop validation accuracy for each algorithm is reported in Table 1.
|optimizer||batch||epochs||top-1,%||top-5,%||LR policy||init LR||WD|
NovoGrad outperformed both AdamW and SGD obtaining the top-1 accuracy of 77% after 100 epochs. SGD and Adam accuracy remained under 76.5% if we trained for 200 epochs instead, while NovoGrad accuracy improved to 77.47%. NovoGrad demonstrated powerful regularization capabilities: training for 100 additional epochs improved top-1 even further to 77.63%. Note that this is "vanilla" ResNet-50, without sophisticated data augmentation or additional model tweaking (He et al., 2018).
4.1.1 Large batch training
Hazan et al. (2015) showed that large batch size is beneficial for SNGD convergence, which motivated us to explore NovoGrad for large batch training. We trained ResNet-50 v2 with batch sizes of 8K and 32K. To compare with the previous methods, we train the model for 90 epochs. To emulate large batch, we used a mini-batch of 128 per GPU and accumulated gradients from several mini-batches before each weight update.
|Batch||top-1,%||top-5,%||initial LR||weight decay|
For comparison, we took 3 other methods, which (1) use fixed batch size during training and (2) don’t modify the original model. All 3 methods employ SGD with momentum (SGD). The first method (Goyal et al. (2017)) scales the LR linearly with batch size and uses the LR warmup to stabilize the initial training phase. The second method (You et al. (2018)) combines LR warmup with Layer-wise Adaptive Rate Scaling (LARS) You et al. (2017). The last method (Codreanu et al. (2017)) uses LR warmup and dynamic weight decay (WD).
|Reference||Optimizer||Bag of Tricks||#epochs||B=1K||B=8K||B=32K|
|Goyal et al.Goyal et al. (2017)||SGD||LR warmup||90||76.47||76.26||72.45|
|You et al.You et al. (2018)||SGD||LR warmup||90||75.30||75.30||75.40|
|CodreanuCodreanu et al. (2017)||SGD||LR warmup||92-100||76.50||76.26||75.31|
NovoGrad outperformed all other methods without using any additional techniques like LR warmup Goyal et al. (2017)
, dynamic weight decay, special batch normalization initialization, etc.Jia et al. (2018) and Ying et al. (2018) proposed a few modifications of ResNet-50 model, which significantly improved the accuracy for a large batch. We are planning to experiment on augmenting NovoGrad with these techniques, checkpoint averaging (Ying et al., 2018), and label smoothing (Szegedy et al., 2015).
4.2 Neural machine translation
We trained Transformer "big" model (Vaswani et al. (2017)) for WMT 2014 English-to-German translation task. We used OpenSeq2Seq (Kuchaiev et al., 2018) transformer-big which differs from the original999https://github.com/tensorflow/models/tree/master/official/transformer implementation in two ways: (1) we measure batch size in sentence pairs, not tokens and (2) we use mixed precision training (Micikevicius et al., 2017). For these experiments, the vocabulary is 32K tokens based on joint source and target byte-pair-encoding (Sennrich et al., 2015).101010https://github.com/google/sentencepiece Models have been trained on WMT’14 dataset and evaluated on newtest14 with sacreBLEU (Post, 2018) on de-tokenized output111111 BLEU+case.mixed+lang.en-de+numrefs.1+smooth.exp+test.wmt14/full+tok.13a+version.1.2.12 . For Adam and AdamW we used the "Noam" (Shazeer and Stern (2018)) learning rate policy with a warmup period of 8,000 steps and decreasing thereafter proportionally to the inverse square root of the step number. In was observed in (Vaswani et al., 2017) (and our experiments confirm this), that the learning rate warmup is crucial for training Transformer-big with these algorithms. However, with NovoGrad, we were able to use the same poly decay policy as for ResNet-50 without any warmup policy.
|Optimizer||batch||epochs||BLEU(c)||BLEU(lc)||LR policy||init LR||weight decay|
NovoGrad performed better than Adam/AdamW, especially for long runs. We observed that NovoGrad is also more stable than Adam to initial LR choice, and it converges without LR warmup.
4.3 Speech recognition
. Jasper was trained with SGD with momentum (SGD) and NovoGrad for 400 epochs. In both cases, we used a batch size of 256, polynomial LR decay, speed perturbation for data augmentation, and Layer-wise Adaptive Rate Clipping (LARC) for gradient clipping.121212See https://github.com/NVIDIA/OpenSeq2Seq/blob/master/open_seq2seq/optimizers and https://github.com/NVIDIA/apex/blob/master/apex/parallel/LARC.py. LARC clips layer gradients with respect to layer weights :
We found that NovoGrad yields lower Word Error Rates (WER) comparing to SGD with momentum, especially for the long runs. Unfortunately, we were unable to get good results using Adam. The details about the model and training parameters are available in (Li et al., 2019).
4.4 Language modeling
We trained Transformer-XL Dai et al. (2019), the state-of-the-art LM architecture on the word-level WikiText–103 Merity et al. (2016) benchmark. For all the experiments we used a -layer base model with M parameters (, , , ). All other hyper-parameters were taken from the original Transformer-XL paper and the source code was based on a publicly available implementation131313https://github.com/cybertronai/transformer-xl. Each configuration was trained for billion tokens which corresponds to approximately epochs and training iterations.
Figure 1 shows that NovoGrad may require more training steps for the model to converge if compared to Adam. However, NovoGrad exhibits a much smaller gap between training and validation perplexity, which results in better generalization and improved performance on the test set.
|Optimizer||#tokens||batch||LR policy||init LR||WD||Val PPL||Test PPL|
4.5 Question answering
Question answering is a popular downstream NLP task which frequently uses a pre-trained language model instead of training the resulting neural net from scratch. We fine-tuned the large BERT model with Adam, AdamW and NovoGrad on the question answering benchmark SQuAD v1.1, which involves predicting the answer text span in a paragraph given a question. For Adam, LR warm-up over 10% of the iterations was used to stabilize the initial training phase. With NovoGrad we did not use LR warm-up. Interestingly, while NovoGrad required 4 epochs to get comparable results, it still had exactly the same number of updates as Adam because of the 2x larger batch size. Table 7 shows the best F1 and Exact Match (EM) scores obtained for the SQuAD benchmark on the evaluation dataset.
|Optimizer||batch||epochs||EM||F1||LR policy||init LR||WD|
We propose NovoGrad – a first-order SGD method with gradients normalized by the second moment computed as moving average of squared norms of layer gradients. Because of the layer-wise second moment, NovoGrad requires half the memory compared to Adam. NovoGrad also decouples gradients and weight decay for better regularization.
We tested NovoGrad on very large models for image classification, translation, language modeling, and speech recognition. In these experiments, NovoGrad performed equally or better than SGD and Adam/AdamW. We found that NovoGrad is more robust to the initial learning rate and weight initialization. For example, NovoGrad works well with the same learning rate decay schedule without warm-up, while other methods require it. The layer-wise normalized gradient makes training with NovoGrad robust for large batch sizes. NovoGrad outperformed current methods for ResNet-50 large batch training. Strong optimization and regularization qualities allow NovoGrad to train longer without over-fitting. NovoGrad and all models described in this work are open sourced in OpenSeq2Seq toolkit.
The authors would like to thank Anima Anandkumar, Yaroslav Bulatov, Ilya Loshchilov, and Sebastian Ruder for their valuable feedback.
- Codreanu et al.  Valeriu Codreanu, Damian Podareanu, and Vikram Saletore. Scale out for large minibatch sgd: Residual network training on imagenet-1k with improved accuracy and reduced time to train. arXiv e-prints arXiv:1711.04291, 2017.
- Dai et al.  Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
Duchi et al. 
John Duchi, Elad Hazan, and Yoram Singer.
Adaptive subgradient methods for online learning and stochastic
Journal of Machine Learning Research, page 2121–2159, 2011.
- Goyal et al.  Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: training imagenet in 1 hour. arXiv e-prints arXiv:1706.02677, 2017.
- Hazan et al.  E. Hazan, K. Levy, and S. Shalev-Shwartz. Beyond convexity: Stochastic quasi-convex optimization. In Neural Information Processing Systems, page 1585–1593, 2015.
- He et al.  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. arXiv e-prints arXiv:11603.05027, 2016.
- He et al.  Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, and Mu Li. Bag of tricks for image classification with convolutional neural networks. arXiv e-prints arXiv:1812.00187, 2018.
- Jia et al.  Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, Tiegang Chen, Guangxiao Hu, Shaohuai Shi, and Xiaowen Chu. Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes. arXiv e-prints arXiv:1807.11205, 2018.
- Keskar and Socher  Nitish Shirish Keskar and Richard Socher. Improving generalization performance by switching from adam to SGD. arXiv e-prints arXiv:1712.07628, 2017.
- Kingma and Ba  Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
- Kuchaiev et al.  Oleksii Kuchaiev, Boris Ginsburg, Igor Gitman, Vitaly Lavrukhin, Carl Case, and Paulius Micikevicius. Openseq2seq: extensible toolkit for distributed and mixed precision training of sequence-to-sequence models. arXiv e-prints arXiv:1805.10387, 2018.
- Li et al.  J. Li, V. Lavrukhin, B. Ginsburg, R. Leary, O. Kuchaiev, J. M. Cohen, H. Nguyen, and R. T. Gaddei. Jasper: An end-to-end convolutional neural acoustic model. arXiv e-prints arXiv:1904.03288, 2019.
- Loshchilov and Hutter  Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. ICLR, 2016.
- Loshchilov and Hutter  Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
- Luo et al.  Liangchen Luo, Yuanhao Xiong, Yan Liu, and Xu Sun. Adaptive gradient methods with dynamic bound of learning rate. In International Conference on Learning Representations, 2019.
- Merity et al.  Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
- Micikevicius et al.  Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory F. Diamos, Erich Elsen, David García, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training. ICLR, 2017.
- Nesterov  Y. E. Nesterov. Minimization methods for nonsmooth convex and quasiconvex functions. Matekon, 29:519–531, 1984.
- Neyshabur et al.  Behnam Neyshabur, Ruslan Salakhutdinov, and Nathan Srebro. Path-sgd: Path-normalized optimization in deep neural networks. In Neural Information Processing Systems, page 2422–2430, 2015.
- Panayotov et al.  Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pages 5206–5210. IEEE, 2015.
- Polyak  B.T Polyak. Some methods of speeding up the convergence of iteration methods. In USSR Computational Mathematics and Mathematical Physics, 1964.
- Post  Matt Post. A call for clarity in reporting bleu scores. arXiv e-prints arXiv:1804.0877, 2018.
- Reddi et al.  Sashank J. Reddi, Stayen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018.
- Russakovsky et al.  Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
- Sennrich et al.  Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
- Shazeer and Stern  Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. arXiv e-prints arXiv:1804.04235, 2018.
- Singh et al.  Bharat Singh, Soham De, Yangmuzi Zhang†, Thomas Goldstein, and Gavin Taylor. Layer-specific adaptive learning rates for deep networks. In 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), 2015.
Sutskever et al. 
Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton.
On the importance of initialization and momentum in deep learning.In International Conference on Machine Learning, 2013.
- Szegedy et al.  Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567, 2015.
- Tieleman and Hinton  T. Tieleman and G. Hinton. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
- Vaswani et al.  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. arXiv preprint arXiv: 1706.03762, 2017.
- Wilson et al.  Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. In Neural Information Processing Systems, pages 4148–4158, 2017.
- Ying et al.  Chris Ying, Sameer Kumar, Dehao Chen, Tao Wang, and Youlong Cheng. Image classification at supercomputer scale. In Neural Information Processing Systems, 2018.
- You et al.  Yang You, Igor Gitman, and Boris Ginsburg. Large batch training of convolutional networks. arXiv e-prints arXiv:1708.03888, 2017.
- You et al.  Yang You, Zhao Zhang, Cho-Jui Hsieh, and James Demmel. 100-epoch imagenet training with alexnet in 24 minutes. arXiv e-prints arXiv:1709.05011, 2018.
- Yu et al.  Adams Wei Yu, Qihang Lin, Ruslan Salakhutdinov, and Jaime Carbonell. Block-normalized gradient method: An empirical study for training deep neural network. arXiv e-prints arXiv:1707.04822, 2018.
- Zhang et al.  Zijun Zhang, Lin Ma, Zongpeng Li, and Chuan Wu. Normalized direction-preserving adam. arXiv e-prints arXiv:1709.04546, 2017.