1 Introduction
The most popular algorithms for training of Deep Neural Networks (DNNs) are Stochastic Gradient Descent (SGD) with momentum
(Polyak, 1964; Sutskever et al., 2013), Adam Kingma and Ba (2015), RMSProp
Tieleman and Hinton (2012), and AdaGrad Duchi et al. (2011). SGD with momentum is the preferred algorithm for computer vision problems, while Adam is the most commonly used for natural language processing (NLP) and speech problems. Compared to SGD, Adam is perceived as safer and more robust to weight initialization and learning rate policy.
^{1}^{1}1A.Karpathy, A Recipe for Training Neural Networks, http://karpathy.github.io/2019/04/25/recipe/However, Adam has certain drawbacks. First, as noted in the original Adam paper (Kingma and Ba (2015)), the second moment can vanish or explode for some variables which can lead to instability, especially during the initial phase of training. To alleviate this problem, a learning rate (LR) warmup is typically used (Vaswani et al. (2017)). Second, Adam often leads to solutions that generalize worse than SGD (Wilson et al. (2017)). Finally, it is incompatible with regularization, as shown in Loshchilov and Hutter (2019). To improve Adam regularization, Loshchilov and Hutter (2019) proposed AdamW, a variant of Adam where weight decay is decoupled from the moment computation. This decoupling significantly boosts the validation accuracy of models trained with Adam, especially for very large networks.
NovoGrad builds upon the strengths of SGD and Adam algorithms in the following ways:

Gradient normalization with 2^{nd} moments makes it invariant to weight rescaling and improves the algorithm robustness.

NovoGrad computes 2^{nd} moments per layer, instead of per individual parameter, resulting in half the memory consumption of Adam (see explanation in Section 3).

NovoGrad uses weight decay decoupling (as in AdamW) for better regularization.
We applied NovoGrad to a variety of large scale problems — image classification, neural machine translation, language modeling, and speech recognition — and found that in all cases, it performs as well or better than Adam/AdamW, and SGD with momentum.
2 Related work
SGDbased algorithms take a batch of training samples [] and compute the gradient of the loss with respect to the weights at each timestep :
(1) 
SGD with momentum uses the firstorder moment to update the weights:
(2)  
(3) 
where is the learning rate and is momentum.^{2}^{2}2We moved
into the weight update for consistency with TensorFlow and PyTorch implementation.
Adam is a popular adaptive learning rate method (Kingma and Ba, 2015). It computes the first and secondorder moments, and respectively, using an exponential moving average:
(4)  
(5) 
The purpose of the 2^{nd} moment is to "normalize" the 1^{st} moment during the weights update^{3}^{3}3We skip the bias correction for brevity
(6) 
Note that the Adam algorithm is scale invariant and the weight update in Equation 6 is bounded by for typical and . These two properties make Adam relatively robust to weight initialization and exploding gradients.
NovoGrad belongs to the family of Stochastic Normalized Gradient Descent (SNGD) methods (Hazan et al., 2015; Nesterov, 1984). SNGD only uses the direction of the stochastic gradient (SG) to update the weights, and the step size does not depend on the magnitude of that gradient. By ignoring the gradient magnitude, SNGD is robust to vanishing and exploding gradients. Hazan et al. (2015) proved that the direction of the gradient was sufficient for convergence. In their experiments, SNGD performs comparable to SGD with momentum for small scale problems like MNIST.
SGD with layerwise gradient normalization was introduced by Singh et al. (2015) as a remedy against vanishing gradients. Their method scales up small gradients, while keeping large gradients unchanged:
where
is the vector of gradients for the layer
at timestep . A similar approach was proposed by Yu et al. (2018), who used layerwise gradient normalization to alleviate both vanishing and exploding gradients. They divide the stochastic gradient for layer by its norm :They showed that gradient normalization can boost both SGD with Momentum and Adam.
NovoGrad is also closely related to the Normalized Directionpreserving Adam, (NDAdam), an algorithm proposed by Zhang et al. (2017). For each layer, NDAdam first removes the projection of gradients on the current weights :
Then, is used to compute the 1^{st} and 2^{nd} scalar moments:
Finally, the weights are updated with the 1^{st} moment rescaled by 2^{nd} moment similarly to Adam:
NDAdam does not use weight decay or L2regularization. Instead, layer weights are explicitly renormalized in the spirit of PathSGD (Neyshabur et al. (2015)):
Wilson et al. (2017) showed that adaptive methods like Adam generalize worse than SGD with momentum. One solution to this problem, proposed by Keskar and Socher (2017), is to use Adam during the initial stage and switch to SGD in the later stage of training. Luo et al. (2019) proposed to improve Adam regularization by limiting the factor to a certain range. They showed that limiting from above helps decrease the training loss while limiting from below helps generalize better.
Loshchilov and Hutter (2019) showed that Adam’s weak regularization is due to the fact that the 2^{nd} moment normalization effectively disables L2regularization. They proposed a new method AdamW, which decouples the weight decay term from the gradient and adds it to the weight update:
(7) 
Because it must stored separately, computation of the 2^{nd}
moment in Adam doubles the memory required by the optimizer compared to SGD with momentum. This especially affects large models like OpenAI’s GPT2 with 1.5 billion parameters.
Shazeer and Stern (2018) proposed the AdaFactor algorithm, which reduces memory usage by replacing the full 2^{nd} moment with moving averages of the row and column sums of the squared gradients. For a layer defined by an matrix, this would reduce memory from to .3 Algorithm
Our motivation for this work is to find an algorithm which: (1) performs equally well for image classification, machine translation, and language modeling, and (2) is robust to learning rate choice and weight initialization. We begin with AdamW as a starting design point. To improve its robustness to learning rate choice, we switch to a layerwise second moment. This improves stability during the initial training phase, allowing us to remove learning rate warmup and to use the same learning rate policy for a diverse set of tasks. We also use normalized gradients Hazan et al. (2015) in the first moment for large batch training. The resulting algorithm, NovoGrad, combines SGD’s and Adam’s strengths without requiring sophisticated learning rate policy tuning and works well with large batch sizes.
Let be the stochastic gradient for layer at step . NovoGrad first computes the second moment using the norm :^{4}^{4}4We use norm for . It would be interesting to see how or norms perform.
(8) 
where controls the exponential decay rate of the moving average of the moment. The moment is used to normalize the gradient when calculating the firstorder moment :
(9) 
where is the momentum. The gradient rescaling at each layer improves robustness to weight initialization and prevents vanishing gradients.
Similarly to AdamW, we decouple weight decay from the stochastic gradient for regularization^{5}^{5}5We move weight decay into the 1^{st} moment, while AdamW Loshchilov and Hutter (2019) uses weight decay in the weight update. We do not observe any difference in the performance.:
(10) 
Good results are often obtained by setting , and . The first moment can be also computed via an exponential moving average instead of momentum in an Adamlike style:
We use the following moments initialization to remove bias:
Weights are updated the same way as in SGD with momentum:^{6}^{6}6To improve the algorithm robustness for large learning rates, one can optionally apply layerwise update clipping (similar to LARC, see also (Shazeer and Stern, 2018)) to make sure that , where :
To summarize, NovoGrad is a firstorder SGD method with gradients normalized per layer. Borrowing from NDAdam, NovoGrad uses the 2^{nd} moment Zhang et al. (2017) for normalization and decouples weight decay from stochastic gradient for regularization as in AdamW Loshchilov and Hutter (2019). NovoGrad has half the memory consumption compared to Adam (similar to AdaFactor Shazeer and Stern (2018), but with a simpler moment computation). Unlike AdaFactor, NovoGrad does not require learning rate warmup.
3.1 Notes on convergence
Similar to other methods with stochastic gradient normalization by the second moment based on exponential moving average, one can easily construct a counterexample for the stochastic convex onedimensional problem as shown by Wilson et al. (2017) and Reddi et al. (2018). To guarantee the convergence of NovoGrad for a stochastic convex case, we can apply the “AMSGrad" fix Reddi et al. (2018):
4 Experiments
We evaluated NovoGrad on the following models:
and compared it to SGD with momentum, Adam, and AdamW.^{7}^{7}7Training was done in OpenSeq2Seq Kuchaiev et al. (2018) toolkit using mixed precision Micikevicius et al. (2017) on DGX1 with 8 V100 GPUs. In all the experiments, NovoGrad performed on par or better than SGD and Adam/AdamW.
4.1 Image classification
We used ResNet50 v2 He et al. (2016) for ImageNet classification task Russakovsky et al. (2015).^{8}^{8}8OpenSeq2Seq mixed precision replica of TensorFlow ResNet50: https://github.com/NVIDIA/OpenSeq2Seq/blob/master/example_configs/image2label/resnet50v2mp.py.
We trained this model with 3 optimizers: SGD with momentum (SGD), AdamW, and NovoGrad. All models have been trained with the batch size of 1024 for 100 epochs. We used polynomial (quadratic) LR decay for SGD with momentum and NovoGrad. We could not find any reference for training ResNet50 with AdamW for ImageNet, so we reported the best accuracy we achieved after extensive hyperparameter search with cosine learning rate decay (
Loshchilov and Hutter (2016)). We used only standard data augmentation methods: resize, flip, and random crop, and did not employ any additional training tricks (He et al. (2018)). The singlecrop validation accuracy for each algorithm is reported in Table 1.optimizer  batch  epochs  top1,%  top5,%  LR policy  init LR  WD 

SGD  1K  100  76.38  93.08  poly (2)  0.400  0.0001 
200  76.33  92.96  
AdamW  1K  100  76.36  93.01  cosine  0.002  0.120 
200  76.48  92.94  
NovoGrad  1K  100  77.00  93.37  poly (2)  0.010  0.002 
200  77.47  93.58  
300  77.63  93.73 
NovoGrad outperformed both AdamW and SGD obtaining the top1 accuracy of 77% after 100 epochs. SGD and Adam accuracy remained under 76.5% if we trained for 200 epochs instead, while NovoGrad accuracy improved to 77.47%. NovoGrad demonstrated powerful regularization capabilities: training for 100 additional epochs improved top1 even further to 77.63%. Note that this is "vanilla" ResNet50, without sophisticated data augmentation or additional model tweaking (He et al., 2018).
4.1.1 Large batch training
Hazan et al. (2015) showed that large batch size is beneficial for SNGD convergence, which motivated us to explore NovoGrad for large batch training. We trained ResNet50 v2 with batch sizes of 8K and 32K. To compare with the previous methods, we train the model for 90 epochs. To emulate large batch, we used a minibatch of 128 per GPU and accumulated gradients from several minibatches before each weight update.
Batch  top1,%  top5,%  initial LR  weight decay 

1K  76.86  93.31  0.01  0.0027 
8K  76.64  93.12  0.02  0.0060 
32K  75.48  92.46  0.03  0.0100 
Instead of scaling the learning rate linearly with the batch size as in Goyal et al. (2017) we increased both the learning rate and the weight decay to improve the regularization (see Table. 2).
For comparison, we took 3 other methods, which (1) use fixed batch size during training and (2) don’t modify the original model. All 3 methods employ SGD with momentum (SGD). The first method (Goyal et al. (2017)) scales the LR linearly with batch size and uses the LR warmup to stabilize the initial training phase. The second method (You et al. (2018)) combines LR warmup with Layerwise Adaptive Rate Scaling (LARS) You et al. (2017). The last method (Codreanu et al. (2017)) uses LR warmup and dynamic weight decay (WD).
Reference  Optimizer  Bag of Tricks  #epochs  B=1K  B=8K  B=32K 
Goyal et al.Goyal et al. (2017)  SGD  LR warmup  90  76.47  76.26  72.45 
You et al.You et al. (2018)  SGD  LR warmup  90  75.30  75.30  75.40 
LARS  
CodreanuCodreanu et al. (2017)  SGD  LR warmup  92100  76.50  76.26  75.31 
multistep WD  
NovoGrad    90  76.86  76.64  75.48 
NovoGrad outperformed all other methods without using any additional techniques like LR warmup Goyal et al. (2017)
, dynamic weight decay, special batch normalization initialization, etc.
Jia et al. (2018) and Ying et al. (2018) proposed a few modifications of ResNet50 model, which significantly improved the accuracy for a large batch. We are planning to experiment on augmenting NovoGrad with these techniques, checkpoint averaging (Ying et al., 2018), and label smoothing (Szegedy et al., 2015).4.2 Neural machine translation
We trained Transformer "big" model (Vaswani et al. (2017)) for WMT 2014 EnglishtoGerman translation task. We used OpenSeq2Seq (Kuchaiev et al., 2018) transformerbig which differs from the original^{9}^{9}9https://github.com/tensorflow/models/tree/master/official/transformer implementation in two ways: (1) we measure batch size in sentence pairs, not tokens and (2) we use mixed precision training (Micikevicius et al., 2017). For these experiments, the vocabulary is 32K tokens based on joint source and target bytepairencoding (Sennrich et al., 2015).^{10}^{10}10https://github.com/google/sentencepiece Models have been trained on WMT’14 dataset and evaluated on newtest14 with sacreBLEU (Post, 2018) on detokenized output^{11}^{11}11 BLEU+case.mixed+lang.ende+numrefs.1+smooth.exp+test.wmt14/full+tok.13a+version.1.2.12 . For Adam and AdamW we used the "Noam" (Shazeer and Stern (2018)) learning rate policy with a warmup period of 8,000 steps and decreasing thereafter proportionally to the inverse square root of the step number. In was observed in (Vaswani et al., 2017) (and our experiments confirm this), that the learning rate warmup is crucial for training Transformerbig with these algorithms. However, with NovoGrad, we were able to use the same poly decay policy as for ResNet50 without any warmup policy.
Optimizer  batch  epochs  BLEU(c)  BLEU(lc)  LR policy  init LR  weight decay 

Adam  1K  100  27.6  28.1  Noam  1.0   
200  27.8  28.3  2.0    
AdamW  1K  100  27.8  28.3  Noam  2.0  
200  27.8  28.2  2.0  
NovoGrad  1K  100  28.1  28.5  poly (2)  0.03  
200  28.5  29.0  0.035 
NovoGrad performed better than Adam/AdamW, especially for long runs. We observed that NovoGrad is also more stable than Adam to initial LR choice, and it converges without LR warmup.
4.3 Speech recognition
We conducted experiments with Jasper10x5 (Li et al. (2019)), a stateoftheart deep convolutional neural acoustic model, on the LibriSpeech 960h speech recognition task Panayotov et al. (2015)
. Jasper was trained with SGD with momentum (SGD) and NovoGrad for 400 epochs. In both cases, we used a batch size of 256, polynomial LR decay, speed perturbation for data augmentation, and Layerwise Adaptive Rate Clipping (LARC) for gradient clipping.
^{12}^{12}12See https://github.com/NVIDIA/OpenSeq2Seq/blob/master/open_seq2seq/optimizers and https://github.com/NVIDIA/apex/blob/master/apex/parallel/LARC.py. LARC clips layer gradients with respect to layer weights :where .
Optimizer  devclean  devother  testclean  testother 

Adam  13.20  31.71  13.36  32.71 
SGD  3.91  12.77  3.98  12.79 
NovoGrad  3.64  11.89  3.86  11.95 
We found that NovoGrad yields lower Word Error Rates (WER) comparing to SGD with momentum, especially for the long runs. Unfortunately, we were unable to get good results using Adam. The details about the model and training parameters are available in (Li et al., 2019).
4.4 Language modeling
We trained TransformerXL Dai et al. (2019), the stateoftheart LM architecture on the wordlevel WikiText–103 Merity et al. (2016) benchmark. For all the experiments we used a layer base model with M parameters (, , , ). All other hyperparameters were taken from the original TransformerXL paper and the source code was based on a publicly available implementation^{13}^{13}13https://github.com/cybertronai/transformerxl. Each configuration was trained for billion tokens which corresponds to approximately epochs and training iterations.
Figure 1 shows that NovoGrad may require more training steps for the model to converge if compared to Adam. However, NovoGrad exhibits a much smaller gap between training and validation perplexity, which results in better generalization and improved performance on the test set.
Optimizer  #tokens  batch  LR policy  init LR  WD  Val PPL  Test PPL 

Adam  B  cosine  
AdamW 
B  cosine  
NovoGrad  B  poly (2)  0.01   
4.5 Question answering
Question answering is a popular downstream NLP task which frequently uses a pretrained language model instead of training the resulting neural net from scratch. We finetuned the large BERT model with Adam, AdamW and NovoGrad on the question answering benchmark SQuAD v1.1, which involves predicting the answer text span in a paragraph given a question. For Adam, LR warmup over 10% of the iterations was used to stabilize the initial training phase. With NovoGrad we did not use LR warmup. Interestingly, while NovoGrad required 4 epochs to get comparable results, it still had exactly the same number of updates as Adam because of the 2x larger batch size. Table 7 shows the best F1 and Exact Match (EM) scores obtained for the SQuAD benchmark on the evaluation dataset.
Optimizer  batch  epochs  EM  F1  LR policy  init LR  WD 

Adam  12  2  84.66  91.28  poly(1)+warmup  
AdamW  16  2  84.52  91.19  poly(1)+warmup  
NovoGrad  24  4  84.43  91.14  cosine 
5 Conclusion
We propose NovoGrad – a firstorder SGD method with gradients normalized by the second moment computed as moving average of squared norms of layer gradients. Because of the layerwise second moment, NovoGrad requires half the memory compared to Adam. NovoGrad also decouples gradients and weight decay for better regularization.
We tested NovoGrad on very large models for image classification, translation, language modeling, and speech recognition. In these experiments, NovoGrad performed equally or better than SGD and Adam/AdamW. We found that NovoGrad is more robust to the initial learning rate and weight initialization. For example, NovoGrad works well with the same learning rate decay schedule without warmup, while other methods require it. The layerwise normalized gradient makes training with NovoGrad robust for large batch sizes. NovoGrad outperformed current methods for ResNet50 large batch training. Strong optimization and regularization qualities allow NovoGrad to train longer without overfitting. NovoGrad and all models described in this work are open sourced in OpenSeq2Seq toolkit.
Acknowledgments
The authors would like to thank Anima Anandkumar, Yaroslav Bulatov, Ilya Loshchilov, and Sebastian Ruder for their valuable feedback.
References
 Codreanu et al. [2017] Valeriu Codreanu, Damian Podareanu, and Vikram Saletore. Scale out for large minibatch sgd: Residual network training on imagenet1k with improved accuracy and reduced time to train. arXiv eprints arXiv:1711.04291, 2017.
 Dai et al. [2019] Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformerxl: Attentive language models beyond a fixedlength context. arXiv preprint arXiv:1901.02860, 2019.

Duchi et al. [2011]
John Duchi, Elad Hazan, and Yoram Singer.
Adaptive subgradient methods for online learning and stochastic
optimization.
Journal of Machine Learning Research
, page 2121–2159, 2011.  Goyal et al. [2017] Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: training imagenet in 1 hour. arXiv eprints arXiv:1706.02677, 2017.
 Hazan et al. [2015] E. Hazan, K. Levy, and S. ShalevShwartz. Beyond convexity: Stochastic quasiconvex optimization. In Neural Information Processing Systems, page 1585–1593, 2015.
 He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. arXiv eprints arXiv:11603.05027, 2016.
 He et al. [2018] Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, and Mu Li. Bag of tricks for image classification with convolutional neural networks. arXiv eprints arXiv:1812.00187, 2018.
 Jia et al. [2018] Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, Tiegang Chen, Guangxiao Hu, Shaohuai Shi, and Xiaowen Chu. Highly scalable deep learning training system with mixedprecision: Training imagenet in four minutes. arXiv eprints arXiv:1807.11205, 2018.
 Keskar and Socher [2017] Nitish Shirish Keskar and Richard Socher. Improving generalization performance by switching from adam to SGD. arXiv eprints arXiv:1712.07628, 2017.
 Kingma and Ba [2015] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
 Kuchaiev et al. [2018] Oleksii Kuchaiev, Boris Ginsburg, Igor Gitman, Vitaly Lavrukhin, Carl Case, and Paulius Micikevicius. Openseq2seq: extensible toolkit for distributed and mixed precision training of sequencetosequence models. arXiv eprints arXiv:1805.10387, 2018.
 Li et al. [2019] J. Li, V. Lavrukhin, B. Ginsburg, R. Leary, O. Kuchaiev, J. M. Cohen, H. Nguyen, and R. T. Gaddei. Jasper: An endtoend convolutional neural acoustic model. arXiv eprints arXiv:1904.03288, 2019.
 Loshchilov and Hutter [2016] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. ICLR, 2016.
 Loshchilov and Hutter [2019] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
 Luo et al. [2019] Liangchen Luo, Yuanhao Xiong, Yan Liu, and Xu Sun. Adaptive gradient methods with dynamic bound of learning rate. In International Conference on Learning Representations, 2019.
 Merity et al. [2016] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
 Micikevicius et al. [2017] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory F. Diamos, Erich Elsen, David García, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training. ICLR, 2017.
 Nesterov [1984] Y. E. Nesterov. Minimization methods for nonsmooth convex and quasiconvex functions. Matekon, 29:519–531, 1984.
 Neyshabur et al. [2015] Behnam Neyshabur, Ruslan Salakhutdinov, and Nathan Srebro. Pathsgd: Pathnormalized optimization in deep neural networks. In Neural Information Processing Systems, page 2422–2430, 2015.
 Panayotov et al. [2015] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pages 5206–5210. IEEE, 2015.
 Polyak [1964] B.T Polyak. Some methods of speeding up the convergence of iteration methods. In USSR Computational Mathematics and Mathematical Physics, 1964.
 Post [2018] Matt Post. A call for clarity in reporting bleu scores. arXiv eprints arXiv:1804.0877, 2018.
 Reddi et al. [2018] Sashank J. Reddi, Stayen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018.
 Russakovsky et al. [2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li FeiFei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
 Sennrich et al. [2015] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
 Shazeer and Stern [2018] Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. arXiv eprints arXiv:1804.04235, 2018.
 Singh et al. [2015] Bharat Singh, Soham De, Yangmuzi Zhang†, Thomas Goldstein, and Gavin Taylor. Layerspecific adaptive learning rates for deep networks. In 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), 2015.

Sutskever et al. [2013]
Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton.
On the importance of initialization and momentum in deep learning.
In International Conference on Machine Learning, 2013.  Szegedy et al. [2015] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567, 2015.
 Tieleman and Hinton [2012] T. Tieleman and G. Hinton. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
 Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. arXiv preprint arXiv: 1706.03762, 2017.
 Wilson et al. [2017] Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. In Neural Information Processing Systems, pages 4148–4158, 2017.
 Ying et al. [2018] Chris Ying, Sameer Kumar, Dehao Chen, Tao Wang, and Youlong Cheng. Image classification at supercomputer scale. In Neural Information Processing Systems, 2018.
 You et al. [2017] Yang You, Igor Gitman, and Boris Ginsburg. Large batch training of convolutional networks. arXiv eprints arXiv:1708.03888, 2017.
 You et al. [2018] Yang You, Zhao Zhang, ChoJui Hsieh, and James Demmel. 100epoch imagenet training with alexnet in 24 minutes. arXiv eprints arXiv:1709.05011, 2018.
 Yu et al. [2018] Adams Wei Yu, Qihang Lin, Ruslan Salakhutdinov, and Jaime Carbonell. Blocknormalized gradient method: An empirical study for training deep neural network. arXiv eprints arXiv:1707.04822, 2018.
 Zhang et al. [2017] Zijun Zhang, Lin Ma, Zongpeng Li, and Chuan Wu. Normalized directionpreserving adam. arXiv eprints arXiv:1709.04546, 2017.