# Stochastic Normalized Gradient Descent with Momentum for Large Batch Training

Stochastic gradient descent (SGD) and its variants have been the dominating optimization methods in machine learning. Compared with small batch training, SGD with large batch training can better utilize the computational power of current multi-core systems like GPUs and can reduce the number of communication rounds in distributed training. Hence, SGD with large batch training has attracted more and more attention. However, existing empirical results show that large batch training typically leads to a drop of generalization accuracy. As a result, large batch training has also become a challenging topic. In this paper, we propose a novel method, called stochastic normalized gradient descent with momentum (SNGM), for large batch training. We theoretically prove that compared to momentum SGD (MSGD) which is one of the most widely used variants of SGD, SNGM can adopt a larger batch size to converge to the ϵ-stationary point with the same computation complexity (total number of gradient computation). Empirical results on deep learning also show that SNGM can achieve the state-of-the-art accuracy with a large batch size.

## Authors

• 11 publications
• 3 publications
• 29 publications
• ### Stagewise Enlargement of Batch Size for SGD-based Learning

Existing research shows that the batch size can seriously affect the per...
02/26/2020 ∙ by Shen-Yi Zhao, et al. ∙ 0

• ### Contrastive Weight Regularization for Large Minibatch SGD

The minibatch stochastic gradient descent method (SGD) is widely applied...
11/17/2020 ∙ by Qiwei Yuan, et al. ∙ 21

• ### Quasi-hyperbolic momentum and Adam for deep learning

Momentum-based acceleration of stochastic gradient descent (SGD) is wide...
10/16/2018 ∙ by Jerry Ma, et al. ∙ 0

Stochastic gradient decent (SGD) and its variants, including some accele...
06/11/2019 ∙ by Shen-Yi Zhao, et al. ∙ 0

• ### Large batch size training of neural networks with adversarial training and second-order information

Stochastic Gradient Descent (SGD) methods using randomly selected batche...
10/02/2018 ∙ by Zhewei Yao, et al. ∙ 2

• ### Scaling Distributed Training with Adaptive Summation

Stochastic gradient descent (SGD) is an inherently sequential training a...
06/04/2020 ∙ by Saeed Maleki, et al. ∙ 0

In this paper, we propose a novel approach to automatically determine th...
12/09/2017 ∙ by Matteo Pirotta, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In machine learning, we often need to solve the following empirical risk minimization problem:

 minw∈RdF(w)=1nn∑i=1fi(w), (1)

where denotes the model parameter, denotes the number of training samples, denotes the loss on the

th training sample. The problem in (1) can be used to formulate a broad family of machine learning models, such as logistic regression and deep learning models.

Stochastic gradient descent (SGD) (Robbins and Monro, 1951) and its variants have been the dominating optimization methods for solving (1). SGD and its variants are iterative methods. In the th iteration, these methods randomly choose a subset (also called mini-batch) and compute the stochastic mini-batch gradient for updating the model parameter, where is the batch size. Existing works (Li et al., 2014b; Yu et al., 2019a) have proved that with the batch size of , SGD and its momentum variant, called momentum SGD (MSGD), achieve a convergence rate for smooth non-convex problems, where is total number of model parameter updates.

With the population of multi-core systems and the easy implementation for data parallelism, many distributed variants of SGD have been proposed, including parallel SGD (Li et al., 2014a), decentralized SGD (Lian et al., 2017), local SGD (Yu et al., 2019b; Lin et al., 2020), local momentum SGD (Yu et al., 2019a) and so on. Theoretical results show that all these methods can achieve a convergence rate for smooth non-convex problems. Here, is the batch size on each worker and is the number of workers. By setting , we can observe that the convergence rate of these distributed methods is consistent with that of sequential methods. In distributed settings, a small number of model parameter updates implies a small synchronize cost and communication cost. Hence, a small can further speed up the training process. Based on the convergence rate, we can find that if we adopt a larger , the will be smaller. Hence, large batch training can reduce the number of communication rounds in distributed training. Another benefit of adopting large batch training is to better utilize the computational power of current multi-core systems like GPUs (You et al., 2017). Hence, large batch training has recently attracted more and more attention in machine learning.

Unfortunately, empirical results (LeCun et al., 2012; Keskar et al., 2017) show that existing SGD methods with a large batch size will lead to a drop of generalization accuracy on deep learning models. Figure 1 shows the comparison of training loss and test accuracy between MSGD with a small batch size and MSGD with a large batch size. We can find that large batch training does degrade both training loss and test accuracy. Many works try to explain this phenomenon (Keskar et al., 2017; Hoffer et al., 2017). They observe that SGD with a small batch size typically makes the model parameter converge to a flatten minimum while SGD with a large batch size typically makes the model parameter fall into the region of a sharp minimum. And usually, a flatten minimum can achieve better generalization ability than a sharp minimum. Hence, large batch training has also become a challenging topic.

Recently, many methods have been proposed for improving the performance of SGD with a large batch size. The work in (Goyal et al., 2017) proposes many tricks like warm-up, momentum correction and linearly scaling the learning rate, for large batch training. The work in (You et al., 2017)

observes that the norms of gradient at different layers of deep neural networks are widely different and the authors propose the layer-wise adaptive rate scaling method (LARS). The work in

(Ginsburg et al., 2019) also proposes a similar method that updates the model parameter in a layer-wise way. However, all these methods lack theoretical evidence to explain why they can adopt a large batch size.

In this paper, we propose a novel method, called stochastic normalized gradient descent with momentum (SNGM), for large batch training. SNGM combines normalized gradient (Nesterov, 2004; Hazan et al., 2015; Wilson et al., 2019) and Polyak’s momentum technique (Polyak, 1964) together. The main contributions of this paper are outlined as follows:

• We theoretically prove that compared to MSGD which is one of the most widely used variants of SGD, SNGM can adopt a larger batch size to converge to the -stationary point with the same computation complexity (total number of gradient computation). That is to say, SNGM needs a smaller number of parameter update, and hence has faster training speed than MSGD.

• For a relaxed smooth objective function (see Definition 2), we theoretically show that SNGM can achieve an -stationary point with a computation complexity of . To the best of our knowledge, this is the first work that analyzes the computation complexity of stochastic optimization methods for a relaxed smooth objective function.

• Empirical results on deep learning also show that SNGM can achieve the state-of-the-art accuracy with a large batch size.

## 2 Preliminaries

In this paper, we use to denote the Euclidean norm, use to denote one of the optimal solutions of (1), i.e., . We call an -stationary point of if . The computation complexity of an algorithm is the total number of its gradient computation. We also give the following assumption and definitions:

###### Assumption 1

(

-bounded variance) For any

,  ().

(Smoothness) A function is -smooth () if for any ,

 ϕ(u)≤ϕ(w)+∇ϕ(w)⊤(u−w)+L2∥u−w∥2.

is called smoothness constant in this paper. (Relaxed smoothness (Zhang et al., 2020)) A function is -smooth (, ) if is twice differentiable and for any ,

 ∥∇2ϕ(w)∥≤L+λ∥∇ϕ(w)∥,

where denotes the Hessian matrix of .

From the above definition, we can observe that if a function is -smooth, then it is a classical -smooth function (Nesterov, 2004). For a -smooth function, we have the following property (Zhang et al., 2020): If is -smooth, then for any such that , we have

 ∥∇ϕ(u)∥≤(Lα+∥∇ϕ(w)∥)eλα.

## 3 Relationship between Smoothness Constant and Batch Size

In this section, we deeply analyze the convergence property of MSGD to find the relationship between smoothness constant and batch size, which provides insightful hint for designing our new method SNGM.

MSGD can be written as follows:

 vt+1=βvt+gt, (2) wt+1=wt−ηvt+1, (3)

where is a stochastic mini-batch gradient with a batch size of , and is the Polyak’s momentum (Polyak, 1964).

We aim to find how large the batch size can be without loss of performance. The convergence rate of MSGD with the batch size for -smooth functions can be derived from the work in (Yu et al., 2019a). That is to say, when , we obtain

 1TT−1∑t=0E∥∇F(wt)∥2≤ 2(1−β)[F(w0)−F(w∗)]ηT+Lησ2(1−β)2B+4L2η2σ2(1−β)2, = O(BηC)+O(ηB)+O(η2), (4)

where denotes the computation complexity (total number of gradient computation). According to Corollary 1 in (Yu et al., 2019a), we set and obtain that

 1TT−1∑t=0E∥∇F(wt)∥≤√O(1√C)+O(B2C). (5)

Since is necessary for (3), we firstly obtain that . Furthermore, according to the right term of (5), we have to set such that , i.e., , for computation complexity guarantee. Hence in MSGD, we have to set the batch size satisfying

 B≤O(min{√CL,C1/4}). (6)

We can observe that a larger leads to a smaller batch size in MSGD. If does not satisfy (6), MSGD will get higher computation complexity.

In fact, to the best of our knowledge, among all the existing convergence analysis of SGD and its variants on both convex and non-convex problems, we can observe three necessary conditions for the computation complexity guarantee (Li et al., 2014b, a; Lian et al., 2017; Yu et al., 2019b, a): (a) the objective function is -smooth; (b) the learning rate is less than ; (c) the batch size is proportional to the learning rate . One direct corollary is that the batch size is limited by the smooth constant , i.e., . Hence, we can not increase the batch size casually in these SGD based methods. Otherwise, it may slow down the convergence rate and we need to compute more gradients, which is consistent with the observations in (Hoffer et al., 2017).

## 4 Stochastic Normalized Gradient Descent with Momentum

In this section, we propose our novel methods, called stochastic normalized gradient descent with momentum (SNGM), which is presented in Algorithm 1. In the -th iteration, SNGM runs the following update:

 ut+1=βut+gt∥gt∥, (7) wt+1=wt−ηut+1, (8)

where is a stochastic mini-batch gradient with a batch size of . When , SNGM will degenerate to stochastic normalized gradient descent (SNGD) (Hazan et al., 2015). The is a variant of Polyak’s momentum. But different from Polyak’s MSGD which adopts directly for updating , SNGM adopts the normalized gradient for updating . In MSGD, we can observe that if is large, then may be large as well and this may lead to a bad model parameter. Hence, we have to control the learning rate in MSGD, i.e., , for a -smooth objective function. The following lemma shows that in SNGM can be well controlled whatever is large or small.

Let be the sequence produced by (7), then we have ,

 ∥ut∥≤11−β.

### 4.1 Smooth Objective Function

For a smooth objective function, we have the following convergence result of SNGM: Let be a -smooth function (). The sequence is produced by Algorithm 1. Then for any , we have

 1TT−1∑t=0E∥∇F(wt)∥≤2(1−β)[F(w0)−F(w∗)]ηT+Lκη+2σ√B, (9)

where .

We can observe that different from (3) which needs , (9) is true for any positive learning rate. According to Theorem 4.1, we obtain the following computation complexity of SNGM: Let be a -smooth function (). The sequence is produced by Algorithm 1. Given any total number of gradient computation , let ,

 B=√C(1−β)σ22L(1+β)(F(w0)−F(w∗)),

and

 η=√2(1−β)3(F(w0)−F(w∗))B(1+β)LC.

Then we have

 1TT−1∑t=0E∥∇F(wt)∥≤2√24√8L(1+β)[F(w0)−F(w∗)]σ2(1−β)C=O(1C1/4).

Hence, the computation complexity for achieving an -stationary point is .

It is easy to verify that the and in Corollary 4.1 make the right term of (9) minimal. However, the and rely on the and which are usually unknown in practice. The following corollary shows the computation complexity of SNGM with simple settings about learning rate and batch size. Let be a -smooth function (). The sequence is produced by Algorithm 1. Given any total number of gradient computation , let , and . Then we have

 1TT−1∑t=0E∥∇F(wt)∥≤2(1−β)[F(w0)−F(w∗)]C1/4+L(1+β)(1−β)2C1/4+2σC1/4=O(1C1/4).

Hence, the computation complexity for achieving an -stationary point is .

According to Corollary 4.1, the batch size of SNGM can be set as , which does not rely on the smooth constant , and the computation complexity is still guaranteed (see Table 1). Hence, SNGM can adopt a larger batch size than MSGD, especially when is large.

### 4.2 Relaxed Smooth Objective Function

Recently, the authors in (Zhang et al., 2020) observe the relaxed smooth property in deep neural networks. According to Definition 2, the relaxed smooth property is more general than -smooth property. For a relaxed smooth objective function, we have the following convergence result of SNGM:

Let be a -smooth function (). The sequence is produced by Algorithm 1 with the learning rate and batch size . Then we have

 1TT−1∑t=0E∥∇F(wt)∥≤2(1−β)[F(w0)−F(w∗)]ηT+8Lκη+4σ√B, (10)

where and .

According to Theorem 4.2, we obtain the computation complexity of SNGM:

Let be a -smooth function (). The sequence is produced by Algorithm 1. Given any total number of gradient computation , let , and . Then we have

 1TT−1∑t=0E∥∇F(wt)∥≤2(1−β)[F(w0)−F(w∗)]C1/4+8L(1+β)(1−β)2C1/4+4σC1/4=O(1C1/4).

Hence, the computation complexity for achieving an -stationary point is .

According to Corollary 4.2, SNGM with a batch size of can still guarantee a computation complexity for a relaxed smooth objective function.

## 5 Experiments

All experiments are conducted with the platform of PyTorch, on a server with eight NVIDIA Tesla V100 (32G) GPU cards. The datasets for evaluation include CIFAR10 and ImageNet.

### 5.1 On CIFAR10

First, we evaluate SNGM by training ResNet20 and ResNet56 on CIFAR10. CIFAR10 contains 50k training samples and 10k test samples. We compare SNGM with MSGD and an existing large batch training method LARS (You et al., 2017)

. We implement LARS by using the open source code

. The standard strategy (He et al., 2016)

for training the two models on CIFAR10 is using MSGD with a weight decay of 0.0001, a batch size of 128, an initial learning rate of 0.1, and dividing the learning rate at the 80th and 120th epochs. We also adopt this strategy for MSGD in this experiment. For SNGM and LARS, we set a large batch size of 4096 and also a weight decay of 0.0001. Following

(You et al., 2017), we adopt the poly power learning rate strategy and adopt the gradient accumulation (Ott et al., 2018)

with a batch size of 128 for the two large batch training methods. The momentum coefficient is 0.9 for all methods. Different from existing heuristic methods for large batch training, we do not adopt the warm-up strategy for SNGM.

The results are presented in Figure 2. As can be seen, SNGM achieves better convergence rate on training loss than LARS. The detailed information about the final convergence results is presented in Table 2. We can observe that MSGD with a batch size of 4096 leads to a significant drop of test accuracy. SNGM with a batch size of 4096 achieves almost the same test accuracy as MSGD with a batch size of 128. But the other large batch training method LARS achieves worse test accuracy than MSGD with a batch size of 128. These results successfully verify the effectiveness of SNGM.

### 5.2 On ImageNet

We also compare SNGM with MSGD by training ResNet18 and ResNet50 on ImageNet. The standard strategy (He et al., 2016) for training the two models on ImageNet is using MSGD with a weight decay of 0.0001, a batch size of 256, an initial learning rate of 0.1, and dividing the learning rate at the 30th and 60th epochs. We also adopt this strategy for MSGD in this experiment. For SNGM, we set a larger batch size of 8192 and a weight decay of 0.0001. We still adopt the poly power learning rate and the gradient accumulation with a batch size of 128 for SNGM. We do not adopt the warm-up strategy for SNGM either. The momentum coefficient is 0.9 in the two methods. The results are presented in Figure 3 and Table 3. As can be seen, SNGM with a larger batch size achieves almost the same test accuracy as MSGD with a small batch size.

## 6 Conclusion

In this paper, we propose a novel method called stochastic normalized gradient descent with momentum (SNGM), for large batch training. We theoretically prove that compared to MSGD which is one of the most widely used variants of SGD, SNGM can adopt a larger batch size to converge to the -stationary point with the same computation complexity. Empirical results on deep learning also show that SNGM can achieve the state-of-the-art accuracy with a large batch size.

## References

• Ginsburg et al. (2019) Boris Ginsburg, Patrice Castonguay, Oleksii Hrinchuk, Oleksii Kuchaiev, Vitaly Lavrukhin, Ryan Leary, Jason Li, Huyen Nguyen, and Jonathan M. Cohen. Stochastic gradient methods with layer-wise adaptive moments for training of deep networks. CoRR, abs/1905.11286, 2019.
• Goyal et al. (2017) Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR, abs/1706.02677, 2017.
• Hazan et al. (2015) Elad Hazan, Kfir Y. Levy, and Shai Shalev-Shwartz. Beyond convexity: Stochastic quasi-convex optimization. In Advances in Neural Information Processing Systems, 2015.
• He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In

Proceedings of Conference on Computer Vision and Pattern Recognition

, 2016.
• Hoffer et al. (2017) Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In Advances in Neural Information Processing Systems, 2017.
• Keskar et al. (2017) Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In Proceedings of the International Conference on Learning Representations, 2017.
• LeCun et al. (2012) Yann A. LeCun, Léon Bottou, Genevieve B. Orr, and Klaus-Robert Müller. Efficient BackProp, pages 9–48. Springer, 2012.
• Li et al. (2014a) Mu Li, David G. Andersen, Alexander J. Smola, and Kai Yu. Communication efficient distributed machine learning with the parameter server. 2014a.
• Li et al. (2014b) Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J. Smola. Efficient mini-batch training for stochastic optimization. In Proceedings of the ACM Conference on Knowledge Discovery and Data Mining, 2014b.
• Lian et al. (2017) Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, 2017.
• Lin et al. (2020) Tao Lin, Sebastian U. Stich, Kumar Kshitij Patel, and Martin Jaggi. Don’t use large mini-batches, use local SGD. In Proceedings of the International Conference on Learning Representations, 2020.
• Nesterov (2004) Yurii E. Nesterov. Introductory Lectures on Convex Optimization - A Basic Course, volume 87 of Applied Optimization. Springer, 2004.
• Ott et al. (2018) Myle Ott, Sergey Edunov, David Grangier, and Michael Auli.

Scaling neural machine translation.

In Proceedings of the Conference on Machine Translation, 2018.
• Polyak (1964) Boris Polyak. Some methods of speeding up the convergence of iteration methods. Ussr Computational Mathematics and Mathematical Physics, 4:1–17, 12 1964.
• Robbins and Monro (1951) Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.
• Wilson et al. (2019) Ashia C. Wilson, Lester Mackey, and Andre Wibisono. Accelerating rescaled gradient descent: Fast optimization of smooth functions. In Advances in Neural Information Processing Systems, 2019.
• You et al. (2017) Yang You, Igor Gitman, and Boris Ginsburg. Scaling SGD batch size to 32k for imagenet training. CoRR, abs/1708.03888, 2017.
• Yu et al. (2019a) Hao Yu, Rong Jin, and Sen Yang. On the linear speedup analysis of communication efficient momentum SGD for distributed non-convex optimization. In Proceedings of the 36th International Conference on Machine Learning, 2019a.
• Yu et al. (2019b) Hao Yu, Sen Yang, and Shenghuo Zhu. Parallel restarted SGD with faster convergence and less communication: Demystifying why model averaging works for deep learning. In

Proceedings of the AAAI Conference on Artificial Intelligence

, 2019b.
• Zhang et al. (2020) Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie.