# On the Convergence of AdaGrad with Momentum for Training Deep Neural Networks

## Authors

• 5 publications
• 46 publications
• ### On the Convergence of Weighted AdaGrad with Momentum for Training Deep Neural Networks

08/10/2018 ∙ by Fangyu Zou, et al. ∙ 0

• ### Factorial Powers for Stochastic Optimization

The convergence rates for convex and non-convex optimization methods dep...
06/01/2020 ∙ by Aaron Defazio, et al. ∙ 17

• ### Understanding the Role of Momentum in Stochastic Gradient Methods

The use of momentum in stochastic gradient methods has become a widespre...
10/30/2019 ∙ by Igor Gitman, et al. ∙ 0

• ### Convergence of a Stochastic Gradient Method with Momentum for Nonsmooth Nonconvex Optimization

Stochastic gradient methods with momentum are widely used in application...
02/13/2020 ∙ by Vien V. Mai, et al. ∙ 0

• ### Improved Analysis of Clipping Algorithms for Non-convex Optimization

Gradient clipping is commonly used in training deep neural networks part...
10/05/2020 ∙ by Bohang Zhang, et al. ∙ 0

• ### On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization

08/08/2018 ∙ by Xiangyi Chen, et al. ∙ 0

• ### Momentum Accelerates the Convergence of Stochastic AUPRC Maximization

In this paper, we study stochastic optimization of areas under precision...
07/02/2021 ∙ by Guanghui Wang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In this work we consider the following general non-convex stochastic optimization problem:

 minx∈Rd f(x):=Ez[F(x,z)], (1)

where

denotes the expectation with respect to the random variable

. We assume that is bounded from below, i.e., , and its gradient is -Lipschitz continuous.

Problem (1) arises from many statistical learning (e.g.

, logistic regression, AUC maximization) and deep learning models

goodfellow2016deep ; lecun2015deep

, as the expectation in problem (1) often can only be approximated as a finite sum. Hence, one of the most popular algorithms to solve problem (1) is Stochastic Gradient Decent (SGD) robbins1985stochastic ; bottou2018optimization :

 xt+1:=xt−ηtgt, (2)

where is the learning rate and is the noisy gradient estimate of in the -th iteration. Its convergence rates for both convex and non-convex settings have been established bottou1998online ; ghadimi2013stochastic .

However, vanilla SGD suffers from slow convergence, and its performance is sensitive to the learning rate—which is tricky to tune. Many techniques have been introduced to improve the convergence speed and robustness of SGD, such as variance reduction

 ηt=η√∑ti=1g2i+ϵ, (3)

where are fixed parameters. On the other hand, Heavy Ball (HB) polyak1964some ; ghadimi2015global and Nesterov Accelerated Gradient (NAG) nesterov1983method ; nesterov2013introductory are two most popular momentum acceleration techniques, which have been extensively studied for stochastic optimization problems ghadimi2016accelerated ; yan2018unified ; levy2018online :

 (SHB): { mt=μmt−1−ηtgtxt+1=xt+mt,(SNAG): {yt+1=xt−ηtgtxt+1=yt+1+μ(yt+1−yt), (4)

where , , and is the momentum factor.

Both the adaptive learning rate and momentum techniques have been individually investigated and have displayed to be effective in practice, so are independently and widely applied in tasks such as training deep networks krizhevsky2012imagenet ; sutskever2013importance ; kingma2014adam ; reddi2018convergence . It is natural to consider: Can we effectively incorporate both techniques at the same time so as to inherit their dual advantages and moreover develop convergence theory for this scenario, especially in the more difficult non-convex stochastic setting? To the best of our knowledge, Levy et al. levy2018online firstly attempted to combine the adaptive learning rate with NAG momentum, which yields the AccAdaGrad algorithm. However, its convergence is limited to the stochastic convex setting. Yan et al. yan2018unified unified SHB and SNAG to a three-step iterate without considering the adaptive learning rate in Eq. (3).

• We develop a new weighted gradient accumulation technique to estimate the adaptive learning rate, and propose a novel unified stochastic momentum scheme to cover SHB and SNAG. We then integrate the weighted coordinate-wise AdaGrad with a unified momentum mechanism, yielding a novel adaptive stochastic momentum algorithm, dubbed as AdaUSM.

• We establish the non-asymptotic convergence rate of AdaUSM under the general non-convex stochastic setting. Our assumptions are natural and mild.

• We show that the adaptive learning rates of Adam and RMSProp correspond to taking exponentially growing weights in AdaUSM, which thereby provides a new perspective for understanding Adam and RMSProp.

## 2 Preliminaries

#### Notations.

denotes the maximum number of iterations. The noisy gradient of at the -th iteration is denoted by for all . We use to denote expectation as usual, and as the conditional expectation with respect to conditioned on the random variables .

In this paper we allow differential learning rates across coordinates, so the learning rate

is a vector in

. Given a vector we denote its -th coordinate by . The -th coordinate of the gradient is denoted by . Given two vectors , the inner product between them is denoted by . We also heavily use the coordinate-wise product between and , denoted as , with . Division by a vector is defined similarly. Given a vector , we define the weighted norm: Norm without any subscript is the Euclidean norm, and is defined as . Let and .

#### Assumptions.

We assume that , are independent of each other. Moreover,

is an unbiased estimator;

• , i.e.

, the second-order moment of

is bounded.

Notice that the condition (A2) is slightly weaker than that in chen2018convergence which assumes that stochastic gradient estimate is uniformly bounded, i.e., .

We describe the two main ingredients of AdaUSM: the unified stochastic momentum formulation of SHB and SNAG (see Subsection 3.1), and the weighted adaptive learning rate (see Subsection 3.2).

### 3.1 Unified Stochastic Momentum (USM)

By introducing with , the iterate of SNAG can be equivalently written as

 (SNAG): mt=μmt−1−ηtgt, xt+1=xt+mt+μ(mt−mt−1).

Comparing SHB and above SNAG, the difference lies in that SNAG has more weight on the current momentum . Hence, we can rewrite SHB and SNAG in the following unified form:

 (USM): {mt=μmt−1−ηtgt,xt+1=xt+mt+λμ(mt−mt−1), (5)

where is a constant. When , it is SHB; when , it is SNAG. We call

the interpolation factor. For any

, can be chosen from the range .

###### Remark 1.

Yan et al. yan2018unified unified SHB and SNAG as a three-step iterate scheme as follows:

 yt+1=xt−ηtgt,yst+1=xt−sηtgt,xt+1=yt+1+μ(yst+1−yst), (6)

where . Its convergence rate has been established for . Notably, USM is slightly simpler than Eq. (6) and the learning rate in USM is adaptively determined.

### 3.2 Weighted Adaptive Learning Rate

We generalize the learning rate in Eq. (3) by assigning different weights to the past stochastic gradients accumulated. It is defined as follows:

 ηt,k=η√∑ti=1aig2i,k/¯at+ϵ=η/√t√∑ti=1aig2i,k/(∑ti=1ai)+ϵ/√t, (7)

for , where and . Here, can be understood as the base learning rate. The classical AdaGrad corresponds to taking for all in Eq. (7), i.e., uniform weights. However, recent gradients tend to carry more information of local geometries than remote ones. Hence, it is natural to assign the recent gradients more weights. A typical choice for such weights is to choose for , which grows in a polynomial rate. For instance, in AccAdaGrad levy2018online weights are chosen to be for and for .

In this subsection, we present the AdaUSM algorithm, which effectively integrates the weighted adaptive learning rate in Eq. (7) with the USM technique in Eq. (5), and establish its convergence rate.

###### Theorem 1.

Let be a sequence generated by AdaUSM. Assume that the noisy gradient satisfies assumptions (A1)- (A2), and the sequence of weights is non-decreasing in . Let be uniformly randomly drawn from . Then

 (E∥∇f(xτ)∥4/3)2/3≤Bound(T)=O(log(T∑t=1at)/√T), (8)

where , and constants and are defined as and , respectively.

###### Sketch of proof.

Our starting point is the following inequality which follows from the Lipschitz continuity of and the descent lemma in nesterov2013introductory :

 f(xt+1)≤f(xt)+⟨∇f(xt),mt+λμ(mt−mt−1)⟩+L2∥mt+λμ(mt−mt−1)∥2.

The key point is to estimate the term , which involves the momentum. However, is far from an unbiased estimate of the true gradient. This difficulty is resolved in Lemma 3, where we establish an estimate for the term via iteration

 ⟨∇f(xt),mt⟩≤(1+2λ)Lt−1∑i=1∥mi∥2μt−i−t∑i=1⟨∇f(xi),ηigi⟩μt−i.

Furthermore, we can derive an estimate on each in terms of (by Lemma 2). For the term , taking expectation of is tricky as is correlated with . We follow the idea of ward2018adagrad and consider the term , where is an approximation of independent of . Hence its expectation gives rise to the desired term . With a suitable choice of , in Lemma 4 we establish the following estimate

Summarizing the estimates above leads to the estimate in Lemma 7:

 E[T∑t=1∥∇f(xt)∥2^ηt]≤C1+C2 E[T∑t=1(at/¯at)∥ηtgt∥2].

With the specific adaptive learning rate as Eq. (7), we can further show that the principal term is bounded by (by Lemma 5), while the term via Hölder inequality (by Lemma 6). This immediately gives rise to our theorem. ∎

###### Remark 2.

When we take for some constant power , then and conditions (i)-(ii) are satisfied. Hence, AdaUSM with such weights has the convergence rate. In fact, AdaUSM is convergent as long as .

When interpolation factor , and for and for , AdaUSM reduces to coordinate-wise AccAdaGrad levy2017online . In this case, . Thus, we have the following corollary for the convergence rate of AccAdaGrad in the non-convex stochastic setting.

###### Corollary 1.

Assume the same setting as Theorem 1. Let be randomly selected from

with equal probability

. Then

 (E[∥∇f(xτ)∥4/3])2/3≤Bound(T)=O(logT/√T), (9)

where , and constants and are defined as and , respectively.

###### Remark 3.

The non-asymptotic convergence rate measured by the objective for AccAdaGrad has already been established in levy2018online in the convex stochastic setting. Corollary 1 provides the convergence rate of coordinate-wise AccAdaGrad measured by gradient residual, which also supplements the results in levy2018online in the non-convex stochastic setting.

## 4 Relationships with Adam and RMSProp

In this section, we show that the exponential moving average (EMA) technique in estimating adaptive learning rates in Adam kingma2014adam and RMSProp hinton2012lecture is a special case of the weighted adaptive learning rate in Eq. (7), i.e., their adaptive learning rates correspond to taking exponentially growing weights in AdaUSM, which thereby provides a new angle for understanding Adam and RMSProp.

For better comparison, we first represent the -th iterate scheme of Adam kingma2014adam as follows:

 ⎧⎪ ⎪⎨⎪ ⎪⎩ˆmt,k=β1ˆmt−1,k+(1−β1)gt,k,mt,k=ˆmt,k/(1−βt1),ˆvt,k=β2ˆvt−1,k+(1−β2)g2t,k,vt,k=ˆvt,k/(1−βt2),xt,k=xt−1,k−ηmt,k/(√tvt,k+√tϵ),

for , where and are constants, and is a sufficiently small constant. Denoting , we can simplify the iterations of Adam as

 {ˆmt,k=β1ˆmt−1,k+(1−β1)gt,k,mt,k=ˆmt,k/(1−βt1),(EMA step)xt+1,k=xt,k−ηt,kmt,k.

Below, we show that AdaUSM and Adam differ in two aspects: momentum estimation and coordinate-wise adaptive learning rate .

Momentum estimation.  EMA technique is widely used in the momentum estimation step in Adam kingma2014adam and AMSGrad reddi2018convergence . Without loss of generality, we consider the simplified EMA step

 {mt,k=β1mt−1,k+(1−β1)gt,kxt+1,k=xt,k−ηt,kmt,k. (10)

To show the difference clearly, we merely compare HB momentum with EMA momentum. Let . By the first equality in Eq. (10), we have

 ˜mt,k=β1ηt,kmt−1,k−(1−β1)ηt,kgt,k=β1˜mt−1,k−(1−β1)ηt,kgt,k+(ηt,k−ηt−1,k)mt−1,k.

Comparing with HB, EMA has an extra error term which vanishes if for all , i.e., the step-size is taken constant222Since the learning rates in AdaUSM and Adam are determined adaptively, we do not have .. In addition, EMA has a much smaller step-size if the momentum factor is close to . More precisely, if we write the iterate in terms of stochastic gradients and eliminate , we obtain

One can see that AdaUSM uses the past step-sizes but EMA uses only the current one in exponential moving averaging. Moreover, when momentum factor is very close to , the update of via EMA could stagnate since . This dilemma will not appear in AdaUSM.

Adaptive learning rate.  Note that . We have . Without loss of generality, we set . Hence, it holds that

 vt,k=ˆvt,k/(1−βt2)=t∑i=1(1−β2)1−βt2βt−i2g2i,k.

 ηt,k=η√tvt,k+tϵ=η√t∑ti=1(1−β2)1−βt2βt−i2g2i,k+tϵ=η√t∑ti=1(1−β2)βt21−βt2β−i2g2i,k+tϵ. (11)

Let . Note that Hence, Eq. (11) can be further reformulated as

 ηt,k=η√t∑ti=1ai∑ti=1aig2i,k+√tϵ. (12)

Hence, the adaptive learning rate in Adam is actually equivalent to that in AdaUSM by specifying if is sufficiently small. For the parameter setting in Adam, it holds that . Thus, we gain an insight on understanding the convergence of Adam from the convergence results of AdaUSM in Theorem 1.

### 4.2 RMSProp

Coordinate-wise RMSProp is another efficient solver for training DNNs hinton2012lecture ; mukkamala2017variants , which is defined as

Define . The adaptive learning rate of RMSProp denoted as can be rewritten as

 ηRMSPropt,k =η√tvt,k+ϵ=η/(1−βt)√t∑ti=1(1−β2)1−βt2βt−i2g2i,k+ϵ1−βt=η/(1−βt)√t∑ti=1ai∑ti=1aig2i,k+ϵ1−βt.

When is a sufficiently small constant and , it is obvious that has a similar structure to after being sufficiently large. Based on the above analysis, AdaUSM can be interpreted as generalized RMSProp with HB and NAG momentums.

## 5 Experiments

In this section, we conduct experiments to validate the efficacy and theory of AdaHB (AdaUSM with ) and AdaNAG (AdaUSM with ) by applying them to train DNNs including LetNet lecun1998gradient , GoogLeNet szegedy2015going , ResNet he2016deep , and DenseNet huang2017densely on various datasets including MNIST lecun1998gradient , CIFAR10/100 krizhevsky2009learning

, and Tiny-ImageNet

deng2009imagenet

. The efficacies of AdaHB and AdaNAG are evaluated via the training loss, test accuracy, and test loss vs. epochs, respectively. In the experiments, we fix the batch-size as

and the weighted decay parameter as , respectively.

## References

• (1) L. Bottou. Online learning and stochastic approximations. On-line learning in neural networks, 17(9):142, 1998.
• (2) L. Bottou, F. E. Curtis, and J. Nocedal.

Optimization methods for large-scale machine learning.

SIAM Review, 60(2):223–311, 2018.
• (3) X. Chen, S. Liu, R. Sun, and M. Hong. On the convergence of a class of adam-type algorithms for non-convex optimization. In International Conference on Learning Representations, 2019.
• (4) A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in neural information processing systems, pages 1646–1654, 2014.
• (5) J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. Ieee, 2009.
• (6) J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
• (7) E. Ghadimi, H. R. Feyzmahdavian, and M. Johansson. Global convergence of the heavy-ball method for convex optimization. In Control Conference (ECC), 2015 European, pages 310–315. IEEE, 2015.
• (8) S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
• (9) S. Ghadimi and G. Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(1-2):59–99, 2016.
• (10) I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio. Deep learning, volume 1. MIT press Cambridge, 2016.
• (11) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
• (12) G. Hinton, N. Srivastava, and K. Swersky. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on, page 14, 2012.
• (13) G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269. IEEE, 2017.
• (14) R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315–323, 2013.
• (15) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
• (16) A. Krizhevsky. Learning multiple layers of features from tiny images. Master’s thesis, University of Tront, 2009.
• (17) A. Krizhevsky, I. Sutskever, and G. E. Hinton.

Imagenet classification with deep convolutional neural networks.

In Advances in neural information processing systems, pages 1097–1105, 2012.
• (18) Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature, 521(7553):436, 2015.
• (19) Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
• (20) K. Levy. Online to offline conversions, universality and adaptive minibatch sizes. In Advances in Neural Information Processing Systems, pages 1613–1622, 2017.
• (21) Y. K. Levy, A. Yurtsever, and V. Cevher. Online adaptive methods, universality and acceleration. In Advances in Neural Information Processing Systems, pages 6500–6509, 2018.
• (22) X. Li and F. Orabona. On the convergence of stochastic gradient descent with adaptive stepsizes. In

International Conference on Artificial Intelligence and Statistics

, pages 983–992, 2019.
• (23) H. B. McMahan and M. Streeter. Adaptive bound optimization for online convex optimization. COLT 2010, page 244, 2010.
• (24) M. C. Mukkamala and M. Hein. Variants of rmsprop and adagrad with logarithmic regret bounds. In International Conference on Machine Learning, pages 2545–2553, 2017.
• (25) Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
• (26) Y. E. Nesterov. A method for solving the convex programming problem with convergence rate o (1/k^ 2). In Dokl. Akad. Nauk SSSR, volume 269, pages 543–547, 1983.
• (27) L. M. Nguyen, J. Liu, K. Scheinberg, and M. Takáč. Sarah: A novel method for machine learning problems using stochastic recursive gradient. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2613–2621. JMLR. org, 2017.
• (28) B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
• (29) S. J. Reddi, S. Kale, and S. Kumar. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018.
• (30) H. Robbins and S. Monro. A stochastic approximation method. In Herbert Robbins Selected Papers, pages 102–109. Springer, 1985.
• (31) I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147, 2013.
• (32) C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
• (33) R. Ward, X. Wu, and L. Bottou. Adagrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization. arXiv preprint arXiv:1806.01811, 2018.
• (34) X. Wu, R. Ward, and L. Bottou. Wngrad: Learn the learning rate in gradient descent. arXiv preprint arXiv:1803.02865, 2018.
• (35) Y. Yan, T. Yang, Z. Li, Q. Lin, and Y. Yang. A unified analysis of stochastic momentum methods for deep learning. In IJCAI, pages 2955–2961, 2018.
• (36) F. Zou, L. Shen, Z. Jie, W. Zhang, and W. Liu. A sufficient condition for convergences of adam and rmsprop. arXiv preprint arXiv:1811.09358, 2018.

## Appendix B Preliminary Lemmas

In this section we provide preliminary lemmas that will be used to prove our main theorem. The readers may skip this part for the first time and come back whenever the lemmas are needed.

###### Lemma 1.

Let , where is a non-negative sequence and . We have

###### Proof.

The finite sum can be interpreted as a Riemann sum as follows Since is decreasing on the interval , we have

 T∑t=11St(St−St−1)≤∫STS01xdx=log(ST)−log(S0).

The proof is finished. ∎

The following lemma is a direct result of the momentum updating rule.

###### Lemma 2.

Suppose with and . We have the following estimate

 T∑t=1∥mt∥2≤1(1−μ)2T∑t=1∥ηtgt∥2. (13)
###### Proof.

First, we have the following inequality due to convexity of :

 ∥mt∥2 =∥μmt−1+(1−μ)(−ηtgt/(1−μ))∥2≤μ∥mt−1∥2+(1−μ)∥ηtgt/(1−μ)∥2 (14) =μ∥mt−1∥2+∥ηtgt∥2/(1−μ).

Taking sum of Eq. (14) from to and using , we have that

 T∑t=1∥mt∥2≤μT−1∑t=1∥mt∥2+11−μT∑t=1∥ηtgt∥2≤μT∑t=1∥mt∥2+11−μT∑t=1∥ηtgt∥2. (15)

Hence,

 T∑t=1∥mt∥2≤1(1−μ)2T∑t=1∥ηtgt∥2. (16)

The proof is finished. ∎

The following lemma is a result of the USM formulation for any general adaptive learning rate.

###### Lemma 3.

Let and be sequences generated by the following general SGD with USM momentum: starting from initial values and , and being updated through

where the momentum factor and the interpolation factor satisfy and , respectively. Suppose that the function is -smooth. Then for any we have the following estimate

 ⟨∇f(xt),mt⟩≤μ⟨∇f(xt−1),mt−1⟩+(1+32λμ)μL∥mt−1∥2+12λμ2L∥mt−2∥2−⟨∇f(xt),ηtgt⟩. (17)

In particular, the following estimate holds

 ⟨∇f(xt),mt⟩≤(1+2λ)Lt−1∑i=1∥mi∥2μt−i−t∑