## Authors

• 12 publications
• 4 publications
• 1 publication
• 41 publications
• 65 publications
• ### Fast Stochastic Methods for Nonsmooth Nonconvex Optimization

We analyze stochastic algorithms for optimizing nonconvex, nonsmooth fin...
05/23/2016 ∙ by Sashank J Reddi, et al. ∙ 0

• ### On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization

08/08/2018 ∙ by Xiangyi Chen, et al. ∙ 0

• ### Stochastic Recursive Gradient Algorithm for Nonconvex Optimization

In this paper, we study and analyze the mini-batch version of StochAstic...
05/20/2017 ∙ by Lam M. Nguyen, et al. ∙ 0

Adaptive methods such as Adam and RMSProp are widely used in deep learni...
01/26/2019 ∙ by Matthew Staib, et al. ∙ 0

• ### Expectigrad: Fast Stochastic Optimization with Robust Convergence Properties

10/03/2020 ∙ by Brett Daley, et al. ∙ 0

• ### On the interplay of network structure and gradient convergence in deep learning

The regularization and output consistency behavior of dropout and layer-...
11/17/2015 ∙ by Vamsi K. Ithapu, et al. ∙ 0

• ### Adaptive Gradient Methods Can Be Provably Faster than SGD after Finite Epochs

06/12/2020 ∙ by Xunpeng Huang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Stochastic gradient descent (SGD) (Robbins and Monro, 1951)

(Duchi et al., 2011; McMahan and Streeter, 2010)

the element-wise square root of the vector

, by the element-wise division between and , and by the element-wise maximum between and .
:

 xt+1=xt−αtmt√^vt, with ^vt=max(^vt−1,vt),

where is the step size, is the iterate in the -th iteration, and are the exponential moving averages of the gradient and the squared gradient at the -th iteration respectively. More specifically, and are defined as follows222Here we denote by the element-wise square of the vector .:

 mt=β1mt−1+(1−β1)gt,vt=β2vt−1+(1−β2)g2t,

where and

are hyperparameters of the algorithm, and

 xt+1=xt−αtmt^vpt, with ^vt=max(^vt−1,vt).

Evidently, when , Padam reduces to AMSGrad. Padam also reduces to the corrected version of RMSProp (Reddi et al., 2018) when and .

in the convergence rate. This is motivated by the fact that modern machine learning methods, especially the training of deep neural networks, usually requires solving a very high-dimensional nonconvex optimization problem. The order of dimension

is usually comparable to or even larger than the total number of iterations

. Take training the latest convolutional neural network DenseNet-BC

(Huang et al., 2017) with depth and growth rate on CIFAR-10 (Krizhevsky, 2009) as an example. According to Huang et al. (2017), the network is trained with in total million iterations, however the number of parameters in the network is million. This example shows that can indeed be in the same order of in practice. Therefore, we argue that it is very important to show the precise dependence on both and in the convergence analysis of adaptive gradient methods for modern machine learning.

When we were preparing this manuscript, we noticed that there was a paper (Chen et al., 2018) released on arXiv on August 8th, 2018, which analyzes the convergence of a class of Adam-type algorithms including AMSGrad and AdaGrad for nonconvex optimization. Our work is an independent work, and our derived convergence rate for AMSGrad is faster than theirs.

### 1.1 Our Contributions

The main contributions of our work are summarized as follows:

• We prove that the convergence rate of Padam to a stationary point for stochastic nonconvex optimization is

 O((∑di=1∥g1:T,i∥2)1/2T3/4+dT), (1)

where are the stochastic gradients and . When the stochastic gradients are -bounded, (1) matches the convergence rate of vanilla SGD in terms of the rate of .

• Our result implies the convergence rate for AMSGrad is

 O(√dT+dT),

which has a better dependence on the dimension and than the convergence rate proved in Chen et al. (2018), i.e.,

 O(logT+d2√T).

Here we briefly review other related work on nonconvex stochastic optimization.

Ghadimi and Lan (2013) proposed a randomized stochastic gradient (RSG) method, and proved its convergence rate to a stationary point. Ghadimi and Lan (2016) proposed an randomized stochastic accelerated gradient (RSAG) method, which achieves convergence rate, where

is an upper bound on the variance of the stochastic gradient. Motivated by the success of stochastic momentum methods in deep learning

(Sutskever et al., 2013), Yang et al. (2016) provided a unified convergence analysis for both stochastic heavy-ball method and the stochastic variant of Nesterov’s accelerated gradient method, and proved convergence rate to a stationary point for smooth nonconvex functions. Reddi et al. (2016); Allen-Zhu and Hazan (2016) proposed variants of stochastic variance-reduced gradient (SVRG) method (Johnson and Zhang, 2013) that is provably faster than gradient descent in the nonconvex finite-sum setting. Lei et al. (2017) proposed a stochastically controlled stochastic gradient (SCSG), which further improves convergence rate of SVRG for finite-sum smooth nonconvex optimization. Very recently, Zhou et al. (2018) proposed a new algorithm called stochastic nested variance-reduced gradient (SNVRG), which achieves strictly better gradient complexity than both SVRG and SCSG for finite-sum and stochastic smooth nonconvex optimization.

There is another line of research in stochastic smooth nonconvex optimization, which makes use of the -nonconvexity of a nonconvex function (i.e., ). More specifically, Natasha 1 (Allen-Zhu, 2017b) and Natasha 1.5 (Allen-Zhu, 2017a) have been proposed, which solve a modified regularized problem and achieve faster convergence rate to first-order stationary points than SVRG and SCSG in the finite-sum and stochastic settings respectively. In addition, Allen-Zhu (2018) proposed an SGD4 algorithm, which optimizes a series of regularized problems, and is able to achieve a faster convergence rate than SGD.

### 1.3 Organization and Notation

The remainder of this paper is organized as follows: We present the problem setup and review the algorithms in Section 2. We provide the convergence guarantee of Padam for stochastic smooth nonconvex optimization in Section 3. Finally, we conclude our paper in Section 4.

Notation. Scalars are denoted by lower case letters, vectors by lower case bold face letters, and matrices by upper case bold face letters. For a vector , we denote the norm () of by , the norm of by . For a sequence of vectors , we denote by the -th element in . We also denote . With slightly abuse of notation, for any two vectors and , we denote as the element-wise square, as the element-wise power operation, as the element-wise division and as the element-wise maximum. For a matrix , we define . Given two sequences and , we write if there exists a constant such that . We use notation to hide logarithmic factors.

## 2 Problem Setup and Algorithms

In this section, we first introduce the preliminary definitions used in this paper, followed by the problem setup of stochastic nonconvex optimization. Then we review the state-of-the-art adaptive gradient method, i.e., Padam (Chen and Gu, 2018), along with AMSGrad (the corrected version of Adam) (Reddi et al., 2018) and the corrected version of RMSProp (Tieleman and Hinton, 2012; Reddi et al., 2018).

### 2.1 Problem Setup

We study the following stochastic nonconvex optimization problem

 minx∈Rdf(x):=Eξ[f(x;ξ)],

where

is a random variable satisfying certain distribution,

is a -smooth nonconvex function. In the stochastic setting, one cannot directly access the full gradient of

. Instead, one can only get unbiased estimators of the gradient of

, which is . This setting has been studied in Ghadimi and Lan (2013, 2016).

### 2.2 Algorithms

In this section we introduce the algorithms we study in this paper. We mainly consider three algorithms: Padam (Chen and Gu, 2018), AMSGrad (Reddi et al., 2018) and a corrected version of RMSProp (Tieleman and Hinton, 2012; Reddi et al., 2018).

## 3 Main Theory

In this section we present our main theoretical results. We first introduce the following assumptions.

[Bounded Gradient] has -bounded stochastic gradient. That is, for any , we assume that

 ∥∇f(x;ξ)∥∞≤G∞.

It is worth mentioning that Assumption 3 is slightly weaker than the -boundedness assumption used in Reddi et al. (2016); Chen et al. (2018). Since , the -boundedness assumption implies Assumption 3 with . Meanwhile, will be tighter than by a factor of when each coordinate of almost equals to each other.

[-smooth] is -smooth: for any , we have

 ∣∣f(x)−f(y)−⟨∇f(y),x−y⟩∣∣≤L2∥x−y∥22.

Assumption 3 is a standard assumption frequently used in analysis of gradient-based algorithms. It is equivalent to the -gradient Lipschitz condition, which is often written as .

We are now ready to present our main result.

[Padam] In Algorithm 1, suppose that , and for . Then under Assumptions 3 and 3, for any , the output of Algorithm 1 satisfies that

 (2)

where

 M1 =2G2p∞Δf,M2=4G2+2p∞E∥∥^v−p1∥∥1d(1−β1)+4G2∞, M3 =4LG1+q−2p∞(1−β2)2p+8LG1+q−2p∞(1−β1)(1−β2)2p(1−β1/β2p2)(β11−β1)2,

and . From Theorem 3, we can see that and are independent of the number of iterations and dimension . In addition, if , it is easy to see that also has an upper bound that is independent of and . The following corollary is a special case of Theorem 3 when and . Under the same conditions of Theorem 3, if , then the output of Padam satisifies

 (3)

where and and are the same as in Theorem 3, and is defined as follows:

 M′3=4LG1−2p∞(1−β2)2p+8LG1−2p∞(1−β1)(1−β2)2p(1−β1/β2p2)(β11−β1)2.

Corollary 3 simplifies the result of Theorem 3 by choosing under the condition . We remark that this choice of is optimal in an important special case studied in Duchi et al. (2011); Reddi et al. (2018): when the gradient vectors are sparse, we assume that . Then for , it follows that

 ∑di=1∥g1:T,i∥2T≪dq(∑di=1∥g1:T,i∥2)1−qT1−q/2. (4)

(4) implies that the upper bound provided by (3) is strictly better than (2) with . Therefore when the gradient vectors are sparse, Padam achieves faster convergence when is located in . We show the convergence rate under different choices of step size . If

 α=Θ(T1/4(d∑i=1∥g1:T,i∥2)1/2)−1,

then by (3), we have

 (5)

Note that the convergence rate given by (5) is related to the sum of gradient norms . As is mentioned in Remark 3, when the stochastic gradients , are sparse, we follow the assumption given by Duchi et al. (2011) that . More specifically, suppose for some . We have

When , we have

which matches the rate achieved by nonconvex SGD (Ghadimi and Lan, 2016), considering the dependence of .

If we set which is not related to , then (3) suggests that

 (6)

When (Duchi et al., 2011; Reddi et al., 2018), by (6) we have

which matches the convergence result in nonconvex SGD (Ghadimi and Lan, 2016) considering the dependence of .

Next we show the convergence analysis of two popular algorithms: AMSGrad and RMSProp. Since AMSGrad and RMSProp can be seen as two specific instances of Padam, we can apply Theorem 3 with specific parameter choice, and obtain the following two corollaries.

[AMSGrad] Under the same conditions of Theorem 3, for AMSGrad in Algorithm 2, if for , then the output satisfies that

 (7)

where are defined as follows:

 MA1 MA3 =4LG∞(1−β2)+8LG∞(1−β1)(1−β2)(1−β1/β2)(β11−β1)2.

As what has been illustrated in Theorem 3, are independent of and essentially independent of . Thus, (7) implies that AMSGrad achieves

 O(√dT+dT)

convergence rate, which matches the convergence rate of nonconvex SGD (Ghadimi and Lan, 2016). Chen et al. (2018) also provided similar bound for AMSGrad. They showed that

It can be seen that the dependence of in their bound is quadratic, which is worse than the linear dependence implied by (7). Moreover, by Corollary 3, Corollary 3 and (4), it is easy to see that Padam with is faster than AMSGrad where , which backups the experimental results in Chen and Gu (2018).

[corrected version of RMSProp] Under the same conditions of Theorem 3, for RMSProp in Algorithm 3, if for , then the output satisfies that

 (8)

where are defined in the following:

 MR1=2G∞Δf,MR2=4G3∞E∥∥^v−1/21∥∥1/d+4G2∞,MR3=4LG∞(1−β2).

are independent of and essentially independent of . Thus, (8) implies that RMSProp achieves convergence rate, which matches the convergence rate of nonconvex SGD given by Ghadimi and Lan (2016).

## 4 Conclusions

In this paper, we provided a sharp analysis of the state-of-the-art adaptive gradient method Padam (Chen and Gu, 2018), and proved its convergence rate for smooth nonconvex optimization. Our results directly imply the convergence rates of AMSGrad and the corrected version of RMSProp for smooth nonconvex optimization. In terms of the number of iterations , the derived convergence rates in this paper match the rate achieved by SGD; in terms of dimension , our results give better rate than existing work. Our results also offer some insights into the choice of the partially adaptive parameter in the Padam algorithm: when the gradients are sparse, Padam with achieves the fastest convergence rate. This theoretically backups the experimental results in existing work (Chen and Gu, 2018).

## Acknowledgement

We would like to thank Jinghui Chen for discussion on this work.

## Appendix A Proof of the Main Theory

Here we provide the detailed proof of the main theorem.

### a.1 Proof of Theorem 3

Let . To prove Theorem 3, we need the following lemmas:

[Restatement of Lemma] Let and be as defined in Algorithm 1. Then under Assumption 3, we have , and .

Suppose that has -bounded stochastic gradient. Let be the weight parameters, , be the step sizes in Algorithm 1 and . We denote . Suppose that and , then under Assumption 3, we have the following two results:

 ≤T(1+q)/2dqα2(1−β1)G(1+q−4p)∞(1−β2)2p(1−γ)E(d∑i=1∥g1:T,i∥2)1−q,

and

 ≤T(1+q)/2dqα2G(1+q−4p)∞(1−β2)2pE(d∑i=1∥g1:T,i∥2)1−q.

To deal with stochastic momentum and stochastic weight , following Yang et al. (2016), we define an auxiliary sequence as follows: let , and for each ,

 zt=xt+β11−β1(xt−xt−1)=11−β1xt−β11−β1xt−1. (9)

Lemma A.1 shows that can be represented in two different ways.

Let be defined in (9). For , we have

 zt+1−zt =β11−β1[I−(αt^V−pt)(αt−1^V−pt−1)−1](xt−1−xt)−αt^V−ptgt. (10)

and

 zt+1−zt =β11−β1(αt−1^V−pt−1−αt^V−pt)mt−1−αt^V−ptgt. (11)

For , we have

 z2−z1=−α1^V−p1g1. (12)

By Lemma A.1, we connect with and

The following two lemmas give bounds on and , which play important roles in our proof.

Let be defined in (9). For , we have

 ∥zt+1−zt∥2 ≤∥∥α^V−ptgt∥∥2+β11−β1∥xt−1−xt∥2.

Let be defined in (9). For , we have

 ∥∇f(zt)−∇f(xt)∥2 ≤L(β11−β1)⋅∥xt−xt−1∥2.

Now we are ready to prove Theorem 3.

###### Proof of Theorem 3.

Since is -smooth, we have:

 f(zt+1) ≤f(zt)+∇f(zt)⊤(zt+1−zt)+L2∥zt+1−zt∥22 =f(zt)+∇f(xt)⊤(zt+1−zt)I1+(∇f(zt)−∇f(xt))⊤(zt+1−zt)I2+L2∥zt+1−zt∥22I3 (13)

In the following, we bound , and separately.

Bounding term : When , we have

 ∇f(x1)⊤(z2−z1)=−∇f(x1)⊤α1^V−ptg1. (14)

For , we have

 ∇f(xt)⊤(zt+1−zt) =∇f(xt)⊤[β11−β1(αt−1^V−pt−1−αt^V−pt)mt−1−αt^V−ptgt] (15)

where the first equality holds due to (11) in Lemma A.1. For in (15), we have

 ∇f(xt)⊤(αt−1^V−pt−1−αt^V−pt)mt−1 ≤∥∇f(xt)∥∞⋅∥∥αt−1^V−pt−1−αt^V−pt∥∥1,1⋅∥mt−1∥∞ ≤G2∞[∥∥αt−1^V−pt−1∥∥1,1−∥∥αt^V−pt∥∥1,1] =G2∞[∥∥αt−1^v−pt−1∥∥1−∥∥αt^v−pt∥∥1]. (16)

The first inequality holds because for a positive diagonal matrix , we have . The second inequality holds due to . Next we bound . We have

 −∇f(xt)⊤αt^V−ptgt =−∇f(xt)⊤αt−1^V−pt−1gt−∇f(xt)⊤(αt^V−pt−αt−1^V−pt−1)gt ≤−∇f(xt)⊤αt−1^V−pt−1gt+∥∇f(xt)∥∞⋅∥∥αt^V−pt−αt−1^V−pt−1∥∥1,1⋅∥gt∥∞ =−∇f(xt)⊤αt−1^V−pt−1gt+G2∞(∥∥αt−1^v−pt−1∥∥1−∥∥αt^v−pt∥∥1). (17)

The first inequality holds because for a positive diagonal matrix , we have . The second inequality holds due to . Substituting (16) and (17) into (15), we have

 ∇f(xt)⊤(zt+1−zt)≤−∇f(xt)⊤αt−1^V−pt−1gt+11−β1G2∞(∥∥αt−1^v−pt−1∥∥1−∥∥αt^v−pt∥∥1). (18)

Bounding term : For , we have

 (∇f(zt)−∇f(xt))⊤(zt+1−zt) =Lβ11−β1∥∥αt^V−ptgt∥∥2⋅∥xt−xt−1∥2+L(β11−β1)2∥xt−xt−1∥22 ≤L∥∥αt^V−ptgt∥∥22+2L(β11−β1)2∥xt−xt−1∥22, (19)

where the second inequality holds because of Lemma A.1 and Lemma A.1, the last inequality holds due to Young’s inequality.

Bounding term : For , we have

 L2∥zt+1−zt∥22 ≤L2[∥∥αt^V−ptgt∥∥2+β11−β1∥xt−1−xt∥2]2 ≤L∥∥αt^V−ptgt∥∥22+2L(β11−β1)2∥xt−1−xt∥22. (20)

The first inequality is obtained by introducing Lemma A.1.

For , substituting (14), (19) and (20) into (13), taking expectation and rearranging terms, we have

 E[f(z2)−f(z1)] ≤E[−∇f(x1)⊤α1^V−p1g1+2L∥∥α1^V−p1g1∥∥2