and its variants have been widely used in training deep neural networks. Among those variants, adaptive gradient methods (AdaGrad)(Duchi et al., 2011; McMahan and Streeter, 2010)
, which scale each coordinate of the gradient by a function of past gradients, can achieve better performance than vanilla SGD in practice when the gradients are sparse. An intuitive explanation for the success of AdaGrad is that it automatically adjusts the learning rate for each feature based on the partial gradient, which accelerates the convergence. However, AdaGrad was later found to demonstrate degraded performance especially in cases where the loss function is nonconvex or the gradient is dense, due to rapid decay of learning rate. This problem is especially exacerbated in deep learning due to the huge number of optimization variables. To overcome this issue, RMSProp(Tieleman and Hinton, 2012) was proposed to use exponential moving average rather than the arithmetic average to scale the gradient, which mitigates the rapid decay of the learning rate. Kingma and Ba (2014) proposed an adaptive momentum estimation method (Adam), which incorporates the idea of momentum (Polyak, 1964; Sutskever et al., 2013) into RMSProp. Other related algorithms include AdaDelta (Zeiler, 2012) and Nadam (Dozat, 2016), which combine the idea of exponential moving average of the historical gradients, Polyak’s heavy ball (Polyak, 1964) and Nesterov’s accelerated gradient descent (Nesterov, 2013). Recently, by revisiting the original convergence analysis of Adam, Reddi et al. (2018) found that for some handcrafted simple convex optimization problem, Adam does not even converge to the global minimizer. In order to address this convergence issue of Adam, Reddi et al. (2018) proposed a new variant of the Adam algorithm namely AMSGrad, which has guaranteed convergence in the convex optimization setting. The update rule of AMSGrad is as follows111With slight abuse of notation, here we denote by
the element-wise square root of the vector, by the element-wise division between and , and by the element-wise maximum between and .:
where is the step size, is the iterate in the -th iteration, and are the exponential moving averages of the gradient and the squared gradient at the -th iteration respectively. More specifically, and are defined as follows222Here we denote by the element-wise square of the vector .:
are hyperparameters of the algorithm, andis the stochastic gradient at the -th iteration. However, Wilson et al. (2017) found that for over-parameterized neural networks, training with Adam or its variants typically generalizes worse than SGD, even when the training performance is better. In particular, they found that carefully-tuned SGD with momentum, weight decay and appropriate learning rate decay strategies can significantly outperform adaptive gradient algorithms in terms of test error. This problem is often referred to as the generalization gap for adaptive gradient methods. In order to close this generalization gap of Adam and AMSGrad, Chen and Gu (2018) proposed a partially adaptive momentum estimation method (Padam). Instead of scaling the gradient by , this method chooses to scale the gradient by , where is a hyper parameter. This gives rise to the following update formula333We denote by the element-wise -th power of the vector :
Evidently, when , Padam reduces to AMSGrad. Padam also reduces to the corrected version of RMSProp (Reddi et al., 2018) when and .
Despite the successes of adaptive gradient methods for training deep neural networks, the convergence guarantees for these algorithms are mostly restricted to online convex optimization (Duchi et al., 2011; Kingma and Ba, 2014; Reddi et al., 2018; Chen and Gu, 2018). Therefore, there is a huge gap between existing online convex optimization guarantees for adaptive gradient methods and the empirical successes of adaptive gradient methods in nonconvex optimization. In order to bridge this gap, there are a few recent attempts to prove the nonconvex optimization guarantees for adaptive gradient methods. More specifically, Basu et al. (2018) proved the convergence rate of RMSProp and Adam when using deterministic gradient rather than stochastic gradient. Li and Orabona (2018) achieves convergence rate of AdaGrad, assuming the gradient is -Lipschitz continuous. Ward et al. (2018) proved the convergence rate of a simplified AdaGrad where the moving average of the norms of the gradient vectors is used to adjust the gradient vector in both deterministic and stochastic settings for smooth nonconvex functions. Nevertheless, the convergence guarantees in Basu et al. (2018); Ward et al. (2018) are still limited to simplified algorithms. Another attempt to obtain the convergence rate under stochastic setting is prompted recently by Zou and Shen (2018), in which they only focus on the condition when the momentum vanishes.
In this paper, we provide a sharp convergence analysis of the adaptive gradient methods. In particular, we analyze the state-of-the-art adaptive gradient method, i.e., Padam (Chen and Gu, 2018), and prove its convergence rate for smooth nonconvex objective functions in the stochastic optimization setting. Our results directly imply the convergence rates for AMSGrad (the corrected version of Adam) and the corrected version of RMSProp (Reddi et al., 2018). Our analyses can be extended to other adaptive gradient methods such as AdaGrad, AdaDelta (Zeiler, 2012) and Nadam (Dozat, 2016) mentioned above, but we omit these extensions in this paper for the sake of conciseness. It is worth noting that our convergence analysis emphasizes equally on the dependence of number of iterations and dimension
in the convergence rate. This is motivated by the fact that modern machine learning methods, especially the training of deep neural networks, usually requires solving a very high-dimensional nonconvex optimization problem. The order of dimensionis usually comparable to or even larger than the total number of iterations
. Take training the latest convolutional neural network DenseNet-BC(Huang et al., 2017) with depth and growth rate on CIFAR-10 (Krizhevsky, 2009) as an example. According to Huang et al. (2017), the network is trained with in total million iterations, however the number of parameters in the network is million. This example shows that can indeed be in the same order of in practice. Therefore, we argue that it is very important to show the precise dependence on both and in the convergence analysis of adaptive gradient methods for modern machine learning.
When we were preparing this manuscript, we noticed that there was a paper (Chen et al., 2018) released on arXiv on August 8th, 2018, which analyzes the convergence of a class of Adam-type algorithms including AMSGrad and AdaGrad for nonconvex optimization. Our work is an independent work, and our derived convergence rate for AMSGrad is faster than theirs.
1.1 Our Contributions
The main contributions of our work are summarized as follows:
We prove that the convergence rate of Padam to a stationary point for stochastic nonconvex optimization is
where are the stochastic gradients and . When the stochastic gradients are -bounded, (1) matches the convergence rate of vanilla SGD in terms of the rate of .
Our result implies the convergence rate for AMSGrad is
which has a better dependence on the dimension and than the convergence rate proved in Chen et al. (2018), i.e.,
1.2 Additional Related Work
Here we briefly review other related work on nonconvex stochastic optimization.
Ghadimi and Lan (2013) proposed a randomized stochastic gradient (RSG) method, and proved its convergence rate to a stationary point. Ghadimi and Lan (2016) proposed an randomized stochastic accelerated gradient (RSAG) method, which achieves convergence rate, where
is an upper bound on the variance of the stochastic gradient. Motivated by the success of stochastic momentum methods in deep learning(Sutskever et al., 2013), Yang et al. (2016) provided a unified convergence analysis for both stochastic heavy-ball method and the stochastic variant of Nesterov’s accelerated gradient method, and proved convergence rate to a stationary point for smooth nonconvex functions. Reddi et al. (2016); Allen-Zhu and Hazan (2016) proposed variants of stochastic variance-reduced gradient (SVRG) method (Johnson and Zhang, 2013) that is provably faster than gradient descent in the nonconvex finite-sum setting. Lei et al. (2017) proposed a stochastically controlled stochastic gradient (SCSG), which further improves convergence rate of SVRG for finite-sum smooth nonconvex optimization. Very recently, Zhou et al. (2018) proposed a new algorithm called stochastic nested variance-reduced gradient (SNVRG), which achieves strictly better gradient complexity than both SVRG and SCSG for finite-sum and stochastic smooth nonconvex optimization.
There is another line of research in stochastic smooth nonconvex optimization, which makes use of the -nonconvexity of a nonconvex function (i.e., ). More specifically, Natasha 1 (Allen-Zhu, 2017b) and Natasha 1.5 (Allen-Zhu, 2017a) have been proposed, which solve a modified regularized problem and achieve faster convergence rate to first-order stationary points than SVRG and SCSG in the finite-sum and stochastic settings respectively. In addition, Allen-Zhu (2018) proposed an SGD4 algorithm, which optimizes a series of regularized problems, and is able to achieve a faster convergence rate than SGD.
1.3 Organization and Notation
The remainder of this paper is organized as follows: We present the problem setup and review the algorithms in Section 2. We provide the convergence guarantee of Padam for stochastic smooth nonconvex optimization in Section 3. Finally, we conclude our paper in Section 4.
Notation. Scalars are denoted by lower case letters, vectors by lower case bold face letters, and matrices by upper case bold face letters. For a vector , we denote the norm () of by , the norm of by . For a sequence of vectors , we denote by the -th element in . We also denote . With slightly abuse of notation, for any two vectors and , we denote as the element-wise square, as the element-wise power operation, as the element-wise division and as the element-wise maximum. For a matrix , we define . Given two sequences and , we write if there exists a constant such that . We use notation to hide logarithmic factors.
2 Problem Setup and Algorithms
In this section, we first introduce the preliminary definitions used in this paper, followed by the problem setup of stochastic nonconvex optimization. Then we review the state-of-the-art adaptive gradient method, i.e., Padam (Chen and Gu, 2018), along with AMSGrad (the corrected version of Adam) (Reddi et al., 2018) and the corrected version of RMSProp (Tieleman and Hinton, 2012; Reddi et al., 2018).
2.1 Problem Setup
We study the following stochastic nonconvex optimization problem
is a random variable satisfying certain distribution,is a -smooth nonconvex function. In the stochastic setting, one cannot directly access the full gradient of
. Instead, one can only get unbiased estimators of the gradient of, which is . This setting has been studied in Ghadimi and Lan (2013, 2016).
In this section we introduce the algorithms we study in this paper. We mainly consider three algorithms: Padam (Chen and Gu, 2018), AMSGrad (Reddi et al., 2018) and a corrected version of RMSProp (Tieleman and Hinton, 2012; Reddi et al., 2018).
The Padam algorithm is given in Algorithm 1. It is originally proposed by Chen and Gu (2018) to improve the generalization performance of adaptive gradient methods. As is shown in Algorithm 1, the learning rate of Padam is , where is a partially adaptive parameter. With this parameter , Padam unifies AMSGrad and SGD with momentum, and gives a general framework of algorithms with exponential moving average. Padam reduces to the AMSGrad algorithm when . If and , Padam reduces to a corrected version of the RMSProp algorithm given by Reddi et al. (2018). As important special cases of Padam, we show AMSGrad and the corrected version of RMSProp in Algorithms 2 and 3 respectively.
3 Main Theory
In this section we present our main theoretical results. We first introduce the following assumptions.
[Bounded Gradient] has -bounded stochastic gradient. That is, for any , we assume that
It is worth mentioning that Assumption 3 is slightly weaker than the -boundedness assumption used in Reddi et al. (2016); Chen et al. (2018). Since , the -boundedness assumption implies Assumption 3 with . Meanwhile, will be tighter than by a factor of when each coordinate of almost equals to each other.
[-smooth] is -smooth: for any , we have
Assumption 3 is a standard assumption frequently used in analysis of gradient-based algorithms. It is equivalent to the -gradient Lipschitz condition, which is often written as .
We are now ready to present our main result.
and . From Theorem 3, we can see that and are independent of the number of iterations and dimension . In addition, if , it is easy to see that also has an upper bound that is independent of and . The following corollary is a special case of Theorem 3 when and . Under the same conditions of Theorem 3, if , then the output of Padam satisifies
where and and are the same as in Theorem 3, and is defined as follows:
Corollary 3 simplifies the result of Theorem 3 by choosing under the condition . We remark that this choice of is optimal in an important special case studied in Duchi et al. (2011); Reddi et al. (2018): when the gradient vectors are sparse, we assume that . Then for , it follows that
(4) implies that the upper bound provided by (3) is strictly better than (2) with . Therefore when the gradient vectors are sparse, Padam achieves faster convergence when is located in . We show the convergence rate under different choices of step size . If
then by (3), we have
Note that the convergence rate given by (5) is related to the sum of gradient norms . As is mentioned in Remark 3, when the stochastic gradients , are sparse, we follow the assumption given by Duchi et al. (2011) that . More specifically, suppose for some . We have
When , we have
which matches the rate achieved by nonconvex SGD (Ghadimi and Lan, 2016), considering the dependence of .
If we set which is not related to , then (3) suggests that
which matches the convergence result in nonconvex SGD (Ghadimi and Lan, 2016) considering the dependence of .
Next we show the convergence analysis of two popular algorithms: AMSGrad and RMSProp. Since AMSGrad and RMSProp can be seen as two specific instances of Padam, we can apply Theorem 3 with specific parameter choice, and obtain the following two corollaries.
where are defined as follows:
It can be seen that the dependence of in their bound is quadratic, which is worse than the linear dependence implied by (7). Moreover, by Corollary 3, Corollary 3 and (4), it is easy to see that Padam with is faster than AMSGrad where , which backups the experimental results in Chen and Gu (2018).
where are defined in the following:
In this paper, we provided a sharp analysis of the state-of-the-art adaptive gradient method Padam (Chen and Gu, 2018), and proved its convergence rate for smooth nonconvex optimization. Our results directly imply the convergence rates of AMSGrad and the corrected version of RMSProp for smooth nonconvex optimization. In terms of the number of iterations , the derived convergence rates in this paper match the rate achieved by SGD; in terms of dimension , our results give better rate than existing work. Our results also offer some insights into the choice of the partially adaptive parameter in the Padam algorithm: when the gradients are sparse, Padam with achieves the fastest convergence rate. This theoretically backups the experimental results in existing work (Chen and Gu, 2018).
We would like to thank Jinghui Chen for discussion on this work.
Appendix A Proof of the Main Theory
Here we provide the detailed proof of the main theorem.
a.1 Proof of Theorem 3
Let . To prove Theorem 3, we need the following lemmas:
Suppose that has -bounded stochastic gradient. Let be the weight parameters, , be the step sizes in Algorithm 1 and . We denote . Suppose that and , then under Assumption 3, we have the following two results:
To deal with stochastic momentum and stochastic weight , following Yang et al. (2016), we define an auxiliary sequence as follows: let , and for each ,
Lemma A.1 shows that can be represented in two different ways.
Let be defined in (9). For , we have
For , we have
By Lemma A.1, we connect with and
The following two lemmas give bounds on and , which play important roles in our proof.
Let be defined in (9). For , we have
Let be defined in (9). For , we have
Now we are ready to prove Theorem 3.
Proof of Theorem 3.
Since is -smooth, we have:
In the following, we bound , and separately.
Bounding term : When , we have
For , we have
The first inequality holds because for a positive diagonal matrix , we have . The second inequality holds due to . Next we bound . We have
Bounding term : For , we have
Bounding term : For , we have
The first inequality is obtained by introducing Lemma A.1.