We study gradient-based optimization algorithms for minimizing a differentiable nonconvex function , where can potentially be stochastic, i.e., . Such choices of
cover a wide range of problems in machine learning; as a result their study motivates a vast body of current optimization literature. Classical approaches for minimizing
include gradient descent (GD) and stochastic gradient descent (SGD). More recently, adaptive gradient methods, e.g., Adagrad(Duchi et al., 2011), ADAM (Kingma and Ba, 2014)
, and RMSProp(Tieleman and Hinton, 2012), have gained popularity due to their empirical performance, in particular, their faster convergence in complex optimization problems such as adversarial training and language modeling. Adaptive methods differ from GD and SGD in that they allow step-sizes to depend on past gradients and to vary across coordinates. Previous analysis has shown that adaptive methods are more robust to variation in hyper-parameters (Ward et al., 2018) and adapt to sparse gradients (Duchi et al., 2011). We will later provide a more detailed review of related literature in Appendix A. In this paper, we focus on a subclass of adaptive gradient methods and theoretically justify their empirical effectiveness and applicability. Specifically, we show that adaptively scaled gradient methods converge arbitrarily faster than fixed-step gradient descent.This result is shown to hold under a novel smoothness condition that is strictly weaker than the standard Lipschitz-gradient assumption pervasive in the literature, hence it captures many functions that are not globally Lipschitz smooth. More importantly, the proposed smoothness condition is validated by precisely in the same type of experiments in neural network training for which there is empirical evidence that adaptive gradient methods perform superior to gradient methods. More specifically, we analyze convergence properties of a widely used technique, clipped gradient descent. In terms of step size choice, gradient clipping is up to constant factors equivalent to normalized gradient descent (NGD), a canonical adaptive method that is widely used in practice. Instead of using constant step sizes, clipped GD adaptively chooses a step size based on the (stochastic) gradient norm. Even though clipping is a standard practice in tasks such as language models (e.g. Merity et al., 2018; Gehring et al., 2017; Peters et al., 2018), it lacks a solid theoretical grounding. Goodfellow et al. (2016); Pascanu et al. (2013, 2012)
discuss the gradient explosion problem in recurrent models and consider clipping as an intuitive trick to work around the explosion. We formalize this argument and prove that clipped GD can be arbitrarily faster than ordinary GD. By examining the smoothness condition and providing new convergence bounds on adaptively-scaled methods, we hope this work can help close the following gap between theory and practice. On the one hand, powerful techniques such as Nesterov’s momentum and variance reduction have been proposed to theoretically accelerate convex and nonconvex optimization. However, these techniques, at least for now, seem to have limited applicability in deep learning(Defazio and Bottou, 2018). On the other hand, some widely used empirical techniques (e.g., heavy-ball momentum, adaptivity) do not have theoretical acceleration guarantees. We suspect that one of the many reasons is the misalignment of the problem assumptions. Our work demonstrates that the concept of acceleration critically relies on the problem assumptions and that the standard global Lipschitz-gradient may not hold in the case of some applications.
We now summarize the main contributions of this paper as follows:
We propose a new smoothness condition that allows the local smoothness constant to change and increase with the gradient norm. This condition is strictly weaker than the standard Lipschitz-gradient assumption, and it is supported by empirical evidence in neural network training.
We provide a convergence rate for clipped GD under our smoothness assumption (Theorem 3).
We show that stochastic clipped GD converges at the expected rate (Theorem 8). We explain why our proof does not apply to SGD with fixed step sizes, outlining the key hurdles.
We support our proposed theory with several experiments. Since gradient clipping is widely used in training recurrent models for natural language processing, we validate our smoothness condition (see Assumption3) in this setting; we observe that the smoothness grows with gradient norms along the training trajectory (Fig. 0(a)). Additional experiments suggest that clipping allows the training trajectory to cross non-smooth regions of the loss, thereby accelerating convergence. Moreover, we show that clipped GD can converge faster (in training) than momentum-SGD, and achieve the same generalization performance as a strong baseline algorithm (e.g.,
test accuracy in 200 epochs for ResNet20 on Cifar10 dataset). Please see Section6 for more details.
2 Problem setup and algorithms
In this section, we set up the problem and introduce our new and relaxed smoothness assumption. Recall that we wish to solve the non-convex optimization problem
In general this problem is intractable; so, instead of seeking a global optimum, we seek an -stationary point, i.e., a point such that . Furthermore, we assume that the following conditions hold in the neighborhood of the sublevel set 111The constant “” in the expression (1) is arbitrary and can be replaced by any fixed positive constant. for a given initialization , where
Function is lower bounded by .
Function is twice differentiable.
The above assumptions are standard. Below we introduce our new relaxed smoothness assumption.
Assumption 3 (-smoothness).
is -smooth, if there exist positive constants and such that .
Section 3 will motivate Assumption 3 and discuss how it relaxes the canonical Lipschitz-gradient assumption and enlarges the class of functions considered. We note here a brief point regarding Assumption 2: The -smoothness can be generalized to once differentiable functions by replacing Assumption 3 with the following definition:
This condition implies that is locally Lipschitz, and hence almost everywhere differentiable. All our results can go through by handling the integrations more carefully. But to avoid such complications and simplify exposition, we assume that the function is twice differentiable.
2.1 Gradient descent algorithms
In this section, we review a few of the well-known variants of gradient based algorithms that relate to this work. We start with ordinary gradient descent,
where is a fixed step size. This algorithm is the baseline algorithm used in neural network training. Many modifications of it have been proposed to stabilize or accelerate training. One such technique , of particular importance to this work, is clipped gradient descent. The update for clipped GD can be written as
Another algorithm that is less common in practice but has attracted theoretical interest is normalized gradient descent. The update for normalized GD method can be written as
Clipped GD and NGD are almost equivalent. Indeed, if we set and , then
Therefore, clipped GD is equivalent to NGD up to a constant factor in the step size choice. Consequently, the convergence rates in Section 4 and Section 5 for clipped GD also apply to NGD. Thus, we omit repeating the analysis for conciseness.
3 Relaxed smoothness condition and motivations
In this section, we discuss and motivate the relaxed smoothness condition in Assumption 3. We start with the traditional definition of smoothness, recalling how it leads to the step size choice in GD.
3.1 Function smoothness (Lipschitz gradients)
Recall that we wish to solve The objective is called -smooth if
For twice differentiable functions, condition (5) is equivalent to . Under this smoothness, one can show the the following well-known upper-bound:
Suppose we set ; then, we can pick the step size to minimize the corresponding upper bound (6) by solving for , to obtain
This choice of leads to GD with a fixed step. Carmon et al. (2017) show that GD with is up to a constant optimal for optimizing smooth nonconvex functions. Noting this optimality relation between -smoothness and step size choice, we are led to ask the question: “Is clipped gradient descent optimized for a different smoothness condition?” We answer this question in Section 3.2. The usual -smoothness assumption (5) enables clean theoretical analysis but has its limitations. Assuming existence of a global constant that upper bounds the variation of gradient is very restrictive. For example, simple polynomials such as break the assumption. One workaround is to assume that exists in a compact region, and either prove that the iterates do not escape the region or run projection-based algorithms. However, such assumption can make very large and slow down the rate. In Section 4, we will show that a slow rate is unavoidable for gradient descent with fixed step size, whereas clipped gradient descent can greatly improve the dependency on . Moreover, though the bound (6) is optimal in the worst case, it can be too conservative. It is true that within any compact region, the function smoothness is bounded. However, the local smoothness can vary drastically (see Figure 1 for example). Gradient based methods can speed up convergence by taking larger steps in flat regions. Intuitively, this is why adaptive gradient methods can be faster.
3.2 Relaxed smoothness assumption
We now return to the question raised in Section 3.1. As step sizes for clipped GD and NGD are related by a constant factor, we answer the question by studying NGD. Inspired by the quadratic (6), assume that optimizes the quadratic function (cf. upper-bound (6)):
Then we can deduce that
Based on the intuition from (8), we propose the following relaxed smoothness condition.
A second order differentiable function is -smooth if
Definition 1 strictly relaxes the usual -smoothness. There are two ways to interpret the relaxation: First, when we focus on a compact region, we can balance the constants and such that while . Second, there exist functions that are -smooth globally, but not -smooth. Hence the constant for -smoothness gets larger as the compact set increases but and stay fixed. An example is given in Lemma 2.
It is worth noting that we do not need the Hessian operator norm and gradient norm to necessarily satisfy a linear relation. As long as they are positively correlated, clipped gradient descent can be shown to achieve faster rate than fixed step size gradient descent. We use the linear relationship for simplicity of exposition.
Let be the univariate polynomial . When , then is -smooth for some and but not -smooth.
The first claim follows from . The second claim follows by the unboundedness of . ∎
3.3 Smoothness in neural networks
We showed that our proposed smoothness condition relaxes the traditional smoothness assumption and is naturally motivated by normalized gradient descent. In this section, we argue that it captures the structure of neural network training. To justify our claim, in Figure 0(a) we empirically show
that a strong linear correlation exists between the gradient norm and the estimated local smoothness for LSTM-based language-model training when gradient clipping is applied. For more details of the experiment, please refer to Section6. Below we develop some high-level intuition for this phenomenon. We conjecture that the said positive correlation results from the common components in expressions of the gradient and the Hessian. We illustrate the reasoning behind this conjecture by considering an -layer linear network with quadratic loss—a similar computation also holds for nonlinear networks. The training loss of a deep linear network is , where denotes labels, denotes the input data matrix, and denotes the weights in the layer. By (Lemma 4.3 Kawaguchi, 2016), we know that
where flattens a matrix in
into a vector in; denotes the Kronecker product. For constants such that , the second order derivative
When , the second term equals . Based on the above expressions, we notice that the gradient norm and Hessian norm may be positively correlated due to the following two observations. First, the gradient and the Hessian share many components such as the matrix product of weights across layers. Second, if one naively upper bounds the norm using Cauchy-Schwarz, then both upper-bounds would be monotonically increasing with respect to and
4 Convergence in the full batch setting
In this section, we analyze the convergence rates of GD and clipped GD under our proposed conditions. We bound the number of iterations required by algorithms to find an -stationary point.
4.1 Clipped gradient descent
We start by analyzing the clipped GD algorithm with update defined in equation (3).
4.2 Gradient descent with fixed step size
Gradient descent with a fixed step size is known to converge to first order -stationary points in iterations for smooth nonconvex functions. By the following theorem of Carmon et al. (2017), this rate is up to a constant optimal.
Theorem 4 (Thm 1 in (Carmon et al., 2017)).
For any deterministic first-order optimization algorithm using gradient oracles, the iteration complexity to optimize an -smooth function to an -stationary point is at least
for some numerical constant .
However, we will show below that gradient descent is suboptimal under our relaxed -smoothness condition. In particular, to prove the convergence rate for gradient descent with fixed step size, we need to make an additional assumption on gradient norms.
Given an initialization , we assume that
The next theorem states that this assumption is necessary. Particularly, we show that gradient descent with fixed step size cannot converge faster than when . Therefore, GD can be arbitrarily slower than clipped GD under our relaxed smoothness assumption.
The proof starts with an exponentially growing function and shows that the step size for gradient descent must be small. The small step size leads to very slow convergence for another almost linear function with a small gradient. Details of this construction can be found in Appendix C.
Theorems 4 and 5 together show that gradient descent with a fixed step size cannot converge to an -stationary point faster than . Recall that clipped GD algorithm converges as . This rate shows that clipped GD converges much faster than GD when is large, or in other words, when the problem has a poor initialization.
Below, we provide an iteration upper bound for the fixed-step gradient descent update (2).
5 Convergence in the stochastic setting
In the stochastic setting, we assume access to the stochastic gradient instead of the exact gradient . For simplicity, we denote below. We need the following assumptions.
, that is, we have unbiased stochastic gradients.
. This implies that .
Bounded gradient is a strong assumption but it is commonly used in proving convergence for adaptive gradient methods (see (Reddi et al., 2019; Ward et al., 2018; Zhou et al., 2018)). In the analysis below, we only discuss the case when , otherwise the condition is equivalent to smooth. The main result of this section is the following convergence guarantee for stochastic clipped GD (based on the stochastic version of the update (3)).
The convergence proof critically relies on the fact that the update distance in each iteration has a fixed upper-bound due to clipping. However, the bounded radius causes a problem in the proof for fixed-step-size SGD. If we only assume bounded second moments of the stochastic gradient oracle, we cannot control the distance between the current point and the updated point. Hence one cannot apply Lemma9 the same way as for (10) in Appendix E. Though we cannot prove the convergence of SGD with fixed step size, one also cannot theoretically rule out the possibility that it converges. Nevertheless, if we additionally assume that the noise in stochastic gradient oracle is sub-Gaussian, then we can show that SGD with fixed step size converges at rate . In order to avoid diversion from discussing adaptive methods, we omit including this analysis for conciseness.
In this section, we summarize our experimental findings on the positive correlation between gradient norm and local smoothness. We then show that clipping accelerates convergence during neural network training. Our experiments are based on two tasks: language modeling and image classification. We run language modeling on the Penn Treebank (PTB) (Mikolov et al., 2010) dataset with AWD-LSTM models (Merity et al., 2018). For image classification, we train ResNet20 (He et al., 2016) on the Cifar10 dataset (Krizhevsky and Hinton, 2009). Details about the smoothness estimation and experimental setups are explained in Appendix F. First, our experiments test whether the local smoothness constant increases with the gradient norm, as suggested by the relaxed smoothness conditions defined in Section 3. To do so, we evaluate both quantities at points generated by the optimization procedure. We then scatter the local smoothness constants against the gradient norms in Figure 1 and Figure 2. Note that the plots are on a log-scale. A linear scale plot is shown in Appendix Figure 4. We notice that the correlation exists in the default training procedure for language modeling (see Figure 0(a)) but not in the default training for image classification (see Figure 1(a)). This difference aligns with the fact that gradient clipping is widely used in language modeling but is less popular in ResNet training, offering empirical support to our theoretical findings. We further investigate the cause of correlation. The plots in Figures 1 and 2 show that correlation appears when the models are trained with clipped GD and large learning rates. We propose the following explanation. Clipping enables the training trajectory to stably traverse non-smooth regions. Hence, we can observe that gradient norms and smoothness are positively correlated in Figures 0(a) and 1(c). Without clipping, the optimizer has to adopt a small learning rate and stays in a region where local smoothness does not vary much, otherwise the sequence diverges, and a different learning rate is used. Therefore, in other plots of Figures 1 and 2, the correlation is much weaker. As positive correlations are present in both language modeling and image classification experiments with large step sizes, our next set of experiments checks whether clipping helps accelerate convergence as predicted by our theory. From Figure 3, we find that the ability to traverse non-smooth regions indeed accelerates convergence. Because gradient clipping is a standard practice in language modeling, the LSTM models trained with clipping achieve the best validation performance and the fastest training loss convergence as expected. For image classification, surprisingly, clipped GD also achieves the fastest convergence and matches the test performance of SGD+momentum. These plots show that clipping can accelerate convergence and achieve good test performance at the same time. We do not analyze theory of this generalization capability as it is beyond the scope of this work.
Much progress has been made to close the gap between upper and lower oracle complexities for first order smooth optimization. The works dedicated to this goal provide important insights and tools for us to understand the optimization procedure. However, there is another gap that separates theoretically accelerated algorithms from empirically fast algorithms. This work aim to close this gap. Specifically, we proposed a relaxed smoothness assumption that is supported by empirical evidence. We analyzed a simple but widely used optimization technique known as gradient clipping and provided theoretical guarantee that clipping can accelerate gradient descent. This phenomenon aligns remarkably well with empirical observations. There is still much to be explored in this direction. First, though our smoothness condition relaxes the usual Lipschitz assumption, it is unclear if there is a better condition that matches the experimental observations while also enabling a clean theoretical analysis. Second, we only studied the convergence of clipped gradient descent. Studying the convergence properties of other techniques such as momentum, coordinate-wise learning rates ( more generally, preconditioning) and variance reduction is also interesting. Finally, the most important question is: “can we design fast algorithm based on relaxed conditions and actually achieve faster convergence in neural network training?” Our experiments also have notable implications. First, though advocating clipped gradient descent in Resnet training is not a main point of this work, it is interesting to note that gradient descent and clipped gradient descent with large step sizes can achieve a similar test performance as momentum-SGD. Second, we learned that the performance of the baseline algorithm can actually beat some recently proposed algorithms. Therefore, when we design or learn about new algorithms, we need to pay extra attention to check whether the baseline algorithms are properly tuned.
- Allen-Zhu  Z. Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. The Journal of Machine Learning Research, 18(1):8194–8244, 2017.
- Armijo  L. Armijo. Minimization of functions having Lipschitz continuous first partial derivatives. Pacific Journal of mathematics, 16(1):1–3, 1966.
- Bach and Moulines  F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate . In Advances in Neural Information Processing Systems, pages 773–781, 2013.
- Beck and Teboulle  A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
- Carmon et al.  Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Lower bounds for finding stationary points i. arXiv preprint arXiv:1710.11606, 2017.
- Carmon et al.  Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Accelerated methods for nonconvex optimization. SIAM Journal on Optimization, 28(2):1751–1772, 2018.
- Cho et al.  K. Cho, B. van Merriënboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, Doha, Qatar, Oct. 2014. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/D14-1179.
- Dai et al.  Z. Dai, Z. Yang, Y. Yang, J. G. Carbonell, Q. V. Le, and R. Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length context. CoRR, abs/1901.02860, 2019. URL http://arxiv.org/abs/1901.02860.
- Defazio and Bottou  A. Defazio and L. Bottou. On the ineffectiveness of variance reduced optimization for deep learning. arXiv preprint arXiv:1812.04529, 2018.
- Defazio et al.  A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In NIPS, pages 1646–1654, 2014.
- Duchi et al.  J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
- Fang et al.  C. Fang, C. J. Li, Z. Lin, and T. Zhang. Spider: Near-optimal non-convex optimization via stochastic path integrated differential estimator. arXiv preprint arXiv:1807.01695, 2018.
- Gehring et al.  J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin. Convolutional sequence to sequence learning. ArXiv e-prints, May 2017.
- Ghadimi and Lan  S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469–1492, 2012.
- Ghadimi and Lan  S. Ghadimi and G. Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(1-2):59–99, 2016.
- Gong and Ye  P. Gong and J. Ye. Linear convergence of variance-reduced stochastic gradient without strong convexity. arXiv preprint arXiv:1406.1102, 2014.
- Goodfellow et al.  I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT press, 2016.
- Hazan et al.  E. Hazan, K. Levy, and S. Shalev-Shwartz. Beyond convexity: Stochastic quasi-convex optimization. In Advances in Neural Information Processing Systems, pages 1594–1602, 2015.
- He et al.  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In
- Hochreiter and Schmidhuber  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Jin et al.  C. Jin, P. Netrapalli, and M. I. Jordan. Accelerated gradient descent escapes saddle points faster than gradient descent. In Conference On Learning Theory, pages 1042–1085, 2018.
- Johnson and Zhang  R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315–323, 2013.
- Kawaguchi  K. Kawaguchi. Deep learning without poor local minima. In Advances in neural information processing systems, pages 586–594, 2016.
- Kingma and Ba  D. P. Kingma and J. Ba. ADAM: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Konečnỳ and Richtárik  J. Konečnỳ and P. Richtárik. Semi-stochastic gradient descent methods. arXiv:1312.1666, 2013.
- Krizhevsky and Hinton  A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
- Levy  K. Y. Levy. The power of normalization: Faster evasion of saddle points. arXiv preprint arXiv:1611.04831, 2016.
- Li and Orabona  X. Li and F. Orabona. On the convergence of stochastic gradient descent with adaptive stepsizes. arXiv preprint arXiv:1805.08114, 2018.
- Lin et al.  H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems, pages 3384–3392, 2015.
- Merity et al.  S. Merity, N. S. Keskar, and R. Socher. Regularizing and optimizing LSTM language models. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=SyyGPP0TZ.
- Mikolov et al.  T. Mikolov, M. Karafiát, L. Burget, J. Cernocký, and S. Khudanpur. Recurrent neural network based language model. In INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26-30, 2010, pages 1045–1048, 2010. URL http://www.isca-speech.org/archive/interspeech_2010/i10_1045.html.
- Nesterov  Y. Nesterov. A method of solving a convex programming problem with convergence rate . In Soviet Mathematics Doklady, volume 27, pages 372–376, 1983.
- Nesterov  Y. Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.
- Pascanu et al.  R. Pascanu, T. Mikolov, and Y. Bengio. Understanding the exploding gradient problem. CoRR, abs/1211.5063, 2, 2012.
- Pascanu et al.  R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In International conference on machine learning, pages 1310–1318, 2013.
- Peters et al.  M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. Deep contextualized word representations. arXiv preprint arXiv:1802.05365, 2018.
- Polyak  B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
- Polyak  B. T. Polyak. Introduction to optimization. optimization software. Inc., Publications Division, New York, 1, 1987.
- Reddi et al.  S. J. Reddi, S. Kale, and S. Kumar. On the convergence of ADAM and beyond. arXiv preprint arXiv:1904.09237, 2019.
Santurkar et al. 
S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry.
How does batch normalization help optimization?In Advances in Neural Information Processing Systems, pages 2483–2493, 2018.
- Schmidt et al.  M. Schmidt, N. Le Roux, and F. Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162, 2017.
- Shalev-Shwartz and Zhang  S. Shalev-Shwartz and T. Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. In International Conference on Machine Learning, pages 64–72, 2014.
- Srivastava et al.  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014. URL http://jmlr.org/papers/v15/srivastava14a.html.
- Staib et al.  M. Staib, S. J. Reddi, S. Kale, S. Kumar, and S. Sra. Escaping saddle points with adaptive gradient methods. arXiv preprint arXiv:1901.09149, 2019.
- Sundermeyer et al.  M. Sundermeyer, R. Schlüter, and H. Ney. LSTM neural networks for language modeling. In INTERSPEECH 2012, 13th Annual Conference of the International Speech Communication Association, Portland, Oregon, USA, September 9-13, 2012, pages 194–197, 2012. URL http://www.isca-speech.org/archive/interspeech_2012/i12_0194.html.
- Sutskever et al.  I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 3104–3112, 2014. URL http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.
- Tieleman and Hinton  T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
- Wan et al.  L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus. Regularization of neural networks using DropConnect. In S. Dasgupta and D. McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1058–1066, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR. URL http://proceedings.mlr.press/v28/wan13.html.
- Ward et al.  R. Ward, X. Wu, and L. Bottou. Adagrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization. arXiv preprint arXiv:1806.01811, 2018.
- Xiao and Zhang  L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.
- Young et al.  T. Young, D. Hazarika, S. Poria, and E. Cambria. Recent trends in deep learning based natural language processing. CoRR, abs/1708.02709, 2017. URL http://arxiv.org/abs/1708.02709.
- Zhou et al.  D. Zhou, Y. Tang, Z. Yang, Y. Cao, and Q. Gu. On the convergence of adaptive gradient methods for nonconvex optimization. arXiv preprint arXiv:1808.05671, 2018.
- Zou et al.  F. Zou, L. Shen, Z. Jie, W. Zhang, and W. Liu. A sufficient condition for convergences of ADAM and RMSProp. arXiv preprint arXiv:1811.09358, 2018.
Appendix A More related work on accelerating gradient methods
Variance reduction. Many efforts have been made to accelerate gradient-based methods. One elegant approach is variance reduction (e.g. Schmidt et al., 2017; Johnson and Zhang, 2013; Defazio et al., 2014; Bach and Moulines, 2013; Konečnỳ and Richtárik, 2013; Xiao and Zhang, 2014; Gong and Ye, 2014; Fang et al., 2018). This technique aims to solve stochastic and finite sum problems by averaging the noise in the stochastic oracle via utilizing the smoothness of the objectives. Momentum methods. Another line of work focuses on achieving acceleration with momentum. Polyak (1964) showed that momentum can accelerate optimization for quadratic problems; later, Nesterov (1983) designed a variation that provably accelerate any smooth convex problems. Based on Nesterov’s work, much theoretical progress was made to accelerate different variations of the original smooth convex problems (e.g. Ghadimi and Lan, 2016, 2012; Beck and Teboulle, 2009; Shalev-Shwartz and Zhang, 2014; Jin et al., 2018; Carmon et al., 2018; Allen-Zhu, 2017; Lin et al., 2015; Nesterov, 2012). Adaptive step sizes. The idea of varying step size in each iteration has long been studied. Armijo (1966) proposed the famous backtracking line search algorithm to choose step size dynamically. Polyak (1987) proposed a strategy to choose step size based on function suboptimality and gradient norm. More recently, Duchi et al. (2011) designed the Adagrad algorithm that can utilize the sparsity in stochastic gradients. Since last year, there has been a surge in studying the theoretical properties of adaptive gradient methods. One starting point is (Reddi et al., 2019), which pointed out that ADAM is not convergent and proposed the AMSGrad algorithm to fix the problem. Ward et al. (2018); Li and Orabona (2018) prove that Adagrad converges to stationary point for nonconvex stochastic problems. Zhou et al. (2018) generalized the result to a class of algorithms named Padam. Zou et al. (2018) gives sufficient condition for proving the convergence of Adam. Staib et al. (2019) shows that adaptive methods can escape saddle point faster than SGD under certain conditions. In addition, Levy (2016) showed that normalized gradient descent may have better convergence rate in presence of injected noise. However, the rate comparison is under dimension dependent setting. Hazan et al. (2015) studied the convergence of normalized gradient descent for quasi-convex functions.
Appendix B Proof of Theorem 3
We start by proving a lemma that is repeatedly used in later proofs. The lemma bounds the gradient in a neighborhood of the current point by Grönwall’s inequality.
Given such that , for any such that , we have .
Let be a curve defined below,
Then we have
By Cauchy-Schwarz’s inequality, we get
The second inequality follows by Assumption 3. Then we can apply the integral form of Grönwall’s inequality and get
The Lemma follows by setting . ∎
b.1 Proof of the theorem
We parameterize the path between and its updated iterate as follows:
Since , using Taylor’s theorem, the triangle inequality, and Cauchy-Schwarz, we obtain
we know by Lemma 9
Then by Assumption 3, we have
Therefore, as long as (which follows by our choice of ), we have
When , we have
When , we have
Assume that the algorithm doesn’t terminate in iterations. By doing a telescopic sum, we get
Rearrange and we get
Appendix C Proof of Theorem 5
We will prove a lower bound for the convergence rate of GD with fixed step size. The high level idea is that if GD converges for all functions satisfying the assumptions, then the step size needs to be small. However, this small step size will lead to very slow convergence for another function. We start with a function that grows exponentially. Let be fixed constants. Pick the initial point . Let the objective be defined as follows,
We notice that the function satisfies the assumptions with constants
When , we would have . By symmetry of the function and the super-linear growth of the gradient norm, we know that the iterates will diverge. Hence, in order for gradient descent with a fixed step size to converge, must be small enough. Formally,
Now, let’s look at a different objective that grows slowly.
This function is also second order differentiable and satisfies the assumptions with constants in (9). If we set for some constant , we know that . With the step size choice , we know that in each step, . Therefore, for ,
Appendix D Proof of Theorem 7
We start by parametrizing the function value along the update,
Note that with this parametrization, we have . Now we would like to argue that if , then . Assume by contradiction that this is not true. Then there exists such that . Since can be made arbitrarily small below a threshold, we assume . Denote
The value exists by continuity of as a function of . Then we know by Assumption 4 that . However, by Taylor expansion, we know that
The last inequality follows by . Hence we get a contradiction and conclude that for all , . Therefore, following the above inequality and Assumption 3, we get
The conclusion follows by the same argument as in Theorem 3 via a telescopic sum over .
Appendix E Proof of Theorem 8
By the fact that
we know . Hence by Lemma 9, we know
Let be a filtration such that is generated by . Then after taking the expectation we get
Notice that . Inspired by the proof in (Ward et al., 2018), we get by ,
We further prove in Lemma 11 that
We also notice that . Then rearrange (11) and we get
Furthermore, we know that
The last inequality follows by and Markov inequality. When , by telescoping the inequality (13), we get
The result follows by setting .
e.1 Technical lemma
The following inequality holds in the context of Theorem 8.
Since , we know that