Adaptive methods such as Adam and RMSProp are widely used in deep learning but are not well understood. In this paper, we seek a crisp, clean and precise characterization of their behavior in nonconvex settings. To this end, we first provide a novel view of adaptive methods as preconditioned SGD, where the preconditioner is estimated in an online manner. By studying the preconditioner on its own, we elucidate its purpose: it rescales the stochastic gradient noise to be isotropic near stationary points, which helps escape saddle points. Furthermore, we show that adaptive methods can efficiently estimate the aforementioned preconditioner. By gluing together these two components, we provide the first (to our knowledge) second-order convergence result for any adaptive method. The key insight from our analysis is that, compared to SGD, adaptive methods escape saddle points faster, and can converge faster overall to second-order stationary points.

• 6 publications
• 25 publications
• 23 publications
• 60 publications
• 74 publications
03/07/2021

Stochastically controlled stochastic gradient (SCSG) methods have been p...
02/01/2019

### Sharp Analysis for Nonconvex SGD Escaping from Saddle Points

In this paper, we prove that the simplest Stochastic Gradient Descent (S...
08/16/2018

05/21/2021

### Escaping Saddle Points with Compressed SGD

Stochastic gradient descent (SGD) is a prevalent optimization technique ...
06/12/2020

04/30/2019

### Hitting Time of Stochastic Gradient Langevin Dynamics to Stationary Points: A Direct Analysis

Stochastic gradient Langevin dynamics (SGLD) is a fundamental algorithm ...
05/20/2022

Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differen...

## 1 Introduction

Stochastic first-order methods are the algorithms of choice for training deep networks, or more generally optimization problems of the form

. While vanilla stochastic gradient descent (SGD) is still the most popular such algorithm, there has been much recent interest in adaptive methods that adaptively change learning rates for each parameter. This is a very old idea, e.g.

[Jacobs, 1988]; modern variants such as Adagrad [Duchi et al., 2011; McMahan and Streeter, 2010] Adam [Kingma and Ba, 2014] and RMSProp [Tieleman and Hinton, 2012] are widely used in deep learning due to their good empirical performance.

 xt+1=xt−G−1/2tgt,

where is a noisy stochastic gradient at and

. More often, a diagonal version of Adagrad is used due to practical considerations, which effectively yields a per parameter learning rate. In the convex setting, Adagrad achieves provably good performance, especially when the gradients are sparse. Although Adagrad works well in sparse convex settings, its performance appears to deteriorate in (dense) nonconvex settings. This performance degradation is often attributed to the rapid decay of the learning rate in Adagrad over time, which is a consequence of rapid increase in eigenvalues of the matrix

.

##### Contributions:

In addition to the aforementioned novel viewpoint, we also make the following key contributions:

• We develop a new approach for analyzing convergence of adaptive methods leveraging the preconditioner viewpoint and by way of disentangling estimation from the behavior of the idealized preconditioner.

• We provide second-order convergence results for adaptive methods, and as a byproduct, first-order convergence results. To the best of our knowledge, ours is the first work to show second order convergence for any adaptive method.

• We provide theoretical insights on how adaptive methods escape saddle points quickly. In particular, we show that the preconditioner used in adaptive methods leads to isotropic noise near stationary points, which helps escape saddle points faster.

• Our analysis also provides practical suggestions for tuning the exponential moving average parameter .

### 1.1 Related work

There is an immense amount of work studying nonconvex optimization for machine learning, which is too much to discuss here in detail. Thus, we only briefly discuss two lines of work that are most relevant to our paper here. First, the recent work e.g.

[Chen et al., 2018; Reddi et al., 2018b; Zou et al., 2018] to understand and give theoretical guarantees for adaptive methods such as Adam and RMSProp. Second, the technical developments in using first-order algorithms to achieve nonconvex second-order convergence (see Definition 2.1) e.g. [Ge et al., 2015; Allen-Zhu and Li, 2018; Jin et al., 2017; Lee et al., 2016].

##### Nonconvex second order convergence of first order methods.

Starting with Ge et al. [2015] there has been a resurgence in interest in giving first-order algorithms that find second order stationary points of nonconvex objectives, where the gradient is small and the Hessian is nearly positive semidefinite. Most other results in this space operate in the deterministic setting where we have exact gradients, with carefully injected isotropic noise to escape saddle points. Levy [2016]

show improved results for normalized gradient descent. Some algorithms rely on Hessian-vector products instead of pure gradient information e.g.

[Agarwal et al., 2017; Carmon et al., 2018]; it is possible to reduce Hessian-vector based algorithms to gradient algorithms [Xu et al., 2018; Allen-Zhu and Li, 2018]. Jin et al. [2017] improve the dependence on dimension to polylogarithmic. Mokhtari et al. [2018] work towards adapting these techniques for constrained optimization. Most relevant to our work is that of Daneshmand et al. [2018], who prove convergence of SGD with better rates than Ge et al. [2015]. Our work differs in that we provide second-order results for preconditioned SGD.

## 2 Notation and definitions

The objective function is , and the gradient and Hessian of are and , respectively. Denote by the iterate at time , by an unbiased stochastic gradient at and by the expected gradient at . The matrix refers to . Denote by and the largest and smallest eigenvalues of , and is the condition number of . For a vector , its elementwise -th power is written . The objective has global minimizer , and we write . The Euclidean norm of a vector is written as , while for a matrix , refers to the operator norm of . The matrix

is the identity matrix, whose dimension should be clear from context.

###### Definition 2.1 (Second-order stationary point).

A -stationary point of is a point so that and , where .

As is standard (e.g. Nesterov and Polyak [2006]), we will discuss only -stationary points, where is the Lipschitz constant of the Hessian.

## 3 The RMSProp Preconditioner

Recall that methods like Adam and RMSProp replace the running sum used in Adagrad with an exponential moving average (EMA) of the form , e.g. full-matrix RMSProp is described formally in Algorithm 2. One key observation is that if is chosen appropriately; in other words, at time , the accumulated

can be seen as an approximation of the true second moment matrix

at the current iterate. Thus, RMSProp can be viewed as preconditioned SGD (Algorithm 1) with the preconditioner being . In practice, it is too expensive (or even infeasible) to compute exactly since it requires summing over all training samples. Practical adaptive methods (see Algorithm 2) estimate this preconditioner (or a diagonal approximation) on-the-fly via an EMA.

Before developing our formal results, we will build intuition about the behavior of adaptive methods by studying an idealized adaptive method (IAM) with perfect access to . In the rest of this section, we make use of idealized RMSProp to answer some simple questions about adaptive methods that we feel have not yet been addressed satisfactorily.

### 3.1 What is the purpose of the preconditioner?

Why should preconditioning by help optimization? The original Adam paper [Kingma and Ba, 2014] argues that Adam is an approximation to natural gradient descent, since if the objective is a log-likelihood, approximates the Fisher information matrix, which captures curvature information in the space of distributions. There are multiple issues with comparing adaptive methods to natural gradient descent, which we discuss in Appendix A. Instead, Balles and Hennig [2018] argue that the primary function of adaptive methods is to equalize the stochastic gradient noise in each direction. But it is still not clear why or how equalized noise should help optimization.

Our IAM abstraction makes it easy to explain precisely how rescaling the gradient noise helps. Specifically, we manipulate the update rule for idealized RMSProp:

 xt+1 ←xt−ηAtgt (1) =xt−ηAt∇t−ηAt(gt−∇t)=:ξt (2)

The term is deterministic; only is stochastic, with mean . Take and assume is invertible, so that . Now we can be more precise about how RMSProp rescales gradient noise. Specifically, we compute the covariance of the noise :

 Cov(ξt) =I−G−1/2t∇t∇TtG−1/2t. (3)

The key insight is: near stationary points, will be small, so that the noise covariance is approximately the identity matrix . In other words, at stationary points, the gradient noise is approximately isotropic. This observation hints at why adaptive methods are so successful for nonconvex problems, where one of the main challenges is to escape saddle points [Reddi et al., 2018a]

. Essentially all first-order approaches for escaping saddlepoints rely on adding carefully tuned isotropic noise, so that regardless of what the escape direction is, there is enough noise in that direction to escape with high probability.

### 3.2 [Reddi et al., 2018b] counterexample resolution

Recently, Reddi et al. [2018b] provided a simple convex stochastic counterexample on which RMSProp and Adam do not converge. Their reasoning is that RMSProp and Adam too quickly forget about large gradients from the past, in favor of small (but poor) gradients at the present. In contrast, for RMSProp with the idealized preconditioner (Algorithm 1 with ), there is no issue, but the preconditioner cannot be computed in practice. Rather, for this example, the exponential moving average estimation scheme fails to adequately estimate the preconditioner.

The counterexample is an optimization problem of the form

 minx∈[−1,1]F(x)=pf1(x)+(1−p)f2(x), (4)

where the stochastic gradient oracle returns with probability and otherwise. Let be “small,” and be “large.” Reddi et al. [2018b] set , , and . Overall, then, which is minimized at , however Reddi et al. [2018b] show that RMSProp has and so incurs suboptimality gap at least . In contrast, the idealized preconditioner is a function of

 E[g2] =p(∂f1∂x)2+(1−p)(∂f2∂x)2=C(1+ζ)−ζ

which is a constant independent of . Hence the preconditioner is constant, and, up to the choice of stepsize, idealized RMSProp on this problem is the same as SGD, which of course will converge.

The difficulty for practical adaptive methods (which estimate via an EMA) is that as

grows, the variance of the estimate of

grows too. Thus Reddi et al. [2018b] break Adam by making estimation of harder.

## 4 Main Results: Gluing Estimation and Optimization

The key enabling insight of this paper is to separately study the preconditioner and its estimation via EMA, then combine these to give proofs for practical adaptive methods. We will prove a formal guarantee that the EMA estimate is close to the true . By combining our estimation results with the underlying behavior of the preconditioner, we will be able to give convergence proofs for practical adaptive methods that are constructed in a novel, modular way.

Separating these two components enables more general results: we actually analyze preconditioned SGD (Algorithm 1) with oracle access to an arbitrary preconditioner . Idealized RMSProp is but one particular instance. Our convergence results depend only on specific properties of the preconditioner , with which we can recover convergence results for many RMSProp variants simply by bounding the appropriate constants. For example, corresponds to full-matrix Adam with or RMSProp as commonly implemented. For cleaner presentation, we instead focus on the variant , but our proof technique can handle either case or its diagonal approximation.

### 4.1 Estimating from Moving Sequences

The above discussion about IAM is helpful for intuition, and as a base algorithm for analyzing convergence. But it remains to understand how well the estimation procedure works, both for intuition’s sake and for later use in a convergence proof. In this section we introduce an abstraction we name “estimation from moving sequences.” This abstraction will allow us to guarantee high quality estimates of the preconditioner, or, for that matter, any similarly constructed preconditioner. Our results will moreover make apparent how to choose the parameter in the exponential moving average: should increase with the stepsize . Increasing over time has been supported both empirically [Shazeer and Stern, 2018] as well as theoretically [Mukkamala and Hein, 2017; Zou et al., 2018; Reddi et al., 2018b], though to our knowledge, the precise pinning of to the stepsize is new.

Suppose there is a sequence of states , e.g. the parameters of our model at each time step. We have access to the states , but more importantly we know the states are not changing too fast: is bounded for all . There is a Lipschitz function , which in our case is the second moment matrix of the stochastic gradients, but could be more general. We would like to estimate for each , but we have only a noisy oracle for , which we assume is unbiased and has bounded variance. Our goal is, given noisy reads of , to estimate at the current point as well as possible.

We consider estimators of the form . For example. setting and all others to zero would yield an unbiased (but high variance) estimate of . We could assign more mass to older samples , but this will introduce bias into the estimate. By optimizing this bias-variance tradeoff, we can get a good estimator. In particular, taking to be an exponential moving average (EMA) of will prioritize more recent and relevant estimates, while placing enough weight on old estimates to reduce the variance. The tradeoff is controlled by the EMA parameter ; e.g. if the sequence moves slowly (the stepsize is small), we will want large because older iterates are still very relevant.

In adaptive methods, the underlying function we want to estimate is (or its diagonal ), and every stochastic gradient

gives us an unbiased estimate

(resp. ) of . With this application in mind, we formalize our results in terms of matrix estimation. By combining standard matrix concentration inequalities (e.g. from Tropp [2015]) with bounds on how fast the sequence moves, we arrive at the following result, proved in Appendix F:

###### Theorem 4.1.

Assume . The function is matrix-valued and -Lipschitz. For each ,

is a random matrix with

, , and . Set with and assume . Then with probability , the estimation error is bounded by

 Φ≤O(σmax√1−β√log(d/δ)+MLη/(1−β)).

This is optimized by , for which the bound is as long as .

As long as is sufficiently large, we can get a high quality estimate of . For this, it suffices to start off the underlying optimization algorithm with burn-in iterations where our estimate is updated but the algorithm is not started. This burn-in period will not affect asymptotic runtime as long as . In our non-convex convergence results we will require and , so that which is much smaller than . In practice, one can get away with much shorter (or no) burn-in period.

If is properly tuned, while running an adaptive method like RMSProp, we will get good estimates of from samples . However, we actually require a good estimate of and variants. To treat estimation in a unified way, we introduce estimable matrix sequences:

###### Definition 4.1.

A -estimable matrix sequence is a sequence of matrices generated from with so that with probability , after a burn-in of time , we can achieve an estimate sequence so that simultaneously for all times .

Applying Theorem 4.1 and union bounding over all time , we may state a concise result in terms of Definition 4.1:

###### Proposition 4.1.

Suppose is -Lipschitz as a function of . When applied to a generator sequence with and samples , the matrix sequence is -estimable with , , and .

We are hence guaranteed a good estimate of . What we actually want, though, is a good estimate of the preconditioner . In Appendix G we show how to bound the quality of an estimate of . One simple result is:

###### Proposition 4.2.

Suppose is -Lipschitz as a function of . Further suppose a uniform bound for all , with . When applied to a generator sequence with and samples , the matrix sequence is -estimable with , , and .

### 4.2 Convergence Results

We saw in the last two sections that it is simple to reason about adaptive methods via IAM, and that it is possible to compute a good estimate of the preconditioner. But we still need to glue the two together in order to get a convergence proof for practical adaptive methods.

In this section we will give non-convex convergence results, first for IAM and then for practical realizations thereof. We start with first-order convergence as a warm-up, and then move on to second-order convergence. In each case we give a bound for IAM, study it, and then give the corresponding bound for practical adaptive methods.

#### 4.2.1 Assumptions and notation

We want results for a wide variety of preconditioners , e.g. , the RMSProp preconditioner , and the diagonal version thereof, . To facilitate this and the future extension of our approach to other preconditioners, we give guarantees that hold for general preconditioners . Our bounds depend on via the following properties:

###### Definition 4.2.

We say is a -preconditioner if, for all , the following bounds hold. First, . Second, if is the quadratic approximation of at some point , we assume . Third, . Fourth, . Finally, .

Note that we could bound . but in practice and may be smaller, since they depend on the behavior of only in specific directions. In particular, if the preconditioner is well-aligned with the Hessian, as may be the case if the natural gradient approximation is valid, then would be very small. If is exactly quadratic, can be taken as a constant. The constant controls the magnitude of (rescaled) gradient noise, which affects stability at a local minimum. Finally, gives a lower bound on the amount of gradient noise in any direction; when is larger it is easier to escape saddle points222In cases where is rank deficient, e.g. in high-dimensional finite sum problems, lower bounds on should be understood as lower bounds on for escape directions from saddle points, analogous to the “CNC condition” from [Daneshmand et al., 2018].. For shorthand, a -preconditioner needs to satisfy only the corresponding inequalities.

In Appendix C we provide bounds on these constants for several variants of the second moment preconditioner. Below we highlight the two most relevant cases, corresponding to SGD and RMSProp:

###### Proposition 4.3.

The preconditioner is a -preconditioner, with , , , and .

###### Proposition 4.4.

The preconditioner is a -preconditioner, with

 Λ1 ν =λmin(G)λmin(G)+ε,andλ−=(λmax(G)+ε)−1/2.

#### 4.2.2 First-order convergence

Proofs are given in Appendix E. For all first-order results, we assume that is a -preconditioner. The proof technique is essentially standard, with minor changes in order to accomodate general preconditioners. First, suppose we have exact oracle access to the preconditioner:

###### Theorem 4.2.

Run preconditioned SGD with preconditioner and stepsize . For small enough , after iterations,

 1TT−1∑t=0E[∥∇f(xt)∥2]≤τ2. (5)

Now we consider an alternate version where instead of the preconditioner , we precondition by an noisy version that is close to , i.e. .

###### Theorem 4.3.

Suppose we have access to an inexact preconditioner , which satisfies for . Run preconditioned SGD with preconditioner and stepsize . For small enough , after iterations, we will have

 1TT−1∑t=0E[∥∇f(xt)∥2]≤τ2. (6)

The results are the same up to constants. In other words, as long as we can achieve less than error, we will converge at essentially the same rate as if we had the exact preconditioner. In light of this, for the second-order convergence results, we treat only the noisy version.

Theorem 4.3 gives a convergence bound assuming a good estimate of the preconditioner, and our estimation results guarantee a good estimate. By gluing together Theorem 4.3 with our estimation results for the RMSProp preconditioner, i.e. Proposition 4.2, we can give a convergence result for bona fide RMSProp:

###### Corollary 4.1.

Consider RMSProp with burn-in, as in Algorithm 3, where we estimate . Retain the same choice of and as in Theorem 4.3. For small enough , such a choice of will yield . Choose all other parameters e.g. in accordance with Proposition 4.2. In particular, choose for the burn-in parameter. Then with probability , in overall time , we achieve

 1TT−1∑t=0E[∥∇f(xt)∥2]≤τ2. (7)

#### 4.2.3 Second-order convergence

Now we leverage the power of our high level approach to prove nonconvex second-order convergence for adaptive methods. Like the first-order results, we start by proving convergence bounds for a generic, possibly inexact preconditioner . Our proof is based on that of Daneshmand et al. [2018], though our study of the preconditioner is wholly new. Accordingly, we study the convergence of Algorithm 4, which is the same as Algorithm 1 (generic preconditioned SGD) except that once in a while we take a large stepsize so we may escape saddlepoints. The proof is given completely in Appendix D. At a high level, we show the algorithm makes progress when the gradient is large and when we are at a saddle point, and does not escape from local minima. Our analysis uses all the constants specified in Definition 4.2, e.g. the speed of escape from saddle points depends on , the lower bound on stochastic gradient noise.

Then, as before, we simply fuse our convergence guarantees with our estimation guarantees. The end result is, to our knowledge, the first nonconvex second-order convergence result for any adaptive method.

##### Definitions for second-order results.

Assume further that the Hessian is -Lipschitz and the preconditioner is -Lipschitz. The dependence on these constants is made more precise in the proof, in Appendix D. The usual stepsize is , while is the occasional large stepsize that happens every iterations. The constant is the small probability of failure we tolerate. For all results, we assume is a -preconditioner. For simplicity, we assume the noisy estimate also satisfies the inequality. We will also assume a uniform bound on .

The proofs rely on a few other quantities that we optimally determine as a function of the problem parameters: is a threshold on the function value progress, and is the time-amortized average of . We specify the precise values of all quantities in the proof.

###### Theorem 4.4.

Consider Algorithm 4 with inexact preconditioner and exact preconditioner satisfying the preceding requirements. Suppose that for all , we have . Then for small , with probability , we reach an -stationary point in time

 T=~O(Λ41Λ42Γ4λ10−ν4⋅L3ρδ3⋅τ−5). (8)

The big-O suppresses other constants given in the proof.

To prove a result for bona fide RMSProp, we need to combine Theorem 4.4 with an algorithm that maintains a good estimate of (and consequently ). This is more delicate than the first-order case because now the stepsize varies. Whenever we take a large stepsize, the estimation algorithm will need to hallucinate number of smaller steps in order to keep the estimate accurate. Our overall scheme is formalized in Appendix B, for which the following convergence result holds:

###### Corollary 4.2.

Consider the RMSProp version of Algorithm 4 that is described in Appendix B. Retain the same choice of , , and as in Theorem 4.4. For small enough , such a choice of will yield . Choose for the burn-in parameter Choose , so that as far as the estimation scheme is concerned, the stepsize is bounded by . Then as before, with probability , we can reach an -stationary point in total time

 W+T=~O(Λ41Λ42Γ4λ10−ν4⋅L3ρδ3⋅τ−5), (9)

where are the constants describing .

Again, as in the first order results, one could substitute in any other estimable preconditioner, such as the more common diagonal version .

## 5 Discussion

Separating the estimation step from the preconditioning enables evaluation of different choices for the preconditioner.

### 5.1 How to set the regularization parameter ε

In the adaptive methods literature, it is still a mystery how to properly set the regularization parameter that ensures invertibility of . When the optimality tolerance is small enough, estimating the preconditioner is not the bottleneck. Thus, focusing only on the idealized case, one could just choose to minimize the bound. Our first-order results depend on only through the following term:

 (10)

where we have used the preconditioner bounds from Proposition 4.4. This is minimized by taking , which suggests using identity preconditioner, or SGD. In contrast, for second-order convergence, the bound is

 Λ41Λ42Γ4λ10−ν4 ≤d4κ(G)4(λmax(G)+ε), (11)

which is instead minimized with . So for the best second-order convergence rate, it is desireable to set as small as possible. Note that since our bounds hold only for small enough convergence tolerance , it is possible that the optimal should depend in some way on .

### 5.2 Comparison to SGD

Another important question we make progress towards is: when are adaptive methods better than SGD? Our second-order result depends on the preconditioner only through . Plugging in Proposition 4.3 for SGD, we may bound

 Λ41Λ42Γ4λ10−ν4≤E[∥g∥2]4λmin(G)4≤d4κ(G)4, (12)

while for full-matrix RMSProp, we have

 Λ41Λ42Γ4λ10−ν4 ≤d4κ(G)4(λmax(G)+ε). (13)

Setting for simplicity, we conclude that full-matrix RMSProp converges faster if .

Now suppose that for a given optimization problem, the preconditioner is well-aligned with the Hessian so that (e.g. if the natural gradient approximation holds) and that near saddle points the objective is essentially quadratic so that . In this regime, the preconditioner dependence of idealized full matrix RMSProp is , which yields a better result than SGD when . This will happen whenever is relatively small. Thus, when there is not much noise in the escape direction, and the Hessian and are not poorly aligned, RMSProp will converge faster overall.

### 5.3 Alternative preconditioners

Our analysis inspires the design of other preconditioners: e.g., if at each iteration we sample two independent stochastic gradients and , we have unbiased sample access to , which in expectation yields the covariance instead of the second moment matrix of . It immediately follows that we can prove second-order convergence results for an algorithm that constructs an exponential moving average estimate of and preconditions by , as advocated by Ida et al. [2017].

### 5.4 Tuning the EMA parameter β

Another mystery of adaptive methods is how to set the exponential moving average (EMA) parameter . In practice is typically set to a constant, e.g. 0.99, while other parameters such as the stepsize are tuned more carefully and may vary over time. While our estimation guarantee Theorem 4.1, suggests setting , the specific formula depends on constants that may be unknown, e.g. Lipschitz constants and gradient norms. Instead, one could set

, and search for a good choice of the hyperparameter

. For example, the common initial choice of and corresponds to .

## 6 Experiments

We experimentally test our claims about adaptive methods escaping saddle points, and our suggestion for setting .

First, we test our claim that when the gradient noise is ill-conditioned, adaptive methods escape saddle points faster than SGD, and often converge faster to (approximate) local minima. We construct a two dimensional333The same phenomenon still holds in higher dimensions but the presentation is simpler with . non-convex problem where . Here, , so has a saddle point at the origin with objective value zero. The vectors are chosen so that sampling uniformly from yields and . Hence at the origin there is an escape direction but little gradient noise in that direction.

We initialize SGD and (diagonal) RMSProp (with ) at the saddle point and test several stepsizes for each. Results for the first iterations are shown in Figure 1. In order to escape the saddle point as fast as RMSProp, SGD requires a substantially larger stepsize, e.g. SGD needs to escape as fast as RMSProp does with . But with such a large stepsize, SGD cannot converge to a small neighborhood of the local minimum, and instead bounces around due to gradient noise. Since RMSProp can escape with a small stepsize, it can converge to a much smaller neighborhood of the local minimum. Overall, for any fixed final convergence criterion, RMSProp escapes faster and converges faster overall.

##### Setting the EMA parameter β.

Next, we test our recommendations regarding setting the EMA parameter

. We consider logistic regression on MNIST. We use (diagonal) RMSProp with batch size 100, decreasing stepsize

and , and compare different schedules for . Specifically we test (so that is spaced roughly logarithmically) as well as our recommendation of for . As shown in Figure 2, all options for have similar performance initially, but as decreases, large yields substantially better performance. In particular, our decreasing schedule achieved the best performance, and moreover was insensitive to how was set.

#### Acknowledgements

This work was supported in part by the DARPA Lagrange grant, and an Amazon Research Award. We thank Nicolas Le Roux for helpful conversations.

## References

• Agarwal et al. [2017] Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Finding approximate local minima faster than gradient descent. In

Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing

, STOC 2017, pages 1195–1199, New York, NY, USA, 2017. ACM.
ISBN 978-1-4503-4528-6.
• Agarwal et al. [2018] Naman Agarwal, Brian Bullins, Xinyi Chen, Elad Hazan, Karan Singh, Cyril Zhang, and Yi Zhang. The case for full-matrix adaptive regularization. arXiv preprint arXiv:1806.02958, 2018.
• Allen-Zhu and Li [2018] Zeyuan Allen-Zhu and Yuanzhi Li. Neon2: Finding local minima via first-order oracles. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 3720–3730. Curran Associates, Inc., 2018.
• Balles and Hennig [2018] Lukas Balles and Philipp Hennig. Dissecting Adam: The sign, magnitude and variance of stochastic gradients. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 404–413, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
• Carmon et al. [2018] Y. Carmon, J. Duchi, O. Hinder, and A. Sidford. Accelerated methods for nonconvex optimization. SIAM Journal on Optimization, 28(2):1751–1772, 2018. doi: 10.1137/17M1114296.
• Chen et al. [2018] Xiangyi Chen, Sijia Liu, Ruoyu Sun, and Mingyi Hong. On the convergence of a class of adam-type algorithms for non-convex optimization. arXiv preprint arXiv:1808.02941, 2018.
• Daneshmand et al. [2018] Hadi Daneshmand, Jonas Kohler, Aurelien Lucchi, and Thomas Hofmann. Escaping saddles with stochastic gradients. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1155–1164, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
• Duchi et al. [2011] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12:2121–2159, July 2011. ISSN 1532-4435.
• Ge et al. [2015] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan.

In Peter Grünwald, Elad Hazan, and Satyen Kale, editors, Proceedings of The 28th Conference on Learning Theory, volume 40 of Proceedings of Machine Learning Research, pages 797–842, Paris, France, 03–06 Jul 2015. PMLR.
• Goodfellow et al. [2016] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
• Ida et al. [2017] Yasutoshi Ida, Yasuhiro Fujiwara, and Sotetsu Iwamura.

Adaptive learning rate via covariance matrix based preconditioning for deep neural networks.

In

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17

, pages 1923–1929, 2017.
• Jacobs [1988] Robert A Jacobs. Increased rates of convergence through learning rate adaptation. Neural networks, 1(4):295–307, 1988.
• Jin et al. [2017] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to escape saddle points efficiently. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1724–1732, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
• Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
• Lee et al. [2016] Jason D. Lee, Max Simchowitz, Michael I. Jordan, and Benjamin Recht. Gradient descent only converges to minimizers. In Vitaly Feldman, Alexander Rakhlin, and Ohad Shamir, editors, 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pages 1246–1257, Columbia University, New York, New York, USA, 23–26 Jun 2016. PMLR.
• Levy [2016] Kfir Y Levy. The power of normalization: Faster evasion of saddle points. arXiv preprint arXiv:1611.04831, 2016.
• McMahan and Streeter [2010] H. Brendan McMahan and Matthew J. Streeter. Adaptive bound optimization for online convex optimization. In COLT 2010 - The 23rd Conference on Learning Theory, Haifa, Israel, June 27-29, 2010, pages 244–256, 2010.
• Mokhtari et al. [2018] Aryan Mokhtari, Asuman Ozdaglar, and Ali Jadbabaie. Escaping saddle points in constrained optimization. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 3633–3643. Curran Associates, Inc., 2018.
• Mukkamala and Hein [2017] Mahesh Chandra Mukkamala and Matthias Hein. Variants of RMSProp and Adagrad with logarithmic regret bounds. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2545–2553, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
• Nesterov [2013] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
• Nesterov and Polyak [2006] Yurii Nesterov and Boris T Polyak. Cubic regularization of Newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
• Reddi et al. [2018a] Sashank Reddi, Manzil Zaheer, Suvrit Sra, Barnabas Poczos, Francis Bach, Ruslan Salakhutdinov, and Alex Smola. A generic approach for escaping saddle points. In Amos Storkey and Fernando Perez-Cruz, editors, Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pages 1233–1242, Playa Blanca, Lanzarote, Canary Islands, 09–11 Apr 2018a. PMLR.
• Reddi et al. [2018b] Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of Adam and beyond. In International Conference on Learning Representations, 2018b.
• Ruder [2016] Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016.
• Shazeer and Stern [2018] Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4596–4604, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
• Tieleman and Hinton [2012] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-RMSProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
• Tropp [2015] Joel A. Tropp. An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning, 8(1-2):1–230, 2015. ISSN 1935-8237. doi: 10.1561/2200000048.
• Ward et al. [2018] Rachel Ward, Xiaoxia Wu, and Leon Bottou. Adagrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization. arXiv preprint arXiv:1806.01811, 2018.
• Xu et al. [2018] Yi Xu, Jing Rong, and Tianbao Yang. First-order stochastic algorithms for escaping from saddle points in almost linear time. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 5531–5541. Curran Associates, Inc., 2018.
• Zaheer et al. [2018] Manzil Zaheer, Sashank Reddi, Devendra Sachan, Satyen Kale, and Sanjiv Kumar. Adaptive methods for nonconvex optimization. In NIPS. 2018.
• Zhou et al. [2018] Dongruo Zhou, Yiqi Tang, Ziyan Yang, Yuan Cao, and Quanquan Gu. On the convergence of adaptive gradient methods for nonconvex optimization. arXiv preprint arXiv:1808.05671, 2018.
• Zou et al. [2018] Fangyu Zou, Li Shen, Zequn Jie, Weizhong Zhang, and Wei Liu. A sufficient condition for convergences of Adam and RMSProp. arXiv preprint arXiv:1811.09358, 2018.

## Appendix A More Insights from Idealized Adaptive Methods (IAM)

Suppose for now that we have oracle access to . Why should preconditioning by help optimization? The original Adam paper [Kingma and Ba, 2014] argues that Adam is an approximation to natural gradient descent, since if the objective is a log-likelihood, approximates the Fisher information matrix , which captures curvature information in the space of distributions. This connection is tenuous at best, since the approximation is only valid near optimality. Moreover, the exponent is wrong: Adam preconditions by , but natural gradient should precondition by . But using the exponent is reported in the literature as unstable, even for Adagrad: “without the square root operation, the algorithm performs much worse” [Ruder, 2016]. So the exponent is changed to instead of .

Both of the above issues with the natural gradient interpretation are also pointed out in Balles and Hennig [2018], who argue that the primary function of adaptive methods is to equalize the stochastic gradient noise in each direction. But it is still not clear precisely why or how equalized noise should help optimization.

By assuming oracle access to , we can immediately argue that the exponent cannot be more aggressive than . Suppose we run preconditioned SGD with the preconditioner (instead of as in RMSProp), and apply this to a noiseless problem; that is, always equals the full gradient . The preconditioner is then

 At=(E[gtgTt]+εI)−1=(∇t∇Tt+εI)−1. (14)

Taking , the idealized RMSProp update approaches

 xt+1←xt−η∇t∥∇t∥2. (15)

First, the actual descent direction is not changed, and curvature is totally absent. Second, the resulting algorithm is unstable unless decreases rapidly: as approaches a stationary point, the magnitude of the step grows arbitrarily large, making it impossible to converge without rapidly decreasing the stepsize.

By contrast, using the standard exponent and taking in the noiseless case yields normalized gradient descent:

 xt+1←xt−η∇t∥∇t∥. (16)

In neither case do adaptive methods actually change the direction of descent (e.g. via curvature information); only the stepsize is changed.

## Appendix B Algorithm Details

Per our estimation results in Section 4.1, we must alter RMSProp to ensure it achieves an accurate estimate of the preconditioner. Namely, before updating the parameter , we need to burn-in the estimate for several iterations so the initial estimate is accurate. This subroutine is given in Algorithm 5.

Later, when we prove second-order convergence, we need to modify RMSProp to occassionally take a large step. However, this complicates estimation: per Theorem 4.1, estimation quality deteriorates as the step size increases. Naively applying Theorem 4.1 to the large stepsize yields an estimate of that is not accurate enough. To get around this, every time RMSProp takes a large step, we will hallucinate a number of smaller steps to feed into the estimation procedure. This is formalized in Algorithm 6. Overall, the variant of RMSProp we study is formalized in Algorithm 7.

## Appendix C Curvature and noise constants for different preconditioners

Our analysis for general preconditioners depends on constants , as well as that measure various properties of the preconditioner . For convenience, we reproduce the definition:

###### Definition C.1.

We say is a -preconditioner if, for all in the domain, the following bounds hold. First, . Second, if is the quadratic approximation of at some point , we assume . Third, . Fourth, . Finally, .

As before, we write throughout.

### c.1 Constants for identity preconditioner

In the simplest case, and we merely run SGD. We reproduce Proposition 4.3:

###### Proposition C.1.

The preconditioner is a -preconditioner, with , ,