AdaGrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization

06/05/2018 ∙ by Rachel Ward, et al. ∙ The University of Texas at Austin 0

Adaptive gradient methods such as AdaGrad and its variants update the stepsize in stochastic gradient descent on the fly according to the gradients received along the way; such methods have gained widespread use in large-scale optimization for their ability to converge robustly, without the need to fine tune parameters such as the stepsize schedule. Yet, the theoretical guarantees to date for AdaGrad are for online and convex optimization, which is quite different from the offline and nonconvex setting where adaptive gradient methods shine in practice. We bridge this gap by providing strong theoretical guarantees in batch and stochastic setting, for the convergence of AdaGrad over smooth, nonconvex landscapes, from any initialization of the stepsize, without knowledge of Lipschitz constant of the gradient. We show in the stochastic setting that AdaGrad converges to a stationary point at the optimal O(1/√(N)) rate (up to a (N) factor), and in the batch setting, at the optimal O(1/N) rate. Moreover, in both settings, the constant in the rate matches the constant obtained as if the variance of the gradient noise and Lipschitz constant of the gradient were known in advance and used to tune the stepsize, up to a logarithmic factor of the mismatch between the optimal stepsize and the stepsize used to initialize AdaGrad. In particular, our results imply that AdaGrad is robust to both the unknown Lipschitz constant and level of stochastic noise on the gradient, in a near-optimal sense. When there is noise, AdaGrad converges at the rate of O(1/√(N)) with well-tuned stepsize, and when there is not noise, the same algorithm converges at the rate of O(1/N) like well-tuned batch gradient descent.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Consider the problem of minimizing a differentiable non-convex function via stochastic gradient descent (SGD); starting from and stepsize , SGD iterates until convergence

(1)

where is the stepsize at the th iteration and

is the stochastic gradient in the form of a random vector satisfying

and having bounded variance. SGD is the de facto standard for deep learning optimization problems, or more generally, for the large-scale optimization problems where the loss function

can be approximated by the average of a large number of component functions, . It is more efficient to measure a single component gradient (or subset of component gradients), and move in the noisy direction than to compute a full gradient .

For non-convex but smooth loss functions , (noiseless) gradient descent (GD) with constant stepsize converges to a stationary point of at rate with the number of iterations [Nesterov, 1998]. In the same setting, and under the general assumption of bounded gradient noise variance, SGD with constant or decreasing stepsize has been proven to converge to a stationary point of at rate [Ghadimi and Lan, 2013, Bottou et al., 2018]. The rate for GD is the best possible worst-case dimension-free rate of convergence for any algorithm [Carmon et al., 2017a]; faster convergence rates in the noiseless setting are available under the mild assumption of additional smoothness [Agarwal et al., 2017, Carmon et al., 2017b]. In the noisy setting, faster rates than are also possible using accelerated SGD methods [Ghadimi and Lan, 2016, Allen-Zhu and Yang, 2016, Reddi et al., 2016, Allen-Zhu, 2017, Xu et al., 2017, Carmon et al., 2018].

Instead of focusing on faster convergence rates for SGD, this paper focuses on adaptive stepsizes [Cutkosky and Boahen, 2017, Levy, 2017] that make the optimization algorithm more robust to (generally unknown) parameters of the optimization problem, such as the noise level of the stochastic gradient and the Lipschitz smoothness constant of the loss function defined as the smallest number such that for all . In particular, the convergence of GD with fixed stepsize is guaranteed only if the fixed stepsize is carefully chosen such that – choosing a larger stepsize , even just by a factor of 2, can result in oscillation or divergence of the algorithm [Nesterov, 1998]. Because of this sensitivity, GD with fixed stepsize is rarely used in practice; instead, one adaptively chooses the stepsize at each iteration to approximately maximize decrease of the loss function in the current direction of via either approximate line search [Wright and Nocedal, 2006], or according to the Barzilai-Borwein rule [Barzilai and Borwein, 1988] combined with line search.

Unfortunately, in the noisy setting where one uses SGD for optimization, line search methods are not useful, as in this setting the stepsize should not be overfit to the noisy stochastic gradient direction at each iteration. The classical Robbins/Monro theory [Robbins and Monro, 1951] says that in order for the stepsize schedule should satisfy

(2)

However, these bounds do not tell us much about how to select a good stepsize schedule in practice, where algorithms are run for finite iterations and the constants in the rate of convergence matter.

The question of how to choose the stepsize or stepsize or learning rate schedule for SGD is by no means resolved; in practice, a preferred schedule is chosen mannually by testing many different schedules in advance and choosing the one leading to smallest training or generalization error. This process can take days or weeks, and can become prohibitively expensive in terms of time and computational resources incurred.

1.1 Stepsize Adaptation with AdaGrad-Norm

Adaptive stochastic gradient methods such as AdaGrad (introduced independently by Duchi et al. [2011] and McMahan and Streeter [2010]) have been widely used in the past few years. AdaGrad updates the stepsize on the fly given information of all previous (noisy) gradients observed along the way. The most common variant of AdaGrad updates an entire vector of per-coefficient stepsizes [Lafond et al., 2017]. To be concrete, for optimizing a function , the “coordinate” version of AdaGrad updates scalar parameters at the iteration – one for each coordinate of – according to in the noiseless setting, and

in the noisy gradient setting. This common use makes AdaGrad a variable metric method and has been the object of recent criticism for machine learning applications

[Wilson et al., 2017].

One can also consider a variant of AdaGrad which updates only a single (scalar) stepsize according to the sum of squared gradient norms observed so far. In this work, we focus instead on the “norm” version of AdaGrad as a single stepsize adaptation method using the gradient norm information, which we call AdaGrad-Norm. The update in the stochastic setting is as follows: initialize a single scalar ; at the

th iteration, observe the random variable

such that and iterate

where is to ensure homogeneity and that the units match. It is straightforward that in expectation, thus, under the assumption of uniformly bounded gradient and uniformly bounded variance , the stepsize will decay eventually according to This stepsize schedule matches the schedule which leads to optimal rates of convergence for SGD in the case of convex but not necessarily smooth functions, as well as smooth but not necessarily convex functions (see, for instance, Agarwal et al. [2009] and Bubeck et al. [2015]). This observation suggests that AdaGrad-Norm should be able to achieve convergence rates for SGD, but without having to know Lipschitz smoothness parameter of and the parameter a priori to set the stepsize schedule.

Theoretically rigorous convergence results for AdaGrad-Norm were provided in the convex setting recently [Levy, 2017]. Moreover, it is possible to obtain convergence rates in the offline setting by online-batch conversion. However, making such observations rigorous for nonconvex functions is difficult because is itself a random variable which is correlated with the current and all previous noisy gradients; thus, the standard proofs in SGD do not straightforwardly extend to the proofs of AdaGrad-Norm. This paper provides such a proof for AdaGrad-Norm.

1.2 Main contributions

Our results make rigorous and precise the observed phenomenon that the convergence behavior of AdaGrad-Norm is highly adaptable to the unknown Lipschitz smoothness constant and level of stochastic noise on the gradient: when there is noise, AdaGrad-Norm converges at the rate of , and when there is no noise, the same algorithm converges at the optimal

rate like well-tuned batch gradient descent. Moreover, our analysis shows that AdaGrad-Norm converges at these rates for any choices of the algorithm hyperparameters

and , in contrast to GD or SGD with fixed stepsize where if the stepsize is set above a hard upper threshold governed by the (generally unknown) smoothness constant , the algorithm might not converge at all. Finally, we note that the constants in the rates of convergence we provide are explicit in terms of their dependence on the hyperparameters and . We list our two main theorems (informally) in the following:

  • For a differential non-convex function with -Lipschitz gradient and , Theorem 2.1 implies that AdaGrad-Norm converges to an

    -approximate stationary point with high probability

    111It is becoming common to define an -approximate stationary point as [Agarwal et al., 2017, Carmon et al., 2018, 2017a, Fang et al., 2018, Zhou et al., 2018, Allen-Zhu, 2018], but we use the convention [Lei et al., 2017, Bottou et al., 2018] to most easily compare our results to those from Ghadimi and Lan [2013], Li and Orabona [2019]. at the rate

    If the optimal value of the loss function is known and one sets accordingly, then the constant in our rate is close to the best-known constant achievable for SGD with fixed stepsize carefully tuned to knowledge of and , as given in Ghadimi and Lan [2013]. However, our result requires bounded gradient and our rate constant scales with instead of linearly in . Nevertheless, our result suggests a good strategy for setting hyperparameters in implementing AdaGrad-Norm practically: given knowledge of , set and simply initialize to be very small.

  • When there is no noise , we can improve this rate to an rate of convergence. In Theorem 2.2, we show that after
    (1) if
    (2) if
    Note that the constant in the second case when is not optimal compared to the known best rate constant obtainable by gradient descent with fixed stepsize [Carmon et al., 2017a]; on the other hand, given knowledge of and , the rate constant of AdaGrad-norm reproduces the optimal constant by setting and .

Practically, our results imply a good strategy for setting the hyperparameters when implementing AdaGrad-norm in practice: set (assuming is known) and set to be a very small value. If is unknown, then setting should work well for a wide range of values of , and in the noisy case with strictly greater than zero.

1.3 Previous work

Theoretical guarantees of convergence for AdaGrad were provided in Duchi et al. [2011]

in the setting of online convex optimization, where the loss function may change from iteration to iteration and be chosen adversarially. AdaGrad was subsequently observed to be effective for accelerating convergence in the nonconvex setting, and has become a popular algorithm for optimization in deep learning problems. Many modifications of AdaGrad with or without momentum have been proposed, namely, RMSprop

[Srivastava and Swersky, 2012], AdaDelta [Zeiler, 2012], Adam [Kingma and Ba, 2015], AdaFTRL[Orabona and Pal, 2015], SGD-BB[Tan et al., 2016], AdaBatch [Defossez and Bach, 2017], SC-Adagrad [Mukkamala and Hein, 2017], AMSGRAD [Reddi et al., 2018], Padam [Chen and Gu, 2018], etc. Extending our convergence analysis to these popular alternative adaptive gradient methods remains an interesting problem for future research.

Regarding the convergence guarantees for the norm version of adaptive gradient methods in the offline setting, the recent work by Levy [2017] introduces a family of adaptive gradient methods inspired by AdaGrad, and proves convergence rates in the setting of (strongly) convex loss functions without knowing the smoothness parameter in advance. Yet, that analysis still requires the a priori knowledge of a convex set with known diameter in which the global minimizer resides. More recently, Wu et al. [2018] provids convergence guarantees in the non-convex setting for a different adaptive gradient algorithm, WNGrad, which is closely related to AdaGrad-Norm and inspired by weight normalization [Salimans and Kingma, 2016]. In fact, the WNGrad stepsize update is similar to AdaGrad-Norm’s:

However, the guaranteed convergence in Wu et al. [2018] is only for the batch setting and the constant in the convergence rate is worse than the one provided here for AdaGrad-Norm. Independently, Li and Orabona [2019] also proves the convergence rate for a variant of AdaGrad-Norm in the non-convex stochastic setting, but their analysis requires knowledge of of smoothness constant and a hard threshold of for their convergence. In contrast to Li and Orabona [2019], we do not require knowledge of the Lipschitz smoothness constant , but we do assume that the gradient is uniformly bounded by some (unknown) finite value, while Li and Orabona [2019] only assumes bounded variance

1.4 Future work

This paper provides convergence guarantees for AdaGrad-Norm over smooth, nonconvex functions, in both the stochastic and deterministic settings. Our theorems should shed light on the popularity of AdaGrad as a method for more robust convergence of SGD in nonconvex optimization in that the convergence guarantees we provide are robust to the initial stepsize , and adjust automatically to the level of stochastic noise. Moreover, our results suggest a good strategy for setting hyperparameters in AdaGrad-Norm implementation: set (if is known) and set to be a very small value. However, several improvements and extensions should be possible. First, the constant in the convergence rate we present can likely be improved and it remains open whether we can remove the assumption of the uniformly bounded gradient in the stochastic setting. It would be interesting to analyze AdaGrad in its coordinate form, where each coordinate of has its own stepsize which is updated according to . AdaGrad is just one particular adaptive stepsize method and other updates such as Adam [Kingma and Ba, 2015] are often preferable in practice; it would be nice to have similar theorems for other adaptive gradient methods, and to even use the theory as a guide for determining the “best” method for adapting the stepsize for given problem classes.

1.5 Notation

Throughout, denotes the norm. We use the notation . A function has -Lipschitz smooth gradient if

(3)

We write and refer to as the smoothness constant for if is the smallest number such that the above is satisfied.

2 AdaGrad-Norm Convergence

To be clear about the adaptive algorithm, we first state in Algorithm 1 the norm version of AdaGrad we consider throughout in the analysis.

1:Input: Initialize and the total iterations
2:for   do
3:     Generate and
4:     
5:     
6:end for
Algorithm 1 AdaGrad-Norm

At the th iteration, we observe a stochastic gradient where , are random variables, and such that

is an unbiased estimator of

.222 is the expectation with respect conditional on previous We require the following additional assumptions: for each ,

  1. The random vectors , are independent of each other and also of ;

  2. uniformly.

The first two assumptions are standard (see e.g. Nemirovski and Yudin [1983], Nemirovski et al. [2009], Bottou et al. [2018]). The third assumption is somewhat restrictive as it rules out strongly convex objectives, but is not an unreasonable assumption for AdaGrad-Norm, where the adaptive learning rate is a cumulative sum of all previous observed gradient norms.

Because of the variance in gradient, the AdaGrad-Norm stepsize decreases to zero roughly at a rate between and . It is known that AdaGrad-Norm stepsize decreases at this rate [Levy, 2017], and that this rate is optimal in in terms of the resulting convergence theorems in the setting of smooth but not necessarily convex , or convex but not necessarily strongly convex or smooth . Still, standard convergence theorems for SGD do not extend straightforwardly to AdaGrad-Norm because the stepsize is a random variable and dependent on all previous points visited along the way, i.e., and . From this point on, we use the shorthand and for simplicity of notation. The following theorem gives the convergence guarantee to Algorithm 1. We give detailed proof in Section 3.

Theorem 2.1 (AdaGrad-Norm: convergence in stochastic setting)

Suppose and . Suppose that the random variables , satisfy the above assumptions. Then with probability ,

where

This result implies that AdaGrad-Norm converges for any and starting from any value of . To put this result in context, we can compare to Corollary 2.2 of Ghadimi and Lan [2013] giving the best-known convergence rate for SGD with fixed step-size in the same setting (albeit not requiring Assumption (3) of uniformly bounded gradient): if the Lipschitz smoothness constant and the variance are known a priori, and the fixed stepsize in SGD is set to

then with probability

We match the rate of Ghadimi and Lan [2013], but without a priori knowledge of and , and with a worse constant in the rate of convergence. In particular, our rate constant scales according to (up to logarithmic factors in ) while the result for SGD with well-tuned fixed step-size scales linearly with . The additional logarithmic factor (by Lemma 3.2) results from the AdaGrad-Norm update using the square norm of the gradient (see inequality (11) for details). The extra constant results from the correlation between the stepsize and the gradient . We note that the recent work Li and Orabona [2019] derives an rate for a variation of Adagrad-Norm without the assumption of uniformly bounded gradient, but at the same time requires a priori knowledge of the smoothness constant in setting the step-size in order to establish convergence, similar to SGD with fixed stepsize. Finally, we note that recent works [Allen-Zhu, 2017, Lei et al., 2017, Fang et al., 2018, Zhou et al., 2018] provide modified SGD algorithms with convergence rates faster than , albeit again requiring priori knowledge of both and to establish convergence.

We reiterate however that the main emphasis in Theorem 2.1 is on the robustness of the AdaGrad-Norm convergence to its hyperparameters and , compared to plain SGD’s dependence on its parameters and . Although the constant in the rate of our theorem is not as good as the best-known constant for stochastic gradient descent with well-tuned fixed stepsize, our result suggests that implementing AdaGrad-Norm allows one to vastly reduce the need to perform laborious experiments to find a stepsize schedule with reasonable convergence when implementing SGD in practice.

We note that for the second bound in 2.1, in the limit as we recover an rate of convergence for noiseless gradient descent. We can establish a stronger result in the noiseless setting using a different method of proof, removing the additional log factor and Assumption 3 of uniformly bounded gradient. We state the theorem below and defer our proof to Section 4.

Theorem 2.2 (AdaGrad-Norm: convergence in deterministic setting)

Suppose that and that Consider AdaGrad-Norm in deterministic setting with following update,

Then after

  • if


  • Here .

The convergence bound shows that, unlike gradient descent with constant stepsize which can diverge if the stepsize , AdaGrad-Norm convergence holds for any choice of parameters and . The critical observation is that if the initial stepsize is too large, the algorithm has the freedom to diverge initially, until grows to a critical point (not too much larger than ) at which point is sufficiently small that the smoothness of forces to converge to a finite number on the order of , so that the algorithm converges at an rate. To describe the result in Theorem 2.2, let us first review a classical result (see, for example Nesterov [1998], ) on the convergence rate for gradient descent with fixed stepsize.

Lemma 2.1

Suppose that and that . Consider gradient descent with constant stepsize, . If , then after at most a number of steps

Alternatively, if , then convergence is not guaranteed at all – gradient descent can oscillate or diverge.

Compared to the convergence rate of gradient descent with fixed stepsize, AdaGrad-Norm in the case gives a larger constant in the rate. But in case , gradient descent can fail to converge as soon as , while AdaGrad-Norm converges for any , and is extremely robust to the choice of in the sense that the resulting convergence rate remains close to the optimal rate of gradient descent with fixed stepsize , paying a factor of and in the constant. Here, the constant results from the worst-cast analysis using Lemma 4.1, which assumes that the gradient for all , when in reality the gradient should be much larger at first. We believe the number of iterations can be improved by a refined analysis, or by considering the setting where is drawn from an appropriate random distribution.

3 Proof of Theorem 2.1

We first introduce some important lemmas in subsection 3.1 and give the proof of the first bound in Theorem 2.1 in Subsection 3.2.

3.1 Ingredients of the Proof

We first introduce several lemmas that are used in the proof for Theorem 2.1. We repeatedly appeal to the following classical Descent Lemma, which is also the main ingredient in Ghadimi and Lan [2013], and can be proved by considering the Taylor expansion of around .

Lemma 3.1 (Descent Lemma)

Let . Then,

We will also use the following lemmas concerning sums of non-negative sequences.

Lemma 3.2

For any non-negative , and , we have

(4)

The lemma can be proved by induction. That the sum should be proportional to can be seen by associating to the sequence a continuous function satisfying , and for , and replacing sums with integrals.

3.2 Proof of 2.1

For simplicity, we write and . By Lemma 3.1, for

At this point, we cannot apply the standard method of proof for SGD, since and are correlated random variables and thus, in particular, for the conditional expectation

If we had a closed form expression for , we would proceed by bounding this term as

(5)

Since we do not have a closed form expression for though, we use the estimate as a surrogate for to proceed. Condition on and take expectation with respect to ,

thus,

(6)

Now, observe the term

thus, applying Cauchy-Schwarz,

(7)

By applying the inequality with , , and , the first term in (7) can be bounded as

(8)

where the last inequality due to the fact that

Similarly, applying the inequality with , , and , the second term of the right hand side in equation (7) is bounded by

(9)

Thus, putting inequalities (8) and (9) back into (7) gives

and, therefore, back to (6),

Rearranging,

Applying the law of total expectation, we take the expectation of each side with respect to , and arrive at the recursion

Taking and summing up from to ,

(10)

where the second inequality we apply Lemma (3.2) and then Jensen’s inequality to bound the summation:

(11)

since