    Authors

03/07/2018

05/23/2016

Fast Stochastic Methods for Nonsmooth Nonconvex Optimization

We analyze stochastic algorithms for optimizing nonconvex, nonsmooth fin...
08/16/2018

02/20/2020

We study distributed optimization algorithms for minimizing the average ...
09/04/2021

On Faster Convergence of Scaled Sign Gradient Descent

Communication has been seen as a significant bottleneck in industrial ap...
10/28/2016

Homotopy Analysis for Tensor PCA

Developing efficient and guaranteed nonconvex algorithms has been an imp...
03/07/2017

Global optimization of Lipschitz functions

The goal of the paper is to design sequential strategies which lead to e...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Consider the problem of minimizing a differentiable non-convex function via stochastic gradient descent (SGD); starting from and stepsize , SGD iterates until convergence

 xj+1 ←xj−ηjG(xj), (1)

where is the stepsize at the th iteration and

is the stochastic gradient in the form of a random vector satisfying

and having bounded variance. SGD is the de facto standard for deep learning optimization problems, or more generally, for the large-scale optimization problems where the loss function

can be approximated by the average of a large number of component functions, . It is more efficient to measure a single component gradient (or subset of component gradients), and move in the noisy direction than to compute a full gradient .

For non-convex but smooth loss functions , (noiseless) gradient descent (GD) with constant stepsize converges to a stationary point of at rate with the number of iterations [Nesterov, 1998]. In the same setting, and under the general assumption of bounded gradient noise variance, SGD with constant or decreasing stepsize has been proven to converge to a stationary point of at rate [Ghadimi and Lan, 2013, Bottou et al., 2018]. The rate for GD is the best possible worst-case dimension-free rate of convergence for any algorithm [Carmon et al., 2017a]; faster convergence rates in the noiseless setting are available under the mild assumption of additional smoothness [Agarwal et al., 2017, Carmon et al., 2017b]. In the noisy setting, faster rates than are also possible using accelerated SGD methods [Ghadimi and Lan, 2016, Allen-Zhu and Yang, 2016, Reddi et al., 2016, Allen-Zhu, 2017, Xu et al., 2017, Carmon et al., 2018].

Instead of focusing on faster convergence rates for SGD, this paper focuses on adaptive stepsizes [Cutkosky and Boahen, 2017, Levy, 2017] that make the optimization algorithm more robust to (generally unknown) parameters of the optimization problem, such as the noise level of the stochastic gradient and the Lipschitz smoothness constant of the loss function defined as the smallest number such that for all . In particular, the convergence of GD with fixed stepsize is guaranteed only if the fixed stepsize is carefully chosen such that – choosing a larger stepsize , even just by a factor of 2, can result in oscillation or divergence of the algorithm [Nesterov, 1998]. Because of this sensitivity, GD with fixed stepsize is rarely used in practice; instead, one adaptively chooses the stepsize at each iteration to approximately maximize decrease of the loss function in the current direction of via either approximate line search [Wright and Nocedal, 2006], or according to the Barzilai-Borwein rule [Barzilai and Borwein, 1988] combined with line search.

Unfortunately, in the noisy setting where one uses SGD for optimization, line search methods are not useful, as in this setting the stepsize should not be overfit to the noisy stochastic gradient direction at each iteration. The classical Robbins/Monro theory [Robbins and Monro, 1951] says that in order for the stepsize schedule should satisfy

 ∞∑k=1ηk=∞and∞∑k=1η2k<∞. (2)

However, these bounds do not tell us much about how to select a good stepsize schedule in practice, where algorithms are run for finite iterations and the constants in the rate of convergence matter.

The question of how to choose the stepsize or stepsize or learning rate schedule for SGD is by no means resolved; in practice, a preferred schedule is chosen mannually by testing many different schedules in advance and choosing the one leading to smallest training or generalization error. This process can take days or weeks, and can become prohibitively expensive in terms of time and computational resources incurred.

in the noisy gradient setting. This common use makes AdaGrad a variable metric method and has been the object of recent criticism for machine learning applications

[Wilson et al., 2017].

th iteration, observe the random variable

such that and iterate

 xj+1 ←xj−ηG(xj)bj+1withb2j+1=b2j+∥G(xj)∥2

where is to ensure homogeneity and that the units match. It is straightforward that in expectation, thus, under the assumption of uniformly bounded gradient and uniformly bounded variance , the stepsize will decay eventually according to This stepsize schedule matches the schedule which leads to optimal rates of convergence for SGD in the case of convex but not necessarily smooth functions, as well as smooth but not necessarily convex functions (see, for instance, Agarwal et al.  and Bubeck et al. ). This observation suggests that AdaGrad-Norm should be able to achieve convergence rates for SGD, but without having to know Lipschitz smoothness parameter of and the parameter a priori to set the stepsize schedule.

Theoretically rigorous convergence results for AdaGrad-Norm were provided in the convex setting recently [Levy, 2017]. Moreover, it is possible to obtain convergence rates in the offline setting by online-batch conversion. However, making such observations rigorous for nonconvex functions is difficult because is itself a random variable which is correlated with the current and all previous noisy gradients; thus, the standard proofs in SGD do not straightforwardly extend to the proofs of AdaGrad-Norm. This paper provides such a proof for AdaGrad-Norm.

1.2 Main contributions

Our results make rigorous and precise the observed phenomenon that the convergence behavior of AdaGrad-Norm is highly adaptable to the unknown Lipschitz smoothness constant and level of stochastic noise on the gradient: when there is noise, AdaGrad-Norm converges at the rate of , and when there is no noise, the same algorithm converges at the optimal

rate like well-tuned batch gradient descent. Moreover, our analysis shows that AdaGrad-Norm converges at these rates for any choices of the algorithm hyperparameters

and , in contrast to GD or SGD with fixed stepsize where if the stepsize is set above a hard upper threshold governed by the (generally unknown) smoothness constant , the algorithm might not converge at all. Finally, we note that the constants in the rates of convergence we provide are explicit in terms of their dependence on the hyperparameters and . We list our two main theorems (informally) in the following:

• For a differential non-convex function with -Lipschitz gradient and , Theorem 2.1 implies that AdaGrad-Norm converges to an

-approximate stationary point with high probability

111It is becoming common to define an -approximate stationary point as [Agarwal et al., 2017, Carmon et al., 2018, 2017a, Fang et al., 2018, Zhou et al., 2018, Allen-Zhu, 2018], but we use the convention [Lei et al., 2017, Bottou et al., 2018] to most easily compare our results to those from Ghadimi and Lan , Li and Orabona . at the rate

If the optimal value of the loss function is known and one sets accordingly, then the constant in our rate is close to the best-known constant achievable for SGD with fixed stepsize carefully tuned to knowledge of and , as given in Ghadimi and Lan . However, our result requires bounded gradient and our rate constant scales with instead of linearly in . Nevertheless, our result suggests a good strategy for setting hyperparameters in implementing AdaGrad-Norm practically: given knowledge of , set and simply initialize to be very small.

• When there is no noise , we can improve this rate to an rate of convergence. In Theorem 2.2, we show that after
(1) if
(2) if
Note that the constant in the second case when is not optimal compared to the known best rate constant obtainable by gradient descent with fixed stepsize [Carmon et al., 2017a]; on the other hand, given knowledge of and , the rate constant of AdaGrad-norm reproduces the optimal constant by setting and .

Practically, our results imply a good strategy for setting the hyperparameters when implementing AdaGrad-norm in practice: set (assuming is known) and set to be a very small value. If is unknown, then setting should work well for a wide range of values of , and in the noisy case with strictly greater than zero.

1.3 Previous work

Theoretical guarantees of convergence for AdaGrad were provided in Duchi et al. 

in the setting of online convex optimization, where the loss function may change from iteration to iteration and be chosen adversarially. AdaGrad was subsequently observed to be effective for accelerating convergence in the nonconvex setting, and has become a popular algorithm for optimization in deep learning problems. Many modifications of AdaGrad with or without momentum have been proposed, namely, RMSprop

However, the guaranteed convergence in Wu et al.  is only for the batch setting and the constant in the convergence rate is worse than the one provided here for AdaGrad-Norm. Independently, Li and Orabona  also proves the convergence rate for a variant of AdaGrad-Norm in the non-convex stochastic setting, but their analysis requires knowledge of of smoothness constant and a hard threshold of for their convergence. In contrast to Li and Orabona , we do not require knowledge of the Lipschitz smoothness constant , but we do assume that the gradient is uniformly bounded by some (unknown) finite value, while Li and Orabona  only assumes bounded variance

1.5 Notation

Throughout, denotes the norm. We use the notation . A function has -Lipschitz smooth gradient if

 ∥∇F(x)−∇F(y)∥≤L∥x−y∥,∀x,y∈Rd (3)

We write and refer to as the smoothness constant for if is the smallest number such that the above is satisfied.

To be clear about the adaptive algorithm, we first state in Algorithm 1 the norm version of AdaGrad we consider throughout in the analysis.

At the th iteration, we observe a stochastic gradient where , are random variables, and such that

is an unbiased estimator of

.222 is the expectation with respect conditional on previous We require the following additional assumptions: for each ,

1. The random vectors , are independent of each other and also of ;

2. uniformly.

The first two assumptions are standard (see e.g. Nemirovski and Yudin , Nemirovski et al. , Bottou et al. ). The third assumption is somewhat restrictive as it rules out strongly convex objectives, but is not an unreasonable assumption for AdaGrad-Norm, where the adaptive learning rate is a cumulative sum of all previous observed gradient norms.

Because of the variance in gradient, the AdaGrad-Norm stepsize decreases to zero roughly at a rate between and . It is known that AdaGrad-Norm stepsize decreases at this rate [Levy, 2017], and that this rate is optimal in in terms of the resulting convergence theorems in the setting of smooth but not necessarily convex , or convex but not necessarily strongly convex or smooth . Still, standard convergence theorems for SGD do not extend straightforwardly to AdaGrad-Norm because the stepsize is a random variable and dependent on all previous points visited along the way, i.e., and . From this point on, we use the shorthand and for simplicity of notation. The following theorem gives the convergence guarantee to Algorithm 1. We give detailed proof in Section 3.

Suppose and . Suppose that the random variables , satisfy the above assumptions. Then with probability ,

 minℓ∈[N−1]∥∇F(xℓ)∥2≤min{(2b0N+2√2(γ+σ)√N)Qδ3/2,(8Qδ+2b0)4QNδ+8Qσδ3/2√N}

where

 Q=F(x0)−F∗η+4σ+ηL2log(20N(γ2+σ2)b20+10).

This result implies that AdaGrad-Norm converges for any and starting from any value of . To put this result in context, we can compare to Corollary 2.2 of Ghadimi and Lan  giving the best-known convergence rate for SGD with fixed step-size in the same setting (albeit not requiring Assumption (3) of uniformly bounded gradient): if the Lipschitz smoothness constant and the variance are known a priori, and the fixed stepsize in SGD is set to

 η=min{1L,1σ√N},j=0,1,…,N−1,

then with probability

 minℓ∈[N−1]∥∇F(xℓ)∥2≤2L(F(x0)−F∗)Nδ+(L+2(F(x0)−F∗))σδ√N.

We match the rate of Ghadimi and Lan , but without a priori knowledge of and , and with a worse constant in the rate of convergence. In particular, our rate constant scales according to (up to logarithmic factors in ) while the result for SGD with well-tuned fixed step-size scales linearly with . The additional logarithmic factor (by Lemma 3.2) results from the AdaGrad-Norm update using the square norm of the gradient (see inequality (11) for details). The extra constant results from the correlation between the stepsize and the gradient . We note that the recent work Li and Orabona  derives an rate for a variation of Adagrad-Norm without the assumption of uniformly bounded gradient, but at the same time requires a priori knowledge of the smoothness constant in setting the step-size in order to establish convergence, similar to SGD with fixed stepsize. Finally, we note that recent works [Allen-Zhu, 2017, Lei et al., 2017, Fang et al., 2018, Zhou et al., 2018] provide modified SGD algorithms with convergence rates faster than , albeit again requiring priori knowledge of both and to establish convergence.

We reiterate however that the main emphasis in Theorem 2.1 is on the robustness of the AdaGrad-Norm convergence to its hyperparameters and , compared to plain SGD’s dependence on its parameters and . Although the constant in the rate of our theorem is not as good as the best-known constant for stochastic gradient descent with well-tuned fixed stepsize, our result suggests that implementing AdaGrad-Norm allows one to vastly reduce the need to perform laborious experiments to find a stepsize schedule with reasonable convergence when implementing SGD in practice.

We note that for the second bound in 2.1, in the limit as we recover an rate of convergence for noiseless gradient descent. We can establish a stronger result in the noiseless setting using a different method of proof, removing the additional log factor and Assumption 3 of uniformly bounded gradient. We state the theorem below and defer our proof to Section 4.

Suppose that and that Consider AdaGrad-Norm in deterministic setting with following update,

 xj=xj−1−ηbj∇F(xj−1) % with b2j=b2j−1+∥∇F(xj−1)∥2

Then after

• if

• Here .

The convergence bound shows that, unlike gradient descent with constant stepsize which can diverge if the stepsize , AdaGrad-Norm convergence holds for any choice of parameters and . The critical observation is that if the initial stepsize is too large, the algorithm has the freedom to diverge initially, until grows to a critical point (not too much larger than ) at which point is sufficiently small that the smoothness of forces to converge to a finite number on the order of , so that the algorithm converges at an rate. To describe the result in Theorem 2.2, let us first review a classical result (see, for example Nesterov , ) on the convergence rate for gradient descent with fixed stepsize.

Lemma 2.1

Suppose that and that . Consider gradient descent with constant stepsize, . If , then after at most a number of steps

 N=2b(F(x0)−F∗)ε.

Alternatively, if , then convergence is not guaranteed at all – gradient descent can oscillate or diverge.

Compared to the convergence rate of gradient descent with fixed stepsize, AdaGrad-Norm in the case gives a larger constant in the rate. But in case , gradient descent can fail to converge as soon as , while AdaGrad-Norm converges for any , and is extremely robust to the choice of in the sense that the resulting convergence rate remains close to the optimal rate of gradient descent with fixed stepsize , paying a factor of and in the constant. Here, the constant results from the worst-cast analysis using Lemma 4.1, which assumes that the gradient for all , when in reality the gradient should be much larger at first. We believe the number of iterations can be improved by a refined analysis, or by considering the setting where is drawn from an appropriate random distribution.

3 Proof of Theorem 2.1

We first introduce some important lemmas in subsection 3.1 and give the proof of the first bound in Theorem 2.1 in Subsection 3.2.

3.1 Ingredients of the Proof

We first introduce several lemmas that are used in the proof for Theorem 2.1. We repeatedly appeal to the following classical Descent Lemma, which is also the main ingredient in Ghadimi and Lan , and can be proved by considering the Taylor expansion of around .

Lemma 3.1 (Descent Lemma)

Let . Then,

 F(x)≤F(y)+⟨∇F(y),x−y⟩+L2∥x−y∥2.

We will also use the following lemmas concerning sums of non-negative sequences.

Lemma 3.2

For any non-negative , and , we have

 T∑ℓ=1aℓ∑ℓi=1ai≤log(T∑i=1ai)+1. (4)

The lemma can be proved by induction. That the sum should be proportional to can be seen by associating to the sequence a continuous function satisfying , and for , and replacing sums with integrals.

3.2 Proof of 2.1

For simplicity, we write and . By Lemma 3.1, for

 Fj+1−Fjη ≤−⟨∇Fj,Gjbj+1⟩+ηL2b2j+1∥Gj∥2 =−∥∇Fj∥2bj+1+⟨∇Fj,∇Fj−Gj⟩bj+1+ηL∥Gj∥22b2j+1.

At this point, we cannot apply the standard method of proof for SGD, since and are correlated random variables and thus, in particular, for the conditional expectation

 Eξj[⟨∇Fj,∇Fj−Gj⟩bj+1] ≠Eξj[⟨∇Fj,∇Fj−Gj⟩]bj+1=1bj+1⋅0;

If we had a closed form expression for , we would proceed by bounding this term as

 ∣∣∣Eξj[1bj+1⟨∇Fj,∇Fj−Gj⟩]∣∣∣= ∣∣∣Eξj[(1bj+1−Eξj[1bj+1])⟨∇Fj,∇Fj−Gj⟩]∣∣∣ ≤ Eξj[∣∣∣1bj+1−Eξj[1bj+1]∣∣∣∥⟨∇Fj,∇Fj−Gj⟩∥]. (5)

Since we do not have a closed form expression for though, we use the estimate as a surrogate for to proceed. Condition on and take expectation with respect to ,

 0=Eξj[⟨∇Fj,∇Fj−Gj⟩]√b2j+∥∇Fj∥2+σ2=Eξj⎡⎢ ⎢⎣⟨∇Fj,∇Fj−Gj⟩√b2j+∥∇Fj∥2+σ2⎤⎥ ⎥⎦

thus,

 Eξj[Fj+1]−Fjη ≤ Eξj⎡⎢ ⎢⎣⟨∇Fj,∇Fj−Gj⟩bj+1−⟨∇Fj,∇Fj−Gj⟩√b2j+∥∇Fj∥2+σ2⎤⎥ ⎥⎦−Eξj[∥∇Fj∥2bj+1]+Eξj[Lη∥Gj∥22b2j+1] = Eξj⎡⎢ ⎢⎣⎛⎜ ⎜⎝1√b2j+∥∇Fj∥2+σ2−1bj+1⎞⎟ ⎟⎠⟨∇Fj,Gj⟩⎤⎥ ⎥⎦−∥∇Fj∥2√b2j+∥∇Fj∥2+σ2+ηL2Eξj[∥Gj∥2b2j+1] (6)

Now, observe the term

 1√b2j+∥∇Fj∥2+σ2−1bj+1= (∥Gj∥−∥∇Fj∥)(∥Gj∥+∥∇Fj∥)−σ2bj+1√b2j+∥∇Fj∥2+σ2(√b2j+∥∇Fj∥2+σ2+bj+1) ≤ ∣∣∥Gj∥−∥∇Fj∥∣∣bj+1√b2j+∥∇Fj∥2+σ2+σbj+1√b2j+∥∇Fj∥2+σ2

thus, applying Cauchy-Schwarz,

 Eξj⎡⎢ ⎢⎣⎛⎜ ⎜⎝1√b2j+∥∇Fj∥2+σ2−1bj+1⎞⎟ ⎟⎠⟨∇Fj,Gj⟩⎤⎥ ⎥⎦ ≤ Eξj⎡⎢ ⎢⎣∣∣∥Gj∥−∥∇Fj∥∣∣∥Gj∥∥∇Fj∥bj+1√b2j+∥∇Fj∥2+σ2⎤⎥ ⎥⎦+Eξj⎡⎢ ⎢⎣σ∥Gj∥∥∇Fj∥bj+1√b2j+∥∇Fj∥2+σ2⎤⎥ ⎥⎦ (7)

By applying the inequality with , , and , the first term in (7) can be bounded as

 Eξj⎡⎢ ⎢⎣∣∣∥Gj∥−∥∇Fj∥∣∣∥Gj∥∥∇Fj∥bj+1√b2j+∥∇Fj∥2+σ2⎤⎥ ⎥⎦ ≤ √b2j+∥∇Fj∥2+σ24σ2∥∇Fj∥2Eξj[(∥Gj∥−∥∇Fj∥)2]b2j+∥∇Fj∥2+σ2+σ2√b2j+∥∇Fj∥2+σ2Eξj[∥Gj∥2b2j+1] ≤ ∥∇Fj∥24√b2j+∥∇Fj∥2+σ2+σEξj[∥Gj∥2b2j+1]. (8)

where the last inequality due to the fact that

 ∣∣∥Gj∥−∥∇Fj∥∣∣≤∥Gj−∇Fj∥.

Similarly, applying the inequality with , , and , the second term of the right hand side in equation (7) is bounded by

 Eξj⎡⎢ ⎢⎣σ∥∇Fj∥∥Gj∥bj+1√b2j+∥∇Fj∥2+σ2⎤⎥ ⎥⎦≤σEξj[∥Gj∥2b2j+1]+∥∇Fj∥24√b2j+∥∇Fj∥2+σ2. (9)

Thus, putting inequalities (8) and (9) back into (7) gives

 Eξj⎡⎢ ⎢⎣⎛⎜ ⎜⎝1√b2j+∥∇Fj∥2+σ2−1bj+1⎞⎟ ⎟⎠⟨∇Fj,Gj⟩⎤⎥ ⎥⎦≤2σEξj[∥Gj∥2b2j+1]+∥∇Fj∥22√b2j+∥∇Fj∥2+σ2

and, therefore, back to (6),

 Eξj[Fj+1]−Fjη≤ ηL2Eξj[∥Gj∥2b2j+1]+2σEξj[∥Gj∥2b2j+1]−∥∇Fj∥22√b2j+∥∇Fj∥2+σ2

Rearranging,

 ∥∇Fj∥22√b2j+∥∇Fj∥2+σ2≤ Fj−Eξj[Fj+1]η+4σ+ηL2Eξj[∥Gj∥2b2j+1]

Applying the law of total expectation, we take the expectation of each side with respect to , and arrive at the recursion

 E⎡⎢ ⎢⎣∥∇Fj∥22√b2j+∥∇Fj∥2+σ2⎤⎥ ⎥⎦≤E[Fj]−E[Fj+1]η+4σ+ηL2E[∥Gj∥2b2j+1].

Taking and summing up from to ,

 N−1∑k=0E⎡⎢ ⎢⎣∥∇Fk∥22√b2k+∥∇Fk∥2+σ2⎤⎥ ⎥⎦ ≤F0−F∗η+4σ+ηL2EN−1∑k=0[∥Gk∥2b2k+1] ≤F0−F∗η+4σ+ηL2log(10+20N(σ2+γ2)b20) (10)

where the second inequality we apply Lemma (3.2) and then Jensen’s inequality to bound the summation:

 EN−1∑k=0[∥Gk∥2b2k+1] ≤E[1+log(1+N−1∑k=0∥Gk∥2/b20)] ≤log(10+20N(σ2+γ2)b20