Stochastic gradient descent (SGD) is a popular and efficient method with wide applications in training deep neural nets and other nonconvex models. While the behavior of SGD is well understood in the convex learning setting, the existing theoretical results for SGD applied to nonconvex objective functions are far from mature. For example, existing results require to impose a nontrivial assumption on the uniform boundedness of gradients for all iterates encountered in the learning process, which is hard to verify in practical implementations. In this paper, we establish a rigorous theoretical foundation for SGD in nonconvex learning by showing that this boundedness assumption can be removed without affecting convergence rates. In particular, we establish sufficient conditions for almost sure convergence as well as optimal convergence rates for SGD applied to both general nonconvex objective functions and gradient-dominated objective functions. A linear convergence is further derived in the case with zero variances.

## Authors

• 15 publications
• 3 publications
• 38 publications
• ### Convergence rates for the stochastic gradient descent method for non-convex objective functions

We prove the local convergence to minima and estimates on the rate of co...
04/02/2019 ∙ by Benjamin Fehrman, et al. ∙ 0

• ### Characterization of Convex Objective Functions and Optimal Expected Convergence Rates for SGD

We study Stochastic Gradient Descent (SGD) with diminishing step sizes f...
10/09/2018 ∙ by Marten van Dijk, et al. ∙ 0

• ### Self-learn to Explain Siamese Networks Robustly

Learning to compare two objects are essential in applications, such as d...
09/15/2021 ∙ by Chao Chen, et al. ∙ 25

• ### Better Theory for SGD in the Nonconvex World

Large-scale nonconvex optimization problems are ubiquitous in modern mac...
02/09/2020 ∙ by Ahmed Khaled, et al. ∙ 26

• ### PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization

In this paper, we propose a novel stochastic gradient estimator—ProbAbil...
08/25/2020 ∙ by Zhize Li, et al. ∙ 8

• ### Kalman-based Stochastic Gradient Method with Stop Condition and Insensitivity to Conditioning

Modern proximal and stochastic gradient descent (SGD) methods are believ...
12/03/2015 ∙ by Vivak Patel, et al. ∙ 0

• ### Data Dependent Convergence for Distributed Stochastic Optimization

In this dissertation we propose alternative analysis of distributed stoc...
08/30/2016 ∙ by Avleen S. Bijral, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Stochastic gradient descent (SGD) is an efficient iterative method suitable to tackle large-scale datasets due to its low computational complexity per iteration and its promising practical behavior, which has found wide applications to solve optimization problems in a variety of areas including machine learning and signal processing. At each iteration, SGD firstly calculates a gradient based on a randomly selected example and updates the model parameter along the minus gradient direction of the current iterate. This strategy of processing a single training example makes SGD very popular in the big data era, which enjoys a great computational advantage over its batch counterpart.

Theoretical properties of SGD are well understood for optimizing both convex and strongly convex objectives, the latter of which can be relaxed to other assumptions on objective functions, e.g., error bound conditions and Polyak-Łojasiewicz conditions [2, 1]

. As a comparison, SGD applied to nonconvex objective functions are much less studied. Indeed, there is a huge gap between the theoretical understanding of SGD and its very promising practical behavior in the nonconvex learning setting, as exemplified in the setting of training highly nonconvex deep neural networks. For example, while theoretical analysis can only guarantee that SGD may get stuck in local minima, in practice it often converges to special ones with good generalization ability even in the absence of early stopping or explicit regularization.

Motivated by the popularity of SGD in training deep neural networks and nonconvex models as well as the huge gap between the theoretical understanding and its practical success, theoretical analysis of SGD has received increasing attention recently. The first nonasymptotical convergence rates of nonconvex SGD were established in [3], which was extended to stochastic variance reduction [4] and stochastic proximal gradient descent [5]. However, these results require to impose a nontrivial boundedness assumption on the gradients at all iterates encountered in the learning process, which, however depends on the realization of the optimization process and is hard to check in practice. It still remains unclear whether this assumption holds when learning takes place in an unbounded domain, in which scenario the existing analysis is not rigorous. In this paper, we aim to build a sound theoretical foundation for SGD by showing that the same convergence rates can be achieved without any boundedness assumption on gradients in the nonconvex learning setting. We also relax the standard smoothness assumption to a milder Hölder continuity on gradients. As a further step, we consider objective functions satisfying a Polyak-Łojasiewicz (PL) condition which is widely adopted in the literature of nonconvex optimization. In this case, we derive convergence rates for SGD with iterations, which also remove the boundedness assumption on gradients imposed in [1] to derive similar convergence rates. We introduce a zero-variance condition which allows us to derive linear convergence of SGD. Sufficient conditions in terms of step sizes are also established for almost sure convergence measured by both function values and gradient norms.

## Ii Problem Formulation and Main Results

Let

be a probability defined on the sample space

with being the input space and being the output space. We are interested in building a prediction rule based on a sequence of examples independently drawn from . We consider learning in a reproducing kernel Hilbert space (RKHSs) associated to a Mercer kernel . The RKHS is defined as the completion of the linear span of the function set satisfying the reproducing property for any and , where denotes the inner product. The quality of a prediction rule at an example is measured by , where

is a differentiable loss function, with which we define the objective function as

 E(h)=Ez[ℓ(h(x),y)]=∫ℓ(h(x),y)dρ. (1)

We consider nonconvex loss functions in this paper. We implement the learning process by SGD to minimize the objective function over . Let and be the example sampled according to at the -th iteration. We update the model sequence in by

 wt+1=wt−ηt∇ℓ(⟨wt,Kxt⟩,yt)Kxt=wt−ηt∇f(wt,zt), (2)

where denotes the gradient of with respect to the first argument, is a sequence of positive step sizes and we introduce for brevity. We denote the RKHS norm in .

Our theoretical analysis is based on a fundamental assumption on the regularity of loss functions. Assumption 1 with corresponds to a smooth assumption standard in nonconvex learning, which is extended to a general Hölder continuity assumption on the gradient of loss functions here.

###### Assumption 1.

Let and . We assume that the gradient of is -Hölder continuous in the sense that

 ∥∇f(w,z)−∇f(~w,z)∥2≤L∥w−~w∥α2,∀w,~w∈HK,z∈Z.

For any function with Hölder continuous gradients, we have the following lemma playing an important role in our analysis. Eq. (4) provides a quantitative measure on the accuracy of approximating with its first-order approximation, while (5) provides a self-bounding property meaning that the norm of gradients can be controlled by function values.

###### Lemma 1.

Let be a differentiable function. Let and . If for all

 ∥∇ϕ(w)−∇ϕ(~w)∥2≤L∥w−~w∥α2, (3)

then, we have

 (4)

Furthermore, if for all , then

 ∥∇ϕ(w)∥1+αα2≤(1+α)L1ααϕ(w),∀w∈HK. (5)

Lemma 1 to be proved in Section IV-A is an extension of Proposition 1 in [6] from univariate functions to multivariate functions. It should be noted that (5) improves Proposition 1 (d) in [6] by removing a factor of .

### Ii-a General nonconvex objective functions

We now present theoretical results for SGD with general nonconvex loss functions. In this case we measure the progress of SGD in terms of gradients. Part (a) gives a nonasymptotic convergence rate by step sizes, while Parts (b) and (c) provide sufficient conditions on the asymptotic convergence measured by function values and gradient norms, respectively.

###### Theorem 2.

Suppose that Assumption 1 holds. Let be produced by (2) with the step sizes satisfying . Then, the following three statements hold.

1. There is a constant independent of such that

 mint=1,…,TE[∥∇E(wt)∥22]≤C(T∑t=1ηt)−1. (6)
2. converges to an almost surely (a.s.) bounded random variable.

3. If Assumption 1 holds with and , then .

###### Remark 1.

Part (a) was derived in [3] under the boundedness assumption for a constant and all . This boundedness assumption depends on the realization of the optimization process and it is therefore difficult to check in practice. It was removed in our analysis. Although Parts (b), (c) do not give convergence rates, an appealing property is that they consider individual iterates. As a comparison, the convergence rates in (6) only hold for the minimum of the first iterates. The analysis for individual iterates is much more challenging than that for the minimum over all iterates. Indeed, Part (c) is based on a careful analysis with the contradiction strategy.

We can derive explicit convergence rates by instantiating the step sizes in Theorem 2. If , the convergence rate in Part (b) becomes which is minimax optimal up to a logarithmic factor.

###### Corollary 3.

Suppose that Assumption 1 holds. Let be the sequence produced by (2). Then,

1. If with , then .

2. If with , then .

### Ii-B Objective functions with Polyak-Łojasiewicz inequality

We now proceed with our convergence analysis by imposing an assumption referred to as PL inequality named after Polyak and Łojasiewicz [2]. Intuitively, this inequality means that the suboptimality of iterates measured by function values can be bounded by gradient norms. PL condition is also referred to as gradient dominated condition in the literature [4], and widely adopted in the analysis in both the convex and nonconvex optimization setting [7, 1, 8]. Examples of functions satisfying PL condition include neural networks with one-hidden layers, ResNets with linear activation and objective functions in matrix factorization [8]. It should be noted that functions satisfying the PL condition is not necessarily convex.

###### Assumption 2.

We assume that the function satisfies the PL inequality with the parameter , i.e.,

 E(w)−E(w∗)≤(2μ)−1∥∇E(w)∥22,∀w∈HK,

where .

Under Assumption 2, we can state convergence results measured by the suboptimality of function values. Part (a) provides a sufficient condition for almost sure convergence measured by function values and gradient norms, while Part (b) establishes explicit convergence rates for step sizes reciprocal to the iteration number. If , we derive convergence rates after iterations, which is minimax optimal even when the objective function is strongly convex. Part (c) shows that a linear convergence can be achieved if , which extends the linear convergence of gradient descent [1] to the stochastic setting. The assumption means that variances of the stochastic gradient vanish at since .

###### Theorem 4.

Let Assumptions 1 and 2 hold. Let be produced by (2). Then the following statements hold.

1. If and , then a.s. and .

2. If , then for any we have , where is a constant independent of (explicitly given in the proof).

3. If , Assumption 1 holds with and , then

 E[E(wt+1)]−E(w∗)≤(1−μη)t(E(w1)−E(w∗)).
###### Remark 2.

Conditions as and are established for almost sure convergence with strongly convex objectives, which are extended here to nonconvex learning under PL conditions. Convergence rates were established for nonconvex optimization under PL conditions, bounded gradient assumption as and smoothness assumptions [1]. We derive the same convergence rates without the bounded gradient assumption and relax the smoothness assumption to a Hölder continuity of .

## Iii Related work and Discussions

SGD has been comprehensively studied in the literature, mainly in the convex setting. For generally convex objective functions, regret bounds were established for SGD with iterates [9] which directly imply convergence rates  [10]. For strongly convex objective functions, regret bounds can be improved to [11] which imply convergence rates . These results were extended to online learning in RKHSs [12, 13, 14] and learning with a mirror map to capture geometry of problems [15, 16].

As compared to the maturity of understanding in convex optimization, convergence analysis for SGD in the nonconvex setting are far from satisfactory. Asymptotic convergence of SGD was established under the assumption for and all  [17]. Nonasymptotic convergence rates similar to (6) were established in [3] under boundedness assumption for all . For objective functions satisfying PL conditions, convergence rates were established for SGD under boundedness assumptions for all  [1]. This boundedness assumption in the literature depends on the realization of the optimization process, which is hard to check in practical implementations. In this paper we show that the same convergence rates can be established without any boundedness assumptions. This establishes a rigorous foundation to safeguard SGD. Existing discussions require to also impose an assumption on the smoothness of , which is relaxed to a Hölder continuity of . Both the PL condition and Hölder continuity condition do not depend on the iterates and can be checked by objective function themselves, which are standard in the literature and satisfied by many nonconvex models [8, 4, 1]. It should be noted that convergence analysis was also performed when is convex [18] and nonconvex [19] without bounded gradient assumptions, both of which, however, require to be strongly convex and to be smooth. Furthermore, we establish a linear convergence of SGD in the case with zero variances, while this linear convergence was only derived for batch gradient descent applied to gradient-dominated objective functions [1]. Necessary and sufficient conditions as were established for convergence of online mirror descent in a strongly convex setting [18], which are partially extended to convergence of SGD for gradient-dominated objective functions measured by both function values and gradient norms.

## Iv Proofs

### Iv-a Proof of Theorem 2

In this section, we present the proofs of Theorem 2 and Corollary 3 on convergence of SGD applied to general nonconvex loss functions. To this aim, we first prove Lemma 1 and introduce the Doob’s forward convergence theorem on almost sure convergence (see, e.g., [20] on page 195).

###### Proof of Lemma 1.

Eq. (4) can be proved in the same way as the proof of Part (a) of Proposition 1 in [6]. We now prove (5) for non-negative . We only need to consider the case . In this case, set

 ~w=w−L−1α∥∇ϕ(w)∥1α2∥∇ϕ(w)∥−12∇ϕ(w)

in (4). We derive

 0≤ϕ(~w)≤ϕ(w)−⟨L−1α∥∇ϕ(w)∥1α2∇ϕ(w)∥∇ϕ(w)∥2,∇ϕ(w)⟩ +L1+αL−1+αα∥∇ϕ(w)∥1+αα2 =ϕ(w)−L−1α∥∇ϕ(w)∥1+αα2+L−1α(1+α)−1∥∇ϕ(w)∥1+αα2 =ϕ(w)−αL−1α1+α∥∇ϕ(w)∥1+αα2,

from which the stated bound (5) follows. ∎

###### Lemma 5.

Let be a sequence of non-negative random variables with and let be a nested sequence of sets of random variables with for all . If for all , then converges to a nonnegative random variable a.s. and a.s..

###### Proof of Theorem 2.

We first prove Part (a). According to Assumption 1, we know

 ∥∇E(w) −∇E(~w)∥2=∥∥E[∇f(w,z)]−E[∇f(~w,z)]∥∥2 ≤E[∥∇f(w,z)−∇f(~w,z)∥2]≤L∥w−~w∥α2.

Therefore, is -Hölder continuous. According to (4) with and (2), we know

 E(wt+1)≤E(wt)+⟨wt+1−wt,∇E(wt)⟩+L∥wt+1−wt∥1+α21+α =E(wt)−ηt⟨∇f(wt,zt),∇E(wt)⟩+Lη1+αt1+α∥∇f(wt,zt)∥1+α2 ≤E(wt)−ηt⟨∇f(wt,zt),∇E(wt)⟩ +L2η1+αt1+α(1+αα)αfα(wt,zt), (7)

where the last inequality is due to (5). With the Young’s inequality for all

 μv≤p−1|μ|p+q−1|v|q, (8)

we get Plugging the above inequality into (7) shows

 E(wt+1)≤E(wt)−ηt⟨∇f(wt,zt),∇E(wt)⟩+L2η1+αt1+α((1+α)f(wt,zt)+1−α).

Taking conditional expectation with respect to , we derive

 Ezt[E(wt+1)] ≤E(wt)−ηt∥∇E(wt)∥22+L2η1+αt(E(wt)+1−α) (9) ≤(1+L2η1+αt)E(wt)−ηt∥∇E(wt)∥22+L2(1−α)η1+αt. (10)

It then follows that

 E[E(wt+1)]≤(1+L2η1+αt)E[E(wt)]+L2(1−α)η1+αt,

from which we derive

 E[E(wt+1)]+L2(1−α)∞∑k=t+1η1+αk≤(1+L2η1+αt)(E[E(wt)]+L2(1−α)∞∑k=tη1+αk).

Introduce Then, it follows from the inequality that An application of the above inequality recursively then gives

 At+1≤exp(L2t∑k=1η1+αk)A1≤exp(L2∞∑k=1η1+αk)A1:=C2,

from which we know Plugging the above inequality back into (10) gives

 E[E(wt+1)]≤E[E(wt)]−ηtE[∥∇E(wt)∥22]+L2η1+αt(C2+1−α). (11)

A summation of the above inequality then implies

 T∑t=1ηtE[∥∇E(wt)∥22]≤T∑t=1(E[E(wt)]−E[E(wt+1)])+L2(C2+1−α)T∑t=1η1+αt≤E(w1)+L2(C2+1−α)C1,

from which we directly get (6) with . This proves Part (a).

We now prove Part (b). Multiplying both sides of (10) by , the term can be upper bounded by

 ∞∏k=t(1+L2η1+αk)E(wt)+L2(1−α)∞∏k=t+1(1+L2η1+αk)η1+αt ≤∞∏k=t(1+L2η1+αk)E(wt)+C3η1+αt, (12)

where we introduce . Introduce the stochastic process

 ˜Xt=∞∏k=t(1+L2η1+αk)E(wt)+C3∞∑k=tη1+αk.

Eq. (12) amounts to saying for all , which shows that is a non-negative supermartingale. Furthermore, the assumption implies that . We can apply Lemma 5 to show that for a non-negative random variable a.s.. This together with the assumption implies for a non-negative random variable , where for all and a.s.. Furthermore, it is clear a.s. that

 ∣∣E(wt)−˜Y∣∣=∣∣(1−∞∏k=t(1+L2η1+αk))E(wt)+ +∣∣∞∏k=t(1+L2η1+αk)E(wt)−˜Y∣∣−−−→t→∞0,

where we have used the fact due to . That is, converges to a.s..

We now prove Part (c) by contradiction. According to Assumption 1 and Lemma 1, we know

 ∥∇f(wk,zk)∥2 ≤((1+α)L1αf(wk,zk)α)α1+α ≤L1αf(wk,zk)+(1+α)−1,

where we have used the Young’s inequality (8). Taking expectations over both sides and using , we derive

 E[∥∇f(wk,zk)∥2] ≤L1αE[E(wk)]+(1+α)−1 ≤L1αC2+(1+α)−1:=C4. (13)

Suppose to contrary that By Part (a) and the assumption , we know

 liminft→∞E[∥∇E(wt)∥2]≤liminft→∞√E[∥∇E(wt)∥22]=0.

Then there exists an such that for infinitely many and for infinitely many . Let be a subset of integers such that for every we can find an integer such that

 E[∥∇E(wt)∥2]<ϵ,E[∥∇E(wk(t))∥2]>2ϵandϵ≤E[∥∇E(wk)∥2]≤2ϵfor all t

Furthermore, we can assert that for every larger than the smallest integer in since .

By (13), (14) and Assumption 1 with , we know

 ϵ≤E[∥∇E(wk(t))∥2]−E[∥∇E(wt)∥2]≤E[∥∇E(wk(t))−∇E(wt)∥2] ≤k(t)−1∑k=tE[∥∇E(wk+1)−∇E(wk)∥2]≤Lk(t)−1∑k=tE[∥wk+1−wk∥2] =Lk(t)−1∑k=tηkE[∥∇f(wk,zk)∥2]≤LC4k(t)−1∑k=tηk. (15)

Analogously, one can show

 E[∥∇E(wt+1)∥2]−E[∥∇E(wt)∥2]≤E[∥∇E(wt+1)−∇E(wt)∥2] ≤LE[∥wt+1−wt∥2]≤LηtE[∥∇f(wt,zt)∥2]≤LC4ηt,

from which, (14) and for any larger than the smallest integer in we get

 E[∥∇E(wk)∥2]≥ϵ/2% for every k=t,t+1,…,k(t)−1

and all . It then follows that

 E[∥∇E(wk)∥22]≥(E[∥∇E(wk)∥2])2≥ϵ2/4 (16)

for every and all . Putting (16) back into (11), can be upper bounded by

 E[E(wt)]−k(t)−1∑k=tηkE[∥∇E(wk)∥22]+L2C2k(t)−1∑k=tη2k ≤E[E(wt)]−ϵ24k(t)−1∑k=tηk+L2C2k(t)−1∑k=tη2k.

This together with (15) implies that

 ϵ3/(4LC4)≤ϵ24k(t)−1∑k=tηk≤E[E(wt)]−E[E(wk(t))]+L2C2k(t)−1∑k=tη2k,∀t∈T. (17)

Part (b) implies that converges to a non-negative value, which together with the assumption , shows that the right-hand side of (17) vanishes to zero as , while the left-hand side is a positive number. This leads to a contradiction and . ∎

###### Proof of Corollary 3.

Since , we know . Eq. (6) and

 11−γ[(T+1)1−γ−1]≤T∑t=1t−γ≤11−γT1−γ,γ∈(0,1)

immediately imply . Part(b) can be proved analogously and we omit the proof. ∎

### Iv-B Proof of Theorem 4

###### Lemma 6 ([12]).

Let be a sequence of non-negative numbers such that and . Let and such that for any . Then we have .

###### Proof of Theorem 4.

We first prove Part (a). We introduce . By (10) and Assumption 2,

 E[E(wt+1)]≤(1+L2η1+αt)E[E(wt)]−2μηt(E[E(wt)−E(w∗)])+L2(1−α)η1+αt.

Subtracting from both sides gives

 E[E(wt+1)]−E(w∗)≤(1+L2η1+αt)(E(wt)−E(w∗)) +L2η1+αtE(w∗)−2μηt(E[E(wt)]−E(w∗))+L2(1−α)η1+αt =(1+L2η1+αt−2μηt)(E[E(wt)]−E(w∗))+C5η1+αt,

where we introduce . The assumption