I Introduction
Stochastic gradient descent (SGD) is an efficient iterative method suitable to tackle largescale datasets due to its low computational complexity per iteration and its promising practical behavior, which has found wide applications to solve optimization problems in a variety of areas including machine learning and signal processing. At each iteration, SGD firstly calculates a gradient based on a randomly selected example and updates the model parameter along the minus gradient direction of the current iterate. This strategy of processing a single training example makes SGD very popular in the big data era, which enjoys a great computational advantage over its batch counterpart.
Theoretical properties of SGD are well understood for optimizing both convex and strongly convex objectives, the latter of which can be relaxed to other assumptions on objective functions, e.g., error bound conditions and PolyakŁojasiewicz conditions [2, 1]
. As a comparison, SGD applied to nonconvex objective functions are much less studied. Indeed, there is a huge gap between the theoretical understanding of SGD and its very promising practical behavior in the nonconvex learning setting, as exemplified in the setting of training highly nonconvex deep neural networks. For example, while theoretical analysis can only guarantee that SGD may get stuck in local minima, in practice it often converges to special ones with good generalization ability even in the absence of early stopping or explicit regularization.
Motivated by the popularity of SGD in training deep neural networks and nonconvex models as well as the huge gap between the theoretical understanding and its practical success, theoretical analysis of SGD has received increasing attention recently. The first nonasymptotical convergence rates of nonconvex SGD were established in [3], which was extended to stochastic variance reduction [4] and stochastic proximal gradient descent [5]. However, these results require to impose a nontrivial boundedness assumption on the gradients at all iterates encountered in the learning process, which, however depends on the realization of the optimization process and is hard to check in practice. It still remains unclear whether this assumption holds when learning takes place in an unbounded domain, in which scenario the existing analysis is not rigorous. In this paper, we aim to build a sound theoretical foundation for SGD by showing that the same convergence rates can be achieved without any boundedness assumption on gradients in the nonconvex learning setting. We also relax the standard smoothness assumption to a milder Hölder continuity on gradients. As a further step, we consider objective functions satisfying a PolyakŁojasiewicz (PL) condition which is widely adopted in the literature of nonconvex optimization. In this case, we derive convergence rates for SGD with iterations, which also remove the boundedness assumption on gradients imposed in [1] to derive similar convergence rates. We introduce a zerovariance condition which allows us to derive linear convergence of SGD. Sufficient conditions in terms of step sizes are also established for almost sure convergence measured by both function values and gradient norms.
Ii Problem Formulation and Main Results
Let
be a probability defined on the sample space
with being the input space and being the output space. We are interested in building a prediction rule based on a sequence of examples independently drawn from . We consider learning in a reproducing kernel Hilbert space (RKHSs) associated to a Mercer kernel . The RKHS is defined as the completion of the linear span of the function set satisfying the reproducing property for any and , where denotes the inner product. The quality of a prediction rule at an example is measured by , whereis a differentiable loss function, with which we define the objective function as
(1) 
We consider nonconvex loss functions in this paper. We implement the learning process by SGD to minimize the objective function over . Let and be the example sampled according to at the th iteration. We update the model sequence in by
(2) 
where denotes the gradient of with respect to the first argument, is a sequence of positive step sizes and we introduce for brevity. We denote the RKHS norm in .
Our theoretical analysis is based on a fundamental assumption on the regularity of loss functions. Assumption 1 with corresponds to a smooth assumption standard in nonconvex learning, which is extended to a general Hölder continuity assumption on the gradient of loss functions here.
Assumption 1.
Let and . We assume that the gradient of is Hölder continuous in the sense that
For any function with Hölder continuous gradients, we have the following lemma playing an important role in our analysis. Eq. (4) provides a quantitative measure on the accuracy of approximating with its firstorder approximation, while (5) provides a selfbounding property meaning that the norm of gradients can be controlled by function values.
Lemma 1.
Let be a differentiable function. Let and . If for all
(3) 
then, we have
(4) 
Furthermore, if for all , then
(5) 
Lemma 1 to be proved in Section IVA is an extension of Proposition 1 in [6] from univariate functions to multivariate functions. It should be noted that (5) improves Proposition 1 (d) in [6] by removing a factor of .
Iia General nonconvex objective functions
We now present theoretical results for SGD with general nonconvex loss functions. In this case we measure the progress of SGD in terms of gradients. Part (a) gives a nonasymptotic convergence rate by step sizes, while Parts (b) and (c) provide sufficient conditions on the asymptotic convergence measured by function values and gradient norms, respectively.
Theorem 2.
Suppose that Assumption 1 holds. Let be produced by (2) with the step sizes satisfying . Then, the following three statements hold.

There is a constant independent of such that
(6) 
converges to an almost surely (a.s.) bounded random variable.

If Assumption 1 holds with and , then .
Remark 1.
Part (a) was derived in [3] under the boundedness assumption for a constant and all . This boundedness assumption depends on the realization of the optimization process and it is therefore difficult to check in practice. It was removed in our analysis. Although Parts (b), (c) do not give convergence rates, an appealing property is that they consider individual iterates. As a comparison, the convergence rates in (6) only hold for the minimum of the first iterates. The analysis for individual iterates is much more challenging than that for the minimum over all iterates. Indeed, Part (c) is based on a careful analysis with the contradiction strategy.
We can derive explicit convergence rates by instantiating the step sizes in Theorem 2. If , the convergence rate in Part (b) becomes which is minimax optimal up to a logarithmic factor.
IiB Objective functions with PolyakŁojasiewicz inequality
We now proceed with our convergence analysis by imposing an assumption referred to as PL inequality named after Polyak and Łojasiewicz [2]. Intuitively, this inequality means that the suboptimality of iterates measured by function values can be bounded by gradient norms. PL condition is also referred to as gradient dominated condition in the literature [4], and widely adopted in the analysis in both the convex and nonconvex optimization setting [7, 1, 8]. Examples of functions satisfying PL condition include neural networks with onehidden layers, ResNets with linear activation and objective functions in matrix factorization [8]. It should be noted that functions satisfying the PL condition is not necessarily convex.
Assumption 2.
We assume that the function satisfies the PL inequality with the parameter , i.e.,
where .
Under Assumption 2, we can state convergence results measured by the suboptimality of function values. Part (a) provides a sufficient condition for almost sure convergence measured by function values and gradient norms, while Part (b) establishes explicit convergence rates for step sizes reciprocal to the iteration number. If , we derive convergence rates after iterations, which is minimax optimal even when the objective function is strongly convex. Part (c) shows that a linear convergence can be achieved if , which extends the linear convergence of gradient descent [1] to the stochastic setting. The assumption means that variances of the stochastic gradient vanish at since .
Theorem 4.
Remark 2.
Conditions as and are established for almost sure convergence with strongly convex objectives, which are extended here to nonconvex learning under PL conditions. Convergence rates were established for nonconvex optimization under PL conditions, bounded gradient assumption as and smoothness assumptions [1]. We derive the same convergence rates without the bounded gradient assumption and relax the smoothness assumption to a Hölder continuity of .
Iii Related work and Discussions
SGD has been comprehensively studied in the literature, mainly in the convex setting. For generally convex objective functions, regret bounds were established for SGD with iterates [9] which directly imply convergence rates [10]. For strongly convex objective functions, regret bounds can be improved to [11] which imply convergence rates . These results were extended to online learning in RKHSs [12, 13, 14] and learning with a mirror map to capture geometry of problems [15, 16].
As compared to the maturity of understanding in convex optimization, convergence analysis for SGD in the nonconvex setting are far from satisfactory. Asymptotic convergence of SGD was established under the assumption for and all [17]. Nonasymptotic convergence rates similar to (6) were established in [3] under boundedness assumption for all . For objective functions satisfying PL conditions, convergence rates were established for SGD under boundedness assumptions for all [1]. This boundedness assumption in the literature depends on the realization of the optimization process, which is hard to check in practical implementations. In this paper we show that the same convergence rates can be established without any boundedness assumptions. This establishes a rigorous foundation to safeguard SGD. Existing discussions require to also impose an assumption on the smoothness of , which is relaxed to a Hölder continuity of . Both the PL condition and Hölder continuity condition do not depend on the iterates and can be checked by objective function themselves, which are standard in the literature and satisfied by many nonconvex models [8, 4, 1]. It should be noted that convergence analysis was also performed when is convex [18] and nonconvex [19] without bounded gradient assumptions, both of which, however, require to be strongly convex and to be smooth. Furthermore, we establish a linear convergence of SGD in the case with zero variances, while this linear convergence was only derived for batch gradient descent applied to gradientdominated objective functions [1]. Necessary and sufficient conditions as were established for convergence of online mirror descent in a strongly convex setting [18], which are partially extended to convergence of SGD for gradientdominated objective functions measured by both function values and gradient norms.
Iv Proofs
Iva Proof of Theorem 2
In this section, we present the proofs of Theorem 2 and Corollary 3 on convergence of SGD applied to general nonconvex loss functions. To this aim, we first prove Lemma 1 and introduce the Doob’s forward convergence theorem on almost sure convergence (see, e.g., [20] on page 195).
Proof of Lemma 1.
Lemma 5.
Let be a sequence of nonnegative random variables with and let be a nested sequence of sets of random variables with for all . If for all , then converges to a nonnegative random variable a.s. and a.s..
Proof of Theorem 2.
We first prove Part (a). According to Assumption 1, we know
Therefore, is Hölder continuous. According to (4) with and (2), we know
(7) 
where the last inequality is due to (5). With the Young’s inequality for all
(8) 
we get Plugging the above inequality into (7) shows
Taking conditional expectation with respect to , we derive
(9)  
(10) 
It then follows that
from which we derive
Introduce Then, it follows from the inequality that An application of the above inequality recursively then gives
from which we know Plugging the above inequality back into (10) gives
(11) 
A summation of the above inequality then implies
from which we directly get (6) with . This proves Part (a).
We now prove Part (b). Multiplying both sides of (10) by , the term can be upper bounded by
(12) 
where we introduce . Introduce the stochastic process
Eq. (12) amounts to saying for all , which shows that is a nonnegative supermartingale. Furthermore, the assumption implies that . We can apply Lemma 5 to show that for a nonnegative random variable a.s.. This together with the assumption implies for a nonnegative random variable , where for all and a.s.. Furthermore, it is clear a.s. that
where we have used the fact due to . That is, converges to a.s..
We now prove Part (c) by contradiction. According to Assumption 1 and Lemma 1, we know
where we have used the Young’s inequality (8). Taking expectations over both sides and using , we derive
(13) 
Suppose to contrary that By Part (a) and the assumption , we know
Then there exists an such that for infinitely many and for infinitely many . Let be a subset of integers such that for every we can find an integer such that
(14) 
Furthermore, we can assert that for every larger than the smallest integer in since .
By (13), (14) and Assumption 1 with , we know
(15) 
Analogously, one can show
from which, (14) and for any larger than the smallest integer in we get
and all . It then follows that
(16) 
for every and all . Putting (16) back into (11), can be upper bounded by
This together with (15) implies that
(17) 
Part (b) implies that converges to a nonnegative value, which together with the assumption , shows that the righthand side of (17) vanishes to zero as , while the lefthand side is a positive number. This leads to a contradiction and . ∎
IvB Proof of Theorem 4
Lemma 6 ([12]).
Let be a sequence of nonnegative numbers such that and . Let and such that for any . Then we have .
Comments
There are no comments yet.