Stochastic gradient descent (SGD) is an efficient iterative method suitable to tackle large-scale datasets due to its low computational complexity per iteration and its promising practical behavior, which has found wide applications to solve optimization problems in a variety of areas including machine learning and signal processing. At each iteration, SGD firstly calculates a gradient based on a randomly selected example and updates the model parameter along the minus gradient direction of the current iterate. This strategy of processing a single training example makes SGD very popular in the big data era, which enjoys a great computational advantage over its batch counterpart.
Theoretical properties of SGD are well understood for optimizing both convex and strongly convex objectives, the latter of which can be relaxed to other assumptions on objective functions, e.g., error bound conditions and Polyak-Łojasiewicz conditions [2, 1]
. As a comparison, SGD applied to nonconvex objective functions are much less studied. Indeed, there is a huge gap between the theoretical understanding of SGD and its very promising practical behavior in the nonconvex learning setting, as exemplified in the setting of training highly nonconvex deep neural networks. For example, while theoretical analysis can only guarantee that SGD may get stuck in local minima, in practice it often converges to special ones with good generalization ability even in the absence of early stopping or explicit regularization.
Motivated by the popularity of SGD in training deep neural networks and nonconvex models as well as the huge gap between the theoretical understanding and its practical success, theoretical analysis of SGD has received increasing attention recently. The first nonasymptotical convergence rates of nonconvex SGD were established in , which was extended to stochastic variance reduction  and stochastic proximal gradient descent . However, these results require to impose a nontrivial boundedness assumption on the gradients at all iterates encountered in the learning process, which, however depends on the realization of the optimization process and is hard to check in practice. It still remains unclear whether this assumption holds when learning takes place in an unbounded domain, in which scenario the existing analysis is not rigorous. In this paper, we aim to build a sound theoretical foundation for SGD by showing that the same convergence rates can be achieved without any boundedness assumption on gradients in the nonconvex learning setting. We also relax the standard smoothness assumption to a milder Hölder continuity on gradients. As a further step, we consider objective functions satisfying a Polyak-Łojasiewicz (PL) condition which is widely adopted in the literature of nonconvex optimization. In this case, we derive convergence rates for SGD with iterations, which also remove the boundedness assumption on gradients imposed in  to derive similar convergence rates. We introduce a zero-variance condition which allows us to derive linear convergence of SGD. Sufficient conditions in terms of step sizes are also established for almost sure convergence measured by both function values and gradient norms.
Ii Problem Formulation and Main Results
be a probability defined on the sample spacewith being the input space and being the output space. We are interested in building a prediction rule based on a sequence of examples independently drawn from . We consider learning in a reproducing kernel Hilbert space (RKHSs) associated to a Mercer kernel . The RKHS is defined as the completion of the linear span of the function set satisfying the reproducing property for any and , where denotes the inner product. The quality of a prediction rule at an example is measured by , where
is a differentiable loss function, with which we define the objective function as
We consider nonconvex loss functions in this paper. We implement the learning process by SGD to minimize the objective function over . Let and be the example sampled according to at the -th iteration. We update the model sequence in by
where denotes the gradient of with respect to the first argument, is a sequence of positive step sizes and we introduce for brevity. We denote the RKHS norm in .
Our theoretical analysis is based on a fundamental assumption on the regularity of loss functions. Assumption 1 with corresponds to a smooth assumption standard in nonconvex learning, which is extended to a general Hölder continuity assumption on the gradient of loss functions here.
Let and . We assume that the gradient of is -Hölder continuous in the sense that
For any function with Hölder continuous gradients, we have the following lemma playing an important role in our analysis. Eq. (4) provides a quantitative measure on the accuracy of approximating with its first-order approximation, while (5) provides a self-bounding property meaning that the norm of gradients can be controlled by function values.
Let be a differentiable function. Let and . If for all
then, we have
Furthermore, if for all , then
Lemma 1 to be proved in Section IV-A is an extension of Proposition 1 in  from univariate functions to multivariate functions. It should be noted that (5) improves Proposition 1 (d) in  by removing a factor of .
Ii-a General nonconvex objective functions
We now present theoretical results for SGD with general nonconvex loss functions. In this case we measure the progress of SGD in terms of gradients. Part (a) gives a nonasymptotic convergence rate by step sizes, while Parts (b) and (c) provide sufficient conditions on the asymptotic convergence measured by function values and gradient norms, respectively.
Part (a) was derived in  under the boundedness assumption for a constant and all . This boundedness assumption depends on the realization of the optimization process and it is therefore difficult to check in practice. It was removed in our analysis. Although Parts (b), (c) do not give convergence rates, an appealing property is that they consider individual iterates. As a comparison, the convergence rates in (6) only hold for the minimum of the first iterates. The analysis for individual iterates is much more challenging than that for the minimum over all iterates. Indeed, Part (c) is based on a careful analysis with the contradiction strategy.
We can derive explicit convergence rates by instantiating the step sizes in Theorem 2. If , the convergence rate in Part (b) becomes which is minimax optimal up to a logarithmic factor.
Ii-B Objective functions with Polyak-Łojasiewicz inequality
We now proceed with our convergence analysis by imposing an assumption referred to as PL inequality named after Polyak and Łojasiewicz . Intuitively, this inequality means that the suboptimality of iterates measured by function values can be bounded by gradient norms. PL condition is also referred to as gradient dominated condition in the literature , and widely adopted in the analysis in both the convex and nonconvex optimization setting [7, 1, 8]. Examples of functions satisfying PL condition include neural networks with one-hidden layers, ResNets with linear activation and objective functions in matrix factorization . It should be noted that functions satisfying the PL condition is not necessarily convex.
We assume that the function satisfies the PL inequality with the parameter , i.e.,
Under Assumption 2, we can state convergence results measured by the suboptimality of function values. Part (a) provides a sufficient condition for almost sure convergence measured by function values and gradient norms, while Part (b) establishes explicit convergence rates for step sizes reciprocal to the iteration number. If , we derive convergence rates after iterations, which is minimax optimal even when the objective function is strongly convex. Part (c) shows that a linear convergence can be achieved if , which extends the linear convergence of gradient descent  to the stochastic setting. The assumption means that variances of the stochastic gradient vanish at since .
Conditions as and are established for almost sure convergence with strongly convex objectives, which are extended here to nonconvex learning under PL conditions. Convergence rates were established for nonconvex optimization under PL conditions, bounded gradient assumption as and smoothness assumptions . We derive the same convergence rates without the bounded gradient assumption and relax the smoothness assumption to a Hölder continuity of .
Iii Related work and Discussions
SGD has been comprehensively studied in the literature, mainly in the convex setting. For generally convex objective functions, regret bounds were established for SGD with iterates  which directly imply convergence rates . For strongly convex objective functions, regret bounds can be improved to  which imply convergence rates . These results were extended to online learning in RKHSs [12, 13, 14] and learning with a mirror map to capture geometry of problems [15, 16].
As compared to the maturity of understanding in convex optimization, convergence analysis for SGD in the nonconvex setting are far from satisfactory. Asymptotic convergence of SGD was established under the assumption for and all . Nonasymptotic convergence rates similar to (6) were established in  under boundedness assumption for all . For objective functions satisfying PL conditions, convergence rates were established for SGD under boundedness assumptions for all . This boundedness assumption in the literature depends on the realization of the optimization process, which is hard to check in practical implementations. In this paper we show that the same convergence rates can be established without any boundedness assumptions. This establishes a rigorous foundation to safeguard SGD. Existing discussions require to also impose an assumption on the smoothness of , which is relaxed to a Hölder continuity of . Both the PL condition and Hölder continuity condition do not depend on the iterates and can be checked by objective function themselves, which are standard in the literature and satisfied by many nonconvex models [8, 4, 1]. It should be noted that convergence analysis was also performed when is convex  and nonconvex  without bounded gradient assumptions, both of which, however, require to be strongly convex and to be smooth. Furthermore, we establish a linear convergence of SGD in the case with zero variances, while this linear convergence was only derived for batch gradient descent applied to gradient-dominated objective functions . Necessary and sufficient conditions as were established for convergence of online mirror descent in a strongly convex setting , which are partially extended to convergence of SGD for gradient-dominated objective functions measured by both function values and gradient norms.
Iv-a Proof of Theorem 2
In this section, we present the proofs of Theorem 2 and Corollary 3 on convergence of SGD applied to general nonconvex loss functions. To this aim, we first prove Lemma 1 and introduce the Doob’s forward convergence theorem on almost sure convergence (see, e.g.,  on page 195).
Proof of Lemma 1.
Let be a sequence of non-negative random variables with and let be a nested sequence of sets of random variables with for all . If for all , then converges to a nonnegative random variable a.s. and a.s..
Proof of Theorem 2.
We first prove Part (a). According to Assumption 1, we know
where the last inequality is due to (5). With the Young’s inequality for all
we get Plugging the above inequality into (7) shows
Taking conditional expectation with respect to , we derive
It then follows that
from which we derive
Introduce Then, it follows from the inequality that An application of the above inequality recursively then gives
from which we know Plugging the above inequality back into (10) gives
A summation of the above inequality then implies
from which we directly get (6) with . This proves Part (a).
We now prove Part (b). Multiplying both sides of (10) by , the term can be upper bounded by
where we introduce . Introduce the stochastic process
Eq. (12) amounts to saying for all , which shows that is a non-negative supermartingale. Furthermore, the assumption implies that . We can apply Lemma 5 to show that for a non-negative random variable a.s.. This together with the assumption implies for a non-negative random variable , where for all and a.s.. Furthermore, it is clear a.s. that
where we have used the fact due to . That is, converges to a.s..
where we have used the Young’s inequality (8). Taking expectations over both sides and using , we derive
Suppose to contrary that By Part (a) and the assumption , we know
Then there exists an such that for infinitely many and for infinitely many . Let be a subset of integers such that for every we can find an integer such that
Furthermore, we can assert that for every larger than the smallest integer in since .
Analogously, one can show
from which, (14) and for any larger than the smallest integer in we get
and all . It then follows that
This together with (15) implies that
Part (b) implies that converges to a non-negative value, which together with the assumption , shows that the right-hand side of (17) vanishes to zero as , while the left-hand side is a positive number. This leads to a contradiction and . ∎
Iv-B Proof of Theorem 4
Lemma 6 ().
Let be a sequence of non-negative numbers such that and . Let and such that for any . Then we have .