# Stochastic subgradient method converges on tame functions

This work considers the question: what convergence guarantees does the stochastic subgradient method have in the absence of smoothness and convexity? We prove that the stochastic subgradient method, on any semialgebraic locally Lipschitz function, produces limit points that are all first-order stationary. More generally, our result applies to any function with a Whitney stratifiable graph. In particular, this work endows the stochastic subgradient method with rigorous convergence guarantees for a wide class of problems arising in data science---including all popular deep learning architectures.

There are no comments yet.

## Authors

• 13 publications
• 12 publications
• 28 publications
• 53 publications
• ### Stochastic model-based minimization of weakly convex functions

We consider an algorithm that successively samples and minimizes stochas...
03/17/2018 ∙ by Damek Davis, et al. ∙ 0

• ### On convergence rate of stochastic proximal point algorithm without strong convexity, smoothness or bounded gradients

Significant parts of the recent learning literature on stochastic optimi...
01/22/2019 ∙ by Andrei Patrascu, et al. ∙ 0

• ### Proximal Gradient Method for Manifold Optimization

This paper considers manifold optimization problems with nonsmooth and n...
11/02/2018 ∙ by Shixiang Chen, et al. ∙ 0

• ### Stabilizing Adversarial Nets With Prediction Methods

Adversarial neural networks solve many important problems in data scienc...
05/20/2017 ∙ by Abhay Yadav, et al. ∙ 0

• ### On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization

Adaptive gradient methods are workhorses in deep learning. However, the ...
08/16/2018 ∙ by Dongruo Zhou, et al. ∙ 6

• ### Analysis of nonsmooth stochastic approximation: the differential inclusion approach

In this paper we address the convergence of stochastic approximation whe...
05/04/2018 ∙ by Szymon Majewski, et al. ∙ 0

• ### On the Expected Dynamics of Nonlinear TD Learning

While there are convergence guarantees for temporal difference (TD) lear...
05/29/2019 ∙ by David Brandfonbrener, et al. ∙ 8

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In this work, we study the long term behavior of the stochastic subgradient method on nonsmooth and nonconvex functions. Setting the stage, consider the optimization problem

 minx∈Rd f(x),

where is a locally Lipschitz continuous function. The stochastic subgradient method simply iterates the steps

 xk+1=xk−αk(yk+ξk)withyk∈∂f(xk). (1.1)

Here denotes the Clarke subdifferential [9]. Informally, the set is the convex hull of limits of gradients at nearby differentiable points. In classical circumstances, the subdifferential reduces to more familiar objects. Namely, when is -smooth at , the subdifferential consists only of the gradient , while for convex functions, it reduces to the subdifferential in the sense of convex analysis. The positive sequence is user specified, and it controls the step-sizes of the algorithm. As is typical for stochastic subgradient methods, we will assume that this sequence is square summable but not summable, meaning and . Finally, the stochasticity is modeled by the random (noise) sequence

. We make the standard assumption that conditioned on the past, each random variable

has mean zero and its second moment grows at a controlled rate.

Though variants of the stochastic subgradient method (1.1) date back to Robbins-Monro’s pioneering 1951 work [29], their convergence behavior is still largely not understood in nonsmooth and nonconvex settings. In particular, the following question remains open.

Does the (stochastic) subgradient method have any convergence guarantees on locally Lipschitz functions, which may be neither smooth nor convex?

That this question remains unanswered is somewhat concerning as the stochastic subgradient method forms a core numerical subroutine for several widely used solvers, including Google’s TensorFlow

[1]

and the open source PyTorch

[28] library.

Convergence behavior of (1.1) is well understood when applied to convex, smooth, and more generally, weakly convex problems. In these three cases, almost surely, every limit point of the iterate sequence is first-order critical [27], meaning . Moreover, rates of convergence in terms of natural optimality/stationarity measures are available. In summary, the rates are , , and , for functions that are convex [26], smooth [18], and -weakly convex [14, 13], respectively. In particular, the convergence guarantee above for -weakly convex functions appeared only recently in [14, 13], with the Moreau envelope playing a central role.

Though widely applicable, these previous results on the convergence of the stochastic subgradient method do not apply to even relatively simple non-pathological functions, such as and

. It is not only toy examples, however, that lack convergence guarantees, but the entire class of deep neural networks with nonsmooth activation functions (e.g., ReLU). Since such networks are routinely trained in practice, it is worthwhile to understand if indeed the iterates

tend to a meaningful limit.

In this paper, we provide a positive answer to this question for a wide class of locally Lipschitz functions; indeed, the function class we consider is virtually exhaustive in data scientific contexts (see Corollary 5.11 for consequences in deep learning). Aside from mild technical conditions, the only meaningful assumption we make is that strictly decreases along any trajectory of the differential inclusion emanating from a noncritical point. Under this assumption, a standard Lyapunov-type argument shows that every limit point of the stochastic subgradient method is critical for , almost surely. Techniques of this type can be found for example in the monograph of Kushner-Yin [22, Theorem 5.2.1] and the landmark papers of Benaïm-Hofbauer-Sorin [2, 3]. Here, we provide a self-contained treatment, which facilitates direct extensions to “proximal” variants of the stochastic subgradient method.111Concurrent to this work, the independent preprint [24] also provides convergence guarantees for the stochastic projected subgradient method, under the assumption that the objective function is “subdifferentially regular” and the constraint set is convex. Subdifferential regularity rules out functions with downward kinks and cusps, such as deep networks with the Relu() activation functions. Besides subsuming the subdifferentially regular case, the results of the current paper apply to the broad class of Whitney stratifiable functions, which includes all popular deep network architectures. In particular, our analysis follows closely the recent work of Duchi-Ruan [17, Section 3.4.1] on convex composite minimization.

The main question that remains therefore is which functions decrease along the continuous subgradient curves. Let us look for inspiration at convex functions, which are well-known to satisfy this property [7, 8]. Indeed, if is convex and

is any absolutely continuous curve, then the “chain rule” holds:

 ddt(f∘x)=⟨∂f(x),˙x⟩for a% .e. t≥0. (1.2)

An elementary linear algebraic argument then shows that if satisfies a.e., then automatically is the minimal norm element of . Therefore, integrating (1.2) yields the desired descent guarantee

 f(x(0))−f(x(t))=∫tτ=0dist2(0;∂f(x(τ)))for all t≥0. (1.3)

Evidently, exactly the same argument yields the chain rule (1.2) for subdifferentially regular functions. These are the functions such that each subgradient

defines a linear lower-estimator of

up to first-order; see for example [10, Section 2.4] or [31, Definition 7.25]. Nonetheless, subdifferentially regular functions preclude “downwards cusps”, and therefore still do not capture such simple examples as . It is worthwhile to mention that one can not expect (1.3) to always hold. Indeed, there are pathological locally Lipschitz functions that do not satisfy (1.3); one example is the univariate 1-Lipschitz function whose Clarke subdifferential is the unit interval at every point [30, 6].

In this work, we isolate a different structural property on the function , which guarantees the validity of (1.2) and therefore of the descent condition (1.3). We will assume that the graph of the function admits a partition into finitely many smooth manifolds, which fit together in a regular pattern. Formally, we require the graph of to admit a so-called Whitney stratification, and we will call such functions Whitney stratifiable. Whitney stratifications have already figured prominently in optimization, beginning with the seminal work [4]. An important subclass of Whitney stratifiable functions consists of semi-algebraic functions [23] – meaning those whose graphs can be written as a finite union of sets each defined by finitely many polynomial inequalities. Semialgebraicity is preserved under all the typical functional operations in optimization (e.g. sums, compositions, inf-projections) and therefore semi-algebraic functions are usually easy to recognize. More generally still, “semianalytic” functions [23] and those that are “definable in an o-minimal structure” are Whitney stratifiable [34]. The latter function class, in particular, shares all the robustness and analytic properties of semi-algebraic functions, while encompassing many more examples. Case in point, Wilkie [36] famously showed that there is an o-minimal structure that contains both the exponential and all semi-algebraic functions.222The term “tame” used in the title has a technical meaning. Tame sets are those whose intersection with any ball is definable in some o-minimal structure. The manuscript [20] provides a nice exposition on the role of tame sets and functions in optimization.

The key observation for us, which originates in [16, Section 5.1], is that any locally Lipschitz Whitney stratifiable function necessarily satisfies the chain rule (1.2) along any absolutely continuous curve. Consequently, the descent guarantee (1.3) holds along any subgradient trajectory, and our convergence guarantees for the stochastic subgradient method become applicable. Since the composition of two definable functions is definable, it follows immediately from Wilkie’s o-minimal structure that nonsmooth deep neural networks built from definable pieces—such as quadratics , hinge losses , and log-exp functions—are themselves definable. Hence, the results of this paper endow stochastic subgradient methods, applied to definable deep networks, with rigorous convergence guarantees.

Validity of the chain rule (1.2) for Whitney stratifiable functions is not new. It was already proved in [16, Section 5.1] for semi-algebraic functions, though identical arguments hold more broadly for Whitney stratifiable functions. These results, however, are somewhat hidden in the paper [16], which is possibly why they have thus far been underutilized. In this manuscript, we provide a self-contained review of the material from [16, Section 5.1], highlighting only the most essential ingredients and streamlining some of the arguments.

Though the discussion above is for unconstrained problems, the techniques we develop apply much more broadly to constrained problems of the form

 minx∈X f(x)+g(x).

Here and are locally-Lipschitz continuous functions and is an arbitrary closed set. The popular proximal stochastic subgradient method simply iterates the steps

 {Sample an estimator ζk of ∂f(xk)Select xk+1∈\operatornamewithlimitsargminx∈X {⟨ζk,x⟩+g(x)+12αk∥x−xk∥2}}. (1.4)

Combining our techniques with those in [17] quickly yields subsequential convergence guarantees for this algorithm. Note that we impose no convexity assumptions on , , or .

The outline of this paper is as follows. In Section 2, we fix the notation for the rest of the manuscript. Section 3 provides a self-contained treatment of asymptotic consistency for discrete approximations of differential inclusions. In Section 4, we specialize the results of the previous section to the stochastic subgradient method. Finally, in Section 5, we verify the sufficient conditions for subsequential convergence for a broad class of locally Lipschitz functions, including those that are subdifferentially regular and Whitney stratifiable. In particular, we specialize our results to deep learning settings in Corollary 5.11. In the final Section 6, we extend the results of the previous sections to the proximal setting.

## 2 Preliminaries

Throughout, we will mostly use standard notation on differential inclusions, as set out for example in the monographs of Borkar [5], Clarke-Ledyaev-Stern-Wolenski [10], and Smirnov [32]. We will always equip the Euclidean space with an inner product and the induced norm . The distance of a point to a set will be written as . The indicator function of , denoted , is defined to be zero on and off it. The symbol will denote the closed unit ball in , while will stand for the closed ball of radius of around . We will use to denote the set of nonnegative real numbers.

### 2.1 Absolutely continuous curves

Any continuous function is called a curve in . All curves in comprise the set . We will say that a sequence of function converges to in if converge to uniformly on compact intervals, that is, for all , we have

 limk→∞supt∈[0,T]∥fk(t)−f(t)∥=0.

Recall that a curve is absolutely continuous if there exists a map that is integrable on any compact interval and satisfies

 x(t)=x(0)+∫t0y(τ)dτfor all t≥0.

Moreover, if this is the case, then equality holds for a.e. . Henceforth, for brevity, we will call absolutely continuous curves arcs. We will often use the observation that if is locally Lipschitz continuous and is an arc, then the composition is absolutely continuous.

### 2.2 Set-valued maps and the Clarke subdifferential

A set-valued map is a mapping from a set to the powerset of . Thus is a subset of , for each . We will use the notation

 G−1(v):={x∈X:v∈G(x)}

for the preimage of a vector

. The map is outer-semicontinuous at a point if for any sequences and converging to some vector , the inclusion holds.

The most important set-valued map for our work will be the generalized derivative in the sense of Clarke [9] – a notion we now review. Consider a locally Lipschitz continuous function . The well-known Rademacher’s theorem guarantees that is differentiable almost everywhere. Taking this into account, the Clarke subdifferential of at any point is the set [10, Theorem 8.1]

 ∂f(x):=conv{limi→∞∇f(xi):xiΩ→x},

where is any full-measure subset of such that is differentiable at each of its points. It is standard that the map is outer-semicontinuous and its images are nonempty, compact, convex sets for each ; see for example [10, Proposition 1.5 (a,e)].

Analogously to the smooth setting, a point is called (Clarke) critical for whenever the inclusion holds. Equivalently, these are the points at which the Clarke directional derivative is nonnegative in every direction [10, Section 2.1]. A real number is called a critical value of if there exists a critical point satisfying .

## 3 Differential inclusions and discrete approximations

In this section, we discuss the asymptotic behavior of discrete approximations of differential inclusions. All the elements of the analysis we present, in varying generality, can be found in the works of Benaïm-Hofbauer-Sorin [2, 3], Borkar [5], and Duchi-Ruan [17]. Out of these, we most closely follow the work of Duchi-Ruan [17].

### 3.1 Functional convergence of discrete approximations

Let be a closed set and let be a set-valued map. Then an arc is called a trajectory of if it satisfies the differential inclusion

 ˙x(t)∈G(x(t))for a.e. t≥0. (3.1)

Notice that the image of any arc is automatically contained in , since arcs are continuous and is closed. In this work, we will primarily focus on iterative algorithms that aim to asymptotically track a trajectory of the differential inclusion (3.1) using a noisy discretization with vanishing step-sizes. Though our discussion allows for an arbitrary set-valued map , the reader should keep in mind that the most important example for us will be , where is a locally Lipschitz function.

Throughout, we will consider the following iteration sequence:

 xk+1=xk+αk(yk+ξk). (3.2)

Here is a sequence of step-sizes, should be thought of as an approximate evaluation of at some point near , and is a sequence of “errors”.

Our immediate goal is to isolate reasonable conditions, under which the sequence asymptotically tracks a trajectory of the differential inclusion (3.1). Following the work of Duchi-Ruan [17] on stochastic approximation, we stipulate the following assumptions.

###### Assumption A (Standing assumptions).
1. All limit points of lie in .

2. The iterates are bounded, i.e., and .

3. The sequence is nonnegative, square summable, but not summable:

 αk≥0,∞∑k=1αk=∞,and∞∑k=1α2k<∞.
4. The weighted noise sequence is convergent: for some as .

5. For any unbounded increasing sequence such that converges to some point , it holds:

 limn→∞dist(1nn∑j=1ykj,G(¯x))=0.

Some comments are in order. Conditions 1, 2, and 3 are in some sense minimal, though the boundedness condition must be checked for each particular algorithm. Condition 4 guarantees that the noise sequence does not grow too quickly relative to the rate at which decrease. The key Condition 5 summarizes the way in which the values are approximate evaluations of , up to convexification.

To formalize the idea of asymptotic approximation, let us define the time points and , for . Let

now be the linear interpolation of the discrete path:

 x(t):=xk+t−tktk+1−tk(xk+1−xk) for t∈[tk,tk+1). (3.3)

For each , define the time-shifted curve .

The following result of Duchi-Ruan [17, Theorem 2] shows that under the above conditions, for any sequence , the shifted curves subsequentially converge in to a trajectory of (3.1). Results of this type under more stringent assumptions, and with similar arguments, have previously appeared for example in Benaïm-Hofbauer-Sorin [2, 3] and Borkar [5].

###### Theorem 3.1 (Functional approximation).

Suppose that Assumption A holds. Then for any sequence , the set of functions is relatively compact in . If in addition as , all limit points of in are trajectories of the differential inclusion (3.1).

### 3.2 Subsequential convergence to equilibrium points

A primary application of the discrete process (3.2) is to solve the inclusion

 0∈G(z). (3.4)

Indeed, one can consider the points satisfying (3.4) as equilibrium (constant) trajectories of the differential inclusion (3.1). Ideally, one would like to find conditions guaranteeing that every limit point of the sequence , produced by the recursion (3.2), satisfies the desired inclusion (3.4). Making such a leap rigorous typically relies on combining the asymptotic convergence guarantee of Theorem 3.1 with existence of a Lyapunov-like function for the continuous dynamics; see e.g. [2, 3]. Let us therefore introduce the following assumption.

###### Assumption B (Lyapunov condition).

There exists a continuous function , which is bounded from below, and such that the following two properties hold.

1. (Weak Sard) For a dense set of values , the intersection is empty.

2. (Descent) Whenever is a trajectory of the differential inclusion (3.1) and , there exists a real satisfying

 φ(z(T))

The weak Sard property is reminiscent of the celebrated Sard’s theorem in real analysis. Indeed, consider the classical setting for a smooth function . Then the weak Sard property stipulates that the set of noncritical values of is dense in . By Sard’s theorem, this is indeed the case, as long as is smooth. Indeed, Sard’s theorem guarantees the much stronger property that the set of noncritical values has full measure. We will comment more on the weak Sard property in Section 4, once we shift focus to optimization problems. The descent property, says that eventually strictly decreases along the trajectories of the differential inclusion emanating from any non-equilibrium point. This Lyapunov-type condition is standard in the literature and we will verify that it holds for a large class of optimization problems in Section 5.

As we have alluded to above, the following theorem shows that under Assumptions A and B, every limit point of indeed satisfies the inclusion . We were unable to find this result stated and proved in this generality. Therefore, we record a complete proof in Section 3.3. The idea of the proof is of course not new, and can already be seen for example in [2, 17, 22]. Upon first reading, the reader can safely skip to Section 4.

###### Theorem 3.2.

Suppose that Assumptions A and B hold. Then every limit point of lies in and the function values converge.

### 3.3 Proof of Theorem 3.2

In this section, we will prove Theorem 3.2. The argument we present is rooted in the “non-escape argument” for ODEs, using as a Lyapunov function for the continuous dynamics. In particular, the proof we present is in the same spirit as that in [22, Theorem 5.2.1] and [17, Section 3.4.1].

Henceforth, we will suppose that Assumptions A and B hold. We first collect two elementary lemmas.

###### Lemma 3.3.

The equality holds.

###### Proof.

From the recurrence (3.2), we have Assumption A guarantees and are bounded, and therefore . Moreover, since the sequence is convergent, we deduce . The result follows. ∎

###### Lemma 3.4.

Equalities hold:

 \operatornamewithlimitsliminft→∞φ(x(t))=\operatornamewithlimitsliminfk→∞φ(xk) and \operatornamewithlimitslimsupt→∞φ(x(t))=\operatornamewithlimitslimsupk→∞φ(xk). (3.5)
###### Proof.

Clearly, the inequalities and hold in (3.5), respectively. We will argue that the reverse inequalities are valid. To this end, let be an arbitrary sequence with converging to some point as .

For each index , define the breakpoint . Then by the triangle inequality, we have

 ∥xki−x∗∥ ≤∥xki−x(τi)∥+∥x(τi)−x∗∥≤∥xki−xki+1∥+∥x(τi)−x∗∥

Lemma 3.3 implies that the right-hand-side tends to zero, and hence . Continuity of then directly yields the guarantee .

In particular, we may take to be a sequence realizing . Since the curve is bounded, we may suppose that up to taking a subsequence, converges to some point . We therefore deduce

 \operatornamewithlimitsliminfk→∞φ(xk)≤limi→∞φ(xki)=φ(x∗)=\operatornamewithlimitsliminft→∞φ(x(t)),

thereby establishing the first equality in (3.5). The second equality follows analogously. ∎

The proof of Theorem 3.3 will follow quickly from the following proposition.

###### Proposition 3.5.

The values have a limit as .

###### Proof.

Without loss of generality, suppose . For each , define the sublevel set

 Lr:={x∈Rd:φ(x)≤r}.

Choose any satisfying . Note that by Assumption B, we can let be as small as we wish. By the first equality in (3.5), there are infinitely many indices such that . The following elementary observation shows that for all large , if lies in then the next iterate lies in .

###### Claim 1.

For all sufficiently large indices , the implication holds:

 xk∈Lϵ⟹xk+1∈L2ϵ.
###### Proof.

Since the sequence is bounded, it is contained in some compact set . From continuity, we have

 cl(Rd∖L2ϵ)=cl(φ−1(2ϵ,∞))⊆φ−1[2ϵ,∞).

It follows that the two closed sets, and , do not intersect. Since is compact, we deduce that it is well separated from ; that is, there exists satisfying:

 min{∥w−v∥:w∈C∩Lϵ, v∉L2ϵ}≥α>0.

In particular , whenever lies in . Taking into account Lemma 3.3, we deduce for all large , and therefore implies , as claimed. ∎

Let us define now the following sequence of iterates. Let be the first index satisfying

1. ,

2. , and

3. defining the exit time , the iterate lies in .

Then let be the next smallest index satisfying the same property, and so on. See Figure 1 for an illustration. The following claim will be key.

###### Claim 2.

This process must terminate, that is exits only finitely many times.

Before proving the claim, let us see how it immediately yields the validity of the theorem. To this end, observe that Claims 1 and 2 immediately imply for all large . Since can be made arbitrarily small, we deduce . Equation (3.5) then directly implies , as claimed.

###### Proof of Claim 2.

To verify the claim, suppose that the process does not terminate. Thus we obtain an increasing sequence of indices with as . Set and consider the curves in . Then up to a subsequence, Theorem 3.1 shows that the curves converge in to some arc satisfying

 ˙z(t)∈G(z(t))for a.e. t≥0.

By construction, we have and . We therefore deduce

 ϵ≥φ(xij)≥φ(xij+1)+(φ(xij)−φ(xij+1))≥ϵ+[φ(xij)−φ(z(0))]−[φ(xij+1)−φ(z(0))]. (3.6)

Recall as . Lemma 3.3 in turn implies and therefore as well. Continuity of then guarantees that the right-hand-side of (3.6) tends to , and hence . In particular, is not an equilibrium point of . Hence, Assumption B yields a real such that

 φ(z(T))

In particular, there exists a real satisfying

Appealing to uniform convergence on , we conclude

 supt∈[0,T]|φ(z(t))−φ(xτj(t))|<ϵ,

for all large , and therefore

 supt∈[0,T]φ(xτj(t))≤supt∈[0,T]φ(z(t))+supt∈[0,T]|φ(z(t))−φ(xτj(t))|≤2ϵ.

Hence, for all large , all the curves map into . We conclude that the exit time satisfies

 tej>τj+Tfor all large j.

We will show that the bound yields the opposite inequality , which will lead to a contradiction.

To that end, let

 ℓj=max{ℓ∈N∣τj≤tℓ≤τj+T},

be the last discrete index before . Because as , we have that for all large . We will now show that for all large , we have

 φ(xℓj)<ϵ−δ,

which implies . Indeed, observe

 ∥xℓj−xτj(T)∥=∥xτj(tℓj−τj)−xτj(T)∥≤∥xℓj−xℓj+1∥→0.

Hence as . Continuity of then guarantees . Consequently, the inequality holds for all large , which is the desired contradiction. ∎

The proof of the lemma is now complete. ∎

We can now prove the main convergence theorem.

###### Proof of Theorem 3.2.

Let be a limit point of and suppose for the sake of contradiction that . Let be the indices satisfying as . Let be the subsequential limit of the curves in guaranteed to exist by Theorem 3.1. Assumption B guarantees that there exists a real satisfying

 φ(z(T))

On the other hand, we successively deduce

 φ(z(T))=limj→∞φ(xtij(T))=limt→∞φ(x(t))=φ(x∗),

where the last two equalities follow from Proposition 3.5 and continuity of . We have thus arrived at a contradiction, and the theorem is proved. ∎

## 4 Subgradient dynamical system

Assumptions A and B, taken together, provide a powerful framework for proving subsequential convergence of algorithms to a zero of the set-valued map . Note that the two assumptions are qualitatively different. Assumption A is a property of both the algorithm (3.2) and the map , while Assumption B is a property of alone.

For the rest of our discussion, we apply the differential inclusion approach outlined above to optimization problems. Setting the notation, consider the optimization task

 minx∈Rd f(x), (4.1)

where is a locally Lipschitz continuous function. Seeking to apply the techniques of Section 3, we simply set in the notation therein. Thus we will be interested in algorithms that, under reasonable conditions, track solutions of the differential inclusion

 ˙z(t)∈−∂f(z(t))for a.e. t≥0, (4.2)

and subsequentially converge to critical points of . Discrete processes of the type (3.2) for the optimization problem (4.1) are often called stochastic approximation algorithms. Here we study two such prototypical methods: the stochastic subgradient method in this section and the stochastic proximal subgradient in Section 6. Each fits under the umbrella of Assumption A.

Setting the stage, the stochastic subgradient method simply iterates the steps:

 xk+1=xk−αk(yk+ξk)withyk∈∂f(xk), (4.3)

where is a step-size sequence and

is now a sequence of random variables (the “noise”) on some probability space. Let us now isolate the following standard assumptions (e.g.

[5, 22]) for the method and see how they immediately imply Assumption A.

###### Assumption C (Standing assumptions for the stochastic subgradient method).
1. The sequence is nonnegative, square summable, but not summable:

 αk≥0,∞∑k=1αk=∞,and∞∑k=1α2k<∞.
2. Almost surely, the stochastic subgradient iterates are bounded: .

3. is a martingale difference sequence w.r.t the increasing -fields

 Fk=σ(xj,yj,ξj:j≤k).

That is, there exists a function , which is bounded on bounded sets, so that almost surely, for all , we have

 E[ξk|Fk]=0 and E[∥ξk∥2|Fk]≤p(xk).

The following is true.

###### Lemma 4.1.

Assumption C guarantees that almost surely Assumption A holds.

###### Proof.

Suppose Assumption C holds. Clearly A.1 and A.3 hold vacuously, while A.2 follows immediately from C.2 and local Lipschitz continuity of . Assumption A.5 follows quickly from the fact the outer-semicontinuous and compact-convex valued; we leave the details to the reader. Thus we must only verify A.4, which follows quickly from standard martingale arguments. Indeed, notice from Assumption C, we have

 E[ξk∣Fk]=0∀k and ∞∑i=0α2iE[∥ξi∥2∣Fi]≤∞∑i=0α2ip(xi)<∞.

Define the martingale . Thus the limit of the predictable compensator

 ⟨X⟩k:=k∑i=1α2iE[∥ξi∥2∣Fi],

exists. Applying [15, Theorem 5.3.33(a)], we deduce that almost surely converges to a finite limit. ∎

Thus applying Theorem 3.1, we deduce that under Assumption C, almost surely, the stochastic subgradient path tracks a trajectory of the differential inclusion (4.2). As we saw in Section 3, proving subsequential convergence to critical points requires existence of a Lyapunov-type function for the continuous dynamics. Henceforth, let us assume that the Lyapunov function is itself. Section 5 is devoted entirely to justifying this assumption for two broad classes of functions that are virtually exhaustive in data scientific contexts.

###### Assumption D (Lyapunov condition in unconstrained minimization).
1. (Weak Sard) The set of noncritical values of is dense in .

2. (Descent) Whenever is trajectory of the differential inclusion and is not a critical point of , there exists a real satisfying

 f(z(T))

Some comments are in order. Recall that the classical Sard’s theorem guarantees that the set of critical values of any -smooth function has measure zero. Thus property 1 in Assumption D asserts a very weak version of a nonsmooth Sard theorem. This is a very mild property, there mostly for technical reasons. It can fail, however, even for a smooth function on ; see the famous example of Whitney [35]. Property 2 of Assumption D is more meaningful. It essentially asserts that must locally strictly decrease along any subgradient trajectory emanating from a noncritical point.

Thus applying Theorem 3.2, we have arrived at the following guarantee for the stochastic subgradient method.

###### Theorem 4.2.

Suppose that Assumptions C and D hold. Then almost surely, every limit point of stochastic subgradient iterates is critical for and the function values converge.

## 5 Verifying the descent condition

In light of Theorems 3.2 and 4.2, it is important to isolate a class of functions that automatically satisfy Assumption D.2. In this section, we do exactly that, focusing on two problem classes: (1) subdifferentially regular functions and (2) those functions whose graphs are Whitney stratifiable. We will see that the latter problem class also satisfies D.1.

The material in this section is not new. In particular, the results of this section have appeared in [16, Section 5.1]. These results, however, are somewhat hidden in the paper [16] and are difficult to parse. Moreover, at the time of writing [16, Section 5.1], there was no clear application of the techniques, in contrast to our current paper. Since we do not expect the readers to be experts in variational analysis and semialgebraic geometry, we provide here a self-contained treatment, highlighting only the most essential ingredients and streamlining some of the arguments.

Let us begin with the following definition, whose importance for verifying Property 2 in Assumption D will become clear shortly.

###### Definition 5.1 (Chain rule).

Consider a locally Lipschitz function on . We will say that admits a chain rule if for any arc , equality

 (f∘z)′(t)=⟨∂f(z(t)),˙z(t)⟩% holds for a.e. t≥0.

The importance of the chain rule becomes immediately clear with the following lemma.

###### Lemma 5.2.

Consider a locally Lipschitz function that admits a chain rule. Let be any arc satisfying the differential inclusion

 ˙z(t)∈−∂f(z(t))for a.e. t≥0.

Then equality holds for a.e. , and therefore

 f(z(0))−f(z(t))=∫t0dist2(0;∂f(z(τ)))dτ,∀t≥0. (5.1)

In particular, property 2 of Assumption D holds.

###### Proof.

Fix a real satisfying . Observe then the equality

 0=⟨∂f(z(t))−∂f(z(t)),˙z(t)⟩. (5.2)

To simplify the notation, set , , and . Appealing to (5.2), we conclude , and therefore trivially we have

 y∈(y+W)∩W⊥.

Basic linear algebra implies . Noting , we deduce as claimed. Since the reverse inequality trivially holds, we obtain the claimed equality, .

Since admits a chain rule, we conclude for a.e. the estimate

 (f∘z)′(τ)=⟨∂f(z(τ)),˙z(τ)⟩=−∥˙z(τ)∥2=−dist2(0;∂f(z(τ))).

Since is locally Lipschitz, the composition is absolutely continuous. Hence integrating over the interval yields (5.1).

Suppose now that the point is noncritical. Then by outer semi-continuity of , the exists such that is noncritical for any . It follows immediately that the value is strictly increasing in , and therefore by (5.1) that is strictly decreasing. Hence item 2 of Assumption D holds, as claimed. ∎

Thus property 2 of Assumption D is sure to hold as long as admits a chain rule. In the following two sections, we identify two different function classes that indeed admit the chain rule.

### 5.1 Subdifferentially regular functions

The first function class we consider consists of subdifferentially regular functions. Such functions play a prominent role in variational analysis due to their close connection with convex functions; we refer the reader to the monograph [31] for details. In essence, subdifferential regularity forbids downward facing cusps in the graph of the function; e.g. is not subdifferentially regular. We now present the formal definition.

###### Definition 5.3 (Subdifferential regularity).

A locally Lipschitz function is subdifferentially regular at a point if every subgradient yields an affine minorant of up to first-order:

 f(y)≥f(x)+⟨v,y−x⟩+o(∥y−x∥)as y→x.

The following lemma shows that any locally Lipschitz function that is subdifferentially regular indeed admits a chain rule.

###### Lemma 5.4 (Chain rule under subdifferential regularity).

Any locally Lipschitz function that is subdifferentially regular admits a chain rule and therefore item 2 of Assumption D holds.

###### Proof.

Let be a locally Lipschitz and subdifferentially regular function. Consider an arc . Since, and are absolutely continuous, both are differentiable almost everywhere. Then for any such and any subgradient , we conclude

 (f∘x)′(t)=limr↘0f(x(t+r))−f(x(t))r ≥limr↘0⟨v,x(t+r)−x(t)⟩+o(∥x(t+r)−x(t)∥)r =⟨v,˙x(t)⟩.

Instead, equating with the left limit of the difference quotient yields the reverse inequality . Thus admits a chain rule and item 2 of Assumption D holds by Lemma 5.2. ∎

Thus we have arrived at the following corollary. For ease of reference, we state subsequential convergence guarantees both for the general process (3.2) and for the specific stochastic subgradient method (4.3).

###### Corollary 5.5.

Let be a locally Lipschitz function that is subdifferentially regular and such that its set of noncritical values is dense in .

• (Stochastic approximation) Consider the iterates produced by (3.2) and suppose that Assumption A holds with . Then every limit point of the iterates is critical for and the function values converge.

• (Stochastic subgradient method) Consider the iterates produced by the stochastic subgradient method (4.3) and suppose that Assumption C holds. Then almost surely, every limit point of the iterates is critical for and the function values converge.

Though subdifferentially regular functions are widespread in applications, they preclude “downwards cusps”, and therefore do not capture such simple examples as and . The following section concerns a different function class that does capture these two nonpathological examples.

### 5.2 Stratifiable functions

As we saw in the previous section, subdifferential regularity is a local property that implies the desired item 2 of Assumption D. In this section, we instead focus on a broad class of functions satisfying a global geometric property, which eliminates pathological examples from consideration.

Before giving a formal definition, let us fix some notation. A set is a smooth manifold if there is an integer such that around any point , there is a neighborhood and a -smooth map with of full rank and satisfying . If this is the case, the tangent and normal spaces to at are defined to be and , respectively.

###### Definition 5.6 (Whitney stratification).

A Whitney -stratification of a set is a partition of into finitely many nonempty manifolds, called strata, satisfying the following compatibility conditions.

1. Frontier condition: For any two strata and , the implication

 L∩clM≠∅⟹L⊂clMholds.
2. Whitney condition (a): For any sequence of points in a stratum converging to a point in a stratum , if the corresponding normal vectors converge to a vector , then the inclusion holds.

A function is Whitney -stratifiable if its graph admits a Whitney -stratification.

The definition of the Whitney stratification invokes two conditions, one topological and the other geometric. The frontier condition simply says that if one stratum intersects the closure of another , then must be fully contained in the closure . In particular, the frontier condition endows the strata with a partial order . The Whitney condition (a) is geometric. In short, it asserts that limits of normals along a sequence in a stratum are themselves normal to the stratum containing the limit of .

The following discussion of Whitney stratifications follows that in [4]. Consider a Whitney -stratification of the graph of a locally Lipschitz function . Let be the manifolds obtained by projecting on . An easy argument using the constant rank theorem shows that the partition of is itself a Whitney -stratification and the restriction of to each stratum is -smooth. Whitney condition (a) directly yields the following consequence [4, Proposition 4]. For any stratum and any point , we have