In this work, we study the long term behavior of the stochastic subgradient method on nonsmooth and nonconvex functions. Setting the stage, consider the optimization problem
where is a locally Lipschitz continuous function. The stochastic subgradient method simply iterates the steps
Here denotes the Clarke subdifferential . Informally, the set is the convex hull of limits of gradients at nearby differentiable points. In classical circumstances, the subdifferential reduces to more familiar objects. Namely, when is -smooth at , the subdifferential consists only of the gradient , while for convex functions, it reduces to the subdifferential in the sense of convex analysis. The positive sequence is user specified, and it controls the step-sizes of the algorithm. As is typical for stochastic subgradient methods, we will assume that this sequence is square summable but not summable, meaning and . Finally, the stochasticity is modeled by the random (noise) sequence
. We make the standard assumption that conditioned on the past, each random variable
has mean zero and its second moment grows at a controlled rate.
Though variants of the stochastic subgradient method (1.1) date back to Robbins-Monro’s pioneering 1951 work , their convergence behavior is still largely not understood in nonsmooth and nonconvex settings. In particular, the following question remains open.
Does the (stochastic) subgradient method have any convergence guarantees on locally Lipschitz functions, which may be neither smooth nor convex?
That this question remains unanswered is somewhat concerning as the stochastic subgradient method forms a core numerical subroutine for several widely used solvers, including Google’s TensorFlow
and the open source PyTorch library.
Convergence behavior of (1.1) is well understood when applied to convex, smooth, and more generally, weakly convex problems. In these three cases, almost surely, every limit point of the iterate sequence is first-order critical , meaning . Moreover, rates of convergence in terms of natural optimality/stationarity measures are available. In summary, the rates are , , and , for functions that are convex , smooth , and -weakly convex [14, 13], respectively. In particular, the convergence guarantee above for -weakly convex functions appeared only recently in [14, 13], with the Moreau envelope playing a central role.
Though widely applicable, these previous results on the convergence of the stochastic subgradient method do not apply to even relatively simple non-pathological functions, such as and
. It is not only toy examples, however, that lack convergence guarantees, but the entire class of deep neural networks with nonsmooth activation functions (e.g., ReLU). Since such networks are routinely trained in practice, it is worthwhile to understand if indeed the iteratestend to a meaningful limit.
In this paper, we provide a positive answer to this question for a wide class of locally Lipschitz functions; indeed, the function class we consider is virtually exhaustive in data scientific contexts (see Corollary 5.11 for consequences in deep learning). Aside from mild technical conditions, the only meaningful assumption we make is that strictly decreases along any trajectory of the differential inclusion emanating from a noncritical point. Under this assumption, a standard Lyapunov-type argument shows that every limit point of the stochastic subgradient method is critical for , almost surely. Techniques of this type can be found for example in the monograph of Kushner-Yin [22, Theorem 5.2.1] and the landmark papers of Benaïm-Hofbauer-Sorin [2, 3]. Here, we provide a self-contained treatment, which facilitates direct extensions to “proximal” variants of the stochastic subgradient method.111Concurrent to this work, the independent preprint  also provides convergence guarantees for the stochastic projected subgradient method, under the assumption that the objective function is “subdifferentially regular” and the constraint set is convex. Subdifferential regularity rules out functions with downward kinks and cusps, such as deep networks with the Relu() activation functions. Besides subsuming the subdifferentially regular case, the results of the current paper apply to the broad class of Whitney stratifiable functions, which includes all popular deep network architectures. In particular, our analysis follows closely the recent work of Duchi-Ruan [17, Section 3.4.1] on convex composite minimization.
The main question that remains therefore is which functions decrease along the continuous subgradient curves. Let us look for inspiration at convex functions, which are well-known to satisfy this property [7, 8]. Indeed, if is convex and
is any absolutely continuous curve, then the “chain rule” holds:
An elementary linear algebraic argument then shows that if satisfies a.e., then automatically is the minimal norm element of . Therefore, integrating (1.2) yields the desired descent guarantee
Evidently, exactly the same argument yields the chain rule (1.2) for subdifferentially regular functions. These are the functions such that each subgradient
defines a linear lower-estimator ofup to first-order; see for example [10, Section 2.4] or [31, Definition 7.25]. Nonetheless, subdifferentially regular functions preclude “downwards cusps”, and therefore still do not capture such simple examples as . It is worthwhile to mention that one can not expect (1.3) to always hold. Indeed, there are pathological locally Lipschitz functions that do not satisfy (1.3); one example is the univariate 1-Lipschitz function whose Clarke subdifferential is the unit interval at every point [30, 6].
In this work, we isolate a different structural property on the function , which guarantees the validity of (1.2) and therefore of the descent condition (1.3). We will assume that the graph of the function admits a partition into finitely many smooth manifolds, which fit together in a regular pattern. Formally, we require the graph of to admit a so-called Whitney stratification, and we will call such functions Whitney stratifiable. Whitney stratifications have already figured prominently in optimization, beginning with the seminal work . An important subclass of Whitney stratifiable functions consists of semi-algebraic functions  – meaning those whose graphs can be written as a finite union of sets each defined by finitely many polynomial inequalities. Semialgebraicity is preserved under all the typical functional operations in optimization (e.g. sums, compositions, inf-projections) and therefore semi-algebraic functions are usually easy to recognize. More generally still, “semianalytic” functions  and those that are “definable in an o-minimal structure” are Whitney stratifiable . The latter function class, in particular, shares all the robustness and analytic properties of semi-algebraic functions, while encompassing many more examples. Case in point, Wilkie  famously showed that there is an o-minimal structure that contains both the exponential and all semi-algebraic functions.222The term “tame” used in the title has a technical meaning. Tame sets are those whose intersection with any ball is definable in some o-minimal structure. The manuscript  provides a nice exposition on the role of tame sets and functions in optimization.
The key observation for us, which originates in [16, Section 5.1], is that any locally Lipschitz Whitney stratifiable function necessarily satisfies the chain rule (1.2) along any absolutely continuous curve. Consequently, the descent guarantee (1.3) holds along any subgradient trajectory, and our convergence guarantees for the stochastic subgradient method become applicable. Since the composition of two definable functions is definable, it follows immediately from Wilkie’s o-minimal structure that nonsmooth deep neural networks built from definable pieces—such as quadratics , hinge losses , and log-exp functions—are themselves definable. Hence, the results of this paper endow stochastic subgradient methods, applied to definable deep networks, with rigorous convergence guarantees.
Validity of the chain rule (1.2) for Whitney stratifiable functions is not new. It was already proved in [16, Section 5.1] for semi-algebraic functions, though identical arguments hold more broadly for Whitney stratifiable functions. These results, however, are somewhat hidden in the paper , which is possibly why they have thus far been underutilized. In this manuscript, we provide a self-contained review of the material from [16, Section 5.1], highlighting only the most essential ingredients and streamlining some of the arguments.
Though the discussion above is for unconstrained problems, the techniques we develop apply much more broadly to constrained problems of the form
Here and are locally-Lipschitz continuous functions and is an arbitrary closed set. The popular proximal stochastic subgradient method simply iterates the steps
Combining our techniques with those in  quickly yields subsequential convergence guarantees for this algorithm. Note that we impose no convexity assumptions on , , or .
The outline of this paper is as follows. In Section 2, we fix the notation for the rest of the manuscript. Section 3 provides a self-contained treatment of asymptotic consistency for discrete approximations of differential inclusions. In Section 4, we specialize the results of the previous section to the stochastic subgradient method. Finally, in Section 5, we verify the sufficient conditions for subsequential convergence for a broad class of locally Lipschitz functions, including those that are subdifferentially regular and Whitney stratifiable. In particular, we specialize our results to deep learning settings in Corollary 5.11. In the final Section 6, we extend the results of the previous sections to the proximal setting.
Throughout, we will mostly use standard notation on differential inclusions, as set out for example in the monographs of Borkar , Clarke-Ledyaev-Stern-Wolenski , and Smirnov . We will always equip the Euclidean space with an inner product and the induced norm . The distance of a point to a set will be written as . The indicator function of , denoted , is defined to be zero on and off it. The symbol will denote the closed unit ball in , while will stand for the closed ball of radius of around . We will use to denote the set of nonnegative real numbers.
2.1 Absolutely continuous curves
Any continuous function is called a curve in . All curves in comprise the set . We will say that a sequence of function converges to in if converge to uniformly on compact intervals, that is, for all , we have
Recall that a curve is absolutely continuous if there exists a map that is integrable on any compact interval and satisfies
Moreover, if this is the case, then equality holds for a.e. . Henceforth, for brevity, we will call absolutely continuous curves arcs. We will often use the observation that if is locally Lipschitz continuous and is an arc, then the composition is absolutely continuous.
2.2 Set-valued maps and the Clarke subdifferential
A set-valued map is a mapping from a set to the powerset of . Thus is a subset of , for each . We will use the notation
for the preimage of a vector. The map is outer-semicontinuous at a point if for any sequences and converging to some vector , the inclusion holds.
The most important set-valued map for our work will be the generalized derivative in the sense of Clarke  – a notion we now review. Consider a locally Lipschitz continuous function . The well-known Rademacher’s theorem guarantees that is differentiable almost everywhere. Taking this into account, the Clarke subdifferential of at any point is the set [10, Theorem 8.1]
where is any full-measure subset of such that is differentiable at each of its points. It is standard that the map is outer-semicontinuous and its images are nonempty, compact, convex sets for each ; see for example [10, Proposition 1.5 (a,e)].
Analogously to the smooth setting, a point is called (Clarke) critical for whenever the inclusion holds. Equivalently, these are the points at which the Clarke directional derivative is nonnegative in every direction [10, Section 2.1]. A real number is called a critical value of if there exists a critical point satisfying .
3 Differential inclusions and discrete approximations
In this section, we discuss the asymptotic behavior of discrete approximations of differential inclusions. All the elements of the analysis we present, in varying generality, can be found in the works of Benaïm-Hofbauer-Sorin [2, 3], Borkar , and Duchi-Ruan . Out of these, we most closely follow the work of Duchi-Ruan .
3.1 Functional convergence of discrete approximations
Let be a closed set and let be a set-valued map. Then an arc is called a trajectory of if it satisfies the differential inclusion
Notice that the image of any arc is automatically contained in , since arcs are continuous and is closed. In this work, we will primarily focus on iterative algorithms that aim to asymptotically track a trajectory of the differential inclusion (3.1) using a noisy discretization with vanishing step-sizes. Though our discussion allows for an arbitrary set-valued map , the reader should keep in mind that the most important example for us will be , where is a locally Lipschitz function.
Throughout, we will consider the following iteration sequence:
Here is a sequence of step-sizes, should be thought of as an approximate evaluation of at some point near , and is a sequence of “errors”.
Our immediate goal is to isolate reasonable conditions, under which the sequence asymptotically tracks a trajectory of the differential inclusion (3.1). Following the work of Duchi-Ruan  on stochastic approximation, we stipulate the following assumptions.
Assumption A (Standing assumptions).
All limit points of lie in .
The iterates are bounded, i.e., and .
The sequence is nonnegative, square summable, but not summable:
The weighted noise sequence is convergent: for some as .
For any unbounded increasing sequence such that converges to some point , it holds:
Some comments are in order. Conditions 1, 2, and 3 are in some sense minimal, though the boundedness condition must be checked for each particular algorithm. Condition 4 guarantees that the noise sequence does not grow too quickly relative to the rate at which decrease. The key Condition 5 summarizes the way in which the values are approximate evaluations of , up to convexification.
To formalize the idea of asymptotic approximation, let us define the time points and , for . Let
now be the linear interpolation of the discrete path:
For each , define the time-shifted curve .
The following result of Duchi-Ruan [17, Theorem 2] shows that under the above conditions, for any sequence , the shifted curves subsequentially converge in to a trajectory of (3.1). Results of this type under more stringent assumptions, and with similar arguments, have previously appeared for example in Benaïm-Hofbauer-Sorin [2, 3] and Borkar .
3.2 Subsequential convergence to equilibrium points
A primary application of the discrete process (3.2) is to solve the inclusion
Indeed, one can consider the points satisfying (3.4) as equilibrium (constant) trajectories of the differential inclusion (3.1). Ideally, one would like to find conditions guaranteeing that every limit point of the sequence , produced by the recursion (3.2), satisfies the desired inclusion (3.4). Making such a leap rigorous typically relies on combining the asymptotic convergence guarantee of Theorem 3.1 with existence of a Lyapunov-like function for the continuous dynamics; see e.g. [2, 3]. Let us therefore introduce the following assumption.
Assumption B (Lyapunov condition).
There exists a continuous function , which is bounded from below, and such that the following two properties hold.
(Weak Sard) For a dense set of values , the intersection is empty.
(Descent) Whenever is a trajectory of the differential inclusion (3.1) and , there exists a real satisfying
The weak Sard property is reminiscent of the celebrated Sard’s theorem in real analysis. Indeed, consider the classical setting for a smooth function . Then the weak Sard property stipulates that the set of noncritical values of is dense in . By Sard’s theorem, this is indeed the case, as long as is smooth. Indeed, Sard’s theorem guarantees the much stronger property that the set of noncritical values has full measure. We will comment more on the weak Sard property in Section 4, once we shift focus to optimization problems. The descent property, says that eventually strictly decreases along the trajectories of the differential inclusion emanating from any non-equilibrium point. This Lyapunov-type condition is standard in the literature and we will verify that it holds for a large class of optimization problems in Section 5.
As we have alluded to above, the following theorem shows that under Assumptions A and B, every limit point of indeed satisfies the inclusion . We were unable to find this result stated and proved in this generality. Therefore, we record a complete proof in Section 3.3. The idea of the proof is of course not new, and can already be seen for example in [2, 17, 22]. Upon first reading, the reader can safely skip to Section 4.
3.3 Proof of Theorem 3.2
In this section, we will prove Theorem 3.2. The argument we present is rooted in the “non-escape argument” for ODEs, using as a Lyapunov function for the continuous dynamics. In particular, the proof we present is in the same spirit as that in [22, Theorem 5.2.1] and [17, Section 3.4.1].
The equality holds.
Clearly, the inequalities and hold in (3.5), respectively. We will argue that the reverse inequalities are valid. To this end, let be an arbitrary sequence with converging to some point as .
For each index , define the breakpoint . Then by the triangle inequality, we have
Lemma 3.3 implies that the right-hand-side tends to zero, and hence . Continuity of then directly yields the guarantee .
In particular, we may take to be a sequence realizing . Since the curve is bounded, we may suppose that up to taking a subsequence, converges to some point . We therefore deduce
thereby establishing the first equality in (3.5). The second equality follows analogously. ∎
The proof of Theorem 3.3 will follow quickly from the following proposition.
The values have a limit as .
Without loss of generality, suppose . For each , define the sublevel set
Choose any satisfying . Note that by Assumption B, we can let be as small as we wish. By the first equality in (3.5), there are infinitely many indices such that . The following elementary observation shows that for all large , if lies in then the next iterate lies in .
For all sufficiently large indices , the implication holds:
Since the sequence is bounded, it is contained in some compact set . From continuity, we have
It follows that the two closed sets, and , do not intersect. Since is compact, we deduce that it is well separated from ; that is, there exists satisfying:
In particular , whenever lies in . Taking into account Lemma 3.3, we deduce for all large , and therefore implies , as claimed. ∎
Let us define now the following sequence of iterates. Let be the first index satisfying
defining the exit time , the iterate lies in .
Then let be the next smallest index satisfying the same property, and so on. See Figure 1 for an illustration. The following claim will be key.
This process must terminate, that is exits only finitely many times.
Before proving the claim, let us see how it immediately yields the validity of the theorem. To this end, observe that Claims 1 and 2 immediately imply for all large . Since can be made arbitrarily small, we deduce . Equation (3.5) then directly implies , as claimed.
Proof of Claim 2.
To verify the claim, suppose that the process does not terminate. Thus we obtain an increasing sequence of indices with as . Set and consider the curves in . Then up to a subsequence, Theorem 3.1 shows that the curves converge in to some arc satisfying
By construction, we have and . We therefore deduce
Recall as . Lemma 3.3 in turn implies and therefore as well. Continuity of then guarantees that the right-hand-side of (3.6) tends to , and hence . In particular, is not an equilibrium point of . Hence, Assumption B yields a real such that
In particular, there exists a real satisfying
Appealing to uniform convergence on , we conclude
for all large , and therefore
Hence, for all large , all the curves map into . We conclude that the exit time satisfies
We will show that the bound yields the opposite inequality , which will lead to a contradiction.
To that end, let
be the last discrete index before . Because as , we have that for all large . We will now show that for all large , we have
which implies . Indeed, observe
Hence as . Continuity of then guarantees . Consequently, the inequality holds for all large , which is the desired contradiction. ∎
The proof of the lemma is now complete. ∎
We can now prove the main convergence theorem.
Proof of Theorem 3.2.
Let be a limit point of and suppose for the sake of contradiction that . Let be the indices satisfying as . Let be the subsequential limit of the curves in guaranteed to exist by Theorem 3.1. Assumption B guarantees that there exists a real satisfying
On the other hand, we successively deduce
where the last two equalities follow from Proposition 3.5 and continuity of . We have thus arrived at a contradiction, and the theorem is proved. ∎
4 Subgradient dynamical system
Assumptions A and B, taken together, provide a powerful framework for proving subsequential convergence of algorithms to a zero of the set-valued map . Note that the two assumptions are qualitatively different. Assumption A is a property of both the algorithm (3.2) and the map , while Assumption B is a property of alone.
For the rest of our discussion, we apply the differential inclusion approach outlined above to optimization problems. Setting the notation, consider the optimization task
where is a locally Lipschitz continuous function. Seeking to apply the techniques of Section 3, we simply set in the notation therein. Thus we will be interested in algorithms that, under reasonable conditions, track solutions of the differential inclusion
and subsequentially converge to critical points of . Discrete processes of the type (3.2) for the optimization problem (4.1) are often called stochastic approximation algorithms. Here we study two such prototypical methods: the stochastic subgradient method in this section and the stochastic proximal subgradient in Section 6. Each fits under the umbrella of Assumption A.
Setting the stage, the stochastic subgradient method simply iterates the steps:
where is a step-size sequence and
is now a sequence of random variables (the “noise”) on some probability space. Let us now isolate the following standard assumptions (e.g.[5, 22]) for the method and see how they immediately imply Assumption A.
Assumption C (Standing assumptions for the stochastic subgradient method).
The sequence is nonnegative, square summable, but not summable:
Almost surely, the stochastic subgradient iterates are bounded: .
is a martingale difference sequence w.r.t the increasing -fields
That is, there exists a function , which is bounded on bounded sets, so that almost surely, for all , we have
The following is true.
Suppose Assumption C holds. Clearly A.1 and A.3 hold vacuously, while A.2 follows immediately from C.2 and local Lipschitz continuity of . Assumption A.5 follows quickly from the fact the outer-semicontinuous and compact-convex valued; we leave the details to the reader. Thus we must only verify A.4, which follows quickly from standard martingale arguments. Indeed, notice from Assumption C, we have
Define the martingale . Thus the limit of the predictable compensator
exists. Applying [15, Theorem 5.3.33(a)], we deduce that almost surely converges to a finite limit. ∎
Thus applying Theorem 3.1, we deduce that under Assumption C, almost surely, the stochastic subgradient path tracks a trajectory of the differential inclusion (4.2). As we saw in Section 3, proving subsequential convergence to critical points requires existence of a Lyapunov-type function for the continuous dynamics. Henceforth, let us assume that the Lyapunov function is itself. Section 5 is devoted entirely to justifying this assumption for two broad classes of functions that are virtually exhaustive in data scientific contexts.
Assumption D (Lyapunov condition in unconstrained minimization).
(Weak Sard) The set of noncritical values of is dense in .
(Descent) Whenever is trajectory of the differential inclusion and is not a critical point of , there exists a real satisfying
Some comments are in order. Recall that the classical Sard’s theorem guarantees that the set of critical values of any -smooth function has measure zero. Thus property 1 in Assumption D asserts a very weak version of a nonsmooth Sard theorem. This is a very mild property, there mostly for technical reasons. It can fail, however, even for a smooth function on ; see the famous example of Whitney . Property 2 of Assumption D is more meaningful. It essentially asserts that must locally strictly decrease along any subgradient trajectory emanating from a noncritical point.
Thus applying Theorem 3.2, we have arrived at the following guarantee for the stochastic subgradient method.
5 Verifying the descent condition
In light of Theorems 3.2 and 4.2, it is important to isolate a class of functions that automatically satisfy Assumption D.2. In this section, we do exactly that, focusing on two problem classes: (1) subdifferentially regular functions and (2) those functions whose graphs are Whitney stratifiable. We will see that the latter problem class also satisfies D.1.
The material in this section is not new. In particular, the results of this section have appeared in [16, Section 5.1]. These results, however, are somewhat hidden in the paper  and are difficult to parse. Moreover, at the time of writing [16, Section 5.1], there was no clear application of the techniques, in contrast to our current paper. Since we do not expect the readers to be experts in variational analysis and semialgebraic geometry, we provide here a self-contained treatment, highlighting only the most essential ingredients and streamlining some of the arguments.
Definition 5.1 (Chain rule).
Consider a locally Lipschitz function on . We will say that admits a chain rule if for any arc , equality
The importance of the chain rule becomes immediately clear with the following lemma.
Fix a real satisfying . Observe then the equality
To simplify the notation, set , , and . Appealing to (5.2), we conclude , and therefore trivially we have
Basic linear algebra implies . Noting , we deduce as claimed. Since the reverse inequality trivially holds, we obtain the claimed equality, .
Since admits a chain rule, we conclude for a.e. the estimate
Since is locally Lipschitz, the composition is absolutely continuous. Hence integrating over the interval yields (5.1).
5.1 Subdifferentially regular functions
The first function class we consider consists of subdifferentially regular functions. Such functions play a prominent role in variational analysis due to their close connection with convex functions; we refer the reader to the monograph  for details. In essence, subdifferential regularity forbids downward facing cusps in the graph of the function; e.g. is not subdifferentially regular. We now present the formal definition.
Definition 5.3 (Subdifferential regularity).
A locally Lipschitz function is subdifferentially regular at a point if every subgradient yields an affine minorant of up to first-order:
The following lemma shows that any locally Lipschitz function that is subdifferentially regular indeed admits a chain rule.
Lemma 5.4 (Chain rule under subdifferential regularity).
Let be a locally Lipschitz and subdifferentially regular function. Consider an arc . Since, and are absolutely continuous, both are differentiable almost everywhere. Then for any such and any subgradient , we conclude
Thus we have arrived at the following corollary. For ease of reference, we state subsequential convergence guarantees both for the general process (3.2) and for the specific stochastic subgradient method (4.3).
Let be a locally Lipschitz function that is subdifferentially regular and such that its set of noncritical values is dense in .
Though subdifferentially regular functions are widespread in applications, they preclude “downwards cusps”, and therefore do not capture such simple examples as and . The following section concerns a different function class that does capture these two nonpathological examples.
5.2 Stratifiable functions
As we saw in the previous section, subdifferential regularity is a local property that implies the desired item 2 of Assumption D. In this section, we instead focus on a broad class of functions satisfying a global geometric property, which eliminates pathological examples from consideration.
Before giving a formal definition, let us fix some notation. A set is a smooth manifold if there is an integer such that around any point , there is a neighborhood and a -smooth map with of full rank and satisfying . If this is the case, the tangent and normal spaces to at are defined to be and , respectively.
Definition 5.6 (Whitney stratification).
A Whitney -stratification of a set is a partition of into finitely many nonempty manifolds, called strata, satisfying the following compatibility conditions.
Frontier condition: For any two strata and , the implication
Whitney condition (a): For any sequence of points in a stratum converging to a point in a stratum , if the corresponding normal vectors converge to a vector , then the inclusion holds.
A function is Whitney -stratifiable if its graph admits a Whitney -stratification.
The definition of the Whitney stratification invokes two conditions, one topological and the other geometric. The frontier condition simply says that if one stratum intersects the closure of another , then must be fully contained in the closure . In particular, the frontier condition endows the strata with a partial order . The Whitney condition (a) is geometric. In short, it asserts that limits of normals along a sequence in a stratum are themselves normal to the stratum containing the limit of .
The following discussion of Whitney stratifications follows that in . Consider a Whitney -stratification of the graph of a locally Lipschitz function . Let be the manifolds obtained by projecting on . An easy argument using the constant rank theorem shows that the partition of is itself a Whitney -stratification and the restriction of to each stratum is -smooth. Whitney condition (a) directly yields the following consequence [4, Proposition 4]. For any stratum and any point , we have