The predominance of huge scale complex nonsmooth nonconvex problems in the development of certain artificial intelligence methods, has brought back rudimentary, numerically cheap, robust methods, such as subgradient algorithms, to the forefront of contemporary numerics, see e.g.,[33, 23, 34, 5, 12]. We investigate here some of the properties of the archetypical algorithm within this class, namely, the vanishing stepsize subgradient method of Shor. Given locally Lipschitz, it reads
where is the Clarke subgradient, , and . This dynamics, illustrated in Figure 1, has its roots in Cauchy’s gradient method and seems to originate in Shor’s thesis . The idea is natural at first sight: one accumulates small subgradient steps to make good progress on average while hoping that oscillations will be tempered by the vanishing steps. For the convex case, the theory was developed by Ermol’ev , Poljak , Ermol’ev–Shor . It is a quite mature theory, see e.g. [39, 40], which still has a considerable success through the famous mirror descent of Nemirovskii–Yudin [39, 7] and its endless variants. In the nonconvex case, developments of more sophisticated methods were made, see e.g. [35, 31, 41], yet little was known for the raw method until recently.
The work of Davis et al. , see also , revolving around the fundamental paper of Benaïm–Hofbauer–Sorin , brought the first breakthroughs. It relies on a classical idea of Euler: small-step discrete dynamics resemble their continuous counterparts. As established by Ljung , this observation can be made rigorous for large times in the presence of good Lyapunov functions. Benaïm–Hofbauer–Sorin  showed further that the transfer of asymptotic properties from continuous differential inclusions to small-step discrete methods is valid under rather weak compactness and dissipativity assumptions. This general result, combined with features specific to the subgradient case, allowed to establish several optimization results such as the convergence to the set of critical points, the convergence in value, convergence in the long run in the presence of noise [21, 46, 14, 12].
Usual properties expected from an algorithm are diverse: convergence of iterates, convergence in values, rates, quality of optimality, complexity, or prevalence of minimizers. Although in our setting some aspects seem hopeless without strong assumptions, most of them remain largely unexplored. Numerical successes suggest however that the apparently erratic process of subgradient dynamics has appealing stability properties beyond the already delicate subsequential convergence to critical points.
In general, as long as the sequence remains bounded, we always have
This fact, that could be called “global oscillation compensation,” does not prevent the trajectory to oscillate fast around a limit cycle, as illustrated in , and is therefore unsatisfying from the stabilization perspective of minimization. The phenomenon (1) remains true even when is not a gradient sequence, as in the case of discrete game theoretical dynamical systems .
In this work, we adapt the theory of closed measures, which was originally developed in the calculus of variations (see for example [4, 9]), to the study of discrete dynamics. Using it, we establish several local oscillation compensation results for path differentiable functions. Morally, our results in this direction say that for limit points we have
While this does not imply the convergence of , it does mean that the drift emanating from the average velocity of the sequence vanishes as time elapses. This is made more explicit in the parts of those theorems that show that, given two limit points and of the sequence , the time it takes for the sequence to flow from a small ball around to a small ball around must eventually grow infinitely long, so that the overall speed of the sequence as it traverses the accumulation set becomes extremely slow.
With these types of results, we evidence new phenomena:
while the sequence may not converge, it will spend most of the time oscillating near the critical set of the objective function, and it appears that there are persistent accumulation points whose importance is predominant;
under weak Sard assumptions, we recover the convergence results of  and improve them by oscillation compensations results,
oscillation structures itself orthogonally to the limit set, so that the incremental drift along this set is negligible with respect to the time increment .
These results are made possible by the use of closed measures. These measures capture the accumulation behavior of the sequence along with the “velocities”
. The simple idea of not throwing away the information of the vectorsallows one to recover a lot of structure in the limit, that can be interpreted as a portrait of the long-term behavior of the sequence. The theory that we develop in Section 4.1 should apply to the analysis of the more general case of small-step algorithms. Along the way, for example, we are able to establish a new connection between the discrete and continuous gradient flows (Corollary 22) that complements the point of view of .
Notations and organization of the paper.
Let be a positive integer, and denote -dimensional Euclidean space. The space of couples is seen as the phase space consisting of positions and velocities . For two vectors and , we let . The norm induces the distance , and similarly on . The Euclidean gradient of is denoted by . The set contains all the nonnegative integers.
In Section 2 we give the definitions necessary to state our results, which we do in Section 3. The proofs of our results will be given in Section 5. Before we broach those arguments, we need to develop some preliminaries regarding our main tool, the so-called closed measures; we do this in Section 4.
2 Algorithm and framework
2.1 The vanishing step subgradient method
Consider a locally Lipschitz functions , denote by the set of its differentiability points which is dense by Rademacher’s theorem (see for example [27, Theorem 3.2]). The Clarke subdifferential of is defined by
where denotes the closed convex envelope of a set ; see .
A point such that , is called critical. The critical set is
It contains local minima and maxima.
The algorithm of interest in this work is:
Definition 1 (Small step subgradient method).
Let be locally Lipschitz and be a sequence of positive step sizes such that
Given , consider the recursion, for ,
Here, is chosen freely among . The sequence is called a subgradient sequence.
In what follows the sequence is interpreted as a sequence of time increments, and it naturally defines a time counter through the formula:
so that as . Given a sequence and a subset , we set
which corresponds to the time spent by the sequence in .
Recall that the accumulation set of the sequence is the set of points such that, for every neighborhood of , the intersection is an infinite set. Its elements are known as limit points.
If the sequence is bounded and comes from the subgradient method as in Definition 1, then because and is locally bounded by local Lipschitz continuity of , so is compact and connected, see e.g., .
Accumulation points are the manifestation of recurrent behaviors of the sequence but the frequency of the recurrence is ignored. In the presence of a time counter, here , this persistence phenomenon may be measured through presence duration in the neighborhood of a recurrent point. This idea is formalized in the following definition:
Definition 2 (Essential accumulation set).
Given a step size sequence and a subgradient sequence as in Definition 1, the essential accumulation set is the set of points such that, for every neighborhood of ,
Analogously, considering the increments , we say that the point is in the essential accumulation set if for every neighborhood of satisfies
As explained previously, the set encodes significantly recurrent behavior; it ignores sporadic escapades of the sequence . Essential accumulation points are accumulation points but the converse is not true. If the sequence is bounded, is nonempty and compact, but not necessarily connected.
2.2 Regularity assumptions on the objective function
Lipchitz continuity and pathologies.
Recall that, given a locally Lipschitz function , a subgradient curve is an absolutely continuous curve satisfying,
By general results these curves exist, see e.g.,  and references therein. In our context they embody the ideal behavior we could hope from subgradient sequences.
First let us recall that pathological Lipschitz functions are generic in the Baire sense, as established in [52, 16]. In particular, generic -Lipschitz functions satisfy everywhere on . This means that any absolutely curve with is a subgradient curve of these functions, regardless of their specifics. Note that this implies that a curve may constantly remain away from the critical set.
The examples by Danillidis–Drusvyatskiy  make this erratic behaviour even more concrete. For instance, they provide a Lipschitz function and a bounded subgradient curve having the “absurd” roller coaster property
Although not directly matching our framework, these examples show that we cannot hope for satisfying convergence results under the spineless general assumption of Lipschitz continuity.
We are thus led to consider functions avoiding pathologies. We choose to pertain to the fonctions saines111Literally, “healthy functions” (as opposed to pathological) in French. of Valadier  (1989), rediscovered in several works, see e.g. [17, 21, 14]. We use the terminology of .
Definition 3 (Path differentiable functions).
A locally Lipschitz function is path differentiable if, for each Lipschitz curve , for almost every , the composition is differentiable at and the derivative is given by
for all .
In other words, all vectors in share the same projection onto the subspace generated by . Note that the definition proposed in  is not limited to chain rules involving the Clarke subgradient, but it turns out to be equivalent to the a definition very much like the one we give here, with Lipschitz curves replaced by absolutely-continuous curves, the equivalence being furnished by [14, Corollary 2]. The current definition is slightly more general than the original one , that is, our class of functions contains the one discussed in , because we require a condition only for Lipschitz curves, which are all absolutely continuous.
The class of path differentiable functions is very large and includes many cases of interest, such as functions that are semi-algebraic, tame (definable in an o-minimal structure), or Whitney stratifiable 
(in particular, models and loss functions used in machine learning, such as, for example, those occurring in neural network training with all the activation functions that have been considered in the literature), as well as functions that are convex, concave, see e.g.,[14, 47].
Whitney stratifiable functions.
Due to their ubiquity we detail here the properties of Whitney stratifiability and illustrate their utility. They were first used in  in the variational analysis context in order to establish Sard’s theorem and Kurdyka-Łojasiewicz inequality for definable functions, two properties which appears to be essential in the study of many subgradient related problems, see e.g., [3, 15].
Definition 4 (Whitney stratification).
Let be a nonempty subset of and . A stratification of is a locally finite partition of into connected submanifolds of of class such that for each
A stratification of satisfies Whitney’s condition (a) if, for each , , and for each sequence with as , and such that the sequence of tangent spaces converges (in the usual metric topology of the Grassmanian) to a subspace , we have that . A stratification is Whitney if it satisfies Whitney’s condition (a).
Definition 5 (Whitney stratifiable function).
With the same notations as above, a function is Whitney -stratifiable if there exists a Whitney stratification of its graph as a subset of .
Examples of Whitney stratifiable functions are semialgebraic or tame functions, but much less structured functions are covered. This class covers most known finite dimensional optimization problems as for instance those met in the training of neural networks. Let us mention here that the subclass of tame functions have led to many results through the nonsmooth Kurdyka–Łojasiewicz inequality, see e.g., , while mere Whitney stratifiability combined with the Ljung-like theory developed in  has also provided several interesting openings [21, 14, 12].
3 Main results: accumulation, convergence, oscillation compensation
We now present our main, results which rely on three types of increasingly demanding assumptions:
path differentiability (Section 3.1),
path differentiable functions with a weak Sard property (Section 3.2),
Whitney stratifiable functions (Section 3.3).
Section 3.3 also contains a general result pertaining the structure of the oscillations.
3.1 Asymptotic dynamics for path differentiable functions
Theorem 6 (Asymptotic dynamics for path differentiable functions).
Assume that is locally Lipschitz path differentiable, and that is a sequence generated by the subgradient method (Definition 1) that remains bounded. Then we have:
(Lengthy separations) Let and be two distinct points in such that . Let be a subsequence such that as , and for each choose such that . Consider
(Oscillation compensation) Let be a continuous function. Then for every subsequence such that
(Criticality) For all , . In other words, .
3.2 Asymptotic dynamics for path differentiable functions with a weak Sard property
With slightly more stringent hypotheses, which are automatically valid for some important cases of lower or upper- functions  (for sufficiently large), semialgebraic or tame functions , we have:
Theorem 7 (Asymptotic dynamics for path differentiable functions: weak Sard case).
In the setting of Theorem 6, and if additionally is constant on the connected components of its critical set, then we also have:
(Lengthy separations version 2) Let and be two distinct points in , , and take small enough that the balls and are at a positive distance form each other, that is, . Consider the successive amounts of time it takes for the sequence to go from the ball to the ball , namely,
Then as .
(Long intervals) Let be neighborhoods of such that . Let be the union of the maximal intervals of the form for some , such that and . Then either there is some that is unbounded or
(Oscillation compensation version 2) Let be two open sets as in item 2, and be the corresponding union of maximal intervals. Then
(Criticality) For all in the (traditional) accumulation set , . That is to say, .
(Convergence of the values) The values converge to a real number as .
3.3 Oscillation structure and asymptotics for Whitney stratifiable functions
The two next corollaries express that oscillations happen perpendicularly to the singular set of , whenever it makes sense. In particular, they are perpendicular to and , respectively, wherever this is well defined.
Corollary 9 (Perpendicularity of the oscillations).
Stratifiable functions (cf. Definition 5) allow to provide much more insight into the oscillation compensation phenomenon: we have seen that substantial oscillations, i.e., those generated by non vanishing subgradients, must be structured orthogonally to the limit point locus. Whitney rigidity then forces the following intuitive phenomenon: substantial bouncing drives the sequence to have limit points lying in the bed of V-shaped valleys formed by the graph of .
Corollary 10 (Oscillations and -shaped valleys).
Let is a Whitney stratifiable function, and let be a point in the accumulation set of a sequence generated by the subgradient method as in Definition 1. Assume that there is a subsequence with
then is contained in a stratum of dimension less than , and if is tangent to at then
This geometrical setting is reminiscent of the partial smoothness assumptions of Lewis: a smooth path lies in between the slopes of a sharp valley. While proximal-like methods end up in a finite time on the smooth locus [32, Theorem 4.1], our result suggests that the explicit subgradient method keeps on bouncing, approaching the smooth part without actually attaining it. This confirms the intuition that finite identification does not occur, although oscillations eventually provide some information on active sets by their “orthogonality features.”
3.4 Further discussion
Theorems 6 and 7 describe the long-term dynamics of the algorithm. While Theorem 6 only talks about what happens close to and explains only what the most frequent persistent behavior is, Theorem 7 covers all of and hence all recurrent behaviors.
While the high-frequency oscillations (i.e., bouncing) will, in many cases, be considerable, they almost cancel out. This is what we refer to as oscillation compensation. The intuitive picture the reader should have in mind is a statement that the oscillations cancel out locally, as in (2). Yet, because of small technical minutia, we do not have exactly (2) and obtain instead very good approximations. Let us provide some explanations.
which is indeed a very good approximation of (2).
Similarly, setting, in item 3 of Theorem 7, and the balls centered at with radius , we obtain this local version of the oscillation cancelation phenomenon: in the setting of Theorem 7 if and if is the union of maximal intervals such that and , then
Note that as we take the limit , we cover almost all in the ball , so we again get a statement very close to (2).
While Theorem 7 tells us that converges, we conjecture that this is no longer true in the context of Theorem 6, which is a matter for future research. Similarly, in the setting of path differentiable functions, the question of determining whether all limit points of bounded sequences are critical remains open.
In all cases, including the Whitney stratifiable case, the sequence may not converge. A well-known example of such a situation was provided for the case of smooth by Palis–de Melo .
However, our results show that the drift that causes the divergence of is very slow in comparison with the local oscillations. This slowness can be immediately appreciated in the statement of item 1 of Theorem 6 and items 1 and 2 of Theorem 7. In substance, these results express that even if the sequence diverges, it takes longer and longer to connect disjoint neighborhoods of different limit points.
4 A closed measure theoretical approach
Given an open subset of , denote by the set of continuous functions while is the set of continuously differentiable functions. The set denotes the space of Lipschitz curves . When is bounded it is endowed with the supremum norm .
4.1 A compendium on closed measures
Given a measure on some set and a measurable map , where is another set, the pushfoward is defined to be the measure on such that, for measurable, .
Recall that the support of a positive Radon measure on , , is the set of points such that for every neighborhood of . It is a closed set.
The origin of the concept of closed measures (sometimes also called holonomic measures or Young measures) can be traced back to the work of L.C. Young [53, 54] in the context of the calculus of variations. It has developed in parallel to the closely related normal currents [29, 28] and varifolds [2, 1], and has found applications in several areas of mathematics, especially Lagrangian and Hamiltonian dynamics [38, 37, 19, 50], and also optimal transport [9, 10].
The definition of closed measures is inspired from the following observations. Given a curve , its position-velocity information can be encoded by a measure on that is the pushforward of the Lebesgue measure on the interval into through the mapping , that is,
In other words, if is a measurable function, then the integral with respect to is given by
With this definition of it follows that is closed, that is, if, and only if, for all smooth , we have
In other words, the integral of with respect to is exactly the circulation of the gradient vector field along the closed curve , and so it vanishes exactly when is closed. This generalizes into:
Definition 11 (Closed measure).
A compactly-supported, positive, Radon measure on is closed if, for all functions ,
Let be the projection . To a measure in we can associate its projected measure . As an immediate consequence we have that .
The disintegration theorem 
implies that there are probability measures, , on such that
We shall refer to the couple as to the desintegration of . Thus if is measurable, we have
Definition 12 (Centroid field).
Let be a positive, compactly-supported, Radon measure on . The centroid field of is, for and with the decomposition (4),
The centroid field gives the average velocity, that is, the average of the velocities encoded by the measure at each point. As a consequence of the disintegration theorem , is measurable, and for every measurable linear in the second variable, we have
It plays a significant role in our work. For later use, we record the following facts that follow from the definition of the centroid field, the convexity of , and the fact that is a probability:
Lemma 13 (Quasi-stationary bundle measures).
If a positive Radon measure has a centroid field that vanishes -almost everywhere, then is closed.
Indeed, if for -almost every , and if , we have
so is closed. ∎
Recall that the weak* topology in the space of Radon measures on an open set is the one induced by the family of seminorms
Thus a sequence of measure converges in this topology to a measure if, and only if, for all ,
The following result can be regarded as a consequence of the forthcoming Theorem 15. It can also be seen as a special case of the results of  that are very well described in [30, Theorem 184.108.40.206]. Specifically it is shown in [30, Theorem 220.127.116.11] that it is possible to approximate, in a weak* sense, objects (namely, currents) intimately related to closed measures, by simpler objects (namely, closed polyhedral chains), which in our case correspond to combinations of finitely-many piecewise-smooth, closed curves.
Proposition 14 (Weak* density of closed curves).
Consider the set of measures of the form for some and a measure induced by some closed, smooth curve , , defined on an interval (which is not fixed). In the weak* topology, this set is dense in the set of closed measures.
Since the space of measures is sequential, this proposition means that for any closed measure , we can find a sequence of closed curves that approximate in the sense that in the weak* topology.
The following result, known as the Young superposition principle [53, 9] or as the Smirnov solenoidal representation [49, 4], is a strong refinement of the assertion of Proposition 14; see also [45, Example 6]. What this result tells us is basically that, not only can closed measures be approximated by measures induced by curves, but actually the centroidal measure
which captures much of the properties of , can be decomposed into a combination of measures induced by Lipschitz curves. This decomposition is very useful theoretically, as there are no limits involved. For completeness, the following is proved in Section A.
Theorem 15 (Young superposition principle/Smirnov solenoidal representation).
Let be a nonempty bounded open subset of and set . For , let be the time-translation . For every closed probability measure supported in with centroid field , there is a Borel probability measure on the space that is invariant under for all and such that
for any measurable .
Curves lying in have an appealing property:
Corollary 16 (Centroid representation).
With the notation of the previous theorem, we have for almost all in :
for almost all .
Take indeed vanishing only on the measureable set consisting of points of the form , . Then both sides of (6) must vanish, which means that for -almost all , the point must be of the form . The conclusion follows from the -invariance of the measure . ∎
As an example, take the case in which is the closed measure
In this simple example, the centroid coincides with the derivative, . Each time-translate is still a parameterization of the circle, and the probability measure we obtain in Theorem 15 is
where is the Dirac delta function whose mass is concentrated at the curve in the space .
The measure in Theorem 15 can be understood as a decomposition of the closed measure into a convex superposition of measures induced by Lipschitz curves. Although at first sight each on the right-hand side of (6) only participates at , the -invariance of means that in fact the entire curve is involved in the integral through its time translates . Observe that another consequence of the -invariance is that the integral in the right-hand side of (6) satisfies, for all ,
where is any nontrivial interval. Thus (6) has the more explicit lamination or superposition form:
for any interval with nonempty interior.
Although the left-hand side of (6) does not involve the full measure , it will turn out to be similar enough: if the integrand is linear in the second variable , we still have (5) and this will be enough for the applications we have in mind.
We remark that the measure in Theorem 15 is not unique in general. For example, if is a closed curve intersecting itself once so as to form the figure 8, then the measure decomposing could be taken to be supported on all the -translates of itself, or it could be taken to be supported on the curves traversing each of the loops of the 8.
Circulation for a subdifferential field.
We provide here some results related to subdifferentials, and that will be useful to the study of the vanishing step subgradient method.
Let be a locally Lipschitz continuous function and a closed measure with desintegration and centroid field . If for some and some we have , then .
Assume . Let , so that for all . Since is a convex set, is a convex function. Then by Jensen’s inequality we have
Proposition 18 (Circulation of subdifferential for path differentiable functions).
If is a path differentiable function and is a closed probability measure, then for each open set and each measurable function with for , the integral
is well defined, and its value is independent of the choice of . We define the symbol
to be equal to this value. If ,
Let be two measurable functions such that for each . From Theorem 15 we get a -invariant, Borel probability measure on the space of Lipschitz curves. Then
Since is path differentiable, for each and for almost every with ,
From the -invariance of it follows then that the integrand above vanishes -almost everywhere.
Let us now analyze the case in which . Let be a mollifier, that is, a compactly-supported, nonnegative, rotationally-invariant, function such that , and let for , so that tends to the Dirac delta at 0 as . Denote by the convolution of and . Observe that if and , then
This justifies the following calculation: