Long term dynamics of the subgradient method for Lipschitz path differentiable functions

05/29/2020 ∙ by Jérôme Bolte, et al. ∙ 0

We consider the long-term dynamics of the vanishing stepsize subgradient method in the case when the objective function is neither smooth nor convex. We assume that this function is locally Lipschitz and path differentiable, i.e., admits a chain rule. Our study departs from other works in the sense that we focus on the behavoir of the oscillations, and to do this we use closed measures. We recover known convergence results, establish new ones, and show a local principle of oscillation compensation for the velocities. Roughly speaking, the time average of gradients around one limit point vanishes. This allows us to further analyze the structure of oscillations, and establish their perpendicularity to the general drift.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The predominance of huge scale complex nonsmooth nonconvex problems in the development of certain artificial intelligence methods, has brought back rudimentary, numerically cheap, robust methods, such as subgradient algorithms, to the forefront of contemporary numerics, see e.g.,

[33, 23, 34, 5, 12]. We investigate here some of the properties of the archetypical algorithm within this class, namely, the vanishing stepsize subgradient method of Shor. Given locally Lipschitz, it reads

where is the Clarke subgradient, , and . This dynamics, illustrated in Figure 1, has its roots in Cauchy’s gradient method and seems to originate in Shor’s thesis [48]. The idea is natural at first sight: one accumulates small subgradient steps to make good progress on average while hoping that oscillations will be tempered by the vanishing steps. For the convex case, the theory was developed by Ermol’ev [26], Poljak [43], Ermol’ev–Shor [25]. It is a quite mature theory, see e.g. [39, 40], which still has a considerable success through the famous mirror descent of Nemirovskii–Yudin [39, 7] and its endless variants. In the nonconvex case, developments of more sophisticated methods were made, see e.g. [35, 31, 41], yet little was known for the raw method until recently.

The work of Davis et al. [21], see also [11], revolving around the fundamental paper of Benaïm–Hofbauer–Sorin [8], brought the first breakthroughs. It relies on a classical idea of Euler: small-step discrete dynamics resemble their continuous counterparts. As established by Ljung [36], this observation can be made rigorous for large times in the presence of good Lyapunov functions. Benaïm–Hofbauer–Sorin [8] showed further that the transfer of asymptotic properties from continuous differential inclusions to small-step discrete methods is valid under rather weak compactness and dissipativity assumptions. This general result, combined with features specific to the subgradient case, allowed to establish several optimization results such as the convergence to the set of critical points, the convergence in value, convergence in the long run in the presence of noise [21, 46, 14, 12].

Usual properties expected from an algorithm are diverse: convergence of iterates, convergence in values, rates, quality of optimality, complexity, or prevalence of minimizers. Although in our setting some aspects seem hopeless without strong assumptions, most of them remain largely unexplored. Numerical successes suggest however that the apparently erratic process of subgradient dynamics has appealing stability properties beyond the already delicate subsequential convergence to critical points.

In order to address some of these issues, this paper avoids the use of the theory of [8] and focuses on the delicate question of oscillations, which is illustrated on Figures 1 and 2.

Figure 1: Contour plot of a Lipschitz function with a subgradient sequence. The color reflects the iteration count. The sequence converges to the unique global minimum, but is constantly oscillating.
Figure 2: On the left, the contour plot of a convex polyhedral function with three strata, where the gradient is constant. A subgradient sequence starts at and converges to the origin with an apparent erratic behavior. On the right, we discover that the behavior is not completely erratic. The oscillation compensation phenomenon contributes some structure: the proportions of time spent in each region where the function has constant gradient , , converge so that we have precisely .

In general, as long as the sequence remains bounded, we always have

(1)

This fact, that could be called “global oscillation compensation,” does not prevent the trajectory to oscillate fast around a limit cycle, as illustrated in [20], and is therefore unsatisfying from the stabilization perspective of minimization. The phenomenon (1) remains true even when is not a gradient sequence, as in the case of discrete game theoretical dynamical systems [8].

In this work, we adapt the theory of closed measures, which was originally developed in the calculus of variations (see for example [4, 9]), to the study of discrete dynamics. Using it, we establish several local oscillation compensation results for path differentiable functions. Morally, our results in this direction say that for limit points we have

(2)

See Theorems 6 and 7 for precise statements, and a discussion in Section 3.4.

While this does not imply the convergence of , it does mean that the drift emanating from the average velocity of the sequence vanishes as time elapses. This is made more explicit in the parts of those theorems that show that, given two limit points and of the sequence , the time it takes for the sequence to flow from a small ball around to a small ball around must eventually grow infinitely long, so that the overall speed of the sequence as it traverses the accumulation set becomes extremely slow.

With these types of results, we evidence new phenomena:

  • while the sequence may not converge, it will spend most of the time oscillating near the critical set of the objective function, and it appears that there are persistent accumulation points whose importance is predominant;

  • under weak Sard assumptions, we recover the convergence results of [21] and improve them by oscillation compensations results,

  • oscillation structures itself orthogonally to the limit set, so that the incremental drift along this set is negligible with respect to the time increment .

These results are made possible by the use of closed measures. These measures capture the accumulation behavior of the sequence along with the “velocities”

. The simple idea of not throwing away the information of the vectors

allows one to recover a lot of structure in the limit, that can be interpreted as a portrait of the long-term behavior of the sequence. The theory that we develop in Section 4.1 should apply to the analysis of the more general case of small-step algorithms. Along the way, for example, we are able to establish a new connection between the discrete and continuous gradient flows (Corollary 22) that complements the point of view of [8].

Notations and organization of the paper.

Let be a positive integer, and denote -dimensional Euclidean space. The space of couples is seen as the phase space consisting of positions and velocities . For two vectors and , we let . The norm induces the distance , and similarly on . The Euclidean gradient of is denoted by . The set contains all the nonnegative integers.

In Section 2 we give the definitions necessary to state our results, which we do in Section 3. The proofs of our results will be given in Section 5. Before we broach those arguments, we need to develop some preliminaries regarding our main tool, the so-called closed measures; we do this in Section 4.

2 Algorithm and framework

2.1 The vanishing step subgradient method

Consider a locally Lipschitz functions , denote by the set of its differentiability points which is dense by Rademacher’s theorem (see for example [27, Theorem 3.2]). The Clarke subdifferential of is defined by

where denotes the closed convex envelope of a set ; see [18].

A point such that , is called critical. The critical set is

It contains local minima and maxima.

The algorithm of interest in this work is:

Definition 1 (Small step subgradient method).

Let be locally Lipschitz and be a sequence of positive step sizes such that

(3)

Given , consider the recursion, for ,

Here, is chosen freely among . The sequence is called a subgradient sequence.

In what follows the sequence is interpreted as a sequence of time increments, and it naturally defines a time counter through the formula:

so that as . Given a sequence and a subset , we set

which corresponds to the time spent by the sequence in .

Recall that the accumulation set of the sequence is the set of points such that, for every neighborhood of , the intersection is an infinite set. Its elements are known as limit points.

If the sequence is bounded and comes from the subgradient method as in Definition 1, then because and is locally bounded by local Lipschitz continuity of , so is compact and connected, see e.g., [15].

Accumulation points are the manifestation of recurrent behaviors of the sequence but the frequency of the recurrence is ignored. In the presence of a time counter, here , this persistence phenomenon may be measured through presence duration in the neighborhood of a recurrent point. This idea is formalized in the following definition:

Definition 2 (Essential accumulation set).

Given a step size sequence and a subgradient sequence as in Definition 1, the essential accumulation set is the set of points such that, for every neighborhood of ,

Analogously, considering the increments , we say that the point is in the essential accumulation set if for every neighborhood of satisfies

As explained previously, the set encodes significantly recurrent behavior; it ignores sporadic escapades of the sequence . Essential accumulation points are accumulation points but the converse is not true. If the sequence is bounded, is nonempty and compact, but not necessarily connected.

2.2 Regularity assumptions on the objective function

Lipchitz continuity and pathologies.

Recall that, given a locally Lipschitz function , a subgradient curve is an absolutely continuous curve satisfying,

By general results these curves exist, see e.g., [8] and references therein. In our context they embody the ideal behavior we could hope from subgradient sequences.

First let us recall that pathological Lipschitz functions are generic in the Baire sense, as established in [52, 16]. In particular, generic -Lipschitz functions satisfy everywhere on . This means that any absolutely curve with is a subgradient curve of these functions, regardless of their specifics. Note that this implies that a curve may constantly remain away from the critical set.

The examples by Danillidis–Drusvyatskiy [20] make this erratic behaviour even more concrete. For instance, they provide a Lipschitz function and a bounded subgradient curve having the “absurd” roller coaster property

Although not directly matching our framework, these examples show that we cannot hope for satisfying convergence results under the spineless general assumption of Lipschitz continuity.

Path differentiability.

We are thus led to consider functions avoiding pathologies. We choose to pertain to the fonctions saines111Literally, “healthy functions” (as opposed to pathological) in French. of Valadier [51] (1989), rediscovered in several works, see e.g. [17, 21, 14]. We use the terminology of [14].

Definition 3 (Path differentiable functions).

A locally Lipschitz function is path differentiable if, for each Lipschitz curve , for almost every , the composition is differentiable at and the derivative is given by

for all .

In other words, all vectors in share the same projection onto the subspace generated by . Note that the definition proposed in [14] is not limited to chain rules involving the Clarke subgradient, but it turns out to be equivalent to the a definition very much like the one we give here, with Lipschitz curves replaced by absolutely-continuous curves, the equivalence being furnished by [14, Corollary 2]. The current definition is slightly more general than the original one [14], that is, our class of functions contains the one discussed in [14], because we require a condition only for Lipschitz curves, which are all absolutely continuous.

The class of path differentiable functions is very large and includes many cases of interest, such as functions that are semi-algebraic, tame (definable in an o-minimal structure), or Whitney stratifiable [21]

(in particular, models and loss functions used in machine learning, such as, for example, those occurring in neural network training with all the activation functions that have been considered in the literature), as well as functions that are convex, concave, see e.g.,

[14, 47].

Whitney stratifiable functions.

Due to their ubiquity we detail here the properties of Whitney stratifiability and illustrate their utility. They were first used in [13] in the variational analysis context in order to establish Sard’s theorem and Kurdyka-Łojasiewicz inequality for definable functions, two properties which appears to be essential in the study of many subgradient related problems, see e.g., [3, 15].

Definition 4 (Whitney stratification).

Let be a nonempty subset of and . A stratification of is a locally finite partition of into connected submanifolds of of class such that for each

A stratification of satisfies Whitney’s condition (a) if, for each , , and for each sequence with as , and such that the sequence of tangent spaces converges (in the usual metric topology of the Grassmanian) to a subspace , we have that . A stratification is Whitney if it satisfies Whitney’s condition (a).

Definition 5 (Whitney stratifiable function).

With the same notations as above, a function is Whitney -stratifiable if there exists a Whitney stratification of its graph as a subset of .

Examples of Whitney stratifiable functions are semialgebraic or tame functions, but much less structured functions are covered. This class covers most known finite dimensional optimization problems as for instance those met in the training of neural networks. Let us mention here that the subclass of tame functions have led to many results through the nonsmooth Kurdyka–Łojasiewicz inequality, see e.g., [3], while mere Whitney stratifiability combined with the Ljung-like theory developed in [8] has also provided several interesting openings [21, 14, 12].

3 Main results: accumulation, convergence, oscillation compensation

We now present our main, results which rely on three types of increasingly demanding assumptions:

  • path differentiability (Section 3.1),

  • path differentiable functions with a weak Sard property (Section 3.2),

  • Whitney stratifiable functions (Section 3.3).

Section 3.3 also contains a general result pertaining the structure of the oscillations.

The significance of the results is discussed in Section 3.4. The proofs are presented in Section 5.

3.1 Asymptotic dynamics for path differentiable functions

Theorem 6 (Asymptotic dynamics for path differentiable functions).

Assume that is locally Lipschitz path differentiable, and that is a sequence generated by the subgradient method (Definition 1) that remains bounded. Then we have:

  1. [label=.,ref=()]

  2. (Lengthy separations) Let and be two distinct points in such that . Let be a subsequence such that as , and for each choose such that . Consider

    Then .

  3. (Oscillation compensation) Let be a continuous function. Then for every subsequence such that

    we have

  4. (Criticality) For all , . In other words, .

3.2 Asymptotic dynamics for path differentiable functions with a weak Sard property

With slightly more stringent hypotheses, which are automatically valid for some important cases of lower or upper- functions [6] (for sufficiently large), semialgebraic or tame functions [13], we have:

Theorem 7 (Asymptotic dynamics for path differentiable functions: weak Sard case).

In the setting of Theorem 6, and if additionally is constant on the connected components of its critical set, then we also have:

  1. [label=.,ref=()]

  2. (Lengthy separations version 2) Let and be two distinct points in , , and take small enough that the balls and are at a positive distance form each other, that is, . Consider the successive amounts of time it takes for the sequence to go from the ball to the ball , namely,

    Then as .

  3. (Long intervals) Let be neighborhoods of such that . Let be the union of the maximal intervals of the form for some , such that and . Then either there is some that is unbounded or

  4. (Oscillation compensation version 2) Let be two open sets as in item 2, and be the corresponding union of maximal intervals. Then

  5. (Criticality) For all in the (traditional) accumulation set , . That is to say, .

  6. (Convergence of the values) The values converge to a real number as .

Remark 8.

Items 4 and 5 of Theorem 7 can also be deduced from [8, Proposition 3.27] using a different approach. Up to our knowledge, items 13 of Theorem 7 as well as Theorem 6 do not have counterparts in the optimization literature.

3.3 Oscillation structure and asymptotics for Whitney stratifiable functions

The two next corollaries express that oscillations happen perpendicularly to the singular set of , whenever it makes sense. In particular, they are perpendicular to and , respectively, wherever this is well defined.

Corollary 9 (Perpendicularity of the oscillations).

In the setting of Theorem 7 (resp. Theorem 6), let be in the accumulation set (resp. essential accumulation set) of , and be a Lipschitz curve with the property that , and is differentiable at and for all . Then

for all . In other words .

Stratifiable functions (cf. Definition 5) allow to provide much more insight into the oscillation compensation phenomenon: we have seen that substantial oscillations, i.e., those generated by non vanishing subgradients, must be structured orthogonally to the limit point locus. Whitney rigidity then forces the following intuitive phenomenon: substantial bouncing drives the sequence to have limit points lying in the bed of V-shaped valleys formed by the graph of .

Corollary 10 (Oscillations and -shaped valleys).

Let is a Whitney stratifiable function, and let be a point in the accumulation set of a sequence generated by the subgradient method as in Definition 1. Assume that there is a subsequence with

then is contained in a stratum of dimension less than , and if is tangent to at then

This geometrical setting is reminiscent of the partial smoothness assumptions of Lewis: a smooth path lies in between the slopes of a sharp valley. While proximal-like methods end up in a finite time on the smooth locus [32, Theorem 4.1], our result suggests that the explicit subgradient method keeps on bouncing, approaching the smooth part without actually attaining it. This confirms the intuition that finite identification does not occur, although oscillations eventually provide some information on active sets by their “orthogonality features.”

3.4 Further discussion

Theorems 6 and 7 describe the long-term dynamics of the algorithm. While Theorem 6 only talks about what happens close to and explains only what the most frequent persistent behavior is, Theorem 7 covers all of and hence all recurrent behaviors.

Oscillation compensation.

While the high-frequency oscillations (i.e., bouncing) will, in many cases, be considerable, they almost cancel out. This is what we refer to as oscillation compensation. The intuitive picture the reader should have in mind is a statement that the oscillations cancel out locally, as in (2). Yet, because of small technical minutia, we do not have exactly (2) and obtain instead very good approximations. Let us provide some explanations.

Letting, in item 2 of Theorem 6, be a continuous cutoff function equal to 1 on a ball of radius around a point and vanishing outside the ball for , then we get, for appropriate subsequences ,

which is indeed a very good approximation of (2).

Similarly, setting, in item 3 of Theorem 7, and the balls centered at with radius , we obtain this local version of the oscillation cancelation phenomenon: in the setting of Theorem 7 if and if is the union of maximal intervals such that and , then

Note that as we take the limit , we cover almost all in the ball , so we again get a statement very close to (2).

Convergence.

While Theorem 7 tells us that converges, we conjecture that this is no longer true in the context of Theorem 6, which is a matter for future research. Similarly, in the setting of path differentiable functions, the question of determining whether all limit points of bounded sequences are critical remains open.

In all cases, including the Whitney stratifiable case, the sequence may not converge. A well-known example of such a situation was provided for the case of smooth by Palis–de Melo [42].

However, our results show that the drift that causes the divergence of is very slow in comparison with the local oscillations. This slowness can be immediately appreciated in the statement of item 1 of Theorem 6 and items 1 and 2 of Theorem 7. In substance, these results express that even if the sequence diverges, it takes longer and longer to connect disjoint neighborhoods of different limit points.

4 A closed measure theoretical approach

Given an open subset of , denote by the set of continuous functions while is the set of continuously differentiable functions. The set denotes the space of Lipschitz curves . When is bounded it is endowed with the supremum norm .

4.1 A compendium on closed measures

General results.

Given a measure on some set and a measurable map , where is another set, the pushfoward is defined to be the measure on such that, for measurable, .

Recall that the support of a positive Radon measure on , , is the set of points such that for every neighborhood of . It is a closed set.

The origin of the concept of closed measures (sometimes also called holonomic measures or Young measures) can be traced back to the work of L.C. Young [53, 54] in the context of the calculus of variations. It has developed in parallel to the closely related normal currents [29, 28] and varifolds [2, 1], and has found applications in several areas of mathematics, especially Lagrangian and Hamiltonian dynamics [38, 37, 19, 50], and also optimal transport [9, 10].

The definition of closed measures is inspired from the following observations. Given a curve , its position-velocity information can be encoded by a measure on that is the pushforward of the Lebesgue measure on the interval into through the mapping , that is,

In other words, if is a measurable function, then the integral with respect to is given by

With this definition of it follows that is closed, that is, if, and only if, for all smooth , we have

In other words, the integral of with respect to is exactly the circulation of the gradient vector field along the closed curve , and so it vanishes exactly when is closed. This generalizes into:

Definition 11 (Closed measure).

A compactly-supported, positive, Radon measure on is closed if, for all functions ,

Let be the projection . To a measure in we can associate its projected measure . As an immediate consequence we have that .

The disintegration theorem [22]

implies that there are probability measures

, , on such that

(4)

We shall refer to the couple as to the desintegration of . Thus if is measurable, we have

Definition 12 (Centroid field).

Let be a positive, compactly-supported, Radon measure on . The centroid field of is, for and with the decomposition (4),

The centroid field gives the average velocity, that is, the average of the velocities encoded by the measure at each point. As a consequence of the disintegration theorem [22], is measurable, and for every measurable linear in the second variable, we have

(5)

It plays a significant role in our work. For later use, we record the following facts that follow from the definition of the centroid field, the convexity of , and the fact that is a probability:

Lemma 13 (Quasi-stationary bundle measures).

If a positive Radon measure has a centroid field that vanishes -almost everywhere, then is closed.

Proof.

Indeed, if for -almost every , and if , we have

so is closed. ∎

Recall that the weak* topology in the space of Radon measures on an open set is the one induced by the family of seminorms

Thus a sequence of measure converges in this topology to a measure if, and only if, for all ,

The following result can be regarded as a consequence of the forthcoming Theorem 15. It can also be seen as a special case of the results of [29] that are very well described in [30, Theorem 1.3.4.6]. Specifically it is shown in [30, Theorem 1.3.4.6] that it is possible to approximate, in a weak* sense, objects (namely, currents) intimately related to closed measures, by simpler objects (namely, closed polyhedral chains), which in our case correspond to combinations of finitely-many piecewise-smooth, closed curves.

Proposition 14 (Weak* density of closed curves).

Consider the set of measures of the form for some and a measure induced by some closed, smooth curve , , defined on an interval (which is not fixed). In the weak* topology, this set is dense in the set of closed measures.

Since the space of measures is sequential, this proposition means that for any closed measure , we can find a sequence of closed curves that approximate in the sense that in the weak* topology.

The following result, known as the Young superposition principle [53, 9] or as the Smirnov solenoidal representation [49, 4], is a strong refinement of the assertion of Proposition 14; see also [45, Example 6]. What this result tells us is basically that, not only can closed measures be approximated by measures induced by curves, but actually the centroidal measure

which captures much of the properties of , can be decomposed into a combination of measures induced by Lipschitz curves. This decomposition is very useful theoretically, as there are no limits involved. For completeness, the following is proved in Section A.

Theorem 15 (Young superposition principle/Smirnov solenoidal representation).

Let be a nonempty bounded open subset of and set . For , let be the time-translation . For every closed probability measure supported in with centroid field , there is a Borel probability measure on the space that is invariant under for all and such that

(6)

for any measurable .

Curves lying in have an appealing property:

Corollary 16 (Centroid representation).

With the notation of the previous theorem, we have for almost all in :

for almost all .

Proof.

Take indeed vanishing only on the measureable set consisting of points of the form , . Then both sides of (6) must vanish, which means that for -almost all , the point must be of the form . The conclusion follows from the -invariance of the measure . ∎

As an example, take the case in which is the closed measure

on for

In this simple example, the centroid coincides with the derivative, . Each time-translate is still a parameterization of the circle, and the probability measure we obtain in Theorem 15 is

where is the Dirac delta function whose mass is concentrated at the curve in the space .

The measure in Theorem 15 can be understood as a decomposition of the closed measure into a convex superposition of measures induced by Lipschitz curves. Although at first sight each on the right-hand side of (6) only participates at , the -invariance of means that in fact the entire curve is involved in the integral through its time translates . Observe that another consequence of the -invariance is that the integral in the right-hand side of (6) satisfies, for all ,

(7)

where is any nontrivial interval. Thus (6) has the more explicit lamination or superposition form:

(8)

for any interval with nonempty interior.

Although the left-hand side of (6) does not involve the full measure , it will turn out to be similar enough: if the integrand is linear in the second variable , we still have (5) and this will be enough for the applications we have in mind.

We remark that the measure in Theorem 15 is not unique in general. For example, if is a closed curve intersecting itself once so as to form the figure 8, then the measure decomposing could be taken to be supported on all the -translates of itself, or it could be taken to be supported on the curves traversing each of the loops of the 8.

Circulation for a subdifferential field.

We provide here some results related to subdifferentials, and that will be useful to the study of the vanishing step subgradient method.

Lemma 17.

Let be a locally Lipschitz continuous function and a closed measure with desintegration and centroid field . If for some and some we have , then .

Proof.

Assume . Let , so that for all . Since is a convex set, is a convex function. Then by Jensen’s inequality we have

Proposition 18 (Circulation of subdifferential for path differentiable functions).

If is a path differentiable function and is a closed probability measure, then for each open set and each measurable function with for , the integral

is well defined, and its value is independent of the choice of . We define the symbol

to be equal to this value. If ,

Proof.

Let be two measurable functions such that for each . From Theorem 15 we get a -invariant, Borel probability measure on the space of Lipschitz curves. Then

Since is path differentiable, for each and for almost every with ,

From the -invariance of it follows then that the integrand above vanishes -almost everywhere.

Let us now analyze the case in which . Let be a mollifier, that is, a compactly-supported, nonnegative, rotationally-invariant, function such that , and let for , so that tends to the Dirac delta at 0 as . Denote by the convolution of and . Observe that if and , then

This justifies the following calculation: