A Gradient Sampling Algorithm for Stratified Maps with Applications to Topological Data Analysis

09/01/2021
by   Jacob Leygonie, et al.
University of Oxford
Inria
0

We introduce a novel gradient descent algorithm extending the well-known Gradient Sampling methodology to the class of stratifiably smooth objective functions, which are defined as locally Lipschitz functions that are smooth on some regular pieces-called the strata-of the ambient Euclidean space. For this class of functions, our algorithm achieves a sub-linear convergence rate. We then apply our method to objective functions based on the (extended) persistent homology map computed over lower-star filters, which is a central tool of Topological Data Analysis. For this, we propose an efficient exploration of the corresponding stratification by using the Cayley graph of the permutation group. Finally, we provide benchmark and novel topological optimization problems, in order to demonstrate the utility and applicability of our framework.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/06/2019

A Non-Asymptotic Analysis of Network Independence for Distributed Stochastic Gradient Descent

This paper is concerned with minimizing the average of n cost functions ...
12/31/2021

Distributed Random Reshuffling over Networks

In this paper, we consider the distributed optimization problem where n ...
07/14/2021

A Granular Sieving Algorithm for Deterministic Global Optimization

A gradient-free deterministic method is developed to solve global optimi...
10/02/2019

A Framework for Differential Calculus on Persistence Barcodes

We define notions of differentiability for maps from and to the space of...
04/25/2017

A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates

This paper considers the problem of decentralized optimization with a co...
02/06/2020

Achieving the fundamental convergence-communication tradeoff with Differentially Quantized Gradient Descent

The problem of reducing the communication cost in distributed training t...
02/07/2021

Classification based on Topological Data Analysis

Topological Data Analysis (TDA) is an emergent field that aims to discov...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 Motivation and related work

In its most general instance nonsmooth, non convex, optimization seek to minimize a locally Lipschitz objective or loss function

. Without further regularity assumptions on , most algorithms—such as the usual Gradient Descent with learning rate decay, or the Gradient Sampling method—are only guaranteed to produce iterates whose subsequences are asymptotically stationary, without explicit convergence rates. Meanwhile, when restricted to the class of min-max functions (like the maximum of finitely many smooth maps), stronger guarantees such as convergence rates can be obtained [43]. This illustrates the common paradigm in nonsmooth optimization: the richer the structure in the irregularities of , the better the guarantees we can expect from an optimization algorithm. Note that there are algorithms specifically tailored to deal with min-max functions, e.g. [8].

Another example are bundle methods [37, 38, 42]. They consist, roughly, in constructing successive linear approximations of  as a proxy for minimization. Their convergence guarantees are strong, especially when an additional semi-smoothness property of  [10, 57] can be made. Other types of methods, like the variable metric ones, can also benefit from the semi-smoothness hypothesis [73]. In many cases, convergence properties of the algorithm are not only dependent on the structure on , but also on the amount of information about that can be computed in practice. For instance, the bundle method [55] assumes that the Hessian matrix, when defined locally, can be computed. For accounts of the theory and practice in nonsmooth optimization, we refer the interested reader to [5, 51, 66].

The ability to cut  in well-behaved pieces where  is regular, is another type of important structure. Examples, in increasing order of generality, are semi-algebraic, (sub)analytic, definable, tame (w.r.t. an o-minimal structure), and Whitney stratifiable functions [11]. For such objective functions, the usual gradient descent (GD) algorithm, or a stochastic version of it, converges to stationary points [31]. In order to obtain further theoretical guarantees such as convergence rates, it is necessary to design optimization algorithms specifically tailored for regular maps, since they enjoy stronger properties, e.g., tame maps are semi-smooth [48], and the generalized gradients of Whitney stratifiable maps are closely related to the (restricted) gradients of the map along the strata [11]. Besides, strong convergence guarantees can be obtained under the Kurdyka–Łojasiewicz assumption [4, 61], which includes the class of semi-algebraic maps. Our method is related to this line of work, in that we exploit the strata of  in which  is smooth.

The motivation of this work stems from Topological Data Analysis (TDA), where geometric objects such as graphs are described by means of computable and topological descriptors. Persistent Homology (PH) is one such descriptor, and has been successfully applied in various areas such as neuroscience [30, 7], material sciences [44, 69], signal analysis [63, 72], shape recognition [54]

, or machine learning

[23, 18].

Persistent Homology describes graphs, and more generally simplicial complexes, over  nodes by means of a signature called the barcode, or persistence diagram . Here  is a filter function

, that is a function on the nodes, which we view as a vector in 

. Loosely speaking,  is a finite multi-set of points in the upper half-plane that encodes topological and geometric information about the underlying simplicial complex and the function .

Barcodes form a metric space when equipped with the standard metrics of TDA, the so-called bottleneck and Wasserstein distances, and the persistence map  is locally Lipschitz [27, 28]. However  is not Euclidean nor a smooth manifold, thus hindering the use of these topological descriptors in standard statistical or machine learning pipelines. Still, there exist natural notions of differentiability for maps in and out of  [53]. In particular, the persistence map  restricts to a locally Lipschitz, smooth map on a stratification of  by polyhedra. If we compose the persistence map with a smooth and Lipschitz map , the resulting objective (or loss) function

is itself Lipschitz and smooth on the various strata. From [31], and as recalled in [17]

, classical (Stochastic) Gradient Descent on 

asymptotically converges to stationary points. Similarly, the Gradient Sampling (GS) method asymptotically converges. See [68] for an application of GS to topological optimization.

Nonetheless, it is important to design algorithms that take advantage of the structure in the irregularities of the persistence map , in order to get better theoretical guarantees. For instance, one can locally integrate the gradients of —whenever defined—to stabilize the iterates [68], or add a regularization term to  that acts as a smoothing operator [29]. In this work, we rather exploit the stratification of  induced by , as it turns out to be easy to manipulate. We will show in particular that we can efficiently access points  located in neighboring strata of the current iterate

, as well as estimate the distance to these strata.

For this reason, we believe that persistent homology-based objective functions  form a rich playground for nonsmooth optimization, with many applications in point cloud inference [40], surface reconstruction [13], shape matching [64], graph classification [45, 75], topological regularization for generative models [58, 46, 39], image segmentation [47, 26], or dimensionality reduction [50], to name a few.

Figure 1: A proof-of-concept comparison between different gradient descent techniques. The objective function (blue surface) attains its minimum at and is not smooth along the line . In particular, around , thus the gradient norm cannot be used as a stopping criterion. The traditional GD, for which updates are given by , oscillates around  due to the non-smoothness of  and asymptotically converges toward  because of the decaying learning rate . In the meantime, non-smooth optimization methods that sample points around  in order to produce reliable descent directions converge in finite time. Namely, the classical Gradient Sampling method randomly samples  points and manages to reach an -stationary point of  in  iterations (averaged over 100 experiments), while our stratified approach improves on this by leveraging the fact that we explicitly have access to the two strata  and  where  is differentiable. In particular, we only sample additional points when  is -near the line , and reach an -stationary point in 18 iterations. Right plots showcase the evolution of the objective value  and the distance to the minimum  across iterations. Parameters: , , , .

1.2 Contributions and outline of contents

Our new method, called Stratified Gradient Sampling (SGS), is a variation of the established GS algorithm, whose main steps for updating the current iterate  we recall in Algorithm 1 below.

1:  Sample points in a ball
2:  Compute approximate subgradient
3:  Compute descent direction 
4:  Find step size so that  (hyperparameter)
5:  Ensure that  is differentiable at  by small perturbations
Algorithm 1 An update step with the Gradient Sampling algorithm

Our method is motivated by the closing remarks of a recent overview of the GS methodology [15], in which the authors suggest that the GS theory and practice could be enhanced by assuming some extra structure on top of the non differentiability of .

In this work, we deal with stratifiably smooth maps, for which the non differentiability is organized around smooth submanifolds that together form a stratification of . In Section 2, we review some background material in nonsmooth analysis and define stratifiably smooth maps , a variant of the Whitney stratifiable functions from [11] for which we do not impose any Whitney regularity on the gluing between adjacent strata of , but rather enforce that there exist local -extensions of the restrictions of  to top-dimensional strata.

In order to update the current iterate  when minimizing a stratifiably smooth objective function , we introduce a new descent direction . As in GS, is obtained in our new SGS algorithm by collecting the gradients of samples around  in an approximate subgradient , and then by taking the element with minimal norm in the convex set generated by . A key difference with GS is that we only need to sample as many points around  as there are distinct strata close by, compare with the  samples of Algorithm 1. In Proposition 3, we show that we indeed obtain a descent direction, i.e., that we have the descent criterion  (as in Line 4 of Algorithm 1) for a suitable choice of step size .

Our SGS algorithm is detailed in Section 3.1 and its analysis in Section 3.2. The convergence of the original GS methodology crucially relies on the sample size  in order to apply the Carathéodory Theorem to subgradients. Differently, our convergence analysis relies on the fact that the gradients of , when restricted to neighboring strata, are locally Lipschitz. Hence, our proof of asymptotic convergence to stationary points (Theorem 2) is substantially different. In Theorem 3, we determine a convergence rate of our algorithm that holds for any proper stratifiably smooth map, which is an improvement over the guarantees of GS for general locally Lipschitz maps. Finally, in Section 3.3, we adapt our method and results to the case where only estimated distances to nearby strata are available.

In Section 4, we introduce the persistence map over a simplicial complex , which gives rise to a wide class of stratifiably smooth objectivefunctions with rich applications in TDA. We characterize strata around the current iterate (i.e., filter function)  by means of the permutation group over  elements, where  is the number of vertices in . Then, the Cayley graph associated to the permutation group allows us to use Dijkstra’s algorithm to efficiently explore the set of neighboring strata by increasing order of distances to , that are needed to compute descent directions.

Section 5 is devoted to the implementation of the SGS algorithm for the optimization of persistent homology-based objective functions . In Section 5.1, we provide empirical evidence that SGS behaves better than GD and GS with a simple experiment about minimization of total persistence. In Section 5.3 and Section 5.2, we consider two novel topological optimization problems which we believe are of interest in real-world applications. On the one hand, the Topological Template Registration of a filter function  defined on a complex , is the task of finding a filter function  over a smaller complex  that preserves the barcode of . On the other hand, given a Mapper graph , which is a standard visualization tool for arbitrary data sets [67], we can bootstrap the data set in order to produce multiple bootstrapped graphs . The Topological Mean is then the task of finding a new graph whose barcode is as close as possible to the mean of the barcodes associated to the graphs . As a result we obtain a smoothed version of the Mapper graph  in which spurious and non-relevant graph attributes are removed.

2 A direction of descent for stratifiably smooth maps

In this section, we define the class of stratifiably smooth maps whose optimization is at stake in this work. For such maps, we can define an approximate subgradient and a corresponding descent direction, which is the key ingredient of our algorithm.

2.1 Nonsmooth Analysis

We first recall some useful background in nonsmooth analysis, essentially from [25]. Throughout, is a locally Lipschitz (non necessary smooth, nor convex) and proper (i.e., compact sublevel sets) function, which we aim to minimize.

First-order information of at in a direction is captured by its generalized directional derivative:

(1)

Besides, the generalized gradient is the following set of linear subapproximations:

(2)

Given an arbitrary set  of Lebesgue measure , we have an alternative description of the generalized gradient in terms of limits of surrounding gradients, whenever defined:

(3)

where  is the operation of taking the closure of the convex hull.111Here the equality holds as well (by some compactness argument) when taking the convex hull without closure. As we take closed convex hulls later on, we choose not to introduce this subtlety explicitly. The duality between generalized directional derivatives and gradients is captured by the equality:

(4)

The Goldstein subgradient [41] is an -relaxation of the generalized gradient:

(5)

Given , we say that:

is a stationary point (for ) if .

Any local minimum is stationary, and conversely if is convex. We also have weaker notions. Namely, given ,

is -stationary if ; and

is -stationary if .

2.2 Stratifiably smooth maps

Desirable properties for an optimization algorithm is that it produces iterates  that either converge to an -stationary point in finitely many steps, or whose subsequences (or some of them) converge to an -stationary point. For this, we work in the setting of objective functions that are smooth when restricted to submanifolds, that together partition .

Definition 1.

A stratification of a closed subset is a locally finite partition of  by smooth submanifolds —called strata—such that for :

This makes into a stratified space.

Note that we do not impose any (usually needed) gluing conditions between adjacent strata, as we do not require them in the analysis. In particular, semi-algebraic, subanalytic, or definable subsets of , together with Whitney stratified sets are stratified in the above weak sense. We next define the class of maps  with smooth restrictions  to strata  of some stratification , inspired by the Whitney stratifiable maps of [11] (there  is required to be Whitney) and the stratifiable functions of [34], however we further require that the restrictions  admit local extensions of class .

Definition 2.

The map is stratifiably smooth if there exists a stratification of , such that for each stratum , the restriction  admits an extension  of class  in a neighborhood of .

Remark 1.

The slightly weaker assumption that the extension  is continuously differentiable with locally Lipschitz gradient would have also been sufficient for our purpose.

We denote by  the stratum containing , and by the set of strata containing  in their closures. More generally, for , we let be the set of strata such that the closure of the ball has non-empty intersection with the closure of . Local finiteness in the definition of a stratification implies that (and ) is finite.

If  is stratifiably smooth and is a stratum, there is a well-defined limit gradient , which is the unique limit of the gradients where is any sequence converging to . Indeed, this limit exists and does not depend on the choice of sequence since admits a local extension . The following result states that the generalized gradient at can be retrieved from these finitely many limit gradients along the various adjacent top-dimensional strata.

Proposition 1.

If  is stratifiably smooth, then for any we have:

More generally, for :

Proof.

We show the first equality only, as the second can be proven along the same lines. We use the description of  in terms of limit gradients from Equation 3, which implies the inclusion of the right-hand side in . Conversely, let  be the union of strata in with positive codimension, which is of measure . Let be a sequence avoiding , converging to , such that converges as well. Since is finite, up to extracting a subsequence, we can assume that all are in the same top-dimensional stratum . Consequently,  converges to . ∎

2.3 Direction of descent

Thinking of  as a current position, we look for a direction of (steepest) descent, in the sense that a perturbation of  in this direction produces a (maximal) decrease of . Given , we let  be the projection of the origin on the convex set . Equivalently,  solves the minimization problem:

(6)

Introduced in [41], the direction is a good candidate of direction for descent, as we explain now. Since is the projection of the origin on the convex closed set , we have the classical inequality  that holds for any  in the Goldstein subgradient at . Equivalently,

(7)

Informally, if we think of a small perturbation of along this direction, for  small enough, then . Using Equation 7, since , we deduce that . So  locally decreases at linear rate in the direction . This intuition relies on the fact that is well-defined so as to provide a first order approximation of  around , and that is chosen small enough. In order to make this reasoning completely rigorous, we need the following well-known result (see for instance [25]):

Theorem 1 (Lebourg Mean value Theorem).

Let . Then there exists some and some such that:

Let be lesser than , in order to ensure that is -close to . Then by the mean value theorem (and Proposition 1), we have that

for some . Equation 7 yields

(8)

as desired.

In practical scenarios however, it is unlikely that the exact descent direction could be determined. Indeed, from Equation 6, it would require the knowledge of the set , which consists of infinitely many (limits of) gradients in an -neighborhood of . We now build, provided is stratifiably smooth, a faithful approximation of , by collecting gradient information in the strata that are -close to .

For each top-dimensional stratum , let be an arbitrary point in . Define

(9)

Of course, depends on the choice of each . But this will not matter for the rest of the analysis, as we will only rely on the following approximation result which holds for arbitrary choices of points :

Proposition 2.

Let and . Assume that is stratifiably smooth. Let be a Lipschitz constant of the gradients restricted to , where is some local extension of , and is top dimensional. Then we have:

In particular, .

Note that, since the are of class , their gradients are locally Lipschitz, hence by compactness of , the existence of the Lipschitz constant above is always guaranteed.

Proof.

From Proposition 1, we have

This yields the inclusion . Now, let , , and let be a top-dimensional stratum touching . Based on how is defined in Equation 9, we have that  and both belong to , and they both belong to the stratum . Therefore, , and so . The result follows from the fact that is convex and closed. ∎

Recall from Equation 8 that the (opposite to the) descent direction is built as the projection of the origin onto . Similarly, we define our approximate descent direction as , where is the projection of the origin onto the convex closed set :

(10)

We show that this choice yields a direction of decrease of , in a sense similar to Equation 8.

Proposition 3.

Let  be stratifiably smooth, and let  be a non-stationary point. Let , and . Denote by  a Lipschitz constant for all gradients of the restrictions  to the ball (as in Proposition 2). Then:

  • For small enough we have ; and

  • For such , we have .

Proof.

Saying that  is non-stationary is equivalent to the inequality . We show that the map , which is non-increasing, is continuous at . Let  be small enough such that the sets of strata incident to  are the same that meet with the -ball around , i.e., . Such an  exists since there are finitely many strata, which are closed sets, that meet with a sufficiently small neighborhood of . Of course, all smaller values of  enjoy the same property. By Proposition 1, we then have the nesting

where is a Lipschitz constant for the gradients in neighboring strata. In turn, . In particular,  as  goes to , hence . Besides, the inclusion (Proposition 2) implies that . This yields and so item (i) is proved.

We now assume that  satisfies the inequality of item (i), and let . By the Lebourg mean value theorem, there exists a and some  such that:

Since is at distance no greater than from . In particular, belongs to . From Proposition 2, there exists some at distance no greater than from . We then rewrite:

(11)

On the one hand, by the Cauchy-Schwarz inequality:

(12)

where the last inequality relies on the assumption that . On the other hand, since is the projection of the origin onto , we obtain , or equivalently:

(13)

Plugging the inequalities of Equations 13 and 12 into Equation 11 proves item (ii). ∎

3 Stratified Gradient Sampling (SGS)

In this section we develop a gradient descent algorithm for the optimization of stratifiably smooth functions, and then we detail its convergence properties. We require that the objective function  has the following properties:

  • (Proper): has compact sublevel sets.

  • (Stratifiably smooth): is stratifiably smooth, and for each iterate  and  we have an oracle  that samples one -close element  in each -close top-dimensional stratum .

  • (Differentiability check) We have an oracle  checking whether an iterate  belongs to the set  over which  is differentiable.

That  is a proper map is also needed in the original GS algorithm [16], but is a condition that can be omitted as in [52] to allow the values  to decrease to . In our case we stick to this assumption because we need the gradient of  (whenever defined) to be Lipschitz on sublevel sets.

Similarly, the ability to check that an iterate  belongs to  is standard in the GS methodology. We use it to make sure that  is differentiable at each iterate . For this, we call a subroutine  which slightly perturbs the iterate  to achieve differentiability and to maintain a descent condition. Note that these considerations are mainly theoretical because generically the iterates  are points of differentiability, hence  is unlikely to change anything.

The last requirement that  is stratifiably smooth replaces the classical weaker assumption used in the GS algorithm that  is locally Lipschitz and that the set  where  is differentiable is open and dense. There are many possible ways to design the oracle 

: for instance the sampling could depend upon arbitrary probability measures on each stratum, or it could be set by deterministic rules depending on the input 

as will be the case for the persistence map in Section 4

. However our algorithm and its convergence properties are oblivious to these degrees of freedom, as by

Section 2.3 any sampling allows us to approximate the Goldstein subgradient  using finitely many neighbouring points to compute . In turn we have an approximate descent direction  which can be used to produce the subsequent iterate  as in the classical smooth gradient descent.

3.1 The algorithm

The details of the main algorithm  are given in Algorithm 2.

The algorithm  calls the method  of Algorithm 3 as a subroutine to compute the right descent direction  and the right step size . Essentially, this method progressively reduces the exploration radius  of the ball where we compute the descent direction  until the criteria of Proposition 3 ensuring that the loss sufficiently decreases along  are met.

Given the iterate  and the radius , the calculation of  is done by the subroutine in Algorithm 4: points  in neighboring strata that intersect the ball  are sampled using  to compute the approximate Goldstein gradient and in turn the descent direction .

Much like the classical GS algorithm, our method behaves like the well-known smooth gradient descent where the gradient is replaced with a descent direction computed from gradients in neighboring strata. A key difference however is that, in order to find the right exploration radius  and step size , the needs to maintain a constant  to approximate the ratio  of Proposition 3, as no Lipschitz constant  may be explicitly available.

To this effect, maintains a relative balance between the exploration radius  and the norm of the descent direction , controlled by , i.e., . As we further maintain , we know that the convergence properties of  and  are closely related. Thus, the utility of this controlling constant is mainly theoretical, to ensure convergence of the iterates  towards stationary points in Theorem 2. In practice, we start with a large initial constant , and decrease it on line 11 of Algorithm 3 whenever it violates a property of the target constant  given by Proposition 3.

0:  Loss function , initial iterate , exploration radius , initial constant  controlling exploration radius, critical distance to origin , descent rate , step size decay rate 
1:  
2:  repeat
3:      via Algorithm 3
4:     
5:     
6:     
7:  until 
8:  return  
Algorithm 2
1:   and 
2:  repeat
3:      via Algorithm 4
4:     if   then
5:        Break, return , and                   # Set to reach an -stationary point
6:     end if
7:                                                                                # Candidate of update step
8:     while  and  do
9:                   # Once , this loop never occurs by (ii) of Proposition 3
10:     end while
11:     if  or  then
12:                                               # Reduce  to satisfy criterion (i) of Proposition 3
13:     end if
14:  until  and
15:  return  , and 
Algorithm 3
1:             # Eventually  will be some approximate Goldstein subgradient
2:            # -close samples from -close top dim strata
3:  for  do
4:                                                  # Add gradients from remote strata
5:  end for
6:  Solve the quadratic minimization problem 
7:  return                              # is the approximate steepest descent direction
Algorithm 4
1:  
2:  while  or  do
3:     Replace with a sample in
4:     
5:  end while
6:  return  
Algorithm 5
Remark 2.

Assume that we dispose of a common Lipschitz constant  for all gradients  in the -neighborhood of the current iterate , recall that  is any  extension of the restriction  to the neighboring top-dimensional stratum . Then we can simplify Algorithm 3 by decreasing the exploration radius  progressively until  as done in Algorithm 6: This ensures by Proposition 3 that the resulting update step satisfies the descent criterion . In particular the parameter  is no longer needed, and the theoretical guarantees of the simplified algorithm are unchanged. Note that for objective functions from TDA (see Section 4), the stability theorems (e.g. from [27]) often provide global Lipschitz constants, enabling the use of the simplified update step described in Algorithm 6.

1:   and via Algorithm 4
2:  repeat
3:     if   then
4:        Break, return and                  # Set to reach an -stationary point
5:     end if
6:      and via Algorithm 4
7:  until 
8:  return   and
Algorithm 6
Remark 3.

In the situation of Remark 2, let us further assume that  is non-increasing. This monotonicity property mimics the fact that  is non-increasing, since increasing  grows the Goldstein generalized gradient , of which  is the element with minimal norm. If the initial exploration radius  does not satisfy the termination criterion (Line 8), , then one can set , yielding . This way Algorithm 6 is further simplified: in constant time we find a  that yields the descent criterion , and the parameter  is no longer needed. A careful reading of the proofs provided in the following section yields that the convergence rate (Theorem 3) of the resulting algorithm is unchanged, however the asymptotic convergence (Theorem 2), case , needs to be weakened: converging subsequences converge to -stationary points instead of stationary points. We omit details for the sake of concision.

3.2 Convergence

We show convergence of Algorithm 2 towards stationary points in Theorem 2. Finally, Theorem 3 provides a non-asymptotic sub linear convergence rate, which is by no mean tight yet gives a first estimate of the number of iterations required in order to reach an approximate stationary point.

Theorem 2.

If , then almost surely the algorithm either converges in finitely many iterations to an -stationary point, or produces a bounded sequence of iterates  whose converging subsequences all converge to stationary points.

As an intermediate result, we first show that the update step computed in Algorithm 3 is obtained after finitely many iterations and estimate its magnitude relatively to the norm of the descent direction.

Lemma 1.

terminates in finitely many iterations. In addition, let  be a Lipschitz constant for the restricted gradients  (as in Proposition 2) in the -neighborhood of . Assume that . If  is not an -stationary point, then the returned exploration radius  satisfies:

Moreover the returned controlling constant  satisfies:

Proof.

If  is a stationary point, then  returns a trivial descent direction  because the approximate gradient  contains  (Line 1). In turn, terminates at Line 4.

Henceforth we assume that  is not a stationary point and that the breaking condition of Line 4 in Algorithm 3 is never reached (otherwise the algorithm terminates). Therefore, at each iteration of the main loop, either  is replaced by  (line 9), or  is replaced by  (line 12), until both the following inequalities hold (line 14):

Once  becomes lower than , inequality (A) implies inequality (B) by Proposition 3 . It then takes finitely many replacements  to reach inequality (A), by Proposition 3 (i). At that point (or sooner), Algorithm 3 terminates. This concludes the first part of the statement, namely  terminates in finitely many iterations.

Next we assume that  is not an -stationary point, which ensures that the main loop of cannot break at Line 5. We have the invariant : this is true at initialization () by assumption, and in later iterations  is only replaced by  whenever (A) holds but not (B), which forces  by Proposition 3 (ii).

At the end of the algorithm, by inequality (A), and so we deduce the right inequality .

Besides, if both (A) and (B) hold when entering the main loop (line 11) for the first time, then . Otherwise, let us consider the penultimate iteration of the main loop for which the update step is . Then, either condition (A) does not hold, namely , or condition (B) does not hold, which by the assertion (ii) of Proposition 3 implies . In any case, we deduce that

Proof of Theorem 2.

We assume that alternative  does not happen. By Lemma 1, Algorithm 3 terminates in finitely many iteration and by Line 14 we have the guarantee: