Convergence of the Forward-Backward Algorithm: Beyond the Worst Case with the Help of Geometry

We provide a comprehensive study of the convergence of forward-backward algorithm under suitable geometric conditions leading to fast rates. We present several new results and collect in a unified view a variety of results scattered in the literature, often providing simplified proofs. Novel contributions include the analysis of infinite dimensional convex minimization problems, allowing the case where minimizers might not exist. Further, we analyze the relation between different geometric conditions, and discuss novel connections with a priori conditions in linear inverse problems, including source conditions, restricted isometry properties and partial smoothness.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

05/01/2016

Further properties of the forward-backward envelope with applications to difference-of-convex programming

In this paper, we further study the forward-backward envelope first intr...
03/19/2021

Forward and Backward Bellman equations improve the efficiency of EM algorithm for DEC-POMDP

Decentralized partially observable Markov decision process (DEC-POMDP) m...
05/28/2020

Variational regularisation for inverse problems with imperfect forward operators and general noise models

We study variational regularisation methods for inverse problems with im...
12/31/2013

Forward-Backward Greedy Algorithms for General Convex Smooth Functions over A Cardinality Constraint

We consider forward-backward greedy algorithms for solving sparse featur...
12/15/2019

Symplectic Runge-Kutta discretization of a regularized forward-backward sweep iteration for optimal control problems

Li, Chen, Tai E. (J. Machine Learning Research, 2018) have proposed ...
09/19/2018

Probabilistic completeness of RRT for geometric and kinodynamic planning with forward propagation

The Rapidly-exploring Random Tree (RRT) algorithm has been one of the mo...
02/14/2019

Geometry of Arimoto Algorithm

This paper aims to reveal information geometric structure of Arimoto alg...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Splitting algorithms based on first order descent methods are widely used to solve high dimensional convex optimization problems in signal and image processing [26], compressed sensing [30]

, and machine learning

[68]. While an advantage of these methods is their simplicity and complexity independent of the dimension of the problem, a drawback is their convergence rates which are known to be slow, in the worst case. For instance, the gradient method applied to a smooth convex function converges in values as [29, 77]. We refer to these results as worst case since no particular assumption is made aside from existence of a solution. Clearly this allows for convex functions with wild behaviors around the minimizers [19], behaviors that might hardly appear in practice. It is then natural to ask whether improved rates can be attained under further regularity assumptions.

Strong convexity is one such assumption and indeed it is known to guarantee linear convergence rates [43, 78]. In practice, strong convexity is too restrictive and one would wish to relax it, while retaining fast rates. In this paper, we are consider geometric conditions that, roughly speaking, describe convex functions that behave like

(1)

for some and on some subset which is typically a neighborhood of the minimizers and/or a sub-level set. The intuition behind this kind of assumptions is clear: the bigger is , the more the function is “flat” around its minimizers, which in turns means that a gradient descent algorithm shall converge slowly. The idea of exploiting geometric conditions to derive convergence rates has a long history dating back to [73, 76], see the literature review in Section 2.3. Recently, there has been a renewed interest, since in practice splitting methods can have fast convergence rates. Plenty of similar convergence rates results have been derived under different yet related geometrical properties, see Section 3.1.

The goal of this paper is to provide a comprehensive study of the convergence rates of forward-backward algorithm for convex minimization problem geometric conditions such as (1). We collect in a unified view a variety of results scattered in the literature, and derive several novel results along the way. After reviewing basic (worst-case) convergence results for the forward-backward algorithm, we recall and connect different geometric conditions. In particular, we study the conditioning and Łojasiewicz properties, and provide a sum rule for conditioned functions. The central part of our study is devoted to exploiting the -Łojasiewicz property to study the convergence of the forward-backward algorithm and a broader class of first-order descent methods. We show that the convergence is finite if , superlinear if , linear if and sublinear if . We further show that -conditioning is essentially equivalent to the linear convergence of the forward-backward algorithm. Then, we consider the case of convex functions being bounded from below but with no minimizers. Indeed, we show that in this case the -Łojasiewicz property with provides sharp sublinear rates for the values, from to .

Our setting allows to consider infinite dimensional problems and in particular linear inverse problems with convex regularizers. Indeed, an observation that motivated out study is that many practical optimization problems are often derived from estimation problems, such as inverse problems, defined by suitable modeling assumptions. It is then natural to ask if these assumptions have a geometric interpretations in the sense of condition (

1). This is indeed the case and we provide two main examples. First, we show that classical Hölder source conditions in inverse problems [40] correspond to the -Lojasiewicz property, with , on some dense affine subspace. Second, we consider sparse regularized inverse problems for which we observe that the restricted injectivity condition [24], which is key for exact recovery, induces a

-conditioning of the problem over a cone of sparse vectors. More generally, we consider inverse problems with partially smooth regularizing functions

[46], and show that the restricted injectivity condition induces a -conditioning of the problem over an identifiable manifold. Studying the above connections required considering geometric conditions like (1) on general subsets , rather than sublevel sets as typically done in the literature.

The rest of the paper is organized as follows. We set the notation, introduce the forward-backward algorithm and discuss worst-case convergence in Section 2, as well as related results in Subsection 2.3. In Section 3, we define and connect different the geometric conditions: -conditioning, -metric subregularity, and -Łojasiewicz property. This section contains in addition examples, and a new sum rule for -conditioned functions. Convergence rates of the forward-backward algorithm are in Section 4. Section 5 is devoted to discussing the special case of linear inverse problems.

2 The forward-backward algorithm: notation and background

2.1 Notation and basic definitions

We recall a few classic notions and introduce some notation. Throughout the paper is a Hilbert space. Let and . and denote respectively the open and closed balls of radius centered at . and are used to denote and . The distance of from a set is , and stands for , with the convention that . If is closed and convex, is the projection of onto , and and respectively denote the relative interior and the strong relative interior of [11]. Given a bounded self-adjoint linear operator , we denote by

the set of eigenvalues of

[40], and , are defined as and , respectively. Let be the class of convex, lower semi-continuous, and proper functions from to . For and , denotes the (Fenchel) subdifferential of at [11]. We also introduce the following notation for the sublevel sets of :

The following assumption will be made throughout this paper.

Assumption 2.1.

is a Hilbert space, , and is differentiable and convex, with -Lipschitz continuous gradient for some and we set

Splitting methods, such as the forward-backward algorithm, are extremely popular to minimize an objective function as in Assumption 2.1. To have an implementable procedure, we implicitly assume that the proximal operator of can be easily computed (see e.g. [26]):

(2)

If Assumption 2.1 is in force, we introduce the Forward-Backward (FB) map for :

(3)

so that the FB algorithm can be simply written as .

2.2 The Forward-Backward algorithm: worst-case analysis

The following theorem collects known results about the convergence of the FB algorithm. This is a “worst-case” analysis, in the sense that it holds for every satisfying Assumption 2.1. The main goal of Section 4 is to show how these results can be improved taking into account the geometry of at its infimum.

Theorem 2.2 (Forward-Backward - convex case).

Suppose that Assumption 2.1 is in force, and let be generated by the FB algorithm. Then:

  1. (Descent property) The sequence is decreasing, and converges to .

  2. (Féjer property) For all , the sequence is decreasing.

  3. (Boundedness) The sequence is bounded if and only if is nonempty.

Suppose in addition that is bounded from below. Then

  1. (Subgradients convergence) The sequence converges decreasingly to zero, with

Moreover, if , we have:

  1. (Weak convergence) The sequence converges weakly to a minimizer of .

  2. (Global rates for function values) For all ,

  3. (Asymptotic rates for function values) When ,

Theorem 2.2 collects various convergence results on the FB algorithm. Item i appears in [77, Theorem 3.12] (see also [44]). Item ii is a consequence of the nonexpansiveness of the FB map (see (3)) [56, Lemma 3.2]. Item iii, which is a consequence of Opial’s Lemma [72, Lem. 5.2], can be found in [77, Theorem 3.12]. Item 1 follows from Lemma A.5.ii in the Annex. Item 1 is also a consequence of Opial’s Lemma, see [56, Proposition 3.1]. Items 2 and 3 are proved in [29, Theorem 3] (see also [16, Proposition 2] and [12, Theorem 3.1]).

Remark 2.3 (Optimal results in the worst-case).

The convergence results in Theorem 2.2 are optimal, in the sense that we explain below. First, the iterates may not converge strongly: see [8, 44] for a counterexample in . Even in finite dimension, no sublinear rates should be expected for the iterates. To see this, apply the proximal algorithm to the function , whose unique minimizer is zero. When , then (see e.g. the discussion following [69, Proposition 2.5]):

(4)

The estimate (4) provides also a lower bound for the rates on the objective values:

(5)

Note that the lower bound on the objective values approach when . This fact was also observed in [29, Theorem 12] on an infinite dimensional counterexample. When is bounded from below, but has no minimizers, the values go to zero but no rates can be obtained in general. To see this, consider for any the function defined by

(6)

This function is a differentiable convex function with -Lipschitz gradient. Let be the sequence obtained applying the gradient algorithm to this function. Then

and the lower bound on the objective values approach for .

2.3 Beyond the worst case: previous results

To derive better convergence rates for the FB algorithm additional assumptions are needed. In this paper, we consider several geometric assumptions, namely the -conditioning, the -metric subregularity and the -Łojasiewicz property, see Section 3.
The first111If we discard the “classic” strong convexity assumption. result exploting geometry to derive fast convergence rates dates back to Polyak [73, Theorem 4], showing that the gradient method converges linearly (in terms of the values and iterates) when the objective function verifies the -Łojasiewicz inequality. Improved convergence rates for first-order descent methods were then obtained in [76], considering notions slightly stronger than -metric subregularity, and proving finite convergence of the proximal algorithm for , and linear convergence for . These results are improved and extended in [67], analyzing for the first time convergence rates for the iterates of the proximal algorithm using metric subregularity for general . The results in [67] recovers those in [76] (see also [79, 80]), but also derive superlinear rates for , and sublinear rates for . Roughly speaking, the results in [67] show that the bigger is the slower is the algorithm. In the early 90’s, some attention was devoted to the study of -conditioned functions. in particular for (some authors call this property superlinear conditioning, sharp growth or sharp minima property). In this context, [41, 55, 23] showed that the proximal algorithm terminates after a finite number of iterations. For , Polyak [74, Theorem 7.2.1] obtained the finite termination for the projected gradient method. The -conditioning was also used to obtain linear rates for the proximal algorithm in [60].
In [3] it was observed that the -Łojasiewicz property could be used to derive precise rates for the iterates of the proximal algorithm. The authors obtain finite convergence when , linear rates when , and sublinear rates when . Similar results can be found in [4, 69]. Such convergence rates for the iterates have been extended to the forward-backward algorithm (and its alternating versions) in [21], and similar rates also hold for the convergence of the values in [27, 42]. More recently, various papers focused on conditions equivalent (or stronger) than the -conditioning, to derive linear rates [58, 63, 39, 65, 38, 52]. Some effort has also been made to show that the Łojasiewicz property and conditioning are equivalent [19, 20], and to relate it to other error bounds appearing in the literature [52]. See also [70] for a fine analysis of linear rates for the projected gradient algorithm under conditions belonging between strong convexity and -conditioning (see also Subsection 4.3).

As clear from the above discussion the literature on convergence rates under geometric conditions is vast and somewhat scattered. The study that we develop in this paper provides a unified treatment, complemented with several novel results and connections.

3 Identifying the geometry of a function

3.1 Definitions

In this section we introduce the main geometrical concepts that will be used throughout the paper to derive precise rates for the FB method. Roughly speaking, these notions characterize functions which behave like around their minimizers.

Definition 3.1.

Let , let with , and . We say that:

  1. is -conditioned on if there exists a constant such that:

  2. is -metrically subregular on if there exists a constant such that:

  3. is -Łojasiewicz on if such that the Łojasiewicz inequality holds:

We will refer to these notions as global if , and as local if for some and .

The notion of conditioning, introduced in [81, 88], is a common tool in the optimization and regularization literature [6, 71, 57, 84, 20]. It is also called growth condition [71], and it is strongly related to the notion of Tikhonov wellposedness [35]. The -metric subregularity is less used, generally defined for or [36, 58], and is also called upper Lipschitz continuity at zero of in [28], or inverse calmness [34]. The Łojasiewicz property goes back to [66], and was initially designed as a tool to guarantee the convergence of trajectories for the gradient flow of analytic functions, before its recent use in convex and nonconvex optimization. It is generally presented with a constant which is equal, in our notation, to [66, 1, 17, 20], or [69, 45, 42]. The main difference between Definition 3.1 and the literature is that we consider an arbitrary set , which will prove to be essential for the analysis of inverse problems (see more details in Section 5).

The notions introduced in Definition 3.1 are close one to the others. Indeed, for convex functions, -conditioning implies metric subregularity, which implies the Łojasiewicz property. Under some additional assumptions, it is possible to show that the reverse implications hold. For instance, metric subregularity implies conditioning when , [85, Theorem 4.3]. Similar results can also be found in [2, 7, 39, 37], and [28, Theorem 5.2] (for ). Also, it is shown in [20, Theorem 5] that the local Łojasiewicz property implies local conditioning. The next result, proved in Annex A.1, extends the mentioned ones, and states the equivalence between conditioning, metric subregularity, and Łojasiewicz property on -invariant sets (see Definition A.1 in Annex A.1).

Proposition 3.2.

Let , let , and let be such that . Consider the following properties:

  1. is -conditioned on ,

  2. is -metrically subregular on ,

  3. is -Łojasiewicz on .

Then i ii iii. One can respectively take and . Assuming in addition that is -invariant, we also have iii i with .

The two next propositions show that these geometric notions are stronger when is smaller, and are meaningful only on sets containing minimizers (their proof follow directly from Definition 3.1 and are left to the reader).

Proposition 3.3.

Let be such that , , and .

  1. If is -conditioned (resp. is -metrically subregular) on , then is -conditioned (resp. is -metrically subregular) on for any .

  2. If is -Łojasiewicz on , then is -Łojasiewicz on for any .

Proposition 3.4.

Let be such that . If is a weakly compact set for which , then is -conditioned on for any .

3.2 Examples

In this section, we collect some relevant examples.

Example 3.5 (Uniformly convex functions).

Suppose that is uniformly convex of order . Then, there exists such that

This implies that is globally -conditioned, with . Moreover, the global -Łojasiewicz inequality is verified with . In the strongly convex case, when , the -Łojasiewicz inequality holds with the constant , which is sharp. Examples of uniformly convex functions of order are [11, Example 10.14].

Example 3.6 (Least squares).

Let be a nonzero bounded linear operator between Hilbert spaces, and , for some . Then, the conditioning, metric subregularity, and Łojasiewicz properties, with and , are equivalent to verify on , respectively:

If holds, one can see that the above inequalities hold with

meaning in particular that is globally -conditioned. Since is equivalent for to be closed, it is in particular always true when has finite dimension. If instead holds, [45] shows that cannot satisfy any local -Łojasiewicz property, for any . This is for instance the case for infinite dimensional compact operators. Nevertheless, we will show in Section 5, that the least squares always satisfies a -Łojasiewicz property on the so-called regularity sets, for any .

Example 3.7 (Convex piecewise polynomials).

A convex continuous function is said to be convex piecewise polynomial if can be partitioned in a finite number of polyhedra such that for all , the restriction of to is a convex polynomial, of degree . The degree of is defined as . Assume . Convex piecewise polynomial functions are conditioned [61, Corollary 3.6]. More precisely, for all , is -conditioned on its sublevel set , with In general, the constant (which depends on ) cannot be explicitly computed. This result implies that polyhedral functions () are -conditioned (in agreement with [23, Corollary 3.6]), and that convex piecewise quadratic functions () are -conditioned. More generally, convex semi-algebraic functions are locally -conditioned [18].

Example 3.8 (L1 regularized least squares).

Let , for some linear operator , and . As observed in [20, Section 3.2.1], is convex piecewise polynomial of degree , thus it is -conditioned on each level set . The computation of the conditioning constant is rather difficult. In [20, Lemma 10] an estimate of is provided, by means of Hoffman’s bound [49].

Example 3.9 (Regularized problems).

Let be an Euclidean space, , where is a linear operator, , and is a strongly convex function. Then is -conditioned on any level set , , if

  1. with , (see [87, Corollary 2]),

  2. with , (see [38, Theorem 4.2]),

  3. is the nuclear norm of the matrix , provided the following qualification condition holds222We mention that this result was originally announced in [51, Theorem 3.1] without the qualification condition, but then corrected in [86, Proposition 12], in which the authors show that this condition is necessary. (see [86]): such that .

  4. is polyhedral (see [86, Proposition 6]).

Note that in [86, 87], the authors do not prove directly that the functions are -conditioned, but that they verify the so-called Luo-Tseng error bound, that is known to be equivalent to -conditioning on sublevel sets [38]. Note that in items ii-iv), the strong convexity and assumptions on can be weakened (see [86] and [38, Theorem 4.2]).

Example 3.10 (Distance to an intersection).

Let be two closed convex sets in such that and the intersection is sufficiently regular, i.e. . Let . Clearly, , and . Then is -conditioned on bounded sets [10, Theorem 4.3]. Let . From , it follows that the function is -conditioned on bounded sets.

3.3 A sum rule for -conditioned functions

Let , and assume that they are respectively and -conditioned. What can be said about their sum ? We present in Theorem 3.11 a partial answer to this question, under the assumption that they remain conditioned under linear perturbations (see Remark 3.13). This extends [38, Theorem 4.2], which deals with the sum of -conditioned functions.

Theorem 3.11 (Sum rule for conditioning).

Let , where , is a bounded linear operator with closed range, and is of class on . Let . Assume that there exists such that, for and ,

is -conditioned on and is -conditioned on . (7)

Suppose that the following qualification conditions are satisfied:

(8)
(9)

Let . Then, for any , is -conditioned on .

Proof.

Let be defined by . Since , [11, Theorem 15.23] yields that strong duality between and holds, meaning that . Differentiability of on implies strict convexity of [11, Proposition 18.10], therefore is nonempty [11, Corollary 15.23]. Moreover, since , [11, Proposition 19.3] applied to the dual problem, yields with for any . Using now [11, Corollary 19.2], we conclude that

(10)

So, it remains to prove that, for all , there exists such that:

(11)

Fix , , set and . It follows from Proposition 3.3 that and are -conditioned on and , respectively. Moreover, , and . According to (10), and , therefore, for all ,

Summing these two last inequalities gives, for all :

with . Since on , we deduce that

It remains to lower bound the right hand side by the distance to . By Example 3.10, thanks to the qualification condition (8) and the fact that is bounded, we derive from (10) that there exists such that

(12)

Fix , and define , which is well defined since we assumed to be closed. Let be defined by . Since , necessarily , so we deduce from Example 3.6 that

(13)

On the one hand, we have . On the other hand, the definition of implies . Thus, it follows from (13) that

Since this is true for any , we can combine it with (12) to get for all

(14)

with . To end the proof, note that the qualification condition (9) implies that , so we can use again Example 3.10 to get some such that for all ,

The above inequality, combined with (14) and (11), concludes the proof. ∎

Remark 3.12 (On the qualification conditions).

It is worth noting that the conclusion of Theorem 3.11 may not hold if the qualification conditions (8) and (9) are removed, as proved in [86, Section 4.4.4]. Nevertheless, these conditions are automatically satisfied whenever and are uniformly convex functions. Also, if both and have finite dimension and is strictly convex, then (8) and (9) become equivalent to (see [11]).

Remark 3.13 (On tilt-conditioned functions).

Following the terminology in [75, 39, 38], we say that a function is -tilt-conditioned on if for all , the function is -conditioned on whenever its set of minimizers is not empty. Clearly, tilt-conditioning is a much stronger assumption than conditioning, but such function has the advantage of verifying (7) without any knowledge on . Tilt conditioning is strongly related to the metric regularity of (see e.g. [7]). Moreover, many relevant conditioned functions are tilt-conditioned. For instance, the -norm , and more generally polyhedral functions are -tilt-conditioned on Euclidean spaces, -uniformly convex functions are -tilt-conditioned on , and convex piecewise polynomials of degree 2 are -tilt-conditioned on their sublevel sets. See [86, Proposition 11] for the proof that the nuclear norm is -tilt-conditioned on bounded sets, and [38, Section 4] for more examples of -tilt-conditioned functions.

4 Sharp convergence rates for the Forward-Backward algorithm

In this section, we present sharp convergence results for the forward-backward algorithm applied to -Łojasiewicz functions on a subset , building on the ideas in [5]. We extend the analysis to the case where is an arbitrary set, which will allow us to deal with infinite dimensional inverse problems (see Section 5.1), or structured problems for which all the information is encoded in a manifold (see Section 5.2).

4.1 Refined analysis with -Łojasiewicz functions

Theorem 4.1 (Strong convergence, ).

Suppose that Assumption 2.1 is in force, and that is bounded from below. Let be generated by the FB algorithm. Assume that:

  1. (Localization) for all , ,

  2. (Geometry) is -Łojasiewicz on , for some .

Then the sequence has finite length in , meaning that , and converges strongly to some .

Proof.

We first show that has finite length. Since , , and it follows from Lemma A.5 that

(15)
(16)

If there exists such that then the algorithm would stop after a finite number of iterations (see (15)), therefore it is not restrictive to assume that for all . We set and , so that the Łojasiewicz inequality at can be rewritten as

(17)

Combining (15), (16), and (17), and using the concavity of , we obtain for all :

By taking the square root on both sides, and using Young’s inequality, we obtain

(18)

Sum this inequality, and reorder the terms to finally obtain

We deduce that has finite length and converges strongly to some . Moreover, from (16) and the strong closedness of , we conclude that . ∎

Now we will provide explicit rates of convergence, for both the iterates and the values.

Theorem 4.2 (Rates of convergence, ).

Suppose Assumption 2.1 is in force. Suppose that , and let be generated by the FB algorithm. Assume that:

  1. (Localization) for all , ,

  2. (Geometry) is -Łojasiewicz on , for some .

Then converges strongly to some . Moreover, there exists some constants , explicitly computable (see equations (21) and (23)), such that the following convergence rates hold, depending on the value of , and of :

  1. If , then for every .

  2. If , the convergence is superlinear: for all ,

  3. If , the convergence is linear: for all ,

  4. If , the convergence is sublinear: for all ,