 # iPiano: Inertial Proximal Algorithm for Non-Convex Optimization

In this paper we study an algorithm for solving a minimization problem composed of a differentiable (possibly non-convex) and a convex (possibly non-differentiable) function. The algorithm iPiano combines forward-backward splitting with an inertial force. It can be seen as a non-smooth split version of the Heavy-ball method from Polyak. A rigorous analysis of the algorithm for the proposed class of problems yields global convergence of the function values and the arguments. This makes the algorithm robust for usage on non-convex problems. The convergence result is obtained based on the inequality. This is a very weak restriction, which was used to prove convergence for several other gradient methods. First, an abstract convergence theorem for a generic algorithm is proved, and, then iPiano is shown to satisfy the requirements of this theorem. Furthermore, a convergence rate is established for the general problem class. We demonstrate iPiano on computer vision problems: image denoising with learned priors and diffusion based image compression.

## Code Repositories

### ipiano

Implementation of the iPiano algorithm for non-convex and non-smooth optimization as described in .

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The gradient method is certainly one of the most fundamental but also one of the most simple algorithms to solve smooth convex optimization problems. In the last decades, the gradient method has been modified in many ways. One of those improvements is to consider so-called multi-step schemes [38, 35]

. It has been shown that such schemes significantly boost the performance of the plain gradient method. Triggered by practical problems in signal processing, image processing and machine learning, there has been an increased interest in so-called composite objective functions, where the objective function is given by the sum of a smooth function and a non-smooth function with an easy to compute proximal map. This initiated the development of the so-called proximal gradient or forward-backward method

, that combines explicit (forward) gradient steps w.r.t. the smooth part with proximal (backward) steps w.r.t. the non-smooth part.

In this paper, we combine the concepts of multi-step schemes and the proximal gradient method to efficiently solve a certain class of non-convex, non-smooth optimization problems. Although, the transfer of knowledge from convex optimization to non-convex problems is very challenging, it aspires to find efficient algorithms for certain non-convex problems. Therefore, we consider the subclass of non-convex problems

 minx∈RN f(x)+g(x),

where is a convex (possibly non-smooth) and is a smooth (possibly non-convex) function. The sum comprises non-smooth, non-convex functions. Despite the non-convexity, the structure of being smooth and being convex makes the forward-backward splitting algorithm well-defined. Additionally, an inertial force is incorporated into the design of our algorithm, which we termed iPiano. Informally, the update scheme of the algorithm that will be analyzed is

 xn+1=(I+α∂g)−1(xn−α∇f(xn)+β(xn−xn−1)),

where and are the step size parameters. The term is referred as forward step, as inertial term, and as backward or proximal step.

For the proximal step is the identity and the update scheme is usually referred as Heavy-ball method. This reduced iterative scheme is an explicit finite differences discretization of the so-called Heavy-ball with friction dynamical system

 ¨x(t)+γ˙x(t)+∇f(x(t))=0.

It arises when Newton’s law is applied to a point subject to a constant friction (of the velocity ) and a gravity potential . This explains the naming “Heavy-ball method” and the interpretation of as inertial force.

Setting results in the forward-backward splitting algorithm, which has the nice property that in each iteration the function value decreases. Our convergence analysis reveals that the additional inertial term prevents our algorithm from monotonically decreasing the function values. Although this may look like a limitation on first glance, demanding monotonically decreasing function values anyway is too strict as it does not allow for provably optimal schemes. We refer to a statement of Nesterov 

: “In convex optimization the optimal methods never rely on relaxation. Firstly, for some problem classes this property is too expensive. Secondly, the schemes and efficiency estimates of optimal methods are derived from some global topological properties of convex functions”

111Relaxation is to be interpreted as the property of monotonically decreasing function values in this context. Topological properties should be associated with geometrical properties.. The negative side of better efficiency estimates of an algorithm is usually the convergence analysis. This is even true for convex functions. In case of non-convex and non-smooth functions, this problem becomes even more severe.

##### Contributions

Despite this problem, we can establish convergence of the sequence of function values for the general case, where the objective function is only required to be a composition of a convex and a differentiable function. Regarding the sequence of arguments generated by the algorithm, existence of a converging subsequence is shown. Furthermore, we show that each limit point is a critical point of the objective function.

To establish convergence of the whole sequence in the non-convex case is very hard. However, with slightly more assumptions to the objective, namely that it satisfies the Kurdyka-Łojasiewicz inequality [30, 31, 26], several algorithms have been shown to converge [14, 5, 3, 4]. In  an abstract convergence theorem for descent methods with certain properties is proved. It applies to many algorithms. However, it can not be used for our algorithm. Based on their analysis, we prove an abstract convergence theorem for a different class of descent methods, which applies to iPiano. By verifying the requirements of this abstract convergence theorem, we manage to also show such a strong convergence result. From a practical point of view of image processing, computer vision, or machine learning, the Kurdyka-Łojasiewicz inequality is almost always satisfied. For more details about properties of Kurdyka-Łojasiewicz functions and a taxonomy of functions that have this property, we refer to [5, 10, 26].

The last part of the paper is devoted to experiments. We exemplarily present results on computer vision tasks, such as denoising and image compression, and show that entering the staggering world of non-convex functions pays off in practice.

## 2 Related Work

##### Forward-backward splitting

In convex optimization, splitting algorithms usually originate from the proximal point algorithm . It is a very general algorithm, and results on its convergence affect many other algorithms. Practically, however, computing one iteration of the algorithm can be as hard as the original problem. Among the strategies to tackle this problem are splitting approaches like Douglas-Rachford [28, 18], several primal-dual algorithms [12, 37, 23], and forward-backward splitting [28, 16, 7, 35]; see  for a survey.

Especially the forward-backward splitting schemes seem to be appealing to generalize to non-convex problems. This is due to their simplicity and the existence of simpler formulations in some special cases like, for example, the gradient projection method, where the backward-step is the projection onto a set [27, 22]. In  the classical forward-backward algorithm, where the backward step is the solution of a proximal term involving a convex function, is studied for a non-convex problem. In fact, the same class of objective functions as in the present paper is analyzed. The algorithm presented here comprises the algorithm from  as a special case. Also Nesterov  briefly accounts this algorithm in a general setting. Even the reverse setting is generalized in the non-convex setting [5, 11], namely where the backward-step is performed on a non-smooth non-convex function.

As the amount of data to be processed is growing and algorithms are supposed to exploit all the data in each iteration, inexact methods become interesting, though we do not consider erroneous estimates in this paper. Forward-backward splitting schemes also seem to work for non-convex problems with erroneous estimates [44, 43]. A mathematical analysis of inexact methods can be found, e.g., in [14, 5], but with the restriction that the method is explicitly required to decrease the function values in each iteration. The restriction comes with significantly improved results with regard of the convergence of the algorithm. The algorithm proposed in this paper provides strong convergence results, although it does not require the function values to decrease.

##### Optimization with inertial forces

In his seminal work , Polyak investigates multi-step schemes to accelerate the gradient method. It turns out that a particularly interesting case is given by a two-step algorithm, which has been coined the Heavy-ball method. The name of the method is because it can be interpreted as an explicit finite differences discretization of the so-called Heavy-ball with friction dynamical system. It differs from the usual gradient method by adding an inertial term that is computed by the difference of the two preceding iterations. Polyak showed that this method can speed up convergence in comparison to the standard gradient method, while the cost of each iteration stays basically unchanged.

The popular accelerated gradient method of Nesterov  obviously shares some similarities with the Heavy-ball method, but it differs from it in one regard: while the Heavy-ball method uses gradients based on the current iterate, Nesterov’s accelerated gradient method evaluates the gradient at points that are extrapolated by the inertial force. On strongly convex functions, both methods are equally fast (up to constants), but Nesterov’s accelerated gradient method converges much faster on weakly convex functions .

The Heavy-ball method requires knowledge about the function parameters (Lipschitz constant of the gradient and the modulus of strong convexity) to achieve the optimal convergence rate, which can be seen as a disadvantage. Interestingly, the conjugate gradient method for minimizing strictly convex quadratic problems can be expressed as Heavy-ball method. Hence, it can be seen as a special case of the Heavy-ball method for quadratic problems. In this special case, no additional knowledge is required about the function parameters, as the algorithm parameters are computed online.

The Heavy-ball method was originally proposed for minimizing differentiable convex functions, but it has been generalized in different ways. In , it has been generalized to the case of smooth non-convex functions. It is shown that, by considering an appropriate Lyapunov objective function, the iterations are attracted by the connected components of stationary points. In Section 4 it will become evident that the non-convex Heavy-ball method is a special case of our algorithm, and also the convergence analysis of  shows some similarities to ours.

In [2, 1], the Heavy-ball method has been extended to maximal monotone operators, e.g., the subdifferential of a convex function. In a subsequent work , it has been applied to a forward-backward splitting algorithm, again in the general framework of maximal monotone operators.

## 3 An abstract convergence result

### 3.1 Preliminaries

We consider the Euclidean vector space

of dimension and denote the standard inner product by and the induced norm by . Let be a proper lower semi-continuous function. [effective domain, proper] The (effective) domain of is defined by . The function is called proper, if is nonempty.

In order to give a sound description of the first order optimality condition for a non-convex non-smooth optimization problem, we have to introduce the generalization of the subdifferential for convex functions.

[Limiting-subdifferential] The limiting-subdifferential (or simply subdifferential) is defined by (see [40, Def. 8.3])

 ∂F(x)={ξ∈RN|∃yk→x,F(yk)→F(x),ξk→ξ,ξk∈ˆ∂F(yk)}, (1)

which makes use of the Fréchet subdifferential defined by

 ˆ∂F(x)={ξ∈RN|liminfy→xy≠x1∥x−y∥2(F(y)−F(x)−⟨y−x,ξ⟩)≥0},

when and by else.

The domain of the subdifferential is .

In what follows, we will consider the problem of finding a critical point of , which is characterized by the necessary first-order optimality condition .

We state the definition of the Kurdyka-Łojasiewicz property from . [Kurdyka-Łojasiewicz property]

1. The function has the Kurdyka-Łojasiewicz property at , if there exist , a neighborhood of and a continuous concave function such that , , for all it is , and for all the Kurdyka-Łojasiewicz inequality holds, i.e.,

 φ′(F(x)−F(x∗))dist(0,∂F(x))≥1.
2. If the function satisfies the Kurdyka-Łojasiewicz inequality at each point of , it is called KL function.

Roughly speaking, this condition says that we can bound the subgradient of a function from below by a reparametrization of its function values. In the smooth case, we can also say that up to a reparametrization the function is sharp, meaning that any non-zero gradient can be bounded away from . This is sometimes called a desingularization. It has been shown in  that a proper lower semi-continuous extended valued function always satisfies this inequality at each non-stationary point. For more details and other interpretations of this property, also for different formulations, we refer to .

A big class of functions that have the KL-property is given by real semi-algebraic functions . Real semi-algebraic functions are defined as functions whose graph is a real semi-algebraic set. [real semi-algebraic set] A subset of is semi-algebraic, if there exists a finite number of real polynomials such that

 S=p⋃j=1q⋂i=1{x∈RN:Pi,j(x)=0 % and Qi,j<0}.

### 3.2 Inexact descent convergence result for KL functions

In the following, we prove an abstract convergence result for a sequence in , , , satisfying certain basic conditions, . For convenience we use the abbreviation for . We fix two positive constants and and consider a proper lower semi-continuous function . Then, the conditions we require for are

1. For each , it holds

 F(zn+1)+aΔ2n≤F(zn).
2. For each , there exists such that

 ∥wn+1∥2≤b2(Δn+Δn+1).
3. There exists a subsequence such that

 znj→~zandF(znj)→F(~z),as j→∞.

Based on these conditions, we derive the same convergence result as in . The statements and proofs of the subsequent results follow the same ideas as . We modified the involved calculations according to our conditions H1, H2, and H3.

###### Remark 1

These conditions are very similar to the ones in , however, they are not identical. The difference comes from the fact that  does not consider a two-step algorithm.

• In  the corresponding condition to H1 (sufficient decrease condition) is .

• The corresponding condition to H2 (relative error condition) is . In some sense, our condition H2 accepts a larger relative error.

• H3 (continuity condition) in  is the same here, but for .

###### Remark 2

Our proof and the proof in  mainly differ in the calculations that are involved, the outline is the same. There is hope to find an even more general convergence result, which comprises ours and .

Let be a proper lower semi-continuous function which satisfies the Kurdyka-Łojasiewicz property at some point . Denote by , and the objects appearing in Definition 3.1 of the KL property at . Let be such that with , where .

Furthermore, let be a sequence satisfying Conditions H1, H2, and

 ∀n∈N:zn∈B(z∗,ρ)⇒zn+1∈B(z∗,σ) with F(zn+1),F(zn+2)≥F(z∗). (2)

Moreover, the initial point is such that and

 ∥x∗−x0∥2+√F(z0)−F(z∗)a+baφ(F(z0)−F(z∗))<ρ2. (3)

Then, the sequence satisfies

 ∀n∈N:zn∈B(z∗,ρ),∞∑n=0Δn<∞,F(zn)→F(z∗), as n→∞, (4)

converges to a point such that . If, additionally, Condition H3 is satisfied, then and . The key points of the proof are the facts that for all :

 zj∈B(z∗,ρ)and (5) j∑i=1Δi≤12(Δ0−Δj)+ba[φ(F(z1)−F(z∗))−φ(F(zj+1)−F(z∗)))] (6)

Let us first see that is well-defined. By Condition H1, is non-increasing, which shows . Combining this with (2) implies .

As for the set is nonempty (see Condition H2) every belongs to . For notational convenience, we define

 Dφn:=φ(F(zn)−F(z∗))−φ(F(zn+1)−F(z∗)).

Now, we want to show that for holds: if and , then

Obviously, we can assume that (otherwise it is trivial), and therefore H1 and (2) imply . The KL inequality shows and H2 shows . Since , using KL inequality and H2, we obtain

 φ′(F(zn)−F(z∗))≥1∥wn∥2≥2b(Δn−1+Δn).

As is concave and increasing (), Condition H1 and (2) yield

 Dφn≥φ′(F(zn)−F(z∗))(F(zn)−F(zn+1))≥φ′(F(zn)−F(z∗))aΔ2n.

Combining both inequalities results in

which by applying establishes (7).

As (2) does only imply , , we can not use (7) directly for the whole sequence. However, (5) and (6) can be shown by induction on . For , (2) yields and . From Condition H1 with , and , we infer

 Δ1≤√F(z1)−F(z2)a≤√F(z0)−F(z∗)a, (8)

which combined with (3) leads to

 ∥x∗−x1∥2≤∥x0−x∗∥2+Δ1≤∥x0−x∗∥2+√F(z0)−F(z∗)a<ρ2,

and therefore . Direct use of (7) with shows that (6) holds with .

Suppose (5) and (6) are satisfied for . Then, using the triangle inequality and (6), we have

 ∥z∗−zj+1∥2≤∥x∗−xj+1∥2+∥x∗−xj∥2≤2∥x∗−x0∥2+2∑ji=1Δi+Δj+1≤2∥x∗−x0∥2+(Δ0−Δj)+Δj+1 2ba[φ(F(z1)−F(z∗))−φ(F(zj+1)−F(z∗)))]≤2∥x∗−x0∥2+Δ0+Δj+1+2ba[φ(F(z0)−F(z∗))],

which shows, using and (3), that . As a consequence (7), with , can be added to (6) and we can conclude (6) with . This shows the desired induction on .

Now, the finiteness of the length of the sequence , i.e., , is a consequence of the following estimation, which is implied by (6),

 j∑i=1Δi≤12Δ0+baφ(F(z1)−F(z∗))<∞.

Therefore, converges to some as , and converges to . As is concave, is decreasing. Using this and Condition H2 yields and . Suppose we have , then KL-inequality reads for all , which contradicts .

Note that, in general, is not a critical point of , because the limiting subdifferential requires as . When the sequence additionally satisfies Condition H3, then , and is a critical point of , because .

###### Remark 3

The only difference to  with respect to the assumptions is (2). In , implies , whereas we require and . However, as Theorem 3 shows, this does not weaken the convergence result compared to . In fact, Corollary 3.2, which assumes for all and which is also used in , is key in Theorem 3.

The next corollary and the subsequent theorem follow as in  by replacing the calculation with our conditions.

Lemma 3.2 holds true, if we replace (2) by

 η

By Condition H1, for , we have

 Δ2n+1≤F(zn+1)−F(zn+2)a≤ηa<(σ−ρ)2.

Using the triangle inequality on shows that , which implies (2) and concludes the proof.

The work that is done in Lemma 3.2 and Corollary 3.2 allows us to formulate an abstract convergence theorem for sequences satisfying the Conditions H1, H2, and H3. It follows, with a few modifications, as in .

[Convergence to a critical point] Let be a proper lower semi-continuous function and a sequence that satisfies H1, H2, and H3. Moreover, let have the Kurdyka-Łojasiewicz property at the cluster point specified in H3.

Then, the sequence has finite length, i.e., , and converges to as , where is a critical point of . By Condition H3, we have and for a subsequence . This, together with the non-decreasingness of (by Condition H1), imply that and for all . The KL-property around states the existence of quantities , , and as in Definition 3.1. Let be such that and . Shrink such that (if necessary). As is continuous, there exists such that for all and

 ∥x∗−xn0∥2+√F(zn0)−F(z∗)a+baφ(F(zn0)−F(z∗))<ρ2.

Then, the sequence defined by satsifies the conditions in Corollary 3.2, which concludes the proof.

## 4 The proposed algorithm - iPiano

### 4.1 The optimization problem

We consider a structured non-smooth non-convex optimization problem with a proper lower semi-continuous extended valued function , :

 minx∈RNh(x)=minx∈RNf(x)+g(x), (9)

which is composed of a -smooth (possibly non-convex) function with -Lipschitz continuous gradient on , , and a convex (possibly non-smooth) function . Furthermore, we require to be coercive, i.e., implies , and bounded from below by some value .

The proposed algorithm, which is stated in Subsection 4.3, seeks for a critical point of , which is characterized by the necessary first-order optimality condition . In our case, this is equivalent to

 −∇f(x∗)∈∂g(x∗).

This equivalence is explicitly verified in the next subsection, where we collect some details and state some basic properties, which are used in the convergence analysis in Subsection 4.5.

### 4.2 Preliminaries

Consider the function first. It is required to be -smooth with -Lipschitz continuous gradient on , i.e., there exists a constant such that

 ∥∇f(x)−∇f(y)∥2≤L∥x−y∥2,∀x,y∈domg. (10)

This directly implies that is a non-empty convex set, as . This property of plays a crucial role in our convergence analysis due to the following lemma (stated as in ). [descent lemma] Let be a -function with -Lipschitz continuous gradient on . Then for any it holds that

 f(x)≤f(y)+⟨∇f(y),x−y⟩+L2∥x−y∥22. (11)

See for example .

We assume that the function is a proper lower semi-continuous convex function with an efficient to compute proximal map. [proximal map] Let be a proper lower semi-continuous convex function. Then, we define the proximal map

 (I+α∂g)−1(^x):=argminx∈RN∥x−^x∥222+αg(x),

where is a given parameter, is the identity map, and .

An important (basic) property that the convex function contributes to the convergence analysis is the following: Let be a proper lower semi-continuous convex function, then it holds for any , that

 g(y)≥g(x)+⟨s,y−x⟩. (12)

This result follows directly from the convexity of .

Finally, consider the optimality condition more in detail. The following proposition proves the equivalence to . The proof is mainly based on Definition 3.1 of the limiting-subdifferential. Let , , and be like before, i.e., let with continuously differentiable and convex. Sometimes, is then called a -perturbation of a convex function. Then, for holds

 ∂h(x)=∇f(x)+∂g(x).

We first prove “”. Let , i.e., there is a sequence such that , , and , where . We want to show that . As and , we have

 ykk→∞⟶x g(yk)=h(yk)−f(yk)k→∞⟶h(x)−f(x)=g(x) ξgk:=ξhk−∇f(yk)k→∞⟶ξh−∇f(x)=:ξg.

It remains to show that . First, remember that is superadditive, i.e., for two sequences , in it is . However, convergence of implies . This fact and again thanks to , we conclude

where and are over . Therefore, .
The other inclusion “” is trivial.

As a consequence, a critical point can also be characterized by the following definition. [proximal residual] Let and be as afore. Then, we define the proximal residual

 r(x):=x−(I+∂g)−1(x−∇f(x)).

It can be easily seen that is equivalent to and , which is the first-order optimality condition. The proximal residual is defined with respect to a fixed step size of . The rationale behind this becomes obvious when is the indicator function of a convex set. In this case, a small residual could be caused by small step sizes as the reprojection onto the convex set is independent of the step size.

### 4.3 The generic algorithm

In this paper, we propose an algorithm, iPiano, with the generic formulation in Algorithm 1. It is a forward-backward splitting algorithm incorporating an inertial force. In the forward step, determines the step size in the direction of the gradient of the differentiable function . The step in gradient direction is aggregated with the inertial force from the previous iteration weighted by . Then, the backward step is the solution of the proximity operator for the function with the weight .

In order to make the algorithm specific and convergent, the step size parameters must be chosen appropriately. What “appropriately” means, will be specified in Subsection 4.4 and proved in Subsection 4.5.

### 4.4 Rules for choosing the step size

In this subsection, we propose several strategies for choosing the step sizes. This will make it easier to implement the algorithm. One may choose among the following variants of step size rules depending on the knowledge about the objective function.

##### Constant step size scheme

The most simple one, which requires most knowledge about the objective function, is outlined in Algorithm 2. All step size parameters are chosen a priori and are constant.

###### Remark 4

Observe that our law on is equivalent to the law found in  for minimizing a smooth non-convex function. Hence, our result can be seen as an extension of their work to the presence of an additional non-smooth convex function.

##### Backtracking

The case where we have only limited knowledge about the objective function occurs more frequently. It can be very challenging to estimate the Lipschitz constant of beforehand. Using backtracking the Lipschitz constant can be estimated automatically. A sufficient condition that the Lipschitz constant at iteration to must satisfy is

 f(xn+1)≤f(xn)+⟨∇f(xn),xn+1−xn⟩+Ln2∥xn+1−xn∥22. (15)

Although, there are different strategies to determine , the most common one is by defining an increment variable and looking for minimal satisfying (15). Sometimes, it is also feasible to decrease the estimated Lipschitz constant after a few iterations. A possible strategy is as follows: if , then search for the minimal satisfying (15).

In Algorithm 3 we propose an algorithm with variable step sizes. Any strategy for estimating the Lipschitz constant may be used. When changing the Lipschitz constant from one iteration to another, all step size parameters must be adapted. The rules for adapting the step sizes will be justified during the convergence analysis in Subsection 4.5.

##### Lazy backtracking

Algorithm 4 presents another alternative of Algorithm 1. It is related to Algorithm 2 and 3 in the following way. Algorithm 4 makes use of the Lipschitz continuity of in the sense that the Lipschitz constant is always finite. As a consequence, using backtracking with only increasing Lipschitz constants, after a finite number of iterations the estimated Lipschitz constant will not change anymore, and starting from this iteration the constant step size rules as in Algorithm 2 are applied. Using this strategies, the results that will be proved in the convergence analysis are satisfied only as soon as the Lipschitz constant is high enough and does not change anymore.

##### General rule of choosing the step sizes

Algorithm 5 defines the general rules that the step size parameters must satisfy.

It contains the Algorithms 23, and 4 as special instances. This is easily verified for Algorithms 2 and 4. For Algorithm 3 the step size rules are derived from the proof of Lemma 4.5.

As Algorithm 5 is the most general one, now, let us analyze the behavior of this algorithm.

### 4.5 Convergence analysis

In all what follows, let be the sequence generated by Algorithm 5 and with parameters satisfying the algorithm’s requirements. Furthermore, for a more convenient notation we abbreviate , , and . Note, that for it is .

Let us first verify that the algorithm makes sense. We have to show that the requirements to the parameters are not contradictory, i.e., that it is possible to choose a feasible set of parameters. In the following Lemma, we will only show existence of such a parameter set, however, the proof helps us to formulate specific step size rules. For all , there are , , and . Furthermore, given , there exists a choice of parameter and such that additionally is monotonically decreasing.  By the algorithm’s requirements it is

 δn=1αn−Ln2−βn2αn≥1αn