# A convergence frame for inexact nonconvex and nonsmooth algorithms and its applications to several iterations

In this paper, we consider the convergence of an abstract inexact nonconvex and nonsmooth algorithm. We promise a pseudo sufficient descent condition and a pseudo relative error condition, which both are related to an auxiliary sequence, for the algorithm; and a continuity condition is assumed to hold. In fact, a wide of classical inexact nonconvex and nonsmooth algorithms allow these three conditions. Under the finite energy assumption on the auxiliary sequence, we prove the sequence generated by the general algorithm converges to a critical point of the objective function if being assumed Kurdyka- Lojasiewicz property. The core of the proofs lies on building a new Lyapunov function, whose successive difference provides a bound for the successive difference of the points generated by the algorithm. And then, we apply our findings to several classical nonconvex iterative algorithms and derive corresponding convergence results.

• 41 publications
• 53 publications
• 9 publications
• 39 publications
04/26/2017

### Linear Convergence of Accelerated Stochastic Gradient Descent for Nonconvex Nonsmooth Optimization

In this paper, we study the stochastic gradient descent (SGD) method for...
03/09/2020

### A block inertial Bregman proximal algorithm for nonsmooth nonconvex problems

In this paper, a block inertial Bregman proximal algorithm, namely [], f...
12/01/2020

### Convergence of Gradient Algorithms for Nonconvex C^1+α Cost Functions

This paper is concerned with convergence of stochastic gradient algorith...
09/01/2017

### Iteratively Linearized Reweighted Alternating Direction Method of Multipliers for a Class of Nonconvex Problems

In this paper, we consider solving a class of nonconvex and nonsmooth pr...
06/13/2013

### A Convergence Theorem for the Graph Shift-type Algorithms

Graph Shift (GS) algorithms are recently focused as a promising approach...
03/18/2019

### Annealing for Distributed Global Optimization

The paper proves convergence to global optima for a class of distributed...
04/04/2022

### Characterizing Parametric and Convergence Stability in Nonconvex and Nonsmooth Optimizations: A Geometric Approach

We consider stability issues in minimizing a continuous (probably parame...

## 1 Introduction

Minimization of the nonconvex and nonsmooth function

 minxF(x) (1.1)

is a core part of nonlinear programming and applied mathematics. Different with traditional convergence results on the global minimizers in the convex community, the convergence of the nonconvex algorithm just promises that the iteration falls into a critical point. In most practical cases, the objective functions enjoy the Kurdyka-ojasiewicz property (see definitions in Sec. 2). In this paper, we consider the convergence analysis under the Kurdyka-ojasiewicz property assumption on the objective function .

In paper [5], for the sequence generated by a very general scheme for problem (1.1), the authors consider three conditions, sufficient descent condition, relative error condition and continuity condition. Mathematically, these three conditions can be presented as: for some

 (1.2)

where means the limiting subdifferential of (see definition in Sec. 2). Actually, various algorithms satisfy these three conditions. The third condition is usually derived by the minimization in each iteration. The proofs in [5] use a local area analysis; the authors first prove that the sequence falls into a neighbor of some point after enough iterations and then employ the Kurdyka-ojasiewicz property around the point. In latter paper [9], the authors prove a uniformed Kurdyka-ojasiewicz lemma for a closed set and much simplify the proofs.

### 1.1 A novel convergence framework

In this paper, we consider the convergence for inexact nonconvex and nonsmooth algorithms. We stress that the inexact algorithms discussed in our paper are different from the paper [5]. In their paper, an assumption is posed for the noise: the noise should be bounded by the successive difference of the iteration. The “inexact algorithm” in [5] is much closer to “proximal algorithm”. For example, if is differentiable (may be nonconvex), the nonconvex gradient descent algorithm performs as

 xk+1=xk−h⋅∇F(xk). (1.3)

If the gradient of is Lipschitz with and , the sequence generated by (1.3) satisfies condition (1.2). However, if the iteration is corrupted by some noise in each step, i.e.,

 xk+1=xk−h⋅∇F(xk)+ek. (1.4)

However, the sequence generated by (1.4) is likely violating some conditions in (1.2) when . The existing analysis cannot be directly used for the algorithm (1.4). The authors in [5] proposed the assumption for the noise as

 ∥ek∥≤ℓ⋅∥xk+1−xk∥, (1.5)

where . Under this assumption, they can continue using the sufficient descent condition and relative error condition. In this paper, we get rid of the dependent assumption like (1.5). Although in this case the inexact algorithms always fail to obey the first two of the core condition (1.2), we find that many of them satisfy an alternative condition:

 (1.6)

where are constants, and is a nonnegative sequence, and and is a sequence satisfying

 ∥xk−xk+1∥≤e∥ωk−ωk+1∥ (1.7)

for some . The continuity condition is kept here. Obviously, if , and , the condition will reduce to (1.2). Thus, our work can also be regarded as a generation of paper [5]. Our approach is first proving convergence for a general inexact algorithm whose sequence satisfying the condition (1.6) under a specific summable assumption on . We then prove several classical inexact algorithms satisfying condition (1.6).

The core of the proof lies in using an auxiliary function whose successive difference gives a bound to the successive difference of the sequence . If is semi-algebraic, the new function is then Kurdyka-ojasiewicz. And then, we build sufficient descent involving the new function and . We denote , which is a composition of , in (3.3). In the -th iteration, the distance between subdifferential of the new function and the origin is bounded by the composition of , and . And then, we prove the finite length of provided is also summable. In proving the finite length, the key part is using the Kurdyka-ojasiewicz property of the new Lyapunov function. The proof techniques are motivated by the methodology proposed in [5].

### 1.2 Related work

Recently, the convergence analysis in nonconvex optimization has paid increasing attention to using the Kurdyka-ojasiewicz property in proofs. In paper [3], the authors proved the convergence of proximal algorithm minimizing the Kurdyka-ojasiewicz functions. In [3], the rates for the iteration converging to a critical point were exploited. An alternating proximal algorithm was considered in [4], and the convergence was proved under Kurdyka-ojasiewicz assumption on the objective function. Later, a proximal linearized alternating minimization algorithm was proposed and studied in [9]. A convergence framework was given in [5], which contains various nonconvex algorithms. In [14], the authors modified the framework for analyzing splitting methods with variable metric, and proved the general convergence rates. The nonconvex ADMM was studied under Kurdyka-ojasiewicz assumption by [20, 21]. And latter paper [32] proposed the nonconvex primal-dual algorithm and proved the convergence. The Kurdyka-ojasiewicz-analysis convergence method was applied to analyzing the convergence of the reweighted algorithm by [35]. And the extension to the reweighted nuclear norm version was developed in [34]. Recently, the DC algorithm has also employed the Kurdyka-ojasiewicz property in the convergence analysis [2].

### 1.3 Contribution and organization

In this paper, we focus on the inexact nonconvex algorithms. We first propose a new framework (1.6), which is more general than the frameworks proposed in [5] and [14]. The convergence is proved for any sequence satisfying (1.6) with and if is a Kurdyka-ojasiewicz function. In the analysis, we employ the new Lyapunov function which is a composition of the and the length of the noise. The new framework proposed in this paper indicates kinds of algorithms. We then apply our results to these algorithms. For a specific algorithm, we just need to verify that (1.6) and (1.7) hold.

The rest of the paper is organized as follows. In section 2, we list necessary preliminaries. Section 3 contains the main results. In section 4, we provide the applications. Section 5 concludes the paper.

## 2 Preliminaries

This section presents the mathematical tools which will be used in our proofs and contains two parts: in the first one, we introduce the basic definitions and properties for subdifferentials; in the second one, the KŁ property is introduced.

### 2.1 Subdifferential

More details about the definition of subdifferential can be found in the textbooks [27, 28]. Given an lower semicontinuous function , its domain is defined by

 dom(J):={x∈RN:J(x)<+∞}.

The notion of subdifferential plays a central role in variational analysis.

###### Definition 1 (subdifferential).

Let be a proper and lower semicontinuous function.

1. For a given , the Frchet subdifferential of at , written

, is the set of all vectors

which satisfy

 limy≠xinfy→xJ(y)−J(x)−⟨u,y−x⟩∥y−x∥≥0.

When , we set .

2. The (limiting) subdifferential, or simply the subdifferential, of at , written , is defined through the following closure process

 ∂J(x):={u∈RN:∃xk→x,J(xk)→J(x) and uk∈^∂J(xk)→u as k→∞}.

It is easy to verify that the Frchet subdifferential is convex and closed while the subdifferential is closed. When is convex, the definition agrees with the subgradient in convex analysis as

 ∂J(x):={v:J(y)≥J(x)+⟨v,y−x⟩  for  any  y∈RN}.

The graph of subdifferential for a real extended valued function is defined by

 graph(∂J):={(x,v)∈RN×RN:v∈∂J(x)}.

And the domain of the subdifferential of is given as

 dom(∂J):={x∈RN:∂J(x)≠∅}.

Let be a sequence in such that . If converges to as and converges to as , then . A necessary condition for to be a minimizer of is

 0∈∂J(x). (2.1)

When is convex, (2.1) is also sufficient. A point that satisfies (2.1) is called (limiting) critical point. The set of critical points of is denoted by .

### 2.2 Kurdyka-Łojasiewicz function

With the definition of subdifferential, we now are prepared to introduce the Kurdyka-Łojasiewicz property and function.

###### Definition 2.

[22, 18, 7] (a) The function is said to have the Kurdyka-Łojasiewicz property at if there exist , a neighborhood of and a continuous concave function such that

1. .

2. is on .

3. for all , .

4. for all in , the Kurdyka-Łojasiewicz inequality holds

 φ′(J(x)−J(¯¯¯x))⋅dist(0,∂J(x))≥1. (2.2)

(b) Proper lower semicontinuous functions which satisfy the Kurdyka-Łojasiewicz inequality at each point of are called KL functions.

It is hard to directly judge whether a function is Kurdyka-Łojasiewicz or not. Fortunately, the concept of semi-algebraicity can help to find and check a very rich class of Kurdyka-Łojasiewicz functions.

###### Definition 3 (Semi-algebraic sets and functions [7, 8]).

(a) A subset of is a real semi-algebraic set if there exists a finite number of real polynomial functions such that

 S=p⋃j=1q⋂i=1{u∈RN:gij(u)=0 %and  hij(u)<0}.

(b) A function is called semi-algebraic if its graph

 {(u,t)∈RN+1:h(u)=t}

is a semi-algebraic subset of .

Better yet, the semi-algebraicity enjoys many quite nice properties [7, 8]. We just put a few of them here:

• Real polynomial functions.

• Indicator functions of semi-algebraic sets.

• Finite sums and product of semi-algebraic functions.

• Composition of semi-algebraic functions.

• Sup/Inf type function, e.g., is semi-algebraic when is a semi-algebraic function and a semi-algebraic set.

• Cone of PSD matrices, Stiefel manifolds and constant rank matrices.

Now we present a lemma for the uniformized KŁ property. With this lemma, we can make the proofs much more concise.

###### Lemma 1 ([9]).

Let be a proper lower semi-continuous function and be a compact set. If is a constant on and satisfies the KŁ property at each point on , then there exists concave function satisfying the four assumptions in Definition 2 and such that for any and any satisfying that and , it holds that

 φ′(J(x)−J(¯¯¯x))⋅\emphdist(0,∂J(x))≥1. (2.3)

## 3 Convergence analysis

The sequence is assumed to satisfy

 ∑lηl<+∞. (3.1)

It is worth mentioning that the assumption (3.1) is necessary to guarantee the sequence convergence in general case. To see this, we consider the inexact gradient example (1.4) in a very special case that . And then, we get . Further, we consider the one-dimensional case, in which ; we set . In this example, will diverge if (3.1) fails to hold. However, in our proofs, only (3.1) barely promises the sequence convergence. The final assumption for the sequence convergence is a little stronger than (3.1).

Now, we introduce the Lyapunov function used in the analysis. Given any fixed , we denote a new function, which plays an important role in the analysis, as

 ξ(z):=F(x)+tθθ, z:=(x,t)∈RN+1. (3.2)

We also need to define the new sequences as

 tk:=(θ⋅b⋅+∞∑l=kη2l)1θ, zk:=(xk,tk). (3.3)

Due to that when and is larger enough, is well-defined. The aim in this part is proving that generated by the algorithm converges to a critical point of , and building the relationships between the critical points of and . The proof contains two main steps:

1. Find a positive constant such that

 ρ1∥ωk+1−ωk∥2≤ξ(zk)−ξ(zk+1), k=0,1,⋯.
2. Find another positive constants such that

 dist(0,∂ξ(zk+1))≤ρ2k∑j=k−τ∥ωj+1−ωj∥+ρ3ηk+ρ4(tk+1)θ−1, k=0,1,⋯.
###### Lemma 2.

Assume that is generated by the general inexact algorithm satisfying conditions (1.6) and (1.7), and condition (3.1) holds. Then, we have the following results.

(1) It holds that

 ξ(zk)−ξ(zk+1)≥a∥ωk−ωk+1∥2. (3.4)

And then, is bounded if is coercive.

(2) , which implies that

 limk∥xk+1−xk∥=0. (3.5)
###### Proof.

(1) From the direct algebra computations, we can easily obtain

 ξ(zk)−ξ(zk+1) = F(xk)−F(xk+1)+tθk−tθk+1θ (3.6) = F(xk)−F(xk+1)+bη2k ≥ a∥ωk−ωk+1∥2.

If is coercive, then is coercive. Thus, is bounded due to that is bounded.

(2) From (3.4), is descending. Note that , is convergent. Hence, we can easily have that

 k∑n=0∥ωn+1−ωn∥2≤ξ(z0)−ξ(zk+1)a<+∞.

With (1.7), we then prove the result. ∎

###### Lemma 3.

If the conditions of Lemma 2 hold,

 \emphdist(0,∂ξ(zk+1))≤ck∑j=k−τ∥ωj+1−ωj∥+dηk+tk+1. (3.7)
###### Proof.

Direct calculation yields

 ∂ξ(zk+1)=(∂F(xk+1)(tk+1)θ−1). (3.8)

Thus, we have

 dist(0,∂ξ(zk+1)) ≤ dist(0,∂F(xk+1))+(tk+1)θ−1 (3.9) ≤ ck∑j=k−τ∥ωj+1−ωj∥+dηk+(tk+1)θ−1.

In the following, we establish some results about the limit points of the sequence generated by the general algorithm. We need a definition about the limit point which is introduced in [5].

###### Definition 4.

For a sequence , define that

 M(d0):={d∈RN:∃ an increasing % sequence of integers {kj}j∈N such that dkj→d as j→∞},

where is the starting point.

###### Lemma 4.

Suppose that is generated by general algorithm and is coercive. And the conditions of Lemma 2 hold. Then, we have the following results.

(1) For any , we have and .

(2) is nonempty and .

(2’) is nonempty and

(3) .

(3’) .

(4) The function is finite and constant on .

(4’) The function is finite and constant on .

###### Proof.

(1) Noting , and

 ξ(z∗)=ξ(x∗,0)=F(x∗).

(2) It is easy to see the coercivity of . With Lemma 2 and the coercivity of , is bounded. Thus, is nonempty. Assume that , from the definition, there exists a subsequence . From Lemmas 2 and 3, we have . The closedness of indicates that , i.e. .

(2’) With the facts and , we can easily derive the results.

(3)(3’) This item follows as a consequence of the definition of the limit point.

(4) Let be the limit of . There exists one stationary point , from the continuity condition, there exists satisfying . We denote that . Thus, the subsequence and . And we have

 ξ(¯¯¯z)=limjξ(zkj)=l.

(4’) The proof is similar to (4).

###### Lemma 5.

Suppose that is a closed semi-algebraic function and coercive. Let the sequence be generated by general scheme and the conditions (1.6) and (1.7) hold. If there exists such that the sequence satisfies

 ∑k(+∞∑l=kη2l)θ−1θ<+∞. (3.10)

Then, the sequence has finite length, i.e.

 +∞∑k=0∥xk+1−xk∥<+∞. (3.11)

And converges to a critical point of .

###### Proof.

Obviously, is semi-algebraic, and then KŁ. Let be a cluster point of , then, is also a cluster point of . If for some , with the fact is decreasing, as . Using Lemma 2, as . In the following, we consider the case . From Lemmas 1 and 4, there exist such that for any and any satisfying that and . From Lemma 4, as is large enough,

 zk∈{z∣dist(z,M(z0))<ε}⋂{z∣ξ(z∗)<ξ(z)<ξ(z∗)+δ}.

Thus, there exist concave function such that

 φ′(ξ(zk+1)−ξ(z∗))⋅dist(0,∂ξ(zk+1))≥1. (3.12)

Therefore, we have

 φ(ξ(zk+1)−ξ(z∗))−φ(ξ(zk+2)−ξ(z∗)) a)≥φ′(ξ(zk+1)−ξ(z∗))⋅(ξ(zk+1)−ξ(zk+2)) b)≥a⋅φ′(f(xk+1)−ξ(z∗))⋅∥ωk+2−ωk+1∥2 c)≥a∥ωk+2−ωk+1∥2dist(0,∂ξ(zk+1)) d)≥a∥ωk+2−ωk+1∥2c∥xk+1−xk∥+dηk+(tk+1)θ−1,

where is due to the concavity of , and depends on Lemma 2, uses the KŁ property, and follows from Lemma 3. That is also

where uses the Schwarz inequality with , and , and . Multiplying (3) with , we have

 2(τ+1)∥ωk+2−ωk+1∥ ≤ c(τ+1)2a2[φ(ξ(zk+1)−ξ(z∗))−φ(ξ(zk+2)−ξ(z∗))] (3.14) + k∑j=k−τ∥ωj+1−ωj∥+adcηk+ac(tk+1)θ−1.

Summing both sides from to , and with simplifications,

Letting and using and , we then derive

 ∑k∥ωk+1−ωk∥<+∞. (3.16)

By using (1.7), we are then led to

 ∑k∥xk+1−xk∥<+∞. (3.17)

Thus, has only one stationary point . From Lemma 4, . ∎

The requirement (3.10) is complicated and impractical in the applications. Thus, we consider the sequence enjoys the polynomial forms as with . We try to simplify (3.10) in this case. The task then reduce the following mathematical analysis problem: find the minimum such that for any , there exists can make (3.10) hold. Direct calculations give us

 (+∞∑l=kη2l)θ−1θ=(+∞∑l=kC2t2α)θ−1θ≤(+∞∑l=k∫l+1lC2t2αdt)θ−1θ=C2(θ−1)θ(2α−1)θ−1θ⋅1k(2α−1)(θ−1)θ. (3.18)

Thus, we need

 α>1,  and  (2α−1)(θ−1)θ>1. (3.19)

After simplifications, we get

 α>1,  and  α>2θ−12(θ−1). (3.20)

Then, the problem reduces to

 α0=infθ>1{c(θ):=max{1,2θ−12(θ−1)}=2θ−12(θ−1)}. (3.21)

Figure 1 shows the function values between . We can see is decreasing to at . Therefore, we get . That is also to say if with any fixed , there exists such that (3.10) can hold. And then, the sequence is convergent to some critical point of . Therefore, we obtain the following result.

###### Theorem 1 (Convergence result).

Suppose that is a closed semi-algebraic function and coercive. Let the sequence be generated by general scheme and the conditions (1.6) and (1.7) hold. The sequence obeys

 ηk=O(1kα),α>1. (3.22)

Then, the sequence has finite length, i.e.

 +∞∑k=0∥xk+1−xk∥<+∞. (3.23)

And converges to a critical point of .

## 4 Applications to several nonconvex algorithms

In this part, several classical nonconvex inexact algorithms are considered. We apply our theoretical findings to these algorithms and derive corresponding convergence results for the algorithms. As presented before, we just need to check whether the algorithm satisfies the three conditions in (1.6). For a closed function (may be nonconvex) , we denote

 proxJ(x)∈argminy{J(y)+∥y−x∥22}. (4.1)

Different with convex cases, the is a point-to-set operator and may have more than one solution. We present a useful lemma which plays a very important role in the analysis.

###### Lemma 6.

For any and , if ,

 J(z)+∥z−x∥22≤J(y)+∥y−x∥22. (4.2)

Of course, we also have

 x−z∈∂J(z). (4.3)

In subsections 4.1-4.4, the point is itself, i.e., .

### 4.1 Inexact nonconvex gradient and proximal algorithm

The nonconvex proximal gradient algorithm is developed for the nonconvex composite optimization

 minx{F(x)=f(x)+g(x)}, (4.4)

where is differentiable and is Lipschitz with , and is closed. And both and may be nonconvex. The nonconvex inexact proximal gradient algorithm can be described as

 xk+1=proxhg(xk−h∇f(xk)+ek), (4.5)

where is the stepsize, prox is the proximal operator and is the noise. In the convex case, this algorithm is discussed in [38, 29], and the acceleration is studied in [31].

###### Lemma 7.

Let and the sequence be generated by algorithm (4.5), we have

 F(xk)−F(xk+1)≥14(1h−L)∥xk+1−xk∥2−1h(1−hL)∥ek∥2. (4.6)
###### Proof.

The -Lipschitz of gives

 f(xk+1)−f(xk)≤⟨∇f(xk),xk+1−xk⟩+L2∥xk+1−xk∥2. (4.7)

On the other hand, with Lemma 6, we have

 hg(xk+1)+∥xk−h∇f(xk)+ek−xk+1∥22≤hg(xk)+∥−h∇f(xk)+ek∥22. (4.8)

This is also

 g(xk+1)−g(xk)≤−⟨∇f(xk),xk+1−xk⟩−∥xk−xk+1∥22h+⟨ek,xk+1−xk⟩h. (4.9)

Summing (4.7) and (4.9),

 F(xk+1)−F(xk)≤12(L−1h)∥xk+1−xk∥2+⟨ek,xk+1−xk⟩h. (4.10)

With the Cauchy-Schwarz inequality, we have

 ⟨ek,xk+1−xk⟩h≤14(1h−L)∥xk+1−xk∥2+1h(1−hL)∥ek∥2 (4.11)

Combining (4.11) and (4.10), we then prove the result. ∎

###### Lemma 8.

Let the sequence be generated by algorithm (4.5), we have

 \emphdist(0,∂F(xk+1))≤(1h+L)∥xk−xk+1∥+1h∥ek∥. (4.12)
###### Proof.

We have

 xk−xk+1h−∇f(xk)+ekh∈∂g(xk+1). (4.13)

Therefore,

 xk−xk+1h+∇f(xk+1)−∇f(xk)+ekh∈∇f(xk+1)+∂g(xk+1)=∂F(xk+1). (4.14)

Thus, we have

 dist(0,∂F(xk+1)) ≤ ∥xk−xk+1h+∇f(xk+1)−∇f(xk)+ekh∥ (4.15) ≤ 1h∥xk−xk+1∥+L∥xk−xk