On the complexity of convex inertial proximal algorithms

The inertial proximal gradient algorithm is efficient for the composite optimization problem. Recently, the convergence of a special inertial proximal gradient algorithm under strong convexity has been also studied. In this paper, we present more novel convergence complexity results, especially on the convergence rates of the function values. The non-ergodic O(1/k) rate is proved for inertial proximal gradient algorithm with constant stepzise when the objective function is coercive. When the objective function fails to promise coercivity, we prove the sublinear rate with diminishing inertial parameters. When the function satisfies some condition (which is much weaker than the strong convexity), the linear convergence is proved with much larger and general stepsize than previous literature. We also extend our results to the multi-block version and present the computational complexity. Both cyclic and stochastic index selection strategies are considered.

Authors

• 32 publications
• 12 publications
• 36 publications
• 7 publications
• General Proximal Incremental Aggregated Gradient Algorithms: Better and Novel Results under General Scheme

The incremental aggregated gradient algorithm is popular in network opti...
10/11/2019 ∙ by Tao Sun, et al. ∙ 0

• On the Proximal Gradient Algorithm with Alternated Inertia

In this paper, we investigate the attractive properties of the proximal ...
01/17/2018 ∙ by Franck Iutzeler, et al. ∙ 0

• Inertial Stochastic PALM and its Application for Learning Student-t Mixture Models

Inertial algorithms for minimizing nonsmooth and nonconvex functions as ...
05/05/2020 ∙ by Johannes Hertrich, et al. ∙ 0

• Convex-Concave Backtracking for Inertial Bregman Proximal Gradient Algorithms in Non-Convex Optimization

Backtracking line-search is an old yet powerful strategy for finding bet...
04/06/2019 ∙ by Mahesh Chandra Mukkamala, et al. ∙ 0

04/07/2021 ∙ by Xiao Lv, et al. ∙ 0

• SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives

In this work we introduce a new optimisation method called SAGA in the s...
07/01/2014 ∙ by Aaron Defazio, et al. ∙ 0

• Bregman Proximal Framework for Deep Linear Neural Networks

A typical assumption for the analysis of first order optimization method...
10/08/2019 ∙ by Mahesh Chandra Mukkamala, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper, we study the following composite optimization problem

 minx{F(x):=f(x)+g(x)}, (1.1)

where is differentiable and is Lipschitz continuous with , and is proximable. The inertial proximal algorithm for the problem (iPiano) [13] can be described as

 xk+1=proxγkg[xk−γk∇f(xk)+βk(xk−xk−1)], (1.2)

where is the stepsize and is the inertial parameter. The iPiano is closely related to two classical algorithms: the forward-backward splitting method [6] (when ) and heavy-ball method (when ) [16]. The iPiano is a combination of forward-backward splitting method and and heavy-ball method. However, different from forward-backward splitting, the sequence generated by iPiano is not Fejér monotone due to the inertial term . This brings troubles in proving the convergence rates in the convex case. Note that the heavy-ball method is a special form of iPiano. The difficulty also exists in analyzing the complexity of heavy-ball method. In the existing literatures, the sublinear convergence rate of the heavy-ball was established only in the sense of ergodicity. In this paper, we propose a novel Lyapunov function to address this issue, and prove the non-ergodic convergence rates.

1.1 Interpretation by dynamical systems

The discretization of the following dynamical system gives the heavy-ball method with [1]:

 ¨x(t)+α˙x(t)+∇f(x(t))=0 (1.3)

for some . If further , the heavy-ball method reduces to basic gradient descent, which results in the discretization of the following ODE

 α˙x(t)+∇f(x(t))=0. (1.4)

Studying the property of the above dynamical systems helps us to understand the algorithms. More importantly, it motivates us to construct the proper Lyapunov function. We notice that some important relation between and is missing. It is thus natural to add the missing information back to (1.3). In the discretization, is replaced by , where is stepsize for discretization. Then it holds that

 ∥xk+1−2xk+xk−1h2∥≤1h⋅(∥xk+1−xkh∥+∥xk−xk−1h∥). (1.5)

Note that both and can be viewed as the discretization of . Motivated by this observation, we propose to modify (1.3) by adding the following constraint

 ∥¨x(t)∥≤θ∥˙x(t)∥, (1.6)

where . In Section 2, we study the system (1.3)+(1.6). With the extra constraint (1.6), the sublinear asymptotical convergence rate can be established. The analysis enables the non-ergodic sublinear convergence rate for heavy-ball (inertial) algorithm.

1.2 Related works

The inertial term was first proposed in the heavy-ball algorithm [16]. When the objective function is twice continuously differentiable, strongly convex (almost quadratic), the Heavy-ball method is proved to converge linearly. Under weaker assumption that the gradient of the objective function is Lipschitz continuous, [21] proved the convergence to a critical point, yet without specifying the convergence rate. The smoothness of objective function is critical for the heavy-ball to converge. In fact, there is an example that the heavy-ball method diverges for a stronly convex but nonsmooth function [9]. Different from the classical gradient methods, heavy-ball algorithm fails to generate a Fejér monotone sequence. In general convex and smooth case, the only convergence rate result is ergodic in terms of the function values [7].

The iPiano combines heavy-ball method with the proximal mapping as in forward-backward splitting. In the nonconvex case, convergence of the algorithm was thoroughly discussed [13]. The local linear convergence of iPiano and heavy-ball method has been proved in [11]. In the strongly convex case, the linear convergence was proved for iPiano with fixed [12]. In the paper [15], inertial Proximal Alternating Linearized Minimization (iPALM) was introduced as a variant of iPiano for solving two-block regularized problem. Without the inertial terms, this algorithm reduces to the Proximal Alternating Linearized Minimization (PALM) [4], being equivalent to the two-block case of the Coordinate Descent (CD) algorithm [20]. In the convex case, the two-block CD methods are also well studied [2, 17, 3, 18]. Recently, there is gowing interests in studying CD method using the operators [5, 14, 19, 8].

1.3 Contribution and organization

In this paper, we present the first non-ergodic convergence rate result for iPiano in general convex case. Compared with results in [7], our convergence is established with a much larger stepsize under the coercive assumption. If the function fails to be coercive, we can choose asymptotic stepsizes. We also present the linear convergence under an error bound condition without assuming strong convexity. Similar to the coercive case, our results hold for relaxed stepsizes. In addition, we extend our result to the coordinate descent version of iPiano. Both cyclic and stochastic index selection strategies are considered. The contributions of this paper are summarized as follows:

1. A novel dynamical interpretation: We propose a modified dynamical system of the inertial algorithm, from which we derive the sublinear asymptotical convergence rate with a proper Lyapunov function.

2. The non-ergodic sublinear convergence rate: We are the first to prove the non-ergodic convergence rates of the inertial proximal gradient algorithm. The linear convergence rate is also proved for the objective function without strong convexity. The brief idea of proof is to bound the Lyapunov function, and connect this bound to the successive difference of the Lyapunov function.

3. Better linear convergence: Stronger linear convergence results are proved for inertial algorithms. Compared with that in the literature, we have relaxed stepsize and inertial parameters. The strong convexity assumption can be weaken. More importantly, we show that the stepsize can be chosen independent of the strong convexity constant.

4. Extensions to multi-block version: The convergence of multi-block versions of inertial methods is studied. Both cyclic and stochastic index selection strategies are considered. The sublinear and linear convergence rates are proved for both algorithms.

The rest of the paper is organized as follows. In Section 2, we study the modified dynamical system and present technical lemmas. In Section 3, we show the convergence rates for inertial proximal gradient methods. We extend the results to the multi-block version of iPiano in Section 4, and to the stochastic version in Section 5. Section 6 concludes this article.

2 Dynamical motivation and technical lemmas

In this part, we first analyze the performance of the modified dynamical system (1.3)+(1.6). The existence of the system is beyond the scope of this paper and will not be discussed. And then, two lemmas are introduced for the sublinear convergence rates analysis.

2.1 Performance of the modified dynamical system

Let us assume the existence of system (1.3)+(1.6), and consider the Lyapunov function

 ξ(t):=f(x(t))+12∥˙x(t)∥2−minf. (2.1)

With direct computation, it holds that

 ˙ξ(t)=⟨∇f(x(t)),˙x(t)⟩+⟨¨x(t),˙x(t)⟩=−α∥˙x(t)∥2. (2.2)

Assume that is coercive, noting is decreasing and nonnegative, must be bounded. With the continuity of , is also bounded. That means is also bounded. If , with the triangle inequality,

 ∥¨x(t)+α˙x(t)∥≥α∥˙x(t)∥−∥¨x(t)∥≥(α−θ)∥˙x(t)∥. (2.3)

We then obtain the boundedness of and . Let , we have

 f(x(t))−f(x∗) ≤⟨∇f(x(t)),x(t)−x∗⟩ ≤∥∇f(x(t))∥⋅∥x−x∗∥ ≤(α+θ)∥˙x(t)∥⋅∥x(t)−x∗∥. (2.4)

With the boundedness, denote that

 R:=supt≥0[max{(α+θ)⋅∥x(t)−x∗∥,∥˙x(t)∥2}]<+∞.

Then, we can easily have

 ξ(t)2≤R2∥˙x(t)∥2. (2.5)

With (2.2) and (2.5),

 ξ(t)2≤−R2α˙ξ(t).

That is also

 −αR2dt≤dξξ2. (2.6)

Taking integrations of both sides, we then have

 f(x(t))−f(x∗)≤ξ(t)≤1αR2t+ξ(0).
Remark 1.

If we just consider (1.3), only convergence can be proved without sublinear asymptotical rates. Obviously, (1.6) is crucial for the analysis.

2.2 Technical lemmas

This parts contains two lemmas on nonnegative sequences: Lemma 1 is used to derive the convergence rate. It can be regarded as the discrete form of (2.6); Lemma 2 is developed to bound the sequence when inertial parameters are decreasing.

Lemma 1 (Lemma 3.8, [2]).

Let be nonnegative sequence of real numbers satisfying

 αk−αk+1≥γα2k+1.

Then, we have

 αk=O(1k).
Lemma 2.

Let be a nonnegative sequence and follow the condition

 tk+1≤(1+βk)tk+βktk−1. (2.7)

If is descending and

 ∑kβk<+∞,

is bounded.

Proof.

Adding to both sides of (2.7),

 tk+1+βktk≤(1+βk)tk+βktk−1+βktk≤(1+2βk)(tk+βktk−1). (2.8)

Noting the decent of , (2.8) is actually

 tk+1+βktk≤(1+2βk)(tk+βk−1tk−1).

Letting

 hk:=tk+βk−1tk−1,

we then have

 hk+1≤(1+2βk)hk≤e2βkhk.

Thus, for any

 hk+1≤e2∑ki=1βih1<+∞.

The boundedness of directly yields the boundedness of . ∎

3 Convergence rates

In this section, we prove convergence rates of iPiano. The core of the proof is to construct a proper Lyapunov function.

Lemma 3.

Suppose is a convex function with -Lipschitz gradient and is convex, and . Let be generated by the inertial proximal gradient algorithm with non-increasing . Choosing the step size

 γk=2(1−βk)cL

for arbitrary fixed , we have

 [F(xk)+βk2γk∥xk−xk−1∥2]−[F(xk+1)+βk+12γk+1∥xk+1−xk∥2] ≥(1−βkγk−L2)∥xk+1−xk∥2. (3.1)
Proof.

Updating directly gives

 xk−xk+1γk−∇f(xk)+βkγk(xk−xk−1)∈∂g(xk+1). (3.2)

With the convexity of , we have

 g(xk+1)−g(xk)≤⟨xk+1−xkγk+∇f(xk)+βkγk(xk−1−xk),xk−xk+1⟩ (3.3)

With Lipschitz continuity of ,

 f(xk+1)−f(xk)≤⟨−∇f(xk),xk−xk+1⟩+L2∥xk+1−xk∥2. (3.4)

Combining (3.3) and (3.4),

 F(xk+1)−F(xk) (???)+(???)≤ βkγk⟨xk−xk−1,xk+1−xk⟩+(L2−1γk)∥xk+1−xk∥2 (3.5) a)≤ βk2γk∥xk−xk−1∥2+(L2−1γk+βk2γk)∥xk+1−xk∥2.

where uses the Schwarz inequality . With direct calculations, we then obtain

 [F(xk)+βk2γk∥xk−xk−1∥2]−[F(xk+1)+βk2γk∥xk+1−xk∥2] ≥(1−βkγk−L2)∥xk+1−xk∥2. (3.6)

With the non-increasity of , is also non-increasing. Thus, we obtain the (3). ∎

We employ the following Lyapunov function

 ξk:=F(xk)+δk∥xk−xk−1∥2−minF, (3.7)

where

 δk:=βk2γk+12(1−βkγk−L2). (3.8)

Function (3.7) can be regarded as the discretization of (2.1). We present a very useful technique lemma which is the key to results.

Lemma 4.

Suppose the conditions of Lemma 3 hold. Let denote the projection of onto , assumed to exist, and define

 εk:=4cδ2k+1(1−c)L+4c(1−c)Lγ2k. (3.9)

Then it holds

 (ξk+1)2 ≤εk(ξk−ξk+1)⋅(2∥xk+1−¯¯¯¯¯¯¯¯¯¯xk+1∥2+∥xk+1−xk∥2). (3.10)
Proof.

With direct computation and Lemma 3, we have

 ξk−ξk+1 ≥12(1−βkγk−L2)⋅(∥xk+1−xk∥2+∥xk−xk−1∥2) =L4(1c−1)⋅(∥xk+1−xk∥2+∥xk−xk−1∥2). (3.11)

The convexity of yields

 g(xk+1)−g(¯¯¯¯¯¯¯¯¯¯xk+1)≤⟨˜∇g(xk+1),xk+1−¯¯¯¯¯¯¯¯¯¯xk+1⟩,

where . By (3.2), we then have

 g(xk+1)−g(¯¯¯¯¯¯¯¯¯¯xk+1)≤⟨xk−xk+1γk−∇f(xk)+βkγk(xk−xk−1),xk+1−¯¯¯¯¯¯¯¯¯¯xk+1⟩. (3.12)

Similarly, we have

 f(xk+1)−f(¯¯¯¯¯¯¯¯¯¯xk+1)≤⟨∇f(xk+1),xk+1−¯xk+1⟩. (3.13)

Summing (3.12) and (3.13) yields

 F(xk+1) − F(¯¯¯¯¯¯¯¯¯¯xk+1)≤βkγk⟨xk−xk−1,xk+1−¯¯¯¯¯¯¯¯¯¯xk+1⟩+⟨xk−xk+1γk,xk+1−¯¯¯¯¯¯¯¯¯¯xk+1⟩ (3.14) a)≤ βkγk∥xk−xk−1∥⋅∥xk+1−¯¯¯¯¯¯¯¯¯¯xk+1∥+1γk∥xk−xk+1∥⋅∥xk+1−¯¯¯¯¯¯¯¯¯¯xk+1∥ b)≤ 1γk(∥xk−xk+1∥+∥xk−xk−1∥)⋅∥xk+1−¯¯¯¯¯¯¯¯¯¯xk+1∥,

where is due to the Schwarz inequalities, depends on the fact . With (3.7) and (3.14), we have

 ξk+1 ≤ 1γk(∥xk−xk+1∥+∥xk−xk−1∥)⋅∥xk+1−¯¯¯¯¯¯¯¯¯¯xk+1∥+δk+1∥xk+1−xk∥2.

Let

 ak:=⎛⎜ ⎜ ⎜⎝1γk∥xk−xk+1∥1γk∥xk−xk−1∥δk+1∥xk+1−xk∥⎞⎟ ⎟ ⎟⎠,bk:=⎛⎜ ⎜ ⎜⎝∥xk+1−¯¯¯¯¯¯¯¯¯¯xk+1∥∥xk+1−¯¯¯¯¯¯¯¯¯¯xk+1∥∥xk+1−xk∥⎞⎟ ⎟ ⎟⎠. (3.15)

Using this and the definition of (3.7), we have:

 (ξk+1)2=∣∣⟨ak,bk⟩∣∣2≤∥ak∥2⋅∥bk∥2. (3.16)

Direct calculation yields

 ∥ak∥2≤(δ2k+1+1γ2k)⋅(∥xk−xk+1∥2+∥xk−1−xk∥2)

and

 ∥bk∥2≤2∥xk+1−¯¯¯¯¯¯¯¯¯¯xk+1∥2+∥xk+1−xk∥2.

Thus, we derive

 (ξk+1)2≤(δ2k+1+1γ2k)⋅(∥xk−xk+1∥2+∥xk−1−xk∥2)⋅(2∥xk+1−¯¯¯¯¯¯¯¯¯¯xk+1∥2+∥xk+1−xk∥2). (3.17)

Combining (3) and (3.17), we then prove the result. ∎

3.1 Sublinear convergence rate under weak convexity

In this subsection, we present the sublinear of the convex iPiano. The coercivity of the function is critical for the analysis. If is coercive, the parameter can be bounded from ; however, if fails to be promised to be coercive, must be descending to zero. Thus, this subsection will be divided into two parts in term of the coercivity.

3.1.1 F is convercive

First, we present the non-ergodic convergence rate of the function value. The rate can be derived if is bounded from and .

Theorem 1.

Assume the conditions of Lemma 3 hold, and

 0

Then we have

 F(xk)−minF=O(1k). (3.18)
Proof.

By Lemma 4, , thus, and . Noting the coercivity of , sequences and are bounded. With the assumptions on and , . Thus, is bounded; and we assume the bound is , i.e.,

 εk(2∥xk+1−¯¯¯¯¯¯¯¯¯¯xk+1∥2+∥xk−1−xk∥2)≤R.

By Lemma 4, we then have

 ξ2k+1≤R(ξk−ξk+1).

By Lemma 1,

 ξk=O(1k).

Using the fact , we then prove the result. ∎

To the best of our knowledge, this is the first time to prove the non-ergodic convergence rate in the perspective of function values for iPiano and heavy-ball method in the convex case.

3.1.2 F fails to be coercive

In this case, to obtain the boundedness of the sequence , we must employ diminishing , i.e., . The following lemma can derive the needed boundedness.

Lemma 5.

Suppose the conditions of Lemma 3 hold, and

 βk=1(k+1)θ,

where . Let be generated by the inertial proximal gradient algorithm algorithm, then, is bounded.

Proof.

First, we prove that is a contractive operator. For any ,

 ∥x−γk∇f(x)−y+γk∇f(y)∥2 =∥x−y∥2−2γk⟨∇f(x)−∇f(y),x−y⟩+γ2k∥∇f(x)−∇f(y)∥2 ≤∥x−y∥2−(2γkL−γ2k)∥∇f(x)−∇f(y)∥2 ≤∥x−y∥2,

where the first inequality depends on the fact , and the second one is due to .

Let be a minimizer of . Obviously, it holds

 x∗=proxγkg[x∗−γk∇f(x∗)].

Noting is contractive,

 ∥xk+1−x∗∥ =∥proxγkg[xk−γk∇f(xk)+βk(xk−xk−1)]−proxγkg[x∗−γk∇f(x∗)]∥ ≤∥[xk−γk∇f(xk)+βk(xk−xk−1)]−[x∗−γk∇f(x∗)]∥ ≤∥[xk−γk∇f(xk)]−[x∗−γk∇f(x∗)]∥+∥+βk(xk−x∗+x∗−xk−1)∥ ≤∥xk−x∗∥+βk∥xk−x∗∥+βk∥xk−1−x∗∥.

With Lemma 2, we then prove the result. ∎

Now, we are prepared to present the rate of the function values when is not coercive.

Theorem 2.

Suppose the conditions of Lemma 5 hold. Let be generated by the inertial proximal gradient algorithm algorithm, then we have

 F(xk)−minF=O(1k).
Proof.

With Lemma 5, the sequence is bounded. And it is easy to verify the boundedness of . Thus, is bounded. With almost the same proofs in Theorem 1, we then prove the result. ∎

3.2 linear convergence and sublinear convergence under optimal strong convexity condition

We say that the function satisfies the optimal strong convexity condition, if

 F(x)−minF≥ν∥x−¯¯¯x∥2, (3.19)

where is the projection of onto the set , and . This condition is much weaker than the strong convexity.

Theorem 3.

Suppose the conditions of Theorem 1 hold, and satisfies (3.19). Then we have

 F(xk)−minF∼O(ωk),

for some .

Proof.

With (3.19), we have

 2∥xk+1−¯¯¯¯¯¯¯¯¯¯xk+1∥2≤2ν(F(xk+1)−minF)≤2νξk+1≤2νξk.

On the other hand, from the definition of (3.7),

 ∥xk−xk−1∥2≤1δkξk.

With Lemma 4, we then derive

 ξ2k+1≤εk(1δk+2ν)(ξk−ξk+1)⋅ξk.

With the assumption, , and the bound is assumed as . And then, we have the following result,

 ξ2k+1≤ℓ(ξk−ξk+1)⋅ξk.

If , we have . The result certainly holds. If ,

 (ξk+1ξk)2+ℓ(ξk+1ξk)−ℓ≤0.

With basic algebraic computation,

 ξk+1ξk≤2ℓ√ℓ2+4ℓ+ℓ.

By defining , we then prove the result. ∎

Remark 2.

Compared with previous linear convergence result presented in [12]. Our result enjoys three advantages: 1. The strongly convex assumption is weaken to (3.19). 2. More general parameters setting can be used. 3. The stepsizes and inertial parameters are independent with the strongly convex constants.

4 Cyclic coordinate descent inertial algorithm

This part analyzes the cyclic coordinate inertial proximal algorithm. The two-block version is proposed in [15], which focuses on the nonconvex case. Here, we consider the multi-block version and prove its convergence rate under convexity assumption. The minimization problem can be described as

 minx1,x2,…,xm{D(x1,x2,…,xm):=H(x1,x2,…,xm)+m∑i=1gi(xi)}, (4.1)

where and () are all convex. We use the notation

 ∇kiH:=∇iH(xk+11,…,xk+1i−1,xki…,xkm), xk:=(xk1,xk2,…,xkm).

The cyclic coordinate descent inertial algorithm runs as: for from to ,

 xk+1i=proxγk,igi[xki−γk,i∇kiH+βk,i(xki−xk−1i)], (4.2)

where . The iPALM can be regarded as the two-block case of this algorithm. The function is assumed to satisfy

 ∥∇iH(x1,x2,…,x1i,…,xm)−∇iH(x1,x2,…,x2i,…,xm)∥≤Li∥x1i−x2i∥ (4.3)

for any and , and . With (4.3), we can easily obtain

 H(x1,x2,…,x1i,…,xm)≤∇iH(x1,x2,…,x2i,…,xm) ⟨∇iH(x1,x2,…,x2i,…,xm),x1i−x2i⟩+Li2∥x1i−x2i∥2. (4.4)

The proof is similar to [Lemma 1.2.3,[10]] and will not be reproduced. In the following part of this paper, we use the following assumption

A1: for any , the sequence is non-increasing.

Lemma 6.

Let be a convex function satisfying (4.3) and is convex (), and finite . Let be generated by scheme (4.2) and assumption A1 is satisfied. Choose the step size

 γk,i=2(1−βk,i)cLi, i∈[1,2,…,m]

for arbitrary fixed . Then, we can obtain

 [D(xk)+m∑i=1βk,i2γk,i∥xki−xk−1i∥2]−[D(xk+1)+m∑i=1βk+1,i2γk+1,i∥xk+1i−xki∥2] (4.5) ≥(1−c)L––2c∥xk+1−xk∥2,

where .

Proof.

For any ,

 xki−xk+1iγk,i−∇kiH+βk,iγk,i(xki−xk−1i)∈∂gi(xk+1i). (4.6)

With the convexity of , we have

 gi(xk+1i)−gi(xki)≤⟨xk+1i−xkiγk,i+∇kiH+βk,iγk,i(xk−1i−xki),xki−xk+1i⟩ (4.7)

With (4), we can have

 H(xk+11,…,xk+1i−1,xk+1i,xki+1…,xkm)−H(xk+11,…,xk+1i−1,xki,xki+1…,xkm) ≤⟨−∇kiH,xki−