Iteratively Linearized Reweighted Alternating Direction Method of Multipliers for a Class of Nonconvex Problems

In this paper, we consider solving a class of nonconvex and nonsmooth problems frequently appearing in signal processing and machine learning research. The traditional alternating direction method of multipliers encounter troubles in both mathematics and computations in solving the nonconvex and nonsmooth subproblem. In view of this, we propose a reweighted alternating direction method of multipliers. In this algorithm, all subproblems are convex and easy to calculate. We also provide several guarantees for the convergence and prove that the algorithm globally converges to a critical point of an auxiliary function with the help of the Kurdyka- Lojasiewicz property. Several numerical results are presented to demonstrate the efficiency of the proposed algorithm.

• 41 publications
• 53 publications
• 9 publications
02/09/2019

Iteratively reweighted penalty alternating minimization methods with continuation for image deblurring

In this paper, we consider a class of nonconvex problems with linear con...
11/21/2017

The Application of Preconditioned Alternating Direction Method of Multipliers in Depth from Focal Stack

Post capture refocusing effect in smartphone cameras is achievable by us...
07/06/2018

Distributed Self-Paced Learning in Alternating Direction Method of Multipliers

Self-paced learning (SPL) mimics the cognitive process of humans, who ge...
11/17/2020

The alternating direction method of multipliers for finding the distance between ellipsoids

We study several versions of the alternating direction method of multipl...
04/02/2018

Adaptive Algorithm for Sparse Signal Recovery

Spike and slab priors play a key role in inducing sparsity for sparse si...
04/02/2021

Solving Large Scale Quadratic Constrained Basis Pursuit

Inspired by alternating direction method of multipliers and the idea of ...
09/12/2017

A convergence frame for inexact nonconvex and nonsmooth algorithms and its applications to several iterations

In this paper, we consider the convergence of an abstract inexact noncon...

1 Introduction

Minimization of composite functions with linear constrains finds various applications in signal and image processing, statistics, machine learning, to name a few. Mathematically, such a problem can be presented as

 minx,y{f(x)+g(y)  s.t.  Ax+By=c}, (1.1)

where , , and is usually the regularization function, and

is usually the loss function.

The well-known alternating direction method of multipliers (ADMM) method [1, 2] is a powerful tool for the problem mentioned above. The ADMM actually focuses on the augmented Lagrangian problem of (1.1) which reads as

 ˜Lα(x,y,p):=f(x)+g(y)+⟨p,Ax+By−c⟩+α2∥Ax+By−c∥22, (1.2)

where is a parameter. The ADMM minimizes only one variable and fixes others in each iteration; the variable is updated by a feedback strategy. Mathematically, the standard ADMM method can be presented as

 (1.3)

The ADMM algorithm attracts increasing attention for its efficiency in dealing with sparsity-related problems [3, 4, 5, 6, 7]. Obviously, the ADMM has a self-explanatory assumption; all the subproblems shall be solved efficiently. In fact, if the proximal maps of the and are easy to calculate, the linearized ADMM [8] proposes the linearized technique to solve the subproblem efficiently; the subproblems all need to compute proximal map of or once. The core part of the linearized ADMM lies in linearizing the quadratic terms and in each iteration. The linearized ADMM is also called as preconditioned ADMM in [9]; in fact, it is also a special case when in Chambolle-Pock primal dual algorithm [10]. In the latter paper [11], the linearized ADMM is further generalized as the Bregman ADMM.

1.1 Motivating example and problem formulation

This subsection contains two parts: the first one presents an example and discusses the problems in direct using the ADMM; the second one describes the problem considered in this paper.

1.1.1 A motivating example: the problems in directly using ADMM

The methods mentioned above are feasibly applicable provided the subproblems are relatively easy to solve, i.e., either the proximal maps of and or the subproblems are assumed to be easily solved. However, the nonconvex cases may not always promise such a convention. We recall the problem [25] which arises in imaging science

 minu{12∥f−Ψu∥22+σ∥Tu∥qq,ε}, (1.4)

where is the total variation operator and . By denoting , the problem then turns to being

 minu,v{12∥f−Ψu∥22+σ∥v∥qq,ε,   s.t.  Tu−v=0}. (1.5)

The direct ADMM for this problem can be presented as

 ⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩vk+1=argminv{σ∥v∥qq,ε+⟨pk,v⟩+α2∥v−Tuk∥22},uk+1=argminu{12∥f−Ψu∥22−⟨pk,Tu⟩+α2∥vk+1−Tu∥22},pk+1=pk+α(vk+1−Tuk+1). (1.6)

The first subproblem in the algorithm needs to minimize a nonconvex and nonsmooth problem. If , the point can be explicitly calculated. This is because the proximal map of can be easily obtained. But for other , the proximal map cannot be easily derived. Thus, we may must employ iterative algorithms to compute . That indicates three drawbacks which cannot be ignored:

1. The stopping criterion is hard to set for the nonconvexity111The convex methods usually enjoy a convergence rate..

2. The error may be accumulating in the iterations due to the inexact numerical solution of the subproblem.

3. Even the subproblem can be numerically solved without any error, the numerical solution for the subproblem is always a critical point rather than the “real” argmin due to the nonconvexity.

In fact, the other penalty functions like Logistic function [26], Exponential-Type Penalty (ETP) [27], Geman [28], Laplace [29] also encounter such a problem.

1.1.2 Optimization problem and basic assumptions

In this paper, we consider the following problem

 minx,yf(x)+N∑i=1g[h(yi)]  s.t.  Ax+By=c, (1.7)

where , and , and satisfy the following assumptions:

A.1 is a differentiable convex function with a Lipschitz continuous gradient, i.e.,

 ∥∇f(x)−∇f(y)∥2≤Lf∥x−y∥2,∀x,y∈RN. (1.8)

And the function is strongly convex with constant .

A.2 is convex and proximable.

A.3 is a differentiable concave function with a Lipschitz continuous gradient whose Lipschitz continuity modulus is bounded by ; that is

 ∣g′(s)−g′(t)∣≤Lg∣s−t∣, (1.9)

and when .

It is easy to see that the problem can be regarded as a special one of (1.7) if we set and . The augmented lagrange dual function of model (1.7) is

 Lα(x,y,p)=f(x)+N∑i=1g[h(yi)]+⟨p,Ax+By−c⟩+α2∥Ax+By−c∥22, (1.10)

where is a parameter.

1.2 Linearized ADMM meets the iteratively reweighted strategy: convexifying the subproblems

In this part, we present the algorithm for solving problem (1.7). The term has a deep relationship with several iteratively reweighted style algorithms [30, 31, 32, 33, 34]. Although the function may be nondifferentiable itself, the reweighted style methods still propose an elegant way: linearization of outside function . Precisely, in -th iteration of the iteratively reweighted style algorithms, the term is usually replaced by , where is obtained in the -th iteration. The extensions of reweighted style methods to matrix cases are considered and analyzed in [35, 36, 37, 38, 39]. In fact, the iteratively reweighted technique is a special majorization minimization technique, which has also been adopted in ADMM [40]. Compared with [40], the most difference in our paper is the exploiting the specific structure of the problem in nonconvex settings. Motivated by the iteratively reweighted strategy, we propose the following scheme for solving (1.7)

 ⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩yk+1=argminy{∑Ni=1g′[h(yki)]h(yi)+⟨pk+α(Axk+Byk−c),By⟩+r2∥y−yk+1∥22},xk+1=argminx{f(x)+⟨pk,Ax⟩+α2∥Ax+Byk+1−c∥22},pk+1=pk+α(Axk+1+yk+1−c). (1.11)

We combined both linearized ADMM and reweighted algorithm in the new scheme: for the nonconvex part , we linearize the outside function and keep , which aims to derive the convexity of the subproblem; for the quadratic part , linearization is for the use of the proximal map of . We call this new algorithm as Iteratively Linearized Reweighted Alternating Direction Method of Multipliers (ILR-ADMM). It is easy to see that each subproblem just needs to solve a convex problem in this scheme. With the expression of proximal maps, updating can be equivalently presented as the following forms

 yk+1i=proxg′[h(yki)]rh(yki−B⊤i(α(Axk+Byk−c)+pk)r), (1.12)

where , and denotes the -th column of the matrix . In many applications, is the quadratic function, and then solving is also very easy. With this form, the algorithm can be programmed with the absence of the inner loop.

1.3 Contribution and Organization

In this paper, we consider a class of nonconvex and nonsmooth problems which are ubiquitous in applications. Direct use of ADMM algorithms will lead to troubles in both computations and mathematics for the nonconvexity of the subproblem. In view of this, we propose the iteratively linearized reweighted alternating direction method of multipliers for these problems. The new algorithm is a combination of iteratively reweighted strategy and the linearized ADMM. All the subproblems in the proposed algorithm are convex and easy to solve if the proximal map of is easy to solve and is quadratic. Compared with the direct application of ADMM to problem (1.7), we now list the advantages of the new algorithm:

1. Computational perspective: each subproblem just needs to compute once proximal map of and minimize a quadratic problem, the computational cost is low in each iteration.

2. Practical perspective: without any inner loop, the programming is very easy.

3. Mathematical perspective: all the subproblems is convex and exactly solved. Thus, we get “real” argmin everywhere, which makes the mathematical convergence analysis solid and meaningful.

With the help of the Kurdyka-Łojasiewicz property, we provide the convergence results of the algorithm with proper selections of the parameters. The applications of the new algorithm to the signal and image processing are presented. The numerical results demonstrate the efficiency of the proposed algorithm.

The rest of this paper is organized as follows. Section 2 introduces the preliminaries including the definitions of subdifferential and the Kurdyka-Łojasiewicz property. Section 3 provides the convergence analysis. The core part is using an auxiliary Lyapunov function and bounding the generated sequence. Section 4 applies the proposed algorithm to image deblurring. And several comparisons are reported. Finally, Section 5 concludes the paper.

2 Preliminaries

We introduce the basic tools in the analysis: the subdifferential and Kurdyka-Łojasiewicz property. These two definitions play important roles in the variational analysis.

2.1 Subdifferential

Given a lower semicontinuous function , its domain is defined by

 dom(J):={x∈RN:J(x)<+∞}.

The graph of a real extended valued function is defined by

 graph(J):={(x,v)∈RN×R:v=J(x)}.

Now, we are prepared to present the definition of subdifferential. More details can be found in [41].

Definition 1.

Let be a proper and lower semicontinuous function.

1. For a given , the Frchet subdifferential of at , written as

, is the set of all vectors

satisfying

 limy≠xinfy→xJ(y)−J(x)−⟨u,y−x⟩∥y−x∥2≥0.

When , we set .

2. The (limiting) subdifferential, or simply the subdifferential, of at , written as , is defined through the following closure process

 ∂J(x):={u∈RN:∃xk→x,J(xk)→J(x) and uk∈^∂J(xk)→u as k→∞}

Note that if , . When is convex, the definition agrees with the subgradient in convex analysis [42] which is defined as

 ∂J(x):={v∈RN:J(y)≥J(x)+⟨v,y−x⟩  for  any  y∈RN}.

It is easy to verify that the Frchet subdifferential is convex and closed while the subdifferential is closed. Denote that

 graph(∂J):={(x,v)∈RN×RN:v∈∂J(x)},

thus, is a closed set. Let be a sequence in such that . If converges to as and converges to as , then . This indicates the following simple proposition.

Proposition 1.

If , , , , and 222If is continuous, the condition certainly holds if .. Then, we have

 v∈∂J(x). (2.1)

A necessary condition for to be a minimizer of is

 0∈∂J(x). (2.2)

When is convex, (2.2) is also sufficient.

Definition 2.

A point that satisfies (2.2) is called (limiting) critical point. The set of critical points of is denoted by .

Proposition 2.

If is a critical point of with any , it must hold that

 −B⊤p∗ ∈ W∗∂h(y∗), −A⊤p∗ = ∇f(x∗), Ax∗+By∗−c = 0,

where is defined in (1.10) and .

Proof.

With [Proposition 10.5, [41]], we have

 ∂(N∑i=1g[h(xi)])=∂(g[h(x1)])×…×∂(g[h(xN)]) (2.3)

Noting that is differentiable and is convex, with direct computation, we have

 ^∂(g[h(xi)])=g′[h(xi)]⋅∂h(xi). (2.4)

And more, by definition, we can obtain

 ∂(g[h(xi)])=g′[h(xi)]⋅∂h(xi). (2.5)

We then prove the first equation. The second and third are quite easy. ∎

2.2 Kurdyka-Łojasiewicz function

The domain of a subdifferential is given as

 dom(∂J):={x∈RN:∂J(x)≠∅}.
Definition 3.

(a) The function is said to have the Kurdyka-Łojasiewicz property at if there exist , a neighborhood of and a continuous concave function such that

1. .

2. is on .

3. for all , .

4. for all in , it holds

 φ′(J(x)−J(¯¯¯x))⋅dist(0,∂J(x))≥1. (2.6)

(b) Proper closed functions which satisfy the Kurdyka-Łojasiewicz property at each point of are called KL functions.

More details can be found in [43, 44, 45]. In the following part of the paper, we use KL for Kurdyka-Łojasiewicz for short. Directly checking whether a function is KL or not is hard, but the proper closed semi-algebraic functions [45] do much help.

Definition 4.

(a) A subset of is a real semi-algebraic set if there exists a finite number of real polynomial functions such that

 S=p⋃j=1q⋂i=1{u∈RN:gij(u)=0 %and  hij(u)<0}.

(b) A function is called semi-algebraic if its graph

 {(u,t)∈RN+1:h(u)=t}

is a semi-algebraic subset of .

Better yet, the semi-algebraicity enjoys many quite nice properties and various kinds of functions are KL [46]. We just put a few of them here:

• Real closed polynomial functions.

• Indicator functions of closed semi-algebraic sets.

• Finite sums and product of closed semi-algebraic functions.

• The composition of closed semi-algebraic functions.

• Sup/Inf type function, e.g., is semi-algebraic when is a closed semi-algebraic function and a closed semi-algebraic set.

• Closed-cone of PSD matrices, closed Stiefel manifolds and closed constant rank matrices.

Lemma 1 ([45]).

Let be a proper and closed function. If is semi-algebraic then it satisfies the KL property at any point of .

The previous definition and property of KL is about a certain point in . In fact, the property has been extended to a certain closed set [47]. And this property makes previous convergence proofs related to KL property much easier.

Lemma 2.

Let be a proper lower semi-continuous function and be a compact set. If is a constant on and satisfies the KL property at each point on , then there exists concave function satisfying the four properties given in Definition 3 and such that for any and any satisfying that and , it holds that

 φ′(J(x)−J(¯¯¯x))⋅dist(0,∂J(x))≥1. (2.7)

3 Convergence analysis

In this part, the function is defined in (1.10). We provide the convergence guarantee and the convergence analysis of ILR-ADMM (Algorithm 1). We first present a sketch of the proofs, which is also a big picture for the purpose of each lemma and theorem:

• In the first step, we bound the dual variables by the primal points (Lemma 3).

• In the second step, the sufficient descent condition is derived for a new Lyapunov function (Lemma 4).

• In the third step, we provide several conditions to bound the points (Lemma 5).

• In the fourth step, the relative error condition is proved (Lemma 6).

• In the last step, we prove the convergence under semi-algebraic assumption (Theorem 1).

The proofs in our paper are closely related to seminal papers [20, 21, 22] in several proofs treatments. In fact, some proofs follow their techniques. For example, in Lemma 3, we employ the method used in [Lemma 3, [21]] to bound . In Lemma 5, boundedness of the sequence is also proved by a similar way given in [Theorem 3, [22]]. Besides the detailed issues, in the large picture, the keystones are also similar to [20, 21, 22]: we also prove the sufficient descent and subdifferential bound for a Lyapunov function, and the boundedness of the generated points.

However, the proofs in our paper are still different from [20, 21, 22] in various aspects. The novelties mainly lay in deriving the sufficient descent and subdifferential bound based on the specific structure of our problem. Noting that in each iteration, we minimize and rather than and . Thus, the previous methods cannot be directly used in our paper. By exploiting the structure property of the problem, we built these two conditions.

Lemma 3.

If

 Im(B)⋃{c}⊆Im(A). (3.1)

Then, we have

 ∥pk−pk+1∥22≤η∥xk+1−xk∥22, (3.2)

where , and

is the smallest strictly-positive eigenvalue of

.

Proof.

The second step in each iteration actually gives

 ∇f(xk+1)=−A⊤(α(Axk+1+Byk+1−c)+pk). (3.3)

With the expression of ,

 ∇f(xk+1)=−A⊤pk+1. (3.4)

Replacing with , we can obtain

 ∇f(xk)=−A⊤pk (3.5)

Under condition (3.1), ; and subtraction of the two equations above gives

 ∥pk−pk+1∥2≤1θ∥A⊤(pk−pk+1)∥2 ≤∥∇f(xk+1)−∇f(xk)∥2θ≤Lfθ∥xk+1−xk∥2. (3.6)

Remark 1.

If condition (3.1) holds and , we have . Then, from (3.5), we have that

 ∥pk∥2≤1θ∥∇f(xk)∥2. (3.7)

We will use this inequality in bounding the sequence.

Remark 2.

The condition (3.1) is satisfied if is surjective. However, in many applications, the matrix may fail to be surjective. For example, for a matrix , we consider the operator

 T(U)=(DU,UD⊤)∈R(N−1)N×RN(N−1), (3.8)

where is the forward difference operator. Noting the when , thus, cannot be surjective in this case. However, the current convergence of nonconvex ADMM is all based on the surjective assumption on or condition (3.1), which is also used in our analysis. How to remove condition (3.1) in the nonconvex ADMM deserves further research.

Now, we introduce several notation to present the following lemma. Denote the variable and the sequence as

 d:=(x,y,p),dk:=(xk,yk,pk),zk:=(xk,yk). (3.9)

An auxiliary function is always used in the proof

 Lkα(x,y,p) :=f(x)+N∑i=1g′[h(yki)]h(yi)+⟨p,Ax+By−c⟩+α2∥Ax+By−c∥22. (3.10)
Lemma 4 (Descent).

Let the sequence be generated by ILR-ADMM. If condition (3.1) and the following condition

 α>max{1,2ηδ},r>α∥B∥22 (3.11)

hold, then there exists such that

 Lα(dk)−Lα(dk+1)≥ν∥zk+1−zk∥22, (3.12)

where .

Proof.

Direct calculation shows that the first step is actually minimizing the function with respect to . Thus, we have

 Lkα(xk,yk+1,pk)+r−α∥B∥222∥yk+1−yk∥22 ≤Lkα(xk,yk+1,pk)+(yk+1−yk)⊤(rI−αB⊤B)(yk+1−yk)2≤Lkα(xk,yk,pk).

Similarly, actually minimizes . Noting , with assumption A.1, the strongly convex constant of is larger than ,

 Lkα(xk+1,yk+1,pk) +δ2∥xk+1−xk∥22≤Lkα(xk,yk+1,pk). (3.13)

Direct calculation yields

 Lkα(xk+1,yk+1,pk+1)=Lkα(xk+1,yk+1,pk)+⟨pk+1−pk,Axk+1+Byk+1−c⟩ =Lkα(xk+1,yk+1,pk)+1α∥pk+1−pk∥22. (3.14)

Combining the equations above, we can have

 Lkα(xk,yk,pk)≥Lkα(xk+1,yk+1,pk+1)+δ2∥xk+1−xk∥22+r−α∥B∥222∥yk+1−yk∥22−1α∥pk+1−pk∥22. (3.15)

Noting is concave, we have

 N∑i=1g[h(yki)]−N∑i=1g[h(yk+1i)] =N∑i=1{g[h(yki)]−g[h(yk+1i)]} ≥N∑i=1g′[h(yki)][h(yki)−h(yk+1i)] (3.16) =N∑i=1g′[h(yki)]h(yki)−N∑i=1g′[h(yki)]h(yk+1i).

Then, we can derive

 Lα(xk,yk,pk)−Lα(xk+1,yk+1,pk+1) =N∑i=1g[h(yki)]−N∑i=1g[h(yk+1i)]+f(xk)+⟨pk,Axk+Byk−c⟩+α2∥Axk+Byk−c∥22 −{f(xk+1)+⟨pk+1,Axk+1+Byk+1−c⟩+α2∥Axk+1+Byk+1−c∥22} ≥N∑i=1g′[h(yki)]h(yki)−N∑i=1g′[h(yki)]h(yk+1i)+f(xk)+⟨pk,Axk+Byk−c⟩+α2∥Axk+Byk−c∥22 −{f(xk+1)+⟨pk+1,Axk+1+Byk+1−c⟩+α2∥Axk+1+Byk+1−c∥22} =Lkα(xk,yk,pk)−Lkα(xk+1,yk+1,pk+1) ≥δ2∥xk+1−xk∥22+r−α∥B∥222∥yk+1−yk∥22−1α∥pk+1−pk∥22.

With Lemma 3, we then have

 Lα(xk,yk,pk)−Lα(xk+1,yk+1,pk+1) ≥(δ2−ηα)∥xk+1−xk∥22+r−α∥B∥222∥yk+1−yk∥22. (3.17)

Letting , we then prove the result. ∎

In fact, condition (3.11) can be always satisfied in applications because the parameters and are both selected by the user. Different with the ADMMs in convex setting, the parameter is nonarbitrary, the here should be sufficiently large.

Lemma 5 (Boundedness).

If and conditions (3.1) and (3.11) hold, and there exists such that

 inf{f(x)−σ0∥∇f(x)∥22}>−∞, (3.18)

and

 α≥12σ0θ2. (3.19)

The sequence is bounded, if one of the following conditions holds:

B1. is coercive, and is coercive.

B2. is coercive, and is invertible.

B3. , is coercive, and is invertible.

Proof.

We have

 Lα(dk)=f(xk)+g(yk)+⟨pk,Axk+Byk−c⟩+α2∥Axk+Byk−c∥22 =f(xk)+g(yk)−∥pk∥222α+α2∥Axk+Byk−c+pkα∥22 =f(xk)+g(yk)−σ0θ2∥pk∥22+(σ0θ2−12α)∥pk∥22+α2∥Axk+Byk−c+pkα∥22 (???) ≥f(xk)−σ0∥∇f(xk)∥22+g(yk)+(σ0θ2−12α)∥pk∥22+α2∥Axk+Byk−c+pkα∥22. (3.20)

Noting is decreasing with Lemma 4, . We then can see , , are all bounded. It is easy to see that if one of the three conditions holds, will be bounded. ∎

Remark 3.

The condition (3.18) holds for many quadratic functions [22, 48]. This condition also implies the function is similar to quadratic function and its property is “good”.

Remark 4.

Both assumptions B2 and B3 actually imply condition (3.7).

Remark 5.

Combining conditions (3.19) and (3.11), we then have

 α>max{1,2ηδ,12σ0θ2},r>α∥B∥22. (3.21)

To determine the , we need to obtain , and first. Computing these constants may be hard due to that they may fail to enjoy the explicit forms. Thus, in the experiments, we use an increasing technique introduced in [49].

If conditions (