# Sparse Optimization Problem with s-difference Regularization

In this paper, a s-difference type regularization for sparse recovery problem is proposed, which is the difference of the normal penalty function R(x) and its corresponding struncated function R (xs). First, we show the equivalent conditions between the L0 constrained problem and the unconstrained s-difference penalty regularized problem. Next, we choose the forward-backward splitting (FBS) method to solve the nonconvex regularizes function and further derive some closed-form solutions for the proximal mapping of the s-difference regularization with some commonly used R(x), which makes the FBS easy and fast. We also show that any cluster point of the sequence generated by the proposed algorithm converges to a stationary point. Numerical experiments demonstrate the efficiency of the proposed s-difference regularization in comparison with some other existing penalty functions.

## Authors

• 7 publications
• 2 publications
• 73 publications
• 2 publications
• 4 publications
07/03/2014

### Global convergence of splitting methods for nonconvex composite optimization

We consider the problem of minimizing the sum of a smooth function h wit...
05/01/2016

### Further properties of the forward-backward envelope with applications to difference-of-convex programming

In this paper, we further study the forward-backward envelope first intr...
10/23/2020

### A Feasible Level Proximal Point Method for Nonconvex Sparse Constrained Optimization

Nonconvex sparse models have received significant attention in high-dime...
10/16/2017

### A successive difference-of-convex approximation method for a class of nonconvex nonsmooth optimization problems

We consider a class of nonconvex nonsmooth optimization problems whose o...
05/07/2012

### Risk estimation for matrix recovery with spectral regularization

In this paper, we develop an approach to recursively estimate the quadra...
06/28/2018

### Successive Convex Approximation Algorithms for Sparse Signal Estimation with Nonconvex Regularizations

In this paper, we propose a successive convex approximation framework fo...
09/17/2021

### Learning Sparse Graph with Minimax Concave Penalty under Gaussian Markov Random Fields

This paper presents a convex-analytic framework to learn sparse graphs f...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

### I-a Background

In recent years, sparse optimization problems have drawn lots of attention in many applications such as compressive sensing, machine learning, image processing and medical imaging. Signal and image processing problems are usually expressed as

 A(x)+n=b (1)

where is the linear or non-linear operator, is the observation data, and represents the observation noise or error. Since problem (1) is often ill-posed and the error is unknown, solving (1) is difficulty. To overcome this ill-posed problem, we need to make some constraints to narrow the solution space, such as the prior sparsity of the signals. Then the problem can be formulated as

 minxϕ(x)+P(x) (2)

where the loss function

is the data fidelity term related to (1), for example, the least square (LS) loss function or the least-absolute (LA) loss function ; is the regularizes function to penalize the sparsity of . Intuitively, should be selected as the -norm , represents the number of nonzero elements in . However, minimizing the -norm is equivalent to finding the sparsest solution, which is known to be NP-hard problem. A favorite and popular approach is using the -norm convex approximation, i.e., to replace the [1]. This model has been widely used in many different applications, such as radar systems [2-3], communications [4], computed tomography (CT) [5] and magnetic resonant imaging (MRI) [6]. It has been proved that the signal can be recovered by this model under some assumption of the operator , such as the restricted isometry property (RIP) of when the operator is a sensing matrix [1]. However, the -norm regularization tends to underestimate high-amplitude components of as it penalizes the amplitude uniformly, unlike -norm in which all nonzero entries have equal contributions. This may lead to reconstruction failures with the least measurements [7-8], and brings undesirable blocky images [9-10]. It is quite well-known that the when it promotes sparsity, the -norm does not provide a performance close to that of the -norm, and lots of theoretical and experimental results in CS and low-rank matrix recovery suggest that better approximations of the -normand matrix rank give rise to better performances.

Recently, researchers began to investigate various non-convex regularizes to replace the -norm regularization and gain some better reconstructions. In particular, the (quasi)-norm with

[11-16], can be regarded as a interpolation between the

and , and a continuation strategy to approximate the as . The optimization strategies include half thresholding [14, 17-20] and iterative reweighting [11-12, 15]. Other non-convex regularizations and algorithms have also been designed to outperform -norm regularization and seek better reconstruction: capped -norm [21-23], transformed -norm [24-26], sorted -norm [27-28], the difference of the and -norms () [29-31], the log-sum penalty (LSP) [8], smoothly clipped absolute deviation (SCAD) [32-33], minimax-concave penalty (MCP) [34-36].

On the other hand, there are some approaches which do not approximate the -norm, such as the iterative hard thresholding (IHT) algorithm [37-38], which operate directly on the regularized cost function or the -sparse constrained optimization problem. Moreover, there are some acceleration methods for the IHT: accelerated IHT (AIHT) [39], proximal IHT (PIHT) [40], extrapolated proximal IHT (EPIHT) [41] and accelerated proximal IHT [42]. Meanwhile, there are some researchers transformed the -norm problem into an equivalent difference of two convex functions, and then using the difference of convex algorithm (DCA) and the proximal gradient technique to solve the subproblem [43-44].

To address these nonconvex regularization problems, many iterative algorithms are investigated by researchers, such as the DCA [45-48] (or Convex-ConCave Procedure (CCCP) [49], or the Multi-Stage (MS) convex relaxation [22]), and its accelerate versions: Boosted Difference of Convex function Algorithms (BDCA) [50] and proximal Difference-of-Convex Algorithm with extrapolation (pDCAe) [51], the alternating direction method of multipliers (ADMM) [52], split Bregman iteration (SBI) [53], General Iterative Shrinkage and Thresholding (GIST) [54], nonmonotone accelerated proximal gradient (nmAPG) [55], which is an extension of the APG [56].

### I-B Contributions

In many applications, the non-convex -norm based regularization has its advantages over the convex -norm , such as image restoration [41, 53, 57-58], bioluminescence [59], CT [9-10], MRI reconstruction [60-61]. Thus, in this paper, we are interested in the following constrained problem

 minxϕ(x) subject to ∥x∥0≤s (3)

i.e. this -sparse problem tries to find the solution minimizing under the constraint that the number of non-zero coefficients below a certain value, where .

This paper can be viewed as a natural complement and extension of Gotoh et al. framework [43]. First, we rewrite the constrained problem (3) as difference of two functions, one of which is the convex or nonconvex function and the other is the corresponding truncated function . Then, we consider the unconstrained minimization problem by using this -difference type regularizations. Second, we propose fast approaches to deal with this non-convex regularizes function, which is based on a proximal operator corresponding to . Moreover, we derive some cheap closed-form solutions for the proximal mapping of with some commonly used , such as , , , LSP, MCP and so on. Third, we prove the convergence performance of the proposed algorithm, and show that any cluster point of the sequence generated by the proposed algorithm converges to a stationary point. We also show a link between the proposed algorithm with some related regularizations and algorithms. Finally, we evaluate the effectiveness of the proposed algorithm via numerical experiments. The reconstruction results demonstrate that the proposed difference penalty function with closed-form solutions is more accurate than the -norm and other non-convex regularization based methods, and faster than the DCA based algorithms.

### I-C Outline and notation

The rest of this paper is structured as follows. In section 2, we define the constrained sparse optimization. In section 3, we propose the reconstruction algorithm by using the proximal operator with closed-form solutions. In section 4, we provide some theorems to demonstrate the convergence of the proposed algorithm. Section 5 presents the numerical results. In the end, we provide our conclusion in section 6.

Here, we define our notation. For a vector

, it can be written as , and its -norm is defined as . Especially, the -norm of is defined as . Given a matrix , the transpose of is denoted by

, the maximum eigenvalue of

is defined as . Some of the arguments in this paper use sub-vectors. The letters , denote sets of indices that enumerate the elements in the vector . By using this sets as subscripts, represents the vector that setting all elements of to zero except those in the set . The iteration count is given in square bracket, e.g., . denotes the inner product, represents the sign of a quantity with . We also use the notation , and if the function is defined as the composition , then we write .

Given a proper closed function , the subgradient of at is given by

 ∂h(x)={v∈Rn:h(u)−h(x)−⟨v,u−x⟩≥0,∀u∈Rn} (4)

In addition, if is continuously differentiable, then the subdifferential reduces to the gradient of denoted by .

## Ii Penalty representation for s-sparse problem

Inspired by Gotoh et al. work of [43], in which they expressed the -norm constraint as a difference of convex (DC) function:

 ∥x∥0≤s⇔∥x∥1−∥|x|∥s=0 (5)

where and , which named top- norm, denotes the sum of top-elements in absolute value. This notation is also known as the largest- norm (or called CVaR norm in [62-63]). Precisely,

 ∥|x|∥s:=∣∣xπx(1)∣∣+∣∣xπx(2)∣∣+⋯+∣∣xπx(s)∣∣ (6)

where denotes the element whose absolute value is the -th largest among the elements of vector , i.e., . For convenience of description, we define the set , then we have . By using as the set difference, we have .

In this work, we consider a more general -difference function instead of to replace the -norm constraint, where can be convex or nonconvex, separable or non-separable, and is the best term approximation to , that is, any -sparse vectors that minimize . By using the definition of , we have

 xsi={xi,ifi∈Γsx0,ifi∈ΓNx∖Γsx (7)

Let , , we defined a class of penalty functions as follows (without loss of generality, functions and mentioned thought this paper all satisfy Property 1 ).

###### Property 1.

The penalty functions satisfy the following properties.

(a)

(b)

(c) is a continuous function which can be written as the difference of two convex (DC) functions, that is, , where and are convex functions.

###### Proposition 1.

The penalty functions listed on Table 1 all satisfy Property 1.

See appendix A for the Proof of Proposition 1.

Remark 1. For the separable , and is continuous, symmetrical and strictly increasing on , if is convex, then satisfies Property 1; if is nonconvex, while it can be written as the difference of two convex functions as , then also satisfies Property 1.

It is easy to see that the penalty function in Ref. [43] is a special case of .

With the Property 1(b), we consider the following unconstrained minimization problem associated with (3):

 minx∈RN{F(x)=ϕ(x)+ρP(x)} (8)

where is the penalty parameter. We make the following assumptions on the above formulation thought the paper, which are standard in image processing and many CS field.

###### Assumption 1.

is continuously differentiable with Lipschitz continuous gradient, i.e., there exists such that

 ∥∇ϕ(x)−∇ϕ(y)∥2≤L∥x−y∥2,∀x,y∈RN (9)
###### Assumption 2.

is bounded from below.

From (8), we can find that the difference between penalty and other penalty function, such as , , and MCP, is that there is no punishment in model (8) when the sparsity level of is under , since is equal to zero as . Meanwhile, the selection of the weighting parameter has importance influence on the performance of the reconstruction. On the one hand, should be big enough to give a heavy cost for constraint violation: . On the other hand, if is too big, the reconstruction is mostly over regularized. In light of this, we need the adjust the value of iteratively based on the convergence speed. The next Theorem ensures that problem (8) is equivalent to the original -sparse constraint problem (3) as we take the limit of , which can be proved in a similar manner to Theorem 17.1 in [71].

###### Theorem 1.

Let be an increasing sequence with and suppose that is an optimal solution of (8) with . Then, any limit point of is also optimal to (3).

See Appendix B for the proof.

In addition to Theorem 1, we have some stricter conclusions for the parameter under some assumptions of and .

###### Proposition 2.

If is Lipschitz continuous with constant , i.e., , and is an optimal solution of (8) with some . Suppose that there exists a constant such that for any . Then if , is also optimal to (3).

See Appendix C for the proof.

Remark 2. Suppose that is -Lipschitz continuous and the regularization is . Then if , any optimal solution of (8) is also optimal to (3).

Remark 3. Suppose that is -Lipschitz continuous. If we choose as , then any optimal solution of (8) is also optimal to (3) when . This can be proved by using that

 ∥x∥2−∥∥x+xs−xs+1∥∥2 =∥∥xs+1−xs∥∥22∥x∥2+∥x+xs−xs+1∥2 (10) ≤∥∥xs+1−xs∥∥22√s

If we choose , then the condition of is that . Meanwhile, we can obtain similar conclusions for the which are the difference of and MCP, or SCAD functions.

The next proposition, which is similar to Theorem 3 in [43], but with wider scope and stricter conclusion, shows another exact penalty parameters requirement for with Lipschitz continuous gradient .

###### Proposition 3.

If Assumption 1 is satisfied and is an optimal solution of (8) with some . Suppose that there exists a constant such that for any , and there exists a constant such that for any , Then if , is also optimal to (3).

See Appendix D for the proof.

Remark 4. Suppose that and . If we choose as , () and (), then any optimal solution of (8) is also optimal to (3) when , and , respectively.

Remark 5. Similarly to Theorem 4 in [43] by replacing penalty function with ordinary function , we have the following conclusions without proof. If the conditions in Proposition 3 are satisfied, and suppose that , where is symmetric and , then is also optimal to (3) if , where denotes the unit vector in the -th coordinate direction.

## Iii Forward-Backward Splitting for the regularization of difference of two functions

In this section, we use the FBS to solve the unconstrained minimization (8). Moreover, we derive closed-form solutions for the proximal mapping of some special regularization -difference , and this makes FBS more efficient.

### Iii-a Forward-Backward Splitting and proximal operator

Each iteration of forward-backward splitting applies the gradient descent of followed by a proximal operator. That is

 x[k+1]=proxβρP(x[k]−β∇ϕ(x[k])) (11)

where is the step size, and the FBS is sometimes called the proximal gradient (PG) algorithm. The proximal operator is defined as

 proxλP(y)=argminx∈RN∥x−y∥222λ+P(x) (12)

with parameter .

The equation (11) can be broken up into a forward gradient step using the function , and a backward step using the function . The proximal operator plays a central role in the analysis and solution of optimization problems. For example, the soft shrinkage operator, which is a proximal operator for -norm regularizer, has been widely used in CS and rendering many efficient algorithms. The proximal operator also has been successfully used with some nonconvex regularizers, such as , SCAD, LSP [64], and MCP [52, 65]. Usually, the closed-form solution of the proximal operator needs some special properties on , such as convexity or separability (e.g., the -norm, LSP, MCP, and other various separable functions in [66]), Next, we will focus on the solution of (12) with separable and non-separable -difference .

### Iii-B Closed-form solution of the proximal operator

Denote as

 E(x)=∥x−y∥222λ+P(x) (13)

Let be the optimal solution of (12), i.e., , then we have the following Proposition.

if and only if .

###### Proof:

Necessary condition: note that for any , and when , we have . Thus if , the optimal solution is . Sufficient condition: assume by contradiction that , then we select an arbitrary non-zero dimension in , and construct as . Then we have

 E(x∗)=E(0)=12λN∑i=1y2i>12λN∑i=1,i≠jy2i=E(~x) (14)

This contradicts the optimality of . Thus if , must be equal to zero. ∎

###### Proposition 5.

For , if , then we have . If , then we have .

###### Proof:

We prove it by establishing contradiction. If there exits any when , then we select an arbitrary one and we construct as . We have

 (15)

The inequality follows from that has the opposite sign as and . Since we have not changed the absolute value of and , then we have . Combing this and (15), we have . This contradicts the optimality of and proves that when . On the other hand, we can prove that when by using a similar method. This completes the proof. ∎

Next, we focus on the closed-form solutions of with different types of .

###### Proposition 6.

If is separable, i.e., and each is strictly increasing on , we have

 x∗i={yi,if i∈Γsy(IN+λ∂R)−1(y)i,if i∈ΓNy∖Γsy (16)

where denotes the identity operator, and is the index of the -th largest amplitude of , i.e., .

See Appendix E for the proof.

Remark 6. Note that if in (16). Suppose that there exits one or more components of , having the same amplitude of , i.e., , . Then there exits solutions of as there are arrangements of .

Remark 7. If , then the solution of (12) is

 x∗i={yi,if i∈Γsyshrink(yi,λ),if i∈ΓNy∖Γsy (17)

where denotes the soft shrinkage operator given by

 shrink(yi,λ)=sign(yi)max{|yi|−λ,0} (18)

Remark 8. If , then the solution of (12) is

 x∗i={yi,if i∈Γsyyi/yi(2λ+1),(2λ+1),if i∈ΓNy∖Γsy (19)

Remark 9. If is the MCP (A.3), that is (),then the solution is: under the condition of , if or , then ; otherwise . When , if or , then ; otherwise . If is the LSP (A.2), that is , then the solution is: if , then ; otherwise , and , where is a set composed of 3 elements or 1 element. If , then

 Ω={0,max{ξ1,0},max{ξ2,0}} (20)

where and . Otherwise, .

Proposition 6 gives the solution of the (12) under the conditions of with separable and strictly increasing properties. In fact, there are some other commonly used separable and non-convex also have the closed-form solution similar as (16), such as with [14], however, these does not satisfy the Property 1(c), so they are not within the scope of this article. Next, we consider two special non-separable cases as the reference for other non-separable regularizations.

###### Proposition 7.

If , then the solution of (12) is that: when ,

 (21)

when ,

 x∗i=√∥y−ys∥22+(∥ys∥2+λ)2−λ√∥y−ys∥22+(∥ys∥2+λ)2yi (22)

See Appendix F for the proof.

###### Proposition 8.

If , , then the solution of (12) is that:

1) When , for ,

 (23)

for ,

 x∗i=⎛⎜ ⎜⎝1+aλ√∥z−zs∥22+(∥ys∥2−aλ)2⎞⎟ ⎟⎠zi (24)

where for , and for .

2) When , if , , , and suppose that there are components of having the same amplitude of , i.e., . is an optimal solution of (12) if and only if it satisfies , , and