# A projected gradient method for αℓ_1-βℓ_2 sparsity regularization

The non-convex α·_ℓ_1-β·_ℓ_2 (α≥β≥0) regularization has attracted attention in the field of sparse recovery. One way to obtain a minimizer of this regularization is the ST-(αℓ_1-βℓ_2) algorithm which is similar to the classical iterative soft thresholding algorithm (ISTA). It is known that ISTA converges quite slowly, and a faster alternative to ISTA is the projected gradient (PG) method. However, the conventional PG method is limited to the classical ℓ_1 sparsity regularization. In this paper, we present two accelerated alternatives to the ST-(αℓ_1-βℓ_2) algorithm by extending the PG method to the non-convex αℓ_1-βℓ_2 sparsity regularization. Moreover, we discuss a strategy to determine the radius R of the ℓ_1-ball constraint by Morozov's discrepancy principle. Numerical results are reported to illustrate the efficiency of the proposed approach.

## Authors

• 33 publications
• 6 publications
07/22/2020

### αℓ_1-βℓ_2 sparsity regularization for nonlinear ill-posed problems

In this paper, we consider the α·_ℓ_1-β·_ℓ_2 sparsity regularization wit...
08/07/2019

### Linear convergence and support recovery for non-convex multi-penalty regularization

We provide a comprehensive convergence study of the iterative multi-pena...
04/05/2017

### Non-Convex Weighted Lp Minimization based Group Sparse Representation Framework for Image Denoising

Nonlocal image representation or group sparsity has attracted considerab...
11/08/2017

### Learning Sparse Visual Representations with Leaky Capped Norm Regularizers

Sparsity inducing regularization is an important part for learning over-...
08/24/2019

### Computing ground states of Bose-Einstein Condensates with higher order interaction via a regularized density function formulation

We propose and analyze a new numerical method for computing the ground s...
09/07/2012

### Learning Model-Based Sparsity via Projected Gradient Descent

Several convex formulation methods have been proposed previously for sta...
07/01/2016

### Convergence Rate of Frank-Wolfe for Non-Convex Objectives

We give a simple proof that the Frank-Wolfe algorithm obtains a stationa...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In this paper, we are interested in solving an ill-posed operator equation of the form

 Ax=y, (1.1)

where is sparse, is a linear and bounded operator mapping between the space and a Banach space with norms and , respectively. In practice, the right-hand side is known only approximately with an error up to a level . Therefore, we assume that we know and with . The most commonly adopted technique to solve problem (1.1) is the -norm sparsity regularization with , see the monographs [18, 39] and the special issues [4, 13, 24, 25] for many developments on regularizing properties and minimization schemes. Since the -norm regularization with does not always provide the ‘sparsest’ solution, the non-convex -norm sparsity regularization with was proposed as alternatives. For the sparsity regularization, see [6, 7, 9, 20] for the iterative hard thresholding algorithm. We refer the reader to [23, 26, 31] for some other types of alternatives to the -norm.

The investigation of the non-convex regularization has attracted attention in the field of sparse recovery over the last five years, see [15, 27, 30, 46, 47] and references therein. In [15], we investigated the well-posedness and convergence rate of the non-convex sparsity regularization of the form

 minJδα,β(x)=1q∥Ax−yδ∥qY+Rα,β(x) (1.2)

in the space, where

 Rα,β(x):=α∥x∥ℓ1−β∥x∥ℓ2,α≥β≥0, q≥1.

Denoting , we can equivalently express the function in (1.2) as

 1q∥Ax−yδ∥qY+αRη(x),

where

 Rη(x):=∥x∥ℓ1−η∥x∥ℓ2,α>0, 1≥η≥0.

For the particular case , we provided an ST-() algorithm of the form

 (1.3)

for (1.2), where is the step size and . Obviously, the ST-() algorithm is similar to the classical ISTA when the step size . In [12], an ISTA of the form

 xk+1=Sα(xk−A∗(Axk−yδ)) (1.4)

was first proposed to solve the classical sparsity regularization of the form

 minJδα(x)=12∥Ax−yδ∥2Y+α∥x∥ℓ1. (1.5)

As an alternative of the -norm with , the function has the desired property that it is a good approximation of a multiple of the -norm. The function has a simpler structure than the -norm from the perspective of computation. The ST-() algorithm can easily be implemented, see [15, 21, 47] for several other algorithms for sparsity regularization. However, the ST-() algorithm, in general, can be arbitrarily slow and it is computationally intensive. So it is desirable to develop accelerated versions of the ST-() algorithm, especially for large-scale ill-posed inverse problems.

### 1.1 Some accelerated algorithms for ISTA

Searching for accelerated algorithms of the ISTA has become popular and some faster algorithms have been proposed. In [5, 14, 17, 45], several accelerated projected gradient methods have been provided. A comparison among several accelerated algorithms is provided in [28], including “fast ISTA” ([2]). Applying a smoothing technique from Nesterov ([32]), a fast and accurate first-order method is proposed for solving large-scale compressed sensing problems ([3]). In [11]

, a simple heuristic adaptive restart technique is introduced, which can dramatically improve the convergence rate of accelerated gradient schemes. In

[10], convergence of the iterates of the “Fast Iterative Shrinkage/Thresholding Algorithm” is established. In [33], a new iterative regularization procedure for inverse problems based on the use of Bregman distances is studied. Numerical results show that the proposed method gives significant improvement over the standard method. An explicit algorithm based on a primal-dual approach for the minimization of an -penalized least-squares function, with a non-separable term, is proposed in [29]. An iteratively reweighted least squares algorithm and the corresponding convergence analysis for the regularization of linear inverse problems with sparsity constraints are investigated in [19]. For a projected gradient method of nonlinear ill-posed problems, see [40].

Unfortunately, the algorithms stated above are only limited to the classical -norm sparsity regularization. Though there is great potential for accelerated algorithms in sparsity regularization with a non-convex penalty term, to the best of our knowledge, little work can be found in the literature. In [35], the authors treat the problem of minimizing a general continuously differentiable function subject to , where is an integer, and is the -norm of , which counts the number of nonzero components in . In this paper, we extend the projected gradient method to the non-convex sparsity regularization. There are two reasons why we choose PG method. First, its formulation is simple and it can easily be implemented. Another reason is that it converges quite fast. So it is adequate for solving large-scale ill-posed problems.

The PG method was introduced in [14] to accelerate the ISTA. It is shown that the ISTA converges initially relatively fast, then it overshoots the -norm penalty, and it takes many steps to re-correct back. It means that the algorithm generates a path that is initially fully contained in the -ball . Then it gets out of the ball to slowly inch back to it in the limit. To avoid this long “external” detour, the authors of [14] proposed an accelerated algorithm by substituting the soft thresholding operation by the projection which is defined in Definition 2.5. This leads to a projected gradient method of the form

 xk+1=PR(xk−γkA∗(Axk−yδ)). (1.6)

### 1.2 Contribution and organization

Since the ST-() algorithm (1.3) is similar to ISTA (1.4), inspired by [14], we propose two accelerated alternatives to (1.3) by extending the PG method to solve (1.2).

The first accelerated algorithm is based on the generalized conditional gradient method (GCGM). In [15], baed on GCGM, we proposed the ST-() algorithm where the crucial issue is to determine by the optimization problem of the form

 minz⟨A∗(Axk−yδ)−λxk−βxk∥xk∥ℓ2,z⟩+λ2∥z∥2ℓ2+α∥z∥ℓ1. (1.7)

In this paper, we show that the problem (1.7) can be solved by a PG method of the form

 zk=PR(xk+βxkλ∥xk∥ℓ2−1λA∗(Axk−yδ)). (1.8)

With at our disposal, we compute by , where is the step size.

Theoretically, the radius of -ball should be chosen by ([14]), where is a minimizer of (1.2). However, in general, one can not obtain the value of before starting the iteration (1.8). In this paper, we utilize Morozov’s discrepancy principle to determine . This method only requires knowledge of the noise level and the observed data . Moreover, we investigate the well-posedness of (1.2) under Morozov’s discrepancy principle.

The second accelerated algorithm is based on the surrogate function approach. We investigate this algorithm in the finite dimensional space . For the case , (1.2) takes the form

 minJδα,β(x)=12∥Ax−yδ∥2ℓ2+α∥x∥ℓ1−β∥x∥ℓ2, (1.9)

where is a linear and bounded operator mapping between the and space with norms. In the following, we remove the constraint in (1.9) and to consider a constrained optimization problem for a certain radius of -ball constraint. So, in analogy to the techniques about projection in [14, 41], a natural strategy is to consider the constrained optimization problem of the form

 min12∥Ax−yδ∥2ℓ2subject to x∈B′R:={x∈Rn∣∥x∥ℓ1−η∥x∥ℓ2≤R},1≥η≥0. (1.10)

However, since is non-convex, it is challenge to analyze and solve this constrained optimization problem. To utilize the theory of convex constraints, we remove the constraint in (1.9) and to consider instead the following optimization problem of the form

 minDδβ(x)=12∥Ax−yδ∥2ℓ2−β∥x∥ℓ2subject to x∈BR:={x∈Rn∣∥x∥ℓ1≤R} (1.11)

for a suitable . We propose a projected gradient method of the form

 xk+1=PR(xk+βxk+1λ∥xk+1∥ℓ2−1λA∗(Axk−yδ)) (1.12)

for (1.11), where satisfies some conditions, see Assumption 4.6.

An outline of the rest of this paper is as follows. In the next section we introduce the notation and review results of the Tikhonov regularization and the PG method. In Section 3, we investigate an accelerated algorithm via GCGM. Furthermore, we give a strategy to determine the radius of -ball constraint. In Section 4, we propose another accelerated algorithm via the surrogate function approach. Finally, we present results from numerical experiments on compressive sensing and image deblurring problems in Section 5.

## 2 Preliminaries

Before starting the discussion on the accelerated algorithms, we briefly introduce some notation and results of the Tikhonov regularization and the PG method. Let

 xδα,β=argminx{12∥Ax−yδ∥2Y+Rα,β(x)} (2.1)

be a minimizer of the regularization function in (1.2) with for every . We denote by the set of all minimizers , and by a solution of (1.11). We use the following definition of -minimum solution ([15]).

###### Definition 2.1

An element is called an -minimum solution of the linear problem if

 Ax†=y  and  Rη(x†)=minx{Rη(x)∣Ax=y}.

We recall the definition of sparsity ([12]).

###### Definition 2.2

An element is called sparse if is finite, where is the component of . is the cardinality of . If for some , then is called -sparse.

###### Definition 2.3

(Morozov’s discrepancy principle) For , we choose such that

 τ1δ≤∥Axδα,β−yδ∥Y≤τ2δ (2.2)

holds for some .

Next we recall definitions of the soft thresholding and the projection operators ([5, 12]).

###### Definition 2.4

For a given , the soft thresholding operator is defined as

 Sα(x)=∑iSα(xi)ei,

where , is the component of and

 Sα(t)=⎧⎪⎨⎪⎩t−α    if   t≥α,0         if   |t|<α,t+α    if   t≤−α.
###### Definition 2.5

The projection onto the -ball is defined by

 PR(^x):={argminx∥x−^x∥ℓ2 subject to ∥x∥ℓ1≤R},

which gives the projection of an element onto the -norm ball with radius .

Then we review two results from [14] on relations between the soft thresholding operator and the projection operator. For relations between the parameters and , see [14, Fig. 2].

###### Lemma 2.6

For some countable index set , denote , . For any fixed and for , is a piecewise linear, continuous, decreasing function of . Moreover, if then and for .

###### Lemma 2.7

If , then the projection of on the -ball with radius is given by , where (depending on and ) is chosen such that . If then .

Finally, recall the following properties of ([14]).

###### Lemma 2.8

Let be a Hilbert space with the inner product and norm . For any ,

is characterized as the unique vector in

such that

 ⟨w−PR(x),x−PR(x)⟩≤0∀w∈BR.

Moreover, the projection is non-expansive:

 ∥PR(x′)−PR(x′′)∥H≤∥x′−x′′∥H∀x′,x′′∈H.

## 3 The projected gradient method via GCGM

In [15], we proposed an ST-() algorithm for (1.2) based on GCGM. We rewrite in (1.2) as

 Jδα,β(x)=F(x)+Φ(x),

where

 F(x) =12∥Ax−yδ∥2Y−Θ(x), Φ(x) =Θ(x)+α∥x∥ℓ1−β∥x∥ℓ2, Θ(x) =λ2∥x∥2ℓ2+β∥x∥ℓ2,λ>0.

The ST-() algorithm is stated in the form of Algorithm 1. Convergence of Algorithm 1 is given in Theorem 3.1; see [15, Theorem 3.5] for its proof.

###### Theorem 3.1

Let denote the sequence generated by Algorithm 1. Then contains a convergent subsequence and every convergent subsequence of converges to a stationary point of the function in (1.2) with .

A crucial step in Algorithm 1 is the determination of as a solution of

 minCδα,β,λ(z,xk)=⟨A∗(Axk−yδ)−λxk−βxk∥xk∥ℓ2,z⟩+λ2∥z∥2ℓ2+α∥z∥ℓ1. (3.1)

In [15], we solve (3.1) by

 zk=Sα/λ((βλ∥xk∥ℓ2+1)xk−1λA∗(Axk−yδ)). (3.2)

However, (3.2) is known to converge quite slowly. To accelerate the ST- algorithm, we transform (3.1) to an -ball constraint optimization problem of the form

 ⎧⎪ ⎪⎨⎪ ⎪⎩minDδβ,λ(z,xk)=⟨A∗(Axk−yδ)−λxk−βxk∥xk∥ℓ2,z⟩+λ2∥z∥2ℓ2,β≥0,subject to ℓ1 ball BR:={z∈ℓ2∣∥z∥ℓ1≤R}. (3.3)

Since , and are convex with respect to the variable , problem (3.3) is equivalent to (3.1) for a certain ([36, Theorem 27.4], [48, Theorem 47.E]). In Lemma 3.2, we show that the problem (3.3) can be solved by a PG method of the form

 zk=PR(xk+βxkλ∥xk∥ℓ2−1λA∗(Axk−yδ)). (3.4)
###### Lemma 3.2

An element is a minimizer of (3.3) if and only if

 ^z=PR(xk+βxkλ∥xk∥ℓ2−1λA∗(Axk−yδ)) (3.5)

for any , which is equivalent to

 ⟨xk+βxkλ∥xk∥ℓ2−1λA∗(Axk−yδ)−^z,z−^z⟩≤0∀z∈BR. (3.6)

Proof. Note that is a solution of (3.3) if and only if for any , the function of attains its minimum at . Since is quadratic and convex, a necessary and sufficient condition for is . Easily,

 f′(0+)=⟨A∗(Axk−yδ)−λxk−βxk∥xk∥ℓ2+λ^z,z−^z⟩,

and is equivalent to (3.6).

The PG algorithm for (1.2) based on GCGM is stated in the form of Algorithm 2.

### 3.1 Determination of the radius R

From the previous discussion, we know that (3.1) is equivalent to (3.3) for a certain . Before starting iteration (3.4), we need to choose an appropriate value of which is crucial for the computation, especially in practical application. In this section, we give a strategy to determine the radius of the -ball constraint by Morozov’s discrepancy principle.

By Lemma 2.7, for a given in (3.1), in (3.3) should be chosen such that . However, one does not know the value of before starting (3.4). Of course, we can find an approximation of by the ST-() algorithm (1.3). Nevertheless, this implies that an additional soft thresholding iteration (1.3) is needed in Algorithm 2. Then the resulting algorithm is no longer an accelerated one.

So a crucial issue is how to check whether a value of is appropriate for (3.3). Recall that there exists a regularization parameter depending on such that (3.1) is equivalent to (3.3). So to determine an appropriate , we need to check whether the corresponding regularization parameter is appropriate. One criterion is to check whether . If , then is a regularized solution ([15, Theorem 2.13]). However, by Lemmas 2.6 and 2.7, we only know that is a piecewise linear, continuous, decreasing function of (see [14, Fig. 2]), and there is no explicit formula relating and . We can not determine the value of from the value of directly. So we can not ensure whether the is appropriate.

Another criterion is Morozov’s discrepancy principle. For any given , we should check whether the regularization parameter satisfies Morozov’s discrepancy principle (2.2), i.e.

 τ1δ≤∥Axδα,β−yδ∥Y≤τ2δ,1<τ1≤τ2.

For any fixed , we need to compute by Algorithm 2 where is determined by the PG method (3.4). Subsequently, we check whether satisfies (2.2). For this strategy, we only need to know the observed data and the noise level . By Lemma 3.5, the discrepancy is an increasing function of . A commonly adopted technique is to try , . With increasing, one calculates until one finds ([42]). Since is a decreasing function of , the discrepancy is a decreasing function of , see Lemma 2.6 and Fig. 1. Hence is a reasonable choice. We begin with a small such that satisfies Morozov’s discrepancy principle (2.2). Subsequently, we increase the value of to , , until fails to satisfy Morozov’s discrepancy principle. Then we can find a maximal which satisfies Morozov’s discrepancy principle (2.2). Of course, we can also begin with a large and gradually reduce the value of until satisfies Morozov’s discrepancy principle (2.2). Under Morozov’s discrepancy principle, the PG algorithm for (1.2) based on GCGM is stated in the form of Algorithm 3.

A natural question is whether (1.2) combined with Morozov’s discrepancy principle is a regularization method. As we know, Tikhonv type functions combined with Morozov’s discrepancy principle is a regularization method. However, this result is usually shown only when the regularized term is convex ([1, 8, 34, 38, 42, 43]). If the regularized term is non-convex, some results can be found in [15, 44] where Morozov’s discrepancy principle is applied to derive the convergence rate. However, these results are obtained under additional source conditions on the true solution . To the best of our knowledge, no results are available on whether Morozov’s discrepancy principle combined with (1.2) is a regularization method. In this paper, we prove that if the non-convex regularized term satisfies some properties, e.g. coercivity, weakly lower semi-continuity and Radon-Riesz property, the well-posedness of the regularization still holds.

### 3.2 Well-posedness of regularization

In this section, we discuss the well-posedness of (1.2) under Morozov’s discrepancy principle. First, we show that there exists at least one regularization parameter in (1.2) such that Morozov’s discrepancy principle (2.2) holds. We recall some properties of ([15]), needed in analyzing the well-posedness of (1.2).

###### Lemma 3.3

If , the function in (1.2) has the following properties:

(i) (Coercivity) For , implies .

(ii) (Weak lower semi-continuity) If in and is bounded, then

 liminfnRα,β(xn)≥Rα,β(x).

(iii) (Radon-Riesz property) If in and , then .

###### Definition 3.4

For fixed and , define

 F(xδα,β) =12∥Axδα,β−yδ∥2Y, Rη(xδα,β) =∥xδα,β∥ℓ1−η∥xδα,β∥ℓ2, m(α) =Jδα,β(xδα,β)=minJδα,β(x),

where and .

In the following we give some properties of , and in Lemmas 3.5 and 3.6. Since is weakly lower semi-continuous, the proofs are similar to that in [42, Section 2.6]. Note that is fixed, and for given , we write and .

###### Lemma 3.5

The function is continuous and non-increasing, i.e., implies . Moreover, for ,

 supxδα1,β1∈Lδα1,β1F(xδα1,β1) ≤infxδα2,β2∈Lδα2,β2F(xδα2,β2), supxδα1,β1∈Lδα1,β1Rη(xδα1,β1) ≥infxδα2,β2∈Lδα2,β2Rη(xδα2,β2).
###### Lemma 3.6

For each there exist such that

 limα→¯α−⎛⎜⎝supxδα,β∈Lδα,βF(xδα,β)⎞⎟⎠=F(x′)=infx∈Lδ¯α,¯βF(x)andlimα→¯α+⎛⎝infxδα,β∈Lδα,βF(xδα,β)⎞⎠=F(x′′)=supx∈Lδ¯α,¯βF(x).

In the following we provide an existence result on the regularization parameter . The proof is along the line of Morozov’s discrepancy principle for nonlinear ill-posed problems ([1, 34]).

###### Lemma 3.7

Assume . Then there exist such that

 supxδα1,β1∈Lδα1,β1F(xδα1,β1)<τ1δ≤τ2δ

Proof. First, let and consider a sequence of corresponding minimizers . By the definition of and , we have

 F(xn)q≤m(αn)≤Jαn(x†)≤δq+αnRη(x†)→δq<τq1δq.

This implies that there exists a small enough such that .

Next, let . Then

 Rη(xn)≤1αnm(αn)≤1αn∥A0−yδ∥Y→0. (3.7)

From the definition of ,

 Rη(x)=(1−η)∥x∥ℓ1+η(∥x∥ℓ1−∥x∥ℓ2). (3.8)

Then a combination of (3.7) and (3.8) implies that is bounded. Consequently, has a convergent subsequence, again denoted by , such that for some . By Lemma 3.3 (ii), it follows from (3.7) that

 0≤Rη(x∗)≤liminfRη(xn)=limRη(xn)=0.

By (3.8), this implies . Since and , Lemma 3.3 (iii) implies that . Then

 ∥Axn−yδ∥Y→∥A0−yδ∥Y=∥yδ∥Y>c2δ.

This implies that there exists a large enough such that .

Note that we require in Lemma 3.7, which is a reasonable assumption. Indeed, in applications, it is almost impossible to recover a solution from observed data of a size in the same order as the noise.

We state an existence result on the regularized parameter, similar to Theorems 3.10 in [1]. The proof makes use of the properties stated in Lemmas 3.6 and 3.7.

###### Theorem 3.8

Assume and there is no with minimizers such that

 ∥Ax′−yδ∥Y<τ1δ≤τ2δ<∥Ax′′−yδ∥Y.

<