Two New Approaches to Compressed Sensing Exhibiting Both Robust Sparse Recovery and the Grouping Effect

In this paper we introduce a new optimization formulation for sparse regression and compressed sensing, called CLOT (Combined L-One and Two), wherein the regularizer is a convex combination of the ℓ_1- and ℓ_2-norms. This formulation differs from the Elastic Net (EN) formulation, in which the regularizer is a convex combination of the ℓ_1- and ℓ_2-norm squared. It is shown that, in the context of compressed sensing, the EN formulation does not achieve robust recovery of sparse vectors, whereas the new CLOT formulation achieves robust recovery. Also, like EN but unlike LASSO, the CLOT formulation achieves the grouping effect, wherein coefficients of highly correlated columns of the measurement (or design) matrix are assigned roughly comparable values. It is already known LASSO does not have the grouping effect. Therefore the CLOT formulation combines the best features of both LASSO (robust sparse recovery) and EN (grouping effect). The CLOT formulation is a special case of another one called SGL (Sparse Group LASSO) which was introduced into the literature previously, but without any analysis of either the grouping effect or robust sparse recovery. It is shown here that SGL achieves robust sparse recovery, and also achieves a version of the grouping effect in that coefficients of highly correlated columns belonging to the same group of the measurement (or design) matrix are assigned roughly comparable values.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

06/19/2016

Tight Performance Bounds for Compressed Sensing With Group Sparsity

Compressed sensing refers to the recovery of a high-dimensional but spar...
06/19/2015

Enhanced Lasso Recovery on Graph

This work aims at recovering signals that are sparse on graphs. Compress...
09/14/2013

Optimized projections for compressed sensing via rank-constrained nearest correlation matrix

Optimizing the acquisition matrix is useful for compressed sensing of si...
12/27/2019

Derandomized compressed sensing with nonuniform guarantees for ℓ_1 recovery

We extend the techniques of Hügel, Rauhut and Strohmer (Found. Comput. M...
02/07/2020

Jointly Sparse Signal Recovery via Deep Auto-Encoder and Parallel Coordinate Descent Unrolling

In this paper, utilizing techniques in compressed sensing, parallel opti...
01/04/2022

Supervised Homogeneity Fusion: a Combinatorial Approach

Fusing regression coefficients into homogenous groups can unveil those c...
10/29/2018

Parameter instability regimes for sparse proximal denoising programs

Compressed sensing theory explains why Lasso programs recover structured...

1 Introduction

The LASSO and the Elastic Net (EN) formulations are among the most popular approaches for sparse regression and compressed sensing. In this section, we briefly review these two problems and their current status, so as to provide the background for the remainder of the paper.

1.1 Sparse Regression

In sparse regression, one is given a measurement matrix (also called a design matrix in statistics) where , together with a measurement or measured vector . The objective is to choose a vector such that is rather sparse, and is either exactly or approximately equal to . The problem of finding the most sparse that satisfies is known to be NP-hard [1]; therefore it is necessary to find alternate approaches.

For the sparse regression problem, the general approach is to determine the estimate

by solving the minimization problem

(1)

or in Lagrangian form,

(2)

where is known as a “regularizer,” and are adjustable parameters. Different choices of the regularizer lead to different approaches. With the choice

, the approach is known as ridge regression

[2], which builds on earlier work [3]. The LASSO approach [4] results from choosing , while the Elastic Net (EN) approach [5] results from choosing

(3)

where

is an adjustable parameter. Note that the EN regularizer function interpolates ridge regression and LASSO, in the sense that EN reduces to LASSO if

and to ridge regression if . A very general approach to regression using a convex regularizer is given in [6].

The LASSO approach can be shown to return a solution with no more than nonzero components, under mild regularity conditions; see [7]. There is no such bound on the number of components of when EN is used. However, when the columns of the matrix are highly correlated, then LASSO chooses just one of these columns and ignores the rest. Measurement matrices with highly correlated columns occur in many practical situations, for example, in microarray measurements of messenger RNA, otherwise known as gene expression data. The EN approach was proposed at least in part to overcome this undesirable behavior of the LASSO formulation. It is shown in [5, Theorem 1] that if two columns (say and ) of the matrix are highly correlated, then the corresponding components and of the EN solution are nearly equal. This is known as the “grouping effect,” and the point is that EN demonstrates the grouping effect whereas LASSO does not.

1.2 Compressed Sensing

In compressed sensing, the objective is to choose the measurement matrix (which is part of the data in sparse regression), such that whenever the vector is nearly sparse, it is possible to nearly recover from noise-corrupted measurements of the form . Let us make the problem formulation precise. For this purpose we begin by introducing some notation.

Throughout, the symbol denotes the index set . The support of a vector is denoted by and is defined as

A vector is said to be -sparse if . The set of all -sparse vectors is denoted by . The -sparsity index of a vector with respect to a given norm is defined as

(4)

It is obvious that if and only if for every norm.

The general formulation of the compressed sensing problem given below is essentially taken from [8]. Suppose that is the “measurement matrix,” and is the “decoder map,” where . Suppose is an unknown vector that is to be recovered. The input to the decoder consists of where denotes the measurement noise, and a prior upper bound in the form is available; in other words, is a known number. In this set-up, the vector is the approximation to the original vector . With these conventions, we can now state the following.

Definition 1.1.

Suppose . The pair is said to achieve robust sparse recovery of order with respect to if there exist constants and that might depend on and but not on or , such that

(5)

The restriction that is tied up with the fact that the bound on the noise is for the Euclidean norm . The usual choices for in (5) are and .

Among the most popular approaches to compressed sensing is -norm minimization, which was popularized in a series of papers, of which we cite only [9, 10, 11, 12]. The survey paper [13] has an extensive bibliography on the topic, as does the recent book [14]. In this approach, the estimate is defined as

(6)

Note that the above definition does indeed define a decoder map . In order for the above pair to achieve robust sparse recovery, the matrix is chosen so as to satisfy a condition defined next.

Definition 1.2.

A matrix is said to satisfy the Restricted Isometry Property (RIP) of order with constant if

(7)

Starting with [9], several papers have derived sufficient conditions that the RIP constant of the matrix must satisfy in order for -norm minimization to achieve robust sparse recovery. Recently, the “best possible” bound has been proved in [15]. These results are stated here for the convenience of the reader.

Theorem 1.1.

(See [15, Theorem 2.1]) Suppose satisfies the RIP of order for some number such that is an integer, with . Then the recovery procedure in (6) achieves robust sparse recovery of order .

Theorem 1.2.

(See [15, Theorem 2.2]) Let . For all and all , there exists a matrix satisfying the RIP of order with constant such that the recovery procedure in (6) fails for some -sparse vector.

Observe that the Lagrangian formulation of the LASSO approach is

whereas the Lagrangian formulation of (6) is

which is essentially the same as the Lagrangian formulation of

This last formulation of sparse regression is known as “square-root LASSO” [16]. Therefore the community refers to the approach to compressed sensing given in (6) as the LASSO, though this may not be strictly accurate.

1.3 Compressed Sensing with Group Sparsity

Over the years some variants of LASSO have been proposed for compressed sensing, such as the Group LASSO (GL) [17] and the Sparse Group LASSO (SGL) [18]. In the GL formulation, the index set is partitioned into disjoint sets , and the associated norm is defined as

(8)

where denotes the projection of the vector onto the components in . The notation is intended to remind us that the norm depends on the specific partitioning . Some authors divide the term by , but we do not do that. A further refinement of GL is the sparse group LASSO (SGL), in which the group structure is as before, but the norm is now defined as

(9)

where as before . If is an unknown vector, then recovery of is attempted via

(10)

in Group LASSO, and via

(11)

in Sparse Group LASSO.

The main idea behind GL is that one is less concerned about the number of nonzero components of , and more concerned about the number of distinct groups containing these nonzero components. Therefore GL attempts to choose an estimate that has nonzero entries in as few distinct sets as possible. In principle, SGL tries to choose an estimate that not only has nonzero components within as few groups as possible, but within those groups, has as few nonzero components as possible. Note that if , then SGL reduces to LASSO (because of the summability of the -norm), whereas if , then SGL reduces to GL. Note too that if and every set is a singleton , then GL reduces to LASSO.

1.4 Motivation and Contributions of the Paper

Now we come to the motivation and contributions of the present paper. The LASSO formulation is well-suited for compressed sensing (see Theorem 1.1), but not so well-suited for sparse regression, because it lacks the grouping effect. The EN formulation is well-suited for sparse regression as it exhibits the grouping effect, but it is not known whether it can achieve compressed sensing.

The first result presented in the paper is that if the EN regularizer of (3) is used instead of the -norm in (6), then the resulting approach does not achieve robust sparse recovery unless , that is, the number of measurements grows linearly with respect to the size of the vector. This would not be considered “compressed” sensing. This led us to formulate another regularizer, namely

(12)

We refer to this as the CLOT norm, with CLOT standing for Combined L-One and Two. It is shown that the CLOT norm combines the best features of both LASSO and EN, in that

  • When the CLOT norm is used as the regularizer in sparse regression, the resulting solution exhibits the grouping effect.

  • When the -norm is replaced by the CLOT norm in (6), the resulting solution achieves robust sparse recovery if the matrix satisfies the RIP.

  • Moreover, if in CLOT is set to zero so that CLOT becomes LASSO, the bound on the RIP constant reduces to the “best possible” bound in Theorem 1.1.

Clearly the CLOT norm is a special case of the SGL norm with the entire index set being taken as a single group (though the adjective “sparse” is no longer appropriate). This led us to explore whether the SGL norm achieves either grouping effect or robust sparse recovery. We are able to show that SGL does indeed achieve both.

Now we place these contributions in perspective. There is empirical evidence to support the belief that both the GL and the SGL formulations work well for compressed sensing. However, until the publication of a companion paper by a subset of the present authors [19], there were no proofs that either of these formulations achieved robust sparse recovery. In [19], it is shown that both the GL and SGL formulations achieve robust sparse recovery provided the group sizes are sufficiently small. This restriction on group sizes is removed in the present paper. Moreover, so far as the authors are aware, until now there are no results on the grouping effect for either of these formulations. In the present paper, it is shown that if two columns of the measurement matrix that belong to the same group are highly correlated, then the corresponding components of the estimate have nearly equal values. However, if two columns that belong to different groups are highly correlated, then their coefficients need not be nearly equal. From the standpoint of applications, this is a highly desirable property. To illustrate, suppose the groups represent biological pathways. Then one would wish to assign roughly similar weights to genes in the same pathway, but not necessarily to those in disjoint pathways.

Thus the contributions of the present paper are to show that:

  • The EN does not achieve robust sparse recovery.

  • Both the CLOT and SGL formulations achieve both robust sparse recovery as well as the grouping effect.

  • The condition under which CLOT achieves robust sparse recovery reduces to the “best possible” condition in Theorem 1.1.

Taken together, these results might indicate that CLOT and SGL are attractive alternatives to the LASSO and EN formulations.

2 Main Theoretical Results

This section contains the main contributions of the paper. First it is shown in Section 2.1 that the EN approach does not achieve robust sparse recovery, and is therefore not suitable for compressed sensing applications. Next, it is shown in Section 2.2 that the SGL formulation assigns nearly equal weights to highly correlated features within the same group, though not necessarily to highly correlated features from different groups. It follows as a corollary that CLOT assigns nearly equal weights to highly correlated features. Then it is shown in Section 2.3 that the SGL formulation achieves robust sparse recovery. The contents of a companion paper by a subset of the present authors [19] establish that SGL achieves robust sparse recovery of order provided that each group size is smaller than . There is no such restriction here. It follows as a corollary that CLOT also achieves robust sparse recovery.

2.1 Lack of Robust Sparse Recovery of the Elastic Net Formulation

The first result of this section shows that EN formulation does not achieve robust sparse recovery, and therefore is not suitable for compressed sensing applications.

Theorem 2.1.

Suppose a matrix has the following property. There exist constants and such that, whenever for some and with , the solution

satisfies

(13)

Then

(14)

Proof: Let denote the null space of the matrix , that is, the set of all such that . Let be arbitrary, and let denote the index set of the largest components of by magnitude. Therefore

Next, (13) implies that, if , then for all . In other words,

or equivalently,

(15)

Now observe that, because , we have that

and more generally,

Apply (15) with and . This leads to

Now divide both sides by , and observe that

Therefore

Next

Equivalently

This is Equation (5.2) of Cohen-Dahmen-Devore (2009) with . As shown in Theorem 5.1 of that paper, this implies that , which is the desired conclusion.

2.2 Grouping Property of the SGL and CLOT Formulations

One advantage of the EN over LASSO is that the former assigns roughly equal weights to highly correlated features, as shown in [5, Theorem 1] and referred to as the grouping effect. In contrast, if LASSO chooses one feature among a set of highly correlated features, then generically it assigns a zero weight to all the rest. To illustrate, if two columns of are identical, then in principle LASSO could assign nonzero weights to both columns; however, the slightest perturbation in the data would cause one or the other weight to become zero. The drawback of this is that the finally selected feature set is very sensitive to noise in the measurements. In this section we prove an analog of [5, Theorem 1] for SGL formulation. Our result states that if two highly correlated features within the same group are chosen by SGL, then they will have roughly similar weights. Since CLOT is a special case of SGL with the entire feature set treated as one group, it follows that CLOT assigns roughly similar weights to highly correlated features in the entire set of features. As a result, the final feature sets obtained using SGL or CLOT are less sensitive to noise in measurements than the ones obtained using LASSO.

Theorem 2.2.

Let be some vector and matrix respectively. Without loss of generality, suppose that is centered, i.e. , where denotes a column vector consisting of ones, and that is standardized, i.e.  where denotes the -th column of . Suppose , and let denote a partition of into disjoint subsets. Define

(16)

where is a Lagrange multiplier. Suppose that, for two indices belonging to the same group , we have that , where denote the components of the vector . By changing the sign of one of the columns of if necessary, it can be assumed that . Define

Then

(17)

where is shorthand for .

Proof: Define

where, as above, denotes . Then is differentiable with respect to whenever . In particular, since both and are nonzero by assumption, it follows that

Expanding the partial derivatives leads to

Subtracting one equation from the other gives

Hence

In the last step, we use the fact that

Rearranging gives

which is the desired conclusion.

Let us illustrate the above result using the CLOT formulation. In the case of CLOT formulation we have , , , and the inequality in (17) becomes

(18)

where is the solution of the CLOT formulation, and

(19)

Now suppose that two indices and are highly correlated such that , so that the right hand side of the inequality in (18) is almost equal to zero. Combining this with (19) we can conclude , so CLOT assigns similar weights to highly correlated variables.

Though the focus of the present paper is not on the GL formulation, we digress briefly to discuss the implications of Theorem 2.2 for GL. This theorem also implies that the GL formulation exhibits the grouping effect, because GL is a special case of SGL with . Indeed, it can be observed from (17) that the bound on the right side is minimized by setting , that is, using GL instead of SGL. This is not surprising, because SGL not only tries to minimize the number of distinct groups containing the support of , but within each group, tries to choose as few elements as possible. Thus, within each group, SGL inherits the weaknesses of LASSO. Thus one would expect that, within each group, the feature set chosen would become more sensitive as we decrease .

2.3 Robust Sparse Recovery of the SGL and CLOT Formulations

In this subsection, we present some sufficient conditions for the SGL and CLOT formulations to achieve robust sparse recovery. When CLOT is specialized to LASSO by setting , the sufficient condition reduces to the “tight” bound given in Theorem 1.1.

Recall the definitions. The CLOT norm with parameter is given by

while the SGL norm is given by

Recall also the problem set-up. The measurement vector equals where , a known upper bound. The recovered vector is defined as

(20)

if SGL is used, and as

(21)

if CLOT is used.

Definition 2.1.

A matrix is said to satisfy the robust null space property (RNSP) if there exist constants and such that, for all sets with , we have

(22)

This property was apparently first introduced in [14, Definition 4.21]. Note that the definition in [14] has just in place of . It is a ready consequence of Schwarz’ inequality that, if (22) holds, then

The following result is established in [20] in the context of group sparsity, but is new even for conventional sparsity. The reader is directed to that source for the proof.

Theorem 2.3.

([20, Theorem 5]) Suppose that, for some number , the matrix satisfies the RIP of order with constant . Let be an abbreviation for , and define the constants

(23)
(24)
(25)
(26)

Then satisfies the -robust null space property with

(27)

In Theorem 2.4 below, it is assumed that the matrix satisfies the RIP of order with , in accordance with Theorem 1.1. With this assumption, we prove bounds on the residual error with SGL; the bounds for CLOT can be obtained simply by setting in the SGL bounds. Note that, once bounds for the -norm of are proved, it is possible to extend the bounds to for all ; see [14, Theorem 4.22].

Theorem 2.4.

Suppose and that satisfies the GRIP of order with constant , and define constants as in (27). Suppose that

(28)

Define

(29)

With these assumptions,

(30)
(31)

where

(32)
Proof.

Hereafter we write instead of in the interests of brevity.

Define . From the definition of the estimate, we have that

From the definition of the SGL norm, this expands to

This can be rearranged as

(33)

We will work separately on each of the two terms separately. First, by the triangle inequality, we have that

As a consequence,

From Schwarz’ inequality, we get

for any . Combining everything gives

(34)

for any subset . Second, for any subset , the decomposability of implies that

while the triangle inequality implies that

Therefore

(35)

If we now choose to be the set corresponding to the largest elements of by magnitude, then

With this choice of , (35) becomes

(36)

Substituting the bounds (34) and (35) into (33) gives

Now recall the definition of the constant from (29). Using this definition, the above inequality can be rearranged as

and equivalently as

(37)

This is the first of two equations that we need.

Now we derive the second equation. From Theorem 2.3, we know that the matrix satisfies the -robust null space property, namely (22). An application of Schwatz’ inequality shows that

However, because both and are feasible for the optimization problem in (20), we get

Substituting this bound for gives us the second equation we need, namely

or equivalently

(38)

The two inequalities (37) and (38) can be written compactly as

(39)

where the coefficient matrix is given by

The matrix has positive diagonal elements (recall that ), and negative off-diagonal elements. Therefore, if , then every element of is positive, in which we can multiply both sides of (39) by . Now

Recall the definition of from (29). Now routine algebra shows that

which is precisely (28). Thus we can multiply both sides of (39) by , which gives

Clearing out the matrix multiplication gives

(40)

Now the triangle inequality states that

Substituting from (40) gives

By substituting that , we get (30) with the constants as defined in (32).

To prove (31), suppose . This part of the proof closely follows that of [14, Theorem 4.22], except that we provide explicit values for the constants. Let denote the index set of the largest components of by magnitude. Then

We will bound each term separately. First, by [14, Theorem 2.5] and (30), we get

(41)

Now we apply in succession Höder’s inequality, the robust null space property, the fact that , and (30). This gives

(42)

Adding (41) and (42) gives (31).

In the above theorem, we started with the restricted isometry constant and computed an upper bound on in order for SGL and CLOT to achieve robust sparse recovery. As gets closer to the limit (which is known to be the best possible in view of Theorem 1.2), the limit on would approach zero. It is also possible to start with and find an upper bound on , by rearranging the inequalities. As this involves just routine algebra, we simply present the final bound. Given the number , define as in (23), and define

(43)
(44)

Given , define as in (29), and define

If the matrix satisfies the RIP of order with a constant , then SGL achieves robust sparse recovery of order provided

(45)

3 Numerical Examples

In this section we present three simulation studies, to demonstrate the various theoretical results presented thus far.

3.1 Comparison of CLOT with LASSO and EN

In this subsection we present the results of running the same four simulations as in [5, Section 5]. The examples were run using LASSO, EN and CLOT. We carried out all three optimizations in Matlab using the cvx optimization package.

Method Example 1 Example 2 Example 3 Example 4
CLOT 2.66(0.51) 1.84(0.29) 35.35(1.62) 53.14(4.94)
EN 2.71(0.37) 2.73(0.32) 25.82(1.48) 54.81(5.13)
LASSO 3.46(0.37) 3.60(0.39) 51.61(2.3) 68.74(7.43)
Table 1:

Median mean-squared errors for the simulated examples and three methods based on 50 replications. The numbers in parentheses are the standard deviations of the medians estimated using the bootstrap method with 500 resamplings on the 50 mean-squared errors.

Method Example 1 Example 2 Example 3 Example 4
CLOT 7 8 30 24
EN 8 8 37 32
LASSO 5 6 21 18
True 3 8 20 15
Table 2: Median number of nonzero coefficients as well as the true values


Box Plots for Example (a)
Box Plots for Example (b)



Box Plots for Example (c)
Box Plots for Example (d)
Figure 1: Box plots for CLOT, EN, and LASSO for the four examples

It can be seen from Tables 1 and 2 as well as Figure 1 that CLOT and EN perform comparably, and that both perform better than LASSO. Note that in all the simulations in this example, the true vector is not sparse, which is why EN outperforms LASSO. Moreover, Table 2 shows that CLOT achieves accuracy comparable to EN with a smaller number of nonzero features.

3.2 Grouping Effect of CLOT

To illustrate that CLOT demonstrates the grouping effect as does EN (see Theorem 2.2), we ran the same example as at the end of [5, Section 5]. Specifically, we chose and to be two independent random variables, and the observation as . The six observations were

where the are i.i.d. . The objective was to express as a linear combination of through . In other words, we wished to express , where is the matrix with the