1 Introduction
The LASSO and the Elastic Net (EN) formulations are among the most popular approaches for sparse regression and compressed sensing. In this section, we briefly review these two problems and their current status, so as to provide the background for the remainder of the paper.
1.1 Sparse Regression
In sparse regression, one is given a measurement matrix (also called a design matrix in statistics) where , together with a measurement or measured vector . The objective is to choose a vector such that is rather sparse, and is either exactly or approximately equal to . The problem of finding the most sparse that satisfies is known to be NP-hard [1]; therefore it is necessary to find alternate approaches.
For the sparse regression problem, the general approach is to determine the estimate
by solving the minimization problem(1) |
or in Lagrangian form,
(2) |
where is known as a “regularizer,” and are adjustable parameters. Different choices of the regularizer lead to different approaches. With the choice
, the approach is known as ridge regression
[2], which builds on earlier work [3]. The LASSO approach [4] results from choosing , while the Elastic Net (EN) approach [5] results from choosing(3) |
where
is an adjustable parameter. Note that the EN regularizer function interpolates ridge regression and LASSO, in the sense that EN reduces to LASSO if
and to ridge regression if . A very general approach to regression using a convex regularizer is given in [6].The LASSO approach can be shown to return a solution with no more than nonzero components, under mild regularity conditions; see [7]. There is no such bound on the number of components of when EN is used. However, when the columns of the matrix are highly correlated, then LASSO chooses just one of these columns and ignores the rest. Measurement matrices with highly correlated columns occur in many practical situations, for example, in microarray measurements of messenger RNA, otherwise known as gene expression data. The EN approach was proposed at least in part to overcome this undesirable behavior of the LASSO formulation. It is shown in [5, Theorem 1] that if two columns (say and ) of the matrix are highly correlated, then the corresponding components and of the EN solution are nearly equal. This is known as the “grouping effect,” and the point is that EN demonstrates the grouping effect whereas LASSO does not.
1.2 Compressed Sensing
In compressed sensing, the objective is to choose the measurement matrix (which is part of the data in sparse regression), such that whenever the vector is nearly sparse, it is possible to nearly recover from noise-corrupted measurements of the form . Let us make the problem formulation precise. For this purpose we begin by introducing some notation.
Throughout, the symbol denotes the index set . The support of a vector is denoted by and is defined as
A vector is said to be -sparse if . The set of all -sparse vectors is denoted by . The -sparsity index of a vector with respect to a given norm is defined as
(4) |
It is obvious that if and only if for every norm.
The general formulation of the compressed sensing problem given below is essentially taken from [8]. Suppose that is the “measurement matrix,” and is the “decoder map,” where . Suppose is an unknown vector that is to be recovered. The input to the decoder consists of where denotes the measurement noise, and a prior upper bound in the form is available; in other words, is a known number. In this set-up, the vector is the approximation to the original vector . With these conventions, we can now state the following.
Definition 1.1.
Suppose . The pair is said to achieve robust sparse recovery of order with respect to if there exist constants and that might depend on and but not on or , such that
(5) |
The restriction that is tied up with the fact that the bound on the noise is for the Euclidean norm . The usual choices for in (5) are and .
Among the most popular approaches to compressed sensing is -norm minimization, which was popularized in a series of papers, of which we cite only [9, 10, 11, 12]. The survey paper [13] has an extensive bibliography on the topic, as does the recent book [14]. In this approach, the estimate is defined as
(6) |
Note that the above definition does indeed define a decoder map . In order for the above pair to achieve robust sparse recovery, the matrix is chosen so as to satisfy a condition defined next.
Definition 1.2.
A matrix is said to satisfy the Restricted Isometry Property (RIP) of order with constant if
(7) |
Starting with [9], several papers have derived sufficient conditions that the RIP constant of the matrix must satisfy in order for -norm minimization to achieve robust sparse recovery. Recently, the “best possible” bound has been proved in [15]. These results are stated here for the convenience of the reader.
Theorem 1.1.
Theorem 1.2.
Observe that the Lagrangian formulation of the LASSO approach is
whereas the Lagrangian formulation of (6) is
which is essentially the same as the Lagrangian formulation of
This last formulation of sparse regression is known as “square-root LASSO” [16]. Therefore the community refers to the approach to compressed sensing given in (6) as the LASSO, though this may not be strictly accurate.
1.3 Compressed Sensing with Group Sparsity
Over the years some variants of LASSO have been proposed for compressed sensing, such as the Group LASSO (GL) [17] and the Sparse Group LASSO (SGL) [18]. In the GL formulation, the index set is partitioned into disjoint sets , and the associated norm is defined as
(8) |
where denotes the projection of the vector onto the components in . The notation is intended to remind us that the norm depends on the specific partitioning . Some authors divide the term by , but we do not do that. A further refinement of GL is the sparse group LASSO (SGL), in which the group structure is as before, but the norm is now defined as
(9) |
where as before . If is an unknown vector, then recovery of is attempted via
(10) |
in Group LASSO, and via
(11) |
in Sparse Group LASSO.
The main idea behind GL is that one is less concerned about the number of nonzero components of , and more concerned about the number of distinct groups containing these nonzero components. Therefore GL attempts to choose an estimate that has nonzero entries in as few distinct sets as possible. In principle, SGL tries to choose an estimate that not only has nonzero components within as few groups as possible, but within those groups, has as few nonzero components as possible. Note that if , then SGL reduces to LASSO (because of the summability of the -norm), whereas if , then SGL reduces to GL. Note too that if and every set is a singleton , then GL reduces to LASSO.
1.4 Motivation and Contributions of the Paper
Now we come to the motivation and contributions of the present paper. The LASSO formulation is well-suited for compressed sensing (see Theorem 1.1), but not so well-suited for sparse regression, because it lacks the grouping effect. The EN formulation is well-suited for sparse regression as it exhibits the grouping effect, but it is not known whether it can achieve compressed sensing.
The first result presented in the paper is that if the EN regularizer of (3) is used instead of the -norm in (6), then the resulting approach does not achieve robust sparse recovery unless , that is, the number of measurements grows linearly with respect to the size of the vector. This would not be considered “compressed” sensing. This led us to formulate another regularizer, namely
(12) |
We refer to this as the CLOT norm, with CLOT standing for Combined L-One and Two. It is shown that the CLOT norm combines the best features of both LASSO and EN, in that
-
When the CLOT norm is used as the regularizer in sparse regression, the resulting solution exhibits the grouping effect.
-
When the -norm is replaced by the CLOT norm in (6), the resulting solution achieves robust sparse recovery if the matrix satisfies the RIP.
-
Moreover, if in CLOT is set to zero so that CLOT becomes LASSO, the bound on the RIP constant reduces to the “best possible” bound in Theorem 1.1.
Clearly the CLOT norm is a special case of the SGL norm with the entire index set being taken as a single group (though the adjective “sparse” is no longer appropriate). This led us to explore whether the SGL norm achieves either grouping effect or robust sparse recovery. We are able to show that SGL does indeed achieve both.
Now we place these contributions in perspective. There is empirical evidence to support the belief that both the GL and the SGL formulations work well for compressed sensing. However, until the publication of a companion paper by a subset of the present authors [19], there were no proofs that either of these formulations achieved robust sparse recovery. In [19], it is shown that both the GL and SGL formulations achieve robust sparse recovery provided the group sizes are sufficiently small. This restriction on group sizes is removed in the present paper. Moreover, so far as the authors are aware, until now there are no results on the grouping effect for either of these formulations. In the present paper, it is shown that if two columns of the measurement matrix that belong to the same group are highly correlated, then the corresponding components of the estimate have nearly equal values. However, if two columns that belong to different groups are highly correlated, then their coefficients need not be nearly equal. From the standpoint of applications, this is a highly desirable property. To illustrate, suppose the groups represent biological pathways. Then one would wish to assign roughly similar weights to genes in the same pathway, but not necessarily to those in disjoint pathways.
Thus the contributions of the present paper are to show that:
-
The EN does not achieve robust sparse recovery.
-
Both the CLOT and SGL formulations achieve both robust sparse recovery as well as the grouping effect.
-
The condition under which CLOT achieves robust sparse recovery reduces to the “best possible” condition in Theorem 1.1.
Taken together, these results might indicate that CLOT and SGL are attractive alternatives to the LASSO and EN formulations.
2 Main Theoretical Results
This section contains the main contributions of the paper. First it is shown in Section 2.1 that the EN approach does not achieve robust sparse recovery, and is therefore not suitable for compressed sensing applications. Next, it is shown in Section 2.2 that the SGL formulation assigns nearly equal weights to highly correlated features within the same group, though not necessarily to highly correlated features from different groups. It follows as a corollary that CLOT assigns nearly equal weights to highly correlated features. Then it is shown in Section 2.3 that the SGL formulation achieves robust sparse recovery. The contents of a companion paper by a subset of the present authors [19] establish that SGL achieves robust sparse recovery of order provided that each group size is smaller than . There is no such restriction here. It follows as a corollary that CLOT also achieves robust sparse recovery.
2.1 Lack of Robust Sparse Recovery of the Elastic Net Formulation
The first result of this section shows that EN formulation does not achieve robust sparse recovery, and therefore is not suitable for compressed sensing applications.
Theorem 2.1.
Suppose a matrix has the following property. There exist constants and such that, whenever for some and with , the solution
satisfies
(13) |
Then
(14) |
Proof: Let denote the null space of the matrix , that is, the set of all such that . Let be arbitrary, and let denote the index set of the largest components of by magnitude. Therefore
Next, (13) implies that, if , then for all . In other words,
or equivalently,
(15) |
Now observe that, because , we have that
and more generally,
Apply (15) with and . This leads to
Now divide both sides by , and observe that
Therefore
Next
Equivalently
This is Equation (5.2) of Cohen-Dahmen-Devore (2009) with . As shown in Theorem 5.1 of that paper, this implies that , which is the desired conclusion.
2.2 Grouping Property of the SGL and CLOT Formulations
One advantage of the EN over LASSO is that the former assigns roughly equal weights to highly correlated features, as shown in [5, Theorem 1] and referred to as the grouping effect. In contrast, if LASSO chooses one feature among a set of highly correlated features, then generically it assigns a zero weight to all the rest. To illustrate, if two columns of are identical, then in principle LASSO could assign nonzero weights to both columns; however, the slightest perturbation in the data would cause one or the other weight to become zero. The drawback of this is that the finally selected feature set is very sensitive to noise in the measurements. In this section we prove an analog of [5, Theorem 1] for SGL formulation. Our result states that if two highly correlated features within the same group are chosen by SGL, then they will have roughly similar weights. Since CLOT is a special case of SGL with the entire feature set treated as one group, it follows that CLOT assigns roughly similar weights to highly correlated features in the entire set of features. As a result, the final feature sets obtained using SGL or CLOT are less sensitive to noise in measurements than the ones obtained using LASSO.
Theorem 2.2.
Let be some vector and matrix respectively. Without loss of generality, suppose that is centered, i.e. , where denotes a column vector consisting of ones, and that is standardized, i.e. where denotes the -th column of . Suppose , and let denote a partition of into disjoint subsets. Define
(16) |
where is a Lagrange multiplier. Suppose that, for two indices belonging to the same group , we have that , where denote the components of the vector . By changing the sign of one of the columns of if necessary, it can be assumed that . Define
Then
(17) |
where is shorthand for .
Proof: Define
where, as above, denotes . Then is differentiable with respect to whenever . In particular, since both and are nonzero by assumption, it follows that
Expanding the partial derivatives leads to
Subtracting one equation from the other gives
Hence
In the last step, we use the fact that
Rearranging gives
which is the desired conclusion.
Let us illustrate the above result using the CLOT formulation. In the case of CLOT formulation we have , , , and the inequality in (17) becomes
(18) |
where is the solution of the CLOT formulation, and
(19) |
Now suppose that two indices and are highly correlated such that , so that the right hand side of the inequality in (18) is almost equal to zero. Combining this with (19) we can conclude , so CLOT assigns similar weights to highly correlated variables.
Though the focus of the present paper is not on the GL formulation, we digress briefly to discuss the implications of Theorem 2.2 for GL. This theorem also implies that the GL formulation exhibits the grouping effect, because GL is a special case of SGL with . Indeed, it can be observed from (17) that the bound on the right side is minimized by setting , that is, using GL instead of SGL. This is not surprising, because SGL not only tries to minimize the number of distinct groups containing the support of , but within each group, tries to choose as few elements as possible. Thus, within each group, SGL inherits the weaknesses of LASSO. Thus one would expect that, within each group, the feature set chosen would become more sensitive as we decrease .
2.3 Robust Sparse Recovery of the SGL and CLOT Formulations
In this subsection, we present some sufficient conditions for the SGL and CLOT formulations to achieve robust sparse recovery. When CLOT is specialized to LASSO by setting , the sufficient condition reduces to the “tight” bound given in Theorem 1.1.
Recall the definitions. The CLOT norm with parameter is given by
while the SGL norm is given by
Recall also the problem set-up. The measurement vector equals where , a known upper bound. The recovered vector is defined as
(20) |
if SGL is used, and as
(21) |
if CLOT is used.
Definition 2.1.
A matrix is said to satisfy the robust null space property (RNSP) if there exist constants and such that, for all sets with , we have
(22) |
This property was apparently first introduced in [14, Definition 4.21]. Note that the definition in [14] has just in place of . It is a ready consequence of Schwarz’ inequality that, if (22) holds, then
The following result is established in [20] in the context of group sparsity, but is new even for conventional sparsity. The reader is directed to that source for the proof.
Theorem 2.3.
([20, Theorem 5]) Suppose that, for some number , the matrix satisfies the RIP of order with constant . Let be an abbreviation for , and define the constants
(23) |
(24) |
(25) |
(26) |
Then satisfies the -robust null space property with
(27) |
In Theorem 2.4 below, it is assumed that the matrix satisfies the RIP of order with , in accordance with Theorem 1.1. With this assumption, we prove bounds on the residual error with SGL; the bounds for CLOT can be obtained simply by setting in the SGL bounds. Note that, once bounds for the -norm of are proved, it is possible to extend the bounds to for all ; see [14, Theorem 4.22].
Theorem 2.4.
Suppose and that satisfies the GRIP of order with constant , and define constants as in (27). Suppose that
(28) |
Define
(29) |
With these assumptions,
(30) |
(31) |
where
(32) |
Proof.
Hereafter we write instead of in the interests of brevity.
Define . From the definition of the estimate, we have that
From the definition of the SGL norm, this expands to
This can be rearranged as
(33) |
We will work separately on each of the two terms separately. First, by the triangle inequality, we have that
As a consequence,
From Schwarz’ inequality, we get
for any . Combining everything gives
(34) |
for any subset . Second, for any subset , the decomposability of implies that
while the triangle inequality implies that
Therefore
(35) |
If we now choose to be the set corresponding to the largest elements of by magnitude, then
With this choice of , (35) becomes
(36) |
Substituting the bounds (34) and (35) into (33) gives
Now recall the definition of the constant from (29). Using this definition, the above inequality can be rearranged as
and equivalently as
(37) |
This is the first of two equations that we need.
Now we derive the second equation. From Theorem 2.3, we know that the matrix satisfies the -robust null space property, namely (22). An application of Schwatz’ inequality shows that
However, because both and are feasible for the optimization problem in (20), we get
Substituting this bound for gives us the second equation we need, namely
or equivalently
(38) |
The two inequalities (37) and (38) can be written compactly as
(39) |
where the coefficient matrix is given by
The matrix has positive diagonal elements (recall that ), and negative off-diagonal elements. Therefore, if , then every element of is positive, in which we can multiply both sides of (39) by . Now
Recall the definition of from (29). Now routine algebra shows that
which is precisely (28). Thus we can multiply both sides of (39) by , which gives
Clearing out the matrix multiplication gives
(40) |
Now the triangle inequality states that
Substituting from (40) gives
By substituting that , we get (30) with the constants as defined in (32).
To prove (31), suppose . This part of the proof closely follows that of [14, Theorem 4.22], except that we provide explicit values for the constants. Let denote the index set of the largest components of by magnitude. Then
We will bound each term separately. First, by [14, Theorem 2.5] and (30), we get
(41) |
Now we apply in succession Höder’s inequality, the robust null space property, the fact that , and (30). This gives
(42) | |||||
∎
In the above theorem, we started with the restricted isometry constant and computed an upper bound on in order for SGL and CLOT to achieve robust sparse recovery. As gets closer to the limit (which is known to be the best possible in view of Theorem 1.2), the limit on would approach zero. It is also possible to start with and find an upper bound on , by rearranging the inequalities. As this involves just routine algebra, we simply present the final bound. Given the number , define as in (23), and define
(43) |
(44) |
Given , define as in (29), and define
If the matrix satisfies the RIP of order with a constant , then SGL achieves robust sparse recovery of order provided
(45) |
3 Numerical Examples
In this section we present three simulation studies, to demonstrate the various theoretical results presented thus far.
3.1 Comparison of CLOT with LASSO and EN
In this subsection we present the results of running the same four simulations as in [5, Section 5]. The examples were run using LASSO, EN and CLOT. We carried out all three optimizations in Matlab using the cvx optimization package.
Method | Example 1 | Example 2 | Example 3 | Example 4 |
---|---|---|---|---|
CLOT | 2.66(0.51) | 1.84(0.29) | 35.35(1.62) | 53.14(4.94) |
EN | 2.71(0.37) | 2.73(0.32) | 25.82(1.48) | 54.81(5.13) |
LASSO | 3.46(0.37) | 3.60(0.39) | 51.61(2.3) | 68.74(7.43) |
Median mean-squared errors for the simulated examples and three methods based on 50 replications. The numbers in parentheses are the standard deviations of the medians estimated using the bootstrap method with 500 resamplings on the 50 mean-squared errors.
Method | Example 1 | Example 2 | Example 3 | Example 4 |
---|---|---|---|---|
CLOT | 7 | 8 | 30 | 24 |
EN | 8 | 8 | 37 | 32 |
LASSO | 5 | 6 | 21 | 18 |
True | 3 | 8 | 20 | 15 |
![]() |
![]() |
---|---|
Box Plots for Example (a) |
Box Plots for Example (b) |
![]() |
![]() |
Box Plots for Example (c) |
Box Plots for Example (d) |
It can be seen from Tables 1 and 2 as well as Figure 1 that CLOT and EN perform comparably, and that both perform better than LASSO. Note that in all the simulations in this example, the true vector is not sparse, which is why EN outperforms LASSO. Moreover, Table 2 shows that CLOT achieves accuracy comparable to EN with a smaller number of nonzero features.
3.2 Grouping Effect of CLOT
To illustrate that CLOT demonstrates the grouping effect as does EN (see Theorem 2.2), we ran the same example as at the end of [5, Section 5]. Specifically, we chose and to be two independent random variables, and the observation as . The six observations were
where the are i.i.d. . The objective was to express as a linear combination of through . In other words, we wished to express , where is the matrix with the