Recent years have witnessed a flurry of research on the recovery of high-dimensional sparse signals (e.g., compressed sensing [2, 6, 18], graphical model selection [13, 14], and sparse approximation ). In all of these settings, the basic problem is to recover information about a high-dimensional signal , based on a set of observations. The signal is assumed a priori to be sparse: either exactly -sparse, or lying within some -ball with . A large body of theory has focused on the behavior of various -relaxations when applied to measurement matrices drawn from the standard Gaussian ensemble [6, 2], or more general random ensembles satisfiying mutual incoherence conditions [13, 20].
These standard random ensembles are dense, in that the number of non-zero entries per measurement vector is of the same order as the ambient signal dimension. Such dense measurement matrices are undesirable for practical applications (e.g., sensor networks), in which it would be preferable to take measurements based on sparse inner products. Sparse measurement matrices require significantly less storage space, and have the potential for reduced algorithmic complexity for signal recovery, since many algorithms for linear programming, and conic programming more generally, can be accelerated by exploiting problem structure. With this motivation, a body of past work (e.g. [4, 8, 16, 23]), motivated by group testing or coding perspectives, has studied compressed sensing methods based on sparse measurement ensembles. However, this body of work has focused on the case of noiseless observations.
In contrast, this paper focuses on observations contaminated by additive noise which, as we show, exhibits fundamentally different behavior than the noiseless case. Our interest is not on sparse measurement ensembles alone, but rather in understanding the trade-off between the degree of measurement sparsity, and its statistical efficiency. We assess measurement sparsity in terms of the fraction of non-zero entries in any particular row of the measurement matrix, and we define statistical efficiency in terms of the minimal number of measurements required to recover the correct support with probability converging to one. Our interest can be viewed in terms of experimental design: more precisely we ask: what degree of measurement sparsity can be permitted without any compromise in the statistical efficiency? To bring sharp focus to the issue, we analyze this question for exact subset recovery using -constrained quadratic programming, also known as the Lasso in the statistics literature [3, 17], where past work on dense Gaussian measurement ensembles  provides a precise characterization of its success/failure. We characterize the density of our measurement ensembles with a positive parameter , corresponding to the fraction of non-zero entries per row. We first show that for all fixed , the statistical efficiency of the Lasso remains the same as with dense measurement matrices. We then prove that it is possible to let at some rate, as a function of the sample size , signal length and signal sparsity , yielding measurement matrices with a vanishing fraction of non-zeroes per row while requiring exactly the same number of observations as dense measurement ensembles. In general, in contrast to the noiseless setting , our theory still requires that the average number of non-zeroes per column of the measurement matrix (i.e.,
) tend to infinity; however, under the loss function considered here (exact signed support recovery), we prove that no method can succeed with probability one if this condition does not hold. The remainder of this paper is organized as follows. In Section2, we set up the problem more precisely, state our main result, and discuss some of its implications. In Section 3, we provide a high-level outline of the proof.
Work in this paper was presented in part at the International Symposium on Information Theory in Toronto, Canada (July, 2008). We note that in concurrent and complementary work, Wang et al.  have analyzed the information-theoretic limitations of sparse measurement matrices for exact support recovery.
Throughout this paper, we use the following standard asymptotic notation: if for some constant ; if for some constant ; and if and .
2 Problem set-up and main result
We begin by setting up the problem, stating our main result, and discussing some of their consequences.
2.1 Problem formulation
Let be a fixed but unknown vector, with at most non-zero entries (), and define its support set
We use to denote the minimum value of on its support—that is, .
Suppose that we make a set of independent and identically distributed (i.i.d.) observations of the unknown vector , each of the form
where is observation noise, and is a measurement vector. It is convenient to use to denote the -vector of measurements, with similar notation for the noise vector , and
to denote the measurement matrix. With this notation, the observation model can be written compactly as .
Given some estimate , its error relative to the true can be assessed in various ways, depending on the underlying application of interest. For applications in compressed sensing, various types of norms (i.e., ) are well-motivated, whereas for statistical prediction, it is most natural to study a predictive loss (e.g., ). For reasons of scientific interpretation or for model selection purposes, the object of primary interest is the support of . In this paper, we consider a slightly stronger notion of model selection: in particular, our goal is to recover the signed support of the unknown , as defined by the -vector with elements
Given some estimate , we study the probability that it correctly specifies the signed support.
The estimator that we analyze is -constrained quadratic programming (QP), also known as the Lasso  in the statistics literature. The Lasso generates an estimate by solving the regularized QP
where is a user-defined regularization parameter. A large body of past work has focused on the behavior of the Lasso for both deterministic and random measurement matrices (e.g., [5, 13, 18, 20]). Most relevant here is the sharp threshold  characterizing the success/failure of the Lasso when applied to measurement matrices drawn randomly from the standard Gaussian ensemble (i.e., each element i.i.d.). In particular, the Lasso undergoes a sharp threshold as a function of the control parameter
For the standard Gaussian ensemble and sequences such that , the probability of Lasso success goes to one, whereas it converges to zero for sequences for which . The main contribution of this paper is to show that the same sharp threshold holds for -sparsified measurement ensembles, including a subset for which , so that each row of the measurement matrix has a vanishing fraction of non-zero entries.
2.2 Statement of main result
A measurement matrix drawn randomly from a Gaussian ensemble is dense, in that each row has non-zero entries. The main focus of this paper is the observation model (2), using measurement ensembles that are designed to be sparse. To formalize the notion of sparsity, we let represent a measurement sparsity parameter, corresponding to the (average) fraction of non-zero entries per row. Our analysis allows the sparsity parameter to be a function of the triple , but we typically suppress this explicit dependence so as to simplify notation. For a given choice of , we consider measurement matrices with i.i.d. entries of the form
By construction, the expected number of non-zero entries in each row of is . It is straightforward to verify that for any constant setting of , elements from the ensemble (6
) are sub-Gaussian. (A zero-mean random variableis sub-Gaussian  if there exists some constant such that for all .) For this reason, one would expect such ensembles to obey similar scaling behavior as Gaussian ensembles, although possibly with different constants. In fact, the analysis of this paper establishes exactly the same control parameter threshold (5) for -sparsified measurement ensembles, for any fixed , as the completely dense case (). On the other hand, if
is allowed to tend to zero, elements of the measurement matrix are no longer sub-Gaussian with any fixed constant, since the variance of the Gaussian mixture component scales non-trivially. Nonetheless, our analysis shows that forsuitably slowly, it is possible to achieve the same statistical efficiency as the dense case.
In particular, we state the following result on conditions under which
the Lasso applied to sparsified ensembles has the same sample
complexity as when applied to the dense (standard Gaussian) ensemble:
Suppose that the measurement matrix is drawn with i.i.d. entries according to the -sparsified distribution (6). Then for any , if the sample size satisfies
then the Lasso succeeds with probability one as in recovering the correct signed support as long as
(a) To provide intuition for Theorem 1, it is
helpful to consider various special cases of the sparsity parameter
. First, if is a constant fixed to some value in
, then it plays no role in the scaling, and
condition (8c) is always satisfied. Furthermore,
condition (8a) is then the exact same as that of from previous
work  on dense measurement ensembles (). However, condition (8b) is slightly weaker than the
corresponding condition from  in that
must approach zero more slowly. Depending on the exact behavior of
, choosing to decay slightly more slowly than
is sufficient to guarantee exact recovery with
, meaning that we recover
exactly the same statistical efficiency as the dense case () for all constant measurement sparsities . At
least initially, one might think that reducing should
increase the required number of observations, since it effectively
reduces the signal-to-noise ratio by a factor of . However,
under high-dimensional scaling (, the
dominant effect limiting the Lasso performance is the number () of irrelevant factors, as opposed to the signal-to-noise ratio
(scaling of the minimum).
(b) However, Theorem 1 also allows for general scalings of the measurement sparsity along with the triplet . More concretely, let us suppose for simplicity that . Then over a range of signal sparsities—say , or , corresponding respectively to linear sparsity, polynomial sparsity, and exponential sparsity—-we can choose a decaying measurement sparsity, for instance
along with the regularization parameter
maintaining the same sample complexity (required number of
observations for support recovery) as the Lasso with dense measurement
(c) Of course, the conditions of Theorem 1 do not allow the measurement sparsity to approach zero arbitrarily quickly. Rather, for any guaranteeing exact recovery, condition (8a) implies that the average number of non-zero entries per column of (namely, ) must tend to infinity. (Indeed, with , our specific choice (9) certainly satisfies this constraint.) A natural question is whether exact recovery is possible using measurement matrices, either randomly drawn or deterministically designed, with the average number of non-zeros per row (namely ) remaining bounded. In fact, under the criterion of exactly recovering the signed support (2.1), no method can succeed with w.p. one if remains bounded.
If does not tend to infinity, then no method can recover the signed support with probability one.
We construct a sub-problem that must be solvable by any method capable of performing exact signed support recovery. Suppose that and that the column has non-zero entries, say without loss of generality indices . Now consider the problem of recovering the sign of . Let us extract the observations that explicitly involve , writing
where denotes the set of indices in row for which is non-zero, excluding index . Even assuming that were perfectly known, this observation model (2.2) is at best equivalent to observing contaminated by constant variance additive Gaussian noise, and our task is to distinguish whether or . The average is a sufficient statistic, following the distribution . Unless the effective signal-to-noise ratio, which is of the order , goes to infinity, there will always be a constant probability of error in distinguishing from . Under the -sparsified random ensemble, we have with high probability, so that no method can succeed unless goes to infinity, as claimed. ∎
3 Proof of Theorem 1
This section is devoted to the proof of Theorem 1. We begin with a high-level outline of the proof; as with previous work on dense Gaussian ensembles , the key is the notion of a primal-dual witness for exact signed support recovery. We then proceed with the proof, divided into a sequence of separate lemmas. Analysis of “sparsified” matrices require results on spectral properties of random matrices not covered by the standard literature. The proofs of some of the more technical results are deferred to the appendices.
3.1 High-level overview of proof
For the purposes of our proof, it is convenient to consider matrices with i.i.d. entries of the form
So as to obtain an equivalent observation model, we also reset the variance of of each noise term to be . Finally, we can assume without loss of generality that .
Define the sample covariance matrix
Of particular importance to our analysis is the sub-matrix . For future reference, we state the following claim, proved in Appendix D:
Under the conditions of Theorem 1, the submatrix is invertible with probability greater than .
The foundation of our proof is the following lemma: it provides sufficient conditions for the Lasso (4) to recover the signed support set.
Lemma 2 (Primal-dual conditions for support recovery).
Suppose that , and that we can find a primal vector , and a subgradient vector that satisfy the zero-subgradient condition
and the signed-support-recovery conditions
Then is the unique optimal solution to the Lasso (4), and recovers the correct signed support.
See Appendix B.1 for the proof of this claim.
Thus, given Lemmas 1 and 2, it suffices to show that under the specified scaling of , there exists a primal-dual pair satisfying the conditions of Lemma 2. We establish the existence of such a pair with the following constructive procedure:
We begin by setting , and .
Next we determine by solving the linear system
Finally, we determine by solving the linear system:
In order to complete these final two steps, it is helpful to define the following random variables:
where is the unit vector with one in position , and is the all-ones vector.
A little bit of algebra (see Appendix B.2 for details) shows that , and that . Consequently, if we define the events
where the minimum value was defined previously as the minimum value of on its support, then in order to establish that the Lasso succeeds in recovering the exact signed support, it suffices to show that ,
We decompose the proof of this final claim in the following three lemmas. As in the statement of Theorem 1, suppose that , for some fixed .
Lemma 3 (Control of ).
Under the conditions of Theorem 1, we have
Lemma 4 (Control of ).
Under the conditions of Theorem 1, we have
Lemma 5 (Control of ).
Under the conditions of Theorem 1, we have
3.2 Proof of Lemma 3
We assume throughout that is invertible, an event which occurs with probability under the stated assumptions (see Lemma 1). If we define the -dimensional vector
then the variable can be written compactly as
Note that each term in this sum is distributed as a mixture variable, taking the value with probability , and distributed as variable with probability . For each
, define the discrete random variable
For each index , let . With these definitions, by construction, we have
To gain some intuition for the behavior of this sum, note that the variables are independent of . (In particular, each is a function of , whereas is a function of , with .) Consequently, we may condition on without affecting , and since is Gaussian, we have . Therefore, if we can obtain good control on the norm , then we can use standard Gaussian tail bounds (see Appendix A) to control the maximum . The following lemma is proved in Appendix C:
Under condition (8c), then for any fixed , we have
The primary implication of the above bound is that each variable is (essentially) no larger than a variable. We can then use standard techniques for bounding the tails of Gaussian variables to obtain good control over the random variable . In particular, by union bound, we have
For any , define the event . Continuing on, we have
where the last line uses a standard Gaussian tail bound (see Appendix A), and Lemma 6. Finally, it can be verified that under the condition for some , and with chosen sufficiently small, we have as claimed.
3.3 Proof of Lemma 4
Defining the orthogonal projection matrix , we then have
Recall from equation (23) the representation , where is Bernoulli with parameter , and is Gaussian. The variable is binomial; define the following event
From the Hoeffding bound (see Lemma 7), we have . Using this representation and conditioning on , we have
where we have assumed without loss of generality that the first elements of are non-zero. Since is an orthogonal projection matrix, we have , so that
Conditioned on , the random variable is zero-mean Gaussian with variance
For some , define the event
Now, by conditioning on and its complement and using tail bounds on Gaussian variates (see Appendix A), we obtain
The first term goes to zero since . The second term goes to zero because eventually (because Condition (8c) implies that ), and Conditon (8a) implies that . Our choice of and Condition (8c) (which implies that ) is enough for the third term goes to zero.
3.4 Proof of Lemma 5
We first observe that conditioned on , each is Gaussian with mean and variance:
Define the upper bounds
and the following event
Conditioning on and its complement, we have
Applying Lemma 10 with and , we have .
We now deal with the first term. Letting , and using as shorthand for the event , we have
Condition (8b) implies that , so that it suffices to upper bound
where , and we have used standard Gaussian tail bounds (see Appendix A).
It remains to verify that this final term converges to zero. Taking logarithms and ignoring constant terms, we have
We would like to show that this quantity diverges to . Condition (8c) implies that