# High-dimensional subset recovery in noise: Sparsified measurements without loss of statistical efficiency

We consider the problem of estimating the support of a vector β^* ∈R^p based on observations contaminated by noise. A significant body of work has studied behavior of ℓ_1-relaxations when applied to measurement matrices drawn from standard dense ensembles (e.g., Gaussian, Bernoulli). In this paper, we analyze sparsified measurement ensembles, and consider the trade-off between measurement sparsity, as measured by the fraction γ of non-zero entries, and the statistical efficiency, as measured by the minimal number of observations n required for exact support recovery with probability converging to one. Our main result is to prove that it is possible to let γ→ 0 at some rate, yielding measurement matrices with a vanishing fraction of non-zeros per row while retaining the same statistical efficiency as dense ensembles. A variety of simulation results confirm the sharpness of our theoretical predictions.

• 1 publication
• 94 publications
06/04/2015

### ShapeFit: Exact location recovery from corrupted pairwise directions

Let t_1,...,t_n ∈R^d and consider the location recovery problem: given a...
02/09/2018

### Limits on Sparse Data Acquisition: RIC Analysis of Finite Gaussian Matrices

One of the key issues in the acquisition of sparse data by means of comp...
12/05/2013

### Swapping Variables for High-Dimensional Sparse Regression with Correlated Measurements

We consider the high-dimensional sparse linear regression problem of acc...
03/19/2021

### Refined Least Squares for Support Recovery

We study the problem of exact support recovery based on noisy observatio...
01/18/2019

### The Restricted Isometry Property of Block Diagonal Matrices for Group-Sparse Signal Recovery

Group-sparsity is a common low-complexity signal model with widespread a...
02/28/2021

### Dynamic Sample Complexity for Exact Sparse Recovery using Sequential Iterative Hard Thresholding

In this paper we consider the problem of exact recovery of a fixed spars...
09/04/2013

### Confidence-constrained joint sparsity recovery under the Poisson noise model

Our work is focused on the joint sparsity recovery problem where the com...

## 1 Introduction

Recent years have witnessed a flurry of research on the recovery of high-dimensional sparse signals (e.g., compressed sensing [2, 6, 18], graphical model selection [13, 14], and sparse approximation [18]). In all of these settings, the basic problem is to recover information about a high-dimensional signal , based on a set of observations. The signal is assumed a priori to be sparse: either exactly -sparse, or lying within some -ball with . A large body of theory has focused on the behavior of various -relaxations when applied to measurement matrices drawn from the standard Gaussian ensemble [6, 2], or more general random ensembles satisfiying mutual incoherence conditions [13, 20].

These standard random ensembles are dense, in that the number of non-zero entries per measurement vector is of the same order as the ambient signal dimension. Such dense measurement matrices are undesirable for practical applications (e.g., sensor networks), in which it would be preferable to take measurements based on sparse inner products. Sparse measurement matrices require significantly less storage space, and have the potential for reduced algorithmic complexity for signal recovery, since many algorithms for linear programming, and conic programming more generally

[1], can be accelerated by exploiting problem structure. With this motivation, a body of past work (e.g. [4, 8, 16, 23]), motivated by group testing or coding perspectives, has studied compressed sensing methods based on sparse measurement ensembles. However, this body of work has focused on the case of noiseless observations.

In contrast, this paper focuses on observations contaminated by additive noise which, as we show, exhibits fundamentally different behavior than the noiseless case. Our interest is not on sparse measurement ensembles alone, but rather in understanding the trade-off between the degree of measurement sparsity, and its statistical efficiency. We assess measurement sparsity in terms of the fraction of non-zero entries in any particular row of the measurement matrix, and we define statistical efficiency in terms of the minimal number of measurements required to recover the correct support with probability converging to one. Our interest can be viewed in terms of experimental design: more precisely we ask: what degree of measurement sparsity can be permitted without any compromise in the statistical efficiency? To bring sharp focus to the issue, we analyze this question for exact subset recovery using -constrained quadratic programming, also known as the Lasso in the statistics literature [3, 17], where past work on dense Gaussian measurement ensembles [20] provides a precise characterization of its success/failure. We characterize the density of our measurement ensembles with a positive parameter , corresponding to the fraction of non-zero entries per row. We first show that for all fixed , the statistical efficiency of the Lasso remains the same as with dense measurement matrices. We then prove that it is possible to let at some rate, as a function of the sample size , signal length and signal sparsity , yielding measurement matrices with a vanishing fraction of non-zeroes per row while requiring exactly the same number of observations as dense measurement ensembles. In general, in contrast to the noiseless setting [23], our theory still requires that the average number of non-zeroes per column of the measurement matrix (i.e.,

) tend to infinity; however, under the loss function considered here (exact signed support recovery), we prove that no method can succeed with probability one if this condition does not hold. The remainder of this paper is organized as follows. In Section

2, we set up the problem more precisely, state our main result, and discuss some of its implications. In Section 3, we provide a high-level outline of the proof.

Work in this paper was presented in part at the International Symposium on Information Theory in Toronto, Canada (July, 2008). We note that in concurrent and complementary work, Wang et al. [22] have analyzed the information-theoretic limitations of sparse measurement matrices for exact support recovery.

#### Notation:

Throughout this paper, we use the following standard asymptotic notation: if for some constant ; if for some constant ; and if and .

## 2 Problem set-up and main result

We begin by setting up the problem, stating our main result, and discussing some of their consequences.

### 2.1 Problem formulation

Let be a fixed but unknown vector, with at most non-zero entries (), and define its support set

 S := {i∈{1,…,p}∣β∗i≠0}. (1)

We use to denote the minimum value of on its support—that is, .

Suppose that we make a set of independent and identically distributed (i.i.d.) observations of the unknown vector , each of the form

 Yi := xTiβ∗+Wi, (2)

where is observation noise, and is a measurement vector. It is convenient to use to denote the -vector of measurements, with similar notation for the noise vector , and

 X = ⎡⎢ ⎢ ⎢ ⎢ ⎢⎣xT1xT2⋮xTn⎤⎥ ⎥ ⎥ ⎥ ⎥⎦=[X1X2…Xp]. (3)

to denote the measurement matrix. With this notation, the observation model can be written compactly as .

Given some estimate , its error relative to the true can be assessed in various ways, depending on the underlying application of interest. For applications in compressed sensing, various types of norms (i.e., ) are well-motivated, whereas for statistical prediction, it is most natural to study a predictive loss (e.g., ). For reasons of scientific interpretation or for model selection purposes, the object of primary interest is the support of . In this paper, we consider a slightly stronger notion of model selection: in particular, our goal is to recover the signed support of the unknown , as defined by the -vector with elements

 [S(β∗)]i := {sign(β∗i)if β∗i≠00otherwise.

Given some estimate , we study the probability that it correctly specifies the signed support.

The estimator that we analyze is -constrained quadratic programming (QP), also known as the Lasso [17] in the statistics literature. The Lasso generates an estimate by solving the regularized QP

 ˆβ = argminβ∈Rp{12n∥Y−Xβ∥22+ρn∥β∥1}, (4)

where is a user-defined regularization parameter. A large body of past work has focused on the behavior of the Lasso for both deterministic and random measurement matrices (e.g., [5, 13, 18, 20]). Most relevant here is the sharp threshold [20] characterizing the success/failure of the Lasso when applied to measurement matrices drawn randomly from the standard Gaussian ensemble (i.e., each element i.i.d.). In particular, the Lasso undergoes a sharp threshold as a function of the control parameter

 θ(n,p,k) := n2klog(p−k). (5)

For the standard Gaussian ensemble and sequences such that , the probability of Lasso success goes to one, whereas it converges to zero for sequences for which . The main contribution of this paper is to show that the same sharp threshold holds for -sparsified measurement ensembles, including a subset for which , so that each row of the measurement matrix has a vanishing fraction of non-zero entries.

### 2.2 Statement of main result

A measurement matrix drawn randomly from a Gaussian ensemble is dense, in that each row has non-zero entries. The main focus of this paper is the observation model (2), using measurement ensembles that are designed to be sparse. To formalize the notion of sparsity, we let represent a measurement sparsity parameter, corresponding to the (average) fraction of non-zero entries per row. Our analysis allows the sparsity parameter to be a function of the triple , but we typically suppress this explicit dependence so as to simplify notation. For a given choice of , we consider measurement matrices with i.i.d. entries of the form

 Xij d= {Z∼N(0,1)with probability γ0with probability 1−γ. (6)

By construction, the expected number of non-zero entries in each row of is . It is straightforward to verify that for any constant setting of , elements from the ensemble (6

) are sub-Gaussian. (A zero-mean random variable

is sub-Gaussian [19] if there exists some constant such that for all .) For this reason, one would expect such ensembles to obey similar scaling behavior as Gaussian ensembles, although possibly with different constants. In fact, the analysis of this paper establishes exactly the same control parameter threshold (5) for -sparsified measurement ensembles, for any fixed , as the completely dense case (). On the other hand, if

is allowed to tend to zero, elements of the measurement matrix are no longer sub-Gaussian with any fixed constant, since the variance of the Gaussian mixture component scales non-trivially. Nonetheless, our analysis shows that for

suitably slowly, it is possible to achieve the same statistical efficiency as the dense case.

In particular, we state the following result on conditions under which the Lasso applied to sparsified ensembles has the same sample complexity as when applied to the dense (standard Gaussian) ensemble:

###### Theorem 1.

Suppose that the measurement matrix is drawn with i.i.d. entries according to the -sparsified distribution (6). Then for any , if the sample size satisfies

 n > (2+ϵ)klog(p−k), (7)

then the Lasso succeeds with probability one as in recovering the correct signed support as long as

 nρ2nγlog(p−k) → ∞ (8a) ρnβmin(1+√kγ√loglog(p−k)log(p−k)) → 0 (8b) γ3min{k,log(p−k)loglog(p−k)} → ∞. (8c)

Remarks:

(a) To provide intuition for Theorem 1, it is helpful to consider various special cases of the sparsity parameter . First, if is a constant fixed to some value in , then it plays no role in the scaling, and condition (8c) is always satisfied. Furthermore, condition (8a) is then the exact same as that of from previous work [20] on dense measurement ensembles (). However, condition (8b) is slightly weaker than the corresponding condition from [20] in that must approach zero more slowly. Depending on the exact behavior of , choosing to decay slightly more slowly than is sufficient to guarantee exact recovery with , meaning that we recover exactly the same statistical efficiency as the dense case () for all constant measurement sparsities . At least initially, one might think that reducing should increase the required number of observations, since it effectively reduces the signal-to-noise ratio by a factor of . However, under high-dimensional scaling (, the dominant effect limiting the Lasso performance is the number () of irrelevant factors, as opposed to the signal-to-noise ratio (scaling of the minimum).

(b) However, Theorem 1 also allows for general scalings of the measurement sparsity along with the triplet . More concretely, let us suppose for simplicity that . Then over a range of signal sparsities—say , or , corresponding respectively to linear sparsity, polynomial sparsity, and exponential sparsity—-we can choose a decaying measurement sparsity, for instance

 γ = [loglog(p−k)log(p−k)]16→0 (9)

along with the regularization parameter while maintaining the same sample complexity (required number of observations for support recovery) as the Lasso with dense measurement matrices.

(c) Of course, the conditions of Theorem 1 do not allow the measurement sparsity to approach zero arbitrarily quickly. Rather, for any guaranteeing exact recovery, condition (8a) implies that the average number of non-zero entries per column of (namely, ) must tend to infinity. (Indeed, with , our specific choice (9) certainly satisfies this constraint.) A natural question is whether exact recovery is possible using measurement matrices, either randomly drawn or deterministically designed, with the average number of non-zeros per row (namely ) remaining bounded. In fact, under the criterion of exactly recovering the signed support (2.1), no method can succeed with w.p. one if remains bounded.

###### Proposition 1.

If does not tend to infinity, then no method can recover the signed support with probability one.

###### Proof.

We construct a sub-problem that must be solvable by any method capable of performing exact signed support recovery. Suppose that and that the column has non-zero entries, say without loss of generality indices . Now consider the problem of recovering the sign of . Let us extract the observations that explicitly involve , writing

 Yi = Xi1β∗1+∑j∈T(i)Xijβ∗j+Wi,i=1,…,n1

where denotes the set of indices in row for which is non-zero, excluding index . Even assuming that were perfectly known, this observation model (2.2) is at best equivalent to observing contaminated by constant variance additive Gaussian noise, and our task is to distinguish whether or . The average is a sufficient statistic, following the distribution . Unless the effective signal-to-noise ratio, which is of the order , goes to infinity, there will always be a constant probability of error in distinguishing from . Under the -sparsified random ensemble, we have with high probability, so that no method can succeed unless goes to infinity, as claimed. ∎

Note that the conditions in Theorem 1 imply that . In particular, condition (8b) implies that , and condition (8a) implies that , which implies the condition of Proposition 1.

## 3 Proof of Theorem 1

This section is devoted to the proof of Theorem 1. We begin with a high-level outline of the proof; as with previous work on dense Gaussian ensembles [20], the key is the notion of a primal-dual witness for exact signed support recovery. We then proceed with the proof, divided into a sequence of separate lemmas. Analysis of “sparsified” matrices require results on spectral properties of random matrices not covered by the standard literature. The proofs of some of the more technical results are deferred to the appendices.

### 3.1 High-level overview of proof

For the purposes of our proof, it is convenient to consider matrices with i.i.d. entries of the form

 Xij d= {Z∼N(0,1γ)with % probability γ0with probability 1−γ. (10)

So as to obtain an equivalent observation model, we also reset the variance of of each noise term to be . Finally, we can assume without loss of generality that .

Define the sample covariance matrix

 ˆΣ := 1nXTX=1nn∑i=1xixTi. (11)

Of particular importance to our analysis is the sub-matrix . For future reference, we state the following claim, proved in Appendix D:

###### Lemma 1.

Under the conditions of Theorem 1, the submatrix is invertible with probability greater than .

The foundation of our proof is the following lemma: it provides sufficient conditions for the Lasso (4) to recover the signed support set.

###### Lemma 2 (Primal-dual conditions for support recovery).

Suppose that , and that we can find a primal vector , and a subgradient vector that satisfy the zero-subgradient condition

 ˆΣ(β∗−ˆβ)+1nXTW+ρnˆz = 0, (12)

and the signed-support-recovery conditions

 ˆzi = sign(β∗i)for all i∈S, (13a) ˆβj = 0for all j∈Sc, (13b) |ˆzj| < 1for all j∈Sc, and (13c) sign(ˆβi) = sign(β∗i)for all i∈S. (13d)

Then is the unique optimal solution to the Lasso (4), and recovers the correct signed support.

See Appendix B.1 for the proof of this claim.

Thus, given Lemmas 1 and 2, it suffices to show that under the specified scaling of , there exists a primal-dual pair satisfying the conditions of Lemma 2. We establish the existence of such a pair with the following constructive procedure:

1. We begin by setting , and .

2. Next we determine by solving the linear system

 ˆΣSS(β∗S−ˆβS)+1nXTSW+ρnsign(β∗S) = 0 (14)
3. Finally, we determine by solving the linear system:

 −ρnˆzSc = ˆΣScS(β∗S−ˆβS)+1nXTScW. (15)

By construction, this procedure satisfies the zero sub-gradient condition (12), as well as auxiliary conditions (13a) and (13b); it remains to verify conditions (13c) and (13d).

In order to complete these final two steps, it is helpful to define the following random variables:

 Vaj := 1nXTj{XS(ˆΣSS)−1→1}ρn (16a) Vbj := XjT[1nXS(ˆΣSS)−1XTS−In×n]Wn, (16b) Ui := eTi(ˆΣSS)−1[1nXTSW−ρn→1], (16c)

where is the unit vector with one in position , and is the all-ones vector.

A little bit of algebra (see Appendix B.2 for details) shows that , and that . Consequently, if we define the events

 E(V) := {maxj∈Sc|Vaj+Vbj|<ρn} (17a) E(U) := {maxi∈S|Ui|≤βmin}, (17b)

where the minimum value was defined previously as the minimum value of on its support, then in order to establish that the Lasso succeeds in recovering the exact signed support, it suffices to show that ,

We decompose the proof of this final claim in the following three lemmas. As in the statement of Theorem 1, suppose that , for some fixed .

###### Lemma 3 (Control of Va).

Under the conditions of Theorem 1, we have

 P[maxj∈Sc|Vaj|≥(1−δ)ρn] → 0 (18)
###### Lemma 4 (Control of Vb).

Under the conditions of Theorem 1, we have

 P[maxj∈Sc|Vbj|≥δρn] → 0 (19)
###### Lemma 5 (Control of U).

Under the conditions of Theorem 1, we have

 P[(E(U))c]=P[maxi∈S|Ui|>βmin] → 0 (20)

### 3.2 Proof of Lemma 3

We assume throughout that is invertible, an event which occurs with probability under the stated assumptions (see Lemma 1). If we define the -dimensional vector

 h := XS(ˆΣSS)−1→1, (21)

then the variable can be written compactly as

 Vajρn=XTjh=n∑ℓ=1hℓXℓj. (22)

Note that each term in this sum is distributed as a mixture variable, taking the value with probability , and distributed as variable with probability . For each

, define the discrete random variable

 Hℓ d= {hℓwith probability γ0with probability 1−γ. (23)

For each index , let . With these definitions, by construction, we have

 Vajρn d= n∑ℓ=1HℓZℓj.

To gain some intuition for the behavior of this sum, note that the variables are independent of . (In particular, each is a function of , whereas is a function of , with .) Consequently, we may condition on without affecting , and since is Gaussian, we have . Therefore, if we can obtain good control on the norm , then we can use standard Gaussian tail bounds (see Appendix A) to control the maximum . The following lemma is proved in Appendix C:

###### Lemma 6.

Under condition (8c), then for any fixed , we have

 P[∥H∥22≤γk(1+δ)n] ≥ 1−O(exp(−min{2log(p−k),n2k}))

The primary implication of the above bound is that each variable is (essentially) no larger than a variable. We can then use standard techniques for bounding the tails of Gaussian variables to obtain good control over the random variable . In particular, by union bound, we have

 P[maxj∈Sc|Vaj|≥(1−δ)ρn] ≤ (p−k)P[n∑ℓ=1HℓjZj≥(1−δ)]

For any , define the event . Continuing on, we have

 P[maxj∈Sc|Vaj|≥(1−δ)ρn] ≤ (p−k){P[n∑ℓ=1HℓjZj≥(1−δ)∣T(δ)]+P[(T(δ)c)]} ≤ (p−k){2exp(−n(1−δ)22k(1+δ))+O(exp(−min(2log(p−k),n2k)))},

where the last line uses a standard Gaussian tail bound (see Appendix A), and Lemma 6. Finally, it can be verified that under the condition for some , and with chosen sufficiently small, we have as claimed.

### 3.3 Proof of Lemma 4

Defining the orthogonal projection matrix , we then have

 P[maxj∈Sc|Vbj|≥δρn] = P[maxj∈Sc∣∣XTjΠ⊥S(W/n)∣∣≥δρn] (24) ≤ (p−k)P[∣∣XT1Π⊥S(W/n)∣∣≥δρn].

Recall from equation (23) the representation , where is Bernoulli with parameter , and is Gaussian. The variable is binomial; define the following event

 T :=

From the Hoeffding bound (see Lemma 7), we have . Using this representation and conditioning on , we have

 ≤ P[∣∣1nn∑ℓ=1HℓjZℓjΠ⊥S(W)ℓ∣∣≥δρn∣T]+P[Tc] ≤ P⎡⎢⎣∣∣1nn(γ+12√k)∑ℓ=1ZℓjΠ⊥S(W)ℓ∣∣≥δρn⎤⎥⎦+2exp(−n2k),

where we have assumed without loss of generality that the first elements of are non-zero. Since is an orthogonal projection matrix, we have , so that

 ≤ P⎡⎢⎣∣∣1nn(γ+12√k)∑ℓ=1ZℓjWℓ∣∣≥δρn⎤⎥⎦+2exp(−n2k), (25)

Conditioned on , the random variable is zero-mean Gaussian with variance

 ν(W;γ):=1n2γn(γ+12√k)∑ℓ=1W2ℓ.

For some , define the event

 T2(δ1) := {ν(W;γ)≤(1+δ1)σ2nγ2(γ+12√k)}.

Note that . Since is with degrees of freedom, using -tail bounds (see Appendix A), we have

 P[(T2(δ1))c] ≤ exp(−n(γ+12√k)3δ2116).

Now, by conditioning on and its complement and using tail bounds on Gaussian variates (see Appendix A), we obtain

 P⎡⎢⎣∣∣1nn(γ+12√k)∑ℓ=1ZℓjWℓ∣∣≥δρn⎤⎥⎦ ≤ P⎡⎢⎣∣∣1nn(γ+12√k)∑ℓ=1ZℓjWℓ∣∣≥δρn∣T2(δ1)⎤⎥⎦+P[(T2(δ1))c] (26) ≤ 2exp⎛⎜⎝−nγ2(δ2ρ2n)2σ2(1+δ1)(γ+12√k)⎞⎟⎠+ exp(−n(γ+12√k)3δ2116).

Finally, putting together the pieces from equations (26),  (25), and equation (24), we obtain that is upper bounded by

 (p−k)⎧⎪⎨⎪⎩2exp(−n2k)+2exp⎛⎜⎝−nγ2(δ2ρ2n)2σ2(1+δ1)(γ+12√k)⎞⎟⎠+exp(−n(γ+12√k)3δ2116)⎫⎪⎬⎪⎭.

The first term goes to zero since . The second term goes to zero because eventually (because Condition (8c) implies that ), and Conditon (8a) implies that . Our choice of and Condition (8c) (which implies that ) is enough for the third term goes to zero.

### 3.4 Proof of Lemma 5

We first observe that conditioned on , each is Gaussian with mean and variance:

 mi := E[Ui∣XS]=eTi(1nXTSXS)−1[−ρn→1], ψi := var[Ui∣XS]=σ2γneTi(1nXTSXS)−1ei

Define the upper bounds

 m∗ := ρn(1+√kO(1γ√max{log(k)klog(p−k),loglog(p−k)log(p−k)})) ψ∗ := σ2γn⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣1−O(1γ√max{log(k)klog(p−k),loglog(p−k)log(p−k)})⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦−1

and the following event

 T(m∗,ψ∗) := {maxi∈S|mi|≤m∗ and maxi∈S|ψi|≤ψ∗}.

Conditioning on and its complement, we have

 P[(E(U))c] = P[1βminmaxi∈SUi|>1] ≤ P[1βminmaxi∈S|Ui|>1∣T(m∗,ψ∗)]+P[(T(m∗,ψ∗))c].

Applying Lemma 10 with and , we have .

We now deal with the first term. Letting , and using as shorthand for the event , we have

 P[1βminmaxi∈S|Ui|>1∣T] = E{P[maxi∈S|Ui|>βmin∣XS,T]} ≤ E{P[maxi∈S(|mi|+|Yi|)>βmin∣XS,T]} ≤ E{P[m∗+maxi∈S|Yi|>βmin∣XS,T]} = E{P[1βminmaxi∈S|Yi|>1−m∗βmin∣XS,T]}.

Condition (8b) implies that , so that it suffices to upper bound

 E{P[1βminmaxi∈S|Yi|>12∣XS,T]} ≤ E{kP[|Y∗|≥βmin2∣XS,T]} ≤ 2kexp(−β2min8ψ∗).

where , and we have used standard Gaussian tail bounds (see Appendix A).

It remains to verify that this final term converges to zero. Taking logarithms and ignoring constant terms, we have

 log(k)(1−β2minlog(k)8ψ∗) = log(k)⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝1−β2minγn⎛⎜ ⎜ ⎜ ⎜ ⎜⎝1−O(1γ√max{log(k)klog(p−k),loglog(p−k)log(p−k)})⎞⎟ ⎟ ⎟ ⎟ ⎟⎠8σ2logk⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠.

We would like to show that this quantity diverges to . Condition (8c) implies that

 1γ ⎷max{log(k)klog(p−k)