# Sparse Signal Processing with Linear and Nonlinear Observations: A Unified Shannon-Theoretic Approach

We derive fundamental sample complexity bounds for recovering sparse and structured signals for linear and nonlinear observation models including sparse regression, group testing, multivariate regression and problems with missing features. In general, sparse signal processing problems can be characterized in terms of the following Markovian property. We are given a set of N variables X_1,X_2,...,X_N, and there is an unknown subset of variables S ⊂{1,...,N} that are relevant for predicting outcomes Y. More specifically, when Y is conditioned on {X_n}_n∈ S it is conditionally independent of the other variables, {X_n}_n ∈ S. Our goal is to identify the set S from samples of the variables X and the associated outcomes Y. We characterize this problem as a version of the noisy channel coding problem. Using asymptotic information theoretic analyses, we establish mutual information formulas that provide sufficient and necessary conditions on the number of samples required to successfully recover the salient variables. These mutual information expressions unify conditions for both linear and nonlinear observations. We then compute sample complexity bounds for the aforementioned models, based on the mutual information expressions in order to demonstrate the applicability and flexibility of our results in general sparse signal processing models.

## Authors

• 3 publications
• 17 publications
• 69 publications
• ### Approximations of Shannon Mutual Information for Discrete Variables with Applications to Neural Population Coding

Although Shannon mutual information has been widely used, its effective ...
03/04/2019 ∙ by Wentao Huang, et al. ∙ 0

• ### Compressed Regression

Recent research has studied the role of sparsity in high dimensional reg...
06/04/2007 ∙ by Shuheng Zhou, et al. ∙ 0

• ### Limits on Support Recovery with Probabilistic Models: An Information-Theoretic Framework

The support recovery problem consists of determining a sparse subset of ...
01/29/2015 ∙ by Jonathan Scarlett, et al. ∙ 0

• ### A Mutual Contamination Analysis of Mixed Membership and Partial Label Models

Many machine learning problems can be characterized by mutual contaminat...
02/19/2016 ∙ by Julian Katz-Samuels, et al. ∙ 0

• ### Pointwise HSIC: A Linear-Time Kernelized Co-occurrence Norm for Sparse Linguistic Expressions

In this paper, we propose a new kernel-based co-occurrence measure that ...
09/04/2018 ∙ by Sho Yokoi, et al. ∙ 0

• ### Testing noisy linear functions for sparsity

We consider the following basic inference problem: there is an unknown h...
11/03/2019 ∙ by Xue Chen, et al. ∙ 0

• ### Information Recovery from Pairwise Measurements

This paper is concerned with jointly recovering n node-variables { x_i}_...
04/06/2015 ∙ by Yuxin Chen, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Recent advances in sensing and storage systems have led to the proliferation of high-dimensional data such as images, video or genomic data, which cannot be processed efficiently using conventional signal processing methods due to their dimensionality. However, high-dimensional data often exhibit an inherent low-dimensional structure, so they can often be represented “sparsely” in some basis or domain. The discovery of an underlying sparse structure is important in order to compress the acquired data or to develop more robust and efficient processing algorithms.

In this paper, we are concerned with the asymptotic analysis of the sample complexity in problems where we aim to identify a set of

salient variables responsible for producing an outcome. In particular, we assume that among a set of variables/features , only variables (indexed by set ) are directly relevant to the outcome . We formulate this with the assumption that given , the outcome is independent of the other variables , i.e.,

 P(Y|X,S)=P(Y|XS,S). (1)

Abstractly, we consider the following generative model: is generated from distribution and the set of salient variables is generated from a distribution over sets of size . Then an observation is generated using the conditional distribution conditioned on and , as in (1).

We assume we are given sample pairs and the problem is to identify the set of salient variables, , from these samples given the knowledge of the observation model . Our analysis aims to establish sufficient conditions on in order to recover the set

with an arbitrarily small error probability in terms of

, , the observation model and other model parameters such as the signal-to-noise ratio. In this paper, we limit our analysis to the setting with independent and identically distributed (IID) variables for simplicity. It turns out that our methods can be extended to the dependent case at the cost of additional terms in our derived formulas that compensate for dependencies between the variables. Some results derived for the former setting were presented in [1] and more recently in [2].

The analysis of the sample complexity is performed by posing this identification problem as an equivalent channel coding problem, as illustrated in Figure 1. The salient set corresponds to the message transmitted through a channel. The set is encoded by of length , which is the collection of codewords for , from a codebook . The coded message is transmitted through a channel with output . As in channel coding, our aim is to identify which message was transmitted given the channel output and the codebook .

The sufficiency and necessity results we present in this paper are analogous to the channel coding theorem for memoryless channels [3]. Before we present exact statements of our results, it is useful to mention that these results are roughly of the form

 TI(XS;Y|S)>log(NK), (2)

which can be interpreted as follows: The right side of the inequality is the number of bits required to represent all sets of size . On the left side, the mutual information term represents the uncertainty reduction on the output when given the input , in bits per sample. This term essentially quantifies the “capacity” of the observation model . Then, the total uncertainty reduction through the samples should exceed the uncertainty of possible salient sets , in order to reliably recover the salient set.

Sparse signal processing models analyzed in this paper have wide applicability. Below we list some examples of problems which can be formulated in the described framework.

Sparse linear regression

[4]

is the problem of reconstructing a sparse signal from underdetermined linear systems. It is assumed that the output vector

can be obtained from a -sparse vector

through some linear transformation with matrix

, i.e., in the noisy case with noise ,

 YT=XTβ+WT. (3)

Non-linear versions of the regression problem are also investigated, where the channel model also includes a quantization of the output. The sparse linear regression model with an example is illustrated in Figure 2. Note that in our analysis the columns of the matrix correspond to the variables and the support of the sparse vector corresponds to the set . It is then easy to see that the Markovianity property (1) holds. In contrast to typical regression setup, the focus here is on the recovery of the support , and not the sparse vector . Hence, the non-zero coefficients are absorbed into our observation model as we elaborate on Section 3.

Models with missing features [5]: Our methods are also used to establish sample complexity bounds for sparse signal processing problems with missing features. The problem here is that some of the variables for some of the measurements , could be missing. Specifically, we observe a matrix instead of , with the relation

 Z(t)i={X(t)i,w.p.\ 1−ρm,w.p.\ ρ∀i,t

i.e., we observe a version of the feature matrix which may have missing entries (denoted by ) with probability , independently for each entry. Note that can take any value as long as there is no ambiguity whether the realization is missing or not, e.g., would be valid for continuous variables where the variable taking value has zero probability. Note that if a problem satisfies assumption (1) with variables , the same problem with missing features also satisfies the assumption with variables . Interestingly, our analysis shows that the sample complexity, for problems with missing features is related to the sample complexity, , of the fully observed case with no missing features by the simple inequality:

 Tmiss≥T1−ρ.

Group testing [6] is a form of sensing with Boolean arithmetic, where the goal is to identify a set of defective items among a larger set of items. As an example, group testing has been used for medical screening to identify a set of individuals who have a certain disease from a large population while reducing the total number of tests. The idea is to pool blood samples from subsets of people and to test them simultaneously rather than conducting a separate blood test for each individual. In an ideal setting, the result of a test is positive if and only if the subset contains a positive sample. A significant part of the existing research is focused on combinatorial pool design to guarantee detection using a small number of tests. Several variants of the problem exist, such as noisy group testing with different types of errors. An interesting variant is the graph-constrained group testing problem, where the salient set is the set of defective links in a graph and each test is a random walk on the graph [7]. The group testing model can be represented graphically as in Figure 3, where is a Boolean testing matrix and is the outcome vector. Again, the different columns of the testing matrix correspond to the variables , while the defective set corresponds to set . Then, a test outcome only depends on , which captures the presence or absence of defective items in the test.

Sparse channel estimation [8]

is used for the estimation of multi-path channels characterized by sparse impulse responses. The output of the channel depends on the input time instances, which correspond to the non-zero coefficients of the impulse response. In an equivalent channel model, the indices of the non-zero coefficients in the impulse response correspond to the encoded set

and the coefficients themselves are absorbed into the channel model.

### 1.1 Related work and contributions

A large body of research work studies the sparse recovery problem, particularly from an information-theoretic (IT) perspective. In this section, we only describe work that is closely related to this paper. The dominant stream of research in this area deals with linear models and mean-squared estimation of the sparse vector in (3) with sub-Gaussian assumptions on variables . Below we list the contributions of our approach and contrast it to some of the related work in the literature.

Unifying framework for linear and non-linear problems: Much of the literature on sparse recovery has focused on particular sparse models and reconstruction algorithms were developed for a specific setting. For instance, Lasso was used for linear regression [9, 10], relaxed integer programs for group testing [11], convex programs for 1-bit quantization [12], projected gradient descent for sparse regression with missing data [5] and other general forms of penalization. While all of these problems share an underlying sparse structure, it is conceptually unclear from a purely IT perspective how they come together from an inference standpoint. The approach presented herein unifies the different sparse models based on the conditional independence assumption in (1), and a single mutual information expression (2) is shown to provide an exact characterization of the sample complexity for such models.

Direct support recovery vs. signal estimation: Much of the existing work focuses on limits of sparse recovery in linear models based on sensing matrices drawn from the standard Gaussian ensemble [13, 14, 15, 16, 17, 18]. While only support recovery is contemplated in [13, 14, 18], the support is generally chosen by retaining those elements of the signal estimate that lie above a design threshold. Alternatively, the estimated support is chosen to minimize the error associated with the best estimator [13]. Thus, much of this related literature is focused on estimation, sometimes as a preliminary step towards support recovery, which we avoid in this paper. In sharp contrast to prior work, our analysis makes a clear distinction between signal estimation and support discovery. It is conceivable that if the support is known, then the signal can be reliably estimated using least-square estimates or other variants. At a conceptual level, IT tools such as Fano’s inequality and the capacity theorems are powerful tools for inference of discrete objects (messages) given continuous observations. Indeed, to exploit such tools, [13, 14, 15, 16, 17, 18] resort to one of the following strategies: (a) Use IT tools only to establish necessary conditions for recovery by assuming a discrete , and derive sufficient conditions using some of the well-known algorithms (Lasso, Basis pursuit etc.); or (b) find an -cover for in some metric space (which requires imposing some extra assumptions) and reduce to a discrete object. In contrast, our approach lifts these assumptions and focuses on the discrete combinatorial component of the object . Indeed, our results in Section 3 show that the discrete part, namely, the uncertainty of the support pattern is the dominating factor and not itself. Furthermore, prior work relied heavily on the design of sampling matrices with special structures such as Gaussian ensembles and RIP matrices, which is a key difference from the setting we consider herein as for our purpose we do not always have the freedom to design the matrix . We do not make explicit assumptions about the structure of the sensing matrix, such as the restricted isometry property [19] or incoherence properties [9], or about the distribution of the matrix elements, such as sub-Gaussianity. Also, the existing information-theoretic bounds, which are largely based on Gaussian ensembles, are limited to the linear regression model, and hence not suitable for the non-linear models we consider herein.

It is worth noting that the authors in [20] adopt a parallel approach to derive sufficiency bounds for direct support recovery, albeit their analysis is focused on a hypothesis testing framework with fixed measurement matrices. In contrast, here we consider a general Bayesian framework with random and .

Performance bounds for new sparse recovery problems: Our unifying approach also allows us to study problems that were not previously analyzed, or that are not easily analyzed, using the previous approaches. This includes problems with new observation models, or existing models with different distributions of variables. Using the formulation presented herein, obtaining necessary and sufficient conditions and error bounds only requires the computation of simple mutual information expressions.

The problem of identifying relevant variables was formulated in a channel coding framework in [6] and in the Russian literature in [21, 22, 23, 24, 25] in the context of group testing. Both sufficient and necessary conditions on the number of tests for the group testing problem with IID test assignments were derived. One main difference between the Russian literature and [6] is that, in the former the number of defective items, , is held fixed while the number of items, , approaches infinity. Consequently, the earlier work suggests that the number of tests must scale poly-logarithmically in regardless of for the error probability to approach zero. In contrast, [6] considers the setting wherein both the number of defective items, as well as the number of items can approach infinity and characterizes constants related to precisely for fixed . The sufficient condition in [6] was derived based on the analysis of a Maximum Likelihood decoder, while the necessary condition was derived using Fano’s inequality [3]. This analysis was further extended to general sparse signal processing models and models with dependent variables in [1].

In this paper, we are concerned with the analysis of the problem with IID variables , which encompasses many important problems, such as the classical group testing or sparse linear regression models, to name a few. While this setup and a similar approach were considered in [1, 26], this paper presents a more thorough and rigorous analysis, including the analysis of problems with latent variable observation models, formally extending the analysis to continuous models, presenting results for scaling models, and bounds for many example applications.

In Section 2, we introduce our notation and provide a formal description of the problem. In Section 3, we state necessary and sufficient conditions on the number of samples required for recovery. Applications are considered in Section 4, including bounds for sparse linear regression, group testing models, and models with missing data. We summarize our results in Section 5. We defer the proofs of theorems and lemmas in Sections 3 and 4 to the Appendix.

## 2 Problem Setup

##### Notation.

We use upper case letters to denote random variables, vectors and matrices, and we use lower case letters to denote realizations of scalars, vectors and matrices. Subscripts are used for column indexing and superscripts with parentheses are used for row indexing in vectors and matrices. Subscripting with a set

implies the selection of columns with indices in . Table 1 provides a reference and further details on the used notation. Transpose of a vector or matrix is denoted by the symbol. is used to denote the natural logarithm and entropic expressions are defined using the natural logarithm, however results can be converted other logarithmic bases w.l.o.g., such as base 2 used in [6].

##### Variables.

Let

denote a set of IID random variables with a joint probability distribution

. To simplify the expressions, we do not use subscript indexing on to denote the random variables since the distribution is determined solely by the number of variables indexed.

##### Candidate sets.

We index the different sets of size as with index , so that is a set of indices corresponding to the -th set of variables. Since there are variables in total, there are such sets, therefore . For any two sets and , we define , , and as the overlap set, the set of indices in but not in , and the set of indices in but not in , respectively. Namely, , and .

##### Observations.

We let denote an observation or outcome, which depends only on a small subset of variables of known cardinality where . In particular, is conditionally independent of the variables given the subset of variables indexed by the index set , as in (1), i.e.,

 P(Y|X,S)=P(Y|XS,S),

where is the subset of variables indexed by the set . We assume true set for some random variable distributed over .

##### Latent observation parameters.

We consider an observation model which is not completely deterministic and known, but depends on a latent variable . Note that this is a more general model compared to [6] and [1]. We assume IID across indices in , is independent of variables and has a prior distribution . We further assume that for the probability is lower bounded by a constant on its support independent of , and . While this assumption is used to obtain the sufficiency results we present in the paper, it is not essential and may be possible to remove using a different analysis.

##### Observation model.

The outcomes depend on both and and are generated according to the model . As an example, this latent variable corresponds to the non-zero coefficients of the -sparse vector in the sparse linear regression framework in Section 4.1.1, or the impulse response coefficients in the sparse channel estimation framework. Note that (1) still holds in this model, where is averaged over conditioned on .

We use the lower-case notation as a shorthand for the conditional distribution given the true subset of variables . For instance, with this notation we have , , etc. When we would like to distinguish between the outcome distribution conditioned on different sets of variables, we use notation, to emphasize that the conditional distribution is conditioned on the given variables, assuming the true set is .

We observe the realizations of variable-outcome pairs with each sample realization of , . The variables are distributed IID across . However, the outcomes are independent for different only when conditioned on . Our goal is to identify the set from the data samples and the associated outcomes , with an arbitrarily small average error probability.

##### Decoder and probability of error.

We let denote an estimate of the set , which is random due to the randomness in , and . We further let denote the average probability of error, averaged over all sets of size , realizations of variables and outcomes , i.e.,

 P(E)=Pr[^S(XT,YT)≠S]=∑ω∈IP(ω)Pr[^S(XT,YT)≠Sω|Sω].
##### Scaling variables and asymptotics.

We let , be a function of such that and be a function of both and . Note that can be a constant function in which case it does not depend on . For asymptotic statements, we consider and and scale as defined functions of . We formally define sufficient and necessary conditions for recovery as below.

###### Definition 2.1.

For a function , we say an inequality (or ) is a sufficient condition for recovery if there exists a sequence of decoders such that for (or ) for sufficiently large , i.e., for any , there exists such that for all , (or ) implies . Conversely, we say an inequality (or ) is a necessary condition for recovery if for any sequence of decoders when (or ).

## 3 Conditions for Recovery

In this section, we state and prove sufficient and necessary conditions for the recovery of the salient set with an arbitrarily small average error probability.

Central to our analysis are the following assumptions, which we utilize in order to analyze the probability of error in recovering the salient set and to obtain sufficient and necessary conditions on the sample complexity.

1. Equi-probable support: Any set with elements is equally likely a priori to be the salient set. We assume we have no prior knowledge of the salient set among the possible sets.

2. Conditional independence: The observation/outcome is conditionally independent of other variables given , variables with indices in , i.e., . This assumption follows directly from our formulation of sparse recovery problems. We further assume the observation model does not depend on except through , i.e., for any , .

3. IID variables: The variables are independent and identically distributed. While the independence assumption is not valid for all sparse recovery problems, many problems of interest can be analyzed within the IID framework, as in Section 4.

4. Observation model symmetry: For any permutation mapping , , i.e., the observations are independent of the ordering of variables. This is not a very restrictive assumption since the asymmetry w.r.t. the indices can be usually incorporated into . In other words, the symmetry is assumed for the observation model when averaged over .

##### Remarks on support versus support coefficients:

In many sparse recovery problems we are concerned with the recovery of an underlying sparse vector , which has a sparsity support and coefficients on the support. For instance, a simple example that exhibits such structure is the following linear observation model, where

 Y=⟨X,β⟩+W=⟨XS,βS⟩+W,

with noise , along with extensions to non-linear models, where , for a function .

In this work, we are specifically concerned with the recovery of the support and not the recovery of the support coefficients . Instead, we incorporate the effects of the support coefficients into the observation model assuming prior density , such that

 p(Y|X)=p(Y|XS)=∫p(Y|XS,βS)p(βS)dβS,

in order to analyze errors in recovering . In contrast, other error criteria are also considered for sparse recovery problems, mostly in the compressive sensing literature, such as the distance between the true and the estimated .

### 3.1 Sufficiency

In this section, we prove a sufficient condition for the recovery of . The notation in this section assumes discrete variables and observations, however simply replacing the related sums with appropriate integrals generalize the notation to the continuous case. We consider models with non-scaling distributions, i.e., models where the observation model and the variable distributions/densities and number of relevant variables do not depend on scaling variables or . Group testing as set up in Section 4.2.2 for fixed is an example of such a model. We defer the discussion of models with scaling distributions and to Section 3.3.

To derive the sufficiency bound for the required number of samples, we analyze the error probability of a Maximum Likelihood (ML) decoder [27]. For this analysis, we assume that is the true set among . We can assume this w.l.o.g. due to the equi-probable support, IID variables and observation model symmetry assumptions (A1)-(A4), thus we can write

 P(E)=1(NK)∑ω∈IPr[^S(XT,YT)≠Sω|Sω]=P(E|S1).

For this reason, we omit the conditioning on on the error probability expressions throughout this section.

The ML decoder goes through all possible sets and chooses the set such that

 pω∗(YT|XTSω∗)>pω(YT|XTSω),∀ω≠ω∗, (4)

and consequently, if any set other than the true set is more likely, an error occurs. This decoder is a minimum probability of error decoder for equi-probable sets, as we assumed in (A1). Note that the ML decoder requires the knowledge of the observation model and the distribution .

Remarks on typicality decoding: It is worth mentioning that a typicality decoder can also be analyzed to obtain a sufficient condition, as used in the early versions of [28]. However, typicality conditions must be defined carefully to obtain a tight bound w.r.t. , as with standard typicality definitions the atypicality probability may dominate the decoding error probability in the typical set. For instance, for the group testing scenario considered in [6], where , we have , which would require the undesirable scaling of as , to ensure typicality in the strong sense (as needed to apply results such as the packing lemma [29]). Redefining the typical set as in [28] is then necessary, but it is problem-specific and makes the analysis cumbersome compared to the ML decoder adopted herein and in [6]. Furthermore, the case where scales together with requires an even more subtle analysis, whereas the analysis of the ML decoder analysis is more straightforward in regards to that scaling. Typicality decoding has also been reported as infeasible for the analysis of similar problems, such as multiple access channels where the number of users scale with the coding block length [30].

We now derive a simple upper bound on the error probability of the ML decoder, which is averaged over all sets, data realizations and observations. Define the error event as the event of mistaking the true set for a set which differs from the true set in exactly variables, thus we can write

 P(Ei)=Pr[∃ω≠1:pω(YT|XTSω)≥p1(YT|XTS1),|S1c,ω|=|S1,ωc|=i, |S1|=|Sω|=K].

Using the union bound, the probability of error can then be upper bounded by

 P(E)≤K∑i=1P(Ei)=K∑i=1∑XTS1∑YTQ(XTS1)p1(YT|XTS1)P(Ei|XTS1,YT,ω=1), (5)

where is the probability of decoding error in exactly variables, conditioned on the true index , the realization for the set , and on the sequence . While we use notation for discrete variables and observations throughout this section, continuous case follows by replacing sums with appropriate integrals.

Next we state our main result, which complements and generalizes the results in [6]. The following theorem provides a sufficient condition on the number of samples for an arbitrarily small average error probability.

###### Theorem 3.1.

(Sufficiency). Let be any partition of true set to and indices respectively, be an arbitrary constant, let be the conditional mutual information conditioned on fixed and 111Note that it is sufficient to compute for one value of (e.g. ) instead of averaging over all possible , since the conditional mutual information expressions are identical due to our symmetry assumptions on the variable distribution and the observation model. Similarly, the bound need only be computed for one partitioning for each , since our assumptions ensure that the mutual information is identical for all such partitions. Also similarly, since is IID for , can be chosen as for some . to be the worst-case (w.r.t. ) conditional mutual information conditioned on fixed , where

 βmin∈argminb∈BKI(XS1;Y|XS2,βS=b,S)=argminb∈BKES,Y,XS1,XS2[logP(Y|XS1,XS2,βS=b,S)P(Y|XS2,βS=b,S)]. (6)

Then, if assumptions (A1)-(A4) are satisfied,

 T>(1+ϵ)⋅maxi=1,…,Klog(N−Ki)I(XS1;Y|XS2,βmin,S), (7)

is a sufficient condition222“Sufficient condition” is defined formally in the problem setup, where in this case we have . for the average error probability to approach zero asymptotically, i.e., .

###### Remark 3.1.

In certain models the worst-case mutual information can be exactly equal to zero, such as the linear model with which would lead to a vacuous upper bound. However such cases can possibly be avoided for models where such occur with zero probability (such as continuous ) by considering a typical set of where the worst-case mutual information is non-zero in the set and the atypical set has vanishing probability.

The sufficiency conditions in Theorem 3.1 are derived from an upper bound on the error probability for each . This upper bound is characterized by the error exponent , which is described by

 Eo(ρ)=−1Tlog∑YT∑XTS2⎡⎢ ⎢⎣∑XTS1Q(XTS1)p(YT,XTS2|XTS1)11+ρ⎤⎥ ⎥⎦1+ρ,   0≤ρ≤1, (8)

and the following lemma provides the upper bound on , the probability of decoding error in variables.

###### Lemma 3.1.

The probability of the error event defined in (9) that a set which differs from the set in exactly variables is selected by the ML decoder (averaged over all data realizations and outcomes) is bounded from above by

 P(Ei)≤e−(TEo(ρ)−ρlog(N−Ki)−log(Ki)). (9)

The proof for Lemma 3.1 follows largely along the proof of Lemma III.1 of [6] for discrete variables and observations. We note certain differences in the proof and the result, and further extend it to continuous variables and observations in Section A.6.

The proof of Theorem 3.1 is provided in the Appendix. It follows from lower bounding the error exponent using a worst-case analysis for to reduce to a single-letter expression and performing a Taylor series analysis of the lower bound around , from which the worst-case mutual information condition is derived. This Taylor series analysis is similar to the analysis of the ML decoder in [27]. While our proof uses a similar methodology to the proof of Theorem III.1 in [6], there are very important conceptual and technical differences, including

• the generalization to discrete alphabets for both and with arbitrary cardinality,

• the generalization to continuous alphabets,

• the handling of latent observation model parameters , which complicates the error exponent and mutual information expressions and induces dependence between pairs across ,

• the second order analysis of the error exponent for scaling models.

All of the above are necessary for the analysis of general sparse signal processing problems and represent a significant technical contribution. In contrast, the group testing model considered in [6] can be viewed as a special case, which enabled the use of simpler analysis.

It is also important to highlight the main difference between the analysis of the error probability for the problem considered herein and the channel coding problem. In contrast to channel coding, the codewords of a candidate set and the true set are not independent since the two sets could be overlapping. To overcome this difficulty, we separate the error events , , of misidentifying the true set in items. Then, for every we fix the correctly identified elements of the true set and average over the set of possible codeword realizations for every candidate set with differing elements.

### 3.2 Necessity

In this section, we derive lower bounds on the required number of measurements using Fano’s inequality [3]. We state the following theorem:

###### Theorem 3.2.

Let be any partition of true set to and indices respectively and define to be the conditional mutual information between and conditioned on , and true set 333 Note that this mutual information is averaged over , instead of being defined for a fixed value of .. For variables and a set of salient variables, a necessary condition444“Necessary condition” is defined formally in the problem setup, where in this case we have . on the number of samples required for the probability of error to be asymptotically bounded away from zero is given by

 T≥maxi=1,…,Klog(N−K+ii)I(XS1;Y|XS2,βS,S). (10)

The proof follows along similar lines to the proof of Theorem IV.1 in [6] with some important differences regarding the latent variable in the observation model and explicit conditioning on and . We detail the differences in the Appendix.

###### Remark 3.2.

Given that the worst-case mutual information is equal to the average mutual information (or random does not exist), the upper bound in Theorem 3.1 is tight as it matches lower bound given in Theorem 3.2. For non-scaling models, the bound is always order-wise tight if the worst-case mutual information is strictly positive.

Interpretation. Intuitively, the bounds in (7) and (10) can be explained as follows: For each , the numerator is approximately the number of bits required to represent all sets that differ from in elements. The denominator represents the information given by the output variable about the remaining indices , given the subset of true indices. Hence, the ratio represents the number of samples needed to control support errors and the maximization accounts for all possible support errors.

Support recovery and support coefficients. In the sufficiency and necessity proofs above, we show that being unknown with prior induces a penalty term in the denominator given by , compared to the case where the support coefficients are fixed and known. We show that this term is always dominated by , therefore does not affect the sample complexity asymptotically. This shows that recovering the support given the knowledge of the support coefficients is as hard as recovering the support with unknown coefficients, underlying the importance of recovering the support in sparse recovery problems.

Partial recovery. As we analyze the error probability separately for support errors in order to obtain the necessity and sufficiency results, it is trivial to determine necessary and sufficient conditions for partial support recovery instead of exact support recovery. By changing the maximization from over to in the two recovery bounds, the conditions to recover at least of the support indices can be determined.

### 3.3 Sufficiency for Models with Scaling

In this section we consider models with scaling distributions, i.e., models where the observation model and the variable distributions/densities and number of relevant variables may depend on scaling variables or . While Theorem 3.1 characterizes precisely the constants (including constants related to ) in the sample complexity, it is also important to analyze models where is scaling with or where distributions depend on scaling variables. Group testing where scales with (e.g. ) is an example of such a model, as well as the normalized sparse linear regression model with SNR and matrix probabilities functions of and in Section 4.1.1. Therefore in this section we consider the most general case where or can be functions of , or and . Note that the necessity result in Theorem 3.2 holds also for scaling models thus does not need to be generalized.

For this case, in addition to the general assumptions (A1)-(A4), we require additional smoothness properties related to the second derivative of the error exponent as defined in (8). We first present a sufficient condition involving multi-letter expressions555The mutual information characterizations in Sections 3.1 and 3.2 were single-letter.. This theorem presents the most general results in this section with the least stringent second derivative conditions.

###### Theorem 3.3.

(Multi-letter sufficiency condition). Let be any partition of true set to and indices respectively, and define to be the multi-letter conditional mutual information between and conditioned on and true set .

Let be a sequence of numbers (which can be a function of , , etc.) such that for all . Then, if assumptions (A1)-(A4) are satisfied,

 mini=1,…,KI(XTS1;YT|XTS2,S)τNlog(N−Ki)>1, (11)

is a sufficient condition\getrefnumberfn:sufficient\getrefnumberfn:sufficientfootnotemark: fn:sufficient for the average error probability to asymptotically approach zero, i.e., .

As both the error exponent and the mutual information expression in the above theorem are multi-letter expressions, the second derivative condition and the sufficiency bound may be difficult to analyze, in contrast to the single-letter characterization of Theorem 3.1. In the theorem below we present a single-letter simplification of Theorem 3.3, which has slightly stronger conditions and may have a looser sufficency bound for certain cases.

###### Theorem 3.4.

(Single-letter sufficiency condition). Let be any partition of true set to and indices respectively and let be the conditional mutual information conditioned on fixed and to be the worst-case (w.r.t. ) conditional mutual information conditioned on fixed , as defined in (6).

Let be the single-letter conditional error exponent as defined in (A.4), for any and be a sequence of numbers (which can be a function of , , etc.) such that

 |E′′o(ρ,βS=b)|≤τN5I(XS1;Y|XS2,βS=b,S), (12)

for all , and . Then, if asymptotically ,

 T>maxi=1,…,Klog(N−Ki)I(XS1;Y|XS2,βmin,S)−KTpmin⋅τN, (13)

is a sufficient condition\getrefnumberfn:sufficient\getrefnumberfn:sufficientfootnotemark: fn:sufficient for the average error probability to asymptotically approach zero, i.e., .

The single-letter theorem above is general and useful for all models including continuous and discrete models, and the second derivative condition can be checked easier indirectly using bounds that we present in the below lemma.

###### Lemma 3.2.

Let

 gρ≜⎛⎜⎝∑XS1Q(XS1)p(Y,XS2|XS1,βS)11+ρ⎞⎟⎠1+ρ,uρ≜p(Y|XS1,XS2,βS)11+ρ∑X′S1Q(X′S1)p(Y|X′S1,XS2,βS)11+ρ

and note that .666While we use notation for discrete variables and observations, continuous case follows by replacing sums with appropriate integrals. Then,

 |E′′o(ρ,βS)|≤∑Y,XS2gρE[uρlog2uρ]∑Y,XS2gρ≤supY,XS2E[uρlog2uρ], (14)

for and .

###### Remark 3.3.

The sufficiency bound (13) in Theorems 3.4 reduces to (7) in Theorem 3.1 nearly exactly given that the model is sufficiently sparse and (12) is satisfied such that can be chosen to scale arbitrarily slowly.

The proofs of Theorems 3.3 and 3.4 are provided in the Appendix. They follow using Lemma 3.1 similar to the proof of Theorem 3.1, but for Theorem 3.4 we further lower bound the error exponent using a worst-case analysis for to reduce to a single-letter expression and again perform a Taylor series analysis of the lower bound around , from which the worst-case mutual information condition is derived. The second derivative conditions such as (12) are necessary to control the second order term in the Taylor series.

## 4 Applications

In this section, we establish results for several problems for which our necessity and sufficiency results are applicable. In the first subsection, we look at linear observation models and derive results for sparse linear regression with measurement noise. Then we consider a multivariate regression model, where we deal with vector-valued variables and outcomes. In the second subsection, we analyze probit regression and group testing as examples of non-linear observation models. Finally, we look at a general framework where some of the variables are not observed, i.e., each variable is missing with some probability. Proofs are provided in the Appendix where necessary.

### 4.1 Linear Settings

#### 4.1.1 Sparse Linear Regression

Using the bounds presented in this paper for general sparse models, we derive sufficient and necessary conditions for the sparse linear regression problem with measurement noise [4] and a Gaussian variable matrix with IID entries.

We consider the following normalized model [15],

 YT=XTβ+WT, (15)

where is the variable matrix, is a -sparse vector of length with support , is the measurement noise of length and is the observation vector of length . In particular, we assume

are Gaussian distributed random variables and the entries of the matrix are independent across rows

and columns . Each element

is zero mean and has variance

. denotes the observation noise of length . We assume each element is IID with . The coefficients of the support, , are IID zero-mean random variables with variance and for .

In order to analyze this problem using the proposed sparse signal processing framework, it is important to observe how the regression model as defined above relates to the general sparse model. The elements in a row of the matrix correspond to variables as defined in Section 2. Each row of the matrix is a realization of , and the rows are generated independently and identically to form . It is easy to see that assumption (1) is satisfied in both models since each measurement depends only on the linear combination of the elements that correspond to the support of . The coefficients of this combination are given by , the values of the non-zero elements of . corresponds to the latent parameter of the observation model , which accounts for the noise .

Let denote the ratio of misidentified elements of the support , where . For the conditions for recovery and SNR, we will first show that

 I(XS1;Y|XS2,βS,S)≥12log(1+iσ2SNRT),I(XS1;Y|XS2,βmin,S)=12log(1+ib2minSNRT).

We then consider all values of and note that , to state the following theorem.

###### Theorem 4.1.

For sparse linear regression with IID Gaussian matrix, is a necessary condition for recovery. Furthermore, for this SNR a necessary condition on the number of observations is , while a sufficient condition is .

###### Remark 4.1.

For the sparse linear regression problem, we showed that our relatively simple mutual information analysis gives us a bound that is asymptotically identical to the best-known bound [15] with an independent Gaussian variable matrix, in the sublinear sparsity regime, in addition to providing us with a necessary condition on SNR that matches the bound in [15].

Figure 5 illustrates the lower bound on the number of measurements and shows that a necessary condition on SNR has to be satisfied for recovery. Our necessity result holds for all scalings of and , while the sufficiency result holds in the fixed sparsity regime considered in Section 3.1. Although we provided results for exact recovery with random , it is easy to obtain results for partial recovery as we remark in Section 3.

Another interesting aspect of our analysis is that in addition to sample complexity bounds, an upper bound to the probability of error in recovery can be explicitly computed and obtained using Lemma 3.1 for any finite triplet . Following this line of analysis, an upper bound is obtained for sparse linear regression and compared to the empirical performance of practical algorithms such as Lasso [9, 10] in [2]. It is then seen that while certain practical recovery algorithms have provably optimal asymptotic sample complexity, there is still a gap between information-theoretically attainable recovery performance and empirical performance of such algorithms. We refer the reader to [2] for details.

#### 4.1.2 Multivariate Regression

In this problem, we consider the following linear model [31], where we have a total of linear regression problems,

 YT{r}=XT{r}β{r}+WT{r},   r=1,…,R.

For each , is a -sparse vector, and . The relation between different tasks is that have joint support . This set-up is also called multiple linear regression or distributed compressive sensing [32] and is useful in applications such as multi-task learning [33].

It is easy to see that this problem can be formulated in our sparse recovery framework, with vector-valued outcomes and variables . Namely, let be a vector-valued outcome, be the collection of vector-valued variables and be the collection of sparse vectors sharing support , making it block-sparse. This mapping is illustrated in Figure 6. Assuming independence between and support coefficients across , we have the following observation model:

 P(Y|X,S)=p(Y|XS)=R∏r=1p(Y{r}|X{r},S)=R∏r=1∫RKp(Y{r}|X{r},S,β{r},S)p(β{r},S)dβ{r},S.

We state the following theorem for the specific linear model in Section 4.1.1 and IID variables, as a direct result of Theorem 4.1 and the fact that the joint mutual information decomposes to identical mutual information terms due to the above equality.

###### Theorem 4.2.

The sample complexity per test for the linear multi-regression model above is , where is the sample complexity in Theorem 4.1.

###### Remark 4.2.

We showed that having problems with independent measurements and sparse vector coefficients decreases the number of measurements per problem by a factor of . While having such problems increases the number of measurements -fold, the inherent uncertainty in the problem is the same since the support is shared. It is then reasonable to expect such a decrease in the number of measurements.

### 4.2 Non-linear Settings

#### 4.2.1 Binary Regression

As an example of a non-linear observation model, we look at the following binary regression problem, also called 1-bit compressive sensing [34, 35, 36] or probit regression. Regression with 1-bit measurements is interesting as the extreme case of regression models with quantized measurements, which are of practical importance in many real world applications. The conditions on the number of measurements have been studied for both noiseless [35] and noisy [36] models and has been established as a sufficient condition for Gaussian variable matrices.

Following the problem setup of [36], we have

 YT=q(XTβ+WT), (16)

where