I Introduction
The support recovery problem consists of determining a sparse subset of a set of variables that is relevant in producing a set of observations, and arises frequently in disciplines such as group testing [1, 2], compressive sensing (CS) [3], and subset selection in regression [4]. The observation models can vary significantly among these disciplines, and it is of considerable interest to consider these in a unified fashion. This can be done via probabilistic models relating the sparse vector to a single observation in the following manner:
(1) 
where represents the set of relevant variables, is a measurement vector, (respectively, ) is the subvector of (respectively, ) containing the entries indexed by , and
is a given probability distribution. Given a collection of measurements
and the corresponding measurement matrix (with each row containing a single measurement vector), the goal is to find the conditions under which the support can be recovered either perfectly or partially. In this paper, we study the informationtheoretic limits for this problem, characterizing the number of measurements required in terms of the sparsity level and ambient dimension regardless of the computational complexity. Such studies are useful for assessing the performance of practical techniques and determining to what extent improvements are possible.Before proceeding, we state some important examples of models that are captured by (1).
Linear Model
is ubiquitous in signal processing, statistics, and machine learning, and in itself covers an extensive range of applications. Each observation takes the form
(2) 
where denotes the inner product, and is additive noise. An important quantity in this setting is the signaltonoise ratio (SNR) , and in the context of support recovery, the smallest nonzero absolute value in has also been shown to play a key role [5, 7, 8].
Quantized Linear Models
Quantized variants of the linear model are of significant interest in applications with hardware limitations. An example that we will consider in this paper is the 1bit model [9], given by
(3) 
where the function equals if its argument is nonnegative, and if it is negative.
Group Testing
Studies of group testing problems began several decades ago [10, 11], and have recently regained significant attention [2, 12], with applications including medical testing, database systems, computational biology, and fault detection. The goal is to determine a small number of “defective” items within a larger subset of items. The items involved in a single test are indicated by , and each observation takes the form
(4) 
with representing the defective items, indicating whether the test contains at least one defective item, and representing possible noise (here denotes modulo2 addition). In this setting, one can think of as deterministically having entries equaling one on , and zero on .
Ia Previous Work and Contributions
Numerous previous works on the informationtheoretic limits of support recovery have focused on the linear model [5, 7, 8, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]. The main aim of these works, and of that the present paper, is to develop necessary and sufficient conditions for which an “error probability” vanishes as . However, there are several distinctions that can be made, including:
Perhaps the most widelystudied combination of these is that of minimax characterizations for exact support recovery with random measurement matrices. In this setting, within the class of vectors whose nonzero entries have an absolute value exceeding some threshold , necessary and sufficient conditions on are available with matching scaling laws [7, 8]. See also [23, 24] for informationtheoretic studies of the linear model with a mean square error criterion.
Compared to the linear model, research on the informationtheoretic limits of support recovery for nonlinear models is relatively scarce. The system model that we have adopted follows those of a line of works seeking mutual information characterizations of sparsity problems [11, 2, 14, 25], though we make use of significantly different analysis techniques. Similarly to these works, we focus on random measurement matrices and random nonzero entries of . Other works considering nonlinear models have used vastly different approaches such as regularized
[26, 27] and approximate message passing [28].Highlevel Contributions: We consider an approach using thresholding techniques akin to those used in informationspectrum methods [29], thus providing a new alternative to previous approaches based on maximumlikelihood decoding and Fano’s inequality. Our key contributions and the advantages of our framework are as follows:

Considering both exact and partial support recovery, we provide nonasymptotic performance bounds applying to general probabilistic models, along with a procedure for applying them to specific models (cf. Section IIIB).

We explicitly provide the constant factors in our bounds, allowing for more precise characterizations of the performance compared to works focusing on scaling laws (e.g., see [5, 20, 8]). In several cases, the resulting necessary and sufficient conditions on the number of measurements coincide up to a multiplicative term, thus providing exact asymptotic thresholds (sometimes referred to as phase transitions [30, 24]) on the number of measurements.

As evidenced in our examples outlined below, our framework often leads to such exact or nearexact thresholds for significantly more general scalings of , SNR, etc. compared to previous works.

The majority of previous works have developed converse results using Fano’s inequality, leading to necessary conditions for . In contrast, our converse results provide necessary conditions for . The distinction between these two conditions is important from a practical perspective: One may not expect a condition such as to be significant, whereas the condition is inarguably so.
Model  Result  Parameters  Distributions  Sufficient for  Necessary for  
Linear  Cor. 1  Discrete Gaussian 



Cor. 2  Partial recovery of proportion  Gaussian Gaussian 



Cor. 3  Low SNR  Discrete Gaussian 



1bit  Cor. 4  High SNR  Fixed Gaussian    (compared to for linear model)  
Cor. 5  Partial recovery of proportion  Gaussian Gaussian 



Group testing  Cor. 6  Fixed Bernoulli 



Cor. 7  Noisy (crossover probability )  Fixed Bernoulli 



Cor. 8  Partial recovery of proportion  Fixed Bernoulli 



General discrete observations  Cor. 9  Arbitrary  Arbitrary   

Contributions for Specific Models: An overview of our bounds for specific models is given in Table I, where we state the derived bounds with the asymptotically negligible terms omitted. All of the models and their parameters are defined precisely in Section IV; in particular, the functions and the remainder terms are given explicitly, and are easy to evaluate. We proceed by discussing these contributions in more detail, and comparing them to various existing results in the literature:

(Linear model) In the case of exact recovery, we recover the exact thresholds on the required number of measurements given by Jin et al. [17], as well as handling a broader range of scalings of (see Section IVA for details) and strengthening the converse by considering the more stringent condition . Our results for partial recovery provide nearmatching necessary and sufficient conditions under scalings with , thus complementing the extensive study of the scaling by Reeves and Gastpar [15, 16].

(1bit model) We provide two surprising observations regarding the 1bit model: Corollary 3 provides a lowSNR setting where the quantization only increases the asymptotic number of measurements by a factor of , whereas Corollary 4 provides a highSNR setting where the scaling law is strictly worse than the linear model. Similar behavior will be observed for partial recovery (Corollaries 2 and 5) by numerically comparing the bounds for various SNR values.

(Group testing) Asymptotic thresholds for group testing with were given previously by Malyutov [11] and Atia and Saligrama [2]. However, for the case that , the sufficient conditions of [2] that introduced additional logarithmic factors. In contrast, we obtain matching scaling laws for any sublinear scaling of the form (). Moreover, for sufficiently small we obtain exact thresholds. In particular, for the noiseless setting we show that measurements are both necessary and sufficient for . This is in fact the same threshold as that for adaptive group testing [31], thus proving that nonadaptive Bernoulli measurement matrices are asymptotically optimal even when adaptivity is allowed; this was previously known only in the limit as [32]. For the noisy case, we prove an analogous claim for sufficiently small . A shortened and simplified version of this paper focusing exclusively on group testing can be found in [33].
IB Structure of the Paper
In Section II, we introduce our system model. In Section III, we present our main nonasymptotic achievability and converse results for general observation models, and the procedure for applying them to specific problems. Several applications of our results to specific models are presented in Section IV. The proofs of the general bounds are given in Section V, and conclusions are drawn in Section VI.
IC Notation
We use uppercase letters for random variables, and lowercase variables for their realizations. A nonbold character may be a scalar or a vector, whereas a bold character refers to a collection of
scalars (e.g., ) or vectors (e.g., ). We write to denote the subvector of at the columns indexed by , and to denote the submatrix of containing the columns indexed by . The complement with respect to is denoted by .The symbol
means “distributed as”. For a given joint distribution
, the corresponding marginal distributions are denoted by and , and similarly for conditional marginals (e.g., ). We write for probabilities, for expectations, andfor variances. We use usual notations for the entropy (e.g.,
) and mutual information (e.g., ), and their conditional counterparts (e.g., , ). Note thatmay also denote the differential entropy for continuous random variables; the distinction will be clear from the context. We define the binary entropy function
, and the Qfunction ().We make use of the standard asymptotic notations , , , and . We define the function , and write the floor function as . The function has base .
Ii Problem Setup
Iia Model and Assumptions
Recall that denotes the ambient dimension, denotes the sparsity level, and denotes the number of measurements. We let be the set of subsets of having cardinality . The key random variables in our setup are the support set , the data vector , the measurement matrix , and the observation vector .^{1}^{1}1Extensions to more general alphabets beyond are straightforward.
The support set is assumed to be equiprobable on the subsets within . Given , the entries of are deterministically set to zero, and the remaining entries are generated according to some distribution . We assume that these nonzero entries follow the same distribution for all of the possible realizations of , and that this distribution is permutationinvariant.
The measurement matrix is assumed to have i.i.d. values on some distribution . We write , to denote the corresponding i.i.d. distributions for matrices, and we write as a shorthand for . Given , , and , each entry of the observation vector is generated in a conditionally independent manner, with the th entry distributed according to
(5) 
for some conditional distribution . We again assume symmetry with respect to , namely, that does not depend on the specific realization, and that the distribution is invariant when the columns of and the entries of undergo a common permutation.
Given and , a decoder forms an estimate of . Similarly to previous works studying informationtheoretic limits on support recovery, we assume that the decoder knows the system model. We consider two related performance measures. In the case of exact support recovery, the error probability is given by
(6) 
and is taken with respect to the realizations of , , , and ; the decoder is assumed to be deterministic. We also consider a less stringent performance criterion requiring that only entries of are successfully recovered, for some . Following [15, 16], the error probability is given by
(7) 
Note that if both and have cardinality with probability one, then the two events in the union are identical, and hence either of the two can be removed.
For clarity, we formally state our main assumptions as follows:

The support set is uniform on the subsets of of size , and the measurement matrix is i.i.d. on some distribution .

The nonzero entries are distributed according to , and this distribution is permutationinvariant and the same for all realizations of .

The observation vector is conditionally i.i.d. according to , and this distribution is the same for all realizations of , and invariant to common permutations of the columns of and entries of .

The decoder is given , and also knows the system model including , , and .
Our main goal is to derive necessary and sufficient conditions on and (as functions of ) such that or vanishes as . Moreover, when considering converse results, we will not only be interested in conditions under which , but also conditions under which the stronger statement holds.
In particular, we introduce the terminology that the strong converse holds if there exists a sequence of values , indexed by , such that for all , we have when , and when . This is related to the notion of a phase transition [30, 24]. More generally, we will refer to conditions under which as strong impossibility results, not necessarily requiring matching achievability bounds. That is, the strong converse conclusively gives a sharp threshold between failure and success, whereas a strong impossibility result may not.
It will prove convenient to work with random variables that are implicitly conditioned on a fixed value of , say . We write and in place of and to emphasize that . Moreover, we define the corresponding joint distribution
(8) 
and its multipleobservation counterpart
(9) 
where is the fold product of .
Except where stated otherwise, the random variables and appearing throughout this paper are distributed as
(10)  
(11) 
with the remaining entries of the measurement matrix being distributed as , and with deterministically. That is, we condition on a fixed except where stated otherwise.
For notational convenience, the main parts of our analysis are presented with , and
representing probability mass functions (PMFs), and with the corresponding averages written using summations. However, except where stated otherwise, our analysis is directly applicable to case that these distributions instead represent probability density functions (PDFs), with the summations replaced by integrals where necessary. The same applies to mixed discretecontinuous distributions.
IiB InformationTheoretic Definitions
Before introducing the required definitions for support recovery, it is instructive to discuss thresholding techniques in channel coding studies. These commenced in early works such as [34, 35], and have recently been used extensively in informationspectrum methods [36, 29].
IiB1 Channel Coding
We first recall the mutual information, which is ubiquitous in information theory:
(12) 
In deriving asymptotic and nonasymptotic performance bounds, it is common to work directly with the logarithm,
(13) 
which is commonly known as the information density. The thresholding techniques work by manipulating probabilities of events of the form and . For the former, one can perform a change of measure from the conditional distribution given to the unconditional distribution of , with a multiplicative constant . For the latter, one can similarly perform a change of measure from to . Hence, in both cases, there is a simple relation between the conditional and unconditional probabilities of the output sequences.
Using these methods, one can get upper and lower bounds on the error probability such that the dominant term is
(14) 
for some . Assuming that
has some form of i.i.d. structure, one can analyze this expression using tools from probability theory. The law of large numbers yields the channel capacity
, and refined characterizations can be obtained using variations of the central limit theorem
[37].Among the channel coding literature, our analysis is most similar to that of mixed channels [29, Sec. 3.3], where the relation between the input and output sequences is not i.i.d., but instead conditionally i.i.d. given another random variable. In our setting, will play the role of this random variable. See Figure 1 for a depiction of this connection.
IiB2 Support Recovery
As in [2, 14], we will consider partitions of the support set into two sets and . As will be seen in the proofs, will typically correspond to an overlap between and some other set (i.e., ), whereas will correspond to the indices in one set but not the other (e.g., ). There are ways of performing such a partition with .
For fixed and a corresponding pair , we introduce the notation
(15)  
(16) 
where is the marginal distribution of (9). While the lefthand sides of (15)–(16) represent the same quantities for any such , it will still prove convenient to work with these in place of the righthand sides. In particular, this allows us to introduce the marginal distributions
(17)  
(18) 
where . Using the preceding definitions, we introduce two information densities. The first contains probabilities averaged over ,
(19) 
whereas the second conditions on :
(20) 
where the singleletter information density is
(21) 
As mentioned above, we will generally work with discrete random variables for clarity of exposition, in which case the ratio is between two PMFs. In the case of continuous observations the ratio is instead between two PDFs, and more generally this can be replaced by the RadonNikodym derivative as in the channel coding setting
[37].Iii General Achievability and Converse Bounds
In this section, we provide general results holding for arbitrary models satisfying the assumptions given in Section II. Each of the results for exact recovery has a direct counterpart for partial recovery. For clarity, we focus on the former throughout Sections IIIA and IIIB, and then proceed with the latter in Section IIIC.
Iiia Initial NonAsymptotic Bounds
Here we provide our main nonasymptotic upper and lower bounds on the error probability. These bounds bear a strong resemblance to analogous bounds from the channel coding literature [29]; in each case, the dominant term involves tail probabilities of the information density given in (20). The mean of the information density is the mutual information in (22), which thus arises naturally in the subsequent necessary and sufficient conditions on upon showing that the deviation from the mean is small with high probability. The procedure for doing this given a specific model will be given in Section IIIB.
We start with our achievability result. Here and throughout this section, we make use of the random variables defined in (11).
Theorem 1.
For any constants and , there exists a decoder such that
(24) 
where
(25) 
Proof.
See Section VA. ∎
Remark 1.
The probability in the definition of is not an i.i.d. sum, and the techniques for ensuring that vary between different settings. The following approaches will suffice for all of the applications in this paper:

In the case that is discrete, , and it follows that
(26) Moreover, this can be strengthened by noting from the proof of Theorem 1 that may depend on , and choosing accordingly.

Defining
(27) (28) we have for any that
(29) This follows directly from Chebyshev’s inequality.

Defining
(30) we have for any that
(31) This follows directly from Markov’s inequality.
The proof of Theorem 1 is based on a decoder the searches for a unique support set such that
(32) 
for some and all partitions of with . Since the numerator in (19) is the likelihood of given , this decoder can be thought of a weakened version of the maximumlikelihood (ML) decoder. Like the ML decoder, computational considerations make its implementation intractable.
The following theorem provides a general nonasymptotic converse bound.
Theorem 2.
Fix , and let be an arbitrary partition of (with ) depending on . For any decoder, we have
(33) 
Proof.
See Section VB. ∎
IiiB Techniques for Applying Theorems 1 and 2
The bounds presented in the preceding theorems do not directly reveal the number of measurements required to achieving a vanishing error probability. In this subsection, we present the steps that can be used to obtain such conditions. We provide examples in Section IV.
The idea is to use a concentration inequality to bound the first term in (24) (or (33)), which is possible due to the fact that each summation is conditionally i.i.d. given . We proceed by providing the details of these steps separately for the achievability and converse. We start with the former.

Observe that, conditioned on , the mean of is , where is defined in (22).

Fix , and suppose that for a fixed value of , we have for all that
(34) and
(35) for some functions (e.g., these may arise from Chebyshev’s inequality or Bernstein’s inequality [38, Ch. 2]). Combining these conditions with the union bound, we obtain
(36) 
Observe that the condition in (34) can be written as
(37)
We summarize the preceding findings in the following.
Theorem 3.
For any constants , and , and functions (), define the set
(38) 
Then we have
(39) 
Remark 2.
The preceding arguments remain unchanged when also depends on . We leave this possible dependence implicit throughout this section, since a fixed value will suffice for all but one of the models considered in Section IV.
In the case that (35) holds for all (or more generally, within a set whose probability under tends to one) and the final three terms in (39) vanish, the overall upper bound approaches the probability, with respect to , that (37) fails to hold. In many cases, the second logarithm in the numerator therein is dominated by the first. It should be noted that the condition that the second term in (39) vanishes can also impose conditions on . For most of the examples presented in Section IV, the condition in (37) will be the dominant one; however, this need not always be the case, and it depends on the concentration inequality used in (35).
The application of Theorem 2 is done using similar steps, so we provide less detail. Fix , and suppose that, for a fixed value of , the pair is such that
(40) 
and
(41) 
for some function . Combining these conditions, we see that the first probability in (33), with an added conditioning on , is lower bounded by . In the case that is defined for multiple values corresponding to different values of , we can further lower bound this by .
Next, we observe that (40) holds if and only if
(42) 
Recalling that the partition is an arbitrary function of , we can ensure that this coincides with
(43) 
by choosing each pair as a function of to achieve this maximum.
Finally, we note that the maximum over in the abovederived term may be restricted to any set provided that is constrained similarly in (43); one simply chooses the partition so that always lies in this set. Putting everything together, we have the following.
Theorem 4.
For any set , constants and , and functions (), define the set
Comments
There are no comments yet.