Limits on Support Recovery with Probabilistic Models: An Information-Theoretic Framework

01/29/2015 ∙ by Jonathan Scarlett, et al. ∙ EPFL 0

The support recovery problem consists of determining a sparse subset of a set of variables that is relevant in generating a set of observations, and arises in a diverse range of settings such as compressive sensing, and subset selection in regression, and group testing. In this paper, we take a unified approach to support recovery problems, considering general probabilistic models relating a sparse data vector to an observation vector. We study the information-theoretic limits of both exact and partial support recovery, taking a novel approach motivated by thresholding techniques in channel coding. We provide general achievability and converse bounds characterizing the trade-off between the error probability and number of measurements, and we specialize these to the linear, 1-bit, and group testing models. In several cases, our bounds not only provide matching scaling laws in the necessary and sufficient number of measurements, but also sharp thresholds with matching constant factors. Our approach has several advantages over previous approaches: For the achievability part, we obtain sharp thresholds under broader scalings of the sparsity level and other parameters (e.g., signal-to-noise ratio) compared to several previous works, and for the converse part, we not only provide conditions under which the error probability fails to vanish, but also conditions under which it tends to one.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The support recovery problem consists of determining a sparse subset of a set of variables that is relevant in producing a set of observations, and arises frequently in disciplines such as group testing [1, 2], compressive sensing (CS) [3], and subset selection in regression [4]. The observation models can vary significantly among these disciplines, and it is of considerable interest to consider these in a unified fashion. This can be done via probabilistic models relating the sparse vector to a single observation in the following manner:

(1)

where represents the set of relevant variables, is a measurement vector, (respectively, ) is the subvector of (respectively, ) containing the entries indexed by , and

is a given probability distribution. Given a collection of measurements

and the corresponding measurement matrix (with each row containing a single measurement vector), the goal is to find the conditions under which the support can be recovered either perfectly or partially. In this paper, we study the information-theoretic limits for this problem, characterizing the number of measurements required in terms of the sparsity level and ambient dimension regardless of the computational complexity. Such studies are useful for assessing the performance of practical techniques and determining to what extent improvements are possible.

Before proceeding, we state some important examples of models that are captured by (1).

Linear Model

The linear model [5, 6]

is ubiquitous in signal processing, statistics, and machine learning, and in itself covers an extensive range of applications. Each observation takes the form

(2)

where denotes the inner product, and is additive noise. An important quantity in this setting is the signal-to-noise ratio (SNR) , and in the context of support recovery, the smallest non-zero absolute value in has also been shown to play a key role [5, 7, 8].

Quantized Linear Models

Quantized variants of the linear model are of significant interest in applications with hardware limitations. An example that we will consider in this paper is the 1-bit model [9], given by

(3)

where the function equals if its argument is non-negative, and if it is negative.

Group Testing

Studies of group testing problems began several decades ago [10, 11], and have recently regained significant attention [2, 12], with applications including medical testing, database systems, computational biology, and fault detection. The goal is to determine a small number of “defective” items within a larger subset of items. The items involved in a single test are indicated by , and each observation takes the form

(4)

with representing the defective items, indicating whether the test contains at least one defective item, and representing possible noise (here denotes modulo-2 addition). In this setting, one can think of as deterministically having entries equaling one on , and zero on .

The above examples highlight that (1) captures both discrete and continuous models. Beyond these examples, several other non-linear models are captured by (1), including the logistic, Poisson, and gamma models.

I-a Previous Work and Contributions

Numerous previous works on the information-theoretic limits of support recovery have focused on the linear model [5, 7, 8, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]. The main aim of these works, and of that the present paper, is to develop necessary and sufficient conditions for which an “error probability” vanishes as . However, there are several distinctions that can be made, including:

  • Random measurement matrices [5, 7, 13, 8] vs. arbitrary measurement matrices [18, 16, 19];

  • Exact support recovery [5, 7, 13, 8] vs. partial support recovery [15, 16, 20];

  • Minimax characterizations for in a given class [5, 7, 13, 8] vs. average performance bounds for random [14, 21, 16].

Perhaps the most widely-studied combination of these is that of minimax characterizations for exact support recovery with random measurement matrices. In this setting, within the class of vectors whose non-zero entries have an absolute value exceeding some threshold , necessary and sufficient conditions on are available with matching scaling laws [7, 8]. See also [23, 24] for information-theoretic studies of the linear model with a mean square error criterion.

Compared to the linear model, research on the information-theoretic limits of support recovery for non-linear models is relatively scarce. The system model that we have adopted follows those of a line of works seeking mutual information characterizations of sparsity problems [11, 2, 14, 25], though we make use of significantly different analysis techniques. Similarly to these works, we focus on random measurement matrices and random non-zero entries of . Other works considering non-linear models have used vastly different approaches such as regularized

-estimators

[26, 27] and approximate message passing [28].

High-level Contributions: We consider an approach using thresholding techniques akin to those used in information-spectrum methods [29], thus providing a new alternative to previous approaches based on maximum-likelihood decoding and Fano’s inequality. Our key contributions and the advantages of our framework are as follows:

  1. Considering both exact and partial support recovery, we provide non-asymptotic performance bounds applying to general probabilistic models, along with a procedure for applying them to specific models (cf. Section III-B).

  2. We explicitly provide the constant factors in our bounds, allowing for more precise characterizations of the performance compared to works focusing on scaling laws (e.g., see [5, 20, 8]). In several cases, the resulting necessary and sufficient conditions on the number of measurements coincide up to a multiplicative term, thus providing exact asymptotic thresholds (sometimes referred to as phase transitions [30, 24]) on the number of measurements.

  3. As evidenced in our examples outlined below, our framework often leads to such exact or near-exact thresholds for significantly more general scalings of , SNR, etc. compared to previous works.

  4. The majority of previous works have developed converse results using Fano’s inequality, leading to necessary conditions for . In contrast, our converse results provide necessary conditions for . The distinction between these two conditions is important from a practical perspective: One may not expect a condition such as to be significant, whereas the condition is inarguably so.

Model Result Parameters Distributions Sufficient for Necessary for
Linear Cor. 1 Discrete Gaussian
for various scalings)
Cor. 2 Partial recovery of proportion Gaussian Gaussian
Cor. 3 Low SNR Discrete Gaussian
(within a factor of linear model)
(within a factor of linear model)
1-bit Cor. 4 High SNR Fixed Gaussian - (compared to for linear model)
Cor. 5 Partial recovery of proportion Gaussian Gaussian
Group testing Cor. 6 Fixed Bernoulli
( for )
Cor. 7 Noisy (crossover probability ) Fixed Bernoulli
( for small )
Cor. 8 Partial recovery of proportion Fixed Bernoulli
General discrete observations Cor. 9 Arbitrary Arbitrary -
Table I: Overview of main results for exact or partial support recovery under various observation models. In the necessary and sufficient number of measurements, asymptotically negligible terms have been omitted. All quantities are defined precisely in Section IV.

Contributions for Specific Models: An overview of our bounds for specific models is given in Table I, where we state the derived bounds with the asymptotically negligible terms omitted. All of the models and their parameters are defined precisely in Section IV; in particular, the functions and the remainder terms are given explicitly, and are easy to evaluate. We proceed by discussing these contributions in more detail, and comparing them to various existing results in the literature:

  1. (Linear model) In the case of exact recovery, we recover the exact thresholds on the required number of measurements given by Jin et al. [17], as well as handling a broader range of scalings of (see Section IV-A for details) and strengthening the converse by considering the more stringent condition . Our results for partial recovery provide near-matching necessary and sufficient conditions under scalings with , thus complementing the extensive study of the scaling by Reeves and Gastpar [15, 16].

  2. (1-bit model) We provide two surprising observations regarding the 1-bit model: Corollary 3 provides a low-SNR setting where the quantization only increases the asymptotic number of measurements by a factor of , whereas Corollary 4 provides a high-SNR setting where the scaling law is strictly worse than the linear model. Similar behavior will be observed for partial recovery (Corollaries 2 and 5) by numerically comparing the bounds for various SNR values.

  3. (Group testing) Asymptotic thresholds for group testing with were given previously by Malyutov [11] and Atia and Saligrama [2]. However, for the case that , the sufficient conditions of [2] that introduced additional logarithmic factors. In contrast, we obtain matching scaling laws for any sublinear scaling of the form (). Moreover, for sufficiently small we obtain exact thresholds. In particular, for the noiseless setting we show that measurements are both necessary and sufficient for . This is in fact the same threshold as that for adaptive group testing [31], thus proving that non-adaptive Bernoulli measurement matrices are asymptotically optimal even when adaptivity is allowed; this was previously known only in the limit as [32]. For the noisy case, we prove an analogous claim for sufficiently small . A shortened and simplified version of this paper focusing exclusively on group testing can be found in [33].

  4. (General discrete observations) Our converse for the case of general discrete observations (Corollary 9) recovers that of Tan and Atia [25] for the case that is fixed, strengthens it due to a smaller remainder term , and provides a generalization to the case that is random.

I-B Structure of the Paper

In Section II, we introduce our system model. In Section III, we present our main non-asymptotic achievability and converse results for general observation models, and the procedure for applying them to specific problems. Several applications of our results to specific models are presented in Section IV. The proofs of the general bounds are given in Section V, and conclusions are drawn in Section VI.

I-C Notation

We use upper-case letters for random variables, and lower-case variables for their realizations. A non-bold character may be a scalar or a vector, whereas a bold character refers to a collection of

scalars (e.g., ) or vectors (e.g., ). We write to denote the subvector of at the columns indexed by , and to denote the submatrix of containing the columns indexed by . The complement with respect to is denoted by .

The symbol

means “distributed as”. For a given joint distribution

, the corresponding marginal distributions are denoted by and , and similarly for conditional marginals (e.g., ). We write for probabilities, for expectations, and

for variances. We use usual notations for the entropy (e.g.,

) and mutual information (e.g., ), and their conditional counterparts (e.g., , ). Note that

may also denote the differential entropy for continuous random variables; the distinction will be clear from the context. We define the binary entropy function

, and the Q-function ().

We make use of the standard asymptotic notations , , , and . We define the function , and write the floor function as . The function has base .

Ii Problem Setup

Ii-a Model and Assumptions

Recall that denotes the ambient dimension, denotes the sparsity level, and denotes the number of measurements. We let be the set of subsets of having cardinality . The key random variables in our setup are the support set , the data vector , the measurement matrix , and the observation vector .111Extensions to more general alphabets beyond are straightforward.

The support set is assumed to be equiprobable on the subsets within . Given , the entries of are deterministically set to zero, and the remaining entries are generated according to some distribution . We assume that these non-zero entries follow the same distribution for all of the possible realizations of , and that this distribution is permutation-invariant.

The measurement matrix is assumed to have i.i.d. values on some distribution . We write , to denote the corresponding i.i.d. distributions for matrices, and we write as a shorthand for . Given , , and , each entry of the observation vector is generated in a conditionally independent manner, with the -th entry distributed according to

(5)

for some conditional distribution . We again assume symmetry with respect to , namely, that does not depend on the specific realization, and that the distribution is invariant when the columns of and the entries of undergo a common permutation.

Given and , a decoder forms an estimate of . Similarly to previous works studying information-theoretic limits on support recovery, we assume that the decoder knows the system model. We consider two related performance measures. In the case of exact support recovery, the error probability is given by

(6)

and is taken with respect to the realizations of , , , and ; the decoder is assumed to be deterministic. We also consider a less stringent performance criterion requiring that only entries of are successfully recovered, for some . Following [15, 16], the error probability is given by

(7)

Note that if both and have cardinality with probability one, then the two events in the union are identical, and hence either of the two can be removed.

For clarity, we formally state our main assumptions as follows:

  1. The support set is uniform on the subsets of of size , and the measurement matrix is i.i.d. on some distribution .

  2. The non-zero entries are distributed according to , and this distribution is permutation-invariant and the same for all realizations of .

  3. The observation vector is conditionally i.i.d. according to , and this distribution is the same for all realizations of , and invariant to common permutations of the columns of and entries of .

  4. The decoder is given , and also knows the system model including , , and .

Our main goal is to derive necessary and sufficient conditions on and (as functions of ) such that or vanishes as . Moreover, when considering converse results, we will not only be interested in conditions under which , but also conditions under which the stronger statement holds.

In particular, we introduce the terminology that the strong converse holds if there exists a sequence of values , indexed by , such that for all , we have when , and when . This is related to the notion of a phase transition [30, 24]. More generally, we will refer to conditions under which as strong impossibility results, not necessarily requiring matching achievability bounds. That is, the strong converse conclusively gives a sharp threshold between failure and success, whereas a strong impossibility result may not.

It will prove convenient to work with random variables that are implicitly conditioned on a fixed value of , say . We write and in place of and to emphasize that . Moreover, we define the corresponding joint distribution

(8)

and its multiple-observation counterpart

(9)

where is the -fold product of .

Except where stated otherwise, the random variables and appearing throughout this paper are distributed as

(10)
(11)

with the remaining entries of the measurement matrix being distributed as , and with deterministically. That is, we condition on a fixed except where stated otherwise.

For notational convenience, the main parts of our analysis are presented with , and

representing probability mass functions (PMFs), and with the corresponding averages written using summations. However, except where stated otherwise, our analysis is directly applicable to case that these distributions instead represent probability density functions (PDFs), with the summations replaced by integrals where necessary. The same applies to mixed discrete-continuous distributions.

Ii-B Information-Theoretic Definitions

Before introducing the required definitions for support recovery, it is instructive to discuss thresholding techniques in channel coding studies. These commenced in early works such as [34, 35], and have recently been used extensively in information-spectrum methods [36, 29].

Ii-B1 Channel Coding

We first recall the mutual information, which is ubiquitous in information theory:

(12)

In deriving asymptotic and non-asymptotic performance bounds, it is common to work directly with the logarithm,

(13)

which is commonly known as the information density. The thresholding techniques work by manipulating probabilities of events of the form and . For the former, one can perform a change of measure from the conditional distribution given to the unconditional distribution of , with a multiplicative constant . For the latter, one can similarly perform a change of measure from to . Hence, in both cases, there is a simple relation between the conditional and unconditional probabilities of the output sequences.

Using these methods, one can get upper and lower bounds on the error probability such that the dominant term is

(14)

for some . Assuming that

has some form of i.i.d. structure, one can analyze this expression using tools from probability theory. The law of large numbers yields the channel capacity

, and refined characterizations can be obtained using variations of the central limit theorem

[37].

Among the channel coding literature, our analysis is most similar to that of mixed channels [29, Sec. 3.3], where the relation between the input and output sequences is not i.i.d., but instead conditionally i.i.d. given another random variable. In our setting, will play the role of this random variable. See Figure 1 for a depiction of this connection.

Figure 1: Connection between support recovery and coding over a mixed channel.

Ii-B2 Support Recovery

As in [2, 14], we will consider partitions of the support set into two sets and . As will be seen in the proofs, will typically correspond to an overlap between and some other set (i.e., ), whereas will correspond to the indices in one set but not the other (e.g., ). There are ways of performing such a partition with .

For fixed and a corresponding pair , we introduce the notation

(15)
(16)

where is the marginal distribution of (9). While the left-hand sides of (15)–(16) represent the same quantities for any such , it will still prove convenient to work with these in place of the right-hand sides. In particular, this allows us to introduce the marginal distributions

(17)
(18)

where . Using the preceding definitions, we introduce two information densities. The first contains probabilities averaged over ,

(19)

whereas the second conditions on :

(20)

where the single-letter information density is

(21)

As mentioned above, we will generally work with discrete random variables for clarity of exposition, in which case the ratio is between two PMFs. In the case of continuous observations the ratio is instead between two PDFs, and more generally this can be replaced by the Radon-Nikodym derivative as in the channel coding setting

[37].

Averaging (21) with respect to the random variables in (10) conditioned on yields a conditional mutual information, which we denote by

(22)

This quantity will play a key role in our bounds, which will typically have the form

(23)

as will be made more precise in the subsequent sections.

Iii General Achievability and Converse Bounds

In this section, we provide general results holding for arbitrary models satisfying the assumptions given in Section II. Each of the results for exact recovery has a direct counterpart for partial recovery. For clarity, we focus on the former throughout Sections III-A and III-B, and then proceed with the latter in Section III-C.

Iii-a Initial Non-Asymptotic Bounds

Here we provide our main non-asymptotic upper and lower bounds on the error probability. These bounds bear a strong resemblance to analogous bounds from the channel coding literature [29]; in each case, the dominant term involves tail probabilities of the information density given in (20). The mean of the information density is the mutual information in (22), which thus arises naturally in the subsequent necessary and sufficient conditions on upon showing that the deviation from the mean is small with high probability. The procedure for doing this given a specific model will be given in Section III-B.

We start with our achievability result. Here and throughout this section, we make use of the random variables defined in (11).

Theorem 1.

For any constants and , there exists a decoder such that

(24)

where

(25)
Proof.

See Section V-A. ∎

Remark 1.

The probability in the definition of is not an i.i.d. sum, and the techniques for ensuring that vary between different settings. The following approaches will suffice for all of the applications in this paper:

  1. In the case that is discrete, , and it follows that

    (26)

    Moreover, this can be strengthened by noting from the proof of Theorem 1 that may depend on , and choosing accordingly.

  2. Defining

    (27)
    (28)

    we have for any that

    (29)

    This follows directly from Chebyshev’s inequality.

  3. Defining

    (30)

    we have for any that

    (31)

    This follows directly from Markov’s inequality.

The proof of Theorem 1 is based on a decoder the searches for a unique support set such that

(32)

for some and all partitions of with . Since the numerator in (19) is the likelihood of given , this decoder can be thought of a weakened version of the maximum-likelihood (ML) decoder. Like the ML decoder, computational considerations make its implementation intractable.

The following theorem provides a general non-asymptotic converse bound.

Theorem 2.

Fix , and let be an arbitrary partition of (with ) depending on . For any decoder, we have

(33)
Proof.

See Section V-B. ∎

The proof of Theorem 2 is based on Verdú-Han type bounding techniques [36].

Iii-B Techniques for Applying Theorems 1 and 2

The bounds presented in the preceding theorems do not directly reveal the number of measurements required to achieving a vanishing error probability. In this subsection, we present the steps that can be used to obtain such conditions. We provide examples in Section IV.

The idea is to use a concentration inequality to bound the first term in (24) (or (33)), which is possible due to the fact that each summation is conditionally i.i.d. given . We proceed by providing the details of these steps separately for the achievability and converse. We start with the former.

  1. Observe that, conditioned on , the mean of is , where is defined in (22).

  2. Fix , and suppose that for a fixed value of , we have for all that

    (34)

    and

    (35)

    for some functions (e.g., these may arise from Chebyshev’s inequality or Bernstein’s inequality [38, Ch. 2]). Combining these conditions with the union bound, we obtain

    (36)
  3. Observe that the condition in (34) can be written as

    (37)

We summarize the preceding findings in the following.

Theorem 3.

For any constants , and , and functions (), define the set

(38)

Then we have

(39)
Remark 2.

The preceding arguments remain unchanged when also depends on . We leave this possible dependence implicit throughout this section, since a fixed value will suffice for all but one of the models considered in Section IV.

In the case that (35) holds for all (or more generally, within a set whose probability under tends to one) and the final three terms in (39) vanish, the overall upper bound approaches the probability, with respect to , that (37) fails to hold. In many cases, the second logarithm in the numerator therein is dominated by the first. It should be noted that the condition that the second term in (39) vanishes can also impose conditions on . For most of the examples presented in Section IV, the condition in (37) will be the dominant one; however, this need not always be the case, and it depends on the concentration inequality used in (35).

The application of Theorem 2 is done using similar steps, so we provide less detail. Fix , and suppose that, for a fixed value of , the pair is such that

(40)

and

(41)

for some function . Combining these conditions, we see that the first probability in (33), with an added conditioning on , is lower bounded by . In the case that is defined for multiple values corresponding to different values of , we can further lower bound this by .

Next, we observe that (40) holds if and only if

(42)

Recalling that the partition is an arbitrary function of , we can ensure that this coincides with

(43)

by choosing each pair as a function of to achieve this maximum.

Finally, we note that the maximum over in the above-derived term may be restricted to any set provided that is constrained similarly in (43); one simply chooses the partition so that always lies in this set. Putting everything together, we have the following.

Theorem 4.

For any set , constants and , and functions (), define the set