# Info-Greedy sequential adaptive compressed sensing

We present an information-theoretic framework for sequential adaptive compressed sensing, Info-Greedy Sensing, where measurements are chosen to maximize the extracted information conditioned on the previous measurements. We show that the widely used bisection approach is Info-Greedy for a family of k-sparse signals by connecting compressed sensing and blackbox complexity of sequential query algorithms, and present Info-Greedy algorithms for Gaussian and Gaussian Mixture Model (GMM) signals, as well as ways to design sparse Info-Greedy measurements. Numerical examples demonstrate the good performance of the proposed algorithms using simulated and real data: Info-Greedy Sensing shows significant improvement over random projection for signals with sparse and low-rank covariance matrices, and adaptivity brings robustness when there is a mismatch between the assumed and the true distributions.

There are no comments yet.

## Authors

• 2 publications
• 32 publications
• 72 publications
12/26/2011

### Online Adaptive Statistical Compressed Sensing of Gaussian Mixture Models

A framework of online adaptive statistical compressed sensing is introdu...
12/29/2017

### Sparse Polynomial Chaos Expansions via Compressed Sensing and D-optimal Design

In the field of uncertainty quantification, sparse polynomial chaos (PC)...
09/01/2015

### Sequential Information Guided Sensing

We study the value of information in sequential compressed sensing by ch...
01/26/2015

### Sequential Sensing with Model Mismatch

We characterize the performance of sequential information guided sensing...
11/05/2015

### Sparse approximation by greedy algorithms

It is a survey on recent results in constructive sparse approximation. T...
09/27/2018

### A Successive-Elimination Approach to Adaptive Robotic Sensing

We study the adaptive sensing problem for the multiple source seeking pr...
12/19/2018

### Derandomizing compressed sensing with combinatorial design

Compressed sensing is the art of reconstructing structured n-dimensional...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Nowadays ubiquitous big data applications (image processing [1], power network monitoring [2], and large scale sensor networks [3]) call for more efficient information sensing techniques. Often these techniques are sequential in that the measurements are taken one after another. Hence information gained in the past can be used to guide an adaptive design of subsequent measurements, which naturally leads to the notion of sequential adaptive sensing. At the same time, a path to efficient sensing of big data is compressive sensing [4, 5, 6], which exploits low-dimensional structures to recover signals from a number of measurements much smaller than the ambient dimension of the signals.

Early compressed sensing works mainly focus on non-adaptive and one-shot measurement schemes. Recently there has also been much interest in sequential adaptive compressed sensing, which measures noisy linear combinations of the entries (this is different from the direct adaptive sensing, which measures signal entries directly [7, 8, 9, 10]). Although in the seminal work of [11], it was shown under fairly general assumptions that “adaptivity does not help much”, i.e., sequential adaptive compressed sensing does not improve the order of the min-max bounds obtained by algorithms, these limitations are restricted to certain performance metrics. It has also been recognized (see, e.g., [12, 13, 14]) that adaptive compressed sensing offers several benefits with respect to other performance metrics, such as the reduction in the signal-to-noise ratio (SNR) to recover the signal. Moreover, larger performance gain can be achieved by adaptive compressed sensing if we aim at recovering a “family” of signals with known statistical prior information (incorporating statistical priors in compressed sensing has been considered in [15] for the non-sequential setting and in [16] for the sequential setting using Bayesian methods).

To harvest the benefits of adaptive compressed sensing, various algorithms have been developed: compressive binary search [17, 18], which considers a problem of determining the location of a single non-zero entry; a variant of the iterative bisection algorithm [19]

to adaptively identify the partial support of the signal; random choice of compressed sensing vectors

[20], and a collection of independent structured random sensing matrices in each measurement step [21] with some columns “masked” to zero; an experimental design approach [22] that designs measurements

adaptive to the mean square error of the estimated signal; exploiting additional graphical structure of the signal

[23, 24]; the CASS algorithm [13], which is based on bisection search to locate multiple non-zero entries, and is claimed to be near-optimal in the number of measurements needed sequentially to achieve small recovery errors; an adaptive sensing strategy specifically tailored to tree-sparse signals [25] that significantly outperforms non-adaptive sensing strategies. In optics literature, compressive imaging systems with sequential measurement architectures have been developed [26, 27, 28], which may modify the measurement basis based on specific object information derived from the previous measurements and achieve better performance. In medical imaging literature, [29] uses Bayesian experimental design to optimize -space sampling for nonlinear sparse MRI reconstruction.

The idea of using an information measure for sequential compressed sensing has been spelled out in various places for specific settings or signal models, for example, the seminal Bayesian compressive sensing work [16], which designs a new projection that minimizes the differential entropy of the posterior estimate on a Gaussian signal; [6, Chapter 6.2] and [30], which introduces the so-called “expected information” and outlines a general strategy for sequential adaptive sensing; [31], which develops a two-step adaptive statistical compressed sensing scheme for Gaussian mixture model (GMM) signals based on maximizing an information-theoretic objective function; [32], which sequentially senses low-rank GMM signals based on a posterior distribution and provides an empirical performance analysis; [33] studies the design of linear projection measurements for a vector Poisson signal model; [34]

designs general nonlinear functions for mapping high-dimensional data into lower-dimensional space using mutual information as a metric.

A general belief, though, is that it is difficult to devise quantitative error bounds for such sequential information maximizing algorithms (see, e.g., [6, Section 6.2.3]).

In this work, we present a unified information theoretical framework for sequential adaptive compressive sensing, called Info-Greedy Sensing

, which greedily picks the measurement with the largest amount of information gain based on the previous measurements. More precisely, we design the next measurement to maximize the conditional mutual information between the measurement and the signal with respect to the previous measurements. This framework enables us to better understand existing algorithms, establish theoretical performance guarantees, as well as develop new algorithms. The optimization problem associated with Info-Greedy Sensing is often non-convex. In some cases the solutions can be found analytically, and in others we resort to iterative heuristics. In particular, (1) we show that the widely used bisection approach is Info-Greedy for a family of

-sparse signals by connecting compressed sensing and blackbox complexity of sequential query algorithms [35]; (2) we present Info-Greedy algorithms for Gaussian and Gaussian Mixture Model (GMM) signals under more general noise models (e.g. “noise-folding” [36]) than those considered in [32], and analyze their performance in terms of the number of measurements needed; (3) we also develop new sensing algorithms, e.g., for sparse sensing vectors. Numerical examples are provided to demonstrate the accuracy of theoretical bounds and good performance of Info-Greedy Sensing algorithms using simulated and real data.

The rest of the paper is organized as follows. Section II sets up the formalism for Info-Greedy Sensing. Section III and Section IV present the Info-Greedy Sensing algorithms for -sparse signals and Gaussian signals (low-rank single Gaussian and GMM), respectively. Section V discusses the Info-Greedy Sensing with sparse measurement vectors. Section VI contains numerical examples using simulated and real data. Finally, Section VII concludes the paper. All proofs are delegated to the Appendix.

The notation in this paper is standard. In particular,

denotes the Gaussian distribution with mean

and covariance matrix ; denotes the th coordinate of the vector ; we use the shorthand ; let denote the cardinality of a set ; is the number of non-zeros in vector ; let

be the spectral norm (largest eigenvalue) of a positive definite matrix

; let be the determinant of a matrix ; let

denote the entropy of a random variable

; let denote the mutual information between two random variables and . Let the column vector has on the th entry and zero elsewhere, and let

be the quantile function of the chi-squared distribution with

degrees of freedom.

## Ii Formulation

A typical compressed sensing setup is as follows. Let be the unknown -dimensional signal. There are measurements, and is the measurement vector depending linearly on the signal and subject to an additive noise:

 y=Ax+w,A≜⎡⎢ ⎢⎣a⊺1⋮a⊺m⎤⎥ ⎥⎦∈Rm×n,w≜⎡⎢ ⎢⎣w1⋮wm⎤⎥ ⎥⎦∈Rm×1, (1)

where is the sensing matrix, and is the noise vector. Here, each coordinate of is a result of measuring with an additive noise . In the setting of sequential compressed sensing, the unknown signal is measured sequentially

 yi=a⊺ix+wi,i=1,…,m.

In high-dimensional problems, various low-dimensional signal models for are in common use: (1) sparse signal models, the canonical one being having non-zero entries111In a related model the signal come from a dictionary with few nonzero coefficients, whose support is unknown. We will not further consider this model here.; (2) low-rank Gaussian model (signal in a subspace plus Gaussian noise); and (3) Gaussian mixture model (GMM) (a model for signal lying in a union of multiple subspaces plus Gaussian noise), which has been widely used in image and video analysis among others222A mixture of GMM models has also been used to study sparse signals [37]. There are also other low-dimensional signal models including the general manifold models which will not be considered here..

Compressed sensing exploits the low dimensional structure of the signal to recover the signal with high accuracy using much fewer measurements than the dimension of the signal, i.e., . Two central and interrelated problems in compressed sensing include signal recovery and designing the sensing matrix . Early compressed sensing works usually assume to be random, which does have benefits for universality regardless of the signal distribution. However, when there is prior knowledge about the signal distribution, one can optimize to minimize the number of measurements subject to a total sensing power constraint

 m∑i=1∥ai∥22≤P (2)

for some constant . In the following, we either vary power for each measurement , or fix them to be unit power (for example, due to physical constraint) and use repeated measurements times in the direction of , which is equivalent to measuring using an integer valued power. Here can be viewed as the amount of resource we allocated for that measurement (or direction).

We will consider a methodology where is chosen to extract the most information about the signal, i.e., to maximize mutual information. In the non-sequential setting this means that maximizes the mutual information between the signal and the measurement outcome, i.e., . In sequential compressed sensing, the subsequent measurement vectors can be designed using the already acquired measurements, and hence the sensing matrix can be designed row by row. Optimal sequential design of can be defined recursively and viewed as dynamic programming [38]. However, this formulation is usually intractable in all but the most simple situations (one such example is the sequential probabilistic bisection algorithm in [30], which locates a single non-zero entry). Instead, the usual approach operates in a greedy fashion. The core idea is that based on the information that the previous measurements have extracted, the new measurement should probe in the direction that maximizes the conditional information as much as possible. We formalize this idea as Info-Greedy Sensing, which is described in Algorithm 1. The algorithm is initialized with a prior distribution of signal , and returns the Bayesian posterior mean as an estimator for signal . Conditional mutual information is a natural metric, as it counts only useful new information between the signal and the potential result of the measurement disregarding noise and what has already been learned from previous measurements.

Algorithm 1 stops either when the conditional mutual information is smaller than a threshold , or we have reached the maximum number of iterations . How relates to the precision

depends on the specific signal model employed. For example, for Gaussian signal, the conditional mutual information is the log determinant of the conditional covariance matrix, and hence the signal is constrained to be in a small ellipsoid with high probability. Also note that in this algorithm, the recovered signal may not reach accuracy

if it exhausts the number of iterations . In theoretical analysis we assume is sufficiently large to avoid it.

Note that the optimization problem in Info-Greedy Sensing is non-convex in general [39]. Hence, we will discuss various heuristics and establish their theoretical performance in terms of the following metric:

###### Definition II.1 (Info-Greedy).

We call an algorithm Info-Greedy if the measurement maximizes for each , where is the unknown signal, is the measurement outcome, and is the amount of resource for measurement .

## Iii k-sparse signal

In this section, we consider the Info-Greedy Sensing for -sparse signal with arbitrary nonnegative amplitudes in the noiseless case as well as under Gaussian measurement noise. We show that a natural modification of the bisection algorithm corresponds to Info-Greedy Sensing under a certain probabilistic model. We also show that Algorithm 2 is optimal in terms of the number of measurements for -sparse signals as well as optimal up to a factor for -sparse signals in the noiseless case. In the presence of Gaussian measurement noise, it is optimal up to at most another factor. Finally, we show Algorithm 2 is Info-Greedy when , and when it is Info-Greedy up to a factor.

To simplify the problem, we assume the sensing matrix consists of binary entries: . Consider a signal with each element with up to non-zero entries which are distributed uniformly at random. The following lemma gives an upper bound on the number of measurements for our modified bisection algorithm (see Algorithm 2) to recover such . In the description of Algorithm 2, let

 [aS]i\coloneqq{1,i∈S0,i∉S

denote the characteristic vector of a set . The basic idea is to recursively estimate a tuple that consists of a set which contains possible locations of the non-zero elements, and the total signal amplitude in that set. We say that a signal has minimum amplitude , if implies for all .

###### Theorem III.1 (Upper bound for k-sparse signal x).

Let be a -sparse signal.

1. In the noiseless case, Algorithm 2 recovers the signal exactly with at most measurements (using in Line 1).

2. In the noisy case with , Algorithm 2 recovers the signal such that with probability at least using at most measurements.

###### Lemma III.2 (Lower bound for noiseless k-sparse signal x).

Let , be a -sparse signal. Then to recover exactly, the expected number of measurements required for any algorithm is at least .

###### Lemma III.3 (Bisection Algorithm 2 for k=1 is Info-Greedy).

For Algorithm 2, is Info-Greedy.

In general case the simple analysis that leads to Lemma III.3 fails. However, using Theorem A.1 in the Appendix we can estimate the average amount of information obtained from a measurement:

###### Lemma III.4 (Bisection Algorithm 2 is Info-Greedy up to a logk factor in the noiseless case).

Let . Then the average information of a measurement in Algorithm 2:

 \mutualInfo[Y1,…,Yi−1]XYi≥1−logklogn.
###### Remark III.5.
1. Observe that Lemma III.4 establishes that Algorithm 2 for a sparse signal with acquires at least a fraction of the maximum possible mutual information (which on average is roughly bit per measurement).

2. Here we constrained the entries of matrix to be binary valued. This may correspond to applications, for examples, sensors reporting errors and the measurements count the total number of errors. Note that, however, if we relax this constraint and allow entries of to be real-valued, in the absence of noise the signal can be recovered from one measurement that project the signal onto a vector with entries .

3. The setup here with -sparse signals and binary measurement matrix generalizes the group testing [40] setup.

4. the CASS algorithm [13] is another algorithm that recovers a -sparse signal by iteratively partitioning the signal support into subsets, computing the sum over that subset and keeping the largest . In [13] it was shown that to recover a -sparse with non-uniform positive amplitude with high probability, the number of measurements is on the order of with varying power measurement. It is important to note that the CASS algorithm allows for power allocation to mitigate noise, while we repeat measurements. This, however, coincides with the number of unit length measurements of our algorithm, in Lemma III.1 after appropriate normalization. For specific regimes of error probability, the overhead in Lemma III.1 can be further reduced. For example, for any constant probability of error , the number of required repetitions per measurement is leading to improved performance. Our algorithm can be also easily modified to incorporate power allocation.

## Iv Low-Rank Gaussian Models

In this section, we derive the Info-Greedy Sensing algorithms for the single low-rank Gaussian model as well as the low-rank GMM signal model, and also quantify the algorithm’s performance.

### Iv-a Single Gaussian model

Consider a Gaussian signal with known parameters and . The covariance matrix has rank . We will consider three noise models:

1. white Gaussian noise added after the measurement (the most common model in compressed sensing):

 y=Ax+w,w∼N(0,σ2I). (3)

Let represent the power allocated to the th measurement. In this case, higher power allocated to a measurement increases SNR of that measurement.

2. white Gaussian noise added prior to the measurement, a model that appears in some applications such as reduced dimension multi-user detection in communication systems [41] and also known as the “noise folding” model [36]:

 y=A(x+w),w∼N(0,σ2I). (4)

In this case, allocating higher power for a measurement cannot increase the SNR of the outcome. Hence, we use the actual number of repeated measurements in the same direction as a proxy for the amount of resource allocated for that direction.

3. colored Gaussian noise with covariance added either prior to the measurement:

 y=A(x+w),w∼N(0,Σw), (5)

or after the measurement:

 y=Ax+w,w∼N(0,Σw). (6)

In the following, we will establish lower bounds on the amount of resource (either the minimum power or the number of measurements) needed for Info-Greedy Sensing to achieve a recovery error .

#### Iv-A1 White noise added prior to measurement or “noise folding”

We start our discussion with this model and results for other models can be derived similarly. As does not affect SNR, we set . Note that conditional distribution of given is a Gaussian random vector with adjusted parameters

 x∣y1∼N(μ+Σa1(a⊺1Σa1+σ2)−1(y1−a⊺1μ),  Σ−Σa1(a⊺1Σa1+σ2)−1a⊺1Σ). (7)

Therefore, to find Info-Greedy Sensing for a single Gaussian signal, it suffices to characterize the first measurement and from there on iterate with adjusted distributional parameters. For Gaussian signal and the noisy measurement , we have

 \mutualInfoxy1=\entropyy1−\entropy[x]y1=12ln(a⊺1Σa1/σ2+1). (8)

Clearly, with , (8) is maximized when corresponds to the largest eigenvector of . From the above argument, the Info-Greedy Sensing algorithm for a single Gaussian signal is to choose

as the orthonormal eigenvectors of

in a decreasing order of eigenvalues, as described in Algorithm 3. The following theorem establishes the bound on the number of measurements needed.

###### Theorem IV.1 (White Gaussian noise added prior to measurement or “noise folding”).

Let and let be the eigenvalues of with multiplicities. Further let be the accuracy and . Then Algorithm 3 recovers satisfying with probability at least using at most the following number of measurements by unit vectors :

 m=k∑i=1λi≠0max{0,⌈(χ2n(p)ε2−1λi)σ2⌉} (9a) provided σ>0. If σ2≤ε2/χ2n(p) the number of measurements simplifies to ∣∣ ∣∣{i:λi>ε2χ2n(p)}∣∣ ∣∣. (9b) This also holds when σ=0.

#### Iv-A2 White noise added after measurement

A key insight in the proof for Theorem IV.1 is that repeated measurements in the same eigenvector direction corresponds to a single measurement in that direction with all the power summed together. This can be seen from the following discussion. After measuring in the direction of a unit norm eigenvector with eigenvalue , and using power , the conditional covariance matrix takes the form of

 Σ−Σ√βu(√βu⊺Σ√βu+σ2)−1√βu⊺Σ=λσ2βλ+σ2uu⊺+Σ⊥u, (10)

where is the component of in the orthogonal complement of . Thus, the only change in the eigendecomposition of is the update of the eigenvalue of from to . Informally, measuring with power allocation on a Gaussian signal reduces the uncertainty in direction as illustrated in Fig. 1. We have the following performance bound for sensing a Gaussian signal:

###### Theorem IV.2 (White Gaussian noise added after measurement).

Let and let be the eigenvalues of with multiplicities. Further let be the accuracy and . Then Algorithm 3 recovers satisfying with probability at least using at most the following power

 P=k∑i=1λi≠0max{0,(χ2n(p)ε2−1λi)σ2} (11)

provided .

#### Iv-A3 Colored noise

When a colored noise

is added either prior to, or after the measurement, similar to the white noise cases, the conditional distribution of

given the first measurement is a Gaussian random variable with adjusted parameters. Hence, as before, the measurement vectors can be found iteratively. Algorithm 3 presents Info-Greedy Sensing for this case and the derivation is given in Appendix B. Algorithm 3 also summarizes all the Info-Greedy Sensing algorithms for Gaussian signal under various noise models.

The following version of Theorem IV.1 is for the required number of measurements for colored noise in the “noise folding” model:

###### Theorem IV.3 (Colored Gaussian noise added prior to measurement or “noise folding”).

Let be a Gaussian signal, and let denote the eigenvalues of with multiplicities. Assume . Furthermore, let be the required accuracy. Then Algorithm 3 recovers satisfying with probability at least using at most the following number of measurements by unit vectors :

 m=n∑i=1λi≠0max{0,⌈χ2n(p)ε2∥Σw∥−1λi⌉}. (12)
###### Remark IV.4.

(1) Under these noise models, the posterior distribution of the signal is also Gaussian, and the measurement outcome affects only its mean and but not the covariance matrix (see (7)). In other words, the outcome does not affect the mutual information of posterior Gaussian signal. In this sense, for Gaussian signals adaptivity brings no advantage when is accurate

, as the measurements are pre-determined by the eigenspace of

. However, when knowledge of is inaccurate for Gaussian signals, adaptivity brings benefit as demonstrated in Section VI-A1, since a sequential update of the covariance matrix incorporates new information and “corrects” the covariance matrix when we design the next measurement.

(2) In (10) the eigenvalue reduces to after the first measurement. Now iterating this we see by induction that after measurements in direction , the eigenvalue reduces to , which is the same as measuring once in direction with power . Hence, measuring several times in the same direction of , and thereby splitting power into for the measurements, has the same effect as making one measurement with total the power .

(3) Info-Greedy Sensing for Gaussian signal can be implemented efficiently. Note that in the algorithm we only need compute the leading eigenvector of the covariance matrix; moreover, updates of the covariance matrix and mean are simple and iterative. In particular, for a sparse with non-zero entries, the computation of the largest eigenvalue and associated eigenvector can be implemented in using sparse power’s method [42], where is the number of power iterations. In many high-dimensional applications, is sparse if the variables (entries of ) are not highly correlated. Also note that the sparsity structure of the covariance matrix as well as the correlation structure of the signal entries will not be changed by the update of the covariance matrix. This is because in (10) the update only changes the eigenvalues but not the eigenvectors. To see why this is true, let be the eigendecomposition of . By saying that the covariance matrix is sparse, we assume that ’s are sparse and, hence, the resulting covariance matrix has few non-zero entries. Therefore, updating the covariance matrix will not significantly change the number of non-zero entries in a covariance matrix. We demonstrate the scalability of Info-Greedy Sensing with larger examples in Section VI-A1.

### Iv-B Gaussian mixture model (GMM)

The probability density function of GMM is given by

 p(x)=C∑c=1πcN(μc,Σc), (13)

where is the number of classes, and is the probability of samples from class . Unlike Gaussian, mutual information of GMM cannot be explicitly written. However, for GMM signals a gradient descent approach that works for an arbitrary signal model can be used as outlined in [32]. The derivation uses the fact that the gradient of the conditional mutual information with respect to

is a linear transform of the minimum mean square error (MMSE) matrix

[43, 39]. Moreover, the gradient descent approach for GMM signals exhibits structural properties that can be exploited to reduce the computational cost for evaluating the MMSE matrix, as outlined in [37, 32]. For completeness we include the detail of the algorithm here, as summarized in Algorithm 6 and the derivations are given in Appendix C333Another related work is [44] which studies the behavior of minimum mean sure error (MMSE) associated with the reconstruction of a signal drawn from a GMM as a function of the properties of the linear measurement kernel and the Gaussian mixture, i.e. whether the MMSE converges or does not converge to zero as the noise. .

An alternative heuristic for sensing GMM is the so-called greedy heuristic, which is also mentioned in [32]. The heuristic picks the Gaussian component with the highest posterior

at that moment, and chooses the next measurement

to be its eigenvector associated with the maximum eigenvalue, as summarized in Algorithm 6. The greedy heuristic is not Info-Greedy, but it can be implemented more efficiently compared to the gradient descent approach. The following theorem establishes a simple upper bound on the number of required measurements to recover a GMM signal using the greedy heuristic with small error. The analysis is based on the well-known multiplicative weight update method (see e.g., [45]) and utilizes a simple reduction argument showing that when the variance of every component has been reduced sufficiently to ensure a low error recovery with probability , we can learn (a mix of) the right component(s) with few extra measurements.

###### Theorem IV.5 (Upper bound on m of greedy heuristic algorithm for GMM).

Consider a GMM signal parameterized in (13). Let be the required number of measurements (or power) to ensure with probability for a Gaussian signal corresponding to component for all . Then we need at most

 (∑c∈Cmc)+Θ(1~ηln|C|)

measurements (or power) to ensure when sampling from the posterior distribution of with probability .

###### Remark IV.6.

In the high noise case, i.e., when SNR is low, Info-Greedy measurements can be approximated easily. Let denote the random variable indicating the class where the signal is sampled from. Then . Hence, the Info-Greedy measurement should be the leading eigenvector of the average covariance matrix with the posterior weights.

## V Sparse measurement vector

In various applications, we are interested in finding a sparse measurement vector . With such requirement, we can add a cardinality constraint on in the Info-Greedy Sensing formulation: , where is the number of non-zero entries we allowed for vector. This is a non-convex integer program with non-linear cost function, which can be solved by outer approximation [46, 47]. The idea of outer approximation is to generate a sequence of cutting planes to approximate the cost function via its subgradient and iteratively include these cutting planes as constraints in the original optimization problem. In particular, we initialize by solving the following optimization problem

 maximizea,r,zzsubject to∑ni=1ri≤k0ai≤ri,−ai≤ri0≤z≤c,ri∈{0,1},i=1,…,na∈Rn,z∈R, (14)

where and are introduced auxiliary variables, and is an user specified upper bound that bounds the cost function over the feasible region. The constraint of the above optimization problem can be casted into matrix vector form as follows:

 F0≜⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣11×n01×n0−InIn0n×1−In−In0n×101×n01×n101×n01×n−1⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦,g0≜⎡⎢ ⎢ ⎢⎣k002n×1c0⎤⎥ ⎥ ⎥⎦

such that

The mixed-integer linear program formulated in (

14) can be solved efficiently by a standard software such as GUROBI. In the next iteration, solution to this optimization problem will be used to generate a new cutting plane, which we include in the original problem by appending a row to and adding an entry to as follows

 Fℓ+1 =[Fℓ0−(∇f(a∗))⊺1], (15) gℓ+1 =[gℓf(a∗)−a⊺∗∇f(a∗)], (16)

where is the non-linear cost function in the original problem. For Gaussian signal , the cost function and its gradient take the form of:

 f(a)=12log(a⊺Σaσ2+1),∇f(a)=1a⊺Σa+σ2Σa. (17)

By repeating iterations as above, we can find a measurement vector with sparsity which is approximately Info-Greedy.

## Vi Numerical examples

### Vi-a Simulated examples

#### Vi-A1 Low-rank Gaussian model

First, we examine the performance of Info-Greedy Sensing for Gaussian signal. The dimension of the signal is , and we set the probability of recovery , the noise standard deviation . The signal mean vector , where the covariance matrix is generated as , has each entry i.i.d. , and the operator thresholds eigenvalues of a matrix that are smaller than 0.7 to be zero. The error tolerance (represented as dashed lines in the figures). For the white noise case, we set , and for the colored noise case, and the noise covariance matrix is generated randomly as

for a random matrix

with entries i.i.d. . The number of measurements is determined from Theorem IV.1 and Theorem IV.2. We run the algorithm over 1000 random instances. Fig. 2 demonstrates the ordered recovery error , as well as the ordered number of measurements calculated from the formulas, for the white and colored noise cases, respectively. Note that in both the white noise and colored noise cases, the errors for Info-Greedy Sensing can be two orders of magnitude lower than the errors obtained from measurement using Gaussian random vectors, and the errors fall below our desired tolerance using the theoretically calculated .

When the assumed covariance matrix for the signal is equal to its true covariance matrix, Info-Greedy Sensing is identical to the batch method [32] (the batch method measures using the largest eigenvectors of the signal covariance matrix). However, when there is a mismatch between the two, Info-Greedy Sensing outperforms the batch method due to adaptivity, as shown in Fig. 3. For Gaussian signals, the complexity of the batch method is (due to eigendecomposition), versus the complexity of Info-Greedy Sensing algorithm is on the order of where is the number of iterations needed to compute the eigenvector associated with the largest eigenvalue (e.g., using the power method), and is the number of measures which is typically on the order of .

We also try larger examples. Fig. 4 demonstrates the performance of Info-Greedy Sensing for a signal of dimension 1000 and with dense and low-rank (approximately of non-zero eigenvalues). Another interesting case is shown in Fig. 5, where and is rank 3 and very sparse: only about of the entries of are non-zeros. In this case Info-Greedy Sensing is able to recover the signal with a high precision using only measurements. This shows the potential value of Info-Greedy Sensing for big data.

#### Vi-A2 Low-rank GMM model

In this example we consider a GMM model with components, and each Gaussian component is generated as a single Gaussian component described in the previous example Section VI-A1 ( and ). The true prior distribution is for the three components (hence each time the signal is draw from one component with these probabilities), and the assumed prior distribution for the algorithms is uniform: each component has probability . The parameters for the gradient descent approach are: step size and the error tolerance to stop the iteration . Fig. 6 demonstrates the estimated cumulative mutual information and mutual information in a single step, averaged over 100 Monte Carlo trials, and the gradient descent based approach has higher information gain than that of the greedy heuristic, as expected. Fig. 7 shows the ordered errors for the batch method based on mutual information gradient [32], the greedy heuristic versus gradient descent approach, when and , respectively. Note that Info-Greedy Sensing approaches (greedy heuristic and gradient descent) outperform the batch method due to adaptivity, and that the simpler greedy heuristic performs fairly well compared with the gradient descent approach. For GMM signals, the complexity of the batch method is (due to eigendecomposition of components), versus the complexity of Info-Greedy Sensing algorithm is on the order of where is the number of iterations needed to compute the eigenvector associated with the largest eigenvector (e.g., using the power method), and is the number of measures which is typically on the order of .

#### Vi-A3 Sparse Info-Greedy Sensing

Consider designing a sparse Info-Greedy Sensing vector for a single Gaussian signal with , desired sparsity of measurement vector , and the low-rank covariance matrix is generated as before by thresholding eigenvalues. Fig. 8(a) shows the pattern of non-zero entries from measurement 1 to 5. Fig. 8(b) compares the performance of randomly selecting 5 non-zero entries. The sparse Info-Greedy Sensing algorithm outperforms the random approach and does not degrade too much from the non-sparse Info-Greedy Sensing.

### Vi-B Real data

#### Vi-B1 MNIST handwritten dataset

We exam the performance of using GMM Info-Greedy Sensing on MNIST handwritten dataset. In this example, since the true label of the training data is known, we can use training data to estimate the true prior distribution , and (there are classes of Gaussian components each corresponding to one digit) using 10,000 training pictures of handwritten digits picture of dimension 28 by 28. The images are vectorize and hence , and the digit can be recognized using the its highest posterior after sequential measurements. Fig. 9 demonstrates an instance of recovered image (true label is 2) using

sequential measurements, for the greedy heuristic and the gradient descent approach, respectively. In this instance, the greedy heuristic classifies the image erroneously as 6, and the gradient descent approach correctly classifies the image as 2. Table

I shows the probability of false classification for the testing data, where the random approach is where are normalized random Gaussian vectors. Again, the greedy heuristic has good performance compared to the gradient descent method.

#### Vi-B2 Recovery of power consumption vector

We consider recovery of a power consumption vector for 58 counties in California. Data for power consumption in these counties from year 2006 to year 2012 are available. We first fit a single Gaussian model using data from year 2006 to 2011 (Fig. 10(a), the probability plot demonstrates that Gaussian is a reasonably good fit to the data), and then test the performance of the Info-Greedy Sensing in recovering the data vector of year 2012. Fig. 10(b) shows that even by using a coarse estimate of the covariance matrix from limited data (5 samples), Info-Greedy Sensing can have better performance than the random algorithm. This example has some practical implications: the compressed measurements here correspond to collecting the total power consumption over a region of the power network. This collection process can be achieved automatically by new technologies such as the wireless sensor network platform using embedded RFID in [2] and, hence, our Info-Greedy Sensing may be an efficient solution to monitoring of power consumption of each node in a large power network.

## Vii Conclusion

We have presented a general framework for sequential adaptive compressed sensing, Info-Greedy Sensing, which is based on maximizing mutual information between the measurement and the signal model conditioned on previous measurements. Our results demonstrate that adaptivity helps when prior distributional information of the signal is available and Info-Greedy is an efficient tool to explore such prior information, such as in the case of the GMM signals. Adaptivity also brings robustness when there is mismatch between the assumed and true distribution, and we have demonstrated such benefits for Gaussian signals. Moreover, Info-Greedy Sensing shows significant improvement over random projection for signals with sparse and low-rank covariance matrices, which demonstrate the potential value of Info-Greedy Sensing for big data.

## References

• [1] D. J. Brady, Optical imaging and spectroscopy. Wiley-OSA, April 2009.
• [2] W. Boonsong and W. Ismail, “Wireless monitoring of household electrical power meter using embedded RFID with wireless sensor network platform,” Int. J. Distributed Sensor Networks, Article ID 876914, 10 pages, vol. 2014, 2014.
• [3] B. Zhang, X. Cheng, N. Zhang, Y. Cui, Y. Li, and Q. Liang, “Sparse target counting and localization in sensor networks based on compressive sensing,” in IEEE Int. Conf. Computer Communications (INFOCOM), pp. 2255 – 2258, 2014.
• [4] E. J. Candès and T. Tao, “Near-optimal signal recovery from random projections: Universal encoding strategies?,” IEEE Trans. Info. Theory, vol. 52, pp. 5406–5425, Dec. 2006.
• [5] D. Donoho, “Compressed sensing,” IEEE Trans. Inform. Theory, vol. 52, pp. 1289–1306, Apr. 2006.
• [6] Y. C. Eldar and G. Kutyniok, eds., Compressed sensing: theory and applications. Cambridge University Press Cambridge, UK, 2012.
• [7] J. Haupt, R. M. Castro, and R. Nowak, “Distilled sensing: adaptive sampling for sparse detection and estimation,” IEEE Trans. Info. Theory, vol. 57, pp. 6222–6235, Sept. 2011.
• [8] D. Wei and A. O. Hero, “Multistage adaptive estimation of sparse signals,” IEEE J. Sel. Topics Sig. Proc., vol. 7, pp. 783 – 796, Oct. 2013.
• [9] D. Wei and A. O. Hero, “Performance guarantees for adaptive estimation of sparse signals,” arXiv:1311.6360v1, 2013.
• [10] M. L. Malloy and R. Nowak, “Sequential testing for sparse recovery,” IEEE Trans. Info. Theory, vol. 60, no. 12, pp. 7862 – 7873, 2014.
• [11] E. Arias-Castro, E. J. Candès, and M. A. Davenport, “On the fundamental limits of adaptive sensing,” IEEE Trans. Info. Theory, vol. 59, pp. 472–481, Jan. 2013.
• [12] P. Indyk, E. Price, and D. P. Woodruff, “On the power of adaptivity in sparse recovery,” in IEEE Foundations of Computer Science (FOCS), Oct. 2011.
• [13] M. L. Malloy and R. Nowak, “Near-optimal adaptive compressed sensing,” arXiv:1306.6239v1, 2013.
• [14] C. Aksoylar and V. Saligrama, “Information-theoretic bounds for adaptive sparse recovery,” arXiv:1402.5731v2, 2014.
• [15] G. Yu and G. Sapiro, “Statistical compressed sensing of Gaussian mixture models,” IEEE Trans. Sig. Proc., vol. 59, pp. 5842 – 5858, Dec. 2011.
• [16] S. Ji, Y. Xue, and L. Carin, “Bayesian compressive sensing,” IEEE Trans. Sig. Proc., vol. 56, no. 6, pp. 2346–2356, 2008.
• [17] J. Haupt, R. Nowak, and R. Castro, “Adaptive sensing for sparse signal recovery,” in IEEE 13th Digital Signal Processing Workshop and 5th IEEE Signal Processing Education Workshop (DSP/SPE), pp. 702 – 707, 2009.
• [18] M. A. Davenport and E. Arias-Castro, “Compressive binary search,” arXiv:1202.0937v2, 2012.
• [19] A. Tajer and H. V. Poor, “Quick search for rare events,” arXiv:1210:2406v1, 2012.
• [20] D. Malioutov, S. Sanghavi, and A. Willsky, “Sequential compressed sensing,” IEEE J. Sel. Topics Sig. Proc., vol. 4, pp. 435–444, April 2010.
• [21] J. Haupt, R. Baraniuk, R. Castro, and R. Nowak, “Sequentially designed compressed sensing,” in Proc. IEEE/SP Workshop on Statistical Signal Processing, 2012.
• [22] S. Jain, A. Soni, and J. Haupt, “Compressive measurement designs for estimating structured signals in structured clutter: A Bayesian experimental design approach,” arXiv:1311.5599v1, 2013.
• [23] A. Krishnamurthy, J. Sharpnack, and A. Singh, “Recovering graph-structured activations using adaptive compressive measurements,” in Annual Asilomar Conference on Signals, Systems, and Computers, Sept. 2013.
• [24] E. Tánczos and R. Castro, “Adaptive sensing for estimation of structure sparse signals,” arXiv:1311.7118, 2013.
• [25] A. Soni and J. Haupt, “On the fundamental limits of recovering tree sparse vectors from noisy linear measurements,” IEEE Trans. Info. Theory, vol. 60, no. 1, pp. 133–149, 2014.
• [26] A. Ashok, P. Baheti, and M. A. Neifeld, “Compressive imaging system design using task-specific information,” Applied Optics, vol. 47, no. 25, pp. 4457–4471, 2008.
• [27] J. Ke, A. Ashok, and M. Neifeld, “Object reconstruction from adaptive compressive measurements in feature-specific imaging,” Applied Optics, vol. 49, no. 34, pp. 27–39, 2010.
• [28] A. Ashok and M. A. Neifeld, “Compressive imaging: hybrid measurement basis design,” J. Opt. Soc. Am. A, vol. 28, no. 6, pp. 1041– 1050, 2011.
• [29] M. Seeger, H. Nickisch, R. Pohmann, and B. Schoelkopf, “Optimization of k-space trajectories for compressed sensing by Bayesian experimental design,” Magnetic Resonance in Medicine, 2010.
• [30] R. Waeber, P. Frazier, and S. G. Henderson, “Bisection search with noisy responses,” SIAM J. Control and Optimization, vol. 51, no. 3, pp. 2261–2279, 2013.
• [31] J. M. Duarte-Carvajalino, G. Yu, L. Carin, and G. Sapiro, “Task-driven adaptive statistical compressive sensing of Gaussian mixture models,” IEEE Trans. Sig. Proc., vol. 61, no. 3, pp. 585–600, 2013.
• [32] W. Carson, M. Chen, R. Calderbank, and L. Carin, “Communication inspired projection design with application to compressive sensing,” SIAM J. Imaging Sciences, 2012.
• [33] L. Wang, D. Carlson, M. D. Rodrigues, D. Wilcox, R. Calderbank, and L. Carin, “Designed measurements for vector count data,” in Neural Information Processing Systems Foundation (NIPS), 2013.
• [34] L. Wang, A. Razi, M. Rodrigues, R. Calderbank, and L. Carin, “Nonlinear information-theoretic compressive measurement design,” in

Proc. 31st Int. Conf. Machine Learning (ICML)

, ., 2014.
• [35] G. Braun, C. Guzmán, and S. Pokutta, “Unifying lower bounds on the oracle complexity of nonsmooth convex optimization,” arXiv:1407.5144, 2014.
• [36] E. Arias-Castro and Y. C. Eldar, “Noise folding in compressed sensing,” IEEE Signal Processing Letter, vol. 18, pp. 478 – 481, June 2011.
• [37] M. Chen, Bayesian and Information-Theoretic Learning of High Dimensional Data. PhD thesis, Duke University, 2013.
• [38] Y. C. Eldar and G. Kutyniok, Compresssed Sensing: Theory and Applications. Cambridge Univ Press, 2012.
• [39] M. Payaró and D. P. Palomar, “Hessian and concavity of mutual information, entropy, and entropy power in linear vector Gaussian channels,” IEEE Trans. Info. Theory, pp. 3613–3628, Aug. 2009.
• [40] M. A. Iwen and A. H. Tewfik, “Adaptive group testing strategies for target detection and localization in noisy environment,” Institute for Mathematics and Its Applications (IMA) Preprint Series # 2311, 2010.
• [41] Y. Xie, Y. C. Eldar, and A. Goldsmith, “Reduced-dimension multiuser detection,” IEEE Trans. Info. Theory, vol. 59, pp. 3858 – 3874, June 2013.
• [42] L. Trevisan, Lecture notes for CS359G: Graph Partitioning and Expanders. Stanford University, Stanford, CA, 2011.
• [43] D. Palomar and S. Verdú, “Gradient of mutual information in linear vector Gaussian channels,” IEEE Trans. Info. Theory, pp. 141–154, 2006.
• [44] F. Renna, R. Calderbank, L. Carin, and M. R. D. Rodrigues, “Reconstruction of signals drawn from a Gaussian mixture via noisy compressive measurements,” IEEE Trans. Sig. Proc., arXiv:1307.0861. To appear.
• [45] S. Arora, E. Hazan, and S. Kale, “The multiplicative weights update method: A meta-algorithm and applications,” Theory of Computing, vol. 8, no. 1, pp. 121–164, 2012.
• [46] M. A. Duran and I. E. Grossmann, “An outer-approximation algorithm for a class of mixed-integer nonlinear programs,” Math. Programming, vol. 36, no. 3, pp. 307 – 339, 1986.
• [47] A. Schrijver, Theory of linear and integer programming. Wiley, 1986.
• [48] J. C. Duchi and M. J. Wainwright, “Distance-based and continuum Fano inequalities with applications to statistical estimation,” arXiv:1311.2669v2, 2013.
• [49]

J. M. Leiva-Murillo and A. Artes-Rodriguez, “A Gaussian mixture based maximization of mutual information for supervised feature extraction,”

Lecture Notes in Computer Science, Independent Component Analysis and Blind Signal Separation

, vol. 3195, pp. 271 – 278, 2004.

## Appendix A General performance lower bounds

In the following we establish a general lower bound for the number of sequential measurements needed to obtain certain small recovery error , similar to the approach in [35]. We consider the following model: sequentially perform measurements and performance is measured by the number of measurements required to obtain a reconstruction of the signal with a prescribed accuracy. Assume the sequential measurements are linear and the measurement returns . Formally, let be a finite family of signals of interest, and

be a random variable with uniform distribution on

. Denote by the sequence of measurements, and the sequence of measurement values: . Let denote the transcript of the measurement operations and a single measurement/value pair. Note that is a random variable of the picked signal . Assume that the accuracy is high enough to ensure a one-to-one correspondence between signal and the -ball it is contained in. Thus we can return the center of such an -ball as the reconstruction of . In this regime, an -recovery of a signal is (information-theoretically) equivalent to learning the -ball that is contained in, and we can invoke the reconstruction principle

 \mutualInfoFΠ=\entropyF=log|F|, (18)

i.e., the transcript has to contain the same information as and in fact uniquely identify it. With this model it was shown in [35] that the total amount of information acquired, , is equal to the sum of the conditional information per iteration:

###### Theorem A.1 ([35]).
 \mutualInfoFΠ=∞∑i=1\entropy[ai,Πi−1,M≥i]yi\rm information gain by measurement i\probabilityM≥i, (19)

where is a shorthand for and is the random variable of required measurements.

We will use Theorem A.1 to establish Lemma III.4 that the bisection algorithm is Info-Greedy for -sparse signals. A priori, Theorem A.1 does not give a bound on the expected number of required measurements, and it only characterizes how much information the sensing algorithm learns from each measurement. However, if we can upper bound the information acquired in each measurement by some constant, this leads to a lower bound on the expected number of measurements, as well as a high-probability lower bound:

###### Corollary A.2 (Lower bound on number of measurements).

Suppose that for some constant ,

 \entropy[ai,Πi−1,m≥i]yi≤C

for every round where is as above. Then . Moreover, for all we have and .

The information theoretic approach also lends itself to lower bounds on the number of measurements for Gaussian signals, as e.g., in [48, Corollary 4].

## Appendix B Derivation of Gaussian signal measured with colored noise

First consider the case when colored noise is added after the measurement: , . In the following, we assume the noise covariance matrix is full rank. Note that we can write . Let the eigendecomposition of the noise covariance matrix be , and define a constant vector . So the variance of is given by . Re-parameterize

by introducing a unitary matrix

: . Also let the eigendecomposition of be . Then the mutual information of and can be written as

 \mutualInfoxy1=12ln(a⊺1Σa1e⊺1Σwe1+1)=12ln(β1∥b∥22⋅b⊺R⊺ΣRbb⊺b+1)=12ln⎛⎝β1∥b∥22⋅e⊺1U