I Introduction
Nowadays ubiquitous big data applications (image processing [1], power network monitoring [2], and large scale sensor networks [3]) call for more efficient information sensing techniques. Often these techniques are sequential in that the measurements are taken one after another. Hence information gained in the past can be used to guide an adaptive design of subsequent measurements, which naturally leads to the notion of sequential adaptive sensing. At the same time, a path to efficient sensing of big data is compressive sensing [4, 5, 6], which exploits lowdimensional structures to recover signals from a number of measurements much smaller than the ambient dimension of the signals.
Early compressed sensing works mainly focus on nonadaptive and oneshot measurement schemes. Recently there has also been much interest in sequential adaptive compressed sensing, which measures noisy linear combinations of the entries (this is different from the direct adaptive sensing, which measures signal entries directly [7, 8, 9, 10]). Although in the seminal work of [11], it was shown under fairly general assumptions that “adaptivity does not help much”, i.e., sequential adaptive compressed sensing does not improve the order of the minmax bounds obtained by algorithms, these limitations are restricted to certain performance metrics. It has also been recognized (see, e.g., [12, 13, 14]) that adaptive compressed sensing offers several benefits with respect to other performance metrics, such as the reduction in the signaltonoise ratio (SNR) to recover the signal. Moreover, larger performance gain can be achieved by adaptive compressed sensing if we aim at recovering a “family” of signals with known statistical prior information (incorporating statistical priors in compressed sensing has been considered in [15] for the nonsequential setting and in [16] for the sequential setting using Bayesian methods).
To harvest the benefits of adaptive compressed sensing, various algorithms have been developed: compressive binary search [17, 18], which considers a problem of determining the location of a single nonzero entry; a variant of the iterative bisection algorithm [19]
to adaptively identify the partial support of the signal; random choice of compressed sensing vectors
[20], and a collection of independent structured random sensing matrices in each measurement step [21] with some columns “masked” to zero; an experimental design approach [22] that designs measurementsadaptive to the mean square error of the estimated signal; exploiting additional graphical structure of the signal
[23, 24]; the CASS algorithm [13], which is based on bisection search to locate multiple nonzero entries, and is claimed to be nearoptimal in the number of measurements needed sequentially to achieve small recovery errors; an adaptive sensing strategy specifically tailored to treesparse signals [25] that significantly outperforms nonadaptive sensing strategies. In optics literature, compressive imaging systems with sequential measurement architectures have been developed [26, 27, 28], which may modify the measurement basis based on specific object information derived from the previous measurements and achieve better performance. In medical imaging literature, [29] uses Bayesian experimental design to optimize space sampling for nonlinear sparse MRI reconstruction.The idea of using an information measure for sequential compressed sensing has been spelled out in various places for specific settings or signal models, for example, the seminal Bayesian compressive sensing work [16], which designs a new projection that minimizes the differential entropy of the posterior estimate on a Gaussian signal; [6, Chapter 6.2] and [30], which introduces the socalled “expected information” and outlines a general strategy for sequential adaptive sensing; [31], which develops a twostep adaptive statistical compressed sensing scheme for Gaussian mixture model (GMM) signals based on maximizing an informationtheoretic objective function; [32], which sequentially senses lowrank GMM signals based on a posterior distribution and provides an empirical performance analysis; [33] studies the design of linear projection measurements for a vector Poisson signal model; [34]
designs general nonlinear functions for mapping highdimensional data into lowerdimensional space using mutual information as a metric.
A general belief, though, is that it is difficult to devise quantitative error bounds for such sequential information maximizing algorithms (see, e.g., [6, Section 6.2.3]).In this work, we present a unified information theoretical framework for sequential adaptive compressive sensing, called InfoGreedy Sensing
, which greedily picks the measurement with the largest amount of information gain based on the previous measurements. More precisely, we design the next measurement to maximize the conditional mutual information between the measurement and the signal with respect to the previous measurements. This framework enables us to better understand existing algorithms, establish theoretical performance guarantees, as well as develop new algorithms. The optimization problem associated with InfoGreedy Sensing is often nonconvex. In some cases the solutions can be found analytically, and in others we resort to iterative heuristics. In particular, (1) we show that the widely used bisection approach is InfoGreedy for a family of
sparse signals by connecting compressed sensing and blackbox complexity of sequential query algorithms [35]; (2) we present InfoGreedy algorithms for Gaussian and Gaussian Mixture Model (GMM) signals under more general noise models (e.g. “noisefolding” [36]) than those considered in [32], and analyze their performance in terms of the number of measurements needed; (3) we also develop new sensing algorithms, e.g., for sparse sensing vectors. Numerical examples are provided to demonstrate the accuracy of theoretical bounds and good performance of InfoGreedy Sensing algorithms using simulated and real data.The rest of the paper is organized as follows. Section II sets up the formalism for InfoGreedy Sensing. Section III and Section IV present the InfoGreedy Sensing algorithms for sparse signals and Gaussian signals (lowrank single Gaussian and GMM), respectively. Section V discusses the InfoGreedy Sensing with sparse measurement vectors. Section VI contains numerical examples using simulated and real data. Finally, Section VII concludes the paper. All proofs are delegated to the Appendix.
The notation in this paper is standard. In particular,
denotes the Gaussian distribution with mean
and covariance matrix ; denotes the th coordinate of the vector ; we use the shorthand ; let denote the cardinality of a set ; is the number of nonzeros in vector ; letbe the spectral norm (largest eigenvalue) of a positive definite matrix
; let be the determinant of a matrix ; letdenote the entropy of a random variable
; let denote the mutual information between two random variables and . Let the column vector has on the th entry and zero elsewhere, and letbe the quantile function of the chisquared distribution with
degrees of freedom.Ii Formulation
A typical compressed sensing setup is as follows. Let be the unknown dimensional signal. There are measurements, and is the measurement vector depending linearly on the signal and subject to an additive noise:
(1) 
where is the sensing matrix, and is the noise vector. Here, each coordinate of is a result of measuring with an additive noise . In the setting of sequential compressed sensing, the unknown signal is measured sequentially
In highdimensional problems, various lowdimensional signal models for are in common use: (1) sparse signal models, the canonical one being having nonzero entries^{1}^{1}1In a related model the signal come from a dictionary with few nonzero coefficients, whose support is unknown. We will not further consider this model here.; (2) lowrank Gaussian model (signal in a subspace plus Gaussian noise); and (3) Gaussian mixture model (GMM) (a model for signal lying in a union of multiple subspaces plus Gaussian noise), which has been widely used in image and video analysis among others^{2}^{2}2A mixture of GMM models has also been used to study sparse signals [37]. There are also other lowdimensional signal models including the general manifold models which will not be considered here..
Compressed sensing exploits the low dimensional structure of the signal to recover the signal with high accuracy using much fewer measurements than the dimension of the signal, i.e., . Two central and interrelated problems in compressed sensing include signal recovery and designing the sensing matrix . Early compressed sensing works usually assume to be random, which does have benefits for universality regardless of the signal distribution. However, when there is prior knowledge about the signal distribution, one can optimize to minimize the number of measurements subject to a total sensing power constraint
(2) 
for some constant . In the following, we either vary power for each measurement , or fix them to be unit power (for example, due to physical constraint) and use repeated measurements times in the direction of , which is equivalent to measuring using an integer valued power. Here can be viewed as the amount of resource we allocated for that measurement (or direction).
We will consider a methodology where is chosen to extract the most information about the signal, i.e., to maximize mutual information. In the nonsequential setting this means that maximizes the mutual information between the signal and the measurement outcome, i.e., . In sequential compressed sensing, the subsequent measurement vectors can be designed using the already acquired measurements, and hence the sensing matrix can be designed row by row. Optimal sequential design of can be defined recursively and viewed as dynamic programming [38]. However, this formulation is usually intractable in all but the most simple situations (one such example is the sequential probabilistic bisection algorithm in [30], which locates a single nonzero entry). Instead, the usual approach operates in a greedy fashion. The core idea is that based on the information that the previous measurements have extracted, the new measurement should probe in the direction that maximizes the conditional information as much as possible. We formalize this idea as InfoGreedy Sensing, which is described in Algorithm 1. The algorithm is initialized with a prior distribution of signal , and returns the Bayesian posterior mean as an estimator for signal . Conditional mutual information is a natural metric, as it counts only useful new information between the signal and the potential result of the measurement disregarding noise and what has already been learned from previous measurements.
Algorithm 1 stops either when the conditional mutual information is smaller than a threshold , or we have reached the maximum number of iterations . How relates to the precision
depends on the specific signal model employed. For example, for Gaussian signal, the conditional mutual information is the log determinant of the conditional covariance matrix, and hence the signal is constrained to be in a small ellipsoid with high probability. Also note that in this algorithm, the recovered signal may not reach accuracy
if it exhausts the number of iterations . In theoretical analysis we assume is sufficiently large to avoid it.Note that the optimization problem in InfoGreedy Sensing is nonconvex in general [39]. Hence, we will discuss various heuristics and establish their theoretical performance in terms of the following metric:
Definition II.1 (InfoGreedy).
We call an algorithm InfoGreedy if the measurement maximizes for each , where is the unknown signal, is the measurement outcome, and is the amount of resource for measurement .
Iii sparse signal
In this section, we consider the InfoGreedy Sensing for sparse signal with arbitrary nonnegative amplitudes in the noiseless case as well as under Gaussian measurement noise. We show that a natural modification of the bisection algorithm corresponds to InfoGreedy Sensing under a certain probabilistic model. We also show that Algorithm 2 is optimal in terms of the number of measurements for sparse signals as well as optimal up to a factor for sparse signals in the noiseless case. In the presence of Gaussian measurement noise, it is optimal up to at most another factor. Finally, we show Algorithm 2 is InfoGreedy when , and when it is InfoGreedy up to a factor.
To simplify the problem, we assume the sensing matrix consists of binary entries: . Consider a signal with each element with up to nonzero entries which are distributed uniformly at random. The following lemma gives an upper bound on the number of measurements for our modified bisection algorithm (see Algorithm 2) to recover such . In the description of Algorithm 2, let
denote the characteristic vector of a set . The basic idea is to recursively estimate a tuple that consists of a set which contains possible locations of the nonzero elements, and the total signal amplitude in that set. We say that a signal has minimum amplitude , if implies for all .
Theorem III.1 (Upper bound for sparse signal ).
Lemma III.2 (Lower bound for noiseless sparse signal ).
Let , be a sparse signal. Then to recover exactly, the expected number of measurements required for any algorithm is at least .
In general case the simple analysis that leads to Lemma III.3 fails. However, using Theorem A.1 in the Appendix we can estimate the average amount of information obtained from a measurement:
Lemma III.4 (Bisection Algorithm 2 is InfoGreedy up to a factor in the noiseless case).
Let . Then the average information of a measurement in Algorithm 2:
Remark III.5.

Here we constrained the entries of matrix to be binary valued. This may correspond to applications, for examples, sensors reporting errors and the measurements count the total number of errors. Note that, however, if we relax this constraint and allow entries of to be realvalued, in the absence of noise the signal can be recovered from one measurement that project the signal onto a vector with entries .

The setup here with sparse signals and binary measurement matrix generalizes the group testing [40] setup.

the CASS algorithm [13] is another algorithm that recovers a sparse signal by iteratively partitioning the signal support into subsets, computing the sum over that subset and keeping the largest . In [13] it was shown that to recover a sparse with nonuniform positive amplitude with high probability, the number of measurements is on the order of with varying power measurement. It is important to note that the CASS algorithm allows for power allocation to mitigate noise, while we repeat measurements. This, however, coincides with the number of unit length measurements of our algorithm, in Lemma III.1 after appropriate normalization. For specific regimes of error probability, the overhead in Lemma III.1 can be further reduced. For example, for any constant probability of error , the number of required repetitions per measurement is leading to improved performance. Our algorithm can be also easily modified to incorporate power allocation.
Iv LowRank Gaussian Models
In this section, we derive the InfoGreedy Sensing algorithms for the single lowrank Gaussian model as well as the lowrank GMM signal model, and also quantify the algorithm’s performance.
Iva Single Gaussian model
Consider a Gaussian signal with known parameters and . The covariance matrix has rank . We will consider three noise models:

white Gaussian noise added after the measurement (the most common model in compressed sensing):
(3) Let represent the power allocated to the th measurement. In this case, higher power allocated to a measurement increases SNR of that measurement.

white Gaussian noise added prior to the measurement, a model that appears in some applications such as reduced dimension multiuser detection in communication systems [41] and also known as the “noise folding” model [36]:
(4) In this case, allocating higher power for a measurement cannot increase the SNR of the outcome. Hence, we use the actual number of repeated measurements in the same direction as a proxy for the amount of resource allocated for that direction.

colored Gaussian noise with covariance added either prior to the measurement:
(5) or after the measurement:
(6)
In the following, we will establish lower bounds on the amount of resource (either the minimum power or the number of measurements) needed for InfoGreedy Sensing to achieve a recovery error .
IvA1 White noise added prior to measurement or “noise folding”
We start our discussion with this model and results for other models can be derived similarly. As does not affect SNR, we set . Note that conditional distribution of given is a Gaussian random vector with adjusted parameters
(7) 
Therefore, to find InfoGreedy Sensing for a single Gaussian signal, it suffices to characterize the first measurement and from there on iterate with adjusted distributional parameters. For Gaussian signal and the noisy measurement , we have
(8) 
Clearly, with , (8) is maximized when corresponds to the largest eigenvector of . From the above argument, the InfoGreedy Sensing algorithm for a single Gaussian signal is to choose
as the orthonormal eigenvectors of
in a decreasing order of eigenvalues, as described in Algorithm 3. The following theorem establishes the bound on the number of measurements needed.Theorem IV.1 (White Gaussian noise added prior to measurement or “noise folding”).
Let and let be the eigenvalues of with multiplicities. Further let be the accuracy and . Then Algorithm 3 recovers satisfying with probability at least using at most the following number of measurements by unit vectors :
(9a)  
provided . If the number of measurements simplifies to  
(9b)  
This also holds when . 
IvA2 White noise added after measurement
A key insight in the proof for Theorem IV.1 is that repeated measurements in the same eigenvector direction corresponds to a single measurement in that direction with all the power summed together. This can be seen from the following discussion. After measuring in the direction of a unit norm eigenvector with eigenvalue , and using power , the conditional covariance matrix takes the form of
(10) 
where is the component of in the orthogonal complement of . Thus, the only change in the eigendecomposition of is the update of the eigenvalue of from to . Informally, measuring with power allocation on a Gaussian signal reduces the uncertainty in direction as illustrated in Fig. 1. We have the following performance bound for sensing a Gaussian signal:
Theorem IV.2 (White Gaussian noise added after measurement).
Let and let be the eigenvalues of with multiplicities. Further let be the accuracy and . Then Algorithm 3 recovers satisfying with probability at least using at most the following power
(11) 
provided .
IvA3 Colored noise
When a colored noise
is added either prior to, or after the measurement, similar to the white noise cases, the conditional distribution of
given the first measurement is a Gaussian random variable with adjusted parameters. Hence, as before, the measurement vectors can be found iteratively. Algorithm 3 presents InfoGreedy Sensing for this case and the derivation is given in Appendix B. Algorithm 3 also summarizes all the InfoGreedy Sensing algorithms for Gaussian signal under various noise models.The following version of Theorem IV.1 is for the required number of measurements for colored noise in the “noise folding” model:
Theorem IV.3 (Colored Gaussian noise added prior to measurement or “noise folding”).
Let be a Gaussian signal, and let denote the eigenvalues of with multiplicities. Assume . Furthermore, let be the required accuracy. Then Algorithm 3 recovers satisfying with probability at least using at most the following number of measurements by unit vectors :
(12) 
Remark IV.4.
(1) Under these noise models, the posterior distribution of the signal is also Gaussian, and the measurement outcome affects only its mean and but not the covariance matrix (see (7)). In other words, the outcome does not affect the mutual information of posterior Gaussian signal. In this sense, for Gaussian signals adaptivity brings no advantage when is accurate
, as the measurements are predetermined by the eigenspace of
. However, when knowledge of is inaccurate for Gaussian signals, adaptivity brings benefit as demonstrated in Section VIA1, since a sequential update of the covariance matrix incorporates new information and “corrects” the covariance matrix when we design the next measurement.(2) In (10) the eigenvalue reduces to after the first measurement. Now iterating this we see by induction that after measurements in direction , the eigenvalue reduces to , which is the same as measuring once in direction with power . Hence, measuring several times in the same direction of , and thereby splitting power into for the measurements, has the same effect as making one measurement with total the power .
(3) InfoGreedy Sensing for Gaussian signal can be implemented efficiently. Note that in the algorithm we only need compute the leading eigenvector of the covariance matrix; moreover, updates of the covariance matrix and mean are simple and iterative. In particular, for a sparse with nonzero entries, the computation of the largest eigenvalue and associated eigenvector can be implemented in using sparse power’s method [42], where is the number of power iterations. In many highdimensional applications, is sparse if the variables (entries of ) are not highly correlated. Also note that the sparsity structure of the covariance matrix as well as the correlation structure of the signal entries will not be changed by the update of the covariance matrix. This is because in (10) the update only changes the eigenvalues but not the eigenvectors. To see why this is true, let be the eigendecomposition of . By saying that the covariance matrix is sparse, we assume that ’s are sparse and, hence, the resulting covariance matrix has few nonzero entries. Therefore, updating the covariance matrix will not significantly change the number of nonzero entries in a covariance matrix. We demonstrate the scalability of InfoGreedy Sensing with larger examples in Section VIA1.
IvB Gaussian mixture model (GMM)
The probability density function of GMM is given by
(13) 
where is the number of classes, and is the probability of samples from class . Unlike Gaussian, mutual information of GMM cannot be explicitly written. However, for GMM signals a gradient descent approach that works for an arbitrary signal model can be used as outlined in [32]. The derivation uses the fact that the gradient of the conditional mutual information with respect to
is a linear transform of the minimum mean square error (MMSE) matrix
[43, 39]. Moreover, the gradient descent approach for GMM signals exhibits structural properties that can be exploited to reduce the computational cost for evaluating the MMSE matrix, as outlined in [37, 32]. For completeness we include the detail of the algorithm here, as summarized in Algorithm 6 and the derivations are given in Appendix C^{3}^{3}3Another related work is [44] which studies the behavior of minimum mean sure error (MMSE) associated with the reconstruction of a signal drawn from a GMM as a function of the properties of the linear measurement kernel and the Gaussian mixture, i.e. whether the MMSE converges or does not converge to zero as the noise. .An alternative heuristic for sensing GMM is the socalled greedy heuristic, which is also mentioned in [32]. The heuristic picks the Gaussian component with the highest posterior
at that moment, and chooses the next measurement
to be its eigenvector associated with the maximum eigenvalue, as summarized in Algorithm 6. The greedy heuristic is not InfoGreedy, but it can be implemented more efficiently compared to the gradient descent approach. The following theorem establishes a simple upper bound on the number of required measurements to recover a GMM signal using the greedy heuristic with small error. The analysis is based on the wellknown multiplicative weight update method (see e.g., [45]) and utilizes a simple reduction argument showing that when the variance of every component has been reduced sufficiently to ensure a low error recovery with probability , we can learn (a mix of) the right component(s) with few extra measurements.Theorem IV.5 (Upper bound on of greedy heuristic algorithm for GMM).
Consider a GMM signal parameterized in (13). Let be the required number of measurements (or power) to ensure with probability for a Gaussian signal corresponding to component for all . Then we need at most
measurements (or power) to ensure when sampling from the posterior distribution of with probability .
Remark IV.6.
In the high noise case, i.e., when SNR is low, InfoGreedy measurements can be approximated easily. Let denote the random variable indicating the class where the signal is sampled from. Then . Hence, the InfoGreedy measurement should be the leading eigenvector of the average covariance matrix with the posterior weights.
V Sparse measurement vector
In various applications, we are interested in finding a sparse measurement vector . With such requirement, we can add a cardinality constraint on in the InfoGreedy Sensing formulation: , where is the number of nonzero entries we allowed for vector. This is a nonconvex integer program with nonlinear cost function, which can be solved by outer approximation [46, 47]. The idea of outer approximation is to generate a sequence of cutting planes to approximate the cost function via its subgradient and iteratively include these cutting planes as constraints in the original optimization problem. In particular, we initialize by solving the following optimization problem
(14) 
where and are introduced auxiliary variables, and is an user specified upper bound that bounds the cost function over the feasible region. The constraint of the above optimization problem can be casted into matrix vector form as follows:
such that
The mixedinteger linear program formulated in (
14) can be solved efficiently by a standard software such as GUROBI^{4}^{4}4http://www.gurobi.com. In the next iteration, solution to this optimization problem will be used to generate a new cutting plane, which we include in the original problem by appending a row to and adding an entry to as follows(15)  
(16) 
where is the nonlinear cost function in the original problem. For Gaussian signal , the cost function and its gradient take the form of:
(17) 
By repeating iterations as above, we can find a measurement vector with sparsity which is approximately InfoGreedy.
Vi Numerical examples
Via Simulated examples
ViA1 Lowrank Gaussian model
First, we examine the performance of InfoGreedy Sensing for Gaussian signal. The dimension of the signal is , and we set the probability of recovery , the noise standard deviation . The signal mean vector , where the covariance matrix is generated as , has each entry i.i.d. , and the operator thresholds eigenvalues of a matrix that are smaller than 0.7 to be zero. The error tolerance (represented as dashed lines in the figures). For the white noise case, we set , and for the colored noise case, and the noise covariance matrix is generated randomly as
for a random matrix
with entries i.i.d. . The number of measurements is determined from Theorem IV.1 and Theorem IV.2. We run the algorithm over 1000 random instances. Fig. 2 demonstrates the ordered recovery error , as well as the ordered number of measurements calculated from the formulas, for the white and colored noise cases, respectively. Note that in both the white noise and colored noise cases, the errors for InfoGreedy Sensing can be two orders of magnitude lower than the errors obtained from measurement using Gaussian random vectors, and the errors fall below our desired tolerance using the theoretically calculated .When the assumed covariance matrix for the signal is equal to its true covariance matrix, InfoGreedy Sensing is identical to the batch method [32] (the batch method measures using the largest eigenvectors of the signal covariance matrix). However, when there is a mismatch between the two, InfoGreedy Sensing outperforms the batch method due to adaptivity, as shown in Fig. 3. For Gaussian signals, the complexity of the batch method is (due to eigendecomposition), versus the complexity of InfoGreedy Sensing algorithm is on the order of where is the number of iterations needed to compute the eigenvector associated with the largest eigenvalue (e.g., using the power method), and is the number of measures which is typically on the order of .
We also try larger examples. Fig. 4 demonstrates the performance of InfoGreedy Sensing for a signal of dimension 1000 and with dense and lowrank (approximately of nonzero eigenvalues). Another interesting case is shown in Fig. 5, where and is rank 3 and very sparse: only about of the entries of are nonzeros. In this case InfoGreedy Sensing is able to recover the signal with a high precision using only measurements. This shows the potential value of InfoGreedy Sensing for big data.
(a)  (b) 
(c)  (d) 
ViA2 Lowrank GMM model
In this example we consider a GMM model with components, and each Gaussian component is generated as a single Gaussian component described in the previous example Section VIA1 ( and ). The true prior distribution is for the three components (hence each time the signal is draw from one component with these probabilities), and the assumed prior distribution for the algorithms is uniform: each component has probability . The parameters for the gradient descent approach are: step size and the error tolerance to stop the iteration . Fig. 6 demonstrates the estimated cumulative mutual information and mutual information in a single step, averaged over 100 Monte Carlo trials, and the gradient descent based approach has higher information gain than that of the greedy heuristic, as expected. Fig. 7 shows the ordered errors for the batch method based on mutual information gradient [32], the greedy heuristic versus gradient descent approach, when and , respectively. Note that InfoGreedy Sensing approaches (greedy heuristic and gradient descent) outperform the batch method due to adaptivity, and that the simpler greedy heuristic performs fairly well compared with the gradient descent approach. For GMM signals, the complexity of the batch method is (due to eigendecomposition of components), versus the complexity of InfoGreedy Sensing algorithm is on the order of where is the number of iterations needed to compute the eigenvector associated with the largest eigenvector (e.g., using the power method), and is the number of measures which is typically on the order of .
(a)  (b) 
ViA3 Sparse InfoGreedy Sensing
Consider designing a sparse InfoGreedy Sensing vector for a single Gaussian signal with , desired sparsity of measurement vector , and the lowrank covariance matrix is generated as before by thresholding eigenvalues. Fig. 8(a) shows the pattern of nonzero entries from measurement 1 to 5. Fig. 8(b) compares the performance of randomly selecting 5 nonzero entries. The sparse InfoGreedy Sensing algorithm outperforms the random approach and does not degrade too much from the nonsparse InfoGreedy Sensing.
(a)  (b) 
ViB Real data
ViB1 MNIST handwritten dataset
We exam the performance of using GMM InfoGreedy Sensing on MNIST handwritten dataset^{5}^{5}5http://yann.lecun.com/exdb/mnist/. In this example, since the true label of the training data is known, we can use training data to estimate the true prior distribution , and (there are classes of Gaussian components each corresponding to one digit) using 10,000 training pictures of handwritten digits picture of dimension 28 by 28. The images are vectorize and hence , and the digit can be recognized using the its highest posterior after sequential measurements. Fig. 9 demonstrates an instance of recovered image (true label is 2) using
sequential measurements, for the greedy heuristic and the gradient descent approach, respectively. In this instance, the greedy heuristic classifies the image erroneously as 6, and the gradient descent approach correctly classifies the image as 2. Table
I shows the probability of false classification for the testing data, where the random approach is where are normalized random Gaussian vectors. Again, the greedy heuristic has good performance compared to the gradient descent method.Method  Random  Greedy  Gradient 

prob. false classification  0.192  0.152  0.144 
ViB2 Recovery of power consumption vector
We consider recovery of a power consumption vector for 58 counties in California^{6}^{6}6http://www.ecdms.energy.ca.gov/elecbycounty.aspx . Data for power consumption in these counties from year 2006 to year 2012 are available. We first fit a single Gaussian model using data from year 2006 to 2011 (Fig. 10(a), the probability plot demonstrates that Gaussian is a reasonably good fit to the data), and then test the performance of the InfoGreedy Sensing in recovering the data vector of year 2012. Fig. 10(b) shows that even by using a coarse estimate of the covariance matrix from limited data (5 samples), InfoGreedy Sensing can have better performance than the random algorithm. This example has some practical implications: the compressed measurements here correspond to collecting the total power consumption over a region of the power network. This collection process can be achieved automatically by new technologies such as the wireless sensor network platform using embedded RFID in [2] and, hence, our InfoGreedy Sensing may be an efficient solution to monitoring of power consumption of each node in a large power network.
(a)  (b) 
Vii Conclusion
We have presented a general framework for sequential adaptive compressed sensing, InfoGreedy Sensing, which is based on maximizing mutual information between the measurement and the signal model conditioned on previous measurements. Our results demonstrate that adaptivity helps when prior distributional information of the signal is available and InfoGreedy is an efficient tool to explore such prior information, such as in the case of the GMM signals. Adaptivity also brings robustness when there is mismatch between the assumed and true distribution, and we have demonstrated such benefits for Gaussian signals. Moreover, InfoGreedy Sensing shows significant improvement over random projection for signals with sparse and lowrank covariance matrices, which demonstrate the potential value of InfoGreedy Sensing for big data.
References
 [1] D. J. Brady, Optical imaging and spectroscopy. WileyOSA, April 2009.
 [2] W. Boonsong and W. Ismail, “Wireless monitoring of household electrical power meter using embedded RFID with wireless sensor network platform,” Int. J. Distributed Sensor Networks, Article ID 876914, 10 pages, vol. 2014, 2014.
 [3] B. Zhang, X. Cheng, N. Zhang, Y. Cui, Y. Li, and Q. Liang, “Sparse target counting and localization in sensor networks based on compressive sensing,” in IEEE Int. Conf. Computer Communications (INFOCOM), pp. 2255 – 2258, 2014.
 [4] E. J. Candès and T. Tao, “Nearoptimal signal recovery from random projections: Universal encoding strategies?,” IEEE Trans. Info. Theory, vol. 52, pp. 5406–5425, Dec. 2006.
 [5] D. Donoho, “Compressed sensing,” IEEE Trans. Inform. Theory, vol. 52, pp. 1289–1306, Apr. 2006.
 [6] Y. C. Eldar and G. Kutyniok, eds., Compressed sensing: theory and applications. Cambridge University Press Cambridge, UK, 2012.
 [7] J. Haupt, R. M. Castro, and R. Nowak, “Distilled sensing: adaptive sampling for sparse detection and estimation,” IEEE Trans. Info. Theory, vol. 57, pp. 6222–6235, Sept. 2011.
 [8] D. Wei and A. O. Hero, “Multistage adaptive estimation of sparse signals,” IEEE J. Sel. Topics Sig. Proc., vol. 7, pp. 783 – 796, Oct. 2013.
 [9] D. Wei and A. O. Hero, “Performance guarantees for adaptive estimation of sparse signals,” arXiv:1311.6360v1, 2013.
 [10] M. L. Malloy and R. Nowak, “Sequential testing for sparse recovery,” IEEE Trans. Info. Theory, vol. 60, no. 12, pp. 7862 – 7873, 2014.
 [11] E. AriasCastro, E. J. Candès, and M. A. Davenport, “On the fundamental limits of adaptive sensing,” IEEE Trans. Info. Theory, vol. 59, pp. 472–481, Jan. 2013.
 [12] P. Indyk, E. Price, and D. P. Woodruff, “On the power of adaptivity in sparse recovery,” in IEEE Foundations of Computer Science (FOCS), Oct. 2011.
 [13] M. L. Malloy and R. Nowak, “Nearoptimal adaptive compressed sensing,” arXiv:1306.6239v1, 2013.
 [14] C. Aksoylar and V. Saligrama, “Informationtheoretic bounds for adaptive sparse recovery,” arXiv:1402.5731v2, 2014.
 [15] G. Yu and G. Sapiro, “Statistical compressed sensing of Gaussian mixture models,” IEEE Trans. Sig. Proc., vol. 59, pp. 5842 – 5858, Dec. 2011.
 [16] S. Ji, Y. Xue, and L. Carin, “Bayesian compressive sensing,” IEEE Trans. Sig. Proc., vol. 56, no. 6, pp. 2346–2356, 2008.
 [17] J. Haupt, R. Nowak, and R. Castro, “Adaptive sensing for sparse signal recovery,” in IEEE 13th Digital Signal Processing Workshop and 5th IEEE Signal Processing Education Workshop (DSP/SPE), pp. 702 – 707, 2009.
 [18] M. A. Davenport and E. AriasCastro, “Compressive binary search,” arXiv:1202.0937v2, 2012.
 [19] A. Tajer and H. V. Poor, “Quick search for rare events,” arXiv:1210:2406v1, 2012.
 [20] D. Malioutov, S. Sanghavi, and A. Willsky, “Sequential compressed sensing,” IEEE J. Sel. Topics Sig. Proc., vol. 4, pp. 435–444, April 2010.
 [21] J. Haupt, R. Baraniuk, R. Castro, and R. Nowak, “Sequentially designed compressed sensing,” in Proc. IEEE/SP Workshop on Statistical Signal Processing, 2012.
 [22] S. Jain, A. Soni, and J. Haupt, “Compressive measurement designs for estimating structured signals in structured clutter: A Bayesian experimental design approach,” arXiv:1311.5599v1, 2013.
 [23] A. Krishnamurthy, J. Sharpnack, and A. Singh, “Recovering graphstructured activations using adaptive compressive measurements,” in Annual Asilomar Conference on Signals, Systems, and Computers, Sept. 2013.
 [24] E. Tánczos and R. Castro, “Adaptive sensing for estimation of structure sparse signals,” arXiv:1311.7118, 2013.
 [25] A. Soni and J. Haupt, “On the fundamental limits of recovering tree sparse vectors from noisy linear measurements,” IEEE Trans. Info. Theory, vol. 60, no. 1, pp. 133–149, 2014.
 [26] A. Ashok, P. Baheti, and M. A. Neifeld, “Compressive imaging system design using taskspecific information,” Applied Optics, vol. 47, no. 25, pp. 4457–4471, 2008.
 [27] J. Ke, A. Ashok, and M. Neifeld, “Object reconstruction from adaptive compressive measurements in featurespecific imaging,” Applied Optics, vol. 49, no. 34, pp. 27–39, 2010.
 [28] A. Ashok and M. A. Neifeld, “Compressive imaging: hybrid measurement basis design,” J. Opt. Soc. Am. A, vol. 28, no. 6, pp. 1041– 1050, 2011.
 [29] M. Seeger, H. Nickisch, R. Pohmann, and B. Schoelkopf, “Optimization of kspace trajectories for compressed sensing by Bayesian experimental design,” Magnetic Resonance in Medicine, 2010.
 [30] R. Waeber, P. Frazier, and S. G. Henderson, “Bisection search with noisy responses,” SIAM J. Control and Optimization, vol. 51, no. 3, pp. 2261–2279, 2013.
 [31] J. M. DuarteCarvajalino, G. Yu, L. Carin, and G. Sapiro, “Taskdriven adaptive statistical compressive sensing of Gaussian mixture models,” IEEE Trans. Sig. Proc., vol. 61, no. 3, pp. 585–600, 2013.
 [32] W. Carson, M. Chen, R. Calderbank, and L. Carin, “Communication inspired projection design with application to compressive sensing,” SIAM J. Imaging Sciences, 2012.
 [33] L. Wang, D. Carlson, M. D. Rodrigues, D. Wilcox, R. Calderbank, and L. Carin, “Designed measurements for vector count data,” in Neural Information Processing Systems Foundation (NIPS), 2013.

[34]
L. Wang, A. Razi, M. Rodrigues, R. Calderbank, and L. Carin, “Nonlinear
informationtheoretic compressive measurement design,” in
Proc. 31st Int. Conf. Machine Learning (ICML)
, ., 2014.  [35] G. Braun, C. Guzmán, and S. Pokutta, “Unifying lower bounds on the oracle complexity of nonsmooth convex optimization,” arXiv:1407.5144, 2014.
 [36] E. AriasCastro and Y. C. Eldar, “Noise folding in compressed sensing,” IEEE Signal Processing Letter, vol. 18, pp. 478 – 481, June 2011.
 [37] M. Chen, Bayesian and InformationTheoretic Learning of High Dimensional Data. PhD thesis, Duke University, 2013.
 [38] Y. C. Eldar and G. Kutyniok, Compresssed Sensing: Theory and Applications. Cambridge Univ Press, 2012.
 [39] M. Payaró and D. P. Palomar, “Hessian and concavity of mutual information, entropy, and entropy power in linear vector Gaussian channels,” IEEE Trans. Info. Theory, pp. 3613–3628, Aug. 2009.
 [40] M. A. Iwen and A. H. Tewfik, “Adaptive group testing strategies for target detection and localization in noisy environment,” Institute for Mathematics and Its Applications (IMA) Preprint Series # 2311, 2010.
 [41] Y. Xie, Y. C. Eldar, and A. Goldsmith, “Reduceddimension multiuser detection,” IEEE Trans. Info. Theory, vol. 59, pp. 3858 – 3874, June 2013.
 [42] L. Trevisan, Lecture notes for CS359G: Graph Partitioning and Expanders. Stanford University, Stanford, CA, 2011.
 [43] D. Palomar and S. Verdú, “Gradient of mutual information in linear vector Gaussian channels,” IEEE Trans. Info. Theory, pp. 141–154, 2006.
 [44] F. Renna, R. Calderbank, L. Carin, and M. R. D. Rodrigues, “Reconstruction of signals drawn from a Gaussian mixture via noisy compressive measurements,” IEEE Trans. Sig. Proc., arXiv:1307.0861. To appear.
 [45] S. Arora, E. Hazan, and S. Kale, “The multiplicative weights update method: A metaalgorithm and applications,” Theory of Computing, vol. 8, no. 1, pp. 121–164, 2012.
 [46] M. A. Duran and I. E. Grossmann, “An outerapproximation algorithm for a class of mixedinteger nonlinear programs,” Math. Programming, vol. 36, no. 3, pp. 307 – 339, 1986.
 [47] A. Schrijver, Theory of linear and integer programming. Wiley, 1986.
 [48] J. C. Duchi and M. J. Wainwright, “Distancebased and continuum Fano inequalities with applications to statistical estimation,” arXiv:1311.2669v2, 2013.

[49]
J. M. LeivaMurillo and A. ArtesRodriguez, “A Gaussian mixture based maximization of mutual information for supervised feature extraction,”
Lecture Notes in Computer Science, Independent Component Analysis and Blind Signal Separation
, vol. 3195, pp. 271 – 278, 2004.
Appendix A General performance lower bounds
In the following we establish a general lower bound for the number of sequential measurements needed to obtain certain small recovery error , similar to the approach in [35]. We consider the following model: sequentially perform measurements and performance is measured by the number of measurements required to obtain a reconstruction of the signal with a prescribed accuracy. Assume the sequential measurements are linear and the measurement returns . Formally, let be a finite family of signals of interest, and
be a random variable with uniform distribution on
. Denote by the sequence of measurements, and the sequence of measurement values: . Let denote the transcript of the measurement operations and a single measurement/value pair. Note that is a random variable of the picked signal . Assume that the accuracy is high enough to ensure a onetoone correspondence between signal and the ball it is contained in. Thus we can return the center of such an ball as the reconstruction of . In this regime, an recovery of a signal is (informationtheoretically) equivalent to learning the ball that is contained in, and we can invoke the reconstruction principle(18) 
i.e., the transcript has to contain the same information as and in fact uniquely identify it. With this model it was shown in [35] that the total amount of information acquired, , is equal to the sum of the conditional information per iteration:
Theorem A.1 ([35]).
(19) 
where is a shorthand for and is the random variable of required measurements.
We will use Theorem A.1 to establish Lemma III.4 that the bisection algorithm is InfoGreedy for sparse signals. A priori, Theorem A.1 does not give a bound on the expected number of required measurements, and it only characterizes how much information the sensing algorithm learns from each measurement. However, if we can upper bound the information acquired in each measurement by some constant, this leads to a lower bound on the expected number of measurements, as well as a highprobability lower bound:
Corollary A.2 (Lower bound on number of measurements).
Suppose that for some constant ,
for every round where is as above. Then . Moreover, for all we have and .
The information theoretic approach also lends itself to lower bounds on the number of measurements for Gaussian signals, as e.g., in [48, Corollary 4].
Appendix B Derivation of Gaussian signal measured with colored noise
First consider the case when colored noise is added after the measurement: , . In the following, we assume the noise covariance matrix is full rank. Note that we can write . Let the eigendecomposition of the noise covariance matrix be , and define a constant vector . So the variance of is given by . Reparameterize
by introducing a unitary matrix
: . Also let the eigendecomposition of be . Then the mutual information of and can be written as
Comments
There are no comments yet.