Statistical Compressed Sensing of Gaussian Mixture Models

01/30/2011 ∙ by Guoshen Yu, et al. ∙ 0

A novel framework of compressed sensing, namely statistical compressed sensing (SCS), that aims at efficiently sampling a collection of signals that follow a statistical distribution, and achieving accurate reconstruction on average, is introduced. SCS based on Gaussian models is investigated in depth. For signals that follow a single Gaussian model, with Gaussian or Bernoulli sensing matrices of O(k) measurements, considerably smaller than the O(k log(N/k)) required by conventional CS based on sparse models, where N is the signal dimension, and with an optimal decoder implemented via linear filtering, significantly faster than the pursuit decoders applied in conventional CS, the error of SCS is shown tightly upper bounded by a constant times the best k-term approximation error, with overwhelming probability. The failure probability is also significantly smaller than that of conventional sparsity-oriented CS. Stronger yet simpler results further show that for any sensing matrix, the error of Gaussian SCS is upper bounded by a constant times the best k-term approximation with probability one, and the bound constant can be efficiently calculated. For Gaussian mixture models (GMMs), that assume multiple Gaussian distributions and that each signal follows one of them with an unknown index, a piecewise linear estimator is introduced to decode SCS. The accuracy of model selection, at the heart of the piecewise linear decoder, is analyzed in terms of the properties of the Gaussian distributions and the number of sensing measurements. A maximum a posteriori expectation-maximization algorithm that iteratively estimates the Gaussian models parameters, the signals model selection, and decodes the signals, is presented for GMM-based SCS. In real image sensing applications, GMM-based SCS is shown to lead to improved results compared to conventional CS, at a considerably lower computational cost.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 26

page 27

page 28

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Compressed sensing (CS) aims at achieving accurate signal reconstruction while sampling signals at a low sampling rate, typically far smaller than that of Nyquist/Shannon. Let be a signal of interest, a non-adaptive sensing matrix (encoder), consisting of measurements, a measured signal, and a decoder used to reconstruct from . CS develops encoder-decoder pairs such that a small reconstruction error can be achieved.

Reconstructing from is an ill-posed problem whose solution requires some prior information on the signal. Instead of the frequency band-limit signal model assumed in classic Shannon sampling theory, conventional CS adopts a sparse signal model, i.e., there exists a dictionary, typically an orthogonal basis , a linear combination of whose columns generates an accurate approximation of the signal, , the coefficients , , having their amplitude decay fast after being sorted. For signals following the sparse model, it has been shown that using some random sensing matrices such as Gaussian and Bernoulli matrices with measurements, and an minimization or a greedy matching pursuit decoder promoting sparsity, with high probability CS leads to accurate signal reconstruction: The obtained approximation error is tightly upper bounded by a constant times the best -term approximation error, the minimum error that one may achieve by keeping the largest coefficients in  [candes2006robust, candes2005decoding, cohen2009compressed, donoho2006compressed]. Redundant and signal adaptive dictionaries that further improve the CS performance with respect to orthogonal bases have been investigated [candes2010compressed, duarte2009learning, peyre2010best]. In addition to sparse models, manifold models have been considered for CS as well [baraniuk2009random, chen2010compressive].

The present paper introduces a novel framework of CS, namely statistical compressed sensing (SCS). As opposed to conventional CS that deals with one signal at a time, SCS aims at efficiently sampling a collection of signals and having accurate reconstruction on average. Instead of restricting to sparse models, SCS works with general Bayesian models. Assuming that the signals

follow a distribution with probability density function (pdf)

, SCS designs encoder-decoder pairs so that the average error

where is a norm, is small. As an important example, SCS with Gaussian models is here shown to have improved performance (bounds) relative to conventional CS, the signal reconstruction calculated with an optimal decoder implemented via a fast linear filtering. Moreover, for Gaussian mixture models (GMMs) that better describe most real signals, SCS with a piecewise linear decoder is investigated.

The motivation of SCS with Gaussian models is twofold. First, controlling the average error over a collection of signals is useful in signal acquisition, not only because one is often interested in acquiring a collection of signals in real applications, but also because more effective processing of an individual signal, an image or a sound for example, is usually achieved by dividing the signal in (often overlapping) local subparts, patches (see Figure 10) or short-time windows for instance, so a signal can be regarded as a collection of subpart signals [aharon2006k, buades2006review, yu2008audio, yu2010PLE]. In addition, Gaussian mixture models (GMMs), which model signals or subpart signals with a collection of Gaussians, assuming each signal drawn from one of them, have been shown effective in describing real signals, leading to state-of-the-art results in image inverse problems [yu2010PLE] and missing data estimation [leger2010Matrix].

SCS based on a single Gaussian model is first developed in Section II. Following a similar mathematical approach as the one adopted in conventional CS performance analysis [cohen2009compressed], it is shown that with the same random matrices as in conventional CS, but with a considerably reduced number of measurements, and with the optimal decoder implemented via linear filtering, significantly faster than the decoders applied in conventional CS, the average error of Gaussian SCS is tightly upper bounded by a constant times the best -term approximation error with overwhelming probability, the failure probability being orders of magnitude smaller than that of conventional CS. Moreover, stronger yet simpler results further show that for any sensing matrix, the average error of Gaussian SCS is upper bounded by a constant times the best -term approximation with probability one, and the bound constant can be efficiently calculated.

Section III extends SCS to GMMs. A piecewise linear GMM-based SCS decoder, which essentially consists of estimating the signal using each Gaussian model included in the GMM and then selecting the best model, is introduced. The accuracy of the model selection, at the heart of the scheme, is analyzed in detail in terms of the properties of the Gaussian distributions and the number of sensing measurements. These results are then important in the general area of model selection from compressed measurements.

Following [yu2010PLE], Section IV presents an maximum a posteriori expectation-maximization (MAP-EM) algorithm that iteratively estimates the Gaussian models and decodes the signals. GMM-based SCS calculated with the MAP-EM algorithm is applied in real image sensing, leading to improved results with respect to conventional CS, at a considerably lower computational cost.

Ii Performance Bounds for a Single Gaussian Model

This section analyzes the performance bounds of SCS based on a single Gaussian model. Perfect reconstruction of degenerated Gaussian signals is briefly discussed. After reviewing basic properties of linear approximation for Gaussian signals, the rest of the section shows that for Gaussian signals with fast eigenvalue decay, the average error of SCS using

measurements and decoded by a linear estimator is tightly upper bounded by that of best -term approximation.

Signals are assumed to follow a Gaussian distribution

in this section. Principal Component Analysis (PCA) calculates a basis change

of the data , with the orthonormal PCA basis that diagonalizes the data covariance matrix

(1)

where is a diagonal matrix whose diagonal elements are the sorted eigenvalues, and

the PCA coefficient vector 

[mallat2008wts]. In this section, for most of the time we will assume without loss of generality that by looking in the PCA domain. For Gaussian and Bernoulli matrices that are known to be universal, analyzing CS in canonical basis or PCA basis is equivalent [baraniuk2008simple].

Ii-a Degenerated Gaussians

Conventional CS is able to perfectly reconstruct -sparse signals, i.e., with at most non-zero entries (typically ), with measurements [cohen2009compressed]. Degenerated Gaussian distributions , where with at most non-zero eigenvalues, give the counterpart of -sparsity for the Gaussian signal models considered in this paper. Such signals belong to a linear subspace . The next lemma gives a condition for perfect reconstruction of signals in .

Lemma 1.

If is any matrix and is a positive integer, then there is a decoder such that , for all , if and only if .

Proof.

Suppose there is a decoder such that , for all . Let . We can write where both . Since , . Plugging and into the decoder , we obtain and then .

Suppose . If with , then , so . is thus a one-to-one map. Therefore there must exist a decoder such that . ∎

It is possible to construct matrices of size with which satisfies the requirement of the Lemma. A trivial example is , where

is the identity matrix of size

and

is a zero matrix of size

. Comparing with conventional compressed sensing that requires measurements for exact reconstruction of -sparse signals, with only measurements signals in can be exactly reconstructed. Indeed, in contrast to the -sparse signals where the positions of the non-zero entries are unknown (-sparse signals reside in multiple -dimensional subspaces), with the degenerated Gaussian model , the position of the non-zero coefficients are known a priori to be the first ones. measurements thus suffice for perfect reconstruction.

In the following, we will concentrate on the more general case of non-degenerated Gaussian signals (i.e., Gaussians with full-rank covariance matrices ) with fast eigenvalue decay, in analogy to compressible signals for conventional CS. As mentioned before and will be further experimented in this paper, such simple models not only lead to improved theoretical bounds, but also provide state-of-the-art image reconstruction results.

Ii-B Optimal Decoder

To simplify the notation, we assume without loss of generality that the Gaussian has zero mean , as one can always center the signal with respect to the mean.

It is well-known that the optimal decoders for Gaussian signals are calculated with linear filtering:

Theorem 1.

[kay1998fundamentals] Let be a random vector with prior pdf , and , , be a sensing matrix. From the measured signal , the optimal decoder that minimizes the mean square error (MSE) as well as the mean absolute error (MAE) where , is obtained with a linear MAP estimator,

(2)

and the resulting error is Gaussian with mean zero and with covariance matrix whose trace yields the MSE of SCS.

(3)

In contrast to conventional CS, for which the minimization or greedy matching pursuit decoders, calculated with iterative procedures, have been shown optimal under certain conditions on and the signal sparsity [candes2006robust, donoho2006compressed], Gaussian SCS enjoys the advantage of having an optimal decoder (2) calculated fast via a closed-form linear filtering for any .

Corollary 1.

If a random matrix

is drawn independently to sense each , with all the other conditions as in Theorem 1, the MSE of SCS is

(4)

Applying an independent random matrix realization to sense each signal has been considered in [cohen2009compressed]. In real applications, these random sensing matrices need not to be stored, since the decoder can regenerate them itself given the random seed.

Following a PCA basis change (1), it is equivalent to consider signals and , where is a diagonal matrix whose diagonal elements are the sorted eigenvalues. Theorem 1 and Corollary 1 clearly hold for , with (2), (3), and (4) rewritten as

(5)
(6)
(7)

Note that PCA bases, as sparsifying dictionaries, have been applied to do conventional CS based on sparse models [masiero2010data], which is fundamentally different than the Gaussian models and SCS here studied.

Ii-C Linear vs Nonlinear Approximation

Before proceeding with the analysis of the SCS performance, let us make some comments on the relationship between linear and non-linear approximations for Gaussian signals. In particular, the following is observed via Monte Carlo simulations:

For Gaussian signals , where whose eigenvalues decay fast, the best -term linear approximation

(8)

and the nonlinear approximation

(9)

where is a thresholding operator that keeps the coefficients of largest amplitude and setting others to zero, lead to comparable approximation errors

(10)

Monte Carlo simulations are performed to test this. Assuming a power decay of the eigenvalues [mallat2008wts],

(11)

where is the decay parameter, with , Figure 1 plots the MSEs

(12)

normalized by the ideal signal energy , of best -term linear and nonlinear approximations as a function of , with typical (for image patches of size for example) values and ( and ). Both MSEs decrease as increases, i.e., as the eigenvalues decay faster. With typical values (similar to the eigenvalue decay calculated with typical image patches) and or 16, both approximations are accurate and generate small and comparable MSEs, their difference being about of the signal energy and ratio about 2.

(a) (b) (c)
Fig. 1: (a). MSEs (normalized by the ideal signal energy) of best -term linear and non-linear approximation, with and (signal dimension ). (b) and (c) Difference and ratio of normalized MSEs of best -term linear and non-linear approximation shown in (a).

Following this, the error of Gaussian SCS will be compared with that of best -term linear approximation, which is comparable to that of best -term nonlinear approximation. For simplicity, the best -term linear approximation errors will be denoted as

(13)

Note that .

Ii-D Performance of Gaussian SCS – A Numerical Analysis At First

This section numerically evaluates the MSE of Gaussian SCS, and compares it with the minimal MSE generated by best -term linear approximation, proceeding the theoretical bounds later developed.

As before, a power decay of the eigenvalues (11), with , is assumed in the Monte Carlo simulations. An independent random Gaussian matrix realization is applied to sense each signal  [cohen2009compressed].

Figures 2 (a) and (c)-top plot the MSE (normalized by the ideal signal energy) of SCS and that of the best -term linear approximation, as well as their ratio as a function of , with fixed at typical values and ( and ). As increases, i.e., as the eigenvalues decay faster, the MSEs for both methods decrease. Their ratio increases almost linearly with . The same is plotted in figures 2 (b) and (c)-bottom, with eigenvalue decay parameter fixed at a typical value , and with varying from to ( from to ). As increases, both MSEs decrease, their ratio being almost constant at about .

(a) (b) (c)
Fig. 2: Comparison of the MSE of SCS and that of the best -term linear approximation for Gaussian signals of dimension . (a) and (c)-top. The MSE (normalized by the ideal signal energy) of SCS and that of best -term linear approximation, as well as their ratio as a function of , with fixed at typical values and . (b) and (c)-bottom. The same values, with eigenvalue decay parameter fixed at a typical value , and with varying from to .

These results indicate a good performance of Gaussian SCS, its MSE is only a small number of times larger than that of the best -term linear approximation. 111Simulations using the same coefficient energy power decay model (11) show that the ratio between conventional CS based on sparse models, with measurements, and that of the best -term nonlinear approximation, varies as a function of the decay parameter and . For typical values , the ratio is typically an order of magnitude larger than that between the MSE of SCS and that of the best -term linear approximation. The next sections provide mathematical analysis of this performance.

Let us notice that while the best -term linear approximation decoding is feasible for signals following a single Gaussian distribution, it is impractical with GMMs (assuming multiple Gaussians and that each signal is generated from one of them with an unknown index), since the Gaussian index of the signal is unknown. SCS with GMMs, which describe real data considerably better than a single Gaussian model [yu2010PLE], will be described in sections III and IV.

Ii-E Performance Bounds

Following the analysis techniques in [cohen2009compressed], this section shows that with Gaussian and Bernoulli random matrices of measurements, considerably smaller than the required by conventional CS, the average error of Gaussian SCS is tightly upper bounded by a constant times the best -term linear approximation error with overwhelming probability, the failure probability being orders of magnitude smaller than that of conventional CS.

We consider only the encoder-decoder pairs that preserve , i.e., , satisfied by the optimal in (5) for Gaussian signals , .

Ii-E1 From Null Space Property to Instance Optimality

The instance optimality in expectation bounds the average error of SCS with a constant times that of the best -term linear approximation (13), defining the desired SCS performance:

Definition 1.

Let be a random vector that follows a certain distribution. Let be any subset of indices. We say that is instance optimal in expectation in in , with a constant , if

(14)

where is the signal restricted to (, and otherwise), the expectation on the left side considered with respect to , and to if one random is drawn independently for each . Similarly, the MSE instance optimality in is defined as

(15)

In particular, if , then we say that is instance optimal in expectation of order in , with a constant , if

(16)

and is instance optimal of order in MSE, with a constant , if

(17)

The null space property in expectation defined next will be shown equivalent to the instance optimality in expectation.

Definition 2.

Let be a random vector that follows a certain distribution. Let be any subset of indices. We say that in has the null space property in expectation in in , with constant , if

(18)

where is the signal restricted to (, and otherwise), the expectation considered on the left side with respect to , and to if one random is drawn independently for each . Note that . Similarly, the MSE null space property in is defined as

(19)

In particular, if , with , then we say that in has the null space property in expectation of order in , with constant , if

(20)

and has the MSE null space property of order , if

(21)
Theorem 2.

Let be a random vector that follows a certain distribution. Given an matrix , a norm , and a subset of indices , a sufficient condition that there exists a decoder such that the instance optimality in expectation in in  (14) holds with constant , is that the null space property in expectation (18) holds with for this :

(22)

A necessary condition is the null space property in expectation (18) with :

(23)

Similar results hold between the MSE instance optimality in  (15) and the null space property (19), with the constant in the sufficient condition.

In particular, if , with , the same equivalence between the instance optimality in expectation of order in  (16) and the null space property in expectation (20), and that between the MSE instance optimality of order  (17) and the null space property (21), hold as well.

Proof.

To prove the sufficiency of (22), we consider the decoder such that for all ,

(24)

By (22), we have

(25)

where the second inequality uses the triangle inequality, and the last inequality follows from the choice of the decoder (24).

To prove the necessity of (23), let be any decoder for which (14) holds. Let and let be the linear approximation of in (, and otherwise). Let be any splitting of into two vectors in the linear space . We can write

with . As the right side of (14) is equal to 0 for , we deduce . Since , we have , so that . We derive

where the inequality follows from (14), and the second and third equalities use the fact that and . Thus we have obtained (23).

A similar proof proceeds for MSE instance optimality and null space property. The second part of the theorem is a direct consequence of the first part. ∎

Comparing to conventional CS that requires the null space property to hold with the best -term nonlinear approximation error [cohen2009compressed], the requirement for Gaussian SCS is relaxed to , thanks to the linearity of the best -term linear approximation for Gaussian signals.

Theorem 2 proves the existence of the decoder for which the instance optimality in expectation holds for , given the null space property in expectation. However, it does not explain how such decoder is implemented. The following Corollary, a direct consequence of theorems 1 and 2, shows that for Gaussian signals the optimal decoder (5) leads to the instance optimality in expectation.

Corollary 2.

For Gaussian signals , if an sensing matrix satisfies the null space property in expectation (20) of order in , with constant , or the MSE null space property (21) of order with constant , then the optimal and linear decoder satisfies the instance optimality in expectation (16) in , or the MSE instance optimality (17).

Proof.

It follows from Theorem 1 that the MAP decoder minimizes MAE and MSE among all the estimators for . Therefore its MAE and MSE are smaller than the ones generated by the decoder considered in Theorem 2 (24). The latter satisfies the instance optimality, so is the former. ∎

Ii-E2 From RIP to Null Space Property

The Restricted Isometry Property (RIP) of a matrix measures its ability to preserve distances, and is related to the null space property in conventional CS [candes2006near, donoho2006compressed]. The new linear RIP of order restricts the requirement of conventional RIP of order to a union of -dimensional linear subspaces with consecutive supports:

Definition 3.

Let be a positive integer. Let define a linear subspace of functions with support in the first indices in , a linear subspace of functions with support in the next indices, and so on. The functions in the last linear subspace defined this way may have support with less than indices. An matrix is said to have linear RIP of order with constant if

(27)

The linear RIP is a special case of the block RIP [eldar2009robust], with block sparsity one and blocks having consecutive support of the same size.

The following theorem relates the linear RIP (27) of a matrix to its null space property in expectation (20).

Theorem 3.

Let be a random vector that follows a certain distribution. Let be an matrix that satisfies the linear RIP of order with , and let be a decoder. Let . Assume further that decays in : , . Then satisfies the null property in expectation of order in  (20), with constant 222As in [cohen2009compressed], the result here is in the norm, while in the next section we will consider a natural extension of the RIP for SCS which can be studied in the norm, something possible for conventional CS only in a probabilistic setting, with one random sensing matrix independently drawn for each signal [cohen2009compressed].

Proof.

Let denote the set of first indices of the entries in , the next indices, the next indices, etc. We have

where the second and last inequalities follow the linear RIP property of , the third inequality follows from the triangle equality, and the equality holds since . Hence we have

(28)

Since , we derive , so that

(29)

where the first inequality follows from the fact that . Inserting (29) into (28) gives

(30)

By the Cauchy-Schwartz inequality , we therefore obtain

(31)

which verifies the null space property with constant . ∎

For Gaussian signals , with Gaussian or Bernoulli matrices, one realization drawn independently for each , and with the optimal decoder (5), the decay of assumed in Theorem 3 is verified through Monte Carlo simulations.

Ii-E3 From Random Matrices to Linear RIP

The next Theorem shows that Gaussian and Bernoulli matrices satisfy the conventional RIP for one subspace with overwhelming probability. The linear RIP will be addressed after it.

Theorem 4.

[achlioptas2003database, baraniuk2008simple] Let be a random matrix of size drawn according to any distribution that satisfies the concentration inequality

(32)

where , and is a constant depending only on . Then for any set with , we have the conventional RIP condition

(33)

where is the set of all vectors in that are zero outside of , with probability greater than or equal to Gaussian and Bernoulli matrices satisfy the concentration inequality (32).

The linear RIP of order  (27) requires that (33) holds for subspaces. The next Theorem follows from Theorem 4 by simply multiplying by the probability that the RIP fails to hold for one subspace.

Theorem 5.

Suppose that , and are given. Let be a random matrix of size drawn according to any distribution that satisfies the concentration inequality (32). Then there exist constants depending only on such that the linear RIP of order  (27) holds with probability greater than or equal to for with the prescribed and .

Proof.

Following Theorem 4, for a -dimensional linear space , the matrix will fail to satisfy (33) with probability .

The linear RIP requires that (33) holds for at most such subspaces. Hence (33) will fail to hold with probability

(34)

Thus for a fixed , whenever , the exponent in the exponential on the right side of (34) is provided that . We can always choose small enough to ensure . This proves that with a probability , the matrix will satisfy the linear RIP (27). ∎

Comparing with conventional CS, where the null space property requires that the RIP (33) holds for subspaces [baraniuk2008simple, candes2006near, donoho2006compressed], the number of subspaces in the linear RIP (27) is sharply reduced to for Gaussian SCS, thanks to the coefficients pre-ordering and the linear estimation in consequence. Therefore with the same number of measurements , the probability that a Gaussian or Bernoulli matrix satisfies the linear RIP is substantially higher than that for the conventional RIP. Equivalently, given the same probability that satisfies the linear RIP or the conventional RIP of order , the required number of measurements for the linear RIP is , substantially smaller than the required for the conventional RIP. Similar improvements have been obtained with model-based CS that assumes structured sparsity on the signals [baraniuk2010model].

With the results above, we have shown that for Gaussian signals, with sensing matrices satisfying the linear RIP (27) of order , for example Gaussian or Bernoulli matrices with rows, with overwhelming probability, and with the optimal and linear decoder (5), SCS leads to the instance optimality in expectation of order in  (16), with constant . is typically small by the definition of CS.

Ii-F Performance Bounds with RIP in Expectation

This section shows that with an RIP in expectation, a matrix isometry property more adapted to SCS, the Gaussian SCS MSE instance optimality (17) of order and constant , holds in the norm with probability one for any matrix. has a closed-form and can be easily computed numerically.

Definition 4.

Let be a random vector that follows a certain distribution. Let be an sensing matrix and let be a decoder. Let . in is said to have RIP in expectation in with constant if

(35)

where , is the signal restricted to (, and otherwise), and the expectation is with respect to , and to if one random is drawn independently for each .

The conventional RIP is known to be satisfied only by some random matrices, Gaussian and Bernoulli matrices for example, with high probability. For a given matrix, checking the RIP property is however NP-hard [baraniuk2008simple]. By contrast, the constant of the RIP in expectation (35) can be measured for any matrix via a fast Monte Carlo simulation, the quick convergence guaranteed by the concentration of measure [talagrand1996new]. The next proposition, directly following from (6) and (7), further shows that for Gaussian signals, the RIP in expectation has its constant in a closed form.

Proposition 1.

Assume , is an sensing matrix and is the optimal and linear decoder (5). Then in satisfies the RIP in expectation in ,

(36)

where is an extraction matrix giving , i.e., , , all the other entries being zero. The expectation with respect to is calculated if one random is drawn independently for each .

Proof.

Let , which follows from the MAP estimation (5). (1) is derived by calculating the covariance matrices of , and of , and using the fact that the trace of a covariance matrix yields the average energy of the underlying random vector. ∎

The next Theorem shows that the RIP in expectation leads to the MSE null space property holding in equality.

Theorem 6.

Let be a random vector that follows a certain distribution, an sensing matrix, and a decoder. Assume and , for some . Assume that in has the RIP in expectation in with constant , and in with constant :

(37)

where , and is the signal restricted to (, and otherwise). Then satisfies

(38)

where . In particular, if , with , then satisfies the MSE null space property of order , which holds with equality,

(39)
Proof.

We derive (38) by