## 1 Introduction

Many problems of practical interest in machine learning involve a high dimensional feature space and a relatively small number of observations. Inference is in general difficult for such underdetermined problems due to high variance and therefore regularization is often the key to extracting meaningful information from such problems

(Tibshirani, 1994). The classical approach is Tikhonov regularization (also known as regularization), but during the last few decades sparsity have been an increasingly popular choice of regularization for many problems, giving rise to methods such as the LASSO (Tibshirani, 1994), Sparse Bayesian Learning (Tipping, 2001) and sparsity promoting priors (Mitchell and Beauchamp, 1988).In this work we address the problem of finding sparse solutions to linear inverse problems of the form

(1) |

where is the desired solution,

is an observed measurement vector,

is a known forward model and is additive measurement noise. We are mainly interested in the underdetermined regime, where the number of observations is smaller than the number of unknowns, i.e. . In the sparse recovery literature it has been shown that the sparsity constraint is crucial for recovering from a small set of linear measurements (Candès et al., 2006). Furthermore, the degree of sparsity of , i.e. , dictates the required number of measurements for robust reconstruction of . This relationship between the number of non-zero coefficients and the number of measurements has given rise to so-called phase transition curves (Donoho and Tanner, 2010). A large body of research has been dedicated to improve these phase transition curves and these endeavors have lead to the concepts of

multiple measurement vectors (Cotter et al., 2005) and structured sparsity (Huang et al., 2009).The multiple measurement vector problem (MMV) is a natural extension of eq. (1), where multiple measurement vectors are observed and assumed to be generated from a series of signals , which share a common sparsity pattern. In matrix notation, we can write the problem as

(2) |

where the desired solution is now a matrix and similar for the measurement matrix and the noise term . The assumption of joint sparsity allows one to recover with significantly fewer observations compared to solving each of the inverse problems in eq. (1) separately (Cotter et al., 2005). This MMV approach has also been generalized to problems where the sparsity pattern is evolving slowly in time (Ziniel and Schniter, 2013a). Structured sparsity, on the other hand, is a generalization of simple sparsity and seek to exploit the fact that the sparsity pattern of many natural signals contains a richer structure than simple sparsity, e.g. group sparsity (Jacob et al., 2009b) or cluster structured sparsity (Yu et al., 2012).

In this paper we combine these two approaches and focus on problems, where the sparsity pattern of exhibits a spatio-temporal structure. In particular, we assume that the row and column indices of

can be associated with a set of spatial and temporal coordinates, respectively. This can equivalently be interpreted as a sparse linear regression problem, where the support of the regressors is correlated in both space and time. Applications of such a model include dynamic compressed sensing

(Ziniel and Schniter, 2013a), background subtraction in computer vision

(Cevher et al., 2009) and EEG source localization problem (Baillet et al., 2001).We take a Bayesian approach to modeling this structure since it provides a natural way of incorporating such prior knowledge in a model. In particular, we propose a hierarchical probabilistic model for based on the so-called spike and slab prior (Mitchell and Beauchamp, 1988). We extend the work in (Andersen et al., 2014) introducing a smooth latent variable controlling the spatiotemporal structure of the support of . We aim for full Bayesian inference under the proposed probabilistic model, but inference w.r.t. the exact posterior distribution of interest is intractable. Instead we resort to approximate inference using Expectation Propagation (Minka, 2001; Opper and Winther, 2000), which has been shown to provide accurate inference for spike and slab priors (Hernández-Lobato et al., 2013; Hernandez-Lobato et al., 2010; Jylänki et al., 2014; Peltola et al., 2014). Our model formulation is generic and generalizes easily to other types of observations. In particular, we also combine the proposed prior with a probit observation model to model binary observations in a sparse linear classification setting.

The contribution of this paper is two-fold. First we extend the structured spike and slab prior and the associated EP inference scheme to incorporate both spatial and temporal smoothness of the support. However, the computational complexity of the resulting EP algorithm is prohibitively slow for problems of even moderate sizes of signal dimension and length . To alleviate the computational bottleneck of the EP algorithm we propose three different approximation schemes and evaluate them based on synthetic and real data sets.

### 1.1 Related work

In this section we briefly review some of the most common approaches to simple sparsity and their generalization to structured sparsity. The classical approach to sparsity is the LASSO (Tibshirani, 1994), which operates by optimizing a least squares cost function augmented with a penalty on the regression weights. Several extensions have been proposed in the literature to generalize the LASSO to the structured sparsity setting, examples include group and graph LASSO (Jacob et al., 2009b). From a probabilistic perspective sparsity can be encouraged through the use of sparsity-promoting priors, i.e. prior distributions which favor sparse solutions. A non-exhaustive list of sparsity-promoting priors includes the Laplace distribution, Automatic Relevance Determination priors, the horseshoe prior and the spike and slab priors. All of these were originally designed to enforce simply sparsity, but they have all been generalized to the structured sparsity setting. The general strategy is to extend univariate densities to correlated multivariate densities by augmenting the models with a latent multivariate variable, where the correlation structure can be controlled explicitly, e.g. using Markov Random Fields (Cevher et al., 2009; Hernandez-Lobato et al., 2011)

or multivariate Gaussian distributions. Here we limit ourselves to consider the latter.

From the probabilistic perspective optimizing with an regularization term can be interpreted as maximum a posteriori (MAP) inference under an i.i.d. Laplace prior distribution on the regression weights (Park and Casella, 2008). The univariate Laplace prior has been generalized to a multivariate distribution with coupling between the regression weights through a scale mixture formulation (Gerven et al., 2009).

Another approach is Automatic Relevance Determination (ARD) (Neal, 2012), which works by imposing independent zero mean Gaussian priors with individual precision parameters on the regression weights. These precision parameters are then optimized using a maximum likelihood type II and the idea is then that the precision parameters of irrelevant features will approach zero and thereby force the weights of the irrelevant features to zero as well. Wu et al. extends the ARD framework to promote spatial sparsity by introducing a latent multivariate Gaussian distribution to impose spatial structure onto the precision parameters of ARD giving rise to dependent relevance determination priors (Wu et al., 2014b).

The horseshoe prior is defined as a scale mixture of Gaussians, where a half-Cauchy distribution is used as prior for the standard deviation of the Gaussian density

(Carvalho et al., 2009). The resulting density has two very appealing properties for promoting sparsity, namely heavy tails and an infinitely large spike at zero. A generalization to the multivariate case can be found in (Hernández-Lobato and Hernández-Lobato, 2013).The spike and slab prior is an increasingly popular choice of sparsity promoting prior and is given by a binary mixture of two components: a Dirac delta distribution (spike) at zero and Gaussian distribution (slab) (Mitchell and Beauchamp, 1988). The spike and slab prior has been generalized to the group setting in (Hernández-Lobato et al., 2013), to clustered sparsity setting in (Yu et al., 2012) and spatial structures in (Andersen et al., 2014; Nathoo et al., 2014). In the (Nathoo et al., 2014) the spatial structure is induced using basis functions and in (Andersen et al., 2014) the structured is imposed using a multivariate Gaussian density. The latter is the starting point of this work.

### 1.2 Structure of paper

This paper is organized as follows. In section 2 we review the structured spike and slab prior and in section 3 we discuss different ways of extending the model to include the temporal structure as well. After introducing the models we propose an algorithm for approximate inference based on the expectation propagation (EP) framework. We review the basics of EP and describe the proposed algorithm in section 4. In section 5 we introduce three simple approximation schemes to speed of the inference process and discuss their properties. Finally, in section 6 we demonstrate the proposed method using synthetic and real data sets.

### 1.3 Notation

We use bold uppercase letters to denote matrices and bold lowercase letters to denote vectors. Unless stated otherwise, all vectors are column vectors. Furthermore, we use the notation and for the ’th row and ’th column in the matrix , respectively. denotes the set of integers from to , i.e. . We use the notation to denote the element-wise Hadamard product of and and for the Kronecker product of matrices and . We use to denote a multivariate Gaussian density over with mean vector and covariance matrix and

denotes a Bernoulli distribution on

with probability of .## 2 The structured spike and slab prior

The purpose of this section is to describe the structured spike and slab prior (Andersen et al., 2014), but first we briefly review the conventional spike and slab prior (Mitchell and Beauchamp, 1988; Titsias and Lazaro-Gredilla, 2011; Hernández-Lobato et al., 2013). For , the spike and slab prior distribution is given by

(3) |

where is the Dirac delta function and and

are hyperparameters. In particular,

is the prior probability of a given variable being active, i.e.

, and are the prior mean and variance, respectively, of the active variables. The spike and slab prior in eq. (3) is also known as the Bernoulli-Gaussian prior since the prior can decomposed as(4) |

Thus, the latent binary variable

can interpreted as an indicator variable for the event . We will refer to as the sparsity pattern or the support of . In eq. (3) and (4) we condition explicitly on the hyperparameters , but to ease the notation we will omit this in the remainder of this paper.Due to the product form of the distribution in eq. (3) and (4), the variables and are a priori assumed to be independent for

. This implies that the number of active variables follows a binomial distribution and hence, the marginal probability of

and being jointly active, is given by for all . However, in many applications the variables might a priori have an underlying topographic relationship such as a spatial or temporal structure. Without loss of generality we will assume a spatial relationship, where denotes the spatial coordinates of . For such models, it is often a reasonable assumption that should depend on . For instance, neighboring voxels in fMRI analysis (Penny et al., 2005) are often more likely to be active simultaneously compared to two voxels far apart. Such a priori knowledge is neglected by the conventional spike and slab prior in eq. (3).The structured spike and slab model is capable of modeling such structure and is given in terms of a hierarchical model

(5) | ||||

(6) | ||||

(7) |

where is a latent variable controlling the structure of the sparsity pattern. Using this model prior knowledge of the structure of the sparsity pattern can be encoded using and . The mean value controls the expected degree of sparsity and the covariance matrix determines the prior correlation of the support. The map serves the purpose of squeezing into the unit interval and thereby represents the probability of . Here we choose

to be the standard normal cumulative distribution function (CDF), but other choices, such as the logistic function, are also possible.

Using this formulation, the marginal prior probability of the ’th variable being active is given by

(8) |

From this expression it is seen that when , the prior belief of is unbiased, i.e. , but when the variable is biased toward zero and vice versa. If a subset of features for some subset is a priori more likely to explain the observed data , then this information can be encoded in the prior distribution by choosing the prior mean of such that for all and for all . However, in the remainder of this paper we will assume that the prior mean is constant, i.e. for some .

The prior probability of two variables, and , being jointly active is

(9) |

If is a diagonal matrix, and become independent and we recover the conventional spike and slab prior. On the other hand, if we choose to be a stationary covariance function of the form , we see that the joint activation probabilities indeed depend on the spatial distance as desired. Finally, we emphasize that this parametrization it not limited to nearest neighbors-type structures. In fact, this parametrization supports general structures that can be modeled using generic covariance functions.

## 3 The spatio-temporal spike and slab prior

In the following we will extend the structured spike and slab prior distribution to model temporal smoothness of the sparsity pattern as well. Let be the time index, then , and are the signal coefficients, the sparsity pattern and the latent structure variable at time . Furthermore, we define the corresponding matrix quantities , and . We consider four different type of temporal structure for the sparsity pattern.

### 3.1 Time-independent and stationary sparsity patterns

The simplest model is to treat the vectors as i.i.d vectors., i.e.

(10) | ||||

(11) |

This effectively corresponds to solving each of the regressions problems in eq. (1) independently. Another simple approach is to use the so-called joint sparsity assumption and assume that the sparsity pattern is stationary or constant for all , and thus all vectors share a common binary support vector , i.e.

(12) | ||||

(13) |

The sharing of the binary support variable often yields much more robust inference, when the assumption of joint sparsity is justified (Cotter et al., 2005; Zhang and Rao, 2011; Ziniel and Schniter, 2013b).

### 3.2 First order process prior

The joint sparsity assumption is too strict for some applications. Often a good approximation can be achieved by assuming the sparsity pattern is piecewise constant, but this effectively limits the number of measurement vectors that can be used in the estimation process. We now relax this assumption and assume that the support is slowly changing in time

(Andersen et al., 2015). To model this, we can impose a first order Gauss-Markov process on of the form(14) |

where the hyperparameters and control the temporal correlation and the “innovation” of the process, respectively. Furthermore, we assume that the prior distribution on is given by

(15) |

Under these assumptions the marginal distribution of becomes

(16) |

Therefore, it follows by induction that if and satisfy , then the marginal distribution of is for all . Furthermore, it is seen that when , this model simplifies to the time independent model given in eq. (11). On the other hand, when , this model simplifies to the joint sparsity model in eq. (13). Therefore, this model can be seen as a generalization of these two extreme cases.

### 3.3 Kronecker product formulation

The first order model has the advantage that it factorizes over time, which often makes the resulting inference problem much easier. On the other hand, first order Markovian dynamics is often not sufficient for capturing long range correlations. Imposing a Gaussian process distribution on with arbitrary covariance structure would facilitate modeling of long range correlations in both time and space. Therefore, the hierarchical prior distribution for becomes

(17) | ||||

(18) | ||||

(19) |

where the mean and covariance matrix, i.e. and are now defined for the full -space. This model is more expressive, but the resulting inference problem becomes infeasible for even moderate sizes of and . But if we assume equidistant sampling in the temporal dimension, the covariance matrix simplifies to a Kronecker product, i.e.

(20) |

where and
This decomposition leads to more efficient inference schemes as we will discuss in section 5.

The coefficients are conditionally independent given the support . For some applications it could be desirable to impose either spatial smoothness, temporal smoothness or both on the non-zero coefficients themselves (Wu et al., 2014a; Ziniel and Schniter, 2013a), but in this work we only assume a priori knowledge of the structure of the support. Although temporal smoothness of could easily be incorporated into the models described above.

## 4 Inference using spatiotemporal priors

In the previous sections we have described the structured spike and slab prior and how to extend it to model temporal smoothness as well. We now turn our attention on how to perform inference using these models. We focus our discussion on the most general formulation using as given in eq. (17)-(19). Let be an observation matrix, where is an observation vector for time . We assume that the distribution on factors over time and is given by

(21) |

We consider two different noise models: an isotropic Gaussian noise model and a probit noise model. The Gaussian noise model is suitable for linear inverse problems with forward model or equivalently sparse linear regression problems with design matrix . On the other hand, the probit model is suitable for modeling binary observations and is given by , where is the ’th row of . For both models we further assume that the matrix is constant across time. However, this assumption can be easily relaxed to have depend on .

For both noise models the resulting joint distribution becomes

(22) | ||||

(23) |

We seek the posterior distribution of the parameters and conditioned on the observations , which is obtained by applying Bayes’s Theorem to the joint distribution in eq. (22)

(24) |

where is the marginal likelihood of . Due to the product of mixtures in the distribution , the expression for the marginal likelihood involves a sum over terms. This renders the computation of the normalization constant intractable for even small and . Hence, the desired posterior distribution is also intractable and we have to resort to approximate inference.

In the literature researchers have applied a whole spectrum of approximate inference methods for spike and slab priors, e.g. Monte Carlo-methods (Mitchell and Beauchamp, 1988), mean-field variational inference (Titsias and Lazaro-Gredilla, 2011), approximate message passing (Vila and Schniter, 2013) and expectation propagation (Hernández-Lobato et al., 2013; Andersen et al., 2014). We use the latter since expectation propagation has been shown to provide accurate inference for spike and slab models (Hernández-Lobato et al., 2015).

### 4.1 The Expectation Propagation Framework

In this section we briefly review expectation propagation for completeness. Expectation propagation (EP) (Minka, 2001; Opper and Winther, 2000)

is a deterministic framework for approximating probability distributions. Consider a probability distribution over the variables

that factorizes into components(25) |

where is taken to be a subvector of . EP takes advantage of this factorization and approximates with a function that shares the same factorization

(26) |

EP approximates each site term with a (scaled) distribution from the exponential family. Since the exponential family is closed under products, the approximation will also be in the exponential family. Consider the product of all terms except the ’th term

(27) |

The core of the EP framework is to choose such that . By approximating with in the context of , we ensure that the approximation is most accurate in the region of high density according to the cavity distribution . This scheme is implemented by iteratively minimizing the KL divergence . Since belongs to the exponential family, the unique solution is obtained by matching the expected sufficient statistics (Bishop, 2006). That is, the variational minimization problem

(28) |

is a convex problem and the unique solution is found by matching the expected sufficient statistics. Once the solution is obtained we update the ’th site approximation as

(29) |

The steps in eq. (27), (28) and (29) are repeated sequentially for all until convergence is achieved.

### 4.2 The Expectation Propagation Approximation

The EP framework provides flexibility in the choice of the approximating factors. This choice is a trade-off between analytical tractability and sufficient flexibility for capturing the important characteristics of the true density. Consider the desired posterior density of interest

(30) |

This posterior density is decomposed into four terms for , where the first three terms can be further decomposed. The term is decomposed into terms of the form , whereas the terms and are further decomposed as follows

(31) | ||||

(32) | ||||

(33) |

Each term only depend on , only depend on and and only depend on and . Furthermore, the terms couple the variables and , while couple the variables and . Based on these observations, we choose , and to have the following forms

(34) | ||||

(35) | ||||

(36) |

The exact term is a distribution of conditioned on , whereas the approximate term is a function of that depends on the data through and etc.
Finally, already belongs to the exponential family and does therefore not have to be approximated by EP. That is, .

Define , and and similarly for , and , then the resulting global approximation becomes

(37) |

where the parameters of the global approximation are obtained by summing the natural parameters. In terms of mean and variance, we get

(38) | ||||

(39) | ||||

(40) | ||||

(41) | ||||

(42) |

To compute the global approximation, we need to estimate the parameters , , , , , , and for all using EP. The estimation process of and depends on the observation model being used, whereas the estimation procedure of the remaining parameters are independent on the choice of observation model.

In the conventional EP algorithm, the site approximations are updated in a sequential manner meaning that the global approximation is updated every time a single site approximation (Minka, 2001) is refined. In this work we use the parallel update scheme to reduce the computational complexity of the algorithm. That is, we first update all the site approximations of the form for , , and then we update the global approximation w.r.t. and similarly for the and the global approximation w.r.t. . From a message passing perspective this can be interpreted as a particular scheduling of messages (Minka, 2005).
The proposed algorithm is summarized in Algorithm 1.

### 4.3 Estimating parameters for

The estimation procedure for depends on the choice of observation model. Here we consider two different observation models, namely the isotropic Gaussian and the probit models. Both of these models lead to closed form update rules, but this is not true for all choices of . In general if factorizes over and each term only depends on through

, then the resulting moment integrals are 1-dimensional and can be solved relatively fast using numerical integration procedures

(Jylänki et al., 2011) if no closed form solutions exists.Under the Gaussian noise model, we have

(43) |

Thus, is already in the exponential family for all and does therefore not have to be approximated using EP. In particular, the parameters for are determined by the relations and . For simplicity we also assume that the noise variance is constant for all .

Under the probit likelihood the term decompose to . In this case the update of each site approximation resembles the updates for Gaussian process classification using EP, see (Rasmussen and Williams, 2006; Hernandez-Lobato et al., 2010) for details.

### 4.4 Estimating parameters for

The terms factor over , which implies that we only need the marginal cavity distributions of each pair of and . Consider the update of the ’th term at time , i.e. . The first step is to compute the marginal cavity distributions by removing the contribution of from the marginal of the global approximation using eq. (27)

(44) |

When the approximate distribution belongs to the exponential family, the cavity distribution is simply obtained by computing the differences in natural parameters. Expressed in terms of mean and variance, we get

(45) | ||||

(46) | ||||

(47) |

The cavity parameter for in is simply equal to (and vice versa) since and are the only two terms contributing to the distribution over . Next, we form the tilde distribution and compute the solution to the KL minimization problem in eq. (28) by matching the expected sufficient statistics. This amounts to computing the zeroth, first and second moments w.r.t.

(48) |

and the first moment of