Bayesian Conditional Density Filtering

01/15/2014 ∙ by Shaan Qamar, et al. ∙ Duke University University of California Santa Cruz 0

We propose a Conditional Density Filtering (C-DF) algorithm for efficient online Bayesian inference. C-DF adapts MCMC sampling to the online setting, sampling from approximations to conditional posterior distributions obtained by propagating surrogate conditional sufficient statistics (a function of data and parameter estimates) as new data arrive. These quantities eliminate the need to store or process the entire dataset simultaneously and offer a number of desirable features. Often, these include a reduction in memory requirements and runtime and improved mixing, along with state-of-the-art parameter inference and prediction. These improvements are demonstrated through several illustrative examples including an application to high dimensional compressed regression. Finally, we show that C-DF samples converge to the target posterior distribution asymptotically as sampling proceeds and more data arrives.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories



view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modern data are increasingly high dimensional, both in the number of observations and the number of predictors measured

. Statistical methods increasingly make use of low-dimensional structure assumptions to combat the curse of dimensionality (e.g., sparsity assumptions) and efficient model fitting tools must evolve quickly to keep pace with the rapidly growing dimension and complexity of data they are applied to. Bayesian methods provide a natural probabilistic characterization of uncertainty in the parameters and in predictions, but there is a lack of scalable inference algorithms having guarantees on accuracy. Markov chain Monte Carlo (MCMC) methods for Bayesian computation are routinely used due to ease, generality and convergence guarantees. When the number of observations is truly massive, however, data processing and computational bottlenecks render many MCMC methods infeasible as they demand (1) the entire dataset (or available sufficient statistics) be held in memory; and (2) likelihood evaluations for updating model parameters at every sampling iteration which can be costly.

A number of alternative strategies have been proposed in recent years for scaling MCMC to large datasets. One possibility is to parallelize computation within each MCMC iteration using GPUs or multicore architectures to free bottlenecks in updating unknowns specific to each sample and in calculating likelihoods (Medlar et al., 2013). Another possibility is to rely on Hamiltonian Monte Carlo with stochastic gradient methods used to approximate gradients with subsamples of the data (Welling and Teh, 2011; Teh et al., 2014). Korattikara et al. (2013) instead use sequential hypothesis testing to choose a subsample to use in approximating the acceptance ratio in Metropolis-Hastings. More broadly, data subsampling approaches attempt to mitigate the MCMC computational bottleneck by choosing a small set of points at each iteration for which the likelihood is to be evaluated, and doing so in a way that preserves validity of the draws as samples from the desired target posterior (Quiroz et al., 2014; Maclaurin and Adams, 2014). Building on ideas reminiscent of sequential importance resampling and particle filtering (Doucet et al., 2000; Arulampalam et al., 2002), these methods are promising and are an active area of research. Yet another approach assigns different data subsets to different machines, runs MCMC in parallel for each subset, and recombines the resulting samples (Scott et al., 2013; Minsker et al., 2014). However, theoretical guarantees for the latter have not been established in great generality.

Sequential Monte Carlo (SMC) (Chopin, 2002; Arulampalam et al., 2002; Lopes et al., 2011; Doucet et al., 2001) is a popular technique for online Bayesian inference that relies on resampling particles sequentially as new data arrive. Unfortunately, it is difficult to scale SMC to problems involving large and due to the need to employ very large numbers of particles to obtain adequate approximations and prevent particle degeneracy. The latter is addressed through rejuvenation steps using all the data (or sufficient statistics), which becomes expensive in an online setting. One could potentially rejuvenate particles only at earlier time points, but this may not protect against degeneracy for models involving many parameters. More recent particle learning (PL) algorithms (Carvalho et al., 2010) reduce degeneracy for the dynamic linear model, with satisfactory density estimates for parameters – but they too require propagating a large number of particles, which significantly adds to the per-iteration computational complexity. MCMC can be extended to accommodate data collected over time by adapting the transition kernel , and drawing a few samples at time , so that samples converge in distribution to the joint posterior distribution in time (Yang and Dunson, 2013)

. However, this method requires the full data or available sufficient statistics to be stored, leading to storage and processing bottlenecks as more data arrive in time. Indeed, models with large parameter spaces often have sufficient statistics which are also high dimensional (e.g., linear regression). In addition, MCMC often faces poor mixing requiring longer runs with additional storage and computation.

In simple conjugate models, such as Gaussian state-space models, efficient updating equations can be obtained using methods related to the Kalman filter. Assumed density filtering (ADF) was proposed

(Lauritzen, 1992; Boyen and Koller, 1998; Opper, 1998) to extend this computational tractability to broader classes of models. ADF approximates the posterior distribution with a simple conjugate family, leading to approximate online posterior tracking. The predominant concern with this approach is the propagation of errors with each additional approximation to the posterior in time. Expectation-propagation (EP) (Minka, 2001; Minka et al., 2009)

improves on ADF through additional iterative refinements, but the approximation is limited to the class of assumed densities and has no convergence guarantees. Moreover in fairly standard settings, arbitrary approximation to the posterior through an assumed density severely underestimates parameter and predictive uncertainties. Similarly, variational methods developed in the machine learning literature attempt to address various difficulties facing MCMC methods by making additional approximations to the joint posterior distribution over model parameters. These procedures often work well in simple low dimensional settings, but fail to adequately capture dependence in the joint posterior, severely underestimate uncertainty in more complex or higher dimensional settings and generally come with no accuracy guarantee. Additionally, they often require computing expensive gradients for large datasets.

Hoffman et al. (2013) proposed stochastic variational inference (SVI) which uses stochastic approximation to the full gradient for a subset of parameters, thus circumventing this computational bottleneck. However, SVI requires retrieving, storing or having access to the entire dataset at once, which is often not feasible in very large or streaming data settings. A parallel literature on online variational approximations (Hoffman et al., 2010) focus primarily on improving batch inferences by feeding in data sequentially. In addition, these methods require specifying or tuning of a ‘learning-rate,’ and generally come with no storage reduction. Broderick et al. (2013) recently proposed a streaming variational Bayes (SVB) algorithm to facilitate storage for only the recent batch of data for a data stream. However, all variational methods (batch or online) rely on a factorized form of the posterior that typically fails to capture dependence in the joint posterior and severely underestimates uncertainty. Recent attempts to design careful online variational approximations (Luts and Wand, 2013), though successful in accurately estimating marginal densities, are limited to specific models and no theoretical guarantees on accuracy are established except for stylized cases.

We propose a new class of Conditional Density Filtering (C-DF) algorithms that extend MCMC sampling to streaming data. Sampling proceeds by drawing from conditional posterior distributions, but instead of conditioning on conditional sufficient statistics (CSS) (Carvalho et al., 2010; Johannes et al., 2010)

, C-DF conditions on surrogate conditional sufficient statistics (SCSS) using sequential point estimates for parameters along with the data observed. This eliminates the need to store the data in time (process the entire dataset at once), and leads to an approximation of the conditional distributions that produce samples from the correct target posterior asymptotically. The C-DF algorithm is demonstrated to be highly versatile and efficient across a variety of settings, with SCSS enabling online sampling of parameters often with dramatic reductions in the memory and per-iteration computational requirements. C-DF has also successfully been applied to non-negative tensor factorization models for massive binary and count-valued tensor streaming data, with state-of-the-art performance demonstrated against a wide-range of batch and online competitors

(Hu et al., 2015; Hu et al., 2015).

Section 2 introduces the C-DF algorithm in generality, along with definitions, assumptions, and a description of how to identify updating quantities of interest. Section 3 demonstrates approximate online inference using C-DF in the context of several illustrating examples. Section 3.1 applies C-DF to linear regression and the one-way Anova model. More complex model settings are considered in Sections 3.2, namely extensions of the C-DF algorithm to a dynamic linear model and binary regression using the high dimensional probit model. Here, we investigate the performance of C-DF in settings with an increasing parameter space. Along with comparing inferential and predictive performance, we discuss various computational, storage and mixing advantages of our method over state-of-the-art competitors in each. Section 4 presents a detailed implementation of the C-DF algorithm for high dimensional compressed regression. We report state-of-the-art inferential and predictive performance across extensive simulation experiments as well as for real data studies in Section 5. The C-DF algorithm is also applied to a Poisson mixed effects model to update the parameters for a variational approximation to the posterior distribution in Appendix D. Section 6 presents a finite-sample error bound for approximate MCMC kernels and establishes the asymptotic convergence guarantee for the proposed C-DF algorithm. Section 7 concludes with a discussion of extensions and future work. Proofs and additional figures pertaining to Section 3 appear in Appendix A and C, respectively.

2 Conditional density filtering


as the collection of unknown parameters in probability model

and , with , and denoting an arbitrary sample space (e.g., a subset of ). Data denotes the data observed at time , while defines the collection of data observed through time . Below, and let for each . Either of or is allowed to be a null set.

2.1 Surrogate conditional sufficient statistics

  • is defined to be a conditional sufficient statistic (CSS) for at time if . Suppose denotes the conditional distribution of parameter given all other model parameters and a dataset .. Then it is easy to show that for known functions , . This satisfies , or equivalently .

Because depends explicitly on , its value changes whenever new samples are drawn for this collection of parameters. This necessitates storing entire data or available sufficient statistics. The following issues inhibit efficient online sampling: (1) sufficient statistics, when they exist, often scale poorly with the size of the parameter-space, thus creating a significant storage overhead; (2) updating , for every iteration at time causes an immense per-iteration computational bottleneck; and (3) updating potentially massive numbers of observation-specific latent variables can lead to significant computational overhead unless conditional independence structures allow for parallelized draws. To address these challenges, we propose surrogate conditional sufficient statistics (SCSS) as a means to approximate full conditional distribution at time .

  • Suppose . For known functions , define as the surrogate conditional sufficient statistic for , with being a consistent estimator of at time . Then, is the C-DF approximation to .

If the full conditional for admits a surrogate quantity , approximate sampling via C-DF proceeds by drawing . Crucially, depends only an estimate for which remain fixed while drawing samples from approximate full conditional distribution . This avoids having to update potentially expensive functionals (i.e., CSS) for each parameter at every MCMC iteration. If the approximating full conditionals are conjugate, then sampling proceeds in a Gibbs-like fashion. Otherwise, draws from may be obtained via a Metropolis step for a proposal distribution for and acceptance probability


Section 3.1 provides examples of the former, and Section 3.3.1 of the latter. In both cases, draws using the C-DF approximation to the conditional distributions have the correct asymptotic stationary distribution (see Section 6).

While this article focuses on cases where the conditional distributions admit surrogate quantities or sufficient statistics, this is often not the case for at least some of the model parameters. In such settings, various distributional approximations (e.g., Laplace approximation, variational Bayes, expectation propagation etc.) are often employed for tractability. Here, it is possible to use C-DF in conjunction with these methods to yield approximate inference in a streaming data setting, albeit without theoretical guarantees on the limiting distribution of the samples obtained. Appendix D considers an application to Poisson regression where the conditional distributions do not admit surrogate quantities and no augmentation scheme is available. Here, a variational approximation to the joint posterior admits SCSS, and we demonstrate using the C-DF algorithm when a full conditional is replaced by an approximating kernel at time . Propagating SCSS associated with the latter allows us to obtain approximate online inference, although the limiting distribution for sampled draws are not guaranteed to have the correct asymptotic stationary distribution.

2.2 The C-DF algorithm

Define a fixed grouping of parameters , into sets , , subject to and . The model specification and conditional independence assumptions often suggest natural parameter partitions, though alternate partitions may be more suitable for specific tasks. For streaming data, these sets may be identified to maximize computational and storage gains. Examples of various partitioning schemes are presented in the context of several illustrating examples in Section 3.

For model parameters indexed by , , , the C-DF algorithm samples sequentially from the respective approximating conditional distributions , where . Efficient updating equations for based on , the incoming data shard , and estimates for from the previous time-point result in scalable parameter inference with sampling-based approximations to the posterior distribution; see definition 2.1. The approximate samples are in turn used to obtain estimates for In cases where closed-form expression for the conditional mean is available, this may be used instead of Monte Carlo estimates. and the algorithm continues by iterating over the parameter partitions. An outline of the C-DF algorithm is given in Algorithm 1.

1:(1) Data shard at time ; (2) parameter partition index sets , , with , ; and (3) parameter SCSS .
2:Approximate posterior draws and SCSS .
3:function CDF.SAMPLE()
4:     for  do
5:           // step 1: update SCSS for
6:         for  do
7:                 //
8:              set
9:         end for
11:           // step 2: approximate C-DF sampling
12:         for  do
13:              for  do
14:                     // and
15:                  sample
16:              end for
17:         end for
19:           // step 3: update parameter estimates
20:         for  do
21:              set
22:              store as approximate posterior samples at time
23:         end for
24:     end for
25:end function
Algorithm 1 A sketch of the C-DF algorithm for approximate online MCMC

3 Illustrating examples using C-DF

The following notation is used for examples considered in this section as well as those presented in Section 3.3. Data denotes the data observed at time , while defines the collection of data observed through time . Where appropriate, with and . Shards of a fixed size arrive sequentially over a time horizon, and 500 MCMC iterations are drawn to approximate the corresponding posterior distribution at every time point

. Where applicable, the following quantities are reported to measure estimation and inferential performance for competing methods: (i) Mean squared estimation error on parameters of interest; (ii) mean squared prediction error (MSPE); and (iii) length and coverage of 95% predictive intervals. Results are averaged over 10 independent replications with associated standard errors appearing as subscripts in Tables. All reported runtimes are based on a non-optimized R implementation run on an x

Intel(R) Core(TM) i5 machine.

Finally, plots of kernel density estimates for marginal posterior densities on representative model parameters are shown at various time points. At time

, let


be a measure of ‘accuracy’ between approximating C-DF density and the full conditional distribution obtained using batch MCMC (S-MCMC). As defined, accuracy ranges between 0 and 1 (larger values are better). Accuracy is plotted as a function of time for examples of this section, with figures appearing in Appendix C.

3.1 Motivating examples

3.1.1 Linear regression

For the Gaussian error model, a response given an associated -dimensional predictor is modeled in the linear regression setting as


A standard Bayesian analysis proceeds by assigning conjugate priors

and , with associated full conditionals given as

These are parametrized in terms of sufficient statistics , and , thus enabling efficient inference using S-MCMC.

A C-DF algorithm from approximate online inference in this setting begins by defining a partition over the model parameters; here, we choose and . Sampling then proceeds as:

  1. Observe data at time . If initialize all parameters at some default values (e.g., , assuming a centered and scaled response); otherwise set ;

  2. Define as SCSS for . Update as and ;

  3. Draw samples from the approximate Gibbs full conditional (), and set (or use the analytical expression for the posterior mean);

  4. Define as SCSS for . Update as and ;

  5. Draw samples from the approximate Gibbs full conditional , and set (or use the analytical expression for the posterior mean).

Data shards of size are generated using predictors drawn from , with true parameters and . Density estimates for model parameters are displayed at , with accuracy comparisons in Figure 7 validating that approximate C-DF draws converge to the true stationary distribution in time. Excellent parameter MSE and coverage using the C-DF algorithm are reported in Table 1.

Avg. coverage Length Time (sec)
C-DF 1.0 0.60 95
S-MCMC 1.0 0.60 119.4
Table 1:

Inferential performance for C-DF and S-MCMC for parameters of interest. Coverage and length are based on 95% credible intervals and is averaged over all the

’s () and all time points and over 10 replications. We report the time taken to produce 500 MCMC samples with the arrival of each data shard. MSE along with associated standard errors are reported at different time points.
Figure 1: Kernel density estimates for posterior draws using the C-DF algorithm and S-MCMC at . Shown from left to right are plots of model parameters , , and , respectively.

3.2 One-way Anova model

Consider the one-way Anova model with fixed ‘treatment’ groups


with default priors , and . As was the case in the linear regression model, here sufficient statistics may be propagated to yield an efficient S-MCMC sampler. With the arrival of a new data shard at time , group-specific sufficient statistics are updated as and . At time , inference for S-MCMC proceeds by drawing from the following full conditionals distributions:


For the C-DF algorithm, a natural partition suggested by the hierarchical structure of model (4) is and . For this parameter partition, modified full conditionals are defined in terms of surrogate quantities as well as the previously defined group-specific sufficient statistics. Approximate inference for C-DF then proceeds as

  1. Observe data at time . If , set , , , . Otherwise, set , ;

  2. Update surrogate statistic component-wise as ;

  3. For : draw from (a) modified full conditional , ; and (b) as given in (5). C-DF full conditional is given by

  4. Set , , and . Update surrogate statistic : and ;

  5. For : draw from modified full conditional distributions (a) and (b) ;

  6. Finally, set and .

Data shards of size are generated according to model (4) with parameters and . Figure 2 displays kernel density estimates for posterior draws using S-MCMC and the C-DF algorithm at . SMC (SMC-CH; Chopin, 2002) and ADF are added as additional competitors for this example. To initialize ADF, the joint posterior over is approximated in time assuming a multivariate-normal density . To begin, we integrate over hyper-parameters to obtain marginal prior . For , the approximate posterior at time becomes the prior at , and parameters are updated using Newton-Raphson steps in the sense of McCormick et al. (2012); in particular, and , with .

ADF is extremely sensitive to good calibration for , without which estimates for are far from the truth even at . Optimal performance is obtained by using the first data shard and performing the ADF approximation at until convergence. Thereafter, parameters estimates are propagated as described above. Though resulting parameters point estimates are accurate, ADF severely underestimates uncertainty as seen by the length and coverage of resulting 95% credible intervals reported in Table 2. This occurs in-part because ADF makes a global approximation on the joint posterior (restricting the propagation of uncertainty in a very specific way), whereas C-DF makes local approximations to set of full conditional distribution. The C-DF approximation results in steadily increasing accuracy along with good parameter inference as shown in Table 2.

Avg. coverage Length Time (sec)
C-DF 0.53
S-MCMC 0.52
Table 2: Inferential performance for C-DF, S-MCMC, SMC-CH and ADF for parameter . Coverage is based on 95% credible intervals averaged over all time points, all and over 10 replications. We report the time taken to produce 500 MCMC samples with the arrival of each data shard. MSE along with associated standard errors are reported at different time points.
Figure 2: Row #1 (left to right): Kernel density estimates for posterior draws of using S-MCMC and the C-DF algorithm at ; Row #2 (left to right): Kernel density estimates for model parameters at .

3.3 Advanced data models

The C-DF algorithm introduced in Section 2 modifies the set of full conditional distributions to depend on propagated surrogate quantities which yields an approximate kernel. This enables efficient online MCMC, with guarantees on correctness of C-DF samples as data accrue over time (see Section 6). We present extensions of the C-DF algorithm to models in settings where (a) the model involves an increasing number of parameters growing with the sample size, or (b) some of the conditional distributions are not in closed form.

Section 3.3.1 presents an application to a dynamic linear model for a continuous response centered on a first-order auto-regressive process. The ‘forward-filtering and backward-sampling’ Kalman filter updates for the latent process enables online posterior computation, albeit with an increasing computational cost as time goes on. Section 3.3.2

considers an application to binary regression where the conditional posterior distribution over the regression coefficients does not assume a closed form. For logistic regression, a variational approximation to the posterior introduces additional variational parameters for each observation to obtain a lower-bound for the likelihood

(Jaakkola and Jordan, 1997). Additional non-conjugate message passing approximations have been considered in Braun and McAuliffe (2010) and Tan et al. (2014). One may also resort to ADF using a Laplace approximation to the posterior over regression coefficients and propagating associated mean and covariance parameters in time. However, the latter are known to severely underestimate parameter uncertainty. Fortunately, data augmentation via the probit model enables conditionally conjugate sampling. In both illustrations, we discuss how to use the C-DF algorithm to overcome the computational and storage bottlenecks in an increasing parameter setting.

3.3.1 Dynamic Linear Model

We consider the first-order dynamic linear model (West and Harrison, 2007), namely

where noisy observations are modeled as arising from an underlying stationary AR(1) process with lag parameter . Default priors are chosen to complete the hierarchical model, and assume , the stationary distribution for the latent process. The full conditional distributions are given by

Forecasting future trajectories of the response is a common goal, hence good characterization of the distribution over is of interest. Hence, retrospective sampling of is only meaningful in as much as it propagates uncertainty to the current time point. Forward-filtering and backward-sampling using Kalman filtering updates for the latent process enables online posterior computation, but this scales poorly as the time horizon grows.

To extend the use of the C-DF algorithm in this growing parameter setting, we propose sampling of the latent process over a moving time-window of size . This eliminates the propagation of errors that might otherwise result from poor estimation at earlier time points. As the sampling window shifts forward, trailing latent parameters are fixed at their most recent point estimates. Parameter partitions for the C-DF algorithm must therefore also evolve dynamically, and are defined at time as , , . Unlike previous examples, conditional distribution is not available in closed-form. Nevertheless, propagated surrogate quantities (SCSS) enable approximate MCMC draws to be sampled from C-DF full conditional via Metropolis-Hastings. Here, steps for approximate MCMC using the C-DF algorithm in the context of a Metropolis-within-Gibbs sampler are:

  1. At time observe ;

  2. If : Draw samples for from the Gibbs full conditionals. In addition, .

  3. If : Repeat times the following

    1. for , draw sequentially from and finally from , noting that ;

    2. draw , and ;

    3. draw ;

    4. sample via a Metropolis-Hastings step.

  4. Set , and then update surrogate quantities , , , .

We vary signal-to-noise ratio for the data generating process in simulation experiments. A number of high-signal cases were examined, and results are reported for a representative case with . A sufficiently large window size is necessary to prevent C-DF from suffering due to poor estimation at early time-points, causing propagation of error in the defined surrogate quantities. For the competitor, 100 particles were propagated in time, and hence we fix as well. Kernel density estimates for using Particle learning (PL; Lopes et al., 2011) and the C-DF algorithm are shown in Figure 3. In the case of C-DF, estimates for all model parameters are found to be concentrated near their true values, whereas for PL,

at different times are centered correctly, albeit with much higher posterior variance (presumably due to poor estimation of noise parameters

). In addition to being 35% faster than PL, average coverage for the latent AR(1) process using the C-DF algorithm is near 80% with credible intervals roughly 10 times narrower than PL (see Table 3). C-DF is substantially more efficient in terms of memory and storage utilization, requiring only 6% of what PL uses for an identical simulation. Finally, C-DF produces accurate estimates for latent parameters (in contrast to PL), although their posterior spread appears somewhat over shrunk. This may be remedied by using larger window size (at the expense of increased runtime) depending on the task at hand.

Avg. coverage Length Time (sec) MSE
Table 3: Inferential performance for C-DF and Particle Learning (PL). Coverage and length are based on 95% credible intervals for averaged over all time points and 10 replications. For truth at time , we report . We report the time taken to run C-DF with Gibbs samples at each time for and MH samples for .
Stats Data Sampling Updating Memory (bytes)
C-DF flops flops 128
PL flops flops 3330
Table 4: Computational and storage requirements for the Dynamic Linear Model using C-DF and PL. , is the -th CSS corresponding to the -th particle in PL, , is the number of particles propagated by PL, and is the number of Metropolis samples used by both PL and C-DF. Memory used to store and propagate SCSS and CSS for C-DF and PL is reported. Sampling and update complexities are in terms of big-O.
Figure 3: Row #1 (left to right): Kernel density estimates for posterior draws of using PL and the C-DF algorithm at ; Row #2 (left to right) plots of model parameters and , respectively.

3.3.2 An application to binary response data

We consider an application of the C-DF algorithm for the probit model

, with standard normal distribution function

. The Albert and Chib (1993) latent variable augmentation applies Gibbs sampling, with , and . Assuming a conditionally conjugate prior , the Gibbs sampler alternates between the following steps: (1) conditional on latent variables , sample from its -variate Gaussian full conditional; and (2) conditional on and the observed binary response , impute latent variables from their truncated normal full conditionals.

However, imputing latent variables

presents an increasing computational bottleneck as the sample size increases, as does recalculating the conditional sufficient statistics (CSS) for given these latent variables at each iteration. C-DF alleviates this problem by imputing latent scores for the last training observations and propagating surrogate quantities. “Budget” is allowed to grow with the predictor dimension, and as a default we set . For data shards of size , define as an index over the final observations at time . Parameter partitions in this setting are dynamic as in Section 3.3.1, with and . The C-DF algorithm proceeds as follows:

  1. Observe data at time ;

  2. If , set , and draw . If , and draw Gibbs samples for from the full conditionals.

  3. If , set , update sufficient statistic , and compute ;

  4. For