1 Background: Bayesian computation via data partitioning
Various divide-and-conquer algorithms have been proposed for fitting statistical models to large datasets. The basic idea is to partition the data into pieces, , each with likelihood , then analyze each part of the likelihood separately; and finally combine the pieces to perform inference (typically approximately) for
. In addition to the appeal of such algorithms for parallel or online computation, there is a statistical intuition that they should work well from the law of large numbers: if the
pieces are a random sample of the data, or are independent realizations from a single probability model, then we can think of the log likelihood as the sum of
independent random variables.
In Bayesian inference, one must also decide what to do with the prior distribution, . Perhaps the most direct approach is to multiply each factor of the likelihood by and analyze the pieces separately as , such that the product of the
pieces is the full posterior density. A disadvantage of this approach is that it can lead to instability in the calculation of the fractional posterior distributions. Consider the many settings in statistics and machine learning where an informative prior is needed for regularization—that is, where the likelihood alone does not allow good estimation of, with corresponding computational difficulties (for some examples in our own research, see Gelman, Bois, and Jiang, 1996, and Gelman, Jakulin, et al., 2008, and also in the likelihood-free context, Barthelmé and Chopin, 2014). In these problems, might not provide enough regularization when is large. Another approach is to include the full prior distribution in each of the independent inferences, but then this can leads to computational instability when dividing by at the end.
In any case, a challenge in all these algorithms is how to combine the inferences. If the model is Gaussian, one can simply calculate the posterior mean as the precision-weighted average of the separate posterior means, and the posterior precision as the sum of the separate posterior precisions. Minsker et al. (2014) find a median to work well. More generally, though, we will be computing inferences using Monte Carlo simulation. And there is no general way to combine batches of simulations to get a combined posterior. Indeed, one of our early attempts at data splitting was a bit disappointing, as the gains in computation time were small compared to all the effort required to combine the inferences (Huang and Gelman, 2005).
Various approaches have been recently proposed to combine inferences from partitioned data (see Ahn, Korattikara, and Welling, 2012, Gershman, Hoffman, and Blei, 2012, Korattikara, Chen, and Welling, 2013, Hoffman et al., 2013, Wang and Blei, 2013, Scott et al., 2013, Wang and Dunson, 2013, Neiswanger et al., 2013, and Wang, 2014). These different approaches rely on different approximations or assumptions, but all of them are based on analyzing each piece of the data alone, not using any information from the other pieces.
The contribution of the present paper is to suggest the general relevance of the expectation propagation (EP) framework, leveraging the idea of a “cavity distribution,” which approximates the influence of inference from all other pieces of the data, as a prior in the inference step for each partition. Like EP in general, our approach has the disadvantages of being an approximation and requiring a sequence of iterations which are not always stable. The advantage of our approach is that its regularization should induce faster and more stable calculation for each partition. Our approach can be viewed either as a stochastic implementation of EP or, as we prefer, as an application of EP ideas to Bayesian data partitioning.
We present the basic ideas in section 2, but we anticipate that the greatest gains from this approach will come in hierarchical models, as we discuss in section 3. Section 4 considers options in carrying out the algorithm, and in section 5
we demonstrate on a hierarchical logistic regression example. In section6 we consider various generalizations and connections to other methods, and an appendix contains some details of implementation.
2 A general framework for EP-like algorithms based on iterative tilted approximations
Expectation propagation (EP) is a fast and parallelizable method of distributional approximation via data partitioning. As generally presented, EP is an iterative approach to approximately minimizing the Kullback-Leibler divergence between the true posterior distribution, and a distribution from a tractable family, typically the multivariate normal. The goal, as is the goal of all approximate inference, is to construct to approximate well the target, .
As we see it, though, the key idea of EP is not specifically about Kullback-Leibler or the evaluation of expectations, but rather that each step works with a “tilted distribution” that includes the likelihood for the th partition of the data along with the th “cavity distribution,” which represents the approximate posterior for the other pieces. The basic steps are as follows:
Partitioning. Split the data into pieces, , each with likelihood .
Initialization. Choose initial site approximations
from some restricted family (for example, multivariate normal distributions in). Let the initial approximation to the posterior density be .
EP-like iteration. For (in serial or parallel):
Compute the cavity distribution, .
Form the tilted distribution, .
Construct an updated site approximation such that approximates .
If parallel, set to , and a new approximate distribution will be formed and redistributed after the site updates. If serial, update the global approximation to .
Termination. Repeat step 3 until convergence of the approximate posterior distribution .
The benefits of this algorithm arise because each site comes from a restricted family with complexity is determined by the number of parameters in the model, not by the sample size; this is less expensive than carrying around the full likelihood, which in general would require computation time proportional to the size of the data. Furthermore, if the parametric approximation is multivariate normal, many of the above steps become analytical, with steps 3a, 3b, and 3d requiring only simple linear algebra. Accordingly, EP tends to be applied to specific high-dimensional problems where computational cost is an issue, notably for Gaussian processes (Rasmussen, and Williams, 2006, Jylänki, Vanhatalo, and Vehtari, 2011, Cunningham, Hennig, and Lacoste-Julien, 2011, and Vanhatalo et al., 2013), and efforts are made to keep the algorithm stable as well as fast.
Figure 1 illustrates the general idea. Here the data have been divided into five pieces, each of which has a somewhat awkward likelihood function. The most direct parallel partitioning approach would be to analyze each of the pieces separately and then combine these inferences at the end, to yield a posterior distribution in the overlap. In contrast to data-partitioning algorithms, each step of an EP-based approach combines the likelihood of each partition with the cavity distribution representing the rest of the available information across the other pieces (and the prior). In Figure 1, an EP approach should allow the computations to be concentrated in the area of overlap rather than wasting computation in the outskirts of the individual likelihood distributions.
Figure 2 illustrates the construction of the tilted distribution (step 3b of the algorithm) and demonstrates the critically important regularization attained by using the cavity distribution as a prior: because the cavity distribution carries information about the posterior inference from all other data pieces, any computation done to approximate the tilted distribution (step 3c) will focus on areas of greater posterior mass.
The remaining conceptual and computational complexity lies in step 3c, the fitting of the updated local approximation such that approximates the tilted distribution . This step is in most settings the crux of EP, in terms of computation, accuracy, and stability of the iterations. As such, here we consider a variety of choices for forming tilted approximations, beyond the standard choices in the EP-literature. We call any method of this general iterative approach an EP-like algorithm. Thus we can think of the EP-like algorithms described in this paper as a set of rough versions of EP, or perhaps more fruitfully, as a statistical and computational improvement upon partitioning algorithms that analyze each part of the data separately.
3 EP-like algorithm for hierarchical models
The exciting idea for hierarchical models is that the local parameters can be partitioned, and the convergence of the EP-like algorithm only needs to happen on the shared parameters. Suppose a hierarchical model has local parameters and shared parameters
. All these can be vectors, with eachapplying to the model for the data piece , and with including shared parameters (“fixed effects”) of the data model and hyperparameters as well. This structure is displayed in Figure 3.
3.1 An EP-like algorithm on the shared parameters
We will define an EP-like algorithm on the vector of shared parameters, , so that the th approximating step, combines the local information at site with the cavity distribution which approximates the other information in the model relevant to this updating. The EP-like algorithm, taking the steps of the algorithm in the previous section and altering them appropriately, becomes:
Partitioning. Partition the data into pieces, , each with its local model . The goal is to construct an approximation for the marginal posterior distribution of the shared parameters
. From this, we can also get an approximate joint distribution as described in step 4 below.
Initialization. Choose initial approximations from some restricted family (for example, multivariate normal distributions in ). Define .
EP-like iteration. For (in serial or parallel):
Compute the cavity distribution, .
Form the local tilted distribution, , and determine its marginal distribution, ; this can be computed analytically (if is a joint normal distribution) or approximated using the sampled values of if is computed via simulation.
Construct an updated site approximation such that approximates .
If parallel, set to , and a new approximation will be formed and redistributed after the site updates. If serial, update the global approximation to .
Termination. Repeat step 3 until convergence of the approximate posterior distribution . Given this approximation, define an approximate joint distribution .
The computational advantage of this algorithm is that the local parameters are partitioned. For example, suppose we have a model with 100 data points in each of 3000 groups, 2 local parameters per group (a varying slope and intercept) and, say, 20 shared parameters (including fixed effects and hyperparameters). If we then divide the problem into pieces, we have reduced a problem to 3000 parallel problems. To the extent that computation costs are proportional to sample size multiplied by number of parameters, this is a big win.
3.2 Example: a hierarchical nonlinear regression
We illustrate with an example of a hierarchical regression model from a current research project in astronomy.
Researchers are interested in the relation between fluxes of two different bands of electromagnetic radiation, which are measured as and , respectively, at a large number of locations . The sky is divided into grid boxes , and in each grid box we want to fit a nonlinear regression with likelihood , where is the vector of data points in grid box , is a vector of parameters (for example, regression coefficients) corresponding to grid box , and is some set of shared parameters (for example, regression coefficients, scale parameters, and shape parameters that are constant in the model and do not vary spatially). The local parameter vectors are modeled as independent draws from a multivariate normal distribution, . Finally, for computational reasons the problem is partitioned into pieces, so that each piece contains some number of grid boxes. For example, we might have data points and grid boxes, with the problem divided into pieces.
In the notation of section 3.1, the set of shared parameters is and, within each piece , the set of local parameters is . Performing full Bayes (or even computing a Bayesian point estimate) on this model would be costly if the number of data points and the number of grid boxes are large. Our EP formulation breaks down the problem into smaller problems, each with only as many data points and roughly as many parameters.
We assume a multivariate normal approximate distribution on —actually, we would transform the parameters to be unconstrained, thus working on the scale of the logarithms of any scale parameters and an appropriately transformed Cholesky decomposition of the covariance matrix (see Stan Development Team, 2014, section 53.9). We write the approximate distribution as:
The ’s and ’s are updated during the steps of the EP-like algorithm until convergence.
For simplicity we assume a flat prior on the (unconstrained transformed components of) and we correspondingly set to a constant. (Appendix A describes an algorithm with informative priors.) We start with weak approximations such as setting, for each , and , so that the initial setting for
is normal with mean 0 and scale equal to 10 times the identity matrix.
The EP-like iterations (step 3 of the algorithm in section 3.1) then proceed as follows:
For (in serial or parallel):
The cavity distribution is , where
The tilted distribution is
We can feed this distribution into Stan—it is simply the likelihood for the data in piece , the priors for the local parameters within that piece, and a normal prior on the shared parameters determined by the cavity distribution—and then we can directly approximate
by a normal distribution (perhaps using a Laplace approximation, perhaps using some quadrature or importance sampling to better locate the mean and variance of this tilted distribution) or else simply take some number of posterior draws. Or a more efficient computation may be possible using importance weighting of earlier sets of simulations at this node, as we discuss in section4.6. In any case we can get a normal approximation to the marginal tilted distribution of the shared parameters, . Call this approximation, .
The updated approximate site distribution is , set so approximates . It would be most natural to do this by matching moments, but it is possible that the differencing step could result in non-positive-definite covariance matrix , so some more complicated step might be needed, as discussed in section 4.7.
We can then perform the appropriate parallel or serial update.
As this example illustrates, the core computation is the inference on the local tilted distribution in step b, and this is where we are taking advantage of the partitioning, both in reduction of the data and in reduced dimensionality of the computation, as all that is required is inference on the shared parameters and the local parameters , with the other sets of local parameters being irrelevant in this step.
3.3 Posterior inference for the approximate joint distribution
Once convergence has been reached for the approximate distribution , we approximate the joint posterior distribution by . We can work with this expression in one of two ways, making use of the ability to simulate the model in Stan (or otherwise) in each note.
First, if all that is required are the separate (marginal) posterior distributions for each vector, we can take these directly from the simulations or approximations performed in step 3b of the algorithm, using the last iteration which can be assumed to be at approximate convergence. This will give simulations of or an approximation to .
These separate simulations will not be “in phase,” however, in the sense that different nodes will be simulating different values of . To get draws from the approximate joint distribution of all the parameters, one can first take some number of draws of the shared parameters from the EP approximation, , and then, for each draw, run parallel processes of Stan to perform inference for each , conditional on the sampled value of . This computation is potentially expensive—for example, to perform it using 100 random draws of would require 100 separate Stan runs—but, on the plus side, each run should converge fast because it is conditional on the hyperparameters of the model. In addition, it may ultimately be possible to use adiabatic Monte Carlo (Betancourt, 2014) to perform this ensemble of simulations more efficiently.
4 Open issues in implementation
4.1 Algorithmic considerations
Implementing an EP-like algorithm involves several decisions:
Algorithmic blocking. This includes parallelization and partitioning of the likelihood. In the present paper we a assume a model with independent data points conditional on its parameters so the likelihood can be factored. The number of subsets will be driven by computational considerations. If is too high, the Gaussian approximation will not be so accurate. However, if is low, then the computational gains will be small. For large problems it could make sense to choose iteratively, for example setting to a high value and then decreasing it if the approximation seems too poor. Due to the structure of modern computer memory, the computation using small blocks may get additional speed-up if the most of the memory accesses can be made using fast but small cache memory.
The parametric form of the approximating distributions . We will use the multivariate normal. For simplicity we will also assume that the prior distribution also is multivariate normal, as is the case in many practical applications, sometimes after proper reparameterization (such as with constrained parameter spaces). Otherwise, one may treat the prior as an extra site which will also be iteratively approximated by some Gaussian density . In that case, some extra care is required regarding the initialization of .
The starting distributions . We will use a broad but proper distribution factored into equal parts, for example setting each , where is some large value (for example, if the elements of are roughly scaled to be of order 1, we might set ).
The approximation to the tilted distribution. For mode-based approximations, we can calculate the first two moments of the fitted normal or other approximate density; see section 4.4. Or if the tilted distribution is computed by Monte Carlo methods we can compute the first two moments of the simulation, possibly with some regularization if is of high dimension; see section 4.5.
The division by the cavity distribution in step 3c can yield a non-positive-definite variance matrix which would correspond to a meaningless update for
. We can stabilize this by setting negative eigenvalues to small positive values or perhaps using some more clever method. We discuss this in section4.7.
In a hierarchical setting, the model can be fit using the nested EP approach (Riihimäki et al., 2013). Assuming that the likelihood term associated with each data partition is approximated with a local Gaussian site function , the marginal approximations for and all could be estimated without forming the potentially high-dimensional joint approximation of all unknowns. At each iteration, first the local site approximations could be updated in parallel with a fixed marginal approximation . Then could be refreshed by marginalizing over separately using the new site approximations and combining the marginalized site approximations. The parallel EP approximations for each data partition correspond to the inner EP approximations, and inference for corresponds to the outer EP.
4.2 Approximating the tilted distribution
In standard EP, the tilted distribution approximation in step 3c is done by moment matching problem, which when using the multivariate normal family implies that: one chooses the site so that the first and second moments of match those of the intractable tilted distribution . When used for Gaussian processes, this approach has the particular advantage that the tilted distribution can typically be set up as a univariate distribution over only a single dimension in . This implicit dimension reduction implies that the tilted distribution approximation can be performed analytically (e.g., Minka, 2001) or relatively quickly using one-dimensional quadrature (e.g., Zoeter and Heskes, 2005). In higher dimensions, quadrature gets computationally more expensive or, with reduced number of evaluation points, the accuracy of the moment computations gets worse. Seeger (2004) estimated the tilted moments in multiclass classification using multidimensional quadratures. Without the possibility of dimension reduction in the more general case, there is no easy way to compute or approximate the integrals to compute the required moments over . Accordingly, black-box EP would seem to be impossible.
To move towards a black-box EP-like algorithm, we can change this moment matching choice to instead match the mode or use numerical simulations. The resulting EP-like algorithms critically preserve the essential idea that the local pieces of data are analyzed at each step in the context of a full posterior approximation. These alternative choices for a tilted approximation generally make EP less stable, and we consider methods for fast and accurate approximations and stabilizing computations for EP updates.
Smola et al. (2004) proposed Laplace propagation, where moment-matching is replaced with a Laplace approximation, so that the tilted mean is replaced with the mode of the tilted distribution and the tilted covariance with the inverse Hessian of the log density at the tilted mode. The proof presented by Smola et al. (2004) suggests that a fixed point of Laplace propagation corresponds to a local mode of the joint model and hence also one possible Laplace approximation. Therefore, with Laplace approximation, a message passing algorithm based on local approximations corresponds to the global solution. Smola et al. (2004) were able to get useful results with tilted distributions in several hundred dimensions. Riihimaki et al. (2013) presented a nested EP method, where moments of the multivariate tilted distribution are also estimated using EP. For certain model types the inner EP can be computed efficiently.
In this paper, we also consider MCMC computation of the moments, which we suspect will give inaccurate moment estimates, but may work better than, or as a supplement to, Laplace approximation for skewed distributions. We also propose an importance sampling scheme to allow stable estimates based on only a moderate number of simulation draws.
When the moment computations are not accurate, EP may have stability issues, as discussed by Jylänki et al. (2011). Even with one-dimensional tilted distributions, moment computations are more challenging if the tilted distribution is multimodal or has long tails. Fractional EP (Minka, 2004) is an extension of EP which can be used to improve the robustness of the algorithm when the approximation family is not flexible enough (Minka, 2005) or when the propagation of information is difficult due to vague prior information (Seeger, 2008). Fractional EP can be viewed as a method for minimizing of the -divergence, with corresponding to Kullback-Leibler divergence used in EP, corresponding to the reverse Kullback-Leibler divergence usually used in variational Bayes, and corresponding to Hellinger distance. Ideas of fractional EP might help to stabilize EP-like algorithms that use approximative moments, as -divergence with is less sensitive to errors in tails of the approximation.
Section 4.1 details these and other considerations for tilted approximations in our EP-like algorithm framework. While the tilted approximation is the key step in any EP-like algorithm, there are two other canonical issues that must be considered in any EP-like approach: lack of convergence guarantees and the possible computational instability of the iterations themselves. Section 4.1 also considers approaches to handle these instabilities.
4.3 Normal approximation based on deterministic computations
The simplest EP-like algorithms are deterministic and at each step construct an approximation of the tilted distribution around its mode. As Figure 2 illustrates, the tilted distribution can have a well-identified mode even if the factor of the likelihood does not.
The most basic approximation is obtained by, at each step, setting to be the (multivariate) normal distribution centered at the mode of , with covariance matrix equal to the inverse of the negative second derivative matrix of
at the mode. This corresponds to the Laplace approximation (Smola et al., 2004). We assume any parameters with restricted range have been transformed to unrestricted spaces (in Stan, this is done via logarithms of all-positive parameters, logits of interval-constrained parameters, and special transforms for simplex constraints and covariance matrices; see Stan Development Team, 2014, section 53.9), so the gradient ofwill necessarily be zero at any mode.
The presence of the cavity distribution as a prior (as illustrated in Figure 2) gives two computational advantages to this algorithm. First, we can use the prior mean as a starting point for the algorithm; second, the use of the prior ensures that at least one mode of the tilted distribution will exist.
4.4 Split-normal and split-t approximations
After computing the mode and curvature at the mode, we can evaluate the tilted distribution at a finite number of points around the mode and use this to construct a better approximation to capture aspects of asymmetry and long tails in the posterior distribution. Possible approximate families include the multivariate split-normal (Geweke, 1989, Villani and Larsson, 2006), split-, or wedge-gamma (Gelman et al., 2014) distributions.
We are not talking about changing the family of approximate distributions —we still would keep these as multivariate normal. Rather, we would use an adaptively-constructed parametric approximation, possibly further improved by importance sampling (Geweke, 1989) or CCD integration (Rue et al., 2009) to get a better approximation to the mean and covariance matrix of the tilted distribution to use in constructing in step 3c of the algorithm.
4.5 Estimating moments using simulation
A different approach is at each step to use simulations (for example, Hamiltonian Monte Carlo using Stan) to approximate the tilted distribution and then use these to set the mean and covariance of the approximation in step 3c. As above, the advantage of the EP-like algorithm here is that the computation only uses a fraction
of the data, along with a simple multivariate normal prior that comes from the cavity distribution. Similar to other inference methods such as stochastic variational inference (Hoffman et al., 2013) which take steps based on stochastic estimates, EP update steps need unbiased estimates. When working with the normal approximation, we then need unbiased estimates of the mean and precision in ((4.7)). The variance of the estimates can be reduced while preserving unbiasedness by using control variates.
4.6 Reusing simulations with importance weighting
In serial or parallel EP, samples from previous iterations can be reused as starting points for either Markov chains or in importance sampling. We discuss briefly the latter.
Assume we have obtained at iteration for node , a set of posterior simulation draws , from the tilted distribution , possibly with weights ; take for an unweighed sample.
To progress to node , reweight these simulations as: . If the effective sample size (ESS) of the new weights,
is large enough, keep this sample, . Otherwise throw it away, simulate new ’s from , and reset the weights to 1.
This basic approach was used in the EP-ABC algorithm of Barthelmé and Chopin (2014). Elaborations could be considered. Instead of throwing away a sample with too low an ESS, one could move these through several MCMC steps, in the spirit of sequential Monte Carlo (Del Moral et al., 2006).
Another approach, which can be used in serial or parallel EP, is to use adaptive multiple importance sampling (Cornuet et al., 2012), which would make it possible to recycle the simulations from previous iterations.
Even the basic strategy should provide important savings when EP is close to convergence. Then changes in the tilted distribution should become small and as result the variance of the importance weights should be small as well. In practice, this means that the last EP iterations should essentially come for free.
4.7 Keeping the covariance matrix positive definite
When working with the normal approximation, step 3c of the EP-like algorithm can be conveniently written in terms of the natural parameters of the exponential family:
where and are the approximate mean and covariance matrix of the tilted distribution, and these are being used to determine and , the mean and variance of the updated site distribution
. The difficulty is that the difference between two positive-definite matrices is not itself necessarily positive definite. There are various tricks in the literature to handle this problem. One idea is to first perform the subtraction in the second line above, then do an eigendecomposition, keeping the eigenvectors ofbut taking any negative eigenvalues and replacing them with small positive numbers as in the SoftAbs map of Betancourt (2013). It is not clear whether some similar adjustment would be needed for the updating of or whether the top line above would work as written.
Is it possible to take into account the positive-definiteness restriction in a more principled manner? If our measure of the precision matrix of the tilted distribution is noisy (due to Monte Carlo error), it would make sense to include this noise in our inference and estimate and statistically via a measurement-error model. The noisy precision of the tilted distribution is the sum of the cavity precision, the precision of the pseudo-observation, and the noise term. Can we infer from this better estimate of precision of the pseudo-observation? This itself would be a small Bayesian inference problem but of low enough dimensionality that it should not represent a large part of the computation cost compared to the other steps of the algorithm.
An simple and effective alternative is to damp each step so that the resulting cavity covariance remains positive definite after each update. This does not change the fixed points of the EP algorithm, and adjusts only the step sizes during the iterations. If the updates are done in parallel fashion, it is possible to derive a limit on the damping factor (or step size) so that the cavity covariances at all sites remain positive definite. The downside is that this requires determining one eigendecomposition at each site at each parallel update. Based on our experiments, this takes negligible time compared to the tilted distribution moment computations.
4.8 The computational opportunity of parallel EP-like algorithms
We have claimed that EP-like algorithms offer computational gains for large inference problems by splitting the data into pieces and performing inference on each of those pieces in parallel, occasionally sharing information between the pieces. Here we detail those benefits specifically.
Consider the simple, non-hierarchical implementation (section 2) with a multivariate normal approximating family. We assume that we have parallel processing units: one central processor that maintains the global posterior approximation and worker units on which inference can be computed on each of the factors of the likelihood. Furthermore, we assume a network transmission cost of per parameter. Let be the number of data points and let be the number of parameters, that is, the length of the vector . Finally, we define as the computational cost of approximating the tilted distribution (a task which may, for example, be performed by running MCMC to perform simulations) with data points and parameters.
Each step of the algorithm then incurs the following costs:
Partitioning. This loading and caching step will in general have immaterial cost.
Initialization. The central unit initializes the site approximations , which by construction are multivariate normal. In the general case each of the sites will require scalar quantities corresponding to the mean and covariance. Thus the central unit bears the initialization cost of .
EP-like iteration. Let be the number of iterations over all sites. Empirically is typically a manageable quantity; however, numerical instabilities tend to increase this number. In parallel EP damped updates are often used to avoid oscillation (van Gerven, 2009).
Computing the cavity distribution. Owing to our multivariate normal approximating family, this step involves only simple rank matrix operation per site, costing (with a small constant; see Cunningham, Hennig, and Lacoste-Julien, 2011). One key choice is whether to perform this step at the central unit or worker units. If we compute the cavity distributions at each worker unit, the central unit must first transmit the full posterior to all worker units, costing for cost per network operation. In parallel, the cavity computations then incur total cost of . On the other hand, small implies central cavity computations are preferred, requiring to construct cavity distributions centrally, with a smaller subsequent distribution cost of . Accordingly, the total cost per EP iteration is . We presume any computation constant is much smaller than the network transmission constant , and thus in the small regime, this step should be borne by the central unit, a choice strengthened by the presumed burden of step 3c on the worker units.
Forming the tilted distribution. This conceptual step bears no cost.
Fitting an updated local approximation . We must estimate parameters. More critical in the large data setting is the cost of computing the log-likelihoods. In the best case, for example if the likelihoods belong to the same exponential family, we need only calculate a statistic on the data, with cost . In some rare cases the desired moment calculation will be analytically tractable, which results in a minimum cost of . Absent analytical moments, we might choose a modal approximation (e.g., Laplace propagation), which may typically incur a term. More common still, MCMC or another quadrature approach over the -dimensional site parameter space will be more costly still: . Furthermore, a more complicated likelihood than the exponential family—especially a multimodal such as a mixture likelihood—will significantly influence numerical integration. Accordingly, in that common case, this step costs . Critically, our setup parallelizes this computation, and thus the factor is absent.
Return the updated to the central processor. This cost repeats the cost and consideration of Step 3a.
Update the global approximation . In usual parallel EP, is updated once after all site updates. However, if the cost of approximating the posterior distribution is variable across worker units (for example, in an MCMC scheme), the central unit could update whenever possible or according to a schedule.
Considering only the dominating terms, across all these steps and the EP iterations, we have the total cost of our parallel, EP-like algorithm:
This cost contains a term due to Gaussian operations and a term due to parallelized tilted approximations. By comparison, consider first the cost of a non-parallel EP-like algorithm:
Second, consider the cost of full numerical quadrature with no EP-like partitioning:
With these three expressions, we can immediately see the computational benefits of our scheme. In many cases, numerical integration will be by far the most costly operation, and will depend superlinearly on its arguments. Thus, our parallel EP-like scheme will dominate. As the total data size grows large, our scheme becomes essential. When data is particularly big (e.g. ), our scheme will dominate even in the rare case that is its minimal (see step 3c above).
5.1 Implementation in Stan
We code our computations in R and Stan. Whether using point estimation (for Laplace approximations) or HMC (for simulations from the tilted distribution), we write a Stan program that includes one portion of the likelihood and an expression for the cavity distribution. We then run this model repeatedly with different inputs for each subset . This ensures that only one part of the likelihood gets computed at a time by the separate processes, but it does have the cost that separate Stan code is needed to implement the EP computations. In future software development we would like to be able to take an existing Stan model and merely overlay a factorization so that the EP-like algorithm could be applied directly to the model.
We use Stan to compute step 3a of the EP-like algorithm, the inference for the tilted distribution for each process. Currently we perform the other steps in Matlab, Python, or R (code will be made available at https://github.com/gelman/ep-stan). In parallel EP, we pass the normal approximations back and forth between the master node and the separate nodes.
Currently Stan performs adaptation each time it runs, but the future version should allow restarting from the previous state, which will speed up computation substantially when the EP algorithm start to converge. We should also be able to approximate the expectations more efficiently using importance sampling.
5.2 Hierarchical logistic regression
We test the algorithm with a simple hierarchical logistic regression,
where is a vector of common coefficients and each group has own intercept with a hierarchical prior. The data were simulated using , and .
We first perform a simulation experiment with 50 regression predictors, the number of groups , the number of data in each group , and thus a total sample size of . For simplicity we partition the data as one piece per group; that is, . The Stan runs had 50 warmup and 200 iterations per chain.
If we were to use a simple scheme of data splitting and separate inferences (without using the cavity distribution as an effective prior distribution at each step), the computation would problematic: with only 50 data points per group, each of the local posterior distributions would be very wide (as sketched in Figure 1). The EP-like framework, in which at each step the cavity distribution is used as a prior, keeps computations more stable and focused.
Figure 4 shows the results: convergene is essentially immediate, and the algorithm ran fast. We have not yet performed runtime comparisons because our implementation has many places where we can increase its speed. But it is reassuring that the computations work well in this very simple and direct MCMC implementation of our algorithm.
In addition to checking the algorithm’s convergence (left plots in Figure 4
), we also check the coverage of posterior inferences from the simulations (right plot). If we were performing full Bayesian inference this coverage check would not be necessary (except possibly for catching bugs), but because the EP-like algorithm is approximate, it makes sense to check coverage in this fake-data simulation. As the graph shows, the coverage is fine, with all 50 estimates within 3 standard errors of the true values, the vast majority being within 2 standard errors, and about two-thirds being within 1 standard error of the truth. For a fuller evaluation of the algorithm it would be a good idea to study this sort of coverage more thoroughly; here we just wanted to get a rough sense that the results made sense and that the approximating distribution was not clearly too narrow or two broad.
Figure 5 shows results for a new set of simulations, this time again with 50 regression predictors and partitions, but this time with groups and observations per group, so that the data are partitioned via cluster sampling into 50 partitions with 20 groups and 1000 data points in each partition. By increasing the size of the problem, we are improving the normal approximation but we are beginning to test the scalability of the algorithm. We plan to do more systematic scalability testing once we have increased the efficiency of some of our computational steps.
The experiments show that the algorithm works and is stable even with the simplest moment computation using MCMC. All the above experiments were made on a desktop computer without any actual parallelization. Additional modifications suggested in section 4 should further speed up the moment computations and the convergence.
In this paper we have presented a framework for using the EP idea of cavity and tilted distributions to perform Bayesian inference on partitioned datasets. Particularly in the case of hierarchical models, this structure can allow efficient dispersed computation for big models fit to big data.
6.1 Marginal likelihood
Although not the focus of this work, we mention in passing that EP also offers as no extra cost an approximation of the marginal likelihood, . This quantity is often used in model choice.
To this end, associate to each approximating site a constant , and write the global approximation as:
Consider the Gaussian case, for the sake of simplicity, so that , under natural parameterization, and denote by the corresponding normalizing constant:
Simple calculations (Seeger, 2005) then lead to following formula for the update of at site ,
where is the normalizing constant of the tilted distribution , is the natural parameter of , , , , and . In the deterministic approaches we have discussed for approximating moments of , it is straightforward to obtain an approximation of the normalizing constant; when simulation is used, some extra efforts may be required, as in Chib (1995).
Finally, after completion of EP one should return the following quantity,
as the EP approximation of , where is the natural parameter of the prior.
6.2 Different families of approximate distributions
We can place the EP approximation, the tilted distributions, and the target distribution on different rungs of a ladder:
, the EP approximation;
For any , , the tilted distribution;
For any , , which we might call the tilted distribution;
For any , , the tilted distribution;
, the target distribution, which is also the tilted distribution.
Instead of independent groups, also tree structures could be used (Minka, 2001). As presented, the goal of an EP-like algorithm is to determine . But at each step we get simulations from all the ’s, so we could try to combine these inferences in some way (for example, following the ideas of Wang and Dunson, 2013). Even something as simple as mixing the simulation draws from the tilted distribution could be a reasonable improvement on the EP approximation. One could then go further, for example at convergence computing simulations from some of the tilted distributions.
Another direction is to compare the EP approximation with the tilted distribution, for example by computing a Kullback-Leibler distance or looking at the distribution of importance weights. Again, we can compute all these densities analytically, we have simulations from the tilted distributions, and we can trivially draw simulations from the EP approximation, so all these things are possible.
6.3 Connections to other approaches
The EP-like algorithm can be combined with other approaches to data partitioning. In the present paper we have focused on the construction of the approximate densities with the goal of simply multiplying them together to get the final approximation , but one could instead think of the cavity distributions at the final iteration as separate priors, and then follow the ideas of Wang and Dunson (2013).
Data partitioning is an extremely active research area right now, with several different black-box algorithms being proposed by various research groups. We are sure that different methods will be more effective in different problems. The present paper has two roles: we are presenting a particular black-box algorithm for distributional approximation, and we are suggesting a general approach to approaching data partitioning problems. We anticipate that great progress could be made by using EP ideas to regularize existing algorithms.
Ahn, S., Korattikara, A., and Welling, M. (2012). Bayesian posterior sampling via stochastic gradient Fisher scoring. In Proceedings of the 29th International Conference on Machine Learning.
Barthelmé, S., and Chopin, N. (2014). Expectation propagation for likelihood-free inference. Journal of the American Statistical Association 109, 315–333.
Betancourt, M. (2013). A general metric for Riemannian manifold Hamiltonian Monte Carlo. http://arxiv.org/abs/1212.4693
Betancourt, M. (2014). Adiabatic Monte Carlo. http://arxiv.org/pdf/1405.3489.pdf
Chib, S. (1995). Marginal likelihood from the Gibbs output. Journal of the American Statistical Association 90, 1313–1321.
Cornuet, J. M., Marin, J. M., Mira, A., and Robert, C. P. (2012). Adaptive multiple importance sampling. Scandinavian Journal of Statistics 39, 798–812.
Cseke, B., and Heskes, T. (2011). Approximate marginals in latent Gaussian models. Journal of Machine Learning Research 12, 417–454.
Cunningham, J. P., Hennig, P., and Lacoste-Julien, S. (2011). Gaussian probabilities and expectation propagation. http://arxiv.org/abs/1111.6832
Del Moral, P., Doucet, A., and Jasra, A. (2006). Sequential Monte Carlo samplers. Journal of the Royal Statistical Society B 68, 411–436.
Gelman, A., Bois, F. Y., and Jiang, J. (1996). Physiological pharmacokinetic analysis using population modeling and informative prior distributions. Journal of the American Statistical Association 91, 1400–1412.
Gelman, A., Carpenter, B., Betancourt, M., Brubaker, M., and Vehtari, A. (2014). Computationally efficient maximum likelihood, penalized maximum likelihood, and hierarchical modeling using Stan. Technical report, Department of Statistics, Columbia University.
Gelman, A., Jakulin, A., Pittau, M. G., and Su, Y. S. (2008). A weakly informative default prior distribution for logistic and other regression models. Annals of Applied Statistics 2, 1360–1383.
Gershman, S., Hoffman, M., and Blei, D. (2012). Nonparametric variational inference. In Proceedings of the 29th International Conference on Machine Learning.
Geweke, J. (1989). Bayesian inference in econometric models using Monte Carlo integration. Econometrica 57, 1317–1339.
Girolami, M., and Zhong, M. (2007). Data integration for classification problems employing Gaussian process priors. In Advances in Neural Information Processing Systems 19, ed. B. Scholkopf, J. Platt, and T. Hoffman, 465–472.
Hoffman, M., Blei, D. M., Wang, C., and Paisley, J. (2013). Stochastic variational inference. Journal of Machine Learning Research, 14(1), 1303-1347.
Huang, Z., and Gelman, A. (2005). Sampling for Bayesian computation with large datasets. Technical report, Department of Statistics, Columbia University.
Jylänki, P., Vanhatalo, J., and Vehtari, A. (2011). Robust Gaussian process regression with a Student-t likelihood. Journal of Machine Learning Research 12, 3227-3257.
Korattikara, A., Chen, Y., and Welling, M. (2013). Austerity in MCMC land: Cutting the Metropolis-Hastings budget. In Proceedings of the 31st International Conference on Machine Learning.
Minka, T. (2001). Expectation propagation for approximate Bayesian inference. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence
Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, ed. J. Breese and D. Koller, 362–369.
Minka, T. (2004). Power EP. Technical report, Microsoft Research, Cambridge.
Minka, T. (2005). Divergence measures and message passing. Technical report, Microsoft Research, Cambridge.
Minkser, S., Srivastava, S., Lin, L., and Dunson, D. B. (2014). Robust and scalable Bayes via a median of subset posterior measures. Technical report, Department of Statistical Science, Duke University. http://arxiv.org/pdf/1403.2660.pdf
Neiswanger, W., Wang, C., and Xing, E. (2013). Asymptotically exact, embarrassingly parallel MCMC. arXiv:1311.4780.
Rasmussen, C. E., and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. Cambridge, Mass.: MIT Press.
Riihimaki, J., Jylänki, P., and Vehtari, A. (2013). Nested expectation propagation for Gaussian process classification with a multinomial probit likelihood. Journal of Machine Learning Research 14, 75–109.
Rue, H., Martino, S., and Chopin, N. (2009). Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. Journal of the Royal Statistical Society B 71, 319–392.
Scott, S. L., Blocker, A. W., Bonassi, F. V., Chipman, H. A., George, E. I., and McCulloch, R. E. (2013). Bayes and big data: The consensus Monte Carlo algorithm. In Bayes 250. http://research.google.com/pubs/pub41849.html
Seeger, M. (2005). Expectation propagation for exponential families. Technical report, Max Planck Institute for Biological Cybernetics, Tubingen, Germany.
Seeger, M. (2008). Bayesian inference and optimal design for the sparse linear model. Journal of Machine Learning Research 9, 759–813.
Seeger, M., and Jordan, M. (2004). Sparse Gaussian process classification with multiple classes. Technical report, University of California, Berkeley.
Smola, A., Vishwanathan, V., and Eskin, E. (2004). Laplace propagation. In Advances in Neural Information Processing Systems 16, ed. S. Thrun, L. Saul, and B. Scholkopf.
Stan Development Team (2014). Stan modeling language: User’s gude and reference manual, version 2.5.0. http://mc-stan.org/
van Gerven, M., Cseke, B., Oostenveld, R., and Heskes, T. (2009). Bayesian source localization with the multivariate Laplace prior. In Advances in Neural Information Processing Systems 22, ed. Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, 1901–1909.
Vanhatalo, J., Riihimäki, J., Hartikainen, J., Jylänki, P., Tolvanen, V., and Vehtari, A. (2013). GPstuff: Bayesian modeling with Gaussian processes. Journal of Machine Learning Research 14, 1005–1009.
Villani, M., and Larsson, R. (2006). The multivariate split normal distribution and asymmetric principal components analysis.Communications in Statistics: Theory and Methods 35, 1123–1140.
Wang, C., and Blei, D. M. (2013). Variational inference in nonconjugate models. Journal of Machine Learning Research 14, 899–925.
Wang, X. (2014). Parallel MCMC via Weirstrass sampler. Xi’an’s Og, 3 Jan. http://xianblog.wordpress.com/2014/01/03/parallel-mcmc-via-weirstrass-sampler-a-reply-by-
Wang, X., and Dunson, D. B. (2013). Parallel MCMC via Weierstrass sampler. http://arxiv.org/pdf/1312.4605v1.pdf
Zoeter, O., and Heskes, T. (2005). Gaussian quadrature based expectation propagation. In Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, ed. Z. Ghahramani and R. Cowell, 445–452.
Appendix A Distributed parallel algorithms
We have pitched this article at a fairly abstract level because we are interested in the general idea of using EP-like algorithms to perform parallel Bayesian computation using data splitting. Ultimately, though, these algorithms are motivated by speed and scalability concerns, so we need efficient implementation. In this appendix we go through some of the steps that we went through to efficiently implement the simulations described in section 5.
a.1 Distributed parallel EP-like algorithm
In this appendix we give a practical algorithm description that is suitable for implementing the general EP-like algorithms of sections 2 and 3 in a numerically stable manner. We assume that the posterior distribution,
is approximated by
where the site approximations and the prior are written using the following definitions:
The natural parameters and the normalization of the prior are given by , , and . The natural exponential parameters of the posterior approximation are obtained by multiplying the prior and the site approximations together which gives
The approximate posterior mean and covariance can be computed using a Cholesky or eigenvalue decomposition of . One possibility to initialize the site approximations is to set them to by choosing and for , which is equivalent to initializing to the prior, that is, and .
We propose to distribute the cavity and tilted moment computations and the site parameter updates to different computing units. The posterior update is done in the central computing node in a parallel fashion. First, the site updates are initialized to zero as and then the following steps are repeated until convergence:
In parallel at each node: Compute the updated site parameters with damping level :
At the central node: Compute the natural parameters of as
In parallel at each node: Determine the cavity distributions for all :
In parallel at each node: If for any , go back to step 1 and decrease . Otherwise, accept the new state by setting , , and and continue to step 5.
In parallel at each node: determine the natural parameters and of the tilted distribution using either MCMC or Laplace’s method. The tilted distribution is given by
which can be efficiently sampled and differentiated using
Key properties of the different approximation methods:
MCMC: It is easy to compute and from a set of samples, and should be symmetric and positive definite if enough samples are used. However, computing the precision matrix requires a Cholesky or eigenvalue decomposition. Could this be done simultaneously within the SoftAbs stabilization step? In addition, there should be literature on estimating (sparse) precision matrices from samples, which could be beneficial with large .
Laplace’s method: Gradient-based methods can be used to determine the mode of the tilted distribution efficiently. Once a local mode is found, the natural parameters can be computed as
If is a local mode, should be symmetric and positive definite.
In parallel at each node: If , compute the undamped site parameter updates resulting from the moment consistency conditions and :
If , there are at least two options: discard the update by setting and , or use the SoftAbs map to improve the conditioning of and compute the parameter updates with the modified .
In the latter approach the natural location parameter of the tilted distribution can be recomputed as using the original tilted mean and the modified covariance matrix , which preserves the tilted mean but changes tilted covariance estimate .
Steps 1–6 are repeated until all the tilted distributions are consistent with the approximate posterior, that is, and for . Steps 4 is done to ensure that the posterior approximation and the cavity distributions remain well defined at all times. Step 4 is potentially time consuming because it involves checking the conditioning of all the cavity precision matrices . A cheaper alternative could be to skip the step and apply more damping which we expect should work well if the tilted distribution related to the different data pieces are not very different or multimodal.
Advantages of the approach include:
The central node does not need to compute matrix decompositions in step 2. It only needs to sum together the natural site parameters to obtain the posterior approximation in exponential form, and pass this to the individual computing nodes that can make the subtractions to form the cavity parameters.
The tilted moments can be determined by sampling directly from the unnormalized tilted distributions or by using the Laplace’s method. This requires only cheap function and gradient evaluations and can be applied to wide variety of models.
After convergence the final posterior approximation could be formed by mixing the draws from the different tilted distributions because these should be consistent with each other and . This sample-based approximation could capture also potential skewness in because it resembles the EP-based marginal improvements described by Cseke and Heskes (2011).
Limitations of the approach include:
The tilted covariance matrices can be easily computed from samples, but these have to be inverted to obtain the site precision matrices. This inversion could be done by each node after sampling, but if there are more efficient ways to approximate (sparse) precision matrices directly from samples, this could be a potential scheme even for large number of unknowns? The algorithm would not require matrix decompositions and all the required posterior marginals could be summarized using samples.
Estimating the marginal likelihood is more challenging, because determining the normalization constants requires multivariate integrations. For example, annealed importance sampling type of approaches could be used if marginal likelihood estimates are required.
With the Laplace’s method, approximating is straightforward by the quality of the marginal likelihood approximation is not likely to very good with skewed posterior distributions. The Laplace marginal likelihood estimate is not generally well calibrated with the approximate predictive distributions in terms of hyperparameter estimation. Therefore, it would make sense to integrate over the hyperparameters within the EP framework.
a.2 Efficient algorithms when dimension reduction is possible
Here we summarize a version of the EP algorithm of section A.1 for the special case in which the non-Gaussian likelihood terms depend on
only through low-dimensional linearly transformed random variables,
that is, for each partition . The posterior distribution is now given by
and we approximate it by
The natural parameters of are obtained by multiplying the site approximations and the prior which gives
The approximate posterior mean and covariance can be computed using a Cholesky or eigenvalue decomposition of , or a series of rank- updates. One possibility is to initialize the site approximations to by setting and for , which is equivalent to initializing to the prior, that is, and .
If is a matrix, then the cavity computations and the site parameter updates require only rank- matrix computations, and determining the moments of the th tilted distribution requires only -dimensional numerical integrations. In the following we outline how this algorithm can be parallelized using computing units.
We propose to distribute the cavity and tilted moment computations into different computing units by dividing the model terms into non-intersecting subsets so that . The posterior updates are done in the central computing node in a parallel fashion. First, the site updates are initialized to zero, , and then the following steps are repeated until convergence:
Distribute parameters and the site parameter updates to each computing node and compute intermediate natural parameters with damping level :
Compute the updated site parameters for :
Compute the natural parameters of the th batch:
At the central node, compute the natural parameters of as