Markov Chain Monte Carlo (MCMC) algorithms have cemented themselves as a cornerstone of practical Bayesian analysis. Nonetheless, accommodating large distributed datasets is still a challenge. For this purpose, methods have been proposed to speed up inference either using mini-batches (e.g. Ma et al., 2015; Quiroz et al., 2018) or exploiting parallel computing (e.g. Ahn et al., 2014; Johnson et al., 2013), or combinations thereof. For a comprehensive review about scaling up Bayesian inference, we refer to Angelino et al. (2016).
A particularly efficient class of parallel algorithms are embarrassingly parallel MCMC methods, which employ a divide-and-conquer strategy to obtain samples from the posterior
where is a prior, is a likelihood function and the data are partitioned into disjoint subsets . The general idea is to break the global inference into smaller tasks and combine their results, requiring coordination only in the final aggregation stage. More specifically, the target posterior is factorized as
and the right-hand-side factors, referred to as subposteriors, are independently sampled from—in parallel—using an MCMC algorithm of choice. The results are then centralized in a coordinating server and aggregated. The core challenge lies in devising strategies which are both accurate and computationally convenient to combine subposterior samples.
proposed parametric, semi-parametric and non-parametric strategies, the two former being based on fitting kernel density estimators to the subposterior samples.Wang et al. (2015) used random partition trees to learn a discrete approximation to the posterior. Nemeth and Sherlock (2018) placed Gaussian process priors on the log-densities of the subposteriors, and approximated the full log-posterior as a sum of Gaussian processes. Except for the parametric method, which imposes overly simplistic local approximations that generally result in poor approximations of the target posterior, all of the aforementioned approaches require the subposterior samples to be centralized, incurring extensive communication costs. In fact, communication costs have been altogether ignored in the literature so far. Furthermore, sampling from the approximate posterior can become difficult, requiring expensive additional MCMC steps to obtain samples from the combined posterior.
In this work, we propose a novel embarrassingly parallel MCMC strategy termed non-volume-preserving aggregation product (NAP), which addresses the aforementioned issues while providing accurate posterior samples. Our work builds on the insight that subposteriors of arbitrary complexity can be mapped to densities of tractable form, making use of real non-volume preserving trasformations
(real NVP), a recently developed class of neural-network based invertible transformations(Dinh et al., 2017). This enables us to accurately evaluate the subposterior densities and sample from the combined posterior using importance sampling. We prove that, under mild assumptions, our importance sampling scheme is stable, i.e., estimates for a test function
have finite variance.
Experimental results show that NAP outperforms state-of-the art methods in several situations, including heterogeneous subposteriors and intricate-shaped, multi-modal or high-dimensional posteriors. Finally, the proposed strategy results in communication costs which are constant in the number of subposterior samples, which is an appealing feature when communication between machines holding data shards and the server is expensive or limited.
In this work, we employ real NVP transformations to approximate subposteriors using samples obtained from independent MCMC runs. In the following subsections, we 1) review the basics of real NVP transformations; 2) discuss how to combine them using importance sampling and 3) how to obtain samples from the approximate posterior using sampling/importance resampling.
2.1 Real Nvp Density Estimation
Real NVP (Dinh et al., 2017) is a class of deep generative models in which a -dimensional real-valued quantity of interest is modeled as a composition of bijective transformations from a base latent variable , with known density function , i.e.:
such that for all . The density is then obtained using the change-of-variable formula
To make (2) tractable, it is composed as follows. Let be a pre-defined proper subset of indices with cardinality , and denote its complement by . Then, each transformation is computed as:
where is an element-wise product. The functions are deep neural networks, which perform scale and translation, respectively. In particular, the Jacobian of , has the form
which avoids explicit computation of the Jacobian of the functions and . For observed data , the weights of the networks and that implicitly parameterize are estimated via maximum likelihood.
2.2 Combining Local Inferences
Consider now a factorization of a target posterior density into a product of subposteriors according to Equation (1). In embarrassingly parallel MCMC, each worker runs MCMC independently on its respective subposterior,
to obtain a set of draws from . The goal is then to produce draws from an approximate target posterior , using the sets of subposterior samples as input. This requires estimating the densities from the subposterior samples, and sampling from the distribution induced by the product of approximations
typically resulting in a trade-off between accuracy and computational efficiency.
In this work, we make use of the fact that bijective transformations using real NVP offers both accurate density estimation and computationally efficient sampling for arbitrarily complex distributions. To this end, we first fit a real NVP network to estimate each as . The networks are then sent to a server that approximates the global posterior as in Equation (5).
In a typical scenario, one would ultimately be interested in using to compute the expectation of some function , such as a predictive density or a utility function. In our case, straightforward importance sampling can be used to weight samples drawn from any of the subposteriors. Thus, given a set of samples drawn from any , we obtain the estimate:
where the importance weights are normalized to sum to one, and given by
This strategy capitalizes on the key properties of real NVP transformations—ease of evaluation and sampling—and avoids the burden of running still more MCMC chains to sample from the aggregated posterior , which might be a complicated target due to the underlying neural networks.
While importance sampling estimates can be unreliable if their variance is very high or infinite, we can provide guarantees that has finite variance. Geweke (1989) showed that, for a broad class of test functions, it suffices to prove that , i.e., the importance weights are bounded. We first note that the denominator of the weight in Equation (6) is included as a factor in the numerator, so that . The remaining thing to check is that is bounded for all and all .
We begin by making the following assumption on the structure of the neural networks, which define the real NVP transformations.
The neural networks , …, associated with the real NVP estimate are equipped with bounded activation functions in their individual output layers.
are equipped with bounded activation functions in their individual output layers.
Note that Assumption 1 is satisfied, for example, when the activation function in question is the hyperbolic tangent or the logistic function. We place no further assumption on the structure of the remaining layers of or in the overall structure of the translation networks .
With the additional condition that we choose an appropriate density for the base variable of the NVP network, we can prove that itself is bounded.
Given a bounded base density , the distribution resulting from transformations is bounded.
As is bounded, there exists such that:
Let . Applying the change-of-variable formula we get:
Let be a constant bounding . Using Assumption 1, since all of the outputs of the neural networks are bounded, their sum is bounded by some . Then, it follows:
i.e., is bounded. Repeating the argument we get a proof by induction on the number of transformations .
As a direct application of Lemma 1, we get the desired bound the importance weights.
For any , there exists such that for all , .
Using Lemma 2.1, let be the upper bound for and let , from which the statement follows. ∎
This provides the sufficient conditions underlined by Geweke (1989), so that we achieve the following result regarding the overall stability of the importance sampling estimates.
Suppose are samples from for some . Let , and . Then, the importance sampling estimate has finite variance.
2.3 Sampling From the Approximate Posterior
We can also use the samples from and their associated importance weights to obtain approximate samples from using sampling/importance resampling (SIR). With this, can be directly estimated as a Monte Carlo integral over the new samples. This procedure easily is done by choosing
with probability proportional to. The required steps are detailed in Algorithm 1.
Note that Algorithm 1 provides, for any single , a valid sampler for the approximate posterior . However, in practice it is beneficial to apply the algorithm for many or all to provide better exploration of the parameter space.
2.4 Time Complexity
We now analyze the time complexity of the proposed method with respect to the number of subposteriors , the number of samples drawn from each of the subposteriors , and the number of samples which we wish to obtain from the aggregated posterior.
Obtaining samples from the approximate posterior using NAP consists of a single pass of the two following steps:
- Step 1.
In parallel, for , fit a real NVP transformation to the samples drawn from the th subposterior at worker .
- Step 2.
Gather the subposterior approximations. Choose a , choose and use Algorithm 1 to draw samples from .
Step 1 involves the usual costs of learning real NVP networks, which can be done using gradient-based methods, such as ADAM (Kingma and Ba, 2014). Assuming the number of layers and weights per layer in each network is fixed, evaluating takes linear time in . Further taking for some constant , we conclude Step 2 can be executed in .
2.5 Communication Costs
It is important to note that typically , i.e., a worker ouputs a much larger number of subposterior samples than the size of the data subset it processes. Even if the dataset is split among workers to improve computational efficiency through parallel inference, sending subposterior samples back to the server for aggregation can amount to considerable communication costs. Therefore, we also examine communication cost of NAP, and contrast it to currently available methods.
The communication cost of the proposed NAP amounts to , corresponding to the cost of communicating the NVP networks to the server, which does not depend on . On the other hand, current methods have their accuracy intrinsically tied to the number of subposterior samples communicated to the server, resulting in communication costs. For example, the partition tree based method of Wang et al. (2015) requires recursive pair-wise aggregation of subposteriors, which calls for centralization of subposterior samples. The non-parametric and semi-parametric methods proposed by Neiswanger et al. (2014) require computing kernel-density estimates defined on each subposterior individually and centralizing them to subsequently execute an MCMC step to sample from their product. Similar costs are implied by the strategy of Nemeth and Sherlock (2018), which fits the Gaussian process approximations and centralizes them to use MCMC to sample from the product of the expected values of their exponentiated predictives.
3 Experimental Results
We evaluated the performance of the proposed method in four different experiments, comparing it against several aggregation methods:
Parametric (PARAM): approximates the posterior as a product of multivariate normal densities fitted to each subposterior (Neiswanger et al., 2014).
Non-parametric (NP): uses kernel density estimates to approximate the subposteriors, takes their product and samples from it using Gibbs sampling (Neiswanger et al., 2014).
Semi-parametric (SP): a hybrid between the two former approaches (Neiswanger et al., 2014).
Consensus (CON): takes weighted averages of subposterior samples to obtain approximate samples from the target posterior (Scott et al., 2016).
Parallel aggregation using random partition trees (PART): uses partition trees to fit hyper-histograms to the target posterior using subposterior samples (Wang et al., 2015).
In the first experiment, we target a uni-modal distribution of an intricate shape. In the second, we approximate a bi-variate multi-modal distribution. In the third, we evaluate the performance of our method when approximating logistic regression posteriors in high dimensions. Finally, in the last one we analyze the performance of our method when there is a clear discrepancy among the subposteriors being merged.
All MCMC simulations were carried out using the python interface of the Stan probabilistic programming language(Carpenter et al., 2017)
, which implements the no-U-turn sampler. For each subposterior, we draw 4000 samples using 16 chains with an equal number of samples as warm-up. The same holds for the target (ground truth) posterior, computed on centralized data. The real NVP networks were implemented with PyTorch111https://pytorch.org using three transformations (
) and Gaussian spherical base densities. The scale and translation networks were all implemented as multi-layer perceptrons with two hidden-layers comprising
nodes each. The layers of these networks were all equipped with rectified linear units except for the the last layer of each scale network, which was equipped with the hyperbolic tangent activation function. The network parameters were optimized using ADAM(Kingma and Ba, 2014) over 1000 iterations with learning rate .
3.1 Warped Gaussian
We first consider inference in a warped Gaussian model which exhibits a banana-shaped posterior and is described by the generative model:
where the true value of the parameters and are and , respectively. The variance is set to and treated as a known constant. We draw observations from the model and distribute them in disjoint sets. Gaussian priors with zero mean and variance 25 were placed both on and .
For NAP, we used Algorithm 1 to draw samples from the approximate posterior, using samples drawn from the individual subposterior approximation. To avoid possible underflow from normalizing a large number of importance weights, we do this in installments, in each of which a suposterior approximation is used as a proposal. The same number of samples was drawn using each of the competing methods.
Figure 1 shows222The experiment was repeated with multiple random seeds, yielding similar results. the samples from the approximate posterior obtained with different aggregation methods, plotted against the posterior obtained using the entire sample set. Of all the methods, only NAP and PART were flexible enough to mimic the banana shape of the posterior. PART, however, is overly concentrated when compared to the ground truth, while NAP more faithfully spreads the mass of the distribution.
3.2 Mixture of Gammas
We now consider performing inference in the shape parameters and of the following two-component mixture of Gammas:
where the true value of the parameters of interest are and . Furthermore, are known constants, which makes the model clearly bi-modal. Independent Gamma priors with shape and scale were placed in and .
As before, we draw observations from the model and distribute them in disjoint sets for parallel inference. Samples from the approximate posterior for each aggregation algorithm were drawn in the same fashion as in the previous experiment.
Figure 2 shows333The experiment was repeated with multiple random seeds, yielding similar results. the samples from the approximate posteriors obtained with each aggregation method plotted against the target posterior, obtained using the entire sample set. The proposed method and PART clearly are the only ones that capture the multi-modality of the posterior. NAP, however, presented a better fit to the true posterior while PART placed more mass in low-density regions.
3.3 Bayesian Logistic Regression
We now explore how our method behaves in higher dimensions in comparison to its alternatives. For this purpose we consider inference on the simple logistic regression model with likelihood:
where denotes the dot product, is the logistic function, and receive independent priors. The true value of is held at and the remaining
are independently from a normal distribution with zero-mean and variance.
To generate a sample pair , we first draw from , where the covariance matrix is such that:
Then, is computed by rounding to one if it is at least , and to zero otherwise.
For each value of , we draw sample pairs using the scheme described above and distribute them in disjoint sets for parallel inference. As in previous experiments, we use NAP and its counterparts to merge the subposteriors and draw samples from the approximate posterior.
Table 1 presents the results for each of the aggregation methods in terms of the following performance measures:
Root mean squared error (RMSE) between the mean of the approximate posterior samples and the mean of samples from the ground truth posterior;
Posterior concentration ratio (), computed as:
comparing the concentration of the two posteriors around the ground truth mean (values close to one are desirable);
KL divergence; () between a multivariate normal approximation of the aggregated posterior and a multivariante normal approximation of the true one, both computed from samples.
Experiments were repeated ten times for each value of , in each of which a new was drawn.
When compared to the other methods, for all values of , NAP presents a mean closer to the one obtained using centralized inference (smaller RMSE) and has a more accurate spread around it ( closer to one). In terms of KL divergence, only at , PARAM outperforms NAP by a relatively small margin. Besides this case, NAP performs orders of magnitude better than the other methods, with increasing disparity as grows.
3.4 Rare Categorical Events
In the scenarios explored in the previous experiments, there is no specific reason to believe that the subposteriors differ drastically from each other. We now consider parallel inference on the parameters of a categorical model, with respective outcomes and , where the probability of observing one outcome is much higher than the others, i.e. .
We simulate data points from with . We then partition the data into disjoint subsets, run MCMC in parallel and apply different aggregation methods to obtain samples from the approximate posterior. The aggregated posteriors are then compared with the one obtained using the complete data.
Since the expected number of and per partition is 2, it often occurs than some partitions have only . Figure 3 illustrates how disparate the subposteriors can be depending on the specific partitioning of data.
To compensate for the variability in experiment results due to the random partitioning of the subsets, we repeated the experiments one hundred times with different random seeds, and report average results in Table 2. NAP clearly outperforms its competitors, with results that are orders of magnitude better.
We have proposed an embarrassingly parallel MCMC scheme in which each subposterior density is mapped to a tractable form using a deep invertible generative model. We capitalized on the ease of sampling from the mapped subposteriors and evaluating their log density values to build an efficient importance sampling scheme to merge the subposteriors. Imposing mild assumptions on the structure of the network, we proved that our importance sampling scheme is stable. While in this work we gave special attention to the use of real NVP networks, our approach could potentially employ other invertible models, such as the Glow transform (Kingma and Dhariwal, 2018), without loosing theoretical guarantees, as long as the log densities remain bounded. If the bounds are difficult to verify, one could still resort to truncated forms of importance sampling (Ionides, 2008; Vehtari et al., 2015) to control the variance of importance sampling estimates.
Our experimental results demonstrated that NAP is capable of capturing intricate posteriors and cope with heteregenous subposteriors. In particular, we observed that it significantly outperformed current methods in high-dimensional settings. A possible explanation for this is that, unlike the density estimation techniques underlying the competing methods, the real NVP transformations used in our method, are specifically designed for high-dimensional data, such as images. Finally, the generative models we use serve as a intermediate representation to the subposterior, the size of which does not depend on the number of subposterior samples. Thus, workers can produce arbitrarily accurate subposterior estimates by drawing additional samples, without affecting the cost of communicating the subposteriors to the server, or the computational cost of aggregating them into a final posterior estimate.
DM, PB and SK were funded by the Academy of Finland, grants 319264 and 294238. The authors gratefully acknowledge the computational resources provided by the Aalto Science-IT project and support from the Finnish Center for Artificial Intelligence (FCAI).
Ahn et al. (2014)
Sungjin Ahn, Babak Shahbaba, and Max Welling.
Distributed stochastic gradient MCMC.
Proceedings of the 31st International Conference on International Conference on Machine Learning, ICML’14, pages II–1044–II–1052. JMLR.org, 2014.
- Angelino et al. (2016) Elaine Angelino, Matthew James Johnson, and Ryan P. Adams. Patterns of scalable Bayesian inference. Foundations and Trends in Machine Learning, 9(2-3):119–247, 2016. doi: 10.1561/2200000052.
- Carpenter et al. (2017) Bob Carpenter, Andrew Gelman, Matthew Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan: A probabilistic programming language. Journal of Statistical Software, 76(1), 2017.
- Dinh et al. (2017) Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real NVP. In International Conference on Learning Representations, 2017.
- Geweke (1989) John Geweke. Bayesian inference in econometric models using Monte Carlo integration. Econometrica, 57(6):1317–1339, 1989.
- Ionides (2008) Edward L Ionides. Truncated importance sampling. Journal of Computational and Graphical Statistics, 17(2):295–311, 2008. doi: 10.1198/106186008X320456.
- Johnson et al. (2013) Matthew Johnson, James Saunderson, and Alan Willsky. Analyzing hogwild parallel Gaussian Gibbs sampling. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 2715–2723. Curran Associates, Inc., 2013.
- Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
- Kingma and Dhariwal (2018) Diederik P. Kingma and Prafulla Dhariwal. Glow: Generative Flow with Invertible 1x1 Convolutions. arXiv e-prints, art. arXiv:1807.03039, Jul 2018.
- Ma et al. (2015) Yi-An Ma, Tianqi Chen, and Emily B. Fox. A complete recipe for stochastic gradient MCMC. In Proceedings of the 28th International Conference on Neural Information Processing Systems, NIPS’15, pages 2917–2925, Cambridge, MA, USA, 2015. MIT Press.
- Neiswanger et al. (2014) Willie Neiswanger, Chong Wang, and Eric P. Xing. Asymptotically exact, embarrassingly parallel MCMC. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, UAI’14, pages 623–632, Arlington, Virginia, United States, 2014. AUAI Press.
- Nemeth and Sherlock (2018) Christopher Nemeth and Chris Sherlock. Merging MCMC subposteriors through Gaussian-process approximations. Bayesian Anal., 13(2):507–530, 06 2018. doi: 10.1214/17-BA1063.
- Quiroz et al. (2018) Matias Quiroz, Robert Kohn, Mattias Villani, and Minh-Ngoc Tran. Speeding up MCMC by efficient data subsampling. Journal of the American Statistical Association, 2018. doi: 10.1080/01621459.2018.1448827.
- Scott et al. (2016) Steven L. Scott, Alexander W. Blocker, Fernando V. Bonassi, Hugh A. Chipman, Edward I. George, and Robert E. McCulloch. Bayes and big data: The consensus Monte Carlo algorithm. International Journal of Management Science and Engineering Management, 11:78–88, 2016.
- Vehtari et al. (2015) Aki Vehtari, Andrew Gelman, and Jonah Gabry. Pareto Smoothed Importance Sampling. arXiv e-prints, art. arXiv:1507.02646, Jul 2015.
- Wang et al. (2015) Xiangyu Wang, Fangjian Guo, Katherine A. Heller, and David B. Dunson. Parallelizing MCMC with random partition trees. In Proceedings of the 28th International Conference on Neural Information Processing Systems, NIPS’15, pages 451–459, Cambridge, MA, USA, 2015. MIT Press.