# Transport Monte Carlo

In Bayesian inference, transport map is a promising alternative to the canonical Markov chain Monte Carlo for posterior estimation: it uses optimization to find a deterministic map from an easy-to-sample reference distribution to the posterior. However, often the invertible map does not exist between the two distributions and can be difficult to parameterize with sufficient flexibility. Motivated to address these issues and substantially simplify its use, we propose Transport Monte Carlo. Instead of relying on a single deterministic map, we consider a coupling joint distribution modeled by a non-parametric mixture of several maps. Such a coupling is guaranteed to exist between the reference and posterior distributions. To automate the map parameterization and estimation, we use the invertible neural networks to replace the manual design procedure. Once the coupling is estimated, one can rapidly generate a large number of samples that are completely independent. With a carefully chosen reference distribution, the difference between the generated samples and the exact posterior is negligibly small. Both theoretic and empirical results demonstrate its advantages for solving common but challenging sampling problems.

02/05/2021

### Invertible Neural Networks versus MCMC for Posterior Reconstruction in Grazing Incidence X-Ray Fluorescence

Grazing incidence X-ray fluorescence is a non-destructive technique for ...
08/28/2018

### A transport-based multifidelity preconditioner for Markov chain Monte Carlo

Markov chain Monte Carlo (MCMC) sampling of posterior distributions aris...
12/24/2017

### Deterministic Sampling of Expensive Posteriors Using Minimum Energy Designs

Markov chain Monte Carlo (MCMC) methods require a large number of sample...
09/21/2022

### Single chain differential evolution Monte-Carlo for self-tuning Bayesian inference

1. Bayesian inference is difficult because it often requires time consum...
03/17/2022

### Evaluating Posterior Distributions by Selectively Breeding Prior Samples

Using Markov chain Monte Carlo to sample from posterior distributions wa...
08/09/2021

### Scalable Bayesian transport maps for high-dimensional non-Gaussian spatial fields

A multivariate distribution can be described by a triangular transport m...
10/22/2020

### Measure Transport with Kernel Stein Discrepancy

Measure transport underpins several recent algorithms for posterior appr...

## 1 Introduction

Bayesian paradigms have been routinely used for uncertainty quantification. Markov chain Monte Carlo, particularly the Gibbs sampling, has been the mainstream tool for the posterior estimation; a primary challenge is that the samples tend to be auto-correlated when updating via the conditional given the current state, leading low effective sample size. To be able to have a high-accuracy estimate, one has to run a long simulation of the Markov chain. For a recent discussion on this issue, see Robert et al. (2018). A large literature of new Marko chain methods has been proposed, aiming to make each Markov state less dependent on the previous state. Methods include Metropolis-adjusted Langevin algorithm (Roberts et al., 1996), Hamiltonian Monte Carlo (Neal et al., 2011), piecewise deterministic (Bierkens et al., 2019), or continuous-time Markov chain Monte Carlo (Fearnhead et al., 2018). At the same time, there is some literature focusing on alternatives to Markov chain methods. For example, among others, Approximate Bayesian Computation (Beaumont et al. (2009)) avoids the evaluation of the likelihood and instead relies on a divergence between simulated and observed data in rejection sampling; variational Bayes (Blei et al., 2017) replaces the complicated posterior with the mean-field approximation that is easier to sample. Despite the popularity of those approximations, there are critical issues, such as ignoring the correlation among the parameters, and difficulty to quantify the approximation error for diagnostics. For the discussion and remedy on those issues, see Giordano et al. (2018). The transport map is an interesting new solution (El Moselhy and Marzouk, 2012). To briefly review, let be a continuous parameter of interest,

the prior probability density,

the data and

the likelihood. Slightly abusing notation, denote the random variable from the posterior as

 θ∼Π(θ∣Y)=L(Y;θ)Π0(θ)m(Y),

where is normalizing constant. Consider another random variable from another proper measure (with having the same number of elements as ), referred to as the reference distribution

 β∼Πr(β).

If there exists an invertible map that pushes forward the reference measure to the posterior . Using standard variable transformation , the pullback density from the posterior should equal to the reference one:

 L{Y;f(β)}Π0{f(β)}m(Y)|det∇f(β)|=Πr(β), (1)

with the Jacobian matrix. With flexibly parameterized and satisfying invertibility, one minimizes the difference between the left (often without the constant ) and right sides, using samples . When the optimal is reached, ’s are the exact posterior samples, and they are completely independent — a major advantage compared to the Markov chain methods. Despite its potentials, there are two limitations. First, most importantly, there is no guarantee that an invertible map exists between and in general. This is also known as the ill-posed Monge problem (Santambrogio, 2015). As an example, if is a bivariate Gaussian but has one element fixed to a constant (Dirac measure), there is no map between the two due to the different dimensions. Even without the dimensionality issue, often is parameterized with limited flexibility, which can lead to thinner distribution tails of , compared to . As shown in our later example, when is multi-modal, and the estimated tends to converge to only one of the modes. As a remedy, existing methods instead use the transport map to optimize Markov transition kernel from one state to the next (Parno and Marzouk, 2018). Despite some improvement, this arguably weakens the strength of having independence in the samples. We address this issue via a much more general coupling in the form of a mixture of multiple transport maps, borrowing strengths from the non-parametric Bayes for both flexibility and computing tractability. Second, the typical implementation relies on the manual design of , as one has to choose a composite of several invertible functions. These procedures are cumbersome for the users. We are not the first to recognize this issue, as some automated solutions have been recently proposed (such as Levy et al. (2018)). Our uniqueness is that we explore the clear alternative to the unscalable optimization needed for every sample of : instead, first estimate on a smaller training set of ’s, then generate a large number of samples with fixed. We provide a theoretic examination on why the associated error is negligible, producing a motivated choice for the reference distribution.

## 2 Method

### 2.1 Random Transport via Non-parametric Coupling

For a concise exposition, we focus on and

as continuous random variables. Extensions to other cases will be discussed at the end of the article. Consider a coupling, defined by the joint density

, between and :

 (θ,β)∼P(θ,β), ∫P(θ,β)dβ=Π(θ∣Y), (2) ∫P(θ,β)dθ=Πr(β). (3)

The coupling always exists — one trivial case would be the independent coupling . In fact, the transport map (1) is another case of deterministic coupling (with the Dirac delta), which relies on the strong assumption that (3) can be met for any . To relax this assumption, we instead consider a random transport, by modeling the conditional distribution of given as a random output from one of several invertible maps, parameterized by a mixture distribution

 P(θ,β)=P(β)P(θ∣β),P(θ)=Π(θ∣Y),P(β∣θ)=K∑k=1wk(θ)δ{β−f−1k(θ)}, (4)

with the mixture weight , for any . We allow to potentially change with for more flexibility (this includes the simpler case with invariant to ). This mixture is inspired by the Bayesian non-parametric method to model an unknown conditional density (Dunson et al., 2007). The randomness in has two benefits: (i) we relax the stringent equal dimensionality requirement, as can have a lower measure dimension than ; (ii) we can now use simpler with less worry on its flexibility, as it can be compensated with multiple copies of ’s. The following theorem formalizes the high flexibility of random transport as .

###### Theorem 1.

Under (4), assuming has an equal or higher dimension than , as there exist such that (2) and (3) are satisfied almost everywhere.

With this generalization, the random transport still enjoys an equality connecting and , providing a path for using optimization to estimate and .

###### Theorem 2.

Under (3) and (4),

 Πr(β)=K∑k=1wk{fk(β)}Π{fk(β)∣Y}|det∇fk(β)|. (5)

### 2.2 Transport Monte Carlo

With the above results, we can now develop a Transport Monte Carlo algorithm, which consists of two stages: (i) optimizing and using a set of training samples ; (ii) generating new samples using the estimated and . In the first stage, we minimize

 (6)

which is related to the empirical Kullback-Leibler divergence as

. After the optimization converges, in the second stage, we sample new and via

 β∼Πr, (7) θ=fk(β) with probability vk, vk∝ wk{fk(β)}Π{fk(β)∣Y}|det∇fk(β)|.

We separate these two stages because the optimization is the time-consuming step, using a smaller set of samples leads to much better scalability; while the sampling is embarrassingly parallel and we can rapidly obtain a large number of samples. In the later section, we will show its associated error is very small with a well-chosen and sufficiently large .

### 2.3 Automated Computation for Invertible Maps

A major complexity of the transport map methods is in the parameterization and estimation of the invertible maps. To simplify and automate this process, we take advantage of the popularity of neural networks and their computational toolbox. Following Dinh et al. (2017), we parameterize each as an M-layer invertible neural network. Without loss of generality, assuming is a

-element vector. We set each layer as a

-element to -element transform:

 fk(β)=σk1∘σk2∘⋯∘σkM(β) (8) σkm(x)=[xA∗skA(xB)+tkA(xB),xB]′ if m is even σkm(x)=[xA,xB∗skB(xA)+tkB(xA)]′ if m is odd

where denotes the Hadamard elementwise product; are the partition of the indices; and are nested neural networks with a flexible forms (not necessarily invertiable), with the only requirement that all . Each resembles a location-scale transform, except that the magnitudes of scaling and shifting also depend on the input, making the change non-linear. Although the nested and have flexible forms, the enclosing is invertible for even , for odd . Therefore, the entire is invertible. In addition, , with the output elements of and in the neural network . See Dinh et al. (2017) for the detailed derivation. For , we parameterize it similarly to except using an extra layer of Softmax activation, so that the output is in . We parametrize and

with a simple neural network containing one hidden layer, connected by the leaky ReLU activation function; the last layer of

is exponentiated so that each element is positive; we use in all the examples. Compared to a manually designed

, the neural network is much easier to use for approximating non-linear transform. In addition, it allows us to exploit popular computing toolbox such as PyTorch

(Paszke et al., 2017)

. The optimization utilizes stochastic gradient descent, by randomly generating a new batch of

’s in each iteration for gradient calculation. For tractable computation, we approximate the infinite with a truncated in most examples, and found empirically no difference compared to using larger .

### 2.4 Generalization Error and Choosing the Reference Distribution

As described above, the samples of are generated using and , estimated from a finite training set of ’s. There can be a small difference between the sampled distribution and the exact posterior, which we will refer to as the generalization error (based on some resemblance to the common training-and-prediction task). Such an error can be quantified theoretically, providing a guide on choosing the reference . To distinguish from the training samples, in this section, we will use and to denote the newly generated samples. When the equality (5) holds, each training sample has a common minimized value

, a constant. In practice, this also holds approximately, as exhibited by the low sample variance of

’s. Intuitively, if ’s are dense enough to cover most of the high probability region in , the new sample will be near a certain training ’s with high probability, hence the associated loss should be near the minimized loss as well. This concept can be formalized as a high probability -net, defined as

 \boldmathβϵ,δ={(β1,…,βn):∫AΠr(β)dβ≥1−δ,A={β∈\boldmathβ:infi∈{1…n}∥β−βi∥≤ϵ}}.

Using this concept, we now formally state the accuracy guarantee.

###### Theorem 3.

For given and , suppose the training samples satisfies , and further if 1. and are absolutely continuous with respect to Lebesgue measure, 2. each in (8) is Lipschitz continuous and bounded, then

 infi∈{1…n}∥log{g(β∗)}−log{g(βi)}∥=O(ϵ),

with probability .

For this theorem to be pratically useful, we need a high probability -net that achieves small under moderate size . Therefore, an important task is to choose a good with the spacing decreases rapidly in , associated with the classical problem of bounding maximal spacing (Janson, 1987). The multivariate normal, commonly used in transport-related methods, is in fact sub-optimal for this purpose as (Deheuvels et al., 1986)

, which has a much faster rate (Devroye et al., 1982). For a tighter but more involved bound, see Janson (1987).For diagnostics, we have a tractable error estimate using the empirical Kullback-Leibler divergence

 KL{Π(θ∗)∥Π(θ∣Y)}=1n∗n∗∑i=1logg(β∗i)+logm(Y).

For intractable is, we replace it with an estimator .

## 3 Numeric Illustrations

### 3.1 Multi-modal Posterior

We first sample from the posterior from the mixture of multivariate Gaussian. Consider from the two-component mixture with unequal component means and unequal covariances

 Π(θ∣Y)=0.5No(θ;[12],[10.50.51])+0.5No(θ;[62],[1−0.9−0.91]).

We choose this example for two reasons: (i) to demonstrate that the deterministic transport map can fail even in such a simple case, and how multiple maps in Transport Monte Carlo can solve this issue; (ii) we intentionally leave out the normalizing constant

when implementing the loss function (

6), in order to test the accuracy of the estimator . As shown in Figure 1(b), the deterministic transport map fails to discover the second component, even though the two modes are not too far apart. This is due to the different covariances in the two components, particularly the different correlation signs, disrupting the log-concavity and causing one transport map to collapse to only one component. On the other hand, as shown in Figure 1(c), Transport Monte Carlo correctly identifies both modes as it exploits multiple transport maps. The estimated is very close to the true (based on ). To assess the new sample error, we generate another samples, and the empirical Kullback-Leibler divergence is only . As a comparison baseline, the kernel-based Kullback-Leibler estimates (Boltz et al., 2009) between the two exact samples from is . This suggests that the Transport Monte Carlo is highly accurate.

### 3.2 High-Dimensional Parameter

To illustrate the performance with the high dimensional parameter, we consider a sparse linear regression with the regularized horseshoe prior

(Piironen and Vehtari, 2017). For data index and covariate index

 yj∼No(x′jb,σ20), bk∼No(0,~λ2kτ2),~λ2k=c2λ2kc2+τ2λ2k,λk∼C+(0,1), c2∼Ga−1(v/2,vs2/2).

where is the covariate; denotes the half-Cauchy and the inverse-gamma. Compared to the original horseshoe (Carvalho et al., 2010), the large signals approximately follow a normal prior , instead of Cauchy, hence has a finite prior mean. As the result of this modification, Gibbs sampling is no longer applicable, and the Hamiltonian Monte Carlo is suggested instead. When simulating the data, we use and ; ; ; for , and for . For both and , we use the informative prior to induce low noise and shrinkage global scale; for , we set , , as suggested by Piironen and Vehtari (2017). The results are compared with Hamiltonian Monte Carlo, using the No-U-TURN sampler provided by the PyMC3 package. As most of the parameters are close to the degenerate zero, Hamiltonian Monte Carlo can only use small leap-frog step, resulting in extremely slow mixing (2). We run it for 100,000 iterations and thin the chain at every th step, and use them as the posterior sample. This takes approximately 8 hours. In contrast, the Transport Monte Carlo only takes a few minutes in optimization with and can generate new samples almost instantaneously. Due to the independence, no thinning is needed. The results are almost identical to the one obtained in costly Hamiltonian Monte Carlo.

## 4 Discussion

We show that Transport Monte Carlo can serve as an appealing alternative to conventional Markov chain Monte Carlo methods, and enjoys theoretic justification via a coupling. On the other hand, several drawbacks remain and merit future work. Although we show the error is very small, unlike Markov chain Monte Carlo, there is no asymptotic exactness guarantee as the number of samples goes to infinity. A possible correction amenable to the high dimensional parameter is using the obtained sample as initial values for parallel Markov chains (Hoffman et al. arXiv preprint-1903.03704). Another extension is to consider in a discrete or constrained space. Our random transport can be combined with the transport that changes one discrete variable to another (Tran et al., 2019); nevertheless, a careful measure-theoretic study is still an interesting open problem. Lastly, for conciseness, we leave out the cases when one has degenerate mass mixed with the continuous random variable, for example, in the spike-and-slab prior (Ishwaran and Rao, 2005). To accommodate this, the reference needs to be adaptive as well and it is worth exploring as a case study.

## References

• Beaumont et al. (2009) Beaumont, M. A., J.-M. Cornuet, J.-M. Marin, and C. P. Robert (2009). Adaptive Approximate Bayesian Computation. Biometrika 96(4), 983–990.
• Bierkens et al. (2019) Bierkens, J., P. Fearnhead, G. Roberts, et al. (2019). The Zig-Zag Process and Super-efficient Sampling for Bayesian Analysis of Big Data. The Annals of Statistics 47(3), 1288–1320.
• Blei et al. (2017) Blei, D. M., A. Kucukelbir, and J. D. McAuliffe (2017). Variational Inference: a Review for Statisticians. Journal of the American Statistical Association 112(518), 859–877.
• Boltz et al. (2009) Boltz, S., E. Debreuve, and M. Barlaud (2009). High-dimensional Statistical Measure for Region-of-interest Tracking. IEEE Transactions on Image Processing 18(6), 1266–1283.
• Carvalho et al. (2010) Carvalho, C. M., N. G. Polson, and J. G. Scott (2010). The Horseshoe Estimator for Sparse Signals. Biometrika 97(2), 465–480.
• Deheuvels et al. (1986) Deheuvels, P. et al. (1986). On the Influence of the Extremes of an IID Sequence on the Maximal Spacings. The Annals of Probability 14(1), 194–208.
• Devroye et al. (1982) Devroye, L. et al. (1982). A Log Log Law for Maximal Uniform Spacings. The Annals of Probability 10(3), 863–868.
• Dinh et al. (2017) Dinh, L., J. Sohl-Dickstein, and S. Bengio (2017). Density Estimation using Real NVP. In International Conference on Learning Representations.
• Dunson et al. (2007) Dunson, D. B., N. Pillai, and J.-H. Park (2007). Bayesian Density Regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69(2), 163–183.
• El Moselhy and Marzouk (2012) El Moselhy, T. A. and Y. M. Marzouk (2012). Bayesian Inference with Optimal Maps. Journal of Computational Physics 231(23), 7815–7850.
• Fearnhead et al. (2018) Fearnhead, P., J. Bierkens, M. Pollock, and G. O. Roberts (2018). Piecewise Deterministic Markov Processes for Continuous-time Monte Carlo. Statistical Science 33(3), 386–412.
• Giordano et al. (2018) Giordano, R., T. Broderick, and M. I. Jordan (2018). Covariances, Robustness and Variational Bayes.

The Journal of Machine Learning Research

19(1), 1981–2029.
• Ishwaran and Rao (2005) Ishwaran, H. and J. S. Rao (2005). Spike and Slab Variable Selection: Frequentist and Bayesian Strategies. The Annals of Statistics 33(2), 730–773.
• Janson (1987) Janson, S. (1987). Maximal Spacings in Several Dimensions. The Annals of Probability 15(1), 274–280.
• Levy et al. (2018) Levy, D., M. D. Hoffman, and J. Sohl-Dickstein (2018). Generalizing Hamiltonian Monte Carlo with Neural Networks. In International Conference on Learning Representations.
• Neal et al. (2011) Neal, R. M. et al. (2011). MCMC using Hamiltonian Dynamics. Handbook of markov chain monte carlo 2(11), 2.
• Parno and Marzouk (2018) Parno, M. D. and Y. M. Marzouk (2018). Transport Map Accelerated Markov Chain Monte Carlo. SIAM/ASA Journal on Uncertainty Quantification 6(2), 645–682.
• Paszke et al. (2017) Paszke, A., S. Gross, S. Chintala, and G. Chanan (2017).

PyTorch: Tensors and Dynamic Neural Networks in Python with Strong GPU Acceleration.

PyTorch: Tensors and dynamic neural networks in Python with strong GPU acceleration 6.
• Piironen and Vehtari (2017) Piironen, J. and A. Vehtari (2017). Sparsity Information and Regularization in the Horseshoe and Other Shrinkage Priors. Electronic Journal of Statistics 11(2), 5018–5051.
• Robert et al. (2018) Robert, C. P., V. Elvira, N. Tawn, and C. Wu (2018). Accelerating MCMC algorithms. Wiley Interdisciplinary Reviews: Computational Statistics 10(5), e1435.
• Roberts et al. (1996) Roberts, G. O., R. L. Tweedie, et al. (1996). Exponential Convergence of Langevin Distributions and Their Discrete Approximations. Bernoulli 2(4), 341–363.
• Santambrogio (2015) Santambrogio, F. (2015). Optimal Transport for Applied Mathematicians. Birkäuser, NY 55, 58–63.
• Tran et al. (2019) Tran, D., K. Vafa, K. K. Agrawal, L. Dinh, and B. Poole (2019). Discrete Flows: Invertible Generative Models of Discrete Data. In International Conference on Learning Representations DeepGenStruct Workshop.

## Appendix

### Proof of theorem 1

It is trivial to see that (2) holds. For (3), we show its existence via one (among many) construction. For any measurable , denote the conditional probability

 ∫AP(θ,β)Π(θ∣Y)dβ=μ(A)∈(0,1].

As any measurable function is the pointwise limit of simple function, for any measurable, there exists limit

 h(β)=limn→∞n∑i=1g[n]i1(β∈E[n]i),

where the disjoint partition of and a constant depending on . Integrate over ,

 ∫h(β)dβ=limn→∞n∑i=1g[n]iμ(E[n]i).

As , we can choose small enough such that with some set of . Therefore unless . Take to be , the left hand side is , choose the right hand side . Given , the integral using (4) has

 ∫AP(β∣θ)dβ=∞∑k=1wk(θ)1{f−1k(θ)∈A}.

For each , let and satisfy , so that , this yields the result.

### Proof of theorem 2

 Πr(β) =∫K∑k=1wk(θ)Π(θ∣Y)δ{β−f−1k(θ)}dθ =K∑k=1∫wk(θ)Π(θ∣Y)δ{θ−fk(β)}|det∇fk(β)|dθ =K∑k=1wk{fk(β)}Π{fk(β)∣Y}|det∇fk(β)|,

where the Jacobian emerges due to with for invertiable .

### Proof of theorem 3

The task is equivalent to showing has a bounded derivative almost everywhere. Rewriting

 log{g(β)}=logK∑k=1exphk(β), hk(β)=−logK+ logΠr(β)−logwk{fk(β)}−logL{Y∣fk(β)}Π0{fk(β)}−log|det∇fk(β)|

and taking derivative with respect to the th sub-coordinate of , denoted by , its magnitude satisfies

 ∣∣∣K∑k=1exphk(β)∂hk(β)/∂β[m]∑Kl=1exphl(β)∣∣∣ ≤ K∑k=1exphk(β)∑Kl=1exphl(β)∣∣∣∂hk(β)∂β[m]∣∣∣ ≤ maxk∈{1…K}∣∣∣∂hk(β)∂β[m]∣∣∣.

Examining the derivative yields

 ∣∣∣∂hk(β)∂β[m]∣∣∣≤∣∣∣∂logΠr(β)∂β[m]∣∣∣+∣∣∣∂logwk(θ)L{Y∣θ}Π0(θ)∂θ|θ=fk(β)∣∣∣∣∣∣∂fk(β)∂β[m]∣∣∣+∑all sk∣∣∣∂log|sk|∂β[m]∣∣∣.

By absolute continuity, the first two absolute values are finite almost everywhere. For the third term, . For the last term, each is a finite composition of Lipschitz functions, therefore it is Lipschitz with respect to ; as , each derivative is equal to . Denote the index that achieves the minimum distance as , then

 infi∈{1…n}∥log{g(β∗)}−log{g(βi)}∥≤∥log{g(β∗)}−log{g(βi0)}∥=O(∥β∗−βi0∥).

Combining the above yields the result.