1 Introduction
Bayesian paradigms have been routinely used for uncertainty quantification. Markov chain Monte Carlo, particularly the Gibbs sampling, has been the mainstream tool for the posterior estimation; a primary challenge is that the samples tend to be autocorrelated when updating via the conditional given the current state, leading low effective sample size. To be able to have a highaccuracy estimate, one has to run a long simulation of the Markov chain. For a recent discussion on this issue, see Robert et al. (2018). A large literature of new Marko chain methods has been proposed, aiming to make each Markov state less dependent on the previous state. Methods include Metropolisadjusted Langevin algorithm (Roberts et al., 1996), Hamiltonian Monte Carlo (Neal et al., 2011), piecewise deterministic (Bierkens et al., 2019), or continuoustime Markov chain Monte Carlo (Fearnhead et al., 2018). At the same time, there is some literature focusing on alternatives to Markov chain methods. For example, among others, Approximate Bayesian Computation (Beaumont et al. (2009)) avoids the evaluation of the likelihood and instead relies on a divergence between simulated and observed data in rejection sampling; variational Bayes (Blei et al., 2017) replaces the complicated posterior with the meanfield approximation that is easier to sample. Despite the popularity of those approximations, there are critical issues, such as ignoring the correlation among the parameters, and difficulty to quantify the approximation error for diagnostics. For the discussion and remedy on those issues, see Giordano et al. (2018). The transport map is an interesting new solution (El Moselhy and Marzouk, 2012). To briefly review, let be a continuous parameter of interest,
the prior probability density,
the data andthe likelihood. Slightly abusing notation, denote the random variable from the posterior as
where is normalizing constant. Consider another random variable from another proper measure (with having the same number of elements as ), referred to as the reference distribution
If there exists an invertible map that pushes forward the reference measure to the posterior . Using standard variable transformation , the pullback density from the posterior should equal to the reference one:
(1) 
with the Jacobian matrix. With flexibly parameterized and satisfying invertibility, one minimizes the difference between the left (often without the constant ) and right sides, using samples . When the optimal is reached, ’s are the exact posterior samples, and they are completely independent — a major advantage compared to the Markov chain methods. Despite its potentials, there are two limitations. First, most importantly, there is no guarantee that an invertible map exists between and in general. This is also known as the illposed Monge problem (Santambrogio, 2015). As an example, if is a bivariate Gaussian but has one element fixed to a constant (Dirac measure), there is no map between the two due to the different dimensions. Even without the dimensionality issue, often is parameterized with limited flexibility, which can lead to thinner distribution tails of , compared to . As shown in our later example, when is multimodal, and the estimated tends to converge to only one of the modes. As a remedy, existing methods instead use the transport map to optimize Markov transition kernel from one state to the next (Parno and Marzouk, 2018). Despite some improvement, this arguably weakens the strength of having independence in the samples. We address this issue via a much more general coupling in the form of a mixture of multiple transport maps, borrowing strengths from the nonparametric Bayes for both flexibility and computing tractability. Second, the typical implementation relies on the manual design of , as one has to choose a composite of several invertible functions. These procedures are cumbersome for the users. We are not the first to recognize this issue, as some automated solutions have been recently proposed (such as Levy et al. (2018)). Our uniqueness is that we explore the clear alternative to the unscalable optimization needed for every sample of : instead, first estimate on a smaller training set of ’s, then generate a large number of samples with fixed. We provide a theoretic examination on why the associated error is negligible, producing a motivated choice for the reference distribution.
2 Method
2.1 Random Transport via Nonparametric Coupling
For a concise exposition, we focus on and
as continuous random variables. Extensions to other cases will be discussed at the end of the article. Consider a coupling, defined by the joint density
, between and :(2)  
(3) 
The coupling always exists — one trivial case would be the independent coupling . In fact, the transport map (1) is another case of deterministic coupling (with the Dirac delta), which relies on the strong assumption that (3) can be met for any . To relax this assumption, we instead consider a random transport, by modeling the conditional distribution of given as a random output from one of several invertible maps, parameterized by a mixture distribution
(4) 
with the mixture weight , for any . We allow to potentially change with for more flexibility (this includes the simpler case with invariant to ). This mixture is inspired by the Bayesian nonparametric method to model an unknown conditional density (Dunson et al., 2007). The randomness in has two benefits: (i) we relax the stringent equal dimensionality requirement, as can have a lower measure dimension than ; (ii) we can now use simpler with less worry on its flexibility, as it can be compensated with multiple copies of ’s. The following theorem formalizes the high flexibility of random transport as .
Theorem 1.
With this generalization, the random transport still enjoys an equality connecting and , providing a path for using optimization to estimate and .
2.2 Transport Monte Carlo
With the above results, we can now develop a Transport Monte Carlo algorithm, which consists of two stages: (i) optimizing and using a set of training samples ; (ii) generating new samples using the estimated and . In the first stage, we minimize
(6) 
which is related to the empirical KullbackLeibler divergence as
. After the optimization converges, in the second stage, we sample new and via(7)  
We separate these two stages because the optimization is the timeconsuming step, using a smaller set of samples leads to much better scalability; while the sampling is embarrassingly parallel and we can rapidly obtain a large number of samples. In the later section, we will show its associated error is very small with a wellchosen and sufficiently large .
2.3 Automated Computation for Invertible Maps
A major complexity of the transport map methods is in the parameterization and estimation of the invertible maps. To simplify and automate this process, we take advantage of the popularity of neural networks and their computational toolbox. Following Dinh et al. (2017), we parameterize each as an Mlayer invertible neural network. Without loss of generality, assuming is a
element vector. We set each layer as a
element to element transform:(8)  
where denotes the Hadamard elementwise product; are the partition of the indices; and are nested neural networks with a flexible forms (not necessarily invertiable), with the only requirement that all . Each resembles a locationscale transform, except that the magnitudes of scaling and shifting also depend on the input, making the change nonlinear. Although the nested and have flexible forms, the enclosing is invertible for even , for odd . Therefore, the entire is invertible. In addition, , with the output elements of and in the neural network . See Dinh et al. (2017) for the detailed derivation. For , we parameterize it similarly to except using an extra layer of Softmax activation, so that the output is in . We parametrize and
with a simple neural network containing one hidden layer, connected by the leaky ReLU activation function; the last layer of
is exponentiated so that each element is positive; we use in all the examples. Compared to a manually designed, the neural network is much easier to use for approximating nonlinear transform. In addition, it allows us to exploit popular computing toolbox such as PyTorch
(Paszke et al., 2017). The optimization utilizes stochastic gradient descent, by randomly generating a new batch of
’s in each iteration for gradient calculation. For tractable computation, we approximate the infinite with a truncated in most examples, and found empirically no difference compared to using larger .2.4 Generalization Error and Choosing the Reference Distribution
As described above, the samples of are generated using and , estimated from a finite training set of ’s. There can be a small difference between the sampled distribution and the exact posterior, which we will refer to as the generalization error (based on some resemblance to the common trainingandprediction task). Such an error can be quantified theoretically, providing a guide on choosing the reference . To distinguish from the training samples, in this section, we will use and to denote the newly generated samples. When the equality (5) holds, each training sample has a common minimized value
, a constant. In practice, this also holds approximately, as exhibited by the low sample variance of
’s. Intuitively, if ’s are dense enough to cover most of the high probability region in , the new sample will be near a certain training ’s with high probability, hence the associated loss should be near the minimized loss as well. This concept can be formalized as a high probability net, defined asUsing this concept, we now formally state the accuracy guarantee.
Theorem 3.
For given and , suppose the training samples satisfies , and further if 1. and are absolutely continuous with respect to Lebesgue measure, 2. each in (8) is Lipschitz continuous and bounded, then
with probability .
For this theorem to be pratically useful, we need a high probability net that achieves small under moderate size . Therefore, an important task is to choose a good with the spacing decreases rapidly in , associated with the classical problem of bounding maximal spacing (Janson, 1987). The multivariate normal, commonly used in transportrelated methods, is in fact suboptimal for this purpose as (Deheuvels et al., 1986)
. Instead, in this article we choose uniform distribution
, which has a much faster rate (Devroye et al., 1982). For a tighter but more involved bound, see Janson (1987).For diagnostics, we have a tractable error estimate using the empirical KullbackLeibler divergenceFor intractable is, we replace it with an estimator .
3 Numeric Illustrations
3.1 Multimodal Posterior
We first sample from the posterior from the mixture of multivariate Gaussian. Consider from the twocomponent mixture with unequal component means and unequal covariances
We choose this example for two reasons: (i) to demonstrate that the deterministic transport map can fail even in such a simple case, and how multiple maps in Transport Monte Carlo can solve this issue; (ii) we intentionally leave out the normalizing constant
when implementing the loss function (
6), in order to test the accuracy of the estimator . As shown in Figure 1(b), the deterministic transport map fails to discover the second component, even though the two modes are not too far apart. This is due to the different covariances in the two components, particularly the different correlation signs, disrupting the logconcavity and causing one transport map to collapse to only one component. On the other hand, as shown in Figure 1(c), Transport Monte Carlo correctly identifies both modes as it exploits multiple transport maps. The estimated is very close to the true (based on ). To assess the new sample error, we generate another samples, and the empirical KullbackLeibler divergence is only . As a comparison baseline, the kernelbased KullbackLeibler estimates (Boltz et al., 2009) between the two exact samples from is . This suggests that the Transport Monte Carlo is highly accurate.Comparing the performance of transporting from a uniform distribution (panel a) to the twocomponent mixture of correlated Gaussian distributions. Transport map, with a single deterministic invertible map, fails to discover the second mode (panel b), while Transport Monte Carlo correctly estimates the posterior (panel c).
3.2 HighDimensional Parameter
To illustrate the performance with the high dimensional parameter, we consider a sparse linear regression with the regularized horseshoe prior
(Piironen and Vehtari, 2017). For data index and covariate indexwhere is the covariate; denotes the halfCauchy and the inversegamma. Compared to the original horseshoe (Carvalho et al., 2010), the large signals approximately follow a normal prior , instead of Cauchy, hence has a finite prior mean. As the result of this modification, Gibbs sampling is no longer applicable, and the Hamiltonian Monte Carlo is suggested instead. When simulating the data, we use and ; ; ; for , and for . For both and , we use the informative prior to induce low noise and shrinkage global scale; for , we set , , as suggested by Piironen and Vehtari (2017). The results are compared with Hamiltonian Monte Carlo, using the NoUTURN sampler provided by the PyMC3 package. As most of the parameters are close to the degenerate zero, Hamiltonian Monte Carlo can only use small leapfrog step, resulting in extremely slow mixing (2). We run it for 100,000 iterations and thin the chain at every th step, and use them as the posterior sample. This takes approximately 8 hours. In contrast, the Transport Monte Carlo only takes a few minutes in optimization with and can generate new samples almost instantaneously. Due to the independence, no thinning is needed. The results are almost identical to the one obtained in costly Hamiltonian Monte Carlo.
4 Discussion
We show that Transport Monte Carlo can serve as an appealing alternative to conventional Markov chain Monte Carlo methods, and enjoys theoretic justification via a coupling. On the other hand, several drawbacks remain and merit future work. Although we show the error is very small, unlike Markov chain Monte Carlo, there is no asymptotic exactness guarantee as the number of samples goes to infinity. A possible correction amenable to the high dimensional parameter is using the obtained sample as initial values for parallel Markov chains (Hoffman et al. arXiv preprint1903.03704). Another extension is to consider in a discrete or constrained space. Our random transport can be combined with the transport that changes one discrete variable to another (Tran et al., 2019); nevertheless, a careful measuretheoretic study is still an interesting open problem. Lastly, for conciseness, we leave out the cases when one has degenerate mass mixed with the continuous random variable, for example, in the spikeandslab prior (Ishwaran and Rao, 2005). To accommodate this, the reference needs to be adaptive as well and it is worth exploring as a case study.
References
 Beaumont et al. (2009) Beaumont, M. A., J.M. Cornuet, J.M. Marin, and C. P. Robert (2009). Adaptive Approximate Bayesian Computation. Biometrika 96(4), 983–990.
 Bierkens et al. (2019) Bierkens, J., P. Fearnhead, G. Roberts, et al. (2019). The ZigZag Process and Superefficient Sampling for Bayesian Analysis of Big Data. The Annals of Statistics 47(3), 1288–1320.
 Blei et al. (2017) Blei, D. M., A. Kucukelbir, and J. D. McAuliffe (2017). Variational Inference: a Review for Statisticians. Journal of the American Statistical Association 112(518), 859–877.
 Boltz et al. (2009) Boltz, S., E. Debreuve, and M. Barlaud (2009). Highdimensional Statistical Measure for Regionofinterest Tracking. IEEE Transactions on Image Processing 18(6), 1266–1283.
 Carvalho et al. (2010) Carvalho, C. M., N. G. Polson, and J. G. Scott (2010). The Horseshoe Estimator for Sparse Signals. Biometrika 97(2), 465–480.
 Deheuvels et al. (1986) Deheuvels, P. et al. (1986). On the Influence of the Extremes of an IID Sequence on the Maximal Spacings. The Annals of Probability 14(1), 194–208.
 Devroye et al. (1982) Devroye, L. et al. (1982). A Log Log Law for Maximal Uniform Spacings. The Annals of Probability 10(3), 863–868.
 Dinh et al. (2017) Dinh, L., J. SohlDickstein, and S. Bengio (2017). Density Estimation using Real NVP. In International Conference on Learning Representations.
 Dunson et al. (2007) Dunson, D. B., N. Pillai, and J.H. Park (2007). Bayesian Density Regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69(2), 163–183.
 El Moselhy and Marzouk (2012) El Moselhy, T. A. and Y. M. Marzouk (2012). Bayesian Inference with Optimal Maps. Journal of Computational Physics 231(23), 7815–7850.
 Fearnhead et al. (2018) Fearnhead, P., J. Bierkens, M. Pollock, and G. O. Roberts (2018). Piecewise Deterministic Markov Processes for Continuoustime Monte Carlo. Statistical Science 33(3), 386–412.

Giordano
et al. (2018)
Giordano, R., T. Broderick, and M. I. Jordan (2018).
Covariances, Robustness and Variational Bayes.
The Journal of Machine Learning Research
19(1), 1981–2029.  Ishwaran and Rao (2005) Ishwaran, H. and J. S. Rao (2005). Spike and Slab Variable Selection: Frequentist and Bayesian Strategies. The Annals of Statistics 33(2), 730–773.
 Janson (1987) Janson, S. (1987). Maximal Spacings in Several Dimensions. The Annals of Probability 15(1), 274–280.
 Levy et al. (2018) Levy, D., M. D. Hoffman, and J. SohlDickstein (2018). Generalizing Hamiltonian Monte Carlo with Neural Networks. In International Conference on Learning Representations.
 Neal et al. (2011) Neal, R. M. et al. (2011). MCMC using Hamiltonian Dynamics. Handbook of markov chain monte carlo 2(11), 2.
 Parno and Marzouk (2018) Parno, M. D. and Y. M. Marzouk (2018). Transport Map Accelerated Markov Chain Monte Carlo. SIAM/ASA Journal on Uncertainty Quantification 6(2), 645–682.

Paszke
et al. (2017)
Paszke, A., S. Gross, S. Chintala, and G. Chanan (2017).
PyTorch: Tensors and Dynamic Neural Networks in Python with Strong GPU Acceleration.
PyTorch: Tensors and dynamic neural networks in Python with strong GPU acceleration 6.  Piironen and Vehtari (2017) Piironen, J. and A. Vehtari (2017). Sparsity Information and Regularization in the Horseshoe and Other Shrinkage Priors. Electronic Journal of Statistics 11(2), 5018–5051.
 Robert et al. (2018) Robert, C. P., V. Elvira, N. Tawn, and C. Wu (2018). Accelerating MCMC algorithms. Wiley Interdisciplinary Reviews: Computational Statistics 10(5), e1435.
 Roberts et al. (1996) Roberts, G. O., R. L. Tweedie, et al. (1996). Exponential Convergence of Langevin Distributions and Their Discrete Approximations. Bernoulli 2(4), 341–363.
 Santambrogio (2015) Santambrogio, F. (2015). Optimal Transport for Applied Mathematicians. Birkäuser, NY 55, 58–63.
 Tran et al. (2019) Tran, D., K. Vafa, K. K. Agrawal, L. Dinh, and B. Poole (2019). Discrete Flows: Invertible Generative Models of Discrete Data. In International Conference on Learning Representations DeepGenStruct Workshop.
Appendix
Proof of theorem 1
It is trivial to see that (2) holds. For (3), we show its existence via one (among many) construction. For any measurable , denote the conditional probability
As any measurable function is the pointwise limit of simple function, for any measurable, there exists limit
where the disjoint partition of and a constant depending on . Integrate over ,
As , we can choose small enough such that with some set of . Therefore unless . Take to be , the left hand side is , choose the right hand side . Given , the integral using (4) has
For each , let and satisfy , so that , this yields the result.
Proof of theorem 2
where the Jacobian emerges due to with for invertiable .
Proof of theorem 3
The task is equivalent to showing has a bounded derivative almost everywhere. Rewriting
and taking derivative with respect to the th subcoordinate of , denoted by , its magnitude satisfies
Examining the derivative yields
By absolute continuity, the first two absolute values are finite almost everywhere. For the third term, . For the last term, each is a finite composition of Lipschitz functions, therefore it is Lipschitz with respect to ; as , each derivative is equal to . Denote the index that achieves the minimum distance as , then
Combining the above yields the result.