# Greedy inference with layers of lazy maps

We propose a framework for the greedy approximation of high-dimensional Bayesian inference problems, through the composition of multiple low-dimensional transport maps or flows. Our framework operates recursively on a sequence of “residual” distributions, given by pulling back the posterior through the previously computed transport maps. The action of each map is confined to a low-dimensional subspace that we identify by minimizing an error bound. At each step, our approach thus identifies (i) a relevant subspace of the residual distribution, and (ii) a low-dimensional transformation between a restriction of the residual onto this subspace and a standard Gaussian. We prove weak convergence of the approach to the posterior distribution, and we demonstrate the algorithm on a range of challenging inference problems in differential equations and spatial statistics.

• 4 publications
• 9 publications
• 6 publications
• 27 publications
02/12/2021

### Stochastic Convergence Rates and Applications of Adaptive Quadrature in Bayesian Inference

We provide the first stochastic convergence rates for adaptive Gauss–Her...
03/17/2017

### Inference via low-dimensional couplings

We investigate the low-dimensional structure of deterministic transforma...
08/09/2021

### Scalable Bayesian transport maps for high-dimensional non-Gaussian spatial fields

A multivariate distribution can be described by a triangular transport m...
11/30/2020

### Derivative-Informed Projected Neural Networks for High-Dimensional Parametric Maps Governed by PDEs

Many-query problems, arising from uncertainty quantification, Bayesian i...
12/14/2021

### Adaptive Projected Residual Networks for Learning Parametric Maps from Sparse Data

We present a parsimonious surrogate framework for learning high dimensio...
05/12/2021

### Multiscale Invertible Generative Networks for High-Dimensional Bayesian Inference

We propose a Multiscale Invertible Generative Network (MsIGN) and associ...
10/22/2020

### Measure Transport with Kernel Stein Discrepancy

Measure transport underpins several recent algorithms for posterior appr...

## 1 Introduction

Inference in the Bayesian setting typically requires the computation of integrals over an intractable posterior distribution, whose density111In this paper, we only consider distributions that are absolutely continuous with respect to the Lebesgue measure on , and thus will use the notation to denote both the distribution and its associated density. is known up to a normalizing constant. A useful approach to this problem involves constructing a deterministic nonlinear transformation, i.e., a transport map villani2008optimal , that induces a coupling of with a tractable distribution (e.g., a standard Gaussian). More formally, we seek a map that pushes forward to , written as , such that the change of variables makes integration tractable.

Many constructions for such maps have been developed in recent years. Triangular maps bogachev2005triangular ; rosenblatt1952remarks ; knothe1957contributions are sufficiently general to couple any absolutely continuous pair of distributions , and their numerical approximations have been investigated in el2012bayesian ; marzouk2016introduction ; spantini2018inference ; Jaini2019 . Other constructions write as a composition of simpler parameterized functions: these include normalizing flows rezende2015variational ; tabak2013family ; dinh2016density , autoregressive flows Kingma2016

, and neural ordinary differential equations

Chen2018 ; Dupont2019 . Stein variational methods liu2016stein ; liu2017stein ; Detommaso2018 provide a nonparametric way of constructing

as a composition of functions lying in a chosen RKHS. Some of these methods have actually been proposed for density estimation (Gaussianizing a distribution from which one has drawn samples, via the action of a map

laparra2011iterative ; tabak2010density ), while others have been developed for variational inference (approximating by , typically in the sense of Kullback–Leibler (KL) divergence), but these two problems are tightly linked as long as is invertible.

Many of the methods mentioned above (with exceptions, e.g., Stein methods) are all-at-once: given some parameterized ansatz for the map, they solve a large and potentially challenging optimization problem for all the parameters simultaneously. More generally, it is difficult to represent expressive maps in high dimensions: parameterized triangular maps on , for example, must describe

-variate functions and thus immediately encounter the curse of dimensionality. Other flows involve particular ansatzes that limit their generality. Similarly, kernel-based methods lose expressiveness in high dimensions

Detommaso2018 ; Chen2019 . Yet one of the key advantages of transport is the potential for iterative approximation using the pullback distribution: any invertible/bijective map preserves information about the target distribution , and hence recursive—even greedy—approaches, applied to the pullback of through imperfect maps, could be effective. In this paper we exploit the interplay between such approximations and low-dimensional structure.

We propose a new greedy framework for variational inference, based on compositions of low-dimensional maps. The maps are low-dimensional in that each acts non-trivially only on a low-dimensional subspace of the parameter space; thus they are termed “lazy.” We identify the maps and subspaces sequentially, by greedily minimizing a bound on the KL divergence between and its approximation, given constraints on the computational effort at each step. The bound itself follows from logarithmic Sobolev inequalites (see zahm2018certified ) relating the KL divergence to gradients of the target density. We prove weak convergence of this approach to the target distribution, given even suboptimal choices of the subspace. Within this framework, there is considerable flexibility. The complexity of the map, and hence the computational effort, can be tailored at each iteration. Moreover, each lazy map can be approximated by any invertible transformation; while here we do so using lazy triangular maps, one could instead use lazy maps obtained via Stein variational methods or normalizing flows.

We also emphasize that our overall approach does not require any global low dimensionality in the likelihood function (as in cui2014likelihood ; Chen2019 , for example); rather, it can converge even for maps that act on a single direction at a time, despite and differing in all directions.

#### Preliminaries.

We will consider target distributions with densities on that are differentiable almost everywhere and that can be evaluated up to a normalizing constant. Such a target will often be the posterior of a Bayesian inference problem, e.g.,

 π(x)\coloneqqp(x|y)∝Ly(x)π0(x), (1)

where is the likelihood function and is the prior. will denote the standard Gaussian density on . We will consider maps that are diffeomorphisms,222In general does not need to be a diffeomorphism, but only a particular invertible map; see Appendix B for more details. The distributions we will consider in this paper, however, fulfill the necessary conditions for to be differentiable almost everywhere. and with some abuse of notation, we will write the pushforward density of under as . We will frequently also use the notion of a pullback distribution or density, written as .

Our goal is to find a map that satisfies for some prescribed tolerance . In Section 2 we show how to build a single map in the “lazy” format described above. However, the map complexity needed to satisfy the prescribed tolerance might be large if is too small. We thus show in Section 3 how to improve our approximations by greedily composing many lazy maps.

## 2 Lazy maps

Given a unitary matrix

and an integer , let be the set that contains all the maps of the form

 T(z)=Urτ(z1,…,zr)+U⊥z⊥ (2)

for some diffeomorphism . Here and are the matrices containing respectively the first and the last columns of , and . Any map is called a lazy map with rank bounded by , as it is nonlinear only with respect to the first inputs variables and the nonlinearity is contained in the low-dimensional subspace . The next proposition gives a characterization of all the densities when .

###### Proposition 1 (Characterization of lazy maps).

Let be a unitary matrix and let . Then for any lazy map , there exists a stricly positive function such that

 T♯ρ(x)=f(U⊤rx)ρ(x), (3)

for all where

is the density of the standard normal distribution. Conversely, any probability density function of the form

for some admits a representation as in (3) for some .

The proof is given in Appendix A.1. In particular, applying Proposition 1 to (1): if the prior distribution is and if the likelihood function is of the form of , then there exists a lazy map that exactly recovers the posterior density . Following (zahm2018certified, , Section 2.1), we have

 DKL(π||T♯ρ)=DKL(π||T⋆♯ρ)+DKL(T⋆♯ρ||T♯ρ), (4)

where is such that and is defined by the conditional expectation with . Because , Equation (4) shows that is a minimizer of over . Following huber1985projection , one can show

 DKL(π||T⋆♯ρ)=DKL(π||ρ)−DKL((U⊤r)♯π||ρr), (5)

where is the density of the standard normal distribution on and is the marginal density of with respect to , meaning the density of with . Equation (5) shows that, for fixed , minimizing over is the same as finding the most non-Gaussian marginal . Such an optimal can be difficult to find in practice. The next proposition gives a useful bound on , which we will use to construct a suboptimal . Proof is given in Appendix A.2.

###### Proposition 2.

Let be the

-th eigenpair of the eigenvalue problem

where . Let

be the matrix containing the eigenvectors of

. Then for any we have

 DKL(π||T⋆♯ρ)≤12(λr+1+…+λd). (6)

Proposition 2 suggests constructing as the matrix containing the eigenvectors of and choosing to be the smallest integer such that the left-hand side of (6) is below . A fast decay in the spectrum of ensures reaching the desired tolerance with . In practice, however, this might not be the case either because is too small or because the spectrum of does not decay quickly enough. Since the complexity for constructing the lazy map strongly depends on , we bound by some that corresponds to the maximal computational budget one can afford in constructing . This procedure is summarized in Algorithm 1.

The practical implementation of Algorithm 1 relies on the discretization of steps 2 (i.e., the computation of ) and 5 (i.e., minimizing the KL divergence). In step 2, direct Monte Carlo estimation of requires generating samples from , which is not feasible in practice. Following zahm2018certified , one could use the importance sampling estimate , where are i.i.d. samples from and where

are self-normalized weights. Such an estimate might have a huge variance when

is a poor approximation to the target . In this case it is preferable to impose , which yields a more biased estimate but with lower variance. As shown in Section 4 (see also the error bounds in (zahm2018certified, , Sec. 3.3.2)), such an estimate still provides relevant information regarding the construction of , and improves via the iterations described in Section 3.

At step 5 of Algorithm 1, we need to find the map such that minimizes . Following the ideas in transportmaps ; bigoni2016nips ; spantini2018inference ; Jaini2019 , we seek in a parametric family of monotone triangular maps; details on this construction are given in Appendix B. As noted in spantini2018inference , minimizing over this class of parametric maps can be challenging because it is difficult to build good estimators for when we cannot sample directly from . In our setting we can only evaluate up to a normalizing constant, and thus it is expedient to minimize the reverse KL divergence , as is typical in variational Bayesian methods—which here can be approximated using Monte Carlo sampling or quadrature from the tractable . Details on the numerical implementation of Algorithm 1 are given in Appendix C. Though reversing the KL divergence in this step changes the approximation, we will show numerically that this computational strategy yields good results, ultimately controlling the KL divergence in both directions.

## 3 Layers of lazy maps

The restriction in Algorithm 1 allows control over the computational cost of constructing the lazy map, but might not be satisfied. To compensate for this lack of accuracy while preserving low complexity, we propose to build a map as a composition of lazy maps

 Tℓ=T1∘…∘Tℓ,Tk∈Tr(Uk),

where each is a lazy map associated with a different unitary matrix . For simplicity we fix the same rank for each .

In general, the composition of lazy maps is not itself a lazy map. For example, there exists such that might depend nonlinearly on each input variable and so it cannot be written as in (2). Notice also that there is no chance in general to write as a product ; otherwise, by Proposition 1, would be a lazy map with rank bounded by . As a consequence, approximations of the form of are different than those constructed by projection pursuit huber1985projection .

We build the lazy maps in a greedy way. After iterations, the composition of maps have been constructed. We seek a unitary matrix and a lazy map such that is a better approximation to than . Let

 πℓ−1=(Tℓ−1)♯π,

be the pullback density of through . Notice that so we can build using Algorithm 1 by replacing by and by . Once this is done, we update the map and the residual density and we move to the next iteration . We stop the iteration process when the error falls below . Proposition 2 applied with and allows writing , where

 Hℓ=∫(∇logπℓρ)(∇logπℓρ)⊤dπ.

Hence, we can use to stop the iteration. This greedy algorithm is summarized in Algorithm 2. Details on the numerical implementation of this algorithm are given in Appendix C.

The next proposition gives a sufficient condition on to guarantee the convergence of this greedy algorithm. The proof is given in Appendix A.3.

###### Proposition 3.

Let be a sequence of unitary matrices. For any , we let be a lazy map that minimizes , where . If there exists such that for any

 DKL((Uℓ⊤r)♯πℓ−1||ρr)≥tsupU∈Rd×ds.t.UU⊤=IdDKL((U⊤r)♯πℓ−1||ρr), (7)

then converges weakly to .

Let us comment on the condition (7). Recall that the unitary matrix that maximizes is optimal; see (5). By (7), the case means that is optimal at each iteration. This corresponds to an ideal greedy algorithm. The case allows suboptimal choices for without losing the convergence property of the algorithm. Such a weak greedy algorithm converges even with a potentially crude selection of that corresponds to a close to zero. This is why a crude approximation to is expected to be sufficient; see Section 4. Following Temlyakov2011 , one can futher relax the condition (7) by replacing with a non-constant parameter which can go to zero “not too quickly” with respect to . This development is left for future work. Finally let us note that Proposition 3 does not require any constraints on , so we have convergence even with . Of course, larger yields faster convergence; see Section 4.

## 4 Numerical examples

We apply Algorithm 2 on a set of increasingly challenging inference problems. We monitor the convergence of the algorithm in two ways: (1) via the upper bound on the error , where is replaced with its estimate at step ; and (2) via the variance diagnostic which is an asymptotic approximation of as (see el2012bayesian ). This latter quantity is easily computed, needing only samples from , but is an approximation of the KL divergence in the reverse direction. Nevertheless, for the proplems tackled here, we will show empirically that this quantity is also bounded by in the spirit of Proposition 2. As a qualitative measure of performance, we plot 2D conditionals (“slices”) of which, at convergence, should be close to . Finally, we will quantify the remaining bias in by reporting the acceptance rate and the effective sample size Wolff2004 of correlated samples from generated with MCMC algorithms—Metropolis-Hastings with a independence proposal brooks2011handbook ; robert2013monte (MH-ind) or with a preconditioned Crank-Nicolson proposal Cotter2013 (MH-pCN). A chain generated with this procedure can be used to obtain a chain with stationary distribution , by applying the transformation . This can be understood as a way of (asymptotically) de-biasing the variational approximation, or conversely as a way of preconditioning MCMC parno2014transport .

The algorithm is implemented333 The code can be retrieved at http://bit.ly/2QlelXF. Data for Sec. 4.2 is at http://bit.ly/2HytQc0. Data for Sec. 4.3 is at http://bit.ly/2X09Ns8. Data for Sec. 4.4 is at http://bit.ly/2Eug5ZR. in the TransportMaps framework transportmaps . The examples in Sections 4.1 and 4.2 have analytical likelihoods, while those in Sections 4.3 and 4.4

have likelihoods that depend on the solutions of partial differential equations (PDEs). The PDEs are discretized and solved using the packages

FEniCS LoggWells2010a and dolfin-adjoint Farrell2013 . The results in Sections 4.1, 4.2 and 4.4 are produced using one Intel® Core i7-5930K CPU @ 3.50GHz. The results in Section 4.3 are obtained using 10 cores of the same type.

### 4.1 Lazy map construction step-by-step

We first apply the algorithm on the standard problem of approximating the rotated banana distribution defined by and , and where is the linear map associated with a random rotation. Despite this being a two-dimensional problem that could be characterized by a degree-two lower triangular (full) map in a single step, we will restrict ourselves to the use of a composition of rank-1 lazy maps of degree 3 (i.e., diagonal maps with nonlinear first component). We use Gauss quadrature rules of order 10 for the discretization of the KL divergence and the approximation of ( in Algorithm 3 and 5).

Figure 0(b) shows the target distribution . Figure 0(a) shows the convergence rate of the algorithm both in terms of residual power and in terms of variance diagnostic. After two iterations the algorithm has explored all the direction of , leading to a fast improvement. The convergence stagnates once the trade-off between the complexity of and and the accuracy of the quadrature has been saturated. Figures 0(c)–fig:ex:banana:pullback-8 show the progressive Gaussianization of the “residual” distributions for different iterations . To further confirm the quality of the approximation we generate an MCMC chain of length , using MH-ind. The reported acceptance rate is with the worst effective sample size being of the total chain.

### 4.2 Log-Gaussian Cox process with sparse observations

We consider an inference problem in spatial statistics for a log-Gaussian Cox point process on a square domain . This type of stochastic process is frequently used to model spatially aggregated point patterns moller1998 ; Christensen2005 ; rue2009approximate ; girolami2011riemann . Following a configuration similar to Christensen2005 ; moller1998 , we discretize into a uniform grid, and denote by the center of the th cell, for , with . We consider a discrete stochastic process , where denotes the number of occurrences/points in the th cell. Each

is modeled as a Poisson random variable with mean

, where is a Gaussian process with covariance and mean , for all . We consider the following values for the parameters: , , and . The are assumed conditionally independent given the (latent) Gaussian field. For interpretability reasons, we also define the intensity process as , for .

The goal of this problem is to infer the posterior distribution of the latent process given few sparse realizations of at spatial locations shown in Figure 1(a). We denote by a realization of obtained by sampling the latent Gaussian field according to its marginal distribution. Our target distribution is then: .

Since the posterior is nearly Gaussian, we will run three experiements with linear lazy maps and ranks . For the three experiments, the KL-divergence minimized for each lazy layer and the estimators of are discretized with Monte Carlo samples respectively.

Figures 1(b)–fig:ex:LogCox:posterior-realizations show the expectation and few realizations of the posterior, confirming the data provides some valuable information to recover the field . Figures 1(d)–fig:ex:LogCox:convergence-cost show the convergence rate and the cost of the algorithm as new layers of lazy maps are added to . As we expect, the use of maps with higher ranks leads to faster convergence. On the other hand the computational cost per step increases—also due to the fact that we increase the sample size as the rank increases. Figure 1(f) reveals the spirit of the algorithm: each lazy map trims away power from the top of the spectrum of , which slowly flattens and decreases. To additionally confirm the quality of for lazy maps with rank , we generate an MCMC chain of length using MH-ind. The reported acceptance rate is with the worst effective sample size being of the total chain.

### 4.3 Identification of the diffusion coefficient in an elliptic equation

We consider the problem of estimating the diffusion coefficient of an elliptic partial differential equation from sparse observations of the field solving

 {∇⋅(eκ(x)∇u(x))=0,  for x∈D\coloneqq[0,1]2,u(x)=0 for x1=0, u(x)=1 for x1=1, ∂u(x)∂n=0 for x2∈{0,1}. (8)

This PDE is discretized using finite elements over a uniform mesh of nodes, leading to degrees of freedom. We denote by the discretized versions of the log-diffusion coefficient over this mesh.

Let be the map from the parameter to uniformily distributetd values of collected at the locations shown in Figure 2(a). Observations follow the model , where and . The coefficient is endowed with a Gaussian prior where is the covariance of an Ornstein–Uhlenbeck process. For the observations associated to the parameter shown in Figure 2(a), our target distribution is , where .

We discretize expectations appearing in the algorithm with Monte Carlo samples. In order not to waste work in the early iterations we use linear maps of rank for iterations . Then we switch to maps of degree and rank for the remaining iterations. The algorithm is terminated when it stagnates due the lack of precision provided by the samples; see Figure 2(b). Randomly drawn realizations of in Figure 2(c) resemble the generating field. We confirm our results generating an MCMC chain of length with MH-pCN and stepsize parameter . The reported acceptance rate is with the worst, best and average effective sample sizes being , and of the complete chain. For comparison, a direct application of MH-pCN with same leads to an acceptance rate . More details are provided in Appendix D.1.

### 4.4 Estimation of the Young’s modulus of a cantilever beam

Here we consider the problem of estimating the Young’s modulus of an inhomogeneous cantilever beam, i.e., a beam clamped on one side () and free on the other (). The beam has a length of , a width of and a thickness of . Using Timoshenko’s beam theory, the displacement of the beam under the load is modeled by the coupled PDEs

 ⎧⎪⎨⎪⎩ddx[E(x)2(1+ν)(φ(x)−ddxw(x))]=q(x)κA,ddx(E(x)Iddxφ(x))=κAE(x)2(1+ν)(φ(x)−ddxw(x)), (9)

where , , , and are parameters described in Appendix D.2. We consider a beam composed of segments each of length made of different kinds of steel, with Young’s moduli respectively, and we run the virtual experiment of applying a point mass of at the tip of the beam. Observations of the displacement are gathered at the locations shown in Figure 3(a) with a measurement noise of . We endow with the prior and our goal is to characterize the posterior distribution .

The algorithm is run with lazy maps of order and rank . The expectations appearing in the algorithms are approximated using samples of size from . Figure 4 summarizes the results. We further confirm these results generating an MCMC chain of length using MH-ind. The reported acceptance rate is with the worst, best, and average effective sample sizes being , , and of the complete chain. Further details regarding the problem setting and additional results are reported in Appendix D.2.

### Acknowledgments

This work was supported in part by the US Department of Energy, Office of Advanced Scientific Computing Research, AEOLUS (Advances in Experimental Design, Optimal Control, and Learning for Uncertain Complex Systems) project. The authors also gratefully acknowledge support from the Inria associate team UNQUESTIONABLE.

## References

• [1] D. Bigoni. TransportMaps. http://transportmaps.mit.edu/, 2016–2019.
• [2] D. Bigoni, A. Spantini, and Y. Marzouk. Adaptive construction of measure transports for Bayesian inference. NIPS workshop on Approximate Inference, 2016.
• [3] V. I. Bogachev, A. V. Kolesnikov, and K. V. Medvedev. Triangular transformations of measures. Sbornik: Mathematics, 196(3):309, 2005.
• [4] S. Brooks, A. Gelman, G. Jones, and X.-L. Meng.

Handbook of Markov Chain Monte Carlo

.
CRC press, 2011.
• [5] C. G. Broyden. The Convergence of a Class of Double Rank Minimization Algorithms. Part {II}. J. Inst. Math. Appl., 6:222, 1970.
• [6] G. Carlier, A. Galichon, and F. Santambrogio. From Knothe’s transport to Brenier’s map and a continuation method for optimal transport. SIAM Journal on Mathematical Analysis, 41(6):2554–2576, 2010.
• [7] P. Chen, K. Wu, J. Chen, T. O’Leary-Roseberry, and O. Ghattas. Projected Stein Variational Newton: A Fast and Scalable Bayesian Inference Method in High Dimensions. arXiv e-prints, Jan. 2019.
• [8] R. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud. Neural Ordinary Differential Equations. NeurIPS, 2018.
• [9] O. F. Christensen, G. O. Roberts, and J. S. Rosenthal. Scaling limits for the transient phase of local Metropolis-Hastings algorithms. Journal of the Royal Statistical Society. Series B: Statistical Methodology, 67(2):253–268, 2005.
• [10] S. L. Cotter, G. O. Roberts, A. M. Stuart, and D. White. MCMC Methods for Functions: Modifying Old Algorithms to Make Them Faster. Statistical Science, 28(3):424–446, 2013.
• [11] T. Cui, J. Martin, Y. Marzouk, A. Solonen, and A. Spantini. Likelihood-informed dimension reduction for nonlinear inverse problems. Inverse Problems, 30(11):114015, 2014.
• [12] G. Detommaso, T. Cui, A. Spantini, Y. Marzouk, and R. Scheichl. A Stein variational Newton method. NeurIPS, 2018.
• [13] L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using real NVP. arXiv:1605.08803, 2016.
• [14] E. Dupont, A. Doucet, and Y. W. Teh. Augmented Neural ODEs. arXiv preprint arXiv:1904.01681, 2019.
• [15] P. E. Farrell, D. A. Ham, S. W. Funke, and M. E. Rognes. Automated Derivation of the Adjoint of High-Level Transient Finite Element Programs. SIAM Journal on Scientific Computing, 35(4):C369–C393, jan 2013.
• [16] M. Girolami and B. Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123–214, 2011.
• [17] G. H. Golub and J. H. Welsch. Calculation of Gauss quadrature rules. Mathematics of Computation, 23(106):221–221, may 1969.
• [18] P. J. Huber. Projection pursuit. The Annals of Statistics, pages 435–475, 1985.
• [19] P. Jaini, K. A. Selby, and Y. Yu. Sum-of-Squares Polynomial Flow. arXiv:1905.02325, 2019.
• [20] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improved variational inference with inverse autoregressive flow. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 4743–4751. Curran Associates, Inc., 2016.
• [21] H. Knothe et al. Contributions to the theory of convex bodies. The Michigan Mathematical Journal, 4(1):39–52, 1957.
• [22] V. Laparra, G. Camps-Valls, and J. Malo. Iterative Gaussianization: from ICA to random rotations.

IEEE Transactions on Neural Networks

, 22(4):537–549, 2011.
• [23] Q. Liu. Stein variational gradient descent as gradient flow. In Advances in neural information processing systems, pages 3115–3123, 2017.
• [24] Q. Liu and D. Wang. Stein Variational Gradient Descent: A general purpose Bayesian inference algorithm. In Advances in Neural Information Processing Systems, pages 2370–2378, 2016.
• [25] A. Logg and G. N. Wells. Dolfin: Automated finite element computing. ACM Transactions on Mathematical Software, 37(2), 2010.
• [26] Y. Marzouk, T. Moselhy, M. Parno, and A. Spantini. Sampling via measure transport: An introduction. In Handbook of Uncertainty Quantification, R. Ghanem, D. Higdon, and H. Owhadi, editors. Springer, 2016.
• [27] J. Møller, A. R. Syversveen, and R. Waagepetersen. Log Gaussian Cox Processes. Scandinavian Journal of Statistics, 25(3):451–482, 1998.
• [28] T. Moselhy and Y. Marzouk. Bayesian inference with optimal maps. Journal of Computational Physics, 231(23):7815–7850, 2012.
• [29] M. Parno and Y. M. Marzouk. Transport map accelerated Markov chain Monte Carlo. SIAM/ASA Journal on Uncertainty Quantification, 6(2):645–682, 2018.
• [30] D. J. Rezende and S. Mohamed. Variational inference with normalizing flows. arXiv:1505.05770, 2015.
• [31] C. Robert and G. Casella. Monte Carlo statistical methods. Springer Science & Business Media, 2013.
• [32] M. Rosenblatt. Remarks on a multivariate transformation. The Annals of Mathematical Statistics, pages 470–472, 1952.
• [33] H. Rue, S. Martino, and N. Chopin. Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. Journal of the Royal Statistical Society: Series B, 71(2):319–392, 2009.
• [34] F. Santambrogio. Optimal Transport for Applied Mathematicians, volume 87. Springer, 2015.
• [35] S. Smolyak.

Quadrature and interpolation formulas for tensor products of certain classes of functions.

• [36] A. Spantini, D. Bigoni, and Y. Marzouk. Inference via low-dimensional couplings.

The Journal of Machine Learning Research

, 19(1):2639–2709, 2018.
• [37] E. Tabak and C. V. Turner. A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics, 66(2):145–164, 2013.
• [38] E. G. Tabak, E. Vanden-Eijnden, et al. Density estimation by dual ascent of the log-likelihood. Communications in Mathematical Sciences, 8(1):217–233, 2010.
• [39] V. N. Temlyakov. Greedy approximation. Acta Numerica, 17:235–409, 2008.
• [40] C. Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
• [41] U. Wolff. Monte Carlo errors with less errors. Computer Physics Communications, 156(2):143–153, 2004.
• [42] O. Zahm, T. Cui, K. Law, A. Spantini, and Y. Marzouk. Certified dimension reduction in nonlinear Bayesian inverse problems. arXiv preprint arXiv:1807.03712, 2018.

## Appendix A Proofs

### a.1 Proof of Proposition 1

We first show that for any there exists a such that (3) holds. Let . Because is a diffeomorphism we have . The inverse of is given by

 T−1(x)=(τ−1(UTrx)UT⊥x),

so that and, because , we have . This yields (3) with .

Now we show that for any function there exists a lazy map such that (3) holds. Let . Denote by (resp. ) the density of the standard normal distribution on (resp. ). Let be a map that pushes forward to , where is the probability density on defined by . Such a map always exists because the support of (and of ) is , see [40]. Consider the map defined by

 Q(z)=(τ(z1,…,zr)z⊥).

Because , we have . Finally, the lazy map satisfies . This concludes the proof.

### a.2 Proof of Proposition 2

Corollary 1 in [42] allows us to write with and , see [42, Example 1]. Because is the matrix containing the first eigenvectors of we have .

### a.3 Proof of Proposition 3

Let . Replacing by in (5) allows writing so that

 DKL(πℓ||ρ)=DKL(π||ρ)−ℓ−1∑k=1Rk.

In particular converges to 0 and, because of (7), we have

 supU∈Rd×ds.t.UUT=IdDKL((UTr)♯πℓ−1||ρr)⟶ℓ→∞0.

By Proposition 14.2 in [18], converges weakly to . Then converges weakly to .

## Appendix B Triangular maps

There may exist many (indeed an infinity) of transport maps able to couple two arbitrary probability distributions. In optimal transport

[40], a map is identified by minimizing some expected transportation cost, which represents the effort required to move mass from one location to another. In this work, we take an alternative approach and consider lower triangular transports,

 T(x)=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣T(1)(x1)T(2)(x1,x2)⋮T(d)(x1,…,xd)⎤⎥ ⎥ ⎥ ⎥ ⎥⎦ (10)

where each component is monotonically increasing with respect to . We will identify these transports with the set . For any two distributions and on that admit densities with respect to Lebesgue measure (also denoted by and , respectively) there exists a unique transport such that . This transport is known as the Knothe–Rosenblatt (KR) rearrangement [21, 32, 6, 3]. Because is invertible, the density of the pullback measure is given by , where is defined by . We stress here that is a definition, and does not need to be differentiable. In fact, inherits the same regularity as and , but no more [3, 34].

We consider semi-parametric approximations to maps in , i.e., we will consider the set of maps defined by

 T(i)[ci,hi](x):=c[ci](x1:i−1)+∫xi0(h[hi](x1:i−1,t))2dt, (11)

where , and are in the span of a polynomial basis. The KR rearrangement between and can be obtained as the unique minimizer of . An approximation to it can be obtained by restricting the minimization over the set [28, 26, 2].

## Appendix C Numerical algorithms

Algorithm 3 assembles the numerical approximation of . In the following we will chose to be the reference distribution , leading to being a biased approximation of . In the examples presented the number of samples used for the approximation of is chosen to be the same used for Algorithm 5, even though the type of quadrature may differ—e.g., Monte Carlo, Gauss quadrature[17], sparse grids [35], etc. This way the extra cost of computing is equivalent to one gradient evaluation in the quasi-Newton method.

Algorithm 4 computes the eigenvectors satisfying Proposition 2 and discerns between the subspace of relevant directions and its orthogonal complement .

Algorithm 5 outlines the numerical solution of the variational problem

In the sake of simplicity we fix the complexity of the semi-parametric map described in Appendix B and the sample size used in the discretization of the KL divergence. Alternatively one could adaptively increase the complexity and the sample size to match a prescribed tolerance, following the procedure described in [2]. For all the examples presented in this work, the minimization problem is solved via the Broyden–Fletcher–Goldfarb–Shanno (BFGS) quasi-Newton method [5]. One could switch to a full Newton method if the Hessian of

or its action on a vector are available..

Algorithms 6 and 7 are the numerical counterparts of Algorithms 1 and 2 respectively.