1 Introduction
Inference in the Bayesian setting typically requires the computation of integrals over an intractable posterior distribution, whose density^{1}^{1}1In this paper, we only consider distributions that are absolutely continuous with respect to the Lebesgue measure on , and thus will use the notation to denote both the distribution and its associated density. is known up to a normalizing constant. A useful approach to this problem involves constructing a deterministic nonlinear transformation, i.e., a transport map villani2008optimal , that induces a coupling of with a tractable distribution (e.g., a standard Gaussian). More formally, we seek a map that pushes forward to , written as , such that the change of variables makes integration tractable.
Many constructions for such maps have been developed in recent years. Triangular maps bogachev2005triangular ; rosenblatt1952remarks ; knothe1957contributions are sufficiently general to couple any absolutely continuous pair of distributions , and their numerical approximations have been investigated in el2012bayesian ; marzouk2016introduction ; spantini2018inference ; Jaini2019 . Other constructions write as a composition of simpler parameterized functions: these include normalizing flows rezende2015variational ; tabak2013family ; dinh2016density , autoregressive flows Kingma2016
, and neural ordinary differential equations
Chen2018 ; Dupont2019 . Stein variational methods liu2016stein ; liu2017stein ; Detommaso2018 provide a nonparametric way of constructingas a composition of functions lying in a chosen RKHS. Some of these methods have actually been proposed for density estimation (Gaussianizing a distribution from which one has drawn samples, via the action of a map
laparra2011iterative ; tabak2010density ), while others have been developed for variational inference (approximating by , typically in the sense of Kullback–Leibler (KL) divergence), but these two problems are tightly linked as long as is invertible.Many of the methods mentioned above (with exceptions, e.g., Stein methods) are allatonce: given some parameterized ansatz for the map, they solve a large and potentially challenging optimization problem for all the parameters simultaneously. More generally, it is difficult to represent expressive maps in high dimensions: parameterized triangular maps on , for example, must describe
variate functions and thus immediately encounter the curse of dimensionality. Other flows involve particular ansatzes that limit their generality. Similarly, kernelbased methods lose expressiveness in high dimensions
Detommaso2018 ; Chen2019 . Yet one of the key advantages of transport is the potential for iterative approximation using the pullback distribution: any invertible/bijective map preserves information about the target distribution , and hence recursive—even greedy—approaches, applied to the pullback of through imperfect maps, could be effective. In this paper we exploit the interplay between such approximations and lowdimensional structure.We propose a new greedy framework for variational inference, based on compositions of lowdimensional maps. The maps are lowdimensional in that each acts nontrivially only on a lowdimensional subspace of the parameter space; thus they are termed “lazy.” We identify the maps and subspaces sequentially, by greedily minimizing a bound on the KL divergence between and its approximation, given constraints on the computational effort at each step. The bound itself follows from logarithmic Sobolev inequalites (see zahm2018certified ) relating the KL divergence to gradients of the target density. We prove weak convergence of this approach to the target distribution, given even suboptimal choices of the subspace. Within this framework, there is considerable flexibility. The complexity of the map, and hence the computational effort, can be tailored at each iteration. Moreover, each lazy map can be approximated by any invertible transformation; while here we do so using lazy triangular maps, one could instead use lazy maps obtained via Stein variational methods or normalizing flows.
We also emphasize that our overall approach does not require any global low dimensionality in the likelihood function (as in cui2014likelihood ; Chen2019 , for example); rather, it can converge even for maps that act on a single direction at a time, despite and differing in all directions.
Preliminaries.
We will consider target distributions with densities on that are differentiable almost everywhere and that can be evaluated up to a normalizing constant. Such a target will often be the posterior of a Bayesian inference problem, e.g.,
(1) 
where is the likelihood function and is the prior. will denote the standard Gaussian density on . We will consider maps that are diffeomorphisms,^{2}^{2}2In general does not need to be a diffeomorphism, but only a particular invertible map; see Appendix B for more details. The distributions we will consider in this paper, however, fulfill the necessary conditions for to be differentiable almost everywhere. and with some abuse of notation, we will write the pushforward density of under as . We will frequently also use the notion of a pullback distribution or density, written as .
Our goal is to find a map that satisfies for some prescribed tolerance . In Section 2 we show how to build a single map in the “lazy” format described above. However, the map complexity needed to satisfy the prescribed tolerance might be large if is too small. We thus show in Section 3 how to improve our approximations by greedily composing many lazy maps.
2 Lazy maps
Given a unitary matrix
and an integer , let be the set that contains all the maps of the form(2) 
for some diffeomorphism . Here and are the matrices containing respectively the first and the last columns of , and . Any map is called a lazy map with rank bounded by , as it is nonlinear only with respect to the first inputs variables and the nonlinearity is contained in the lowdimensional subspace . The next proposition gives a characterization of all the densities when .
Proposition 1 (Characterization of lazy maps).
Let be a unitary matrix and let . Then for any lazy map , there exists a stricly positive function such that
(3) 
for all where
is the density of the standard normal distribution. Conversely, any probability density function of the form
for some admits a representation as in (3) for some .The proof is given in Appendix A.1. In particular, applying Proposition 1 to (1): if the prior distribution is and if the likelihood function is of the form of , then there exists a lazy map that exactly recovers the posterior density . Following (zahm2018certified, , Section 2.1), we have
(4) 
where is such that and is defined by the conditional expectation with . Because , Equation (4) shows that is a minimizer of over . Following huber1985projection , one can show
(5) 
where is the density of the standard normal distribution on and is the marginal density of with respect to , meaning the density of with . Equation (5) shows that, for fixed , minimizing over is the same as finding the most nonGaussian marginal . Such an optimal can be difficult to find in practice. The next proposition gives a useful bound on , which we will use to construct a suboptimal . Proof is given in Appendix A.2.
Proposition 2.
Let be the
th eigenpair of the eigenvalue problem
where . Letbe the matrix containing the eigenvectors of
. Then for any we have(6) 
Proposition 2 suggests constructing as the matrix containing the eigenvectors of and choosing to be the smallest integer such that the lefthand side of (6) is below . A fast decay in the spectrum of ensures reaching the desired tolerance with . In practice, however, this might not be the case either because is too small or because the spectrum of does not decay quickly enough. Since the complexity for constructing the lazy map strongly depends on , we bound by some that corresponds to the maximal computational budget one can afford in constructing . This procedure is summarized in Algorithm 1.
The practical implementation of Algorithm 1 relies on the discretization of steps 2 (i.e., the computation of ) and 5 (i.e., minimizing the KL divergence). In step 2, direct Monte Carlo estimation of requires generating samples from , which is not feasible in practice. Following zahm2018certified , one could use the importance sampling estimate , where are i.i.d. samples from and where
are selfnormalized weights. Such an estimate might have a huge variance when
is a poor approximation to the target . In this case it is preferable to impose , which yields a more biased estimate but with lower variance. As shown in Section 4 (see also the error bounds in (zahm2018certified, , Sec. 3.3.2)), such an estimate still provides relevant information regarding the construction of , and improves via the iterations described in Section 3.At step 5 of Algorithm 1, we need to find the map such that minimizes . Following the ideas in transportmaps ; bigoni2016nips ; spantini2018inference ; Jaini2019 , we seek in a parametric family of monotone triangular maps; details on this construction are given in Appendix B. As noted in spantini2018inference , minimizing over this class of parametric maps can be challenging because it is difficult to build good estimators for when we cannot sample directly from . In our setting we can only evaluate up to a normalizing constant, and thus it is expedient to minimize the reverse KL divergence , as is typical in variational Bayesian methods—which here can be approximated using Monte Carlo sampling or quadrature from the tractable . Details on the numerical implementation of Algorithm 1 are given in Appendix C. Though reversing the KL divergence in this step changes the approximation, we will show numerically that this computational strategy yields good results, ultimately controlling the KL divergence in both directions.
3 Layers of lazy maps
The restriction in Algorithm 1 allows control over the computational cost of constructing the lazy map, but might not be satisfied. To compensate for this lack of accuracy while preserving low complexity, we propose to build a map as a composition of lazy maps
where each is a lazy map associated with a different unitary matrix . For simplicity we fix the same rank for each .
In general, the composition of lazy maps is not itself a lazy map. For example, there exists such that might depend nonlinearly on each input variable and so it cannot be written as in (2). Notice also that there is no chance in general to write as a product ; otherwise, by Proposition 1, would be a lazy map with rank bounded by . As a consequence, approximations of the form of are different than those constructed by projection pursuit huber1985projection .
We build the lazy maps in a greedy way. After iterations, the composition of maps have been constructed. We seek a unitary matrix and a lazy map such that is a better approximation to than . Let
be the pullback density of through . Notice that so we can build using Algorithm 1 by replacing by and by . Once this is done, we update the map and the residual density and we move to the next iteration . We stop the iteration process when the error falls below . Proposition 2 applied with and allows writing , where
Hence, we can use to stop the iteration. This greedy algorithm is summarized in Algorithm 2. Details on the numerical implementation of this algorithm are given in Appendix C.
The next proposition gives a sufficient condition on to guarantee the convergence of this greedy algorithm. The proof is given in Appendix A.3.
Proposition 3.
Let be a sequence of unitary matrices. For any , we let be a lazy map that minimizes , where . If there exists such that for any
(7) 
then converges weakly to .
Let us comment on the condition (7). Recall that the unitary matrix that maximizes is optimal; see (5). By (7), the case means that is optimal at each iteration. This corresponds to an ideal greedy algorithm. The case allows suboptimal choices for without losing the convergence property of the algorithm. Such a weak greedy algorithm converges even with a potentially crude selection of that corresponds to a close to zero. This is why a crude approximation to is expected to be sufficient; see Section 4. Following Temlyakov2011 , one can futher relax the condition (7) by replacing with a nonconstant parameter which can go to zero “not too quickly” with respect to . This development is left for future work. Finally let us note that Proposition 3 does not require any constraints on , so we have convergence even with . Of course, larger yields faster convergence; see Section 4.
4 Numerical examples
We apply Algorithm 2 on a set of increasingly challenging inference problems. We monitor the convergence of the algorithm in two ways: (1) via the upper bound on the error , where is replaced with its estimate at step ; and (2) via the variance diagnostic which is an asymptotic approximation of as (see el2012bayesian ). This latter quantity is easily computed, needing only samples from , but is an approximation of the KL divergence in the reverse direction. Nevertheless, for the proplems tackled here, we will show empirically that this quantity is also bounded by in the spirit of Proposition 2. As a qualitative measure of performance, we plot 2D conditionals (“slices”) of which, at convergence, should be close to . Finally, we will quantify the remaining bias in by reporting the acceptance rate and the effective sample size Wolff2004 of correlated samples from generated with MCMC algorithms—MetropolisHastings with a independence proposal brooks2011handbook ; robert2013monte (MHind) or with a preconditioned CrankNicolson proposal Cotter2013 (MHpCN). A chain generated with this procedure can be used to obtain a chain with stationary distribution , by applying the transformation . This can be understood as a way of (asymptotically) debiasing the variational approximation, or conversely as a way of preconditioning MCMC parno2014transport .
The algorithm is implemented^{3}^{3}3 The code can be retrieved at http://bit.ly/2QlelXF. Data for Sec. 4.2 is at http://bit.ly/2HytQc0. Data for Sec. 4.3 is at http://bit.ly/2X09Ns8. Data for Sec. 4.4 is at http://bit.ly/2Eug5ZR. in the TransportMaps framework transportmaps . The examples in Sections 4.1 and 4.2 have analytical likelihoods, while those in Sections 4.3 and 4.4
have likelihoods that depend on the solutions of partial differential equations (PDEs). The PDEs are discretized and solved using the packages
FEniCS LoggWells2010a and dolfinadjoint Farrell2013 . The results in Sections 4.1, 4.2 and 4.4 are produced using one Intel® Core™ i75930K CPU @ 3.50GHz. The results in Section 4.3 are obtained using 10 cores of the same type.4.1 Lazy map construction stepbystep
We first apply the algorithm on the standard problem of approximating the rotated banana distribution defined by and , and where is the linear map associated with a random rotation. Despite this being a twodimensional problem that could be characterized by a degreetwo lower triangular (full) map in a single step, we will restrict ourselves to the use of a composition of rank1 lazy maps of degree 3 (i.e., diagonal maps with nonlinear first component). We use Gauss quadrature rules of order 10 for the discretization of the KL divergence and the approximation of ( in Algorithm 3 and 5).
Figure 0(b) shows the target distribution . Figure 0(a) shows the convergence rate of the algorithm both in terms of residual power and in terms of variance diagnostic. After two iterations the algorithm has explored all the direction of , leading to a fast improvement. The convergence stagnates once the tradeoff between the complexity of and and the accuracy of the quadrature has been saturated. Figures 0(c)–fig:ex:banana:pullback8 show the progressive Gaussianization of the “residual” distributions for different iterations . To further confirm the quality of the approximation we generate an MCMC chain of length , using MHind. The reported acceptance rate is with the worst effective sample size being of the total chain.






4.2 LogGaussian Cox process with sparse observations
We consider an inference problem in spatial statistics for a logGaussian Cox point process on a square domain . This type of stochastic process is frequently used to model spatially aggregated point patterns moller1998 ; Christensen2005 ; rue2009approximate ; girolami2011riemann . Following a configuration similar to Christensen2005 ; moller1998 , we discretize into a uniform grid, and denote by the center of the th cell, for , with . We consider a discrete stochastic process , where denotes the number of occurrences/points in the th cell. Each
is modeled as a Poisson random variable with mean
, where is a Gaussian process with covariance and mean , for all . We consider the following values for the parameters: , , and . The are assumed conditionally independent given the (latent) Gaussian field. For interpretability reasons, we also define the intensity process as , for .The goal of this problem is to infer the posterior distribution of the latent process given few sparse realizations of at spatial locations shown in Figure 1(a). We denote by a realization of obtained by sampling the latent Gaussian field according to its marginal distribution. Our target distribution is then: .
Since the posterior is nearly Gaussian, we will run three experiements with linear lazy maps and ranks . For the three experiments, the KLdivergence minimized for each lazy layer and the estimators of are discretized with Monte Carlo samples respectively.
Figures 1(b)–fig:ex:LogCox:posteriorrealizations show the expectation and few realizations of the posterior, confirming the data provides some valuable information to recover the field . Figures 1(d)–fig:ex:LogCox:convergencecost show the convergence rate and the cost of the algorithm as new layers of lazy maps are added to . As we expect, the use of maps with higher ranks leads to faster convergence. On the other hand the computational cost per step increases—also due to the fact that we increase the sample size as the rank increases. Figure 1(f) reveals the spirit of the algorithm: each lazy map trims away power from the top of the spectrum of , which slowly flattens and decreases. To additionally confirm the quality of for lazy maps with rank , we generate an MCMC chain of length using MHind. The reported acceptance rate is with the worst effective sample size being of the total chain.
4.3 Identification of the diffusion coefficient in an elliptic equation



We consider the problem of estimating the diffusion coefficient of an elliptic partial differential equation from sparse observations of the field solving
(8) 
This PDE is discretized using finite elements over a uniform mesh of nodes, leading to degrees of freedom. We denote by the discretized versions of the logdiffusion coefficient over this mesh.
Let be the map from the parameter to uniformily distributetd values of collected at the locations shown in Figure 2(a). Observations follow the model , where and . The coefficient is endowed with a Gaussian prior where is the covariance of an Ornstein–Uhlenbeck process. For the observations associated to the parameter shown in Figure 2(a), our target distribution is , where .
We discretize expectations appearing in the algorithm with Monte Carlo samples. In order not to waste work in the early iterations we use linear maps of rank for iterations . Then we switch to maps of degree and rank for the remaining iterations. The algorithm is terminated when it stagnates due the lack of precision provided by the samples; see Figure 2(b). Randomly drawn realizations of in Figure 2(c) resemble the generating field. We confirm our results generating an MCMC chain of length with MHpCN and stepsize parameter . The reported acceptance rate is with the worst, best and average effective sample sizes being , and of the complete chain. For comparison, a direct application of MHpCN with same leads to an acceptance rate . More details are provided in Appendix D.1.
4.4 Estimation of the Young’s modulus of a cantilever beam



Here we consider the problem of estimating the Young’s modulus of an inhomogeneous cantilever beam, i.e., a beam clamped on one side () and free on the other (). The beam has a length of , a width of and a thickness of . Using Timoshenko’s beam theory, the displacement of the beam under the load is modeled by the coupled PDEs
(9) 
where , , , and are parameters described in Appendix D.2. We consider a beam composed of segments each of length made of different kinds of steel, with Young’s moduli respectively, and we run the virtual experiment of applying a point mass of at the tip of the beam. Observations of the displacement are gathered at the locations shown in Figure 3(a) with a measurement noise of . We endow with the prior and our goal is to characterize the posterior distribution .
The algorithm is run with lazy maps of order and rank . The expectations appearing in the algorithms are approximated using samples of size from . Figure 4 summarizes the results. We further confirm these results generating an MCMC chain of length using MHind. The reported acceptance rate is with the worst, best, and average effective sample sizes being , , and of the complete chain. Further details regarding the problem setting and additional results are reported in Appendix D.2.
Acknowledgments
This work was supported in part by the US Department of Energy, Office of Advanced Scientific Computing Research, AEOLUS (Advances in Experimental Design, Optimal Control, and Learning for Uncertain Complex Systems) project. The authors also gratefully acknowledge support from the Inria associate team UNQUESTIONABLE.
References
 [1] D. Bigoni. TransportMaps. http://transportmaps.mit.edu/, 2016–2019.
 [2] D. Bigoni, A. Spantini, and Y. Marzouk. Adaptive construction of measure transports for Bayesian inference. NIPS workshop on Approximate Inference, 2016.
 [3] V. I. Bogachev, A. V. Kolesnikov, and K. V. Medvedev. Triangular transformations of measures. Sbornik: Mathematics, 196(3):309, 2005.

[4]
S. Brooks, A. Gelman, G. Jones, and X.L. Meng.
Handbook of Markov Chain Monte Carlo
. CRC press, 2011.  [5] C. G. Broyden. The Convergence of a Class of Double Rank Minimization Algorithms. Part {II}. J. Inst. Math. Appl., 6:222, 1970.
 [6] G. Carlier, A. Galichon, and F. Santambrogio. From Knothe’s transport to Brenier’s map and a continuation method for optimal transport. SIAM Journal on Mathematical Analysis, 41(6):2554–2576, 2010.
 [7] P. Chen, K. Wu, J. Chen, T. O’LearyRoseberry, and O. Ghattas. Projected Stein Variational Newton: A Fast and Scalable Bayesian Inference Method in High Dimensions. arXiv eprints, Jan. 2019.
 [8] R. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud. Neural Ordinary Differential Equations. NeurIPS, 2018.
 [9] O. F. Christensen, G. O. Roberts, and J. S. Rosenthal. Scaling limits for the transient phase of local MetropolisHastings algorithms. Journal of the Royal Statistical Society. Series B: Statistical Methodology, 67(2):253–268, 2005.
 [10] S. L. Cotter, G. O. Roberts, A. M. Stuart, and D. White. MCMC Methods for Functions: Modifying Old Algorithms to Make Them Faster. Statistical Science, 28(3):424–446, 2013.
 [11] T. Cui, J. Martin, Y. Marzouk, A. Solonen, and A. Spantini. Likelihoodinformed dimension reduction for nonlinear inverse problems. Inverse Problems, 30(11):114015, 2014.
 [12] G. Detommaso, T. Cui, A. Spantini, Y. Marzouk, and R. Scheichl. A Stein variational Newton method. NeurIPS, 2018.
 [13] L. Dinh, J. SohlDickstein, and S. Bengio. Density estimation using real NVP. arXiv:1605.08803, 2016.
 [14] E. Dupont, A. Doucet, and Y. W. Teh. Augmented Neural ODEs. arXiv preprint arXiv:1904.01681, 2019.
 [15] P. E. Farrell, D. A. Ham, S. W. Funke, and M. E. Rognes. Automated Derivation of the Adjoint of HighLevel Transient Finite Element Programs. SIAM Journal on Scientific Computing, 35(4):C369–C393, jan 2013.
 [16] M. Girolami and B. Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123–214, 2011.
 [17] G. H. Golub and J. H. Welsch. Calculation of Gauss quadrature rules. Mathematics of Computation, 23(106):221–221, may 1969.
 [18] P. J. Huber. Projection pursuit. The Annals of Statistics, pages 435–475, 1985.
 [19] P. Jaini, K. A. Selby, and Y. Yu. SumofSquares Polynomial Flow. arXiv:1905.02325, 2019.
 [20] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improved variational inference with inverse autoregressive flow. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 4743–4751. Curran Associates, Inc., 2016.
 [21] H. Knothe et al. Contributions to the theory of convex bodies. The Michigan Mathematical Journal, 4(1):39–52, 1957.

[22]
V. Laparra, G. CampsValls, and J. Malo.
Iterative Gaussianization: from ICA to random rotations.
IEEE Transactions on Neural Networks
, 22(4):537–549, 2011.  [23] Q. Liu. Stein variational gradient descent as gradient flow. In Advances in neural information processing systems, pages 3115–3123, 2017.
 [24] Q. Liu and D. Wang. Stein Variational Gradient Descent: A general purpose Bayesian inference algorithm. In Advances in Neural Information Processing Systems, pages 2370–2378, 2016.
 [25] A. Logg and G. N. Wells. Dolfin: Automated finite element computing. ACM Transactions on Mathematical Software, 37(2), 2010.
 [26] Y. Marzouk, T. Moselhy, M. Parno, and A. Spantini. Sampling via measure transport: An introduction. In Handbook of Uncertainty Quantification, R. Ghanem, D. Higdon, and H. Owhadi, editors. Springer, 2016.
 [27] J. Møller, A. R. Syversveen, and R. Waagepetersen. Log Gaussian Cox Processes. Scandinavian Journal of Statistics, 25(3):451–482, 1998.
 [28] T. Moselhy and Y. Marzouk. Bayesian inference with optimal maps. Journal of Computational Physics, 231(23):7815–7850, 2012.
 [29] M. Parno and Y. M. Marzouk. Transport map accelerated Markov chain Monte Carlo. SIAM/ASA Journal on Uncertainty Quantification, 6(2):645–682, 2018.
 [30] D. J. Rezende and S. Mohamed. Variational inference with normalizing flows. arXiv:1505.05770, 2015.
 [31] C. Robert and G. Casella. Monte Carlo statistical methods. Springer Science & Business Media, 2013.
 [32] M. Rosenblatt. Remarks on a multivariate transformation. The Annals of Mathematical Statistics, pages 470–472, 1952.
 [33] H. Rue, S. Martino, and N. Chopin. Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. Journal of the Royal Statistical Society: Series B, 71(2):319–392, 2009.
 [34] F. Santambrogio. Optimal Transport for Applied Mathematicians, volume 87. Springer, 2015.

[35]
S. Smolyak.
Quadrature and interpolation formulas for tensor products of certain classes of functions.
Dokl. Akad. Nauk SSSR, 1963. 
[36]
A. Spantini, D. Bigoni, and Y. Marzouk.
Inference via lowdimensional couplings.
The Journal of Machine Learning Research
, 19(1):2639–2709, 2018.  [37] E. Tabak and C. V. Turner. A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics, 66(2):145–164, 2013.
 [38] E. G. Tabak, E. VandenEijnden, et al. Density estimation by dual ascent of the loglikelihood. Communications in Mathematical Sciences, 8(1):217–233, 2010.
 [39] V. N. Temlyakov. Greedy approximation. Acta Numerica, 17:235–409, 2008.
 [40] C. Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
 [41] U. Wolff. Monte Carlo errors with less errors. Computer Physics Communications, 156(2):143–153, 2004.
 [42] O. Zahm, T. Cui, K. Law, A. Spantini, and Y. Marzouk. Certified dimension reduction in nonlinear Bayesian inverse problems. arXiv preprint arXiv:1807.03712, 2018.
Appendix A Proofs
a.1 Proof of Proposition 1
We first show that for any there exists a such that (3) holds. Let . Because is a diffeomorphism we have . The inverse of is given by
so that and, because , we have . This yields (3) with .
Now we show that for any function there exists a lazy map such that (3) holds. Let . Denote by (resp. ) the density of the standard normal distribution on (resp. ). Let be a map that pushes forward to , where is the probability density on defined by . Such a map always exists because the support of (and of ) is , see [40]. Consider the map defined by
Because , we have . Finally, the lazy map satisfies . This concludes the proof.
a.2 Proof of Proposition 2
a.3 Proof of Proposition 3
Appendix B Triangular maps
There may exist many (indeed an infinity) of transport maps able to couple two arbitrary probability distributions. In optimal transport
[40], a map is identified by minimizing some expected transportation cost, which represents the effort required to move mass from one location to another. In this work, we take an alternative approach and consider lower triangular transports,(10) 
where each component is monotonically increasing with respect to . We will identify these transports with the set . For any two distributions and on that admit densities with respect to Lebesgue measure (also denoted by and , respectively) there exists a unique transport such that . This transport is known as the Knothe–Rosenblatt (KR) rearrangement [21, 32, 6, 3]. Because is invertible, the density of the pullback measure is given by , where is defined by . We stress here that is a definition, and does not need to be differentiable. In fact, inherits the same regularity as and , but no more [3, 34].
We consider semiparametric approximations to maps in , i.e., we will consider the set of maps defined by
(11) 
where , and are in the span of a polynomial basis. The KR rearrangement between and can be obtained as the unique minimizer of . An approximation to it can be obtained by restricting the minimization over the set [28, 26, 2].
Appendix C Numerical algorithms
Algorithm 3 assembles the numerical approximation of . In the following we will chose to be the reference distribution , leading to being a biased approximation of . In the examples presented the number of samples used for the approximation of is chosen to be the same used for Algorithm 5, even though the type of quadrature may differ—e.g., Monte Carlo, Gauss quadrature[17], sparse grids [35], etc. This way the extra cost of computing is equivalent to one gradient evaluation in the quasiNewton method.
Algorithm 4 computes the eigenvectors satisfying Proposition 2 and discerns between the subspace of relevant directions and its orthogonal complement .
Algorithm 5 outlines the numerical solution of the variational problem
(12) 
In the sake of simplicity we fix the complexity of the semiparametric map described in Appendix B and the sample size used in the discretization of the KL divergence. Alternatively one could adaptively increase the complexity and the sample size to match a prescribed tolerance, following the procedure described in [2]. For all the examples presented in this work, the minimization problem is solved via the Broyden–Fletcher–Goldfarb–Shanno (BFGS) quasiNewton method [5]. One could switch to a full Newton method if the Hessian of
or its action on a vector are available..