# A Distributed Framework for the Construction of Transport Maps

The need to reason about uncertainty in large, complex, and multi-modal datasets has become increasingly common across modern scientific environments. The ability to transform samples from one distribution P to another distribution Q enables the solution to many problems in machine learning (e.g. Bayesian inference, generative modeling) and has been actively pursued from theoretical, computational, and application perspectives across the fields of information theory, computer science, and biology. Performing such transformations , in general, still comprises computational difficulties, especially in high dimensions. Here, we consider the problem of computing such "measure transport maps" with efficient and parallelizable methods. Under the mild assumptions that P need not be known but can be sampled from, that the density of Q is known up to a proportionality constant, and that Q is log-concave, we provide a convex optimization problem pertaining to relative entropy minimization. We show how an empirical minimization formulation and polynomial chaos map parameterization can allow for learning a transport map between P and Q with distributed and scalable methods. We also leverage findings from nonequilibrium thermodynamics to represent the transport map as a composition of simpler maps, each of which is learned sequentially with a transport cost regularized version of the aforementioned problem formulation. We provide examples of our framework within the context of Bayesian inference for the Boston housing dataset, active learning for optimizing human computer interfaces, density estimation for probabilistic sleep staging with EEG, and generative modeling for handwritten digit images from the MNIST dataset.

• 5 publications
• 1 publication
• 2 publications
• 10 publications
09/29/2015

### Tractable Fully Bayesian Inference via Convex Optimization and Optimal Transport Theory

We consider the problem of transforming samples from one continuous sour...
04/02/2021

### Physics Informed Convex Artificial Neural Networks (PICANNs) for Optimal Transport based Density Estimation

Optimal Mass Transport (OMT) is a well studied problem with a variety of...
05/25/2019

### HINT: Hierarchical Invertible Neural Transport for General and Sequential Bayesian inference

In this paper, we introduce Hierarchical Invertible Neural Transport (HI...
09/22/2020

### An adaptive transport framework for joint and conditional density estimation

We propose a general framework to robustly characterize joint and condit...
08/09/2021

### Scalable Bayesian transport maps for high-dimensional non-Gaussian spatial fields

A multivariate distribution can be described by a triangular transport m...
12/11/2020

### Generative Learning With Euler Particle Transport

We propose an Euler particle transport (EPT) approach for generative lea...
03/17/2017

### Inference via low-dimensional couplings

We investigate the low-dimensional structure of deterministic transforma...

## 1 Introduction

While scientific problems of interest continue to grow in size and complexity, managing uncertainty is increasingly paramount. As a result, the development and use of theoretical and numerical methods to reason in the face of uncertainty, in a manner that can accommodate large datasets, has been the focus of sustained research efforts in statistics, machine learning, information theory and computer science. The ability to construct a mapping which transforms samples from one distribution to another distribution enables the solution to many problems in machine learning.

One such problem is Bayesian inference, (Gelman ., 2014; Bernardo  Smith, 2001; Sivia  Skilling, 2006)

, where a latent signal of interest is observed through noisy observations. Fully characterizing the posterior distribution is in general notoriously challenging, due to the need to calculate the normalization constant pertaining to the posterior density. Traditionally, point estimation procedures are used, which obviate the need for this calculation, despite their inability to quantify uncertainty. Generating samples from the posterior distribution enables approximation of any conditional expectation, but this is typically performed with Markov Chain Monte Carlo (MCMC) methods

(Gilks, 2005; Andrieu ., 2003; Hastings, 1970; Geman  Geman, 1984; Liu, 2008) despite the following drawbacks: (a) the convergence rates and mixing times of the Markov chain are generally unknown, thus leading to practical shortcomings like “sample burn in” periods; and (b) the samples generated are necessarily correlated, lowering effective sample sizes and propagating errors throughout estimates (Robert  Casella, 2004). If we let be the prior distribution and the posterior distribution for Bayesian inference , then an algorithm which can transform independent samples from to , without knowledge of the normalization constant in the density of , enables calculation of any conditional expectation with fast convergence.

As another example, generative modeling problems entail observing a large dataset with samples from an unknown distribution (in high dimensions) and attempting to learn a representation or model so that new independent samples from

can be generated. Emerging approaches to generative modeling rely on the use of deep neural networks and include variational autoencoders

(Kingma  Welling, 2013), generative adversarial networks (Goodfellow ., 2014) and their derivatives (Li ., 2015), and auto-regressive neural networks (Larochelle  Murray, 2011). These models have led to impressive results in a number of applications, but their tractability and theory are still not fully developed. If can be transformed into a known and well-structured distribution (e.g. a multivariate standard Gaussian), then the inverse of the transformation can be used to transform new independent samples from into new samples from .

While these issues relate to the functional attractiveness of the ability to characterize and sample from non-trivial distributions, there is also the issue of computational efficiency. There continues to be an ongoing upward trend of the availability of distributed and hardware-accelerated computational resources. As such, it would be especially valuable to develop solutions to these problems that are not only satisfactory in a functional sense, but are also capable of taking advantage of the ever-increasing scalability of parallelized computational capability.

### 1.1 Main Contribution

The main contribution of this work is to extend our previous results on finding transport maps to provide a more general transport-based push-forward theorem for pushing independent samples from a distribution to independent samples from a distribution . Moreover, we show how given only independent samples from , knowledge of up to a normalization constant, and under the traditionally mild assumption of the log-concavity of , it can be carried out in a distributed and scalable manner, leveraging the technique of alternating direction method of multipliers (ADMM) (Boyd ., 2011). We also leverage variational principles from nonequilibrium thermodynamics (Jordan ., 19981) to represent a transport map as an aggregate composition of simpler maps, each of which minimizes a relative entropy along with a transport-cost-based regularization term. Each map can be constructed with a complementary, ADMM-based formulation, resulting in the construction of a measure transport map smoothly and sequentially with applicability in high-dimensional settings.

Expanding on previous work on the real-world applicability of these general-purpose algorithms, we showcase the implementation of a Bayesian LASSO-based analysis of the Boston Housing dataset (Harrison  Rubinfeld, 1978) and a high-dimensional example of using transport maps for generative modeling for the MNIST handwritten digits dataset (LeCun ., 1998).

### 1.2 Previous Work

A methodology for finding transport maps based on ideas from optimal transport within the context of Bayesian inference was first proposed in (El Moselhy  Marzouk, 2012) and expanded upon in conjunction with more traditional MCMC-based sampling schemes in (Marzouk ., 2016; Parno  Marzouk, 2014; Parno ., 2016; Spantini ., 2016).

Our previous work used ideas from optimal transport theory to generalize the posterior matching scheme, a mutual-information maximizing scheme for feedback signaling of a message point in arbitrary dimension (Ma  T.P., 2014; Ma  Coleman, 2011; Tantiongloc ., 2017). Building upon this, we considered a relative entropy minimization formulation, as compared to what was developed in (El Moselhy  Marzouk, 2012), and showed that for the class of log-concave distributions, this is a convex problem (Kim ., 2013). We also previously described a distributed framework (Mesa ., 2015) that we expand upon here.

In the more traditional optimal transportation literature convex optimization has been used to varying success in specialized cases (Papadakis ., 2014), as well as gradient-based optimization methods (Rezende  Mohamed, 2015; JD. Benamou ., 2015; Jd. Benamou ., 2015). The use of stochastic optimization techniques in optimal transport is also of current interest (Genevay ., 2016). In contrast, our work below presents a specific distributed framework where extensions to stochastic updating have been previously developed in a general case. Incorporating them into this framework remains to be explored.

Additionally, there is much recent interest in the efficient and robust calculation of Wasserstein barycenters (center of mass) across partial empirical distributions calculated over batches of samples (Cuturi  Doucet, 2014; Claici ., 2018). Wasserstein barycenters have also been applied to Bayesian inference (Srivastava ., 2015). While related, our work focuses instead on calculating the full empirical distribution through various efficient parameterizations discussed below.

Building on much of this, there is growing interest in specific applications of these transport problems in various areas (Arjovsky ., 2017; Tolstikhin ., 2017). These derived transport problems are proving to be a fruitful alternative approach and are the subject of intense research. The framework presented below is general purpose and could benefit many of the derived transport problems.

Excellent introductory and references to the field can be found in (Villani, 2008; Santambrogio, 2015).

The rest of this paper is organized as follows: in Section 2, we provide some necessary definitions and background information; in Section 3, we describe the distributed general push-forward framework and provide several details on its construction and use; in Section 4, we formulate a specialized version of the objective specifically tailored for sequential composition; in Section 5, we discuss applications and examples of our framework; and we provide concluding remarks in Section 6.

## 2 Preliminaries

In this section we make some preliminary definitions and provide background information for the rest of this paper.

### 2.1 Definitions and Assumptions

Assume the space for sampling is given by , a convex subset of

-dimensional Euclidean space. Define the space of all probability measures on

(endowed with the Borel sigma-algebra) as . If admits a density with respect to the Lebesgue measure, we denote it as .

###### Assumption 1.

We assume that admit densities with respect to the Lebesgue measure.

This work is fundamentally concerned with trying to find an appropriate push-forward between two probability measures, and :

###### Definition 2.1 (Push-forward).

Given we say that a map pushes forward to (denoted as

) if a random variable

with distribution results in having distribution .

Of interest to us is the class of invertible and “smooth” push-forwards:

###### Definition 2.2 (Diffeomorphism).

A mapping is a diffeomorphism on if it is invertible, and both and are differentiable. Let be the space of all diffeomorphisms on .

A subclass of these, are those that are “orientation preserving”:

###### Definition 2.3 (Monotonic Diffeomorphism).

A mapping is orientation preserving, or monotonic, if its Jacobian is positive-definite:

 JS(u)⪰0,∀u∈W

Let be the set of all monotonic diffeomorphisms on .

The Jacobian can be thought of as how the map “warps” space to facilitate the desired mapping. Any monotonic diffeomorphism necessarily satisfies the following Jacobian equation:

###### Lemma 2.4 (Monotonic Jacobian Equation).

Let and assume they have densities and . Any map for which satisfies the following Jacobian equation:

 p(u)=q(S(u))det(JS(u))∀u∈W (1)

We will now concern ourselves with two different notions of “distance” between probability measures.

###### Definition 2.5 (KL Divergence).

Let and assume they have densities and . The Kullback-Leibler (KL) divergence, or relative entropy, between and is given by

 D(P∥Q) = EP[logp(X)q(X)]

The KL divergence is non-negative and is zero if and only if for all .

###### Definition 2.6 (Wasserstein Distance).

For with densities and , the Wasserstein distance of order two between and can be described as

 d(P,Q)2 ≜inf{EPX,Y[∥X−Y∥2]:X∼P,Y∼Q} (2)

The following theorem will be useful throughout:

###### Theorem 2.7 ((Brenier, 1987; Villani, 2003)).

Under Assumption 1, can be equivalently expressed as

 d(P,Q)2 ≜inf{EP[∥X−S(X)∥2]:S#P=Q} (3)

and there is a unique minimizer which satisfies .

Note that this implies the following corollary:

###### Corollary 2.8.

For any satisfying creftype 1, there exists a for which , or equivalently, for which (1) holds.

## 3 KL Divergence-based Push-Forward

In this section, we present the distributed push-forward framework that relies on our previously published relative entropy-based formulation of the measure transport problem, and discuss several issues related to its construction.

### 3.1 General Push-Forward

According to Lemma 2.4, a monotonic diffeomorphism pushing to will necessarily satisfy the Jacobian equation (1). Note that although we think of a map as pushing from to , we have written (1) so that appears by itself on the left-hand side, while is being acted on by on the right-hand side. This notation is suggestive of the following interpretation: If we think of the destination density as an anchor point, then for any arbitrary mapping , we can describe an induced density for according to Eq. 1 as:

 ~p(u;S)=q(S(u))det(JS(u)) for all u∈W (4)

With this notation, we can interpret as a parametric family of densities, and for any fixed , is a density which integrates to . We note that by construction, any necessarily pushes to : . We can then cast the transport problem as finding the mapping that minimizes the relative entropy between and the induced .

 S∗=argminS∈D+D(P∥~P(⋅;S)) (5)

This perspective is represented visually in Fig. 1.

If we again make another natural assumption:

###### Assumption 2.

 E[|logp(X)|]<∞

We can expand Eq. 5 and combine with (4) to write:

 S∗ =argminS∈D+D(P∥~P(⋅;S)) =argminS∈D+EP[logp(X)~p(X;S)] =argminS∈D+−h(p)−EP[log~p(X;S)] (6) =argminS∈D+−EP[log~p(X;S)] (7) =argminS∈D+−EP[logq(S(X))+logdetJS(X)] (8)

where in (6), is the Shannon differential entropy of , which is fixed with respect to ; (7) is by creftype 2 and Jensen’s inequality implying that and the non-negativity of KL divergence; (8) is by combining with (4).

We now make another assumption for which we can guarantee efficient methods to solve for (5).

###### Assumption 3.

The density is log-concave.

We can now state the main theorem of this section (Kim ., 2015; Mesa ., 2015):

###### Theorem 3.1 (General Push-Forward).

Under Assumptions 1—-3,

 minS∈D+D(P∥~P(⋅;S)) (GP)

is a convex optimization problem.

###### Proof.

For any , we have that . For any we have that and . Since is strictly concave over the space of positive definite matrices (Boyd  Vandenberghe, 2004), and by assumption is concave, we have that is a convex function of on . Existence of for which is given by Corollary 2.8. ∎

An important remark on this theorem:

###### Remark 1.

Theorem 3.1 does not place any structural assumptions on . It need not be log-concave, for example.

Beginning with Eq. 8 above, we see that Problem (GP) can then be solved through the use of a Monte-Carlo approximation of the expectation, and we arrive at the following sample-based version of the formulation:

 S∗ =argminS∈D+1NN∑i=1[−logq(S(Xi))−logdet(JS(Xi))] (9)

where .

### 3.2 Consensus Formulation

The stochastic optimization problem in (9) takes the general form of:

 minS N∑i=1fi(S)

From this perspective, can be thought of as a complicating variable. That is, this optimization problem would be entirely separable across the sum were it not for . This can be instantiated as a global consensus problem:

 minS N∑i=1fi(Si) s.t. Si−S=0

where the optimization is now separable across the summation, but we must achieve global consensus over . With this in mind, we can now write a global consensus version of (9) as:

 minSi∈D+−1NN∑i=1logq(Si(Xi))+logdet(JSi(Xi)) s.t.Si=S,i=,1…,N (10)

In this problem, we can think of each (batch of) sample as independently inducing some random through a function . The method proposed below can then be thought of as iteratively reducing the distance between each and the true by reducing the distance between each .

This problem is still over an infinite dimensional space of functions , however.

### 3.3 Transport Map Parameterization

To address the infinite dimensional space of functions mentioned above, as in (Mesa ., 2015; Kim ., 2013, 2015; Marzouk ., 2016) we parameterize the transport map over a space of multivariate polynomial basis functions formed as the product of -many univariate polynomials of varying degree. That is, given some , we form a basis function of multi-index degree using univariate polynomials of degree as:

 ϕj(x)=D∏a=1ψja(xa)

This allows us to represent one component of as a weighted linear combination of basis functions with weights as:

 Sd(x)=∑j∈Jwd,jϕj(x)

where is a set of multi-indices in the representation specifying the order of the polynomials in the associated expansion, and denotes the component of the mapping. In order to make this problem finite-dimensional, we must truncate the expansion to some fixed maximum-order .

 J={j∈ND:D∑i=1ji≤O}

We can now approximate any nonlinear function as:

 S(x)=WΦ(x)

where the size of the index-set, , and is a matrix of weights.

In order to avoid confusion and in the spirit of consensus ADMM as shown in Boyd . (2011), we introduce a consensus variable . With this, we can now give a finite-dimensional version of (10) as:

 minWi∈RD×K−1NN∑i=1[logq(WiΦ(Xi))+logdet(WiJΦ(Xi))] (11) s.t.Wi=B,WiJΦ(Xi)⪰0,i=1,…,N

with:

 Wi =[w1,…,wK] D ×K Φ(⋅) =[ϕj1(⋅),…,ϕjK(⋅)]T K ×1 JΦ(⋅) =[∂ϕji∂xj(⋅)]i,j K ×D

where we have made explicit the implicit constraint that by ensuring that . We now provide two important remarks:

###### Remark 2.

In principle, any basis of polynomials whose finite-dimensional approximations are sufficiently dense over will suffice. In applications where is assumed known, the basis functions are chosen to be orthogonal with respect to the reference measure :

 ∫Wϕj(x)ϕi(x)p(x)dx=1i=j

Within the context of Bayesian inference, for instance, this greatly simplifies computing conditional expectations, corresponding conditional moments, etc.

(Schoutens, 2000).

###### Remark 3.

When it is important to ensure that the approximation satisfies the properties of a diffeomorphism, we can project onto with solving a quadratic optimization problem, as discussed in Ensuring Diffeomorphism Properties of Parameterized Maps.

We also note that the polynomial representation presented above is chosen to best approximate a transport map, independent of a specific application or representation of the data (Fourier, wavelet, etc.). As mentioned in Remark 2 above, in principle any dense basis will suffice.

### 3.4 Distributed Push-Forward with Consensus ADMM

In this section we will reformulate (11) within the framework of the alternating direction method of multipliers (ADMM), and provide our main result, Corollary 3.2.

#### 3.1 Distributed Algorithm

Using ADMM, we can reformulate (11) as a global consensus problem to accommodate a parallelizable implementation. For notational clarity, we write and . We then introduce the following auxiliary variables:

 BΦi≜pi,BJi≜Zi

We can now write (10) as:

 min{W,Z,p}i,B 1NN∑i=1−logq(pi)−logdetZi+12ρ∥Wi−B∥22 +1NN∑i=112ρ∥BΦi−pi∥22+12ρ∥BJi−Zi∥22 s.t. BΦi=pi:γi(D×1) BJi=Zi:λi(D×D) Wi−B=0:αi(D×K) Zi⪰0

where in the feasible set, we have denoted the Lagrange multiplier that will be associated with each constraint to the right.

Although coordinate descent algorithms solve for one variable at a time while fixing the others and can be extremely efficient, they are not always guaranteed to find the globally optimal solution Wright (2015). Using the consensus formulation of ADMM above, we consider a problem formulation with the same global optimum which contains quadratic penalties associated with equality constraints in the objective function and constraints still imposed. The consensus formulation has the key property that its Lagrangian, termed the ”augmented Lagrangian” Boyd . (2011), can be globally minimized with coordinated descent algorithms for any . Note that when , the augmented Lagrangian is equivalent to the standard (unaugmented) Lagrangian associated with (11).

We can now raise the constraints to form the fully-penalized augmented Lagrangian as:

 Lρ(W,Z,p,B;γ,λ,α) = 1NN∑i=1−logq(pi)−logdetZi + 1NN∑i=112ρ∥Wi−B∥22+12ρ∥BΦi−pi∥22 + 1NN∑i=112ρ∥BJi−Zi∥22+γTi(pi−BΦi) + 1NN∑i=1tr(λTi(Zi−BJi))+tr(αTi(Wi−B))

The key property we leverage from the ADMM framework is the ability to minimize this Lagrangian across each optimization variable sequentially, using only the most recently updated estimates. After simplification (details can be found in the Appendix), the final ADMM update equations for the remaining variables are: equationparentequation

 Bk+1 =Bi⋅Bs (12a) Wk+1i =−1ραki+Bk+1 (12b) Zk+1i =Q~ZiQT (12c) γk+1i =γki+ρ(pk+1i−Bk+1Φi) (12d) λk+1i =λki+ρ(Zk+1i−Bk+1Ji) (12e) αk+1i =αki+ρ(Wk+1i−Bk+1) (12f) pk+1i =argminpi−logq(pi)+pen(pi) (12g)

We look first at the consensus variable . We can separate its update into two pieces: a static component , and an iterative component : equationparentequation

 Bi =1NN∑i=1[ρ(Wki+pkiΦTi+ZkiJTi)+γkiΦTi+λkiJTi+αki] (13a) Bs =[ρ(I+1NN∑i=1ΦiΦTi+JiJTi)]−1 (13b)

The consensus variable can then be thought of as averaging the effect of all other auxiliary variables, and forming the current best estimate for consensus among the distributed computational nodes.

The -update is the only remaining minimization step that cannot necessarily be solved in closed form, as it completely contains the structure of the density. In its penalization, all other optimization variables are fixed:

 pen(pi)=12ρ∥Bk+1Φi−pi∥22+γkTi(pi−Bk+1Φi)

The formulation of (12) has the following desirable properties:

• Eqs. 12f, 12e, 12d, 12c, 12b and 12a admit closed form solutions. In particular, Eqs. 12f, 12e, 12d and 12b are simple arithmetic updates;

• Eq. 12g is a penalized

-dimensional-vector convex optimization problem that entirely captures the structure of

. In particular, any changes to the problem specifying a different structure of will be entirely confined in this update; furthermore, algorithm designers can utilize any optimization procedure/library of their choosing to perform this update.

With this, we can now give an efficient, distributed version of the general push-forward theorem:

###### Corollary 3.2 (Distributed Push-Forward).

Under creftype 1 and creftype 3,

 minWi∈Rd×K −1NN∑i=1logq(WiΦi)+logdet(WiJi) (14) s.t. Wi=W,WJi⪰0i=1,…,N

is a convex optimization problem.

###### Remark 4.

ADMM convergence’s properties are robust to inaccuracies in the initial stages of the iterative solving process (Boyd ., 2011). Additionally several key concentration results provide very strong bounds for averages of random samples from log-concave distributions, showing that the approximation is indeed robust (Bobkov ., 2011, Thrm 1.1, 1.2).

The above framework, under natural assumptions, facilitates the efficient, distributed and scalable calculation of an optimal map that pushes forward some to some .

### 3.5 Structure of the Transport Map

An important consideration in ensuring the construction of transport maps is efficient is their underlying structure. In Section 3.3 we described a parameterization of the transport map through the multi-index set - the indices of polynomial orders involved in the expansion. However, this parameterization tends to be unfeasible to use in high dimension or with high order polynomials due to the exponential rate at which the number of polynomials increases with respect to these two properties.

In (Marzouk ., 2016), two less expressive, but more computationally feasible map structures that can be used to generate the transport map were discussed, which we briefly reproduce here, along with some useful properties. For more specific details and examples of multi-index sets pertaining to each mode for implementation purposes, see Transport Map Multi-Indices Details

The first alternative to the map pertaining to the fully-expressive mapping is the Knothe-Rosenblatt map (Bonnotte, 2013), which our group also previously used within the context of generating transport maps for optimal message point feedback communication (Ma  Coleman, 2011). Here, each component of the output, , is only a function of the first components of the input, resulting in a mapping that is lower-triangular. Both the Knothe-Rosenblatt and dense mapping described above perform the transport from one density to another, but with different geometric transformations. An example of these differences can be found in Figures 3 and 4 of (Ma  Coleman, 2011).

A Knothe-Rosenblatt arrangement gives the following multi-index set (note that the index-set is now sub-scripted according to dimension of the data denoting the dependence on data component):

 JKRd={j∈ND:D∑i=1ji≤O∧ji=0,∀i>d},d=1,…,D

An especially useful property of this parameterization is the following identity for the Jacobian of the map:

 logdet(JS(Xi))=D∑d=1log∂dSd(Xi) (15)

where represents the partial derivative of the component of the mapping with respect to the component of the data, evaluated at .

Furthermore, the positive-definiteness of the Jacobian can equivalently be enforced for a lower-triangular mapping by ensuring the following:

 ∂dSd>0,1≤d≤D (16)

We can then write a Knothe-Rosenblatt special-case version of Eq. 10 as:

 minSi∈DKR+−1NN∑i=1logq(Si(Xi))+D∑d=1log∂dSdi(Xi) s.t.Si=S,i=1,…,N (17)

Indeed, we use this to our advantage in Section 4.

Finally, in the event that the Knothe-Rosenblatt mapping also proves to have too high of model complexity, an even less expressive mapping is a Knothe-Rosenblatt mapping that ignores all multivariate polynomials that involve more than one data component of the input at a time, resulting in the following multi-index set:

 JKRSVd={j∈ND:D∑i=1ji≤O∧jijl=0,∀i≠l∧ji=0,∀i>d},d=1,…,D

Although less expressive and less precise than the total order Knothe-Rosenblatt map, these maps can often still perform at an acceptable level of accuracy with respect to many problems.

### 3.6 Algorithm for Inverse Mapping with Knothe-Rosenblatt Transport

It may be desirable to compute the inverse mapping of a given sample from , that is, . When the forward mapping is constrained to have Knothe-Rosenblatt structure, and a polynomial basis is used to parameterize the mapping, the process of inverting a sample from reduces to solving a sequential series of polynomial root-finding problems (Marzouk ., 2016). We give a more detailed implementation-based explanation of this process alongside a discussion of implementation details for the Knothe-Rosenblatt maps in Inverse Map Details.

## 4 Sequential Composition of Optimal Transportation Maps

In this section, we introduce a scheme for using many individually computed maps in sequential composition to achieve an overall effect of a single large mapping from to . By using a sequence of maps to transform to instead of a single one-shot map, one can theoretically rely on models of lower complexity to represent each map in the sequence, as although each map is, on its own, “weak” in the sense of its ability to induce large changes in the distribution space, the combined action of many such maps together can potentially successfully transform samples as desired. This is especially attractive for model structures that increase exponentially in complexity with problem size, such as the dense polynomial chaos structure discussed on the previous section. This sequential composition process is visually represented in Figure 2.

Moving forward, we first take a brief look at a non-equilibrium thermodynamics interpretation of this methodology to further justify the use of such a scheme, and then derive a slightly different ADMM problem to implement it.

### 4.1 Non-Equilibrium Thermodynamics and Sequential Evolution of Distributions

One approach to interpreting sequential composition of maps is to borrow ideas from statistical physics, where we can interpret as the equilibrium density ( of particles in a system, which at time is out of equilibrium with density (also termed ). Since is an equilibrium density, it can be written as a Gibbs distribution (with temperature equal to 1 for simplicity): . For instance, if pertains to a standard Gaussian, then . Assuming the particles obey the Langevin equation, it is well known that the evolution of the particle density as a function of time obeys the Fokker-Planck equation. It was shown in (Jordan ., 19982) that the trajectory of can be interpreted from variational principles. Specifically,

###### Theorem 4.1 ((Jordan ., 19982) Thm 5.1).

Define and and assume that . For any , consider the following minimization problem:

 A(ρ) ≜ 12d(ρk−1,ρ)2+hD(ρ∥ρ∞) (18) ρk ≜ argminρ∈P(W)A(ρ) (19)

Then as

, the piecewise constant interpolation which equals

for converges weakly in to , the solution to the Fokker-Planck equation.

The log-concave structure of we have exploited previously also has implications for exponential convergence to equilibrium with this statistical physics perspective:

###### Theorem 4.2 ((Bakry  Émery, 1985)).

If is uniform log-concave, namely

 ∇2Ψ(u)⪰λID

for some with the identity matrix, then:

 D(ρt∥ρ∞) ≤ e−2λtD(ρ0∥ρ∞).

Note that if is the density of a standard Gaussian, this inequality holds with .

### 4.2 Sequential Construction of Transport Maps

We now note that for any , (19) encodes a sequence of densities which evolve towards . For notational conciseness in this section, we will be using the subscript on to denote the position of the map in a sequence of maps. As such, from corollary Corollary 2.8, there exists an for which , and more generally, for any , there exists an for which .

###### Lemma 4.3.

Define as

 B(S) ≜ 12Eρk−1[∥X−S(X)∥2]+hD(ρk−1∥~p(⋅;S)) Sk ≜ argminS∈D+B(S) (20)

Then and .

###### Proof.

From the definition of in (4) and the invariance of relative entropy under an invertible transformation, any satisfies

 D(ρk−1∥~p(⋅;S))=D(ρk−1∥S−1#ρ∞)=D(S#ρk−1∥ρ∞).

As such, moving forward with the proof, we will exploit how where

 ~B(S)≜12Eρk−1[∥X−S(X)∥2]+hD(S#ρk−1∥ρ∞).

From Theorem 2.7, for any . Also, since the relative entropy terms of and are equal, it follows that for any . Moreover, from Corollary 2.8, we have that there exists an for which and

 Eρk−1[∥X−Sk(X)∥2]=d(ρk−1,ρk)2.

Thus . ∎

As such, a natural composition of maps underlies how a sample from gives rise to a sample from :

 ρk=Sk#ρk−1=Sk∘Sk−1#ρk−2=Sk∘⋯∘S1#ρ0 (21)

Moreover, since as , and so approaches the identity map. Thus for small , each should be estimated with reasonable accuracy using lower-order maps. That is, can be described as the composition of maps as

 S(x)=ST∘…∘S2∘S1(x) (22)

for all , such that each is of relative low-order in the polynomial chaos expansion.

Note that as written above involves a sum of expectations with respect to . Since our scheme operates sequentially, we have already estimated and can generate approximate i.i.d. samples from by first generating i.i.d. from and constructing

 Zi=Sk−1∘⋯∘S1(Xi),i≥1.

We below will demonstrate efficient ways to solve the below convex optimization problem which replaces the expectation with respect to instead with the empirical expectation with respect to .

 minS∈D+1NN∑i=1[12∥Zi−S(Zi)∥2−hlog~p(Zi;S)]

To reiterate, we consider a distribution formed by the sequential composition of previous mappings as

 ρk−1=S∗k−1∘⋯∘S∗1#ρ0,

where . We then try to find a map that pushes forward closer to . Each is solved by the optimization problem (20), which we term SOT. As the number of compositions in (22) increases, approaches . When is uniform log-concave, this greedy, sequential approach still guarantees exponential convergence.

In the context of Knothe-Rosenblatt maps, for every map in the sequence we can solve the following optimization problem (in the following equation, we will be dropping the subscript that indicates the sequential map index, as the formulation is not dependent on position in the map sequence, and we will once again be replacing the subscript with to indicate the distributed variables for the consensus problem instead):

 minSi∈DKR+θ||Si(Xi)−Xi||22−1NN∑i=1logq(Si(Xi))+D∑d=1log∂dSd(Xi) (23) s.t.Si=S,∀0≤i≤N

where can be interpreted as an inverse “step-size” parameter.

Though each map in the sequence must be calculated sequentially after the previous one, each mapping can still be calculated in the distributed framework described above. This implies that at each round, one could adaptively decide the parameters for the next-round’s solve.

### 4.3 ADMM Formulation for Learning Sequential Maps

We now showcase an ADMM formulation for the optimal transportation-based objective function, similar in spirit to that of Eq. 12.

We first introduce the following conventions:

• represents the partial derivative of taken with respect to the component. Therefore, , and is the component of .

• represents a one-hot vector of length with the one in the position

We can then introduce a finite-dimensional representation of the transport map, as well as auxiliary variables and a consensus variable to Eq. 23 and rewrite the problem as:

 min{W,p}i,{Y,Z}di,Bθ||BΦi−xi||22+1NN∑i=1−logq(pi)+12ρ||Wi−B||22+12||BΦi−pi||22+D∑d=1−logZdi+12ρ(Ydi% 1d−Zdi)2+12ρ||BΦdi−Ydi||22s.tBΦi=piγi(D×1)Wi−B=0αi(D×K)Ydi1d=Zdiβdi(1×1)BΦdi=Ydiλdi(D×1)Zdi>0 (24)

where we have once again denoted the corresponding Lagrange multipliers to the right of each constraint. The superscript notation represents the fact that in this formulation, in addition to having separable variables for each data sample, some variables are now unique to an index over dimension as well. For example, there are -many variables that must be solved for. We can now raise the constraints to form the fully-penalized Lagrangian as:

 Lρ,θ(W,Z,Y,p,B;γ,α,β,λ)=θ||BΦi−xi||22+1NN∑i=1−logq(pi)+12ρ||Wi−B||22+12ρ||BΦi−pi||22+γTi(pi−BΦi)+tr(αTi(Fi−B))+D∑d=1−logZdi+12ρ(Ydi% 1d−Zdi)2+12ρ||BΦdi−Ydi||22+βdi(Zdi−Ydi1d)+λdTi(Ydi−BΦdi) (25)

The final ADMM update equations for each variable are once again all closed-form, with the exception of the optimization over . For the sake of brevity, we refer the reader to Section Derivation of Knothe-Rosenblatt ADMM Formulation and Final Updates of the Appendix for the exact update equations.

However, one notable difference between this formulation and that of Section 3.1 as noted in the previous section is that the update for

has been simplified from requiring an eigenvalue decomposition, to requiring a simple scalar computation, thus significantly reducing computation time, especially in higher dimensions.

### 4.4 Scaling Parallelization with GPU Hardware

Given the parallelized formulations given above, we implemented our algorithm using the Nvidia CUDA API to get as much performance as possible out of our formulation, and to maximize the problem sizes we could reasonably handle, while keeping computation time as short as possible. To test the algorithm’s parallelizability, we ran our implementation on a single Nvidia GTX 1080ti GPU, as well as on a single p3.16xlarge instance available on Amazon Web Services, which itself contains 8 on-board Tesla V100 GPUs.

For this test, we have sampled synthetic data from a bimodal

distribution specified as a combination of two Gaussian distributions, for a wide range of problem dimensions, specifically

, and a constant number of samples from set to . We then find a transport pushing to , composed of a sequence of 10 individual Knothe-Rosenblatt maps with no mixed multivariate terms. We then monitor the convergence of dual variables for proper termination of the algorithm.

Figure 3 shows the result of this analysis. The 1 GPU curve corresponds to performance using the single GTX 1080ti, and the AWS curve corresponds to the performance using the 8-GPU system on Amazon Web Services. The trending of the curves shows that, as expected, as problem dimension increases, a multi-GPU system will continue to maintain reasonable computation times, at least with respect to a single-GPU system, however fewer GPU’s will begin to accumulate increasingly high computational costs. In addition, the parallelizability of our algorithm also has a subtle benefit of helping with memory-usage issues; since we can distribute samples across multiple devices, we can also subsequently distribute all corresponding ADMM variables as well. Indeed, the single GTX 1080ti ran out of on-board memory roughly around , whereas the 8-GPU system can go well beyond that.

## 5 Applications

The framework presented above is general-purpose, and works to push-forward a distribution to a log-concave distribution . Below we discuss some interesting applications, namely Bayesian inference and a generative model, and show results with real-world datasets.

### 5.1 Bayesian Inference

A very important instantiation of this framework comes when we consider to represent a prior distribution, and to be a Bayesian posterior:

 fX|Y=y(x)=fY|X(y|x)fX(x)βy

where is a constant that does not vary with , given by:

 βy=∫v∈XfY|X(y|v)fX(v)dv

Using Eq. 1 and combining with Bayes’ rule above we can write:

 fX(x) =fX|Y=y(S∗(y)(x))det(JS∗(y)(x)) =fY|X(y|S∗(y)(x))fX(S∗(y)(x))βydet(JS∗(y)(x))

where the notation indicates that the optimal map is found with respect to observations . We note that since , log-concavity of is equivalent to log-concavity of the prior density and log-concavity of the likelihood density in : the same criterion for an MAP estimation procedure to be convex. Thus Corollary 3.2 extends to the special case of Bayesian inference; i.e. we can generate i.i.d. samples from the posterior distribution by solving a convex optimization problem in a distributed fashion.

Due to the unique way the ADMM steps were structured, this special case only requires specifying a particular instance of Eq. 12g: