# Optimal transport mapping via input convex neural networks

In this paper, we present a novel and principled approach to learn the optimal transport between two distributions, from samples. Guided by the optimal transport theory, we learn the optimal Kantorovich potential which induces the optimal transport map. This involves learning two convex functions, by solving a novel minimax optimization. Building upon recent advances in the field of input convex neural networks, we propose a new framework where the gradient of one convex function represents the optimal transport mapping. Numerical experiments confirm that we learn the optimal transport mapping. This approach ensures that the transport mapping we find is optimal independent of how we initialize the neural networks. Further, target distributions from a discontinuous support can be easily captured, as gradient of a convex function naturally models a discontinuous transport mapping.

## Authors

• 4 publications
• 7 publications
• 41 publications
• 50 publications
• ### Computation of optimal transport and related hedging problems via penalization and neural networks

This paper presents a widely applicable approach to solving (multi-margi...
02/23/2018 ∙ by Stephan Eckstein, et al. ∙ 0

• ### Guiding the One-to-one Mapping in CycleGAN via Optimal Transport

CycleGAN is capable of learning a one-to-one mapping between two data di...
11/15/2018 ∙ by Guansong Lu, et al. ∙ 0

• ### Large-Scale Optimal Transport via Adversarial Training with Cycle-Consistency

Recent advances in large-scale optimal transport have greatly extended i...
03/14/2020 ∙ by Guansong Lu, et al. ∙ 16

• ### Learning to Transport with Neural Networks

We compare several approaches to learn an Optimal Map, represented as a ...
08/04/2019 ∙ by Andrea Schioppa, et al. ∙ 12

• ### 2-Wasserstein Approximation via Restricted Convex Potentials with Application to Improved Training for GANs

We provide a framework to approximate the 2-Wasserstein distance and the...
02/19/2019 ∙ by Amirhossein Taghvaei, et al. ∙ 0

• ### Transport-Based Pattern Theory: A Signal Transformation Approach

In many scientific fields imaging is used to relate a certain physical q...
02/20/2018 ∙ by Liam Cattell, et al. ∙ 0

• ### Optimal Transport Classifier: Defending Against Adversarial Attacks by Regularized Deep Embedding

Recent studies have demonstrated the vulnerability of deep convolutional...
11/19/2018 ∙ by Yao Li, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Finding a mapping that transports a mass from one distribution to the other has several applications. Training deep generative models to sample from a realistic distribution, relies on finding a mapping from a predefined distribution to the target distribution. This includes generative adversarial networks goodfellow2014generative

, generative moment matching networks

li2015generative , minimizing maximum mean discrepancy li2017mmd

, and variational autoencoders

2013auto . Domain adaptation requires mapping the source domain to the target domain, to align the patterns discovered in the former to the latter. Finding a natural alignment that preserves similarities is critical gopalan2011domain ; ben2010theory .

There are many mappings

that can map a random variable

from such that is distributed as . Optimal transport addresses the problem of finding one that minimizes the total cost of transporting mass from to , as originally formulated in monge1781memoire . Optimal mappings are useful in several applications including color transfer ferradans2014regularized , shape matching su2015optimal , data assimilation reich2013nonparametric el2012bayesian

. For discrete variables, the optimal transport can be found as a solution to linear program. Hence, typical approaches to learn the optimal transport is based on quantization and cannot be extended to high-dimensional variables

evans1999differential ; benamou2000computational ; papadakis2014optimal .

We propose a novel minimax optimization to learn the optimal transport under the quadratic distance or Wassertstein-2 metric. The key innovation is to depart from the common practice of adding regularizers to eliminate the challenging constraints in the Kantorovich dual formulation (3). This regularization creates a bias that hinders learning the true optimal transport. Instead, we eliminate the constraints by restricting our search to the set of all convex functions, building upon the fundamental connection from Theorem 3.2. This leads to a novel minimax formulation in (6). Leveraging the recent advances in input convex neural networks, we propose an algorithm for solving this minimax optimization, and experimentally showcase its performance. From the perspective of training generative models, our approach can be viewed as a novel framework to train a generator that is modeled as a gradient of a convex function. We provide a principled training rule based on the optimal transport theory. This ensures that the generator converges to the optimal transport, independent of how we initialize the neural network. Further, gradient of a neural network naturally represents discontinuous functions, which is critical whenever the target distribution has a disconnected support.

To model convex functions, we leverage Input Convex Neural Networks (ICNNs), a class of scalar-valued neural networks such that the function is convex. These neural networks were introduced by amos2016input

to provide efficient inference and optimization procedures for structured prediction, data imputation and reinforcement learning tasks. In this paper, we show that ICNNs can be efficiently trained to learn the optimal transport map between two distributions

and . To the best of our knowledge, this is the first such instance where ICNNs are leveraged for the well-known task of learning optimal transport maps in a scalable fashion. This framework opens up a new realm for understanding problems in optimal transport theory using parametric convex neural networks, both in theory and practice. Figure 1 provides an example where the optimal transport map has been learned via the proposed Algorithm 1 from the orange distribution to the green distribution.

Notation. denotes the Dirac distribution at , . denotes the identity function, i.e. . For any function , . For any two measures on , denotes their product measure on . We denote the set of all convex functions by . -Euclidean norm is denoted by . denotes that the function is differentiable. For any measure , let .

## 2 Related work and background on optimal transport

For any Polish space , let

denote the set of all probability distributions on

, and denote the set of all Borel-measurable subsets of . Let and be two probability distributions on with finite second order moments. For any measurable mapping , let denote the push-forward of the measure under , i.e. for all . Let be the standard quadratic cost. Then the Monge’s optimal transportation problem is to transport the probability mass under to with the least quadratic cost, i.e.

 (1)

Any transport map achieving the minimum in eq:monge is called the optimal transport map. The optimal transport maps may not exist. In fact, the feasible set in the above optimization problem may itself be empty, for example when is a Dirac distribution and is any non-Dirac distribution.

To extend this notion of distance between and in (1), Kantorovich considered a relaxed version in kantorovitch1958translocation ; kantorovich2006problem . When an optimal transport map exists, the following second Wasserstein distance recovers (1). Further, is well-defined, even when an optimal transport map might not exist. In particular, it is defined as

 W22(P,Q)≜infπ∈Π(P,Q) 12E(X,Y)∼π∥X−Y∥2, (2)

where denotes the set of all joint probability distributions (or equivalently, couplings) whose first and second marginals are and , respectively. Any coupling achieving the infimum is called the optimal coupling. eq:kantor_relax is also referred to as the primal formulation for Wasserstein- distance. Kantorovich also provided a dual formulation for eq:kantor_relax, well-known as the Kantorovich duality theorem (villani2003topics, , Theorem 1.3), given by

 (3)

where denotes the constrained space of functions, defined as .

As there is no easy way to ensure the feasibility of the constraints along the gradient updates, common approach is to translate the optimization into a tractable form, while sacrificing the original goal of finding the optimal transport cuturi2013sinkhorn . Concretely, an entropic or a quadratic regularizer is added to (2). This makes the dual an unconstrained problem, which can be numerically solved using Sinkhorn algorithm cuturi2013sinkhorn or stochastic gradient methods genevay2016stochastic ; seguy2017large . The optimal transport can then be obtained from and , using the first-order optimality conditions of the Fenchel-Rockafellar’s duality theorem.

In this paper, we take a different approach and aim to solve the dual problem, without introducing a regularization. This idea is also considered classically in chartrand2009gradient and more recently in guo2019mode and taghvaei20192 . The classical approach relies on the exact knowledge of the density, which is not available in practice. The approach in guo2019mode relies on the discrete Brenier theory which is computationally expensive and not scalable. The most related work to ours is taghvaei20192 , which we provide a formal comparison in Section 3.

## 3 A novel minimax formulation to learn optimal transport

Our goal is to learn the optimal transport map from to , from samples drawn from and , respectively. We use the fundamental connection between optimal transport and Kantorovich dual in Theorem 3.2, to formulate learning

as a problem of estimating

. However, is notoriously hard to estimate. The standard Kantorovich dual formulation in Eq. (3) involves a supremum over a set with infinite constraints, which is challenging to even approximately project onto. To this end, we derive an alternative optimization formulation in Eq. (6), inspired by the convexification trick (villani2003topics, , Section 2.1.2). This allows us to eliminate the distance constraint of , and instead constrain our search over all convex functions. This constrained optimization can now be seamlessly integrated with recent advances in designing deep neural architectures with convexity guarantees. This leads to a novel minimax optimization to learn the optimal transport.

We exploit the fundamental properties of and the corresponding optimal transport to reparametrize the optimization formulation. Note that for any , we have

 f(x)+g(y)≤12∥x−y∥22⟺[12∥x∥22−f(x)]+[12∥y∥22−g(y)]≥⟨x,y⟩.

Hence reparametrizing and by and respectively, and substituting them in eq:dual_form, the dual form modifies to

 (4)

where . While the above constrained optimization problem involves a pair of functions , it can be transformed into the following form involving only a single convex function , thanks to (villani2003topics, , Theorem 2.9):

###### Theorem 3.1.

For any two probability distributions and on with finite second order moments, we have that

 inf(f,g)∈˜ΦcEP[f(X)]+EQ[g(Y)]=inff∈CVX(Rd)EP[f(X)]+EQ[f∗(Y)],

where is the convex conjugate of .

This single function representation does not make the problem any less challenging, as the definition of involves solving an optimization. However, the fact that we only need to search over convex is surprising, which leads to the novel representation of the optimal transport as a gradient of a convex function in the following. In view of thm:double_single_convex, we have from eq:double_convex that

 W22(P,Q)=12EP∥X∥22+12EQ∥Y∥22−inff∈CVX(Rd){EP[f(X)]+EQ[f∗(Y)]}. (5)

The crucial tools behind our formulation are the following celebrated results due to Knott-Smith and Brenier villani2003topics , which relate the optimal solutions for the dual form in eq:dual_convex_form and the primal form in eq:kantor_relax.

###### Theorem 3.2 ((villani2003topics, , Theorem 2.12)).

Let be any two probability distributions on with finite second order moments. Then,

1. (Knott-Smith optimality criterion) A coupling is optimal for the primal eq:kantor_relax if and only if there exists a convex function such that . Or equivalently, for all -almost , . Moreover, the pair achieves the minimum in the dual form eq:dual_convex_form.

2. (Brenier’s theorem) If admits a density with respect to the Lebesgue measure on , then there is a unique optimal coupling for the primal problem. In particular, the optimal coupling satisfies that

 dπ(x,y)=dQ(y)δx=∇f∗(y),

where the convex pair achieves the minimum in the dual problem eq:dual_convex_form. Equivalently, .

3. Under the above assumptions of Brenier’s theorem, in the unique solution to Monge transportation problem from to , i.e.

 EQ∥∇f∗(Y)−Y∥2=infT:T#Q=PEQ∥T(Y)−Y∥2.
###### Remark 3.3.

Whenever admits a density, we refer to as the optimal transport map.

Henceforth, throughout the paper we assume that the distribution admits a density in , and discuss algorithmic implications of this assumption in Section 4.5. In view of thm:knott-brenier, any optimal pair from the dual formulation in eq:dual_convex_form provides us an optimal transport map pushing forward onto . We show in the following theorem that the conjugate function can be replaced by a minimax optimization involving two convex functions.

###### Theorem 3.4.

Whenever admits a density in , we have that

 W22(P,Q)=CP,Q−inff∈CVXd∩L1(P),f∗∈L1(Q)supg∈CVXd∩L1(Q)EP[f(X)]+EQ[⟨Y,∇g(Y)⟩−f(∇g(Y))],

where is a constant independent of . In addition, there exists an optimal pair achieving the infimum and supremum respectively, where is the optimal transport map pushing forward onto .

Rearranging the terms in Theorem 3.4, we get

 W22(P,Q)=supf∈CVXd∩L1(P),f∗∈L1(Q)infg∈CVXd∩L1(Q) E[f(∇g(Y))−f(X)]−E[⟨Y,∇g(Y)⟩]+CP,Q, (6)

where . This provides a principled approach to learn the optimal transport mapping as a solution of a minimax optimization.

Since the optimization formulation in eq:max-min involves the search over the space of convex functions, we utilizing recent advances in input convex neural networks to parametrize them in Section 3.1.

### 3.1 Minimax optimization over input convex neural networks

We propose using parametric models based on deep neural networks to approximate the set of convex functions. This is known as input convex neural networks

amos2016input , denoted by . We propose estimating the following approximate Wasserstein- distance, from samples:

 ˜W22(P,Q)=supf∈ICNN(Rd)infg∈ICNN(Rd) E[f(∇g(Y))−f(X)]−E[⟨Y,∇g(Y)⟩]+CP,Q. (7)

Input convex neural networks (ICNNs) are a class of scalar-valued neural networks such that the function is convex. The neural network architecture for an ICNN is as follows. Given an input , the mapping is given by a -layer feed-forward NN using the following equations for :

 zl+1=σl(Wlzl+Alx+bl),f(x;θ)=zL,

where , are weight matrices (with the convention that ), and are the bias terms.

denotes the entry-wise activation function at the layer

. This is illustrated in Figure 2. We denote the total set of parameters by . It follows from (amos2016input, , Proposition 1) that is convex in provided

1. all entries of the weights are non-negative

2. the activation function is convex

3. the activation function is convex and non-decreasing, for .

While ICNNs are a specific parametric class of convex functions, it is important to understand if this class is rich enough representationally. This is answered positively by (chen2018optimal, , Theorem 1). In particular, they show that any convex function over a compact domain can be approximated in sup norm by a ICNN to the desired accuracy. This justifies the choice of ICNNs as a suitable approximating class for the convex functions.

Remark 1. The proposed framework for learning the optimal transport provides a novel training method for deep generative models, where the generator is modeled as a gradient of a convex function and the minimax optimization in (7) (and more concretely the update rule in Algorithm 1) provides the training rule. On the surface, Eq. (7) resembles the minimax optimization of generative adversarial networks based on Wasserstein-1 distance arjovsky2017wasserstein , called WGAN. However, there are several critical differences making our approach attractive.

First, because WGANs use optimal transport distance only as a measure of distance, the learned transport map from the latent source to the target is arbitrary (see Figure 4) jacob2018w2gan . Hence, it is sensitive to the initialization. On the other hand, we find the optimal transport map and converges to the same mapping regardless of the initialization (see Figure 1). Further, this makes the training stable.

Secondly, in a WGAN architecture arjovsky2017wasserstein ; petzka2017regularization , the transport map (which is the generator) is represented with neural networks that is a continuous mapping. Although, a discontinuous map can be approximated arbitrarily close with continuous neural networks, such a construction requires large weights making training unstable. On the other hand, through our proposed method, by representing the transport map with gradient

of the neural network (equipped with ReLU type activation functions), we obtain a naturally

discontinuous map. As a consequence we have sharp transition from one part of the support to the other, whereas GANs (including WGANs) suffer from spurious probability masses that are not present in the target. This is illustrated in Section 4.4. The same holds for regularization-based methods for learning optimal transport genevay2016stochastic ; seguy2017large , where transport map is parametrized by neural networks.

Remark 2. In a recent work taghvaei20192 , ICNNs have been proposed for learning the Kantorovich dual. The main difference is that taghvaei20192 proposes solving a single maximization over a convex function , and a convex optimization is proposed to find the gradient of that involves the gradient of . The conjugate needs to be computed by solving a convex optimization problem for all samples at each iteration, and is challenging to scale to large datasets. Instead, we we propose a min-max formulation to learn the conjugate function in a scalable fashion.

## 4 Experiments

We use the following synthetic datasets, (i) Checker-board in fig:checker-board-data, (ii) Eight Gaussians in fig:8gaussians-data, and (iii) Circle in fig:circle-data, to show that the proposed approach converges to a global optima and finds an optimal transport map (Section 4.2), where as other Wasserstein distance based methods converge to a local minima and find arbitrary transport maps (Section 4.3). Further, the proposed method can naturally represent the desired discontinuous transport maps, whereas competing approaches are restricted continuous maps, resulting in spurious probability masses (Section 4.4

). We further introduce a heuristic to train our model on data in lower-dimensional manifolds to learn an optimal

stochastic transport mappings (Section 4.5).

### 4.1 Experimental Setup

Neural Network Architecture: We propose ICNNs to parametrize both convex functions and as shown in Figure 2. Both networks have equal number of nodes for all the hidden layers followed by a final output layer. We choose a square of leaky ReLU function, i.e with a small positive constant as the convex activation function for the first layer . For the remaining layers, we use the leaky ReLU function, i.e for , as the monotonically non-decreasing and convex activation function. Note that this choice of activations naturally satisfy the assumptions (ii)-(iii) of the ICNN. In all of our experiments, we set the parameter .

The intuition behind choosing to be the square of the leaky ReLU function is as follows: Since the learned optimal transport map is given by , where is an ICNN, is a piecewise constant map when the activation is just the leaky ReLU. In the case of a piecewise constant map, its gradient maps the domain of each piece to a single point. Clearly, this is not desirable as it results in generalization issues. On the other hand, using the squared version can mitigate this problem since its gradient is piecewise affine.

Optimization Procedure: The weights and

are initialized with a uniform distribution on

. The bias terms are initialized at zero. To ensure convexity, we need to restrict all weights ’s to be non-negative (Assumption (i) in ICNN). We enforce it strictly for , as a non-convex can render the maximization over to diverge and thus making the optimization unstable. However, for , we relax this constraint and instead introduce a regularization term

 R(θg)=λ∑w∈{Wl}∈θg(max% (−w,0))2,

where is the regularization constant and the summation is over all matrix components of the weight matrices of the network . This relaxation makes the optimization algorithm converge faster.

For both the maximization and minimization updates in (7), we use Adam kingma2014adam . At each iteration, we draw a batch of samples from and , denoted by and respectively. The objective (7) thus translates to

 maxθf:Wℓ≥0,∀ℓ∈[L−1]minθgJ(θf,θg)=1MM∑i=1f(∇g(Yi))−⟨Yi,∇g(Yi)⟩−f(Xi)+R(θg), (8)

where are the parameters of and respectively, and is an entry-wise constraint. This is summarized in Algorithm 1. For more experimental details, please refer to Appendix B.

### 4.2 Learning the optimal transport map

We now present experimental results suggesting that our proposed procedure indeed learns the optimal transport map. First, we consider the checker board example. The source distribution (orange) and the target distribution (green) are illustrated in Figure 1. The learned optimal transport map, along with the pairs of generated and transported samples are depicted in Figure 1. Each line connects a sample from to its transported version. In order to illustrate that the learned convex function captures the essential features of the problem, the displacement vector field and the level sets of are depicted in Figure 1 and 1 respectively.

In the second experiment, we consider a mixture of eight Gaussians as depicted in Figure 3. The simulation parameters are same as that of checker board example except that the batch size . The learned optimal transport map, along with the generated samples are depicted in Figure 3.

### 4.3 Comparisons to Wasserstein-1 GANs

We numerically show that W1-GANs find arbitrary transport maps, and hence are sensitive to initializations as discussed in Section 3. This is in stark contrast with the proposed approach which finds the optimal transport independent of how we initialize it. We consider the checker board example (Figure 1) and run the state-of-the-art Wasserstein GAN introduced in petzka2017regularization , named as W1-LP. The resulting transport map for three different random initialization are depicted in Figure 4, 4, and 4. In the second experiment, we considered the eight Gaussians example, and run the W1-LP algorithm petzka2017regularization with a batch size of . Figure 4 includes the resulting transport map.

### 4.4 Regularized OT methods and GANs cannot learn discontinuous transport maps

The power to represent a discontinuous transport mapping is what fundamentally sets our proposed method apart from the existing approaches, as discussed in Section 3. Two prominent approaches for learning transport maps are generative adversarial networks arjovsky2017wasserstein ; petzka2017regularization and regularized optimal transport learning methods genevay2016stochastic ; seguy2017large . In both cases, the transport map is modeled by a standard neural network with finite depth and width, which is a continuous function. As a consequence, continuous transport maps suffer from unintended and undesired spurious probability mass that connects disjoint supports of the target probability distribution.

First, standard GANs including the original GAN goodfellow2014generative and variants of WGAN arjovsky2017wasserstein ; gulrajani2017improved ; wei2018improving all suffer from spurious probability masses. Even those designed to tackle such spurious probability masses, like PacGAN lin2018pacgan , cannot overcome the barrier of continuous neural networks. This suggests that fundamental change in the architecture, like the one we propose, is necessary. Figure 5 illustrates the same scenario for the transport map learned through the WGAN framework discussed in Section 4.3. We can observe the trailing dots of spurious probability masses, resulting from undesired continuity of the learned transport maps.

Similarly, regularization methods to approximate optimal transport maps, explained in Section 2, suffer from the same phenomenon. Representing a transport map with an inherently continuous function class results in spurious probability masses connecting disjoint supports. Figures 5 and 5 illustrate those trailing dots of spurious masses for the learned transport map from algorithm introduced in seguy2017large with the default hyper-parameter choices, with regularization and entropic regularization respectively.

On the other hand, we represent the transport map with the gradient of a neural network (equipped with non-smooth ReLU type activation functions). The resulting transport map can naturally represent discontinuous transport maps, as illustrated in Figure 1 and Figure 5. The vector field of the learned transport map in Figure 1 clearly shows the discontinuity of the learned optimal transport.

### 4.5 Supporting low-dimensional manifolds in high-dimensions

In typical training of deep generative models, we desire to capture the low inherent dimensionality of natural distributions in high-dimensions (such as manifold of faces in high-resolution images). This is enforced by restricting the source distribution to a dimension much smaller than the ambient dimension of the samples. We cannot directly apply our approach, as it critically relies on the fact that the source distribution admits a density, both theoretically (Remark 3.3) and in practice.

For example, consider a one dimensional Gaussian distribution as source distribution and the uniform distribution on a circle as the target distribution, as illustrated in Figure

3. According to the Brenier theorem, (deterministic) optimal transport map exists when at least one of the distributions admit a density. However, in this example, none of the distributions admit density.

Inspired by the Kantorovich relaxation, we propose to search for stochastic mappings instead. Recall that denotes the source random variable and denotes the target random variable. Define where is a standard -dimensional Gaussian random variable. By adding a Gaussian random variable, it follows that admits density. Then, we consider the optimal transport problem between and , which now has a deterministic optimal transport map solution. Let denote the optimal transport map. We propose the coupling induced by the stochastic map as an approximation for the optimal coupling. The resulting coupling is depicted in Figure 3, where we learned an optimal stochastic transport map.

## 5 Further discussion of related work

The idea of solving the semi-dual optimization problem (5) is classically considered in chartrand2009gradient , where the authors derive a formula for the functional derivative of the objective function with respect to

and propose to solve the optimization problem with the gradient descent method. Their approach is based on the discretization of the space and knowledge of the explicit form of the probability density functions, that is not applicable to real-world high dimensional problems.

More recently, the authors in lei2017geometric ; guo2019mode propose to learn the function in a semi-discrete setting, where one of the marginals is assumed to be a discrete distribution supported on a set of points , and the other marginal is assumed to have a continuous density with compact convex support . They show that the problem of learning the function is similar to the variational formulation of the Alexandrov problem: constructing a convex polytope with prescribed face normals and volumes. Moreover, they show that, in the semi-distrete setting, the optimal is of the form and simplify the problem of learning to the problem learning real numbers . However, the objective function involves computing polygonal partition of into convex cells, induced by the function , which is computationally challenging. Moreover, the learned optimal transport map , transports the probability distribution from each convex cell to a single point , which results in generalization issues. Additionally, the proposed approach is semi-discrete, and as a result, does not scale with the number of samples.

Statistical analysis of learning the optimal transport map through the semi-dual optimization problem (5) is studied in hutter2019minimax ; rigollet2018uncoupled , where the authors establish a minimax convergence rate with respect to number of samples for certain classes of regular probability distributions. They also propose a procedure that achieves the optimal convergence rate, that involves representing the function with span of wavelet basis functions up to a certain order, and also requiring the function to be convex. However, they do not provide a computational algorithm to implement the procedure.

The approach proposed in this paper is built upon the recent work taghvaei20192 , where the proposal to solve the semi-dual optimization problem (5) by representing the function with an ICNN appeared for the first time. The proposed procedure in taghvaei20192 involves solving a convex optimization problem to compute the convex conjugate for each sample in the batch, at each optimization iteration. This procedure becomes computationally challenging to scale to large datasets. However, in this paper, we propose a minimax formulation (6) to learn the convex conjugate function in a scalable fashion.

There are also other alternative approaches to approximate the optimal transport map that are not based on solving the semi-dual optimization problem (5). In leygonie2019adversarial , the authors propose to approximate the optimal transport map, through an adversarial computational procedure, by considering the dual optimization problem (3), and replacing the constraint with a quadratic penalty term. However, in contrast to the other regularization-based approaches such as seguy2017large , they consider a GAN architecture, and propose to take the generator, after the training is finished, as the optimal transport map. They also provide a theoretical justification for their proposal, however the theoretical justification is valid in an ideal setting where the generator has infinite capacity, the discriminator is optimal at each update step, and the cost is equal to the exact Wasserstein distance. These ideal conditions are far from being true in a practical setting.

Another approach, proposed in xie2019scalable , is based on a generative learning framework to approximate the optimal coupling, instead of optimal transport map. The approach involves a low-dimensional latent random variable, two generators that take the latent variable as input and map it to a high-dimensional space where the real data resides in, and two discriminators that respectively take as inputs the real data and the output of the generator. Although, the proposed approach is attractive when an optimal transport map does not exist, it is computationally expensive because it involves learning four deep neural networks, and suffers from unused capacity issues that WGAN architecture suffers from gulrajani2017improved .

Finally, a procedure is recently proposed to approximate the optimal transport map that is optimal only on a subspace projection instead of the entire space muzellec2019subspace . This approach is inspired by the sliced Wasserstein distance method to approximate the Wasserstein distance rabin2011wasserstein ; deshpande2018generative . However, selection of the subspace to project on is a non-trivial task, and optimally selecting the projection is an optimization over the Grassmann manifold which is computationally challenging.

## Appendix A Proof of Theorem 3.4

Define . The main step of the proof is to show that . Then the conclusion follows from thm:double_single_convex. To prove this, note that for all , we have

 ⟨y,∇g(y)⟩−f(∇g(y))≤⟨y,∇f∗(y)⟩−f(∇f∗(y))=f∗(y),

for all such that and are differentiable at . We now claim that both and are differentiable -almost everywhere (a.e). If the claim is true, upon taking the expectation w.r.t : and the inequality is achieved with . Hence, . Now we prove the claim as follows: Since , we have . Thus , where is the domain of the function . Moreover, , where denotes the interior, because the boundary has -measure zero ( has a density). Since is convex, it is differentiable on except at points of Lebesgue measure zero which have -measure zero too. Therefore, is -a.e differentiable. Similar arguments hold for .

## Appendix B Experimental set-up

For reproducibility, we provide the details of the numerical experiments.

In Section 4.2 and Figures 1, 1, 3, and 3, we run Algorithm 1 with the following parameters: The hidden size , number of layers , regularization constant , batch size , learning rate , generator iterations , and the total number of iterations .

In Section 4.3 and Figures 4, 4, and 4, we use the state-of-the-art Wasserstein GAN introduced in [PFL17], named as W1-LP, with a fully connected neural network architecture of size for the generator and a fully connected neural network architecture of size for the discriminator. The simulation parameters are as follows: the hidden size , number of layers , learning rate , batch size , generator iterations , and total number of iterations . The algorithm in [PFL17] includes a regularization constant which is chosen to be . The resulting transport map for three different random initialization are depicted in Figure 4, 4, and 4.

In Section 4.4 and Figures 5 and 5, we run the algorithm introduced in [SDF17] with the following parameters: number of samples , batch size , entropic regularization value , regularization value , first step learning rate , second step learning rate

, first step epochs

, second step epochs , neural network size . These are the default values reported in [SDF17]. We implemented the step one in discrete setting, because the continuous setting led to numerical instabilities. Also, decreasing the regularization parameter results in same numerical issues.

In Section 4.5 and Figures 3 and 3, we carry out the proposed procedure for the circle example in Figure 3. The simulation parameters are as follows: hidden layer size , number of layers , batch size , learning rate , regularization constant , generator iterations , total iterations . The resulting coupling is depicted in Figure 3.