1 Introduction
Mapping one distribution to another
Given two random variables
and taking values in and respectively, the problem of finding a map such that and have the same distribution, denoted henceforth, finds applications in many areas. For instance, in domain adaptation, given a source dataset and a target dataset with different distributions, the use of a mapping to align the source and target distributions is a natural formulation (Gopalan et al., 2011) since theory has shown that generalization depends on the similarity between the two distributions (BenDavid et al., 2010). Current stateoftheart methods for computing generative models such as generative adversarial networks (Goodfellow et al., 2014), generative moments matching networks
(Li et al., 2015) or variational auto encoders (Kingma & Welling, 2013) also rely on finding such that . In this setting, the latent variableis often chosen as a continuous random variable, such as a Gaussian distribution, and
is a discrete distribution of real data, e.g. the ImageNet dataset. By learning a map
, sampling from the generative model boils down to simply drawing a sample from and then applying to that sample.Mapping with optimality Among the potentially many maps verifying , it may be of interest to find a map which satisfies some optimality criterion. Given a cost of moving mass from one point to another, one would naturally look for a map which minimizes the total cost of transporting the mass from to . This is the original formulation of Monge (1781), which initiated the development of the optimal transport (OT) theory. Such optimal maps can be useful in numerous applications such as color transfer (Ferradans et al., 2014), shape matching (Su et al., 2015), data assimilation (Reich, 2011, 2013)
, or Bayesian inference
(Moselhy & Marzouk, 2012). In small dimension and for some specific costs, multiscale approaches (Mérigot, 2011) or dynamic formulations (Evans & Gangbo, 1999; Benamou & Brenier, 2000; Papadakis et al., 2014; Solomon et al., 2014) can be used to compute optimal maps, but these approaches become intractable in higher dimension as they are based on space discretization. Furthermore, maps veryfiying might not exist, for instance when is a constant but not . Still, one would like to find optimal maps between distributions at least approximately. The modern approach to OT relaxes the Monge problem by optimizing over plans, i.e. distributions over the product space, rather than maps, casting the OT problem as a linear program which is always feasible and easier to solve. However, even with specialized algorithms such as the network simplex, solving that linear program takes
time, where is the size of the discrete distribution (measure) support.Largescale OT Recently, Cuturi (2013) showed that introducing entropic regularization into the OT problem turns its dual into an easier optimization problem which can be solved using the Sinkhorn algorithm. However, the Sinkhorn algorithm does not scale well to measures supported on a large number of samples, since each of its iterations has an
complexity. In addition, the Sinkhorn algorithm cannot handle continuous probability measures. To address these issues, two recent works proposed to optimize variations of the dual OT problem through stochastic gradient methods.
Genevay et al. (2016) proposed to optimize a “semidual” objective function. However, their approach still requires operations per iteration and hence only scales moderately w.r.t. the size of the input measures. Arjovsky et al. (2017) proposed a formulation that is specific to the socalled Wasserstein distance (unregularized OT using the Euclidean distance as a cost function). This formulation has a simpler dual form with a single variable which can be parameterized as a neural network. This approach scales better to very large datasets and handles continuous measures, enabling the use of OT as a loss for learning a generative model. However, a drawback of that formulation is that the dual variable has to satisfy the nontrivial constraint of being a Lipschitz function. As a workaround, Arjovsky et al. (2017) proposed to use weight clipping between updates of the neural network parameters. However, this makes unclear whether the learned generative model is truly optimized in an OT sense. Besides these limitations, these works only focus on the computation of the OT objective and do not address the problem of finding an optimal map between two distributions.Contributions We present a novel twostep approach for learning an optimal map that satisfies . First, we compute an optimal transport plan, which can be thought as a onetomany map between the two distributions. To that end, we propose a new simple dual stochastic gradient algorithm for solving regularized OT which scales well with the size of the input measures. We provide numerical evidence that our approach converges faster than semidual approaches considered in (Genevay et al., 2016). Second, we learn an optimal map (also referred to as a Monge map) as a neural network by approximating the barycentric projection of the OT plan obtained in the first step. Parameterization of this map with a neural network allows efficient learning and provides generalization outside the support of the input measure. Fig. 1 provides a 2D example showing the computed map between a Gaussian measure and a discrete measure and the resulting density estimation. On the theoretical side, we prove the convergence of regularized optimal plans (resp. barycentric projections of regularized optimal plans) to the optimal plan (resp. Monge map) between the underlying continuous measures from which data are sampled. We demonstrate our approach on domain adaptation and generative modeling.
Notations: We denote and some complete metric spaces. In most applications, these are Euclidean spaces. We denote random variables such as or as capital letters. We use to say that and have the same distribution, and also to say that is distributed according to the probability measure . refers to the support of , a subset of , which is also the set of values which can take. Given and a map defined on ,
is the probability distribution of
. We say that a measure is continuous when it admits a density w.r.t. the Lebesgues measure. We denote id the identity map.2 Background on Optimal Transport
The Monge Problem Consider a cost function , and two random variables and taking values in and respectively. The Monge problem (Monge, 1781) consists in finding a map which transports the mass from to while minimizing the mass transportation cost,
(1) 
Monge originally considered the cost , but in the present article we refer to the Monge problem as Problem (1) for any cost . When is a discrete measure, a map satisfying the constraint may not exist: if is supported on a single point, no such map exists as soon as is not supported on a single point. In that case, the Monge problem is not feasible. However, when , admits a density and is the squared Euclidean distance, an important result by Brenier (1991) states that the Monge problem is feasible and that the infinum of Problem (1) is attained. The existence and uniqueness of Monge maps, also referred to as optimal maps, was later generalized to more general costs (e.g. strictly convex and superlinear) by several authors. With the notable exception of the Gaussian to Gaussian case which has a close form affine solution, computation of Monge maps remains an open problem for measures supported on highdimensional spaces.
Kantorovich Relaxation In order to make Problem (1) always feasible, Kantorovich (1942) relaxed the Monge problem by casting Problem (1) into a minimization over couplings rather than the set of maps, where should have marginals equals to and ,
(2) 
Concretely, this relaxation allows mass at a given point to be transported to several locations , while the Monge problem would send the whole mass at to a unique location . This relaxed formulation is a linear program, which can be solved by specialized algorithms such as the network simplex when considering discrete measures. However, current implementations of this algorithm have a supercubic complexity in the size of the support of and , preventing wider use of OT in largescale settings.
Regularized OT OT regularization was introduced by Cuturi (2013) in order to speed up the computation of OT. Regularization is achieved by adding a negativeentropy penalty (defined in Eq. (5)) to the primal variable of Problem (2),
(3) 
Besides efficient computation through the Sinkhorn algorithm, regularization also makes the OT distance differentiable everywhere w.r.t. the weights of the input measures (Blondel et al., 2018), whereas OT is differentiable only almost everywhere. We also consider the regularization introduced by Dessein et al. (2016), whose computation is found to be more stable since there is no exponential term causing overflow. As highlighted by Blondel et al. (2018), adding an entropy or squared norm regularization term to the primal problem (3) makes the dual problem an unconstrained maximization problem. We use this dual formulation in the next section to propose an efficient stochastic gradient algorithm.
3 LargeScale Optimal Transport
By considering the dual of the regularized OT problem, we first show that stochastic gradient ascent can be used to maximize the resulting concave objective. A close form for the primal solution of Problem (3) can then be obtained by using firstorder optimality conditions.
3.1 Dual stochastic approach
OT dual Let and . The Kantorovich duality provides the following dual of the OT problem (2),
(4) 
This dual formulation suggests that stochastic gradient methods can be used to maximize the objective of Problem (4) by sampling batches from the independant coupling . However there is no easy way to fulfill the constraint on and along gradient iterations. This motivates considering regularized optimal transport.
Regularized OT dual The hard constraint in Eq. (4) can be relaxed by regularizing the primal problem (2) with a strictly convex regularizer as detailed in (Blondel et al., 2018). In the present paper, we consider both entropy regularization used in (Cuturi, 2013; Genevay et al., 2016) and regularization ,
(5) 
where is the density, i.e. the RadonNikodym derivative, of w.r.t. . When and are discrete, and so is , the integrals are replaced by sums. The dual of the regularized OT problems can be obtained through the FenchelRockafellar’s duality theorem,
(6) 
(7) 
Compared to Problem (4), the constraint has been relaxed and is now enforced smoothly through a penalty term which is concave w.r.t. . Although we derive formula and perform experiments w.r.t. entropy and regularizations, any strictly convex regularizer which is decomposable, i.e. which can be written (in the discrete case), gives rise to a dual problem of the form Eq. (6), and the proposed algorithms can be adapted.
PrimalDual relationship In order to recover the solution of the regularized primal problem (3), we can use the firstorder optimality conditions of the FenchelRockafellar’s duality theorem,
(8) 
Algorithm The relaxed dual (6) is an unconstrained concave problem which can be maximized through stochastic gradient methods by sampling batches from . When is discrete, i.e. , the dual variable is a
dimensional vector over which we carry the optimization, where
. When has a density, is a function on which has to be parameterized in order to carry optimization. We thus consider deep neural networks for their ability to approximate general functions. Genevay et al. (2016) used the same stochastic dual maximization approach to compute the regularized OT objective in the continuouscontinuous setting. The difference lies in their pamaterization of the dual variables as kernel expansions, while we decide to use deep neural networks. Using a neural network for parameterizing a continuous dual variable was done also by Arjovsky et al. (2017). The same discussion also stands for the second dual variable . Our stochastic gradient algorithm is detailed in Alg. 1.Convergence rates and computational cost comparison. We first discuss convergence rates in the discretediscrete setting (i.e. both measures are discrete), where the problem is convex, while parameterization of dual variables as neural networks in the semidiscrete or continuouscontinuous settings make the problem nonconvex. Because the dual (6) is not strongly convex, fullgradient descent converges at a rate of , where is the iteration number. SGD with a decreasing step size converges at the inferior rate of (Nemirovski et al., 2009), but with a
cost per iteration. The two rates can be interpolated when using minibatches, at the cost of
per iteration, where is the minibatch size. In contrast, Genevay et al. (2016) considered a semidual objective of the form , with a cost per iteration which is now due to the computation of the gradient of . Because that objective is not strongly convex either, SGD converges at the same rate, up to problemspecific constants. As noted by Genevay et al. (2016), this rate can be improved to while maintaining the same iteration cost, by using stochastic average gradient (SAG) method (Schmidt et al., 2017). However, SAG requires to store past stochastic gradients, which can be problematic in a largescale setting.In the semidiscrete setting (i.e. one measure is discrete and the other is continuous), SGD on the semidual objective proposed by Genevay et al. (2016) also converges at a rate of , whereas we only know that Alg. 1 converges to a stationary point in this nonconvex case.
In the continuouscontinuous setting (i.e. both measures are continuous), Genevay et al. (2016) proposed to represent the dual variables as kernel expansions. A disadvantage of their approach, however, is the cost per iteration. In contrast, our approach represents dual variables as neural networks. While nonconvex, our approach preserves a cost per iteration. This parameterization with neural networks was also used by Arjovsky et al. (2017) who maximized the 1Wasserstein dualobjective function . Their algorithm is hence very similar to ours, with the same complexity per iteration. The main difference is that they had to constrain to be a Lipschitz function and hence relied of weight clipping inbetween gradient updates.
The proposed algorithm is capable of computing the regularized OT objective and optimal plans between empirical measures supported on arbitrary large numbers of samples. In statistical machine learning, one aims at estimating the underlying continuous distribution from which empirical observations have been sampled. In the context of optimal transport, one would like to approximate the true (nonregularized) optimal plan between the underlying measures. The next section states theoretical guarantees regarding this problem.
3.2 Convergence of regularized OT plans
Consider discrete probability measures and . Analysis of entropyregularized linear programs (Cominetti & San Martín, 1994) shows that the solution of the entropyregularized problem (3) converges exponentially fast to a solution of the nonregularized OT problem (2). Also, a result about stability of optimal transport (Villani, 2008)[Theorem 5.20] states that, if and weakly, then a sequence of optimal transport plans between and converges weakly to a solution of the OT problem between and . We can thus write,
(9) 
A more refined result consists in establishing the weak convergence of to when jointly converge to . This is the result of the following theorem which states a stability property of entropyregularized plans (proof in the Appendix).
Theorem 1.
Let and where and are complete metric spaces. Let and be discrete probability measures which converge weakly to and respectively, and let a sequence of nonnegative real numbers converging to sufficiently fast. Assume the cost is continuous on and finite. Let the solution of the entropyregularized OT problem (3) between and . Then, up to extraction of a subsequence, converges weakly to the solution of the OT problem (2) between and ,
(10) 
Keeping the analogy with statistical machine learning, this result is an analog to the universal consistency property of a learning method. In most applications, we consider empirical measures and is fixed, so that regularization, besides enabling dual stochastic approach, may also help learn the optimal plan between the underlying continuous measures.
So far, we have derived an algorithm for computing the regularized OT objective and regularized optimal plans regardless of and being discrete or continuous. The OT objective has been used successfully as a loss in machine learning (Montavon et al., 2015; Frogner et al., 2015; Rolet et al., 2016, 2018; Arjovsky et al., 2017; Courty et al., 2017a), whereas the use of optimal plans has straightforward applications in logistics, as well as economy (Kantorovich, 1942; Carlier, 2012) or computer graphics (Bonneel et al., 2011)
. In numerous applications however, we often need mappings rather than joint distributions. This is all the more motivated since
Brenier (1991) proved that when the source measure is continuous, the optimal transport plan is actually induced by a map. Assuming that available data samples are sampled from some underlying continuous distributions, finding the Monge map between these continuous measures rather than a discrete optimal plan between discrete measures is essential in machine learning applications. Hence in the next section, we investigate how to recover an optimal map, i.e. find an approximate solution to the Monge problem (1), from regularized optimal plans.4 Optimal Mapping Estimations
A map can be obtained from a solution to the OT problem (2) or regularized OT problem (3) through the computation of its barycentric projection. Indeed, a solution of Problem (2) or (3) between a source measure and a target measure is, identifying the plan with its density w.r.t. a reference measure, a function which can be seen as a weighted onetomany map, i.e. sends to each location where . A map can then be obtained by simply averaging over these according to the weights .
Definition 1.
In the special case , Eq. (11) has the close form solution , which is equal to in a discrete setting with and the weights of . Moreover, for the specific squared Euclidean cost , the barycentric projection is an optimal map (Ambrosio et al., 2006)[Theorem 12.4.4], i.e. is a solution to the Monge problem (1) between the source measure and the target measure . Hence the barycentric projection w.r.t. the squared Euclidean cost is often used as a simple way to recover optimal maps from optimal transport plans (Reich, 2013; Wang et al., 2013; Ferradans et al., 2014; Seguy & Cuturi, 2015).
Formula (11) provides a pointwise value of the barycentric projection. When is discrete, this means that we only have mapping estimations for a finite number of points. In order to define a map which is defined everywhere, we parameterize the barycentric projection as a deep neural network. We show in the next paragraph how to efficiently learn its parameters.
Optimal map learning An estimation of the barycentric projection of a regularized plan which generalizes outside the support of can be obtained by learning a deep neural network which minimizes the following objective w.r.t. the parameters ,
(12) 
When , the last term in Eq. (12) is simply a weighted sum of squared errors, with possibly an infinite number of terms whenever or are continuous. We propose to minimize the objective (12
) by stochastic gradient descent, which provides the simple Algorithm
2. The OT problem being symmetric, we can also compute the opposite barycentric projection w.r.t. a cost by minimizing .However, unless the plan is induced by a map, the averaging process results in having the image of the source measure by only approximately equal to the target measure . Still, when the size of discrete measure is large and the regularization is small, we show in the next paragraph that 1) the barycentric projection of a regularized OT plan is close to the Monge map between the underlying continuous measures (Theorem 2) and 2) the image of the source measure by this barycentric projection should be close to the target measure (Corollary 1).
Theoretical guarantees As stated earlier, when and , Brenier (1991) proved that when the source measure is continuous, there exists a solution to the Monge problem (1). This result was generalized to more general cost functions, see (Villani, 2008)[Corollary 9.3] for details. In that case, the plan between and is written as where is the Monge map. Now considering discrete measures and which converge to (continuous) and respectively, we have proved in Theorem 1 that converges weakly to when . The next theorem, proved in the Appendix, shows that the barycentric projection also converges weakly to the true Monge map between and , justifying our approach.
Theorem 2.
Let be a continuous probability measure on , and an arbitrary probability measure on and a cost function satisfying (Villani, 2008)[Corollary 9.3]. Let and converging weakly to and respectively. Assume that the OT solution of Problem (2) between and is unique for all . Let a sequence of nonnegative real numbers converging sufficiently fast to and the barycentric projection w.r.t. the convex cost of the solution of the entropyregularized OT problem (3). Then, up to extraction of a subsequence,
(13) 
where is the solution of the Monge problem (1) between and .
This theorem shows that our estimated barycentric projection is close to an optimal map between the underlying continuous measures for big and small. The following corollary confirms the intuition that the image of the source measure by this map converges to the underlying target measure.
Corollary 1.
With the same assumptions as above,
In terms of random variables, the last equation states that if and , then converges in distribution to .
These theoretical results show that our estimated Monge map can thus be used to perform domain adaptation by mapping a source dataset to a target dataset, as well as perform generative modeling by mapping a continuous measure to a target discrete dataset. We demontrate this in the following section.
5 Numerical Experiments
5.1 Dual vs SemiDual speed comparisons
We start by evaluating the training time of our dual stochastic algorithm 1 against a stochastic semidual approach similar to (Genevay et al., 2016). In the semidual approach, one of the dual variable is eliminated and is computed in close form. However, this computation has complexity where is the size of the target measure . We compute the regularized OT objective with both methods on a spectral transfer problem, which is related to the color transfer problem (Reinhard et al., 2001; Pitié et al., 2007), but where images are multispectral, i.e. they share a finer sampling of the light wavelength. We take two images from the CAVE dataset (Yasuma et al., 2010) that have spectral bands. As such, the optimal transport problem is computed on two empirical distributions of samples in on which we consider the squared Euclidean ground cost . The timing evolution of train losses are reported in Figure 2 for three different regularization values . In the three cases, one can observe that convergence of our proposed dual algorithm is much faster.
5.2 Large scale domain adaptation
We apply here our computation framework on an unsupervised domain adaptation (DA) task, for which optimal transport has shown to perform well on small scale datasets (Courty et al., 2017b; Perrot et al., 2016; Courty et al., 2014). This restriction is mainly due to the fact that those works only consider the primal formulation of the OT problem. Our goal here is not to compete with the stateoftheart methods in domain adaptation but to assess that our formulation allows to scale optimal transport based domain adaptation (OTDA) to large datasets. OTDA is illustrated in Fig. 3
and follows two steps: 1) learn an optimal map between the source and target distribution, 2) map the source samples and train a classifier on them in the target domain. Our formulation also allows to use any differentiable ground cost
while (Courty et al., 2017b) was limited to the squared Euclidean distance.Datasets We consider the three crossdomain digit image datasets MNIST (Lecun et al., 1998), USPS, and SVHN (Netzer et al., 2011), which have 10 classes each. For the adaptation between MNIST and USPS, we use samples in the MNIST domain and samples in USPS domain. MNIST images are resized to the same resolution as USPS ones (). For the adaptation between SVHN and MNIST, we use samples in the SVHN domain and
samples in the MNIST domain. MNIST images are zeropadded to reach the same resolution as SVHN (
) and extended to three channels to match SVHN image sizes. The labels in the target domain are withheld during the adaptation. In the experiment, we consider the adaptation in three directions: MNIST USPS, USPS MNIST, and SVHN MNIST.Methods and experimental setup Our goal is to demonstrate the potential of the proposed method in largescale settings. Adaptation performance is evaluated using a 1nearest neighbor (1NN) classifier, since it has the advantage of being parameter free and allows better assessment of the quality of the adapted representation, as discussed in (Courty et al., 2017b). In all experiments, we consider the 1NN classification as a baseline, where labeled neighbors are searched in the source domain and the accuracy is computed on target data. We compare our approach to previous OTDA methods where an optimal map is obtained through the discrete barycentric projection of either an optimal plan (computed with the network simplex algorithm^{1}^{1}1http://liris.cnrs.fr/~nbonneel/FastTransport/) or an entropyregularized optimal plan (computed with the Sinkhorn algorithm (Cuturi, 2013)), whenever their computation is tractable. Note that these methods do not provide outofsample mapping. In all experiments, the ground cost
is the squared Euclidean distance and the barycentric projection is computed w.r.t. that cost. We learn the Monge map of our proposed approach with either entropy or L2 regularizations. Regarding the adaptation between SVHN and MNIST, we extract deep features by learning a modified LeNet architecture on the source data and extracting the
dimensional features output by the top hidden layer. Adaptation is performed on those features. We report for all the methods the best accuracy over the hyperparameters on the target dataset. While this setting is unrealistic in a practical DA application, it is widely used in the DA community
(Long et al., 2013) and our goal is here to investigate the relative performances of largescale OTDA in a fair setting.Hyperparameters and learning rate The value for the regularization parameter is set in . Adam optimizer with batch size is used to optimize the network. The learning rate is varied in . The learned Monge map in Alg. 2 is parameterized as a neural network with two fullyconnected hidden layers (
) and ReLU activations, and the weights are optimized using the Adam optimizer with learning rate equal to
and batch size equal to . For the Sinkhorn algorithm, regularization value is chosen from .Method  MNIST USPS  USPS MNIST  SVHN MNIST 

Source only  73.47  36.97  54.33 
Bar. proj. OT  57.75  52.46  intractable 
Bar. proj. OT with  68.75  57.35  intractable 
Bar. proj. Alg. 1 with  68.84  57.55  58.87 
Bar. proj. Alg. 1 with  67.8  57.47  60.56 
Monge map Alg. 1+2 with  77.92  60.02  61.11 
Monge map Alg. 1+2 with  72.61  60.50  62.88 
Results Results are reported in Table 1. In all cases, our proposed approach outperforms previous OTDA algorithms. On MNISTUSPS, previous OTDA methods perform worse than using directly source labels, whereas our method leads to successful adaptation results with 20% and 10% accuracy points over OT and regularized OT methods respectively. On USPSMNIST, all three algorithms lead to successful adaptation results, but our method achieves the highest adaptation results. Finally, on the challenging largescale adaptation task SVHNMNIST, only our method is able to handle the whole datasets, and outperforms the source only results.
Comparing the results between the barycentric projection and estimated Monge map illustrates that learning a parametric mapping provides some kind of regularization, and improves the performance.
5.3 Generative Optimal Transport (GOT)
Approach Corollary 1 shows that when the support of the discrete measures and is large and the regularization is small, then we have approximately . This observation motivates the use of our Monge map estimation as a generator between an arbitrary continuous measure and a discrete measure representing the discrete distribution of some dataset. We can thus obtain a generative model by first computing regularized OT through Alg. 1 between a Gaussian measure and a discrete dataset and then compute our generator with Alg. 2. This requires to have a cost function between the latent variable and the discrete variable . The property we gain compared to other generative models is that our generator is, at least approximately, an optimal map w.r.t. this cost. In our case, the Gaussian is taken with the same dimensionality as the discrete data and the squared Euclidean distance is used as ground cost .
Permutationinvariant MNIST We preprocess MNIST data by rescaling grayscale values in . We run Alg. 1 and Alg. 2 where is a Gaussian whose mean and covariance are taken equal to the empirical mean and covariance of the preprocessed MNIST dataset; we have observed that this makes the learning easier. The target discrete measure is the preprocessed MNIST dataset. Permutation invariance means that we consider each grayscale images as a dimensional vector and do not rely on convolutional architectures. In Alg. 1 the dual potential is parameterized as a fullyconnected NN with ReLU activations for each hidden layer, and the regularization is considered as it produced experimentally less blurring. The barycentric projection of Alg. 2 is parameterized as a fullyconnected NN with ReLU activation for each hidden layer and a activation on the output layer. We display some generated samples in Fig. 4.
6 Conclusion
We proposed two original algorithms that allow for i) largescale computation of regularized optimal transport ii) learning an optimal map that moves one probability distribution onto another (the socalled Monge map). To our knowledge, our approach introduces the first tractable algorithms for computing both the regularized OT objective and optimal maps in largescale or continuous settings. We believe that these two contributions enable a wider use of optimal transport strategies in machine learning applications. Notably, we have shown how it can be used in an unsupervised domain adaptation setting, or in generative modeling, where a Monge map acts directly as a generator. Our consistency results show that our approach is theoretically wellgrounded. An interesting direction for future work is to investigate the corresponding convergence rates of the empirical regularized optimal plans. We believe this is a very complex problem since technical proofs regarding convergence rates of the empirical OT objective used e.g. in (Sriperumbudur et al., 2012; Boissard et al., 2014; Fournier & Guillin, 2015) do not extend simply to the optimal transport plans.
Acknowledgments
This work benefited from the support of the project OATMIL ANR17CE230012 of the French National Research Agency (ANR). We thank the anonymous reviewers and Arthur Mensh for the careful reading and helpful comments regarding the present article.
References
 Ambrosio et al. (2006) Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces and in the space of probability measures. Springer, 2006.
 Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International Conference on Machine Learning, pp. 214–223, 2017.
 BenDavid et al. (2010) Shai BenDavid, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79(1):151–175, 2010.
 Benamou & Brenier (2000) JeanDavid Benamou and Yann Brenier. A computational fluid mechanics solution to the mongekantorovich mass transfer problem. Numerische Mathematik, 84(3):375–393, 2000.
 Birkhoff (1946) Garrett Birkhoff. Three observations on linear algebra. Univ. Nac. Tucumán. Revista A, 5:147–151, 1946.
 Blondel et al. (2018) Mathieu Blondel, Vivien Seguy, and Antoine Rolet. Smooth and sparse optimal transport. In Proc. of AISTATS, 2018.
 Boissard et al. (2014) Emmanuel Boissard, Thibaut Le Gouic, et al. On the mean speed of convergence of empirical and occupation measures in wasserstein distance. In Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, volume 50, pp. 539–563. Institut Henri Poincaré, 2014.
 Bonneel et al. (2011) Nicolas Bonneel, Michiel Van De Panne, Sylvain Paris, and Wolfgang Heidrich. Displacement interpolation using lagrangian mass transport. In ACM Transactions on Graphics (TOG), volume 30, pp. 158. ACM, 2011.
 Brenier (1991) Yann Brenier. Polar factorization and monotone rearrangement of vectorvalued functions. Communications on pure and applied mathematics, 44(4):375–417, 1991.
 Carlier (2012) Guillaume Carlier. Optimal transportation and economic applications. Lecture Notes.(Cited on page 2.), 2012.
 Cominetti & San Martín (1994) Roberto Cominetti and Jaime San Martín. Asymptotic analysis of the exponential penalty trajectory in linear programming. Mathematical Programming, 67(13):169–187, 1994.
 Courty et al. (2014) Nicolas Courty, Rémi Flamary, and Devis Tuia. Domain adaptation with regularized optimal transport. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 274–289. Springer, 2014.
 Courty et al. (2017a) Nicolas Courty, Rémi Flamary, Amaury Habrard, and Alain Rakotomamonjy. Joint distribution optimal transportation for domain adaptation. arXiv preprint arXiv:1705.08848, 2017a.
 Courty et al. (2017b) Nicolas Courty, Remi Flamary, Devis Tuia, and Alain Rakotomamonjy. Optimal Transport for Domain Adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(9):1853–1865, sep 2017b.
 Cuturi (2013) Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, pp. 2292–2300, 2013.
 Dessein et al. (2016) Arnaud Dessein, Nicolas Papadakis, and JeanLuc Rouas. Regularized optimal transport and the rot mover’s distance. arXiv preprint arXiv:1610.06447, 2016.
 Evans & Gangbo (1999) Lawrence C Evans and Wilfrid Gangbo. Differential equations methods for the MongeKantorovich mass transfer problem, volume 653. American Mathematical Soc., 1999.
 Ferradans et al. (2014) Sira Ferradans, Nicolas Papadakis, Gabriel Peyré, and JeanFrançois Aujol. Regularized discrete optimal transport. SIAM Journal on Imaging Sciences, 7(3):1853–1882, 2014.
 Fournier & Guillin (2015) Nicolas Fournier and Arnaud Guillin. On the rate of convergence in wasserstein distance of the empirical measure. Probability Theory and Related Fields, 162(34):707–738, 2015.
 Frogner et al. (2015) Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya, and Tomaso A Poggio. Learning with a wasserstein loss. In Advances in Neural Information Processing Systems, pp. 2053–2061, 2015.
 Genevay et al. (2016) Aude Genevay, Marco Cuturi, Gabriel Peyré, and Francis Bach. Stochastic optimization for largescale optimal transport. In Advances in Neural Information Processing Systems, pp. 3432–3440, 2016.
 Goodfellow et al. (2014) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (eds.), Advances in Neural Information Processing Systems 27, pp. 2672–2680. Curran Associates, Inc., 2014.
 Gopalan et al. (2011) Raghuraman Gopalan, Ruonan Li, and Rama Chellappa. Domain adaptation for object recognition: An unsupervised approach. In Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 999–1006. IEEE, 2011.
 Kantorovich (1942) Leonid Vitalievich Kantorovich. On the translocation of masses. In Dokl. Akad. Nauk SSSR, volume 37, pp. 199–201, 1942.
 Kingma & Welling (2013) Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Lecun et al. (1998) Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. ISSN 00189219. doi: 10.1109/5.726791.
 Li et al. (2015) Yujia Li, Kevin Swersky, and Rich Zemel. Generative moment matching networks. In Proceedings of the 32nd International Conference on Machine Learning (ICML15), pp. 1718–1727, 2015.
 Long et al. (2013) Mingsheng Long, Jianmin Wang, Guiguang Ding, Jiaguang Sun, and Philip S Yu. Transfer feature learning with joint distribution adaptation. In Proceedings of the IEEE international conference on computer vision, pp. 2200–2207, 2013.
 Mérigot (2011) Quentin Mérigot. A multiscale approach to optimal transport. In Computer Graphics Forum, volume 30, pp. 1583–1592. Wiley Online Library, 2011.
 Monge (1781) G. Monge. Mémoire sur la theorie des deblais et des remblais. Histoire de l’Académie Royale des Sciences de Paris, 1781.
 Montavon et al. (2015) Grégoire Montavon, KlausRobert Müller, and Marco Cuturi. Wasserstein training of boltzmann machines. arXiv preprint arXiv:1507.01972, 2015.
 Moselhy & Marzouk (2012) Tarek A. El Moselhy and Youssef M. Marzouk. Bayesian inference with optimal maps. Journal of Computational Physics, 231(23):7815 – 7850, 2012. ISSN 00219991. doi: https://doi.org/10.1016/j.jcp.2012.07.022. URL http://www.sciencedirect.com/science/article/pii/S0021999112003956.
 Nemirovski et al. (2009) Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609, 2009.

Netzer et al. (2011)
Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew
Y. Ng.
Reading Digits in Natural Images with Unsupervised Feature
Learning.
In
NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011
, 2011.  Papadakis et al. (2014) Nicolas Papadakis, Gabriel Peyré, and Edouard Oudet. Optimal transport with proximal splitting. SIAM Journal on Imaging Sciences, 7(1):212–238, 2014.
 Perrot et al. (2016) Michaël Perrot, Nicolas Courty, Rémi Flamary, and Amaury Habrard. Mapping Estimation for Discrete Optimal Transport. In Neural Information Processing System, Barcelone, Spain, December 2016.
 Pitié et al. (2007) François Pitié, Anil C Kokaram, and Rozenn Dahyot. Automated colour grading using colour distribution transfer. Computer Vision and Image Understanding, 107(1):123–137, 2007.
 Reich (2011) Sebastian Reich. A dynamical systems framework for intermittent data assimilation. BIT Numerical Mathematics, 51(1):235–249, 2011.
 Reich (2013) Sebastian Reich. A nonparametric ensemble transform method for bayesian inference. SIAM Journal on Scientific Computing, 35(4):A2013–A2024, 2013.
 Reinhard et al. (2001) Erik Reinhard, Michael Adhikhmin, Bruce Gooch, and Peter Shirley. Color transfer between images. IEEE Computer graphics and applications, 21(5):34–41, 2001.

Rolet et al. (2016)
Antoine Rolet, Marco Cuturi, and Gabriel Peyré.
Fast dictionary learning with a smoothed wasserstein loss.
In
Proceedings of the 19th International Conference on Artificial Intelligence and Statistics
, pp. 630–638, 2016.  Rolet et al. (2018) Antoine Rolet, Vivien Seguy, Mathieu Blondel, and Hiroshi Sawada. Blind source separation with optimal transport nonnegative matrix factorization. arXiv preprint arXiv:1802.05429, 2018.
 Schmidt et al. (2017) Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(12):83–112, 2017.
 Seguy & Cuturi (2015) Vivien Seguy and Marco Cuturi. Principal geodesic analysis for probability measures under the optimal transport metric. In Advances in Neural Information Processing Systems, pp. 3312–3320, 2015.
 Solomon et al. (2014) Justin Solomon, Raif Rustamov, Leonidas Guibas, and Adrian Butscher. Earth mover’s distances on discrete surfaces. ACM Transactions on Graphics (TOG), 33(4):67, 2014.
 Sriperumbudur et al. (2012) Bharath K Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf, Gert RG Lanckriet, et al. On the empirical estimation of integral probability metrics. Electronic Journal of Statistics, 6:1550–1599, 2012.
 Su et al. (2015) Zhengyu Su, Yalin Wang, Rui Shi, Wei Zeng, Jian Sun, Feng Luo, and Xianfeng Gu. Optimal mass transport for shape matching and comparison. IEEE transactions on pattern analysis and machine intelligence, 37(11):2246–2259, 2015.
 Villani (2008) Cédric Villani. Optimal transport: old and new, volume 338. Springer, 2008.
 Wang et al. (2013) Wei Wang, Dejan Slepčev, Saurav Basu, John A Ozolek, and Gustavo K Rohde. A linear optimal transportation framework for quantifying and visualizing variations in sets of images. International journal of computer vision, 101(2):254–269, 2013.
 Yasuma et al. (2010) Fumihito Yasuma, Tomoo Mitsunaga, Daisuke Iso, and Shree K Nayar. Generalized assorted pixel camera: postcapture control of resolution, dynamic range, and spectrum. IEEE Transactions on Image Processing, 19(9):2241–2253, 2010.
Appendix A Proofs
Proof of Theorem 1.
Proof.
Let be the solution of the OT problem (2) between and which has maximum entropy. A result about stability of optimal transport (Villani, 2008)[Theorem 5.20] states that, up to extraction of a subsequence, converges weakly to a solution of the OT problem between and (regardless of being the solution with maximum entropy or not). We still write this subsequence, as well as the corresponding subsequence.
Let a bounded continuous function on . We have,
(14) 
The second term in the righthand side converges to as a result of the previously mentioned stability of optimal transport (Villani, 2008)[Theorem 5.20]. We now show the convergence of the first term to when sufficiently fast. We have,
(15) 
where is an upperbound of . A convergence result by Cominetti & San Martín (1994) shows that there exists positive constants (w.r.t. ) and such that,
(16) 
where . The subscript indices indicate the dependences of each constant. Hence, we see that choosing any such that the righthand side of Eq. (16) tends to provides the results. In particular, we can take
(17) 
which suffices to have the convergence of (15) to for any bounded continuous function . This proves the weak convergence of to . ∎
Proof of Theorem 2.
Proof.
First, note that the existence of a Monge map between and
follows from the absolute continuity of and the assumptions on the cost functions (Villani, 2008)[Corollary 9.3].
Let a Lipschitz function on . Let be the unique (by assumption) solution of the OT problem between and . We have,
(18) 
Since and are uniform discrete probability measures supported on the same number of points, we know by (Birkhoff, 1946) that the optimal transport is actually an optimal assignment , so that we have . This also implies so that . Hence, the second term in the righthand side of (18) converges to as a result of the stability of optimal transport (Villani, 2008)[Theorem 5.20]. Now, we show that the first term also converges to for converging sufficiently fast to . By definition of the pushforward operator,
(19) 
and we can bound,
(20) 
where and is the Lipschitz constant of . The first inequality follows from being Lipschitz. The next equality follows from the discrete close form of the barycentric projection. The last inequality is obtained through CauchySchwartz. We can now use the same arguments as in the previous proof. A convergence result by Cominetti & San Martín (1994) shows that there exists positive constants (w.r.t. ) and such that,
(21) 
where . The subscript indices indicate the dependences of each constant. Hence, we see that choosing any such that (21) tends to provides the results. In particular, we can take
(22) 
which suffices to have the convergence of (15) to for Lipschitz function . This proves the weak convergence of to . ∎
Proof of Corollary 1.
Proof.
Let a bounded continuous function. Let defined as . We have,
(23) 
which converges to by Theorem (2). Since , this proves the corollary. ∎