 # Stochastic Optimization for Regularized Wasserstein Estimators

Optimal transport is a foundational problem in optimization, that allows to compare probability distributions while taking into account geometric aspects. Its optimal objective value, the Wasserstein distance, provides an important loss between distributions that has been used in many applications throughout machine learning and statistics. Recent algorithmic progress on this problem and its regularized versions have made these tools increasingly popular. However, existing techniques require solving an optimization problem to obtain a single gradient of the loss, thus slowing down first-order methods to minimize the sum of losses, that require many such gradient computations. In this work, we introduce an algorithm to solve a regularized version of this problem of Wasserstein estimators, with a time per step which is sublinear in the natural dimensions of the problem. We introduce a dual formulation, and optimize it with stochastic gradient steps that can be computed directly from samples, without solving additional optimization problems at each step. Doing so, the estimation and computation tasks are performed jointly. We show that this algorithm can be extended to other tasks, including estimation of Wasserstein barycenters. We provide theoretical guarantees and illustrate the performance of our algorithm with experiments on synthetic data.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Optimal transport is one of the foundational problems of optimisation (Monge, 1781; Kantorovich, 2006), and an important topic in analysis (Villani, 2008). It asks how one can transport mass with distribution measure to another distribution measure

, with minimal global transport cost. It can also be written with a probabilistic interpretation, known as the Monge-Kantorovich formulation, of finding a joint distribution

in the set of those with marginals and , minimizing an expected cost between variables and . The minimum value gives rise to a natural statistical tool to compare distributions, known as the Wasserstein (or earth-mover’s) distance,

 Wc(μ,ν)=OT(μ,ν)=minπ∈Π(μ,ν)E(X,Y)∼π[c(X,Y)].

In the case of finitely supported measures, taken with same support size

for ease of notation, such as two empirical measures from samples, it is written as a linear program (on the right). It can be solved by the Hungarian algorithm

(Kuhn, 1955), which runs in time . While tractable, this is still relatively expensive for extremely large-scale applications in modern machine learning, where one hopes for running times that are linear in the size of the input (here ).

Attention to this problem has been recently renewed in machine learning, in particular due to recent advances to efficiently solve an entropic-regularized version (Cuturi, 2013), and its uses in many applications (see e.g. Peyré et al., 2019, for a survey), as it allows to capture the geometric aspects of the data. This problem has a strongly convex objective, and its solution converges to that of the optimal transport problem whenthe regularization parameter goes to 0. It can be easily solved with the Sinkhorn algorithm (Sinkhorn, 1964; Altschuler et al., 2017), or by other methods in time (Dvurechensky et al., 2018).

These tools have been applied in a wide variety of fields, from machine learning (Alvarez-Melis et al., 2018; Arjovsky et al., 2017; Gordaliza et al., 2019; Flamary et al., 2018)

(Grave et al., 2019; Alaux et al., 2019; Alvarez-Melis et al., 2018), computer graphics (Feydy et al., 2017; Lavenant et al., 2018; Solomon et al., 2015), the natural sciences (del Barrio et al., 2019; Schiebinger et al., 2019), and learning under privacy (Boursier and Perchet, 2019).

Of particular interests to statistics and machine learning are analyses of this problem with only sample access to the distributions. There have been growing efforts to estimate either the objective value of this problem, or the unknown distribution, with this metric or associated regularized metrics (see below) (Weed et al., 2019; Genevay et al., 2019; Uppal et al., 2019). One of the motivations are variational Wasserstein problems, where the objective value of an optimal transport problem is used as a loss, and one seeks to minimize in a parameter an objective that depends on a known distribution

 minθ∈ΘOT(νθ,μ),

where is only accessible through samples. This method for estimation, referred to as minimum Kantorovich estimators (Bassetti et al., 2006), mirrors the interpretation of likelihood maximization as the minimization of

, with the Kullback-Leibler divergence.

The value of the entropic-regularized problem, or of the related Sinkhorn divergence, can also be used as a loss in learning tasks (Alvarez-Melis et al., 2018; Genevay et al., 2017; Luise et al., 2018), and compared to other metrics such as maximum mean discrepency (Gretton et al., 2012; Feydy et al., 2019; Arbel et al., 2019). One of the advantages of the regularized problem is the existence of gradients in the parameters of the problem (cost matrix, target measures).

The problem of minimizing this loss for the cost over has been shown to be equivalent to maximum likelihood Gaussian deconvolution (Rigollet and Weed, 2018). We show here that this result can be generalized for all cost functions to maximum likelihood estimation for a kernel inversion problem. It is not only the solution of a stochastic optimization problem, but also an estimator, referred to here as the regularized Wasserstein estimator.

In this work, we propose a new stochastic optimization scheme to minimize the between an unknown discrete measure and another discrete measure , with an additional regularization term on . There are many connections between this problem and stochastic optimization: by a dual formulation, the value can be written as the optimum of an expectation in , allowing simple computations with only sample access (Genevay et al., 2016). Here, we take this one step further and design an algorithm to optimize in , not just evaluate this loss. A direct approach is to optimize by first-order methods, by the use of stochastic gradients in at each step (Genevay et al., 2017). However, these gradient estimates are based on dual solutions of the regularized problem, so obtaining them requires to solve an optimization problem, with running time scaling quadratically in the intrinsic dimension of the problem (the size of the supports of ). For the dual formulation that we introduce, stochastic gradients can be directly computed from samples. Algorithmic techniques exploiting the particular structure of the dual formulation for this regularization allow us to compute these gradients in constant time. We follow here the recent developments in sublinear algorithms based on stochastic methods (Clarkson et al., 2012).

We provide theoretical guarantees on the convergence of the final iterate to the true minimizer , and demonstrate these results on simulated experiments.

## 2 Problem Description

#### Definitions.

Let be a probability measure on with finite support and a family of probability measures. The measures in should all be absolutely continuous with respect to a known measure  supported in the finite set . We consider the following minimization problem:

 minν∈MOTε(μ,ν)+ηKL(ν,β). (1)

In this expression, is the regularised optimal transport cost defined by the following expression

 OTε(μ,ν)=minπ∈Π(μ,ν)E(X,Y)∼π[c(X,Y)]+εKL(π,μ⊗ν), (2)

where the minimum is taken over the set

 Π(μ,ν)={π∈P(X×Y):πX=μ, πY=ν}

of couplings of and , and is a cost function in . The operator is the Kullback-Leibler divergence, defined as

 KL(μ1,μ2)=EZ∼μ2[dμ1dμ2(Z)log(dμ1dμ2(Z))],

for two measures and such that . We assume that is convex for the problem to be a convex optimization problem, and compact to guarantee that the minimum is attained.

###### Remark 1.

If is a distance and if , then is a Wasserstein distance and our problem can be seen as computing a projection of onto . In the discrete case, the solution to the unregularized problem is the distribution such that , where is the nearest neighbour in of .

#### Learning problem.

Our objective is to solve the optimization problem in Equation (1), given observations independent and identically distributed (i.i.d.) from that is unknown, and sample access to . These can be assumed to be simulated by the user if

is known, as part of the regularization. This problem can be either be interpreted as an unsupervised learning problem or as estimation in an inverse problem, and we refer to it as

regularized Wasserstein estimation. The term in Kullback-Leibler (or entropy, up to an offset) are classical manners in which a probability can be regularized.

#### Maximum likelihood interpretation.

While the unregularized problem has a trivial solution, there is in general no closed form for positive . When , and is the set of all probability measures on

, then our problem is equivalent to the maximum likelihood estimator for a kernel inversion problem. This corresponds to estimating the unknown initial distribution of a random variable

, but only by observing it after the action of a specific transition kernel (see, e.g., Berthet and Kanade, 2019, for the statistical complexity of estimating initial dustributions under general Markov kernels).

###### Proposition 2.1 (MLE interpretation).

Let be the set of all probability measures on , let be a measure on , and let be a transition kernel of the form

 κ(x,y)=exp(−c(x,y)ε)∑x′∈Xexp(−c(x′,y)ε),

the observed measure is , which can be written as

 μ(x)=∫Yκ(x,y)dν∗(y).

The maximum likelihood estimation of for this observation is

 ^ν:=argmaxν∈M∑ilog(κν)(Xi).

This estimator also verifies

 ^ν=argminν∈MOTε(μ,ν). (3)
###### Remark 2.

If , then is a Gaussian convolution kernel and the sample measure is a convolution, so the solution of (3) is the MLE of the Gaussian deconvolution problem, as already presented by Rigollet and Weed (2018).

As in the Gaussian case, these optimization problems share an optimum, but are not equal in value. Therefore, in our regularized setting, it is not possible to substitute one for the other.

## 3 Dual formulations

As noted above, first-order optimization methods to solve directly in the regularized problem require at every step to solve an optimization problem. We explore instead another approach, through a dual formulation of our problem. Such a formulation allows to change the minimisation problem in (2) into a maximisation problem.

###### Proposition 3.1 (Dual formulation).

If , then the problem (1) is equivalent to the following problem:

 minf∈Fmaxa∈L1(μ),b∈L1(ν)E[a(X)+b(Y)f(Y)−εexp(a(X)+b(Y)−c(X,Y)ε)+(η−ε)f(Y)logf(Y)], (4)

the expectation being over the variables , with and .

If is constant -almost everywhere, with value , then the maximization problem for and in (4) is the dual of the regularized optimal transport problem 2, for which a block coordinate descent corresponds to Sinkhorn algorithm (Cuturi, 2013).

This dual formulation is a saddle point problem, and it is convex-concave if , so the Von Neumann minimax theorem applies: we can swap the minimum and the maximum.

###### Proposition 3.2.

If then the problem (1) is equivalent to the following maximization problem:

 maxa∈L1(μ),b∈L1(ν)F(a,b), (5)

with

 F(a,b)=E[a(X)−εea(X)+b(Y)−c(X,Y)ε]−(η−ε)H∗β(−bη−ε), (6)

by writing

 H∗β(α)=maxf∈FE[α(Y)f(Y)−f(Y)logf(Y)],

with the variables .

In its discrete formulation, the problem is written with the following notations: for the cost matrix, and

for the dual vectors, and

for the remaining primal variable.

The problem (5) is hence given by

 max(a,b)∈RI×RJF(a,b), (7)

with

 F(a,b)=E[ai−εexp(ai+bj−Ci,jε)]−(η−ε)H∗β,M(−bη−ε). (8)

The indices are here independent random variables such that and . The function is the Legendre transform of the relative entropy to on the set :

 H∗β,M(α) =maxf∈FE[fj(αj−logfj)], (9)

with a random index such that .

If the maximum is attained on the relative interior of at the point , then we have Moreover the optimum for the dual problem (4) is the optimal for our general problem (1).

###### Proposition 3.3.

The function has the following properties.

1. The set of solutions to the problem (7) is a nonempty affine space spanned by the vector .

2. Every solution of (7) verifies

 ∀ i,j, |a∗i+b∗j−Ci,j|≤B, (10)

with , where is the range of the matrix given by , and with .

3. The function is -strongly convex on the slice with

 λ:=mini,j{μi,βj}εe−(m+2RC/ε).
4. For and independent random variables as for (8), we have the gradients of are written as simple expectations

 ∇aF =E[(1−Di,j)ei], (11) ∇bF =E[(fj−Di,j)ej], (12)

with .

## 4 Stochastic Optimization Methods

The formulas (11) and (12) suggest that our problem can be solved using a stochastic optimization approach. For random indices drawn from and drawn from , we obtain the following stochastic gradients

 Ga =(1−Di,j)ei=(1−exp(ai+bj−Ci,jε))ei Gb =(fj−Di,j)ej=(ν∗jβj−exp(ai+bj−Ci,jε))ej.

By Proposition 3.3

, these are unbiased estimates of the gradients of

. The algorithm then proceeds with an averaged gradient ascent that uses these stochastic gradients updates at each step. The obtained iterates are averaged, producing the sequence of iterates defined by

 ¯¯¯¯bt:=1t∑1≤t′≤tbt′.

The computation of can be done in , however necessitates the value in (9) to be computed. The complexity of this computation depends on the set , and we will present here two cases where it can be done with low complexity.

#### Initialization.

To guarantee that the gradients will not get exponentially big, we choose the initial value of the dual variables so that it verifies

 ∀i,j, ai+bj−Ci,j≤−εm,

with being defined in (10). We define

 ini(C,ε,m):=(minCi,j−εm)/2,

and we initialize

 ai=bj=¯¯¯¯bj=ini(C,ε,m). (13)

Usually,

is unknown and should be determined by heuristics.

#### Simple case.

We analyze the case where is the family of all probability measures supported in the finite set , with the assumption that . Then, if the max is attained on the interior of the simplex, we have the optimum

 ν∗j=βje−bj/(η−ε)∑kβke−bk/(η−ε). (14)

The algorithm needs complexity for each time step. If the values of are accessible without having the whole matrix stored (such as a simple function of and ), the storage is only in this algorithm, because we do not need to store any . The complexity at each step of the algorithm is better than with the non regularized form, where is taken as , instead of randomly. This enhancement in complexity mostly comes from the storage of the sum with

 gj(btj):=βje−btj/(η−ε).

Indeed, instead of computing the entire sum at each iterates, which costs operations, the algorithm simply updates the part of the sum that was modified:

 St+1=St+gj(bt+1j)−gj(btj).

This method assures updates in . In a context focused entirely on optimization, where and are known in advance, we could also pick and uniformly, and add and as factors in the formulas. This would not reduce the complexity.

#### Mixture models.

We also consider a set of measures supported in supported in the set , and take to be their convex hull. We define the matrix . Then , and Equation (9) becomes

 H∗β,M(α)=maxθ∈ΔK(α−log(Mθ)+log(β))TMθ, (15)

with the being taken component-wise.

###### Proposition 4.1.

The maximization problem (15) has a solution

 θ∗=M†exp(PIm(M)(−b/(η−ε)−1−log(β)))1TM†exp(PIm(M)(−b/(η−ε)−1−log(β))),

which gives the measure

 ν∗=exp(PIm(M)(−b/(η−ε)−1−log(β))1Texp(PIm(M)(−b/(η−ε)−1−log(β)).

We can replace it in equation (12) to get the stochastic gradients. However at each new computed step, every coefficient changes, and there is a need to do computations for each step. The solution computed here is also valid for the case when it is not unique.

We can, however, consider another regularization to the entropy of to improve the algorithm. The problem is the following:

 minθ∈ΔKOTε(ν,μ)+ηKL(θ,M†β),

with being the Moore-Penrose inverse of the matrix . The other computations are unchanged, apart from Equation (9), replaced by

 H∗β,M(α) =maxθ∈ΔKαTMθ−(η−ε)KL(θ,M†β) =maxθ∈ΔK(MTα−log(θ)+log(M†β))Tθ. (16)
###### Proposition 4.2.

The maximization problem (4) has a solution

 θ∗=exp(MT(−b/(η−ε)−1)+log(M†β))1Texp(MT(−b/(η−ε)−1)+log(M†β)).

Both regularizations and are minimal when , and can therefore be used as a suitable proxy. The solution to the regularized problem is similar to the solution to the unregularized one. For this modified problem, the computations are accessible, and they can be done in time , a great improvement if . The algorithm is the following:

#### Wasserstein barycenters.

Algorithm 1 can be used to compute an approximation of the Wasserstein barycenter of measures . If the cost funtion in the optimal transport problem is of the form with being a distance and , then the transport cost defines the -Wasserstein distance. In these conditions, the Wasserstein barycenter of the measures with weights is the solution of the problem

 minνK∑k=1θkOT(μk,ν). (17)

This optimization and the barycenter that it defines was introduced by Agueh and Carlier (2011), these objects and their regularized versions have attracted a lot of attention, for their statistical and algorithmic aspects (Zemel et al., 2019; Cuturi and Doucet, 2014; Claici et al., 2018; Luise et al., 2019).

As an analogy with our original problem (1), we consider an entropic regularization of the Wasserstein barycenter problem (17):

 minν∈MK∑k=1θkOTε(μk,ν)+ηKL(ν,β).

Our approach can be translated to this setting, as well as the theoretical results found for (1). We have the equivalent dual formulation

 maxa∈L1(μ),b∈L1(ν)~F(a1,…,aK,b),

with

 ~F(a1,…,aK,b):=K∑k=1θkFk(ak,b).

Here is defined like the function in (6) by replacing by . The only difference in the algorithm is that there should be dual variables that play the role of the variable for each measure while one variable is used to obtain the target measure. The complexity of the algorithm is for each stochastic gradient step, which gains a factor compared to the state-of-the-art stochastic Wasserstein barycenter (Staib et al., 2017), that solves the minimisation problem

 minν∈MK∑k=1θkOTε(μk,ν).

The complexity of a gradient step could be further reduced to at the cost of more randomization, by sampling randomly at each step with probability , and updating and as in algorithm 1 with playing the role of . If , the approximation error of this estimated Wasserstein Barycenter is of the same order as by Staib et al. (2017).

## 5 Results

### 5.1 Convergence bounds

The following convergence bounds are valid for both algorithms presented in the previous section. They come from general convergence bounds averaged stochastic gradient descent with decreasing stepsize

(Shamir and Zhang, 2012). For be the optimal Wasserstein estimator, let be the estimator obtained by stopping the algorithm at step . We consider the Kullback-Leibler divergence to express how close the estimated measure is to . As is obtained with the dual variable , the estimation error of can translate to an entropic error in the following two bounds. The first result uses the stepsize for SGD associated to strongly convex functions and the second one uses the stepsize for SGD associated to convex functions. Both results are presented here: even though the theoretical bound of the second one is asymptotically worse, its stepsize can yield better performance in practice.

###### Theorem 5.1.

With stepsize , the estimator verifies the following bound:

 E[KL(ν∗,νt)]≤34e2m(η−ε)λ21+logtt.
###### Theorem 5.2.

With stepsize , , the estimator verifies the following bound:

 E[KL(ν∗,νt)]≤2B2emc0ε(η−ε)λ2+logt√t.

In order to prove both theorems, we present two lemmas whose proofs are provided in the appendix.

###### Lemma 5.3.

Let be the iterations of the stochastic gradient descent, seen as random variables. If the initialization is done as in (13

), then the second order moments of the stochastic gradients are bounded:

 E[∥∇aFi,j(at,bt)∥2+∥∇bFi,j(at,bt)∥2]≤2e2m.
###### Lemma 5.4.

The convergence of the primal variable is linked to the convergence of the objective by the following bound:

 KL(ν(b∗),ν(b))≤F(a∗,b∗)−F(a,b)(η−ε)λ.
###### Proof of Theorem 5.1.

The result from Shamir and Zhang (2012) on strongly convex functions gives the bound

 E[F(a∗,b∗)−F(at,bt)]≤17G2λ1+logtt,

with being a bound on the second order moments of the stochastic gradients. The lemma 5.3 provides . We conclude with lemma 5.4. ∎

###### Proof of theorem 5.2.

With stepsize , the result from Shamir and Zhang (2012) on convex functions gives the bound

 E[F(a∗,b∗)−F(at,bt)]≤2(BG)2+logt√t,

with being a bound on the second order moments of the stochastic gradients. The lemma 5.3 provides , here we choose where we assume . We conclude with Lemma 5.4. ∎

###### Remark 3.

The term in can be removed by using adaptive averaging schemes: by averaging only the past iterates, the the term can be replaced by .

###### Remark 4.

The strong convexity coefficient

 λ=mini,j{μi,βj}εe−B/ε

is negligible when , thus the stepsize of the first theorem is large: it can lead to growth of the dual variables grow out of their normal range and produces an exponential overflow in experiments. One solution is to cap the dual variables to the range provided by (10), but the algorithm would then not provide any useful solution until a high number of steps is performed, i.e. . Instead, we recommend using the stepsize that provides a quick convergence at the earlier steps, then gives a better asymptotic convergence rate.

### 5.2 Simulations

We demonstrate the performance of the algorithm on simulated experiments.

#### Regularization term.

In order to exhibit clearly the impact of regularization parameters, We analyze a simple case, where , and . In this case the solution is given by for , with a diagonal transportation matrix. The introduction of the positive regularization in noticeably spreads the transportation matrix, and provides a solution that is closer to the uniform law on . We use the learning rate provided by Theorem 5.2. Figure 1: Effect of the regularization. Upper plots, from left to right: cost matrix used, transportation matrix for ε=η−ε=0.1 after 105 iterations, and for ε=η−ε=0.01. Lower plots, from left to right: Target measure μ in blue, estimator in orange, ε=η−ε=0.1, then for ε=η−ε=0.01.

The regularization term should be greater than , and brings the estimated measure closer to the uniform measure. We choose to take to conserve a similar degree of regularization as in the case , while guaranteeing that the exponentials in (14) do not overflow.

#### Sensibility to dimension.

We consider the relationship between the convergence rate and the dimensions of the problem. The theoretical results 5.1 and 5.2 depend on , which scales with if and are uniform on their support. We generate and as two samples of and independent Gaussian vectors, is the uniform measure on , and is the distance matrix between and . We compute the gradient norm of the objective function at the averaged iterates .

The gradient norm here converges at rate , with as would be predicted from the theorem 5.1, except in the case where , which matches better with the bound in Theorem 5.2. An increase of the sample size for the input measure seems to decrease performance while an increase of the support size of the target increases performance. It means that a finer grid of points in will provide a faster convergence to the optimal estimator.

#### Choice of the learning rate.

As noted above, a choice of learning rate that is large compared to can lead to a divergence of the dual variables. This is due to the exponential dependency of the gradients in and . Experiments suggest the learning rate

 γt=min{1λt,c0ε√t}.

The following graphs show the convergence to the target with different choices of . Here , , with the same problem is the same as in the experiments on the regularization term.

A regression on the curves shows that the empirical convergence rate is of order with , which matches with theorem 5.1. We remark that the greater is, the better the algorithm converges, until it becomes unstable and does not converge anymore for . This instability was observed consistently for a large range of values of and . The choice appears to be reasonable for both stability and convergence.

## 6 Conclusion

We consider the problem of minimizing a doubly regularized optimal transport cost over a set of finitely supported measures with fixed support. Using an entropic regularization on the target measure, we derive a stochastic gradient descent on the dual formulation with sublinear (even constant in the simplest case) complexity at each step of the optimization. The algorithm is thus highly paralellizable, and can be used to compute a regularized solution to the Wasserstein barycenter problem. We also provide convergence bounds for the estimator that this algorithm yields after steps, and demonstrate its performs on randomly generated data.

#### Aknowledgements.

This work was funded in part by the French government under management of Agence Nationale de la Recherche as part of the “Investissements d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute). We also acknowledge support from the European Research Council (grant SEQUOIA 724063).

## References

• Agueh and Carlier (2011) Martial Agueh and Guillaume Carlier. Barycenters in the wasserstein space. SIAM Journal on Mathematical Analysis, 43(2):904–924, 2011.
• Alaux et al. (2019) Jean Alaux, Edouard Grave, Marco Cuturi, and Armand Joulin. Unsupervised hyper-alignment for multilingual word embeddings. 2019.
• Altschuler et al. (2017) Jason Altschuler, Jonathan Niles-Weed, and Philippe Rigollet. Near-linear time approximation algorithms for optimal transport via sinkhorn iteration. In Advances in Neural Information Processing Systems, pages 1964–1974, 2017.
• Alvarez-Melis et al. (2018) David Alvarez-Melis, Tommi Jaakkola, and Stefanie Jegelka. Structured optimal transport. In

International Conference on Artificial Intelligence and Statistics

, pages 1771–1780, 2018.
• Arbel et al. (2019) Michael Arbel, Anna Korba, Adil Salim, and Arthur Gretton. Maximum mean discrepancy gradient flow. In Advances in Neural Information Processing Systems, pages 6481–6491, 2019.
• Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International Conference on Machine Learning, pages 214–223, 2017.
• Bassetti et al. (2006) Federico Bassetti, Antonella Bodini, and Eugenio Regazzini. On minimum kantorovich distance estimators. Statistics & probability letters, 76(12):1298–1302, 2006.

Statistical windows in testing for the initial distribution of a reversible markov chain.

pages 246–255, 2019.
• Boursier and Perchet (2019) Etienne Boursier and Vianney Perchet. Private learning and regularized optimal transport. arXiv preprint arXiv:1905.11148, 2019.
• Claici et al. (2018) Sebastian Claici, Edward Chien, and Justin Solomon. Stochastic wasserstein barycenters. In International Conference on Machine Learning, pages 999–1008, 2018.
• Clarkson et al. (2012) Kenneth L. Clarkson, Elad Hazan, and David P Woodruff. Sublinear optimization for machine learning. Journal of the ACM (JACM), 59(5):1–49, 2012.
• Cuturi (2013) Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pages 2292–2300, 2013.
• Cuturi and Doucet (2014) Marco Cuturi and Arnaud Doucet. Fast computation of wasserstein barycenters. In International Conference on Machine Learning, pages 685–693, 2014.
• del Barrio et al. (2019) Eustasio del Barrio, Hristo Inouzhe, Jean-Michel Loubes, Carlos Matrán, and Agustín Mayo-Íscar. optimalflow: Optimal-transport approach to flow cytometry gating and population matching. arXiv preprint arXiv:1907.08006, 2019.
• Dvurechensky et al. (2018) Pavel Dvurechensky, Alexander Gasnikov, and Alexey Kroshnin. Computational optimal transport: Complexity by accelerated gradient descent is better than by sinkhorn’s algorithm. In International Conference on Machine Learning, pages 1367–1376, 2018.
• Feydy et al. (2017) Jean Feydy, Benjamin Charlier, François-Xavier Vialard, and Gabriel Peyré. Optimal transport for diffeomorphic registration. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 291–299. Springer, 2017.
• Feydy et al. (2019) Jean Feydy, Thibault Séjourné, François-Xavier Vialard, Shun-ichi Amari, Alain Trouve, and Gabriel Peyré. Interpolating between optimal transport and mmd using sinkhorn divergences. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2681–2690, 2019.
• Flamary et al. (2018) Rémi Flamary, Marco Cuturi, Nicolas Courty, and Alain Rakotomamonjy. Wasserstein discriminant analysis. Machine Learning, 107(12):1923–1945, 2018.
• Genevay et al. (2016) Aude Genevay, Marco Cuturi, Gabriel Peyré, and Francis Bach. Stochastic optimization for large-scale optimal transport. In Advances in neural information processing systems, pages 3440–3448, 2016.
• Genevay et al. (2017) Aude Genevay, Gabriel Peyré, and Marco Cuturi. Learning generative models with sinkhorn divergences. arXiv preprint arXiv:1706.00292, 2017.
• Genevay et al. (2019) Aude Genevay, Lénaïc Chizat, Francis Bach, Marco Cuturi, and Gabriel Peyré. Sample complexity of sinkhorn divergences. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1574–1583, 2019.
• Gordaliza et al. (2019) Paula Gordaliza, Eustasio Del Barrio, Gamboa Fabrice, and Jean-Michel Loubes. Obtaining fairness using optimal transport theory. In International Conference on Machine Learning, pages 2357–2365, 2019.
• Grave et al. (2019) Edouard Grave, Armand Joulin, and Quentin Berthet. Unsupervised alignment of embeddings with wasserstein procrustes. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1880–1890, 2019.
• Gretton et al. (2012) Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.
• Kantorovich (2006) Leonid V Kantorovich. On the translocation of masses. Journal of Mathematical Sciences, 133(4):1381–1382, 2006.
• Kuhn (1955) Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955.
• Lavenant et al. (2018) Hugo Lavenant, Sebastian Claici, Edward Chien, and Justin Solomon. Dynamical optimal transport on discrete surfaces. ACM Transactions on Graphics (TOG), 37(6):1–16, 2018.
• Luise et al. (2018) Giulia Luise, Alessandro Rudi, Massimiliano Pontil, and Carlo Ciliberto. Differential properties of sinkhorn approximation for learning with wasserstein distance. In Advances in Neural Information Processing Systems, pages 5859–5870, 2018.
• Luise et al. (2019) Giulia Luise, Saverio Salzo, Massimiliano Pontil, and Carlo Ciliberto. Sinkhorn barycenters with free support via frank-wolfe algorithm. In Advances in Neural Information Processing Systems, pages 9318–9329, 2019.
• Monge (1781) Gaspard Monge. Mémoire sur la théorie des déblais et des remblais. Histoire de l’Académie Royale des Sciences de Paris, 1781.
• Peyré et al. (2019) Gabriel Peyré, Marco Cuturi, et al. Computational optimal transport. Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019.
• Rigollet and Weed (2018) Philippe Rigollet and Jonathan Weed. Entropic optimal transport is maximum-likelihood deconvolution. Comptes Rendus Mathematique, 356(11-12):1228–1235, 2018.
• Schiebinger et al. (2019) Geoffrey Schiebinger, Jian Shu, Marcin Tabaka, Brian Cleary, Vidya Subramanian, Aryeh Solomon, Joshua Gould, Siyan Liu, Stacie Lin, Peter Berube, et al. Optimal-transport analysis of single-cell gene expression identifies developmental trajectories in reprogramming. Cell, 176(4):928–943, 2019.
• Shamir and Zhang (2012) Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. CoRR, abs/1212.1824, 2012.
• Sinkhorn (1964) Richard Sinkhorn. A relationship between arbitrary positive matrices and doubly stochastic matrices. The annals of mathematical statistics, 35(2):876–879, 1964.
• Solomon et al. (2015) Justin Solomon, Fernando De Goes, Gabriel Peyré, Marco Cuturi, Adrian Butscher, Andy Nguyen, Tao Du, and Leonidas Guibas. Convolutional wasserstein distances: Efficient optimal transportation on geometric domains. ACM Transactions on Graphics (TOG), 34(4):1–11, 2015.
• Staib et al. (2017) Matthew Staib, Sebastian Claici, Justin M Solomon, and Stefanie Jegelka. Parallel streaming wasserstein barycenters. In Advances in Neural Information Processing Systems, pages 2647–2658, 2017.
• Uppal et al. (2019) Ananya Uppal, Shashank Singh, and Barnabas Poczos. Nonparametric density estimation & convergence rates for gans under besov ipm losses. In Advances in Neural Information Processing Systems, pages 9086–9097, 2019.
• Villani (2008) Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
• Weed et al. (2019) Jonathan Weed, Francis Bach, et al. Sharp asymptotic and finite-sample rates of convergence of empirical measures in wasserstein distance. Bernoulli, 25(4A):2620–2648, 2019.
• Zemel et al. (2019) Yoav Zemel, Victor M Panaretos, et al. Fréchet means and procrustes analysis in wasserstein space. Bernoulli, 25(2):932–976, 2019.

## Appendix A Proofs of technical results

###### Proof of proposition 2.1.

This proof follows the reasoning in Rigollet and Weed (2018). Let be the empirical measure of the sample . We first remark that the log-likelihood of defined by

 ℓν(Xi):=log∫κ(Xi,y)dν(y)

verifies

 ℓν(Xi)=logEY∼ν[κ(Xi,Y)].

With the Legendre transform of the relative entropy, we obtain

 ℓν(Xi)=supγiEY∼γi[logκ(Xi,Y)]−KL(γi,ν)

with the minimum being over every probability measures on . The MLE maximizes

 1I∑iℓν(Xi)=EX∼μ[ℓν(X)]

over , it can be written as

 maxπ∈Π(μ,ν)E(X,Y)∼π[logκ(X,Y)]−EX∼μ[KL(π(X,⋅),ν)],

with being the conditional probability of , defined by . We have

 EX∼μ[KL(π(X,⋅),ν)]= 1I∑iEY∼ν[logdπ(Xi,⋅)dν(Y)], = 1I∑iEY∼ν[logdπdμ⊗ν(Xi,Y)]−logI, = KL(π,μ⊗ν)−logI.

Thus the MLE minimizes

 minπ∈Π(μ,ν)E[c(X,Y)]+εKL(π,μ⊗ν),

which is the regularized optimal transport cost between and . ∎

###### Proof of Proposition 3.3.
1. The function is a Legendre transform, so it is convex, and thus is convex as a sum of convex functions. Moreover is bounded from above:

 F(a,b)≤ C1E[ai+bj]−C2E[eai+bjε], ≤ C3,

where does not depend on or . Thus the set of solutions is nonempty. F is invariant by the translation , so each solution generates an affine set of solutions spanned by the vector . We can conclude using the strong convexity on the slice , which implies that there exists only one solution on this slice.

2. The solution solves the following system

 {∇aF(a∗,b∗)=0,∇bF(a∗,b∗)=0.

With notations , , , the two equations can be written as

 {∀ 1≤i≤I,1−Ai∑jβjBjΓi,j=0,∀ 1≤j≤J,fj−Bj∑iμiAiΓi,j=0. (18)

Thus

 ⎧⎪⎨⎪⎩∀ 1≤i≤I,Ai=1∑jβjBjΓi,j,∀ 1≤j≤J,Bj=fj∑iμiAiΓi,j. (19)

We also remark that by multiplying the second term of (18) by and summing over we get

 ∑i,jμiAiβjBjΓi,j=1. (20)

By multiplying the equations in (19) we have for all :

 AiBjΓi,j=fjΓi,j∑k,lμkAkΓk,jβlBlΓi,l

thus using (20):

 fjmink,lΓi,jΓk,lΓk,jΓi,l≤AiBjΓi,j≤fjmaxk,lΓi,jΓk,lΓ