# The statistical effect of entropic regularization in optimal transportation

We propose to tackle the problem of understanding the effect of regularization in Sinkhorn algotihms. In the case of Gaussian distributions we provide a closed form for the regularized optimal transport which enables to provide a better understanding of the effect of the regularization from a statistical framework.

## Authors

• 14 publications
• 31 publications
• ### Entropy-regularized optimal transport on multivariate normal and q-normal distributions

Distance and divergence of the probability measures play a central role ...
12/19/2020 ∙ by Qijun Tong, et al. ∙ 0

• ### Empirical Regularized Optimal Transport: Statistical Theory and Applications

We derive limit distributions for certain empirical regularized optimal ...
10/23/2018 ∙ by Marcel Klatt, et al. ∙ 0

• ### Quadratically-Regularized Optimal Transport on Graphs

Optimal transportation provides a means of lifting distances between poi...
04/26/2017 ∙ by Montacer Essid, et al. ∙ 0

• ### Hard Negative Sampling via Regularized Optimal Transport for Contrastive Representation Learning

We study the problem of designing hard negative sampling distributions f...
11/04/2021 ∙ by Ruijie Jiang, et al. ∙ 0

• ### Iterative Barycenter Flows

The task of mapping two or more distributions to a shared representation...
04/15/2021 ∙ by David I. Inouye, et al. ∙ 0

• ### Informative GANs via Structured Regularization of Optimal Transport

We tackle the challenge of disentangled representation learning in gener...
12/04/2019 ∙ by Pierre Bréchet, et al. ∙ 0

• ### Entropy Partial Transport with Tree Metrics: Theory and Practice

Optimal transport (OT) theory provides powerful tools to compare probabi...
01/24/2021 ∙ by Tam Le, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction.

Statistical methods based on optimal transportation have received a considerable amount of attention in recent times. While the topic has a long history, computational limitations (and also lack of a well developed distributional theory) hampered its applicability for years. Some recent advences (see Cuturi, Peyre, Schmitzer, Rigollet,…) have completely changed the scene and now statistical methods based on optimal transportation are everywhere (see, e.g.[32],for Kernel based methods [24, 4]

, in Fair Machine Learning

[20]).

Monge-Kantorovich distances are defined using a cost function as

 Wc(P,Q)=minπ∈Π(P,Q)∫c(x,y)dπ(x,y),

where denotes the set of distributions with marginals and .

Computing such distances requires to solve in the discrete case a linear program. Actually solving the original discrete optimal transport problem, for two discrete distributions

and and a cost matrix for all , amounts to solve the minimization with respect to the transportation plan

 minπ∈Π(P,Q) (1)

where and . This minimization is yet a linear problem (see [22]) but it turns to be computationally difficult. Different algorithms have been proposed such as the Hungarian algorithm [25], the simplex algorithm [28] or others versions using interior points algorithms [31]. The complexity of these methods is at most of order for the OT problem between two discrete distributions with equal size . To overcome this issue, regularization methods have been proposed to approximate the optimal transport problem by adding a penalty. The seminal paper by [11] provides the description of the Sinkhorn algorithm to regularize optimal transport by using the entropy of the transportation plan and changing the initial optimization program (1) into a strictly convex one

 minπ∈Π(P,Q){+εH(π)}. (2)

The minimization of this criterion is achieved using Sinkhorn algorithm. We refer to [32] and references therein for more details. The introdution of the Sinkhorn divergence enables to obtain an -approximation of the optimal transport distance which can be computed, as pointed out in [2], with a complexity of algorithm of order , hence in a much faster way than the original optimal transport problem. Several toolboxes have been developed to compute regularized OT such among others as [16] for python, [23] for R.
Other algorithms can be used to minimize (2). In [19]stochastic gradient descent is applied to solve the entropy-regularized OT problem while in [15] an accelerated gradient descent is proposed improving the complexity to . The influence of the penalty is balanced introducing a parameter which controls the balance between the approximation of the optimal transport distance and its computational feasibility. Note also that others regularizing penalty have been proposed, for instance the entropy with respect to the product of marginals . Beyond computational convenience, regularization has a statistical impact and few results exist in the literature. An enjoyable property of regularized optimal transport is that the convergence of its empirical version is faster than the standard optimal transport. Actually, if and are empirical versions of distributions and in

, then Monge-Kantorovich distances suffer from the curse of dimensionality and converge under some assumptions at a rate at most

, this rate may be improved under some assumptions as pointed out in [33] for instance. As shown first in [18] for distributions defined on a bounded domain, and sharpened for sub-Gaussian distributions in [30], the rate of convergence of regularized OT divergences is of order

. In the recent years, optimal transport theory has been extensively used in unsupervised learning in order to characterize the mean of observations, giving rise to the notion of Wasserstein barycenters. This point of view is closely related to the notion of Fréchet means which has been used in statistics in preliminar works such as

[14] . The problem of existence and uniqueness of the Wasserstein barycenter of distributions , where at least one of these distributions has a density, has been tackled in [1]. The asymptotic property of Wasserstein barycenters have been studied in [8] or [27]. However their computation is a difficult issue apart from the scatter-location family case. In this case a fixed point solution method can be derived to compute their barycenter as explained in [3]. Hence, some authors have replaced Monge-Kantorovich distance by the Sinkhorn divergence and thus have considered the notion of Sinkhorn barycenter as in [12] or [6]. In this setting, the distributions are discretized and the usual Sinkhorn’s algorithm for discrete distributions is applied. Results proving the consistency of empirical Sinkhorn barycenters towards population Sinkhorn barycenters can be derived and the rate of convergence can be upper bounded by a bound depending on the number of observations, the discretization scheme and the trade-off parameter . Here again little is said to derive the statistical properties of the Sinkhorn barycenter and its property with respect to the original Wasserstein barycenter. Hence, for both computational and statistical properties, the influence of is crucial and the results dealing with the approximation properties of regularized OT with regards to standard OT are scarce. Very recently some papers all independently have proved in [21] similar expressions for the closed form of regularized optimal transport between Gaussian distributions, including in their case the case of unbalanced transport. In [29], similar formulations have been derived using proofs based on the solution of the Schrödinger system that can be written to compute the entropic transport plan. All three points of view are complementary and provide new insights on entropic optimal transport. Our contribution is the following

• We investigate in this paper this impact

• We find that optimal regularized coupling of Gaussian measures is Gaussian and compute regularized transportation cost between Gaussians (Theorem 2.2)

• The Gaussian case is not just an interesting benchmark. In fact, just as in the classical (unregularized) optimal transportation problem, for probabilities with given means and covariance matrices the entropic transportation cost is minimized for Gaussian distributions. This is a generalization of Gelbrich lower bound the entropic setup (Theorem

2.3).

• Also as in the classical case, the entropic barycenter of Gaussian probabilities is Gaussian (Theorem 3.2).

• Entropic variation around barycenter lower bounded by explicit expression from Gaussian case

• We see that entropic regularization basically amounts to smoothing via convolution with a Gaussian kernel, which results in added variance. The regularization parameter controls the increase in variance

## 2 Regularized optimal transport.

We consider the entropic regularization of the transportation cost, namely, for probabilities , on ,

 W22,ε(P,Q)=minπ∈Π(P,Q)Iε[π]

with

 Iε[π]=∫Rd×Rd∥x−y∥2dπ(x,y)+εH(π). (3)

Here stands for the negative of the differential or Boltzmann-Shannon entropy, that is, if has density with respect to Lebesque measure on , then

 H(π)=∫Rd×Rdr(x,y)logr(x,y)dxdy,

while if does not have a density.

The entropy term modifies the linear term in classical optimal transportation (the quadratic transportation cost) to produce a strictly convex functional. This is not the only possible choice. Alternatively, we could fix two reference probability measures on , say and , and consider

 W22,ε,μ,ν(P,Q)=minπ∈Π(P,Q)Iε,μ,ν[π]

where

 Iε,μ,ν[π]=∫Rd×Rd∥x−y∥2dπ(x,y)+εK(π|μ⊗ν) (4)

and

denotes the Kullback-Leibler divergence, namely, for probability measures,

, , if and otherwise. In the case when

is the centered normal distribution on

with covariance matrix, , for some , we will simply write and .

In our definitions of the regularized transportation cost we have written instead of . The existence of the minimizer follows easily. In the case of , for instance, let us assume that is a minimizing sequence, that is, . Since have fixed marginals, is a tight sequence and we can extract a weakly convergent subsequence, that we keep denoting , say . Obviously . By Fatou’s Lemma and by lower semicontinuity of relative entropy (see, e.g., Lemma 1.4.3 in [13]) . But this shows that , hence, is a minimizer. The case of follows similarly. Futhermore, if the transportation cost is finite then the minimizer is unique, since the relative entropy is strictly convex in its domain.

The choice of the reference measures is arbitrary. However, its influence on the regularized optimal transport is limited. In fact, if we replace , with equivalent measures , (in the sense of and being mutually absolutely continuous with respect to each other and similarly for and ) then if and only if and then . Hence, for any with , writing we have

 K(π∥μ⊗ν)−K(π∥μ′⊗ν′)=∫Rdlog(dμ′dμ(x))dP(x)+∫Rdlog(dν′dν(y))dQ(y) (5)

and we see that the difference does not depend on . In particular the minimizer, if it exists, does not depend on the choice of , . Furthermore, if and have a positive density on then and differ only in a constant and, again, the minimizer does not depend on the choice of , . The minimal value, however, does depend on the choice of the regularization term and this has an impact, for instance, in the barycenter problem, as we will see later.

We prove in this section that the entropic regularization of the transportation problem between nondegenerate Gaussian laws admits a (unique) minimizer which is also Gaussian (on the product space). We provide a explicit expression for the mean and covariance of this minimizer. Our proof is self-contained in the sense that we prove the existence of a minimizer in this setup. This existence could be obtained from more general results (see, e.g., Theorem 3.2 in [9] or Remark 4.19 in [32]) based on duality. We obtain the minimizer, instead, from the analysis of a particular type of matrix equation: the so-called algebraic Riccatti equation. This equation has been extensively studied (see [26]) and efficient numerical methods for the computation of solutions are available (see, e.g., [7]). However, the particular Riccatti equation which is of interest for the entropic transportation problem (see (6)) has a particularly simple structure and its unique positive definite solution admits an explicit expression. This is shown in our next result.

###### Proposition 2.1

If , are real, symmetric, positive definite matrices and then the unique symmetric, positive definite solution of the matrix equation

 XΣ1X+ε2X=Σ2 (6)

is

 Xε=Σ−1/21(Σ1/21Σ2Σ1/21+(ε4)2Id)1/2Σ−1/21−ε4Σ−11. (7)

Furthermore, if

 Σε=[Σ1Σ1XεXεΣ1Σ2]

then is a real, symmetric, positive definite matrix and

Proof. The fact that solves (6) can be checked by simple inspection. is obviously symmetric. Hence, it suffices to show that it is positive definite or, equivalently, that

is positive definite. This, in turn, will follow if we prove that every eigenvalue, say

, of satisfies . But this is a consequence of the fact that the eigenvalues of are with ranging in the set of eigenvalues of , which is positive definite. Consequently, is a positive definite solution of (6). To prove uniqueness we set and note that if is a solution to (6) then

 XZ=Σ2−ε4X. (8)

But then and substitution in (8) yields

 Σ2+(ε4)2Σ−11=Σ−11Z2

or, equivalently,

 Z2=Σ1Σ2+(ε4)2Id.

Observe now that is a symmetric, positive definite matrix. From the last identity we see that

 A2=Σ−1/21Z2Σ1/21=Σ1/21Σ2Σ1/21+(ε4)2Id.

Therefore, . We conclude that, necessarily, .

We show next that is positive definite. In fact, (see, e.g., Theorem 1.3.3 in [5]) it suffices to show that is positive definite. Since solves (6), we have that and the last condition becomes that has to be positive definite. But this holds if and only if

 U=[Σ1Σ1Σ1Σ1+ε2X−1ε]

is positive definite. Since , we conclude that is indeed positive definite.

To complete the proof we note that from the well known identity for the inverse of block partitioned matrices

 Σ−1ε=[(Σ1−Σ1XεΣ−12XεΣ1)−1−Xε(Σ2−XεΣ1Xε)−1−(Σ2−XεΣ1Xε)−1Xε(Σ2−XεΣ1Xε)−1],

Since solves (6) we have that . We similarly check that . This completes the proof.

###### Remark 2.1.1

The inverse of the solution of equation (6) can be expressed in terms of , the unique symmetric positive definite solution of the alternative Riccati equation

 YΣ2Y+ε2Y=Σ1. (9)

In fact, if we write then

 Zε=Σ−12(Σ2−ε2Xε)X−1ε=(Id−ε2Σ−12Xε)X−1ε=X−1ε−ε2Σ−12.

This shows that is symmetric. Also, since , we see that , that is, is positive definite. Since solves (9) we conclude or, equivalently, . From this we obtain , which implies

 Id = Σ−12(Σ2−XεΣ1Xε)+YεXε=ε2Σ−12Xε+YεXε = (Σ−12+2εYε)ε2Xε.

Thus, we conclude

 2εX−1ε=Σ−12+2εYε. (10)

Before stating the announced result, we observe that in the analyisis of entropic regularization of transportation problems can focus on the case of centered probabilities and . In fact, for and we write , and . The map is a bijection between and and . Similarly, we see that and , where , . If , and , are equivalent, we see, using (5), that

 Iε,μ,ν[π]=Iε,μ,ν[~π]+∥μP−μQ∥2−ε(∫Rdlog(d~μdμ(x))d~P(x)+∫Rdlog(d~νdν(y))d~Q(y)).

With the choice of reference measures we have , . Hence, and we conclude that

 Iε[π] = Iε[~π]+∥μP−μQ∥2 (11) Iε,λ[π] = Iε,λ[~π]+∥μP−μQ∥2+ε2λ(∥μP∥2+∥μQ∥2).

###### Theorem 2.2

If and are Gaussian probabilities on with means and and positive definite covariance matrices and , respectively, then, if denotes the Gaussian probability on with mean and covariance matrix as in Proposition 2.1

 W22,ε(P,Q) = Iε[π0]=∥μ1−μ2∥2 (12)

Proof. We write and for the densities of and , respectively. From (11) and the comments above we see that we only have to consider the case . Also, since can only be finite if has a density, we can rewrite (3) as

 W22,ε(P,Q)=infr∈R(P,Q)[∫Rd×Rd[∥x−y∥2+εlogr(x,y)]r(x,y)dxdy]

with denoting the set of densities on satisfying the marginal conditions for almost every and for almost every . Consider now , . Then for any ,

 ∫[∥x−y∥2+εlogr(x,y)]r(x,y)dxdy−∫f(x)dP(x)−∫g(y)dQ(y) = ε∫r(x,y)log(r(x,y)ef(x)+g(y)−∥x−y∥2ε)dxdy ≥ = ε−ε∫ef(x)+g(y)−∥x−y∥2εdxdy,

with equality if and only if for almost every (observe that this follows from the elementary fact that , , with equality if and only if ). This shows that

 W22,ε(P,Q)≥ε+supf∈L1(P),g∈L1(Q)[∫f(x)dP(x)+∫g(y)dQ(y)−ε∫ef(x)+g(y)−∥x−y∥2εdxdy].

It shows also that if can be written as then is a minimizer for the entropy-regularized transportation problem (indeed, by the strict convexity of , the unique minimizer).

Now, if denotes the centered Gaussian distribution on with covariance matrix as in Proposition 2.1, then, obviously, . From the expression for and denoting and we see that the density of equals

 r0(x,y) = 1(2π)d|Σε|12exp[−12(xTAεx+yTBεy−4εxTy)] = 1(2π)d|Σε|12exp[−1ε(∥x−y∥2+xT(ε2Aε−Id)x+yT(ε2Bε−Id)y)].

Consequently, with

 f0(x) = g0(y) = yT(Id−X−1ε)y. (13)

This proves that minimizes the regularized transportation cost between and .

Finally, to prove (12) we note first that

 ∫Rd×Rd∥x−y∥2dπ0(x,y) = Tr(Σ1)+Tr(Σ2)−2Tr(Σ1Xε). (14)

A simple computation shows that . On the other hand

 det(Σε)=det(Σ1)det(Σ2−XεΣ1Σ−11Σ1Xε)=(ε2)ddet(Σ1Xε)

(here we have used that ). Combining these last computations with (14) we obtain (12).

The proof of Theorem 2.2 can be easily adapted to other entropic regularizations. In particular, we can check that is also the minimizer of and also that

 W22,ε,λ(P,Q) = Iε,λ[π0] = ∥μ1−μ2∥2+ε2λ(∥μ1∥2+∥μ2∥2) +Tr(Σ1)+Tr(Σ2)−2Tr(Σ1Xε) −ε2[log(|Σ1Xε|)−1λ(Tr(Σ1)+Tr(Σ2))−d(2logλ−logε2−1)].

Theorem 2.2 shows that the entropic transportation cost between normal laws is, as in the case of classical transportation cost, a sum of two contributions. One accounts for the deviation in mean between the two laws. This part remains unchanged by the regularization with negative differential entropy (but not with relative entropies). The other contribution, which accounts for deviations between the covariance matrices, behaves differently, but this behavior is also easier to understand in the case of . In the one-dimensional case we see that

 W22,ε(N(0,σ21),N(0,σ22))=σ21+σ22−2√σ21σ22+(ε4)2−ε2log(√σ21σ22+(ε4)2−ε4)−ε2log(2π2eε).

In particular, with

 h(x)=2(1−√1+x2)−2xlog(√1+x2−x)−2xlog((2π)2ex).

It is easy to see that , is decreasing in and .

While Theorem 2.2 is limited to Gaussian probabilities, its scope goes beyond that case. In classical optimal transportation the Gaussian case provides a lower bound for the quadratic transportation cost through Gelbrich’s bound (see in [10] which improves the bound in [17] ). We show next that this carries over to entropic regularizations of transportation cost.

###### Theorem 2.3

If and are probabilites on with means and positive definite covariance matrices , respectively, then

 W22,ε(P,Q) ≥ ∥μ1−μ2∥2 (16)

where is as in (7). Equality in (16) holds if and only if and are Gaussian.

Proof. As in the proof of Theorem 2.2, it suffices to consider the case of centered and . If (or ) does not have a density then does not contain any probability with a density and, consequently, for every and the result is trivial. We assume, therefore, that and are absolutely continuous w.r.t. Lebesgue measure. We consider with density and denote by the density of , as defined in Theorem 2.2. Then (recall (13))

 Iε[π] = ε∫Rd×Rdlog(r(x,y)e−∥x−y∥2ε)r(x,y)dxdy = ε∫Rd×Rdlog(r(x,y)r0(x,y))r(x,y)dxdy+∫Rd×RdxT(Id−Xε−ε2Σ−11)xr(x,y)dxdy +∫Rd×RdyT(Id−X−1ε)yr(x,y)dxdy−ε2log((2π)2d(ε2)d|Σ1Xε|) = Tr((Id−Xε−ε2Σ−11)Σ1)+Tr((Id−X−1ε)Σ2)−ε2log((2π)2d(ε2)d|Σ1Xε|)+εK(π|π0), = Tr(Σ1)+Tr(Σ2)−2Tr(Σ1Xε)−ε2log((2πe)2d(ε2)d|Σ1Xε|)+εK(π|π0).

Now (16) follows from the fact that . If and are Gaussian (and only in that case) then . This completes the proof.

To conclude this section we present a simple result on best approximation with respect to entropic transportation cost. In the case (classical optimal transportation) is a metric and for any

with finite second moment we have

 W22(P,Q)≥W22(P,P)=0,Q∈F2(Rd).

The fact that is no longer a metric for changes the nature of the problem and we may wonder which is closest to in the sense of minimizing . We show next that in the case of Gaussian the problem admits a simple solution.

###### Theorem 2.4

Assume that is a probability on with a density such that . Then

 P∗Nd(0,ε2Id)=\em argminQW22,ε(P,Q),

with the minimization extended to the set of all probabilities on . Furthermore, is the unique minimizer.

Proof. We consider a probability on with first marginal and . Arguing as in the proof of Theorem 2.2 (take ) we see that

 Iε[π]≥ε+∫f(x)dP(x)−ε∫ef(x)−∥x−y∥2εdxdy,

with equality if and only if has a density, , that can be written as . Now, if , then is a density on with first marginal , second marginal , and we can write