# On Unbalanced Optimal Transport: An Analysis of Sinkhorn Algorithm

We provide a computational complexity analysis for the Sinkhorn algorithm that solves the entropic regularized Unbalanced Optimal Transport (UOT) problem between two measures of possibly different masses with at most n components. We show that the complexity of the Sinkhorn algorithm for finding an ε-approximate solution to the UOT problem is of order O(n^2/ ε), which is near-linear time. To the best of our knowledge, this complexity is better than the complexity of the Sinkhorn algorithm for solving the Optimal Transport (OT) problem, which is of order O(n^2/ε^2). Our proof technique is based on the geometric convergence of the Sinkhorn updates to the optimal dual solution of the entropic regularized UOT problem and some properties of the primal solution. It is also different from the proof for the complexity of the Sinkhorn algorithm for approximating the OT problem since the UOT solution does not have to meet the marginal constraints.

## Authors

• 3 publications
• 1 publication
• 37 publications
• 4 publications
• 14 publications
• ### Screening Sinkhorn Algorithm for Regularized Optimal Transport

We introduce in this paper a novel strategy for efficiently approximate ...
06/20/2019 ∙ by Mokhtar Z. Alaya, et al. ∙ 0

• ### Accelerated Alternating Minimization

Alternating minimization (AM) optimization algorithms have been known fo...
06/09/2019 ∙ by Sergey Guminov, et al. ∙ 0

• ### Optimal Transport for Stationary Markov Chains via Policy Iteration

We study an extension of optimal transport techniques to stationary Mark...
06/14/2020 ∙ by Kevin O'Connor, et al. ∙ 0

• ### Greedy stochastic algorithms for entropy-regularized optimal transport problems

Optimal transport (OT) distances are finding evermore applications in ma...
03/04/2018 ∙ by Brahim Khalil Abid, et al. ∙ 0

• ### Co-clustering through Optimal Transport

In this paper, we present a novel method for co-clustering, an unsupervi...
05/17/2017 ∙ by Charlotte Laclau, et al. ∙ 0

• ### On the Computational Complexity of Finding a Sparse Wasserstein Barycenter

The discrete Wasserstein barycenter problem is a minimum-cost mass trans...
10/16/2019 ∙ by Steffen Borgwardt, et al. ∙ 0

• ### GlymphVIS: Visualizing Glymphatic Transport Pathways Using Regularized Optimal Transport

The glymphatic system (GS) is a transit passage that facilitates brain m...
08/24/2018 ∙ by Rena Elkin, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The Optimal Transport (OT) problem has a long history in mathematics and operation research, originally used to find the optimal cost to transport masses from one distribution to the other (Villani, 2003)

. Over the last decade, OT has emerged as one of the most important tools to solve interesting practical problems in statistics and machine learning

(Peyré & Cuturi, 2019). Recently, the Unbalanced Optimal Transport (UOT) problem between two measures of possibly different masses has been used in several applications in computational biology (Schiebinger et al., 2019), computational imaging (Lee et al., 2019)(Yang & Uhler, 2019) and machine learning and statistics (Frogner et al., 2015; Janati et al., 2019).

The UOT problem is a regularized version of Kantorovich formulation which places penalty functions on the marginal distributions based on some divergence (Liero et al., 2018)

. When the two measures are the probability distributions, the standard OT is a limiting case of the UOT. Under the discrete setting of the OT problem where each probability distribution has at most

components, the OT problem can be recast as a linear programming problem. The benchmark methods for solving the OT problem are interior-point methods of which the most practical complexity is

developed by (Pele & Werman, 2009). Recently, (Lee & Sidford, 2014) used Laplacian linear system algorithms to improve the complexity of interior-point methods to . However, the interior-point methods are not scalable when is large.

To deal with the scalability of computing the OT, (Cuturi, 2013) proposed to regularize its objective function by the entropy of the transportation plan, which results in the entropic regularized OT. One of the most popular algorithms for solving the entropic regularized OT is the Sinkhorn algorithm (Sinkhorn, 1974), which was shown by  (Altschuler et al., 2017) to have a complexity of when used to approximate the OT within an -accuracy. In the same article, (Altschuler et al., 2017) developed a greedy version of the Sinkhorn algorithm, named the Greenkhorn algorithm, that has a better practical performance than the Sinkhorn algorithm. Later, the complexity of the Greenkhorn algorithm was improved to by a deeper analysis in  (Lin et al., 2019b). To accelerate Sinkhorn and Greenkhorn algorithms, (Lin et al., 2019a) introduced Randkhorn and Gandkhorn algorithms that have complexity upper bounds of . These complexities are better than those of Sinkhorn and Greenkhorn algorithms in terms of the desired accuracy . A different line of algorithms for solving the OT problem is based on primal-dual algorithms. These algorithms include accelerated primal-dual gradient descent algorithm (Dvurechensky et al., 2018), accelerated primal-dual mirror descent algorithm (Lin et al., 2019b), and accelerated primal-dual coordinate descent algorithm (Guo et al., 2019). These primal-dual algorithms all have complexity upper bounds of , which are better than those of Sinkhorn and Greenkhorn algorithms in terms of . Recently, (Jambulapati et al., 2019; Blanchet et al., 2018) developed algorithms with complexity upper bounds of , which are believed to be optimal, based on either a dual extrapolation framework with area-convex mirror mapping or some black-box and specialized graph algorithms. However, these algorithms are quite difficult to implement. Therefore, they are less competitive than Sinkhorn and Greenkhorn algorithms in practice.

Our Contribution. While the complexity theory for OT has been rather well-understood, that for UOT is still nascent. In the paper, we establish the complexity of approximating UOT between two discrete measures with at most components. We focus on the setting when the penalty functions are Kullback-Leiber divergences. Similar to the entropic regularized OT, in order to account for the scalability of computing UOT, we also consider an entropic version of UOT, which we refer to as entropic regularized UOT. The Sinkhorn algorithm is widely used to solve the entropic regularized UOT (Chizat et al., 2016); however, its complexity for approximating the UOT has not been studied. Our contribution is to prove that the Sinkhorn algorithm has a complexity of

 O(n2εlog(n)[log(∥C∥∞)+log(log(n))+3log(1ε)]).

This complexity is close to the probably optimal one by a factor of logarithm of and .

The main difference between finding an -approximation solution for OT and UOT by the Sinkhorn algorithm is that the Sinkhorn algorithm for OT knows when it is close to the solution because of the constraints on the marginals, while the UOT does not have that advantage. Despite lacking that useful property, the UOT enjoys more freedom resulting in some interesting equations that relate the optimal value of the primal function to the masses of two measures (see Lemma 4). Those equations together with the geometric convergence of the dual solution prove the almost linear time convergence to an -approximation solution of the UOT.

Organization. The remainder of the paper is organized as follows. In Section 2, we provide a setup for the regularized UOT in primal and dual forms, respectively. Based on the dual form, we show the dual solution has a geometric convergence rate in Section 3. We also show in Section 3 that the Sinkhorn algorithm for the UOT has a complexity of order . Section 4 presents some empirical results confirming the complexity of the Sinkhorn algorithm. Finally, we conclude with Section 5.

Notation. We let stand for the set while

stands for the set of all vectors in

with nonnegative components for any . For a vector and , we denote as its -norm and as the diagonal matrix with on the diagonal. stands for a vector of length with all of its components equal to . refers to a partial gradient of with respect to . Lastly, given the dimension and accuracy , the notation stands for the upper bound where is independent of and . Similarly, the notation indicates the previous inequality may depend on the logarithmic function of and , and where .

## 2 Unbalanced Optimal Transport with entropic regularization

In this section, we present the primal and dual form of the entropic regularized UOT problem and define an -approximation for the solution of the unregularized UOT.

For any two positive vectors and , the UOT problem takes the form where

 f(X)=⟨C,X⟩+τKL(X1n||a)+τKL(X⊤1n||b), (1)

is a cost matrix, is a given regularization parameter and the divergence between vectors and is defined as

 KL(x∥y)=n∑i=1xilog(xiyi)−xi+yi,

and is called the transportation plan. When and , the UOT problem becomes the standard OT problem. Similar to the original OT problem, the exact computation of UOT is expensive and not scalable in terms of dimension . Inspired by the recent success of the entropic regularized OT problem as an efficient approximation of OT problem, we also consider the entropic version of the UOT problem (Frogner et al., 2015) of finding , where

 g(X):=⟨C,X⟩−ηH(X)+τKL(X1n||a) +τKL(X⊤1n||b), (2)

is the regularization parameter and is the entropic regularization defined by

 H(X):=−n∑i,j=1Xij(log(Xij)−1). (3)

For each , the entropic regularized UOT problem is strongly convex.

###### Definition 1.

For any , we call an -approximation transportation plan if the following holds

 ⟨C,X⟩+τKL(X1n||a)+τKL(X⊤1n||b) ≤⟨C,ˆX⟩+τKL(ˆX1n||a)+KL(ˆX⊤1% n||b)+ε,

where is an optimal transportation plan for the UOT problem (1).

We aim to develop an algorithm to obtain -approximation transportation plan for the UOT problem (1). In order to do that, we consider the Fenchel-Legendre dual form of entropic regularized UOT, which is given by

where

 F∗(u) =maxz∈Rnz⊤u−τKL(z||a)=τ⟨eu/τ,a⟩−a⊤1n, G∗(v) =maxx∈Rnx⊤v−τKL(x||b)=τ⟨ev/τ,b⟩−b⊤1n.

Since and are given non-negative vectors, finding the optimal solution for the above objective is equivalent to finding the optimal solution for the following objective

 minu,v∈Rnh(u,v) :=ηn∑i,j=1exp(ui+vj−Cijη) +τ⟨e−u/τ,a⟩+τ⟨e−v/τ,b⟩. (4)

Problem (4) is referred to as dual entropic regularized UOT.

## 3 Complexity analysis of approximating unbalanced optimal transport

In this section, we provide a complexity analysis of the Sinkhorn algorithm for approximating UOT solution. We start with some notations and useful quantities followed by the lemmas and main theorems.

### 3.1 Notations and assumptions

We first denote , . For each , its corresponding optimal transport in the dual form (4) is denoted by , where . The corresponding solution in (2) is denoted by . Let , and .
Let be the solution returned at the -th iteration of the Sinkhorn algorithm and be the optimal solution of (4). Following the above scheme, we also define , and , correspondingly. Additionally, we define to be the optimal solution of the unregularized objective (1) and .

Different from the balanced OT, the optimal solutions of the entropic regularized UOT and our complexity analysis overall also depend on the masses and the regularization parameter . We will assume the following simple regularity conditions throughout the paper.

Regularity Conditions:

• , are positive constants.

• is a matrix of non-negative entries.

Before presenting the main theorem and analysis, for convenience, we define some quantities that will be used in our analysis and quantify their magnitudes under the regularity conditions.

List of quantities:

 R =max{∥log(a)∥∞,∥log(b)∥∞} +max{log(n),1η∥C∥∞−log(n)}, (5) Δk =max{∥uk−u∗∥∞,∥vk−v∗∥∞}, (6) Λk =(ττ+η)k×τ×R. (7) S =12(α+β)+12+14log(n), (8) T =(α+β2)[log(α+β2)+2log(n)−1] +log(n)+52, (9) U =max{S+T,2ε,4εlog(n)τ,4ε(α+β)log(n)τ}. (10)

As we shall see, the quantities and are used to establish the convergence rate of . We now consider the order of and . Since the order of the penalty function is and should be small for a good approximation, is often chosen such that is sufficiently small. Hence, we can assume the dominant factor in the second term of is . If is a positive constant, then we can expect that be as small as for a constant . In this case, . Overall, we can assume that and if and are positive constants, then and .

### 3.2 Sinkhorn algorithm

The Sinkhorn algorithm (Chizat et al., 2016) alternatively minimizes the dual function in (4) with respect to and . Suppose we are at iteration for and even, by setting the gradient to we can see that given fixed , the update that minimizes the function in (4) satisfies

Multiplying both sides by , we get:

Similarly with fixed and

is odd:

 exp(vk+1jη)bkj=exp(vkjη)exp⎛⎝−vk+1jτ⎞⎠bj.

These equations translate to the pseudocode of Algorithm  1, in which we have included our stopping condition stated in Theorem 2.

We now present the main theorems.

###### Theorem 1.

For any , the update from Algorithm 1 satisfies the following bound

 Δk+1≤Λk, (11)

where and are defined in (6) and (7), respectively.

###### Remark 1.

Theorem 1 establishes a convergence rate for the dual solution . It is a geometric convergence similar to the work of (Sejourne et al., ). However, we obtain a specific upper bound for the convergence rate which depends explicitly on the number of components and all other parameters of masses and penalty function. The convergence rate of Theorem 1 plays an important role for complexity analysis in the next theorem.

###### Theorem 2.

Under the regularity conditions (A1-A2), with and defined in (5) and (10) respectively, for and

the update from Algorithm 1 is an -approximation of the optimal solution of (1).

The next corollary sums up the complexity of Algorithm 1.

###### Corollary 1.

Under conditions (A1-A2) and assume that and . Then the complexity of Algorithm 1 is

 O(n2εlog(n)[log(∥C∥∞)+log(log(n))+log(1ε)]),

which is also .

###### Proof of Corollary 1.

By the assumptions on the order of , and the definition of in (10), we have

 U=O(S)+O(T)+εO(log(n))=O(log(n)).

Overall, we obtain

Multiplying with arithmetic operations per iteration, we obtain the final complexity. ∎

###### Remark 2.

In comparison to the best well-known OT’s complexity of the similar order of , i.e. (Dvurechensky et al., 2018), our complexity for the OT is better by a factor of . Meanwhile, among the practical algorithms for OT which have similar order of , i.e. Gankhorn and Randkhorn, our bound is better by a factor of .

### 3.3 Analysis of the Sinkhorn algorithm

The analysis for Unbalanced Optimal Transport is different from that of Optimal Transport, since and are no longer probability measures. The proof of Theorem 1 requires the convergence rate of and an upper bound on the supremum norm of the optimal dual solution , the later of which is presented in Lemma 3.

###### Lemma 1.

The optimal solution of (4) satisfies the following equations:

 u∗τ=log(a)−log(a∗), and v∗τ=log(b)−log(b∗).
###### Proof.

Since is a fixed point of the update in the Algorithm 1, we get

 u∗ =[u∗η+log(a)−log(a∗)]ητη+τ.

This directly leads to the stated equality for , and that for can be obtained similarly. ∎

###### Lemma 2.

Assume the regularity conditions (A1-A2) hold, the following are true

 (a)∣∣log(a∗iaki)−u∗i−ukiη∣∣≤max1≤j≤n|v∗j−vkj|η,(b)∣∣log(b∗jbkj)−v∗j−vkjη∣∣≤max1≤i≤n|u∗i−uki|η.

The proof is given in the appendix.

###### Lemma 3.

The sup norm of the optimal solution and is bounded by:

 max{∥u∗∥∞,∥v∗∥∞}≤τR,

where is defined in (5).

###### Proof.

We start with the equations for the solution in Lemma 1, i.e.

 u∗iτ=log(ai)−log(n∑j=1exp(u∗i+v∗j−Cijη)),

which can be rewritten as

The second term can be bounded as follows

 log[n∑j=1exp(v∗j−Cijη)] ≥log(n)+min1≤j≤n{v∗j−Cijη} ≥log(n)−∥v∗∥∞η−∥C∥∞η

and

 log[n∑j=1exp(v∗j−Cijη)] ≤log(n)+max1≤j≤n{v∗j−Cijη} ≤log(n)+∥v∗∥∞η,

 ∣∣ ∣∣log[n∑j=1exp(v∗j−Cijη)]∣∣ ∣∣

Hence,

 |u∗i|(1τ+1η) ≤|log(ai)|+∥v∗∥∞η+ max{log(n),∥C∗∥∞η−log(n)}.

Choosing such that , combining with the fact that , we have

 ∥u∗∥∞(1τ+1η)≤∥v∗∥∞η+R.

WLOG assume that , we can easily obtain the stated bound. ∎

###### Proof of Theorem 1.

We first consider the case when is even. From the update of in Algorithm 1, we have:

 uk+1i=(ukiη+logai−logaki)ηττ+η ={ukiη+[log(ai)−log(a∗i)]+[log(a∗i)−log(aki)]}ηττ+η.

Using Lemma 1, the above equality is equivalent to

 uk+1i−u∗i=[ηlog(a∗iaki)−(u∗i−uki)]ττ+η.

Using Lemma 2, we get

This leads to . Similarly, we obtain . Combining the two inequalities yields

 ∥uk+1−u∗∥∞≤ (ττ+η)2∥uk−1−u∗∥∞.

Repeating all the above arguments alternatively, we have

 ∥uk+1−u∗∥∞≤(ττ+η)k+1∥v0−v∗∥∞=(ττ+η)k+1∥v∗∥∞.

Note that for even, then

These two results lead to .

Similarly, for odd we obtain .

Thus the above inequality is true for all . Using the fact that and Lemma 3, we obtain the conclusion. ∎

### 3.4 Proof of the main theorem

The proof is based on the upper bound for the convergence rate in Theorem 1 and an upper bound for the solutions and of (1) and (2), respectively, which are direct consequences of the following lemma

###### Lemma 4.

Assume that the function attains its minimum at , then

 g(X∗)+(2τ+η)x∗=τ(α+β). (12)

Similarly, assume that attains its minimum at , then

 f(ˆX)+2τˆx=τ(α+β). (13)

Both equations in Lemma 4 establish the relationships between the optimal solutions of (1) and (2) with other parameters. Those relationships are very useful for analysing the behaviour of the optimal solution of UOT, because the UOT does not have any conditions on the marginals as the OT does. Consequences of Lemma 4 include Corollary 2 which provides upper bounds for and of (1) and (2) as well as bounds for the entropic functions in the proof of Theorem 2. The key idea of the proof surprisingly comes from the fact that the UOT solution does not have to meet the marginal constraints. We now present the proof of Lemma 4 and defer the proof of Corollary 2 to the Appendix.

###### Proof.

Consider the function , where ,

 g(tX∗) =⟨C,tX∗⟩+τKL(tX∗1n∥a) +τKL((tX∗)⊤1n∥b)−ηH(tX∗).

For the KL term of , we have:

 KL(tX∗1n∥a) =n∑i=1(ta∗i)log(ta∗iai)−n∑i=1(ta∗i)+n∑i=1ai =n∑i=1(ta∗i)(log(a∗iai)+log(t))−tx∗+α =tKL(X∗1n∥a)+(1−t)α+x∗tlog(t).

Similarly, we get

 KL(t(X∗)⊤1n∥b) =tKL((X∗)⊤1n∥b) +(1−t)β+x∗tlog(t).

For the entropic penalty term,

 −H(tX∗) =n∑i,j=1tX∗ij(log(tX∗ij)−1) =∑i,jtX∗ij(log(X∗ij)−1)+x∗tlog(t) =−tH(X∗)+x∗tlog(t).

Putting all results together, we obtain

 g(tX∗)=tg(X∗)+τ(1−t)(α+β)+(2τ+η)x∗tlog(t).

Taking the derivative of with respect to ,

 ∂g(tX∗)∂t=g(X∗)−τ(α+β)+(2τ+η)x∗(1+log(t)).

The function is well-defined for all . We know that attains its minimum at . Replace into the above equation, we obtain

 g(X∗)−τ(α+β)+(2τ+η)x∗ =0 g(X∗)+(2τ+η)x∗ =τ(α+β).

The second claim is proved in the same way. ∎

###### Corollary 2.

Assume that condition (A1-A2) hold and is sufficiently small. We have the following bounds on and :

 (a)x∗≤(12+ηlog(n)2τ−2ηlog(n))(α+β)+16log(n), (b)ˆx≤α+β2.

Next, we use the condition for in Theorem 2 to bound some relevant quantities at the -th iteration of the Sinkhorn algorithm.

###### Lemma 5.

Assume that the regularity conditions (A1-A2) hold and satisfies the inequality in Theorem 2. The following are true

 (a)Λk−1≤η28(τ+1), (b)|xk−x∗|≤3ηΔkmin{x∗,xk}, (c)xk≤S,

where is defined in (8).

We are now ready to construct a proof for Theorem 2.

###### Proof of Theorem 2.

From the definitions of and , we have

 f(Xk)−f(ˆX) =g(Xk)+ηH(Xk)−g(ˆX)−ηH(ˆX) =g(Xk)+ηH(Xk)−g(ˆX)−ηH(ˆX)−g(X∗)+g(X∗) ≤[g(Xk)−g(X∗)]+η[H(Xk)−H(ˆX)], (14)

since , as is the optimal solution of (2). The above two terms can be bounded separately as follows:

#### Upper bound of H(Xk)−H(ˆX).

We first show the following inequalities

 x−xlog(x)≤H(X)≤2xlog(n)+x−xlog(x) (15)

for any that and .

Indeed, rewriting as

 −H(X)=x[n∑i,j=1Xijxlog(Xijx)−1]+xlog(x).

and using , we thus obtain (15).

Now apply the lower bound of (15) to

 −H(ˆX) ≤ˆxlog(ˆx)−ˆx ≤max{0,α+β2[log(α+β2)−1]} ≤α+β2log(α+β2)−α+β2+1,

where the second inequality is due to being convex and by Corollary 2 and the third inequality is due to .

Similarly, apply the upper bound of (15) to

 H(Xk) ≤2xklog(n)+xk−xklog(xk) ≤2xklog(n)+1 ≤(α+β+1+12log(n))log(n)+1.

By combining the two results, we have

 H(Xk)−H(ˆX)≤T, (16)

where is defined in (9).

#### Upper bound of g(Xk)−g(X∗).

WLOG we assume that is odd. At step of Algorithm 1, we find by minimizing the dual function (4) given and fixed , and simply keep . Hence, is the optimal solution of

 minX∈Rn×n+gk(X) :=⟨C,X⟩−ηH(X) +τKL(X1n||a)+τKL(X⊤1n||bk),

where with denoting element-wise multiplication.

Denote . By Lemma 4,

 gk(Xk) =τ(α+βk)−(2τ+η)xk, g(X∗) =τ(α+β)−(2τ+η)x∗.

Writing , following some derivations using the above equations of and and the definitions of and , we get

 g(Xk)−g(X∗) =[−(2τ+η)(xk−x∗)] +τ[n∑j=1bkjlog(bkjbj)]. (17)

By part of Lemma 5, the first term is bounded by .

Note that and . Use part of Lemma 2

 log(bkjbj) =[−log(b∗jbkj)]+1τ(vkj−v∗j) ∣∣ ∣∣log(bkjbj)∣∣ ∣∣ ≤(2ηΔk)+(1τΔk)=(2η+1τ)Δk.

Note that for all . The above inequality leads to

 ∣∣ ∣∣n∑j=1bkjlog(bkjbj)∣∣ ∣∣ ≤xk(2η+1τ)Δk.

We have

 g(Xk)−g(X∗) ≤[(2τ+η)+3(2τ+η)]×xk×Δkη ≤8(τ+1)×S×Λk−1η,

where the first inequality is obtained by combining the bounds for two terms of (17) while the second inequality results from the fact that