Computing distances between probability measures on metric spaces, or more generally between point clouds, plays an increasingly preponderant role in machine learning[SL11, MJ15, LG15, JSCG16, ACB17], statistics [FCCR16, PZ16, SR04, BGKL17] and computer vision [RTG00, BvdPPH11, SdGP15]. A prominent example of such distances is the earth mover’s distance introduced in [WPR85] (see also [RTG00]), which is a special case of Wasserstein distance, or optimal transport (OT) distance [Vil09].
While OT distances exhibit a unique ability to capture geometric features of the objects at hand, they suffer from a heavy computational cost that had been prohibitive in large scale applications until the recent introduction to the machine learning community of Sinkhorn Distances by Cuturi [Cut13]. Combined with other numerical tricks, these recent advances have enabled the treatment of large point clouds in computer graphics such as triangle meshes [SdGP15] and high-resolution neuroimaging data [GPC15]. Sinkhorn Distances rely on the idea of entropic penalization, which has been implemented in similar problems at least since Schrödinger [Sch31, Leo14]. This powerful idea has been successfully applied to a variety of contexts not only as a statistical tool for model selection [JRT08, RT11, RT12] and online learning [CBL06], but also as an optimization gadget in first-order optimization methods such as mirror descent and proximal methods [Bub15].
Related work. Computing an OT distance amounts to solving the following linear system:
is the all-ones vector in, is a given cost matrix, and are given vectors with positive entries that sum to one. Typically is a matrix containing pairwise distances (and is thus dense), but in this paper we allow to be an arbitrary non-negative dense matrix with bounded entries since our results are more general. For brevity, this paper focuses on square matrices and , since extensions to the rectangular case are straightforward.
This paper is at the intersection of two lines of research: a theoretical one that aims at finding (near) linear time approximation algorithms for simple problems that are already known to run in polynomial time and a practical one that pursues fast algorithms for solving optimal transport approximately for large datasets.
Noticing that (1
) is a linear program withlinear constraints and certain graphical structure, one can use the recent Lee-Sidford linear solver to find a solution in time [LS14], improving over the previous standard of [Ren88]. While no practical implementation of the Lee-Sidford algorithm is known, it provides a theoretical benchmark for our methods. Their result is part of a long line of work initiated by the seminal paper of Spielman and Teng [ST04] on solving linear systems of equations, which has provided a building block for near-linear time approximation algorithms in a variety of combinatorially structured linear problems. A separate line of work has focused on obtaining faster algorithms for (1) by imposing additional assumptions. For instance, [AS14] obtain approximations to (1) when the cost matrix arises from a metric, but their running times are not truly near-linear. [SA12, ANOY14] develop even faster algorithms for (1), but require to arise from a low-dimensional metric.
Practical algorithms for computing OT distances include Orlin’s algorithm for the Uncapacitated Minimum Cost Flow problem via a standard reduction. Like interior point methods, it has a provable complexity of . This dependence on the dimension is also observed in practice, thereby preventing large-scale applications. To overcome the limitations of such general solvers, various ideas ranging from graph sparsification [PW09] to metric embedding [IT03, GD04, SJ08] have been proposed over the years to deal with particular cases of OT distance.
Our work complements both lines of work, theoretical and practical, by providing the first near-linear time guarantee to approximate (1) for general non-negative cost matrices. Moreover we show that this performance is achieved by algorithms that are also very efficient in practice. Central to our contribution are recent developments of scalable methods for general OT that leverage the idea of entropic regularization [Cut13, BCC15, GCPB16]. However, the apparent practical efficacy of these approaches came without theoretical guarantees. In particular, showing that this regularization yields an algorithm to compute or approximate general OT distances in time nearly linear in the input size was an open question before this work.
Our contribution. The contribution of this paper is twofold. First we demonstrate that, with an appropriate choice of parameters, the algorithm for Sinkhorn Distances introduced in [Cut13] is in fact a near-linear time approximation algorithm for computing OT distances between discrete measures. This is the first proof that such near-linear time results are achievable for optimal transport. We also provide previously unavailable guidance for parameter tuning in this algorithm. Core to our work is a new and arguably more natural analysis of the Sinkhorn iteration algorithm, which we show converges in a number of iterations independent of the dimension of the matrix to balance. In particular, this analysis directly suggests a greedy variant of Sinkhorn iteration that also provably runs in near-linear time and significantly outperforms the classical algorithm in practice. Finally, while most approximation algorithms output an approximation of the optimum value of the linear program (1), we also describe a simple, parallelizable rounding algorithm that provably outputs a feasible solution to (1). Specifically, for any and bounded, non-negative cost matrix , we describe an algorithm that runs in time and outputs such that
We emphasize that our analysis does not require the cost matrix to come from an underlying metric; we only require to be non-negative. This implies that our results also give, for example, near-linear time approximation algorithms for Wasserstein -distances between discrete measures.
Notation. We denote non-negative real numbers by , the set of integers by , and the -dimensional simplex by
. For two probability distributionssuch that is absolutely continuous w.r.t. , we define the entropy of
and the Kullback-Leibler divergencebetween and respectively by
Similarly, for a matrix , we define the entropy entrywise as . We use and to denote the all-ones and all-zeroes vectors in . For a matrix , we denote by the matrix with entries . For , we denote its row and columns sums by and , respectively. The coordinates and denote the th row sum and th column sum of , respectively. We write and . For two matrices of the same dimension, we denote the Frobenius inner product of and by . For a vector , we write to denote the diagonal matrix with entries . For any two nonnegative sequences , we write if there exist positive constants such that . For any two real numbers, we write .
2 Optimal Transport in near-linear time
In this section, we describe the main algorithm studied in this paper. Pseudocode appears in Algorithm 1.
The core of our algorithm is the computation of an approximate Sinkhorn projection of the matrix (Step 1), details for which will be given in Section 3. Since our approximate Sinkhorn projection is not guaranteed to lie in the feasible set, we round our approximation to ensure that it lies in (Step 2). Pseudocode for a simple, parallelizable rounding procedure is given in Algorithm 2.
Algorithm 1 hinges on two subroutines: Proj and round. We give two algorithms for Proj: Sinkhorn and Greenkhorn. We devote Section 3 to their analysis, which is of independent interest. On the other hand, round is fairly simple. Its analysis is postponed to Section 4.
The time complexity in the above theorem reflects only elementary arithmetic operations. In the interest of clarity, we ignore questions of bit complexity that may arise from taking exponentials. The effect of this simplification is marginal since it can be easily shown [KLRS08] that the maximum bit complexity throughout the iterations of our algorithm is . As a result, factoring in bit complexity leads to a runtime of , which is still truly near-linear.
3 Linear-time approximate Sinkhorn projection
The core of our OT algorithm is the entropic penalty proposed by Cuturi [Cut13]:
The solution to (2) can be characterized explicitly by analyzing its first-order conditions for optimality.
We call the matrix appearing in Lemma 1 the Sinkhorn projection of , denoted , after Sinkhorn, who proved uniqueness in [Sin67]. Computing exactly is impractical, so we implement instead an approximate version , which outputs a matrix that may not lie in but satisfies the condition . We stress that this condition is very natural from a statistical standpoint, since it requires that and are close to the target marginals and in total variation distance.
3.1 The classical Sinkhorn algorithm
Given a matrix , Sinkhorn proposed a simple iterative algorithm to approximate the Sinkhorn projection , which is now known as the Sinkhorn-Knopp algorithm or RAS method. Despite the simplicity of this algorithm and its good performance in practice, it has been difficult to analyze. As a result, recent work showing that can be approximated in near-linear time [AZLOW17, CMTV17] has bypassed the Sinkhorn-Knopp algorithm entirely.111Replacing the Proj step in Algorithm 1 with the matrix-scaling algorithm developed in [CMTV17] results in a runtime that is a single factor of faster than what we present in Theorem 1. The benefit of our approach is that it is extremely easy to implement, whereas the matrix-scaling algorithm of [CMTV17] relies heavily on near-linear time Laplacian solver subroutines, which are not implementable in practice. In our work, we obtain a new analysis of the simple and practical Sinkhorn-Knopp algorithm, showing that it also approximates in near-linear time.
Pseudocode for the Sinkhorn-Knopp algorithm appears in Algorithm 3. In brief, it is an alternating projection procedure which renormalizes the rows and columns of in turn so that they match the desired row and column marginals and . At each step, it prescribes to either modify all the rows by multiplying row by for , or to do the analogous operation on the columns. (We interpret the quantity as in this algorithm if ever it occurs.) The algorithm terminates when the matrix is sufficiently close to the polytope .
3.2 Prior work
Before this work, the best analysis of Algorithm 3 showed that iterations suffice to obtain a matrix close to in distance:
Unfortunately, this analysis is not strong enough to obtain a true near-linear time guarantee. Indeed, the norm is not an appropriate measure of closeness between probability vectors, since very different distributions on large alphabets can nevertheless have small distance: for example, and in have distance even though they have disjoint support. As noted above, for statistical problems, including computation of the OT distance, it is more natural to measure distance in norm.
The following Corollary gives the best guarantee available from Proposition 1.
Algorithm 3 with outputs a matrix satisfying in iterations.
3.3 New analysis of the Sinkhorn algorithm
Our new analysis allows us to obtain a dimension-independent bound on the number of iterations beyond the uniform case.
Algorithm 3 with outputs a matrix satisfying in iterations, where and .
Comparing our result with Corollary 1, we see what our bound is always stronger, by up to a factor of . Moreover, our analysis is extremely short. Our improved results and simplified proof follow directly from the fact that we carry out the analysis entirely with respect to the Kullback-Leibler divergence, a common measure of statistical distance. This measure possesses a close connection to the total-variation distance via Pinsker’s inequality (Lemma 4, below), from which we obtain the desired bound. Similar ideas can be traced back at least to [GY98] where an analysis of Sinkhorn iterations for bistochastic targets is sketched in the context of a different problem: detecting the existence of a perfect matching in a bipartite graph.
We first define some notation. Given a matrix and desired row and column sums and , we define the potential (Lyapunov) function by
This auxiliary function has appeared in much of the literature on Sinkhorn projections [KLRS08, CMTV17, KK96, KK93]. We call the vectors and scaling vectors. It is easy to check that a minimizer of yields the Sinkhorn projection of : writing and , first order optimality conditions imply that lies in , and therefore .
The following lemma exactly characterizes the improvement in the potential function from an iteration of Sinkhorn, in terms of our current divergence to the target marginals.
If , then
Assume without loss of generality that is odd, so that and . (If is even, interchange the roles of and .) By definition,
where we have used that: and ; for all , ; and since . ∎
The next lemma has already appeared in the literature and we defer its proof to the Appendix.
If is a positive matrix with and smallest entry , then
Lemma 4 (Pinsker’s Inequality).
For any probability measures and , .
3.4 Greedy Sinkhorn
In addition to a new analysis of Sinkhorn, we propose a new algorithm Greenkhorn which enjoys the same convergence guarantee but performs better in practice. Instead of performing alternating updates of all rows and columns of , the Greenkhorn algorithm updates only a single row or column at each step. Thus Greenkhorn updates only entries of per iteration, rather than .
In this respect, Greenkhorn is similar to the stochastic algorithm for Sinkhorn projection proposed by [GCPB16]. There is a natural interpretation of both algorithms as coordinate descent algorithms in the dual space corresponding to row/column violations. Nevertheless, our algorithm differs from theirs in several key ways. Instead of choosing a row or column to update randomly, Greenkhorn chooses the best row or column to update greedily. Additionally, Greenkhorn does an exact line search on the coordinate in question since there is a simple closed form for the optimum, whereas the algorithm proposed by [GCPB16] updates in the direction of the average gradient. Our experiments establish that Greenkhorn performs better in practice; more details appear in the Appendix.
We emphasize that our algorithm is an extremely natural modification of Sinkhorn, and greedy algorithms for the scaling problem have been proposed before, though these do not come with with explicit near-linear time guarantees [PL82]. However, whereas previous analyses of Sinkhorn cannot be modified to extract any meaningful rates of convergence for greedy algorithms, our new analysis of Sinkhorn from Section 3.3 applies to Greenkhorn with only trivial modifications.
Pseudocode for Greenkhorn appears in Algorithm 4. We define and define the distance function by
The choice of is justified by its appearance in Lemma 5, below. While is not a metric, it is easy to see that is nonnegative and satisfies iff .
We note that after and are computed once at the beginning of the algorithm, Greenkhorn can easily be implemented such that each iteration runs in only time.
The algorithm Greenkhorn outputs a matrix satisfying in iterations, where and . Since each iteration takes time, such a matrix can be found in time.
The analysis requires the following lemma, which is an easy modification of Lemma 2.
Let and be successive iterates of Greenkhorn, with corresponding scaling vectors and . If was obtained from by updating row , then
and if it was obtained by updating column , then
We also require the following extension of Pinsker’s inequality (proof in Appendix).
For any , define . If , then
Proof of Theorem 3.
We follow the proof of Theorem 2. Since the row or column update is chosen greedily, at each step we make progress of at least . If and are both at most , then under the assumption that , our progress is at least
Likewise, if either or is larger than , our progress is at least . Therefore, we terminate in at most iterations. ∎
4 Proof of Theorem 1
First, we present a simple guarantee about the rounding Algorithm 2. The following lemma shows that the distance between the input matrix and rounded matrix is controlled by the total-variation distance between the input matrix’s marginals and and the desired marginals and .
If and , then Algorithm 2 takes time to output a matrix satisfying
The proof of Lemma 7 is simple and left to the Appendix. (We also describe in the Appendix a randomized variant of Algorithm 2 that achieves a slightly better bound than Lemma 7). We are now ready to prove Theorem 1.
Proof of Theorem 1.
Error analysis. Let be the output of , and let be an optimal solution to the original OT program.
We first show that is not much larger than . To that end, write and . Since for positive diagonal matrices and , Lemma 1 implies is the optimal solution to
By Lemma 7, there exists a matrix such that
Moreover, since is an optimal solution of (3), we have
Thus, by Hölder’s inequality
where we have used the fact that .
Applying the guarantee of , we obtain
Plugging in the values of and prescribed in Algorithm 1 finishes the error analysis.
Runtime analysis. Lemma 7 shows that Step 2 of Algorithm 1 takes time. The runtime of Step 1 is dominated by the subroutine. Theorems 2 and 3 imply that both the Sinkhorn and Greenkhorn algorithms accomplish this in time, where is the sum of the entries of and is the smallest entry of . Since the matrix is nonnegative, the entries of are bounded above by , thus . The smallest entry of is , so . We obtain . The proof is finished by plugging in the values of and prescribed in Algorithm 1. ∎
5 Empirical results
Cuturi [Cut13] already gave experimental evidence that using Sinkhorn to solve (2) outperforms state-of-the-art techniques for optimal transport. In this section, we provide strong empirical evidence that our proposed Greenkhorn algorithm significantly outperforms Sinkhorn.
We consider transportation between pairs of grayscale images, normalized to have unit total mass. The target marginals and represent two images in a pair, and is the matrix of distances between pixel locations. Therefore, we aim to compute the earth mover’s distance.
We run experiments on two datasets: real images, from mnist, and synthetic images, as in Figure 1.
We first compare the behavior of Greenkhorn and Sinkhorn on real images. To that end, we choose random pairs of images from the MNIST dataset, and for each one analyze the performance of ApproxOT when using both Greenkhorn and Sinkhorn for the approximate projection step. We add negligible noise to each background pixel with intensity . Figure 2 paints a clear picture: Greenkhorn significantly outperforms Sinkhorn both in the short and long term.
5.2 Random images
To better understand the empirical behavior of both algorithms in a number of different regimes, we devised a synthetic and tunable framework whereby we generate images by choosing a randomly positioned “foreground” square in an otherwise black background. The size of this square is a tunable parameter varied between 20%, 50%, and 80% of the total image’s area. Intensities of background pixels are drawn uniformly from ; foreground pixels are drawn uniformly from . Such an image is depicted in Figure 1, and results appear in Figure 2.
We perform two other experiments with random images in Figure 3. In the first, we vary the number of background pixels and show that Greenkhorn performs better when the number of background pixels is larger. We conjecture that this is related to the fact that Greenkhorn only updates salient rows and columns at each step, whereas Sinkhorn wastes time updating rows and columns corresponding to background pixels, which have negligible impact. This demonstrates that Greenkhorn is a better choice especially when data is sparse, which is often the case in practice.
In the second, we consider the role of the regularization parameter . Our analysis requires taking of order , but Cuturi [Cut13] observed that in practice can be much smaller. Cuturi showed that Sinkhorn outperforms state-of-the art techniques for computing OT distance even when is a small constant, and Figure 3 shows that Greenkhorn runs faster than Sinkhorn in this regime with no loss in accuracy.
Appendix A Omitted proofs
a.1 Proof of Lemma 3
The proof of the first inequality is similar to the proof of Lemma 2:
where denotes the divergence between and viewed as elements of .
We now prove the second claim. Note that satisfies and has smallest entry . Since is positive, [Sin67] shows that exists and is unique. Let be corresponding scaling factors. Then
for all . Thus because and are both probability vectors,
a.2 Proof of Lemma 5
We prove only the case where a row was updated, since the column case is exactly the same.
Observe that and differ only in the th row, and and differ only in the th entry, and . Hence
where we have used the fact that and . ∎
a.3 Proof of Lemma 6
Let , and write . The definition of implies
Note that both and are nonnegative. If , then in particular , and it can be seen that in this range. Applying Lemma 4 (Pinsker’s inequality) yields
By the triangle inequality and convexity,
The claim follows from the above two displays. ∎
a.4 Proof of Lemma 7
Let be the output of . The entries of are nonnegative, and at the end of the algorithm and are both nonnegative, with . Therefore the entries of are nonnegative and