Log In Sign Up

Scalable Nearest Neighbor Search for Optimal Transport

by   Yihe Dong, et al.

The Optimal Transport (a.k.a. Wasserstein) distance is an increasingly popular similarity measure for rich data domains, such as images or text documents. This raises the necessity for fast nearest neighbor search with respect to this distance, a problem that poses a substantial computational bottleneck for various tasks on massive datasets. In this work, we study fast tree-based approximation algorithms for searching nearest neighbors w.r.t. the Wasserstein-1 distance. A standard tree-based technique, known as Quadtree, has been previously shown to obtain good results. We introduce a variant of this algorithm, called Flowtree, and formally prove it achieves asymptotically better accuracy. Our extensive experiments, on real-world text and image datasets, show that Flowtree improves over various baselines and existing methods in either running time or accuracy. In particular, its quality of approximation is in line with previous high-accuracy methods, while its running time is much faster.


page 1

page 2

page 3

page 4


Monge's Optimal Transport Distance with Applications for Nearest Neighbour Image Classification

This paper focuses on a similarity measure, known as the Wasserstein dis...

The Role of Local Intrinsic Dimensionality in Benchmarking Nearest Neighbor Search

This paper reconsiders common benchmarking approaches to nearest neighbo...

Multilevel Optimal Transport: a Fast Approximation of Wasserstein-1 distances

We propose a fast algorithm for the calculation of the Wasserstein-1 dis...

Optimal Transport Relaxations with Application to Wasserstein GANs

We propose a family of relaxations of the optimal transport problem whic...

Towards Optimal Running Times for Optimal Transport

In this work, we provide faster algorithms for approximating the optimal...

Wasserstein-Fisher-Rao Document Distance

As a fundamental problem of natural language processing, it is important...

TRAJEDI: Trajectory Dissimilarity

The vast increase in our ability to obtain and store trajectory data nec...

1 Introduction

Given a finite metric space and two distributions and on , the Wasserstein- distance (a.k.a. Earth Mover’s Distance or Optimal Transport) between and is defined as


where the minimum is taken over all distributions on whose marginals are equal to and .222For mathematical foundations of Wasserstein distances, see [Vil03]. The Wasserstein- distance and its variants are heavily used in applications to measure similarity in structured data domains, such as images [RTG00] and natural language text [KSKW15]. In particular, Kusner et al. [KSKW15] recently put forth the Word Mover Distance (WMD)

as a similarity measure for text documents. Each document is seen as a uniform distribution over the words it contains, and the underlying metric between words is given by high-dimensional word embeddings such as word2vec 

[MSC13] or GloVe [PSM14]. It is shown in [KSKW15] (see also [LYFC19, YCC19]) that the Wasserstein- distance between the two distributions is a good similarity measure between the associated documents.

To leverage the Wasserstein- distance for classification tasks, the above line of work uses the

-nearest neighbor classifier. This poses a notorious bottleneck for large datasets, necessitating the use of fast approximate similarity search algorithms. While such algorithms are widely studied for

distances (chiefly ; see [AIR18] for a survey), much less is known for Wasserstein distances, and a comprehensive study appears to be lacking. In particular, two properties of the distance make the nearest neighbor search problem very challenging. First, the distance is fairly difficult to compute (the most common approaches are combinatorial flow algorithms [Kuh55] or approximate iterative methods [Cut13]). Second, the distance is strongly incompatible with the Euclidean (and more generally, with ) geometries [Bou86, KN06, NS07, AIK08, ANN15, AKR18], which renders many of the existing techniques for nearest neighbor search inadequate (e.g., random projections).

In this work, we systematically study the -nearest neighbor search (-NNS) problem with respect to the distance. In accordance with the above applications, we focus on the case where the ground set is a finite subset of , endowed with the Euclidean distance, where is a possibly high dimension, and each distribution over has finite support of size at most .333In the application to [KSKW15], is the set word embeddings of (say) all terms in the English language, and is the maximum number of terms per text document. Given a dataset of distributions , the goal is to preprocess it, such that given a query distribution (also supported on ), we can quickly find the data points closest to in the

distance. To speed up search, the algorithms we consider rely on efficient estimates of the distances

. This may lead to retrieving approximate nearest neighbors rather than the exact ones, which is often sufficient for practical applications.

1.1 Prior work

Kusner et al. [KSKW15] sped up -NNS for WMD by designing two approximations of . The first algorithm estimates as the Euclidean distance between their respective means. The second algorithm, called “Relaxed WMD” (abbrev. R-WMD), assigns every point in the support of to its closest point in the support of , and vice versa, and returns the maximum of the two assignments. Both of these methods produce an estimate no larger than the true distance . The former is much faster to compute, while the latter has a much better empirical quality of approximation. The overall -NNS pipeline in [KSKW15] consists of the combination of both algorithms, together with the exact distance computation.

Indyk and Thaper [IT03]

studied the approximate NNS problem for the Wasserstein distance in the context of image retrieval. Their approach capitalizes on a long line of work of 

tree-based methods, in which the given metric space is embedded at random into a tree metric. This is a famously fruitful approach for many algorithmic and structural statements [Bar96, Bar98, CCG98, Ind01, GKL03, FRT04, CKR05, MN06, IW17, IRW17, IW18, BIO19]. It is useful in particular for Wasserstein distances, since the optimal flow ( in (1)) on a tree can be computed in linear time, and since a tree embedding of the underlying metric yields an -embedding of the Wasserstein distance, as shown by  [KT02, Cha02]. This allowed [IT03] to design an efficient NNS algorithm for based on classical locality-sensitive hashing (LSH). Recently, [LYFC19] introduced a kernel similarity measure based on the same approach, and showed promising empirical results for additional application domains.

1.2 Our results


The tree-based method used in [IT03, LYFC19] is a classical algorithm called Quadtree, described in Section 2. We suggest a variant of Quadtree for estimating distances, which we call Flowtree. As in Quadtree, we embed the ground metric into a tree and compute a flow on the tree. The difference is that we the cost of the flow is measured w.r.t. the original Euclidean distances on instead of the tree metric. This still yields a linear-time algorithm, since the optimal flow itself (and not just its cost) can be computed on a tree in linear time. At the same time, intuitively, using the original distances should render Flowtree more accurate than Quadtree. We substantiate this intuition both theoretically and empirically.

Theoretical results.

We analyze Quadtree and Flowtree, and show that Flowtree is qualitatively better than Flowtree for nearest neighbor search with respect to the distance. The key difference is that the quality of Flowtree is independent of the dataset size, i.e., of the number of distributions . Quadtree, on the other hand, degrades in quality as grows.

We expose this phenomenon in two regimes. First, we provide worst-case guarantees if the underlying metric on is . Namely, we observe that the analysis of a related algorithm from [AIK08] can be used to show that Quadtree returns an

-approximate nearest neighbor with high probability,

444A distribution is a -approximate nearest neighbor of if . if all the dataset and query distributions have supports of size at most . This analysis can be further modified to show approximation for Flowtree, which confirms the above intuition that Flowtree does not degrade as the dataset size grows. Furthermore, we show that the dependence on for Quadtree is necessary. While these results hold for an underlying metric, they are readily adapted to by a random rotation of the dataset (which in practice is not necessary).

Second, we consider a random setting. This is motivated by the observation that the above worst-case bounds are too pessimistic for real-world data. In practice, we observe that Quadtree and especially Flowtree recover the exact nearest neighbor with noticeable probability. To explain this, we introduce and analyze a simple random model, which we believe captures the relevant aspects of the real-world instances. For this model, we show that both Flowtree and Quadtree recover the nearest neighbor with high probability. However, as was the case for the worst-case bounds, Quadtree’s success rate degrades as increases, while Flowtree’s does not.

Empirical results.

We evaluate the performance of Flowtree, and of several baselines and prior work [KSKW15, Cut13], on nearest neighbor search with respect to the distance. The upshot is that Flowtree achieves accuracy on par with the most accurate existing methods, while being much faster, up to times (and up to times without any loss in accuracy).

Let us provide more context on these results. Generally, existing algorithms can be grouped into two classes: “fast” linear-time methods, which produce an efficient but coarse approximation for , and quadratic-time methods, which produce a closer but slower approximation (though considerably faster than exact ). The terms “linear” and “quadratic” time refer to the dependence on of estimating a single distance between distributions whose support size is at most . A complete -NNS pipeline would typically rely on a combination of these methods: initial pruning by fast method, followed by intermediate pruning by slower and and more accurate method, followed by exact computation on the few suviving points. This schemed has been employed, for example, in [KSKW15] (termed “prefetch and prune”).

Flowtree forms a new intermediate category, being a “slow” linear-time method. Specifically, its accuracy closely matches the quadratic-time methods, while its running time is only linear in , rendering it much faster in practice. The difference between Flowtree and the existing “fast“ linear-time methods is that the latter ones possess certain additional properties (namely, embeddings into or spaces, as explained below), that allow for considerable implementational speedup in practice. Flowtree lacks these properties and does not match their running time, but is dramatically more accurate. Thus, it offers substantial improvement in the intermediate pruning regime of the -NNS pipeline scheme outlined above.

2 Preliminaries: Quadtree

In this section we describe the classical Quadtree algorithm. Its name derives from its original use in two dimensions (cf. [Sam84]), but it extends to—and has been successfully used in—various high-dimensional settings (e.g. [Ind01, IT03, IRW17, BIO19]). It enjoys a combination of appealing theoretical properties and amenability to fast implementation. As it forms the basis for our work, we now describe it in detail.

Generic Quadtree.

Let be a finite set of points. Our goal is to embed into a random tree metric, so as to approximately preserve each pairwise distance in . To simplify the description, suppose that the minimum pairwise distance in is exactly , and that all points in have coordinates in .555This is without loss of generality, as we can set the minimum distance to by scaling ( is determined accordingly), and we can shift all the points to have non-negative coordinates without changing internal distances.

The first step is to obtain a randomly shifted hypercube that encloses all points in . To this end, let be the hypercube with side length centered at the origin. Let

be a random vector with i.i.d. coordinates uniformly distributed in

. We randomly shift by , obtaining the hypercube . Observe that has side length and encloses . The random shift is needed in order to obtain formal guarantees for arbitrary .

Now, we construct a tree of hypercubes by letting be the root, halving along each dimension, and recursing on the resulting sub-hypercubes. We add to the tree only those hypercubes that are non-empty (i.e., contain at least one point from ). Furthermore, we do not partition hypercubes that contain exactly one point from ; they become leaves. The resulting tree has at most levels and exactly leaves, one per point in .666This is since the diameter of the root hypercube is , and the diameter of a leaf is no less than , since by scaling the minimal distance in to we have assured that a hypercube of diameter contains a single point and thus becomes a leaf. Since the diameter is halved in each level, there are at most levels. We number the root level as , and the rest of the levels are numbered downward accordingly (). We set the weight of each tree edge between level and level to be .

It is straightforward to build a quadtree in time . Let us remark that the size of the resulting tree is ; in particular, there is no exponential dependence on .

Wasserstein- on Quadtree.

The tree distance between each pair is defined as the total edge weight on the unique path between their corresponding leaves in the quadtree. Given two distributions on , the Wasserstein- distance with this underlying metric (as a proxy for the Euclidean metric on ) admits the closed-form , where ranges over all nodes in the tree, is the level of , is the total -mass of points enclosed in the hypercube associated with , and is defined similarly for the -mass. If have supports of size at most , then this quantity can be computed in time .

The above closed-form implies, in particular, that on the qudtree metric (and indeed, any tree metric) embeds isometrically into , as originally oberved by [Cha02] following [KT02]. Namely, the space has a coordinate associated with each tree node , and a distribution is embedded in that space by setting the value of each coordinate to , where is defined as above. Furthermore, observe that if has support size at most , then its corresponding embedding w.r.t the tree metric has at most non-zero entries, where is the height of the tree. Thus, computing on the tree metric amounts to computing the distance between sparse vectors, which further facilitates fast implementation in practice.

3 Flowtree

The Flowtree algorithm for -NNS w.r.t. the distance is as follows. In the preprocessing stage, we build a quadtree on the ground set , as described in Section 2. Let denote the quadtree distance between every pair . In the query stage, in order to estimate between two distributions , we compute the optimal flow w.r.t. the tree metric, that is,

where the argmin is taken over all distribution on with marginals . Then, the estimate of the distance between and is given by

Note that if the support sizes of and are upper-bounded by , then the Flowtree estimate of their distance can be computed in time linear in .

Lemma .

can be computed in time .

On the other hand, Flowtree has the notable property discussed above: its NNS approximation factor is independent of the dataset size . In comparison, the classical Quadtree does not possess this property, and its accuracy deteriorates as the dataset becomes larger. We will now formally establish this distinction in two senses: first by analyzing worst-case bounds, and then by analyzing a random model of synthetic data.

3.1 Worst-case bounds

For convenience, the results in this section are stated with the underlying metric on being the distance; as mentioned earlier, the same results hold for the by applying a random rotation on the dataset, which embeds into with constant distortion and constant blowup in the dimension [DIIM04]. All proofs are deferred to the appendix.

We start with an analytic worst-case bound on the performance of quadtree. For convenience, let us recall the parameters: is a finite subset of , and is such that can be enclosed in a hypercube of side length at most . We have a dataset of distributions , and a query distribution , where each of these distributions is supported on a subset of of size at most . We look for the nearest neighbor of among , The following theorem is an adaptation of a result of [AIK08] (where it is proven for a somewhat different algorithm, with similar analysis).

Theorem 3.1 (Quadtree upper bound).

With probability , the nearest neighbor of among in the Quadtree distance is an -approximate nearest neighbor in the distance.

Next, we show that the factor in the above upper bound is necessary for Quadtree.

Theorem 3.2 (Quadtree lower bound).

Suppose is such that Quadtree is guaranteed to return a -approximate nearest neighbor, for any dataset, with probability more than (say) . Then .

In contrast, Flowtree attains an approximation factor that does not depend on .

Theorem 3.3 (Flowtree upper bound).

With probability , the nearest neighbor of among in the Flowtree distance is an -approximate nearest neighbor for the distance.

3.2 Random model

In this section we consider a simple model of random data, which is canonical in studying nearest neighbor search. We choose a ground set of points i.i.d. uniformly at random on the unit sphere in dimensions. For each such subset of of size , we form a uniform distribution supported on the subset. These distributions make up the dataset (so ). To generate a query, we take an arbitrary as the “planted” nearest neighbor. Let be the points in its support. For each , we pick a uniformly random point among the points on which are at distance at most from from , where is a parameter. The query distribution is the uniform distribution over . By known concentration of measure results, the distance from to every point in other than is with high probability. Therefore, the optimal flow from to is the perfect matching , and is the nearest neighbor of .

Our goal now is to show that in this model, the success probability of Quadtree in recovering the planted nearest neighbor decays exponentially with , while the success probability of Flowtree is independent of .


For every , let be the smallest hypercube in the quadtree that contains both and . (Note that

is a random variable, determined by the initial random shift in the Quadtree construction.) In order for Quadtree to correctly identify

as the nearest neighbor of , every must not contain any additional points from . Otherwise, if say contains a point , the distance on the quadtree from to is equal to its distance to the uniform distribution over . Since the points in are chosen uniformly i.i.d. over , the probability of the above event, and thus the success probability of Quadtree, is upper bounded by , where . This is a random variable whose distribution depends only on , and is independent of . Thus the success probability decays exponentially with .


On the other hand, suppose that each contains no other points from other than (but is allowed to contain any other points from ). This event guarantees that the optimal flow on the tree between and is the planted perfect matching, i.e., the true optimal flow, and thus the estimated Flowtree distance between them equals . This guarantees that Flowtree recovers the planted nearest neighbor, and this event depends only on , and is independent of .

4 Experiments

We empirically evaluate Flowtree and compare it with various methods.

4.1 Synthetic data

We implemented the random model described in Section 3.2. The results are given in Figure 1. The x-axis is , and the y-axis is the fraction of successes over independent repetitions of planting a query and recovering its nearest neighbor. As predicted, the success ratio of Quadtree degrades as increases, while the success ratio of Flowtree does not.

Figure 1: Results on random data. Left: , . Right: , , .

4.2 Real data


We use three datasets from two application domains:

  • Text documents: We use a dataset of Amazon reviews split evenly over product categories, and the 20news dataset of news-related online discussion groups. Both have been used in [KSKW15] to evaluate the Word-Move Distance. Each document is interpreted as a uniform distribution supported on the terms it contains (after stopword removal). As the underlying metric, we use GloVe word embeddings [PSM14] with terms and dimensions.

  • Image recognition: We use the MNIST dataset of handwritten digits. The Wasserstein distance has been applied to MNIST in [Cut13]. Each image is interpreted as a distribution over pixels, with mass proportional to the greyscale intensity of the pixel (normalized so that the mass sums to ). Note that the distribution is supported on the nonzero pixels in the image. The underlying metric is the -dimensional Euclidean distance between the pixels, where they are identified with the points on the plane.

Full properties of the datasets are listed in Table 1.

Name Dataset size Queries Underlying metric Avg. support size
Amazon GloVe word embeddings
20news GloVe word embeddings
MNIST 2D Euclidean
Table 1: Datasets


We evaluate the following methods:

  • Mean: is estimated as the Euclidean distance between the means of and . This method has been suggested and used in [KSKW15].

  • Overlap: A simple baseline that estimates by the size of the intersection of their supports.

  • TF-IDF: A well-known similarity measure for text documents. We remark that it is closely related to Overlap.777Namely, it is a weighted variant of Overlap, where terms are weighted according to their frequency in the dataset. For MNIST we omit this baseline since it is not a text dataset.

  • Quadtree: See Section 2.

  • Flowtree: See Section 3.

  • R-WMD: The Relaxed WMD method of [KSKW15], described in Section 1.1). We recall it is a greedy method that assigns every point in the support of to its closest point in the support of , the repeats the process with the roles of and reversed, and returns the maximum of the two estimates. Note that this method does not produce a proper flow (i.e., it does not adhere to the capacity and demand constraints of ).

  • Sinkhorn with few iterations: The iterative Sinkhorn method of [Cut13] is designed to converge to a near-perfect approximation of . Nonetheless, it can be adapted into a fast approximation algorithm by invoking it with a fixed small number of iterations. We use 1 and 3 iterations, referred to as Sinkhorn-1 and Sinkhorn-3 respectively. Since the Sinkhorn method requires tuning certain parameters (the number of iterations as well as the regularization parameter), the experiments in this section evaluate the method at its optimal setting, and the appendix includes additional experiments with more parameter settings.

As mentioned in Section 1.2, these methods can be grouped as follows, in terms of their running time dependece on :

  • “Fast” linear-time: Mean, Overlap, TF-IDF, Quadtree

  • “Slow” linear-time: Flowtree

  • Quadratic time: R-WMD, Sinkhorn

The difference between “fast” and “slow” linear time is that the former algorithms are more cache-efficient and benefit from the SIMD vectorization. Notably, each reduces to either computing a single Euclidean distance in the ground metric (in the case of Mean), or a single distance between sparse vectors (in the case of Overlap, TF-IDF and Quadtree). This renders them an order of magnitude faster than the other methods, as our empirical results will show.

Evaluation metrics.

We measure the accuracy and the running time of each method. The results are depicted in Figures 4, 4 and 4 and in Table 2. Accuracy is measured as follows: For each query , we sort by the distances as estimated by the evaluated method; then, for each number of “candidates” , a query is considered successful if the true nearest neighbor (according to the actual distance) is among the nearest neighbors according to the estimated distances. The x-axis lists the number of candidates , and the y-axis lists the fraction of successful queries.

For each dataset, we plot on the left the accuracy of all methods for large numbers of candidates. On the right we plot the accuracy of the high-accuracy methods for small numbers of candidates (since they cannot be discerned in the left plots). The high-accuracy methods are Flowtree, R-WMD and Sinkhorn, and on MNIST also Quadtree. For Quadtree and Flowtree, which are randomized methods, we report the mean and standard deviation (shown as error bars) of

executions. The other methods are deterministic.

The legend of each plot is annonated with the running time of each method (also summarized in Table 2). The running time is measured as the average time, over the queries, to estimate all distances from the query to the whole dataset. Running times are measured on a “Standard F72s_v2” Microsoft Azure instance equipped with Intel Xeon Platinum 8168 CPU. In our implementations, we use NumPy linked with OpenBLAS, which is used in a single-threaded mode. On the MNIST dataset, the accuracy of Sinkhorn is evaluated on a random subset of queries, and the running time of R-WMD, Sinkhorn and Exact is evaluated on a random subset of queries, due to their computational cost.


The results show that Flowtree, R-WMD and Sinkhorn achieve dramatically higher accuracy than the other baselines. Among these, Flowtree is the fastest by a margin. In particular, its accuracy is either comparable or superior to R-WMD, while being up to times faster. Compared to Sinkhorn-1, Flowtree achieves somewhat lower accuracy, but is at least times faster.

Figure 3: Results for 20news dataset
Figure 4: Results for MNIST dataset
Figure 3: Results for 20news dataset
Figure 4: Results for MNIST dataset
Figure 2: Results for Amazon dataset
Dataset Mean Overlap TF-IDF Quadtree Flowtree R-WMD Sinkhorn-1 Sinkhorn-3 Exact
Amazon ms ms ms ms s s s s s
20news ms ms ms ms s s s s s
MNIST ms ms ms s s s s s
Table 2: Running times

5 Future directions

We view this work as a first step, leaving a number of exciting open problems for future work.

  • We suspect that the worst-case bounds for Quadtree and Flowtree might not be tight. The challenge is either to improve them, or to prove matching lower bounds.

  • Note that given a nearest neighbor query, we need to estimate the distance from a single distribution (the query) to many distributions (the dataset) at once. Could this be leveraged to speed up Flowtree?

  • For an actual nearest neighbor search system, one likely needs to combine several algorithms, as outlined above. It would be useful to conduct a systematic investigation on this front, and obtain a “recipe” for setting up such a pipeline (of increasingly slower but more accurate approximation methods, gradually pruning more and more candidates) for a given dataset.


  • [AIK08] Alexandr Andoni, Piotr Indyk, and Robert Krauthgamer. Earth mover distance over high-dimensional spaces. In Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms, pages 343–352. Society for Industrial and Applied Mathematics, 2008.
  • [AIR18] Alexandr Andoni, Piotr Indyk, and Ilya Razenshteyn. Approximate nearest neighbor search in high dimensions. arXiv preprint arXiv:1806.09823, 2018.
  • [AKR18] Alexandr Andoni, Robert Krauthgamer, and Ilya Razenshteyn. Sketching and embedding are equivalent for norms. SIAM Journal on Computing, 47(3):890–916, 2018.
  • [ANN15] Alexandr Andoni, Assaf Naor, and Ofer Neiman. Snowflake universality of wasserstein spaces. arXiv preprint arXiv:1509.08677, 2015.
  • [Bar96] Yair Bartal. Probabilistic approximation of metric spaces and its algorithmic applications. In Proceedings of 37th Conference on Foundations of Computer Science, pages 184–193. IEEE, 1996.
  • [Bar98] Yair Bartal. On approximating arbitrary metrices by tree metrics. In STOC, volume 98, pages 161–168, 1998.
  • [BIO19] Arturs Backurs, Piotr Indyk, Krzysztof Onak, Baruch Schieber, Ali Vakilian, and Tal Wagner. Scalable fair clustering. In

    International Conference on Machine Learning

    , pages 405–413, 2019.
  • [Bou86] Jean Bourgain. The metrical interpretation of superreflexivity in banach spaces. Israel Journal of Mathematics, 56(2):222–230, 1986.
  • [CCG98] Moses Charikar, Chandra Chekuri, Ashish Goel, Sudipto Guha, and Serge Plotkin. Approximating a finite metric by a small number of tree metrics. In Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No. 98CB36280), pages 379–388. IEEE, 1998.
  • [Cha02] Moses S Charikar. Similarity estimation techniques from rounding algorithms. In

    Proceedings of the thiry-fourth annual ACM symposium on Theory of computing

    , pages 380–388. ACM, 2002.
  • [CKR05] Gruia Calinescu, Howard Karloff, and Yuval Rabani. Approximation algorithms for the 0-extension problem. SIAM Journal on Computing, 34(2):358–372, 2005.
  • [Cut13] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pages 2292–2300, 2013.
  • [DIIM04] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the twentieth annual symposium on Computational geometry, pages 253–262. ACM, 2004.
  • [FRT04] Jittat Fakcharoenphol, Satish Rao, and Kunal Talwar. A tight bound on approximating arbitrary metrics by tree metrics. Journal of Computer and System Sciences, 69(3):485–497, 2004.
  • [GKL03] Anupam Gupta, Robert Krauthgamer, and James R Lee. Bounded geometries, fractals, and low-distortion embeddings. In 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings., pages 534–543. IEEE, 2003.
  • [Ind01] Piotr Indyk. Algorithmic applications of low-distortion geometric embeddings. In Proceedings 42nd IEEE Symposium on Foundations of Computer Science, pages 10–33. IEEE, 2001.
  • [IRW17] Piotr Indyk, Ilya Razenshteyn, and Tal Wagner. Practical data-dependent metric compression with provable guarantees. In Advances in Neural Information Processing Systems, pages 2617–2626, 2017.
  • [IT03] Piotr Indyk and Nitin Thaper. Fast image retrieval via embeddings. In 3rd international workshop on statistical and computational theories of vision, volume 2, page 5, 2003.
  • [IW17] Piotr Indyk and Tal Wagner. Near-optimal (euclidean) metric compression. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 710–723. SIAM, 2017.
  • [IW18] Piotr Indyk and Tal Wagner. Approximate nearest neighbors in limited space. In Conference On Learning Theory, pages 2012–2036, 2018.
  • [KN06] Subhash Khot and Assaf Naor. Nonembeddability theorems via fourier analysis. Mathematische Annalen, 334(4):821–852, 2006.
  • [KSKW15] Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. From word embeddings to document distances. In International conference on machine learning, pages 957–966, 2015.
  • [KT02] Jon Kleinberg and Eva Tardos. Approximation algorithms for classification problems with pairwise relationships: Metric labeling and markov random fields. Journal of the ACM (JACM), 49(5):616–639, 2002.
  • [Kuh55] Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955.
  • [LYFC19] Tam Le, Makoto Yamada, Kenji Fukumizu, and Marco Cuturi. Tree-sliced approximation of wasserstein distances. arXiv preprint arXiv:1902.00342, 2019.
  • [MN06] Manor Mendel and Assaf Naor. Ramsey partitions and proximity data structures. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), pages 109–118. IEEE, 2006.
  • [MSC13] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
  • [NS07] Assaf Naor and Gideon Schechtman. Planar earthmover is not in l_1. SIAM Journal on Computing, 37(3):804–826, 2007.
  • [PSM14] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In

    Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)

    , pages 1532–1543, 2014.
  • [RTG00] Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. The earth mover’s distance as a metric for image retrieval.

    International journal of computer vision

    , 40(2):99–121, 2000.
  • [Sam84] Hanan Samet. The quadtree and related hierarchical data structures. ACM Computing Surveys (CSUR), 16(2):187–260, 1984.
  • [Vil03] Cédric Villani. Topics in optimal transportation. Number 58. American Mathematical Soc., 2003.
  • [YCC19] Mikhail Yurochkin, Sebastian Claici, Edward Chien, Farzaneh Mirzazadeh, and Justin Solomon. Hierarchical optimal transport for document representation. arXiv preprint arXiv:1906.10827, 2019.

Appendix A Proofs

Proof of Theorem 3.1.

Let . Let be the probability that fall into the same cell (hypercube) in level of the quadtree. It satisfies,

Let be the tree metric induced on by the quadtree. Note that for to be at most , must fall into the same hypercube in level . For any , we can round to its nearest power of and obtain such that . It satisfies,

By letting , we can take union bound either over all pairwise distances in (of which there are ), or over all distances between the support of the query and the union of supports of the dataset (of which there are at most , if every support has size at most ). Then, with probability say , all those distances are contracted by at most , i.e.,


On the other hand,

Let be the true nearest neighbor of in . Let be the optimal flow between them. Then by the above,

By Markov, with probability say ,


Let be the nearest neighbor of in the dataset according to the quadtree distance. Let be the optimal flow between them in the true underlying metric ( on ), and let be the optimal flow in the quadtree. Finally let denote the Wasserstein- distance on the quadtree. Then,

is optimal for
definition of
is the nearest neighbor in
definition of
is optimal for

so is a -approximate nearest neighbor. ∎

Proof of Theorem 3.2.

It suffices to prove the claim for (i.e., the standard -distance). Let be an even integer. Consider the -dimensional hypercube. Our query point is the origin. The true nearest neighbor is (standard basis vector). The other data points are the hypercube nodes whose hamming weight is exactly . The number of such points is , and this is our .

Consider imposing the grid with cell side on the hypercube. The probability that and are uncut in a given axis is exactly , and since the shifts in different axes are independent, the number of uncut axes is distributed as . Thus with probability there are at least uncut dimensions. If this happens, we have a data point hashed into the same grid cell as the origin (to get such data point, put in any uncut dimensions and in the rest), so its quadtree distance from the origin is . On the other hand, the distance of the origin to its true nearest neighbor is at least , since they will necessarily be separated in the next level (when the grid cells have side ). Thus the quadtree cannot tell between the true nearest neighbor and the one at distance , and we get the lower bound . Since , we have as desired. ∎

Proof of Theorem 3.3.

The proof is the same as for Theorem 3.1, except that in eq. 2, we take a union bound only over the distances between the supports of and (the query and its true nearest neighbor). Thus each distance between and is contracted by at most .

Let denote the Flowtree distance estimate of . Let be the nearest neighbor of in the Flowtree distance. With the same notation in the proof of Theorem 3.1,

is optimal for
Flowtree definition
is nearest in Flowtree distance
Flowtree definition
is optimal for

as needed. Note that the difference from the proof of Theorem 3.1 is that we only needed the contraction bound (eq. 2) for distances between and . ∎

Appendix B Additional Sinkhorn experiments

We include additional experimental results for the few-iterations Sinkhorn baseline, depicted in Figure 5.

Number of iterations.

Figures 5(a)–(c) depict the accuracy and running time of Sinkhorn-5, showing that it achieves comparable accuracy to Sinkhorn-3, while being slower. Therefore we limit the experiments in Section 4 to 3 iterations.

Regularization parameter.

Sinkhorn has a regularization parameter that needs to be tuned per dataset. We set , where is the maximum value in the cost matrix (of the currently evaluated pair of distributions), and tune . In all of our three datasets the optimal setting is , which is the setting we use in Section 4. As an example, Figure 5(d) depicts the -NN accuracy (y-axis) of Sinkhorn-1 per (x-axis).

(a) Amazon dataset
(b) 20news dataset
(c) MNIST dataset
(d) -NN accuracy with varying regularization
Figure 5: Additional Sinkhorn experiments.