1 Introduction
Given a finite metric space and two distributions and on , the Wasserstein distance (a.k.a. Earth Mover’s Distance or Optimal Transport) between and is defined as
(1) 
where the minimum is taken over all distributions on whose marginals are equal to and .^{2}^{2}2For mathematical foundations of Wasserstein distances, see [Vil03]. The Wasserstein distance and its variants are heavily used in applications to measure similarity in structured data domains, such as images [RTG00] and natural language text [KSKW15]. In particular, Kusner et al. [KSKW15] recently put forth the Word Mover Distance (WMD)
as a similarity measure for text documents. Each document is seen as a uniform distribution over the words it contains, and the underlying metric between words is given by highdimensional word embeddings such as word2vec
[MSC13] or GloVe [PSM14]. It is shown in [KSKW15] (see also [LYFC19, YCC19]) that the Wasserstein distance between the two distributions is a good similarity measure between the associated documents.To leverage the Wasserstein distance for classification tasks, the above line of work uses the
nearest neighbor classifier. This poses a notorious bottleneck for large datasets, necessitating the use of fast approximate similarity search algorithms. While such algorithms are widely studied for
distances (chiefly ; see [AIR18] for a survey), much less is known for Wasserstein distances, and a comprehensive study appears to be lacking. In particular, two properties of the distance make the nearest neighbor search problem very challenging. First, the distance is fairly difficult to compute (the most common approaches are combinatorial flow algorithms [Kuh55] or approximate iterative methods [Cut13]). Second, the distance is strongly incompatible with the Euclidean (and more generally, with ) geometries [Bou86, KN06, NS07, AIK08, ANN15, AKR18], which renders many of the existing techniques for nearest neighbor search inadequate (e.g., random projections).In this work, we systematically study the nearest neighbor search (NNS) problem with respect to the distance. In accordance with the above applications, we focus on the case where the ground set is a finite subset of , endowed with the Euclidean distance, where is a possibly high dimension, and each distribution over has finite support of size at most .^{3}^{3}3In the application to [KSKW15], is the set word embeddings of (say) all terms in the English language, and is the maximum number of terms per text document. Given a dataset of distributions , the goal is to preprocess it, such that given a query distribution (also supported on ), we can quickly find the data points closest to in the
distance. To speed up search, the algorithms we consider rely on efficient estimates of the distances
. This may lead to retrieving approximate nearest neighbors rather than the exact ones, which is often sufficient for practical applications.1.1 Prior work
Kusner et al. [KSKW15] sped up NNS for WMD by designing two approximations of . The first algorithm estimates as the Euclidean distance between their respective means. The second algorithm, called “Relaxed WMD” (abbrev. RWMD), assigns every point in the support of to its closest point in the support of , and vice versa, and returns the maximum of the two assignments. Both of these methods produce an estimate no larger than the true distance . The former is much faster to compute, while the latter has a much better empirical quality of approximation. The overall NNS pipeline in [KSKW15] consists of the combination of both algorithms, together with the exact distance computation.
Indyk and Thaper [IT03]
studied the approximate NNS problem for the Wasserstein distance in the context of image retrieval. Their approach capitalizes on a long line of work of
treebased methods, in which the given metric space is embedded at random into a tree metric. This is a famously fruitful approach for many algorithmic and structural statements [Bar96, Bar98, CCG98, Ind01, GKL03, FRT04, CKR05, MN06, IW17, IRW17, IW18, BIO19]. It is useful in particular for Wasserstein distances, since the optimal flow ( in (1)) on a tree can be computed in linear time, and since a tree embedding of the underlying metric yields an embedding of the Wasserstein distance, as shown by [KT02, Cha02]. This allowed [IT03] to design an efficient NNS algorithm for based on classical localitysensitive hashing (LSH). Recently, [LYFC19] introduced a kernel similarity measure based on the same approach, and showed promising empirical results for additional application domains.1.2 Our results
Flowtree.
The treebased method used in [IT03, LYFC19] is a classical algorithm called Quadtree, described in Section 2. We suggest a variant of Quadtree for estimating distances, which we call Flowtree. As in Quadtree, we embed the ground metric into a tree and compute a flow on the tree. The difference is that we the cost of the flow is measured w.r.t. the original Euclidean distances on instead of the tree metric. This still yields a lineartime algorithm, since the optimal flow itself (and not just its cost) can be computed on a tree in linear time. At the same time, intuitively, using the original distances should render Flowtree more accurate than Quadtree. We substantiate this intuition both theoretically and empirically.
Theoretical results.
We analyze Quadtree and Flowtree, and show that Flowtree is qualitatively better than Flowtree for nearest neighbor search with respect to the distance. The key difference is that the quality of Flowtree is independent of the dataset size, i.e., of the number of distributions . Quadtree, on the other hand, degrades in quality as grows.
We expose this phenomenon in two regimes. First, we provide worstcase guarantees if the underlying metric on is . Namely, we observe that the analysis of a related algorithm from [AIK08] can be used to show that Quadtree returns an
approximate nearest neighbor with high probability,
^{4}^{4}4A distribution is a approximate nearest neighbor of if . if all the dataset and query distributions have supports of size at most . This analysis can be further modified to show approximation for Flowtree, which confirms the above intuition that Flowtree does not degrade as the dataset size grows. Furthermore, we show that the dependence on for Quadtree is necessary. While these results hold for an underlying metric, they are readily adapted to by a random rotation of the dataset (which in practice is not necessary).Second, we consider a random setting. This is motivated by the observation that the above worstcase bounds are too pessimistic for realworld data. In practice, we observe that Quadtree and especially Flowtree recover the exact nearest neighbor with noticeable probability. To explain this, we introduce and analyze a simple random model, which we believe captures the relevant aspects of the realworld instances. For this model, we show that both Flowtree and Quadtree recover the nearest neighbor with high probability. However, as was the case for the worstcase bounds, Quadtree’s success rate degrades as increases, while Flowtree’s does not.
Empirical results.
We evaluate the performance of Flowtree, and of several baselines and prior work [KSKW15, Cut13], on nearest neighbor search with respect to the distance. The upshot is that Flowtree achieves accuracy on par with the most accurate existing methods, while being much faster, up to times (and up to times without any loss in accuracy).
Let us provide more context on these results. Generally, existing algorithms can be grouped into two classes: “fast” lineartime methods, which produce an efficient but coarse approximation for , and quadratictime methods, which produce a closer but slower approximation (though considerably faster than exact ). The terms “linear” and “quadratic” time refer to the dependence on of estimating a single distance between distributions whose support size is at most . A complete NNS pipeline would typically rely on a combination of these methods: initial pruning by fast method, followed by intermediate pruning by slower and and more accurate method, followed by exact computation on the few suviving points. This schemed has been employed, for example, in [KSKW15] (termed “prefetch and prune”).
Flowtree forms a new intermediate category, being a “slow” lineartime method. Specifically, its accuracy closely matches the quadratictime methods, while its running time is only linear in , rendering it much faster in practice. The difference between Flowtree and the existing “fast“ lineartime methods is that the latter ones possess certain additional properties (namely, embeddings into or spaces, as explained below), that allow for considerable implementational speedup in practice. Flowtree lacks these properties and does not match their running time, but is dramatically more accurate. Thus, it offers substantial improvement in the intermediate pruning regime of the NNS pipeline scheme outlined above.
2 Preliminaries: Quadtree
In this section we describe the classical Quadtree algorithm. Its name derives from its original use in two dimensions (cf. [Sam84]), but it extends to—and has been successfully used in—various highdimensional settings (e.g. [Ind01, IT03, IRW17, BIO19]). It enjoys a combination of appealing theoretical properties and amenability to fast implementation. As it forms the basis for our work, we now describe it in detail.
Generic Quadtree.
Let be a finite set of points. Our goal is to embed into a random tree metric, so as to approximately preserve each pairwise distance in . To simplify the description, suppose that the minimum pairwise distance in is exactly , and that all points in have coordinates in .^{5}^{5}5This is without loss of generality, as we can set the minimum distance to by scaling ( is determined accordingly), and we can shift all the points to have nonnegative coordinates without changing internal distances.
The first step is to obtain a randomly shifted hypercube that encloses all points in . To this end, let be the hypercube with side length centered at the origin. Let
be a random vector with i.i.d. coordinates uniformly distributed in
. We randomly shift by , obtaining the hypercube . Observe that has side length and encloses . The random shift is needed in order to obtain formal guarantees for arbitrary .Now, we construct a tree of hypercubes by letting be the root, halving along each dimension, and recursing on the resulting subhypercubes. We add to the tree only those hypercubes that are nonempty (i.e., contain at least one point from ). Furthermore, we do not partition hypercubes that contain exactly one point from ; they become leaves. The resulting tree has at most levels and exactly leaves, one per point in .^{6}^{6}6This is since the diameter of the root hypercube is , and the diameter of a leaf is no less than , since by scaling the minimal distance in to we have assured that a hypercube of diameter contains a single point and thus becomes a leaf. Since the diameter is halved in each level, there are at most levels. We number the root level as , and the rest of the levels are numbered downward accordingly (). We set the weight of each tree edge between level and level to be .
It is straightforward to build a quadtree in time . Let us remark that the size of the resulting tree is ; in particular, there is no exponential dependence on .
Wasserstein on Quadtree.
The tree distance between each pair is defined as the total edge weight on the unique path between their corresponding leaves in the quadtree. Given two distributions on , the Wasserstein distance with this underlying metric (as a proxy for the Euclidean metric on ) admits the closedform , where ranges over all nodes in the tree, is the level of , is the total mass of points enclosed in the hypercube associated with , and is defined similarly for the mass. If have supports of size at most , then this quantity can be computed in time .
The above closedform implies, in particular, that on the qudtree metric (and indeed, any tree metric) embeds isometrically into , as originally oberved by [Cha02] following [KT02]. Namely, the space has a coordinate associated with each tree node , and a distribution is embedded in that space by setting the value of each coordinate to , where is defined as above. Furthermore, observe that if has support size at most , then its corresponding embedding w.r.t the tree metric has at most nonzero entries, where is the height of the tree. Thus, computing on the tree metric amounts to computing the distance between sparse vectors, which further facilitates fast implementation in practice.
3 Flowtree
The Flowtree algorithm for NNS w.r.t. the distance is as follows. In the preprocessing stage, we build a quadtree on the ground set , as described in Section 2. Let denote the quadtree distance between every pair . In the query stage, in order to estimate between two distributions , we compute the optimal flow w.r.t. the tree metric, that is,
where the argmin is taken over all distribution on with marginals . Then, the estimate of the distance between and is given by
Note that if the support sizes of and are upperbounded by , then the Flowtree estimate of their distance can be computed in time linear in .
Lemma .
can be computed in time .
On the other hand, Flowtree has the notable property discussed above: its NNS approximation factor is independent of the dataset size . In comparison, the classical Quadtree does not possess this property, and its accuracy deteriorates as the dataset becomes larger. We will now formally establish this distinction in two senses: first by analyzing worstcase bounds, and then by analyzing a random model of synthetic data.
3.1 Worstcase bounds
For convenience, the results in this section are stated with the underlying metric on being the distance; as mentioned earlier, the same results hold for the by applying a random rotation on the dataset, which embeds into with constant distortion and constant blowup in the dimension [DIIM04]. All proofs are deferred to the appendix.
We start with an analytic worstcase bound on the performance of quadtree. For convenience, let us recall the parameters: is a finite subset of , and is such that can be enclosed in a hypercube of side length at most . We have a dataset of distributions , and a query distribution , where each of these distributions is supported on a subset of of size at most . We look for the nearest neighbor of among , The following theorem is an adaptation of a result of [AIK08] (where it is proven for a somewhat different algorithm, with similar analysis).
Theorem 3.1 (Quadtree upper bound).
With probability , the nearest neighbor of among in the Quadtree distance is an approximate nearest neighbor in the distance.
Next, we show that the factor in the above upper bound is necessary for Quadtree.
Theorem 3.2 (Quadtree lower bound).
Suppose is such that Quadtree is guaranteed to return a approximate nearest neighbor, for any dataset, with probability more than (say) . Then .
In contrast, Flowtree attains an approximation factor that does not depend on .
Theorem 3.3 (Flowtree upper bound).
With probability , the nearest neighbor of among in the Flowtree distance is an approximate nearest neighbor for the distance.
3.2 Random model
In this section we consider a simple model of random data, which is canonical in studying nearest neighbor search. We choose a ground set of points i.i.d. uniformly at random on the unit sphere in dimensions. For each such subset of of size , we form a uniform distribution supported on the subset. These distributions make up the dataset (so ). To generate a query, we take an arbitrary as the “planted” nearest neighbor. Let be the points in its support. For each , we pick a uniformly random point among the points on which are at distance at most from from , where is a parameter. The query distribution is the uniform distribution over . By known concentration of measure results, the distance from to every point in other than is with high probability. Therefore, the optimal flow from to is the perfect matching , and is the nearest neighbor of .
Our goal now is to show that in this model, the success probability of Quadtree in recovering the planted nearest neighbor decays exponentially with , while the success probability of Flowtree is independent of .
Quadtree.
For every , let be the smallest hypercube in the quadtree that contains both and . (Note that
is a random variable, determined by the initial random shift in the Quadtree construction.) In order for Quadtree to correctly identify
as the nearest neighbor of , every must not contain any additional points from . Otherwise, if say contains a point , the distance on the quadtree from to is equal to its distance to the uniform distribution over . Since the points in are chosen uniformly i.i.d. over , the probability of the above event, and thus the success probability of Quadtree, is upper bounded by , where . This is a random variable whose distribution depends only on , and is independent of . Thus the success probability decays exponentially with .Flowtree.
On the other hand, suppose that each contains no other points from other than (but is allowed to contain any other points from ). This event guarantees that the optimal flow on the tree between and is the planted perfect matching, i.e., the true optimal flow, and thus the estimated Flowtree distance between them equals . This guarantees that Flowtree recovers the planted nearest neighbor, and this event depends only on , and is independent of .
4 Experiments
We empirically evaluate Flowtree and compare it with various methods.
4.1 Synthetic data
We implemented the random model described in Section 3.2. The results are given in Figure 1. The xaxis is , and the yaxis is the fraction of successes over independent repetitions of planting a query and recovering its nearest neighbor. As predicted, the success ratio of Quadtree degrades as increases, while the success ratio of Flowtree does not.
4.2 Real data
Datasets.
We use three datasets from two application domains:

Text documents: We use a dataset of Amazon reviews split evenly over product categories, and the 20news dataset of newsrelated online discussion groups. Both have been used in [KSKW15] to evaluate the WordMove Distance. Each document is interpreted as a uniform distribution supported on the terms it contains (after stopword removal). As the underlying metric, we use GloVe word embeddings [PSM14] with terms and dimensions.

Image recognition: We use the MNIST dataset of handwritten digits. The Wasserstein distance has been applied to MNIST in [Cut13]. Each image is interpreted as a distribution over pixels, with mass proportional to the greyscale intensity of the pixel (normalized so that the mass sums to ). Note that the distribution is supported on the nonzero pixels in the image. The underlying metric is the dimensional Euclidean distance between the pixels, where they are identified with the points on the plane.
Full properties of the datasets are listed in Table 1.
Name  Dataset size  Queries  Underlying metric  Avg. support size 

Amazon  GloVe word embeddings  
20news  GloVe word embeddings  
MNIST  2D Euclidean 
Methods.
We evaluate the following methods:

Mean: is estimated as the Euclidean distance between the means of and . This method has been suggested and used in [KSKW15].

Overlap: A simple baseline that estimates by the size of the intersection of their supports.

TFIDF: A wellknown similarity measure for text documents. We remark that it is closely related to Overlap.^{7}^{7}7Namely, it is a weighted variant of Overlap, where terms are weighted according to their frequency in the dataset. For MNIST we omit this baseline since it is not a text dataset.

Quadtree: See Section 2.

Flowtree: See Section 3.

RWMD: The Relaxed WMD method of [KSKW15], described in Section 1.1). We recall it is a greedy method that assigns every point in the support of to its closest point in the support of , the repeats the process with the roles of and reversed, and returns the maximum of the two estimates. Note that this method does not produce a proper flow (i.e., it does not adhere to the capacity and demand constraints of ).

Sinkhorn with few iterations: The iterative Sinkhorn method of [Cut13] is designed to converge to a nearperfect approximation of . Nonetheless, it can be adapted into a fast approximation algorithm by invoking it with a fixed small number of iterations. We use 1 and 3 iterations, referred to as Sinkhorn1 and Sinkhorn3 respectively. Since the Sinkhorn method requires tuning certain parameters (the number of iterations as well as the regularization parameter), the experiments in this section evaluate the method at its optimal setting, and the appendix includes additional experiments with more parameter settings.
As mentioned in Section 1.2, these methods can be grouped as follows, in terms of their running time dependece on :

“Fast” lineartime: Mean, Overlap, TFIDF, Quadtree

“Slow” lineartime: Flowtree

Quadratic time: RWMD, Sinkhorn
The difference between “fast” and “slow” linear time is that the former algorithms are more cacheefficient and benefit from the SIMD vectorization. Notably, each reduces to either computing a single Euclidean distance in the ground metric (in the case of Mean), or a single distance between sparse vectors (in the case of Overlap, TFIDF and Quadtree). This renders them an order of magnitude faster than the other methods, as our empirical results will show.
Evaluation metrics.
We measure the accuracy and the running time of each method. The results are depicted in Figures 4, 4 and 4 and in Table 2. Accuracy is measured as follows: For each query , we sort by the distances as estimated by the evaluated method; then, for each number of “candidates” , a query is considered successful if the true nearest neighbor (according to the actual distance) is among the nearest neighbors according to the estimated distances. The xaxis lists the number of candidates , and the yaxis lists the fraction of successful queries.
For each dataset, we plot on the left the accuracy of all methods for large numbers of candidates. On the right we plot the accuracy of the highaccuracy methods for small numbers of candidates (since they cannot be discerned in the left plots). The highaccuracy methods are Flowtree, RWMD and Sinkhorn, and on MNIST also Quadtree. For Quadtree and Flowtree, which are randomized methods, we report the mean and standard deviation (shown as error bars) of
executions. The other methods are deterministic.The legend of each plot is annonated with the running time of each method (also summarized in Table 2). The running time is measured as the average time, over the queries, to estimate all distances from the query to the whole dataset. Running times are measured on a “Standard F72s_v2” Microsoft Azure instance equipped with Intel Xeon Platinum 8168 CPU. In our implementations, we use NumPy linked with OpenBLAS, which is used in a singlethreaded mode. On the MNIST dataset, the accuracy of Sinkhorn is evaluated on a random subset of queries, and the running time of RWMD, Sinkhorn and Exact is evaluated on a random subset of queries, due to their computational cost.
Results.
The results show that Flowtree, RWMD and Sinkhorn achieve dramatically higher accuracy than the other baselines. Among these, Flowtree is the fastest by a margin. In particular, its accuracy is either comparable or superior to RWMD, while being up to times faster. Compared to Sinkhorn1, Flowtree achieves somewhat lower accuracy, but is at least times faster.
Dataset  Mean  Overlap  TFIDF  Quadtree  Flowtree  RWMD  Sinkhorn1  Sinkhorn3  Exact 

Amazon  ms  ms  ms  ms  s  s  s  s  s 
20news  ms  ms  ms  ms  s  s  s  s  s 
MNIST  ms  ms  —  ms  s  s  s  s  s 
5 Future directions
We view this work as a first step, leaving a number of exciting open problems for future work.

We suspect that the worstcase bounds for Quadtree and Flowtree might not be tight. The challenge is either to improve them, or to prove matching lower bounds.

Note that given a nearest neighbor query, we need to estimate the distance from a single distribution (the query) to many distributions (the dataset) at once. Could this be leveraged to speed up Flowtree?

For an actual nearest neighbor search system, one likely needs to combine several algorithms, as outlined above. It would be useful to conduct a systematic investigation on this front, and obtain a “recipe” for setting up such a pipeline (of increasingly slower but more accurate approximation methods, gradually pruning more and more candidates) for a given dataset.
References
 [AIK08] Alexandr Andoni, Piotr Indyk, and Robert Krauthgamer. Earth mover distance over highdimensional spaces. In Proceedings of the nineteenth annual ACMSIAM symposium on Discrete algorithms, pages 343–352. Society for Industrial and Applied Mathematics, 2008.
 [AIR18] Alexandr Andoni, Piotr Indyk, and Ilya Razenshteyn. Approximate nearest neighbor search in high dimensions. arXiv preprint arXiv:1806.09823, 2018.
 [AKR18] Alexandr Andoni, Robert Krauthgamer, and Ilya Razenshteyn. Sketching and embedding are equivalent for norms. SIAM Journal on Computing, 47(3):890–916, 2018.
 [ANN15] Alexandr Andoni, Assaf Naor, and Ofer Neiman. Snowflake universality of wasserstein spaces. arXiv preprint arXiv:1509.08677, 2015.
 [Bar96] Yair Bartal. Probabilistic approximation of metric spaces and its algorithmic applications. In Proceedings of 37th Conference on Foundations of Computer Science, pages 184–193. IEEE, 1996.
 [Bar98] Yair Bartal. On approximating arbitrary metrices by tree metrics. In STOC, volume 98, pages 161–168, 1998.

[BIO19]
Arturs Backurs, Piotr Indyk, Krzysztof Onak, Baruch Schieber, Ali Vakilian, and
Tal Wagner.
Scalable fair clustering.
In
International Conference on Machine Learning
, pages 405–413, 2019.  [Bou86] Jean Bourgain. The metrical interpretation of superreflexivity in banach spaces. Israel Journal of Mathematics, 56(2):222–230, 1986.
 [CCG98] Moses Charikar, Chandra Chekuri, Ashish Goel, Sudipto Guha, and Serge Plotkin. Approximating a finite metric by a small number of tree metrics. In Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No. 98CB36280), pages 379–388. IEEE, 1998.

[Cha02]
Moses S Charikar.
Similarity estimation techniques from rounding algorithms.
In
Proceedings of the thiryfourth annual ACM symposium on Theory of computing
, pages 380–388. ACM, 2002.  [CKR05] Gruia Calinescu, Howard Karloff, and Yuval Rabani. Approximation algorithms for the 0extension problem. SIAM Journal on Computing, 34(2):358–372, 2005.
 [Cut13] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pages 2292–2300, 2013.
 [DIIM04] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. Localitysensitive hashing scheme based on pstable distributions. In Proceedings of the twentieth annual symposium on Computational geometry, pages 253–262. ACM, 2004.
 [FRT04] Jittat Fakcharoenphol, Satish Rao, and Kunal Talwar. A tight bound on approximating arbitrary metrics by tree metrics. Journal of Computer and System Sciences, 69(3):485–497, 2004.
 [GKL03] Anupam Gupta, Robert Krauthgamer, and James R Lee. Bounded geometries, fractals, and lowdistortion embeddings. In 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings., pages 534–543. IEEE, 2003.
 [Ind01] Piotr Indyk. Algorithmic applications of lowdistortion geometric embeddings. In Proceedings 42nd IEEE Symposium on Foundations of Computer Science, pages 10–33. IEEE, 2001.
 [IRW17] Piotr Indyk, Ilya Razenshteyn, and Tal Wagner. Practical datadependent metric compression with provable guarantees. In Advances in Neural Information Processing Systems, pages 2617–2626, 2017.
 [IT03] Piotr Indyk and Nitin Thaper. Fast image retrieval via embeddings. In 3rd international workshop on statistical and computational theories of vision, volume 2, page 5, 2003.
 [IW17] Piotr Indyk and Tal Wagner. Nearoptimal (euclidean) metric compression. In Proceedings of the TwentyEighth Annual ACMSIAM Symposium on Discrete Algorithms, pages 710–723. SIAM, 2017.
 [IW18] Piotr Indyk and Tal Wagner. Approximate nearest neighbors in limited space. In Conference On Learning Theory, pages 2012–2036, 2018.
 [KN06] Subhash Khot and Assaf Naor. Nonembeddability theorems via fourier analysis. Mathematische Annalen, 334(4):821–852, 2006.
 [KSKW15] Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. From word embeddings to document distances. In International conference on machine learning, pages 957–966, 2015.
 [KT02] Jon Kleinberg and Eva Tardos. Approximation algorithms for classification problems with pairwise relationships: Metric labeling and markov random fields. Journal of the ACM (JACM), 49(5):616–639, 2002.
 [Kuh55] Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(12):83–97, 1955.
 [LYFC19] Tam Le, Makoto Yamada, Kenji Fukumizu, and Marco Cuturi. Treesliced approximation of wasserstein distances. arXiv preprint arXiv:1902.00342, 2019.
 [MN06] Manor Mendel and Assaf Naor. Ramsey partitions and proximity data structures. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), pages 109–118. IEEE, 2006.
 [MSC13] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
 [NS07] Assaf Naor and Gideon Schechtman. Planar earthmover is not in l_1. SIAM Journal on Computing, 37(3):804–826, 2007.

[PSM14]
Jeffrey Pennington, Richard Socher, and Christopher Manning.
Glove: Global vectors for word representation.
In
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)
, pages 1532–1543, 2014. 
[RTG00]
Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas.
The earth mover’s distance as a metric for image retrieval.
International journal of computer vision
, 40(2):99–121, 2000.  [Sam84] Hanan Samet. The quadtree and related hierarchical data structures. ACM Computing Surveys (CSUR), 16(2):187–260, 1984.
 [Vil03] Cédric Villani. Topics in optimal transportation. Number 58. American Mathematical Soc., 2003.
 [YCC19] Mikhail Yurochkin, Sebastian Claici, Edward Chien, Farzaneh Mirzazadeh, and Justin Solomon. Hierarchical optimal transport for document representation. arXiv preprint arXiv:1906.10827, 2019.
Appendix A Proofs
Proof of Theorem 3.1.
Let . Let be the probability that fall into the same cell (hypercube) in level of the quadtree. It satisfies,
Let be the tree metric induced on by the quadtree. Note that for to be at most , must fall into the same hypercube in level . For any , we can round to its nearest power of and obtain such that . It satisfies,
By letting , we can take union bound either over all pairwise distances in (of which there are ), or over all distances between the support of the query and the union of supports of the dataset (of which there are at most , if every support has size at most ). Then, with probability say , all those distances are contracted by at most , i.e.,
(2) 
On the other hand,
Let be the true nearest neighbor of in . Let be the optimal flow between them. Then by the above,
By Markov, with probability say ,
(3) 
Let be the nearest neighbor of in the dataset according to the quadtree distance. Let be the optimal flow between them in the true underlying metric ( on ), and let be the optimal flow in the quadtree. Finally let denote the Wasserstein distance on the quadtree. Then,
is optimal for  
definition of  
is the nearest neighbor in  
definition of  
is optimal for  
so is a approximate nearest neighbor. ∎
Proof of Theorem 3.2.
It suffices to prove the claim for (i.e., the standard distance). Let be an even integer. Consider the dimensional hypercube. Our query point is the origin. The true nearest neighbor is (standard basis vector). The other data points are the hypercube nodes whose hamming weight is exactly . The number of such points is , and this is our .
Consider imposing the grid with cell side on the hypercube. The probability that and are uncut in a given axis is exactly , and since the shifts in different axes are independent, the number of uncut axes is distributed as . Thus with probability there are at least uncut dimensions. If this happens, we have a data point hashed into the same grid cell as the origin (to get such data point, put in any uncut dimensions and in the rest), so its quadtree distance from the origin is . On the other hand, the distance of the origin to its true nearest neighbor is at least , since they will necessarily be separated in the next level (when the grid cells have side ). Thus the quadtree cannot tell between the true nearest neighbor and the one at distance , and we get the lower bound . Since , we have as desired. ∎
Proof of Theorem 3.3.
The proof is the same as for Theorem 3.1, except that in eq. 2, we take a union bound only over the distances between the supports of and (the query and its true nearest neighbor). Thus each distance between and is contracted by at most .
Let denote the Flowtree distance estimate of . Let be the nearest neighbor of in the Flowtree distance. With the same notation in the proof of Theorem 3.1,
is optimal for  
Flowtree definition  
is nearest in Flowtree distance  
Flowtree definition  
is optimal for  
as needed. Note that the difference from the proof of Theorem 3.1 is that we only needed the contraction bound (eq. 2) for distances between and . ∎
Appendix B Additional Sinkhorn experiments
We include additional experimental results for the fewiterations Sinkhorn baseline, depicted in Figure 5.
Number of iterations.
Regularization parameter.
Sinkhorn has a regularization parameter that needs to be tuned per dataset. We set , where is the maximum value in the cost matrix (of the currently evaluated pair of distributions), and tune . In all of our three datasets the optimal setting is , which is the setting we use in Section 4. As an example, Figure 5(d) depicts the NN accuracy (yaxis) of Sinkhorn1 per (xaxis).