The nearest neighbor search problem is defined as follows: given a set of points in a -dimensional space, build a data structure that, given any query point , returns the point in closest to . For efficiency reasons, the problem is often relaxed to approximate nearest neighbor search, where the goal is to find a point whose distance to is at most for some approximation factor
. Both problems have found numerous applications in machine learning, computer vision, information retrieval and other areas. In machine learning in particular, nearest neighbor classifiers are popular baseline methods whose classification error often comes close to that of the best known techniques ([Efr17]).
Developing fast approximate nearest neighbor search algorithms have been a subject of extensive research efforts over the last last two decades, see e.g., [SDI06, AI17] for an overview. More recently, there has been increased focus on designing nearest neighbor methods that use a limited amount of space. This is motivated by the need to fit the data set in the main memory ([JDJ17a, JDJ17b]) or an Internet of Things device ([GSG17]). Furthermore, even a simple linear scan over the data is more time-efficient if the data is compressed. The data set compression is most often achieved by developing compact representations of data that approximately preserve the distances between the points (see [WLKC16] for a survey). Such representations are smaller than the original (uncompressed) representation of the data set, while approximately preserving the distances between points.
Most of the approaches in the literature are only validated empirically. The currently best known theoretical tradeoffs between the representation size and the approximation quality are summarized in Table 1, together with their functionalities and constraints.
|Bits per point||Comments|
|[JL84]||Estimates distances between any and all|
|[KOR00]||Estimates distances between any and all ,|
|[IW17]||Estimates distances between all ,|
|does not provably support out-of-sample queries|
|This paper||Returns an approximate nearest neighbor of in|
Unfortunately, in the context of approximate nearest neighbor search, the above representations lead to sub-optimal results. The result from the last row of the table (from [IW17]) cannot be used to obtain provable bounds for nearest neighbor search, because the distance preservation guarantees hold only for pairs of points in the pointset .111We note, however, that a simplified version of this method, described in [IRW17], was shown to have good empirical performance for nearest neighbor search. The second-to-last result (from [KOR00]) only estimates distances in a certain range; extending this approach to all distances would multiply the storage by a factor of . Finally, the representations obtained via a direct application of randomized dimensionality reduction ([JL84]) are also larger than the bound from [IW17] by almost a factor of .
In this paper we show that it is possible to overcome the limitations of the previous results and design a compact representation that supports -approximate nearest neighbor search, with a space bound essentially matching that of [IW17]. This constitutes the first reduction in the space complexity of approximate nearest neighbor below the “Johnson-Lindenstrauss bound”. Specifically, we show the following. Suppose that we want the data structure to answer approximate nearest neighbor queries in a -dimensional dataset of size , in which coordinates are represented by bits each. All queries must be answered correctly with probability . (See Section 2 for the formal problem definition).
For the all-nearest-neighbors problem, there is a sketch of size
Interestingly, the representation by itself does not return the (approximate) distance between the query point and the returned neighbor. Thus, we also consider the problem of estimating distances from a query point to all data points. In this setting, a result of [MWY13] shows that the Johnson-Lindenstrauss space bound is optimal when the number of queries is equal to the number of data points. However, in many settings, the number of queries is often substantially smaller than the dataset size. We give nearly tight upper and lower bounds (up to a factor of
) for this problem, showing it is possible to smoothly interpolate between[IW17], which does not support out-of-sample distance queries, and the Johnson-Lindenstrauss bound.
Specifically, we show the following. Suppose that we want the data structure to estimate all cross-distances between a set of queries and all points in , all of which must be estimated correctly with probability (see Section 2 for the formal problem definition).
For the all-cross-distances problem, there is a sketch of size
Note that the dependence per point on is logarithmic, as opposed to doubly logarithmic in Theorem 1.1. We show this dependence is necessary, as per the following theorem.
Suppose that , , and for some constants and a sufficiently small constant . Then, for the all-cross-distances problem, any sketch must use at least
[IRW17] presented a simplified version of [IW17], which has slightly weaker size guarantees, but on the other hand is practical to implement and was shown to work well empirically. However, it did not provably support out-of-sample queries. Our techniques in this paper can be adapted to their algorithm and endow it with such provable guarantees, while retaining its simplicity and practicality. We elaborate on this in Appendix D.
The starting point of our representation is the compressed tree data structure from [IW17]
. The structure is obtained by constructing a hierarchical clustering of the data set, forming a tree of clusters. The position of each point corresponding to a node in the tree is then represented by storing a (quantized) displacement vector between the point and its “ancestor” in the tree. The resulting tree is further compressed by identifying and post-processing “long” paths in the tree. The intuition is that a subtree at the bottom of such a path corresponds to a cluster of points that is “sufficiently separated” from the rest of the points (see Figure1). This means that the data structure does not need to know the exact position of this cluster in order to estimate the distances between the points in the cluster and the rest of the data set. Thus the data structure replaces each long path by a quantized displacement vector, where the quantization error does not depend on the length of the path. This ensures that the tree does not have long paths, which bounds its total size.
Unfortunately, this reasoning breaks down if one of the points is not known in advance, as it is the case for the approximate nearest neighbor problem. In particular, if the query point lies in the vicinity of the separated cluster, then small perturbations to the cluster position can dramatically affect which points in the cluster are closest to (see Figure 2 for an illustration).
In this paper we overcome this issue by maintaining extra information about the geometry of the point set. First, for each long path, we store not only the quantized displacement vector (which preserves the “global” position of the subtree with respect to the rest of the tree) but also the suffix of the path. Intuitively, this allows us to recover both the most significant bits and the least significant bits of points in the subtree corresponding to the “separated” clusters, which allows us to avoid cases as depicted in Figure 2. However, this intuition breaks down when the diameter of the cluster is much larger than the amount of “separation”. Thus we also need to store extra information about the position of the subtree points. This is accomplished by storing a hashed representation of a representative point of the subtree (called “the center”). We note that this modification makes our data structure inherently randomized; in contrast, the data structure of [IW17] was deterministic.
Given the above information, the approximate nearest neighbor search is performed top down, as follows. In each step, we recover and enumerate points in the current subtree, some of which could be centers of “separated” clusters as described above. The “correct” center, guaranteed to contain an approximate nearest neighbor of the query point, is identified by its hashed value (if no hash match is found, then any center is equally good). Note that our data structure does not allow us to compute all distances from the query point to all points in (in fact, as mentioned earlier, this task is not possible to achieve within the desired space bound). Instead, it stores just enough information to ensure that the procedure never selects a “wrong” subtree to iterate on.
Lastly, suppose we also wish to estimate all distances from to . To this end, we augment each subtree with the distance sketches due to [KOR00] and [JL84]. The former allows us to identify the cluster of all approximate nearest neighbors of (whereas the above algorithm was only guaranteed to return one approximate nearest neighbor). The latter stores the approximate distance from that cluster. These are the smallest distances from to , which are the most challenging to estimate; the remaining distances can be estimated based on the hierarchical partition into well-separated clusters, which is already present in the sketch.
2 Formal Problem Statements
We formalize the problems in terms of one-way communication complexity. The setting is as follows. Alice has data points, , while Bob has query points, , where . Distances are Euclidean, and we can assume w.l.o.g. that .222Any -point Euclidean metric can be embedded into dimensions. Let be given parameters. In the one-way communication model, Alice computes a compact representation (called a sketch) of her data points and sends it to Bob, who then needs to report the output. We define two problems in this model (with private randomness), each parameterized by :333Throughout we use to denote , for an integer .
Problem 1 – All-nearest-neighbors:
Bob needs to report a -approximate nearest neighbor in for all his points simultaneously, with probability . That is, for every , Bob reports an index such that
Our upper bound for this problem is stated in Theorem 1.1.
Problem 2 – All-cross-distancess:
3 Basic Sketch
In this section we describe the basic data structure (generated by Alice) used for all of our results. The data structure augments the representation from [IW17], which we will now reproduce. For the sake of readability, the notions from the latter paper (tree construction via hierarchical clustering, centers, ingresses and surrogates) are interleaved with the new ideas introduced in this paper (top-out compression, grid quantization and surrogate hashing). Proofs in this section are deferred to Appendix A.
3.1 Hierarchical Clustering Tree
The sketch consists of an annotated hierarchical clustering tree, which we now describe with our modified “top-out compression” step.
We construct the inter-link hierarchical clustering tree of : In the bottom level (numbered ) every point is a singleton cluster, and level is formed from level by recursively merging any two clusters whose distance is at most , until no two such clusters are present. We repeat this until level , even if all points in are already joined in one cluster at a lower level. The following observation is immediate.
If are in different clusters at level , then .
Let denote the tree. For every tree node , we denote its level by , its associated cluster by , and its cluster diameter by . For a point , let denote the tree leaf whose associated cluster is .
The degree of a node in is its number of children. A -path with edges in is a downward path , such that (i) each of the nodes has degree , (ii) has degree either or more than , (iii) if is not the root of , then its ancestor has degree more than .
For every node denote . If is the bottom of a -path with more than edges, we replace all but the bottom edges with a long edge, and annotate it by the length of the path it represents. More precisely, if the downward -path is and , then we connect directly to by the long edge, and the nodes are removed from the tree, and the long edge is annotated with length .
The compressed tree has nodes.
We henceforth refer only to the compressed tree, and denote it by . However, for every node in , continues to denote its level before compression (i.e., the level where the long edges are counted according to their lengths). We partition into subtrees by removing the long edges. Let denote the set of subtrees.
Let be the bottom node of a long edge, and . Then .
Let be a leaf of a subtree in , and . Then .
The purpose of annotating the tree is to be able to recover a list of surrogates for every point in . A surrogate is a point whose location approximates . Since we will need to compare to a new query point, which is unknown during sketching, we define the surrogates to encompass a certain amount information about the absolute point location, by hashing a coarsened grid quantization of a representative point in each subtree.
With every tree node we associate an index such that , and we call the center of . The centers are chosen bottom-up in as follows. For a leaf , contains a single point , and we set . For a non-leaf with children , we set .
Fix a subtree . To every node in , except the root, we will now assign an ingress node, denoted . Intuitively this is a node in the same subtree whose center is close to , and the purpose is to store the location of by its quantized displacement from that center (whose location will have been already stored, by induction).
We will now assign ingresses to all children of a given node . (Doing this for every in defines ingresses for all nodes in except its root.) Let be the children of , and w.l.o.g. . Consider the graph whose nodes are , and are neighbors if there are points and such that . By the tree construction, is connected. We fix an arbitrary spanning tree of which is rooted at .
For we set . For with , let be its (unique) direct ancestor in the tree . Let be the closest point to in . Note that in there is a downward path from to . Let be the bottom node in that path that belongs to . (Equivalently, is the bottom node on that downward path that is reachable from without traversing a long edge.) We set .
Grid net quantization
Assume w.l.o.g. that is a power of . We define a hierarchy of grids aligned with as follows. We begin with the single hypercube whose corners are . We generate the next grid by halving along each dimension, and so on. For every , let be the coarsest grid generated, whose cell side is at most . Note that every cell in has diameter at most . For a point , we denote by the closest corner of the grid cell containing it.
We will rely on the following fact about the intersection size of a grid and a ball; see, for example, [HPIM12].
For every , the number of points in at distance at most from any given point, is at most .
Fix a subtree . With every node in we will now associate a surrogate . Define the following for every node in :
The surrogates are defined by induction on the ingresses.
Induction base: For the root of we set .
Induction step: For a non-root we denote the quantized displacement of from its ingress by , and set .
For every node , . Furthermore if is a leaf of a subtree in , then .
3.3 Sketch Size
The sketch contains the tree , with each node annotated by its center , ingress , precision and quantized displacement (if applicable). For subtree roots we store their hashed surrogate, and for long edges we store their length. We also store the hash functions .
The total sketch size is
As a preprocessing step, Alice can reduce the dimension of her points to by a Johnson-Lindenstrauss projection. She then augments the sketch with the projection, in order for Bob to be able to project his points as well. By [KMN11], the projection can be stored with bits. This yields the sketch size stated in Theorem 1.1.
Both the hash functions and the projection map can be sampled using public randomness. If one is only interested in the communication complexity, one can use the general reduction from public to private randomness due to [New91], which replaces the public coins by augmenting bits to the sketch (since Alice’s input has size bits). The bound in Theorem 1.1 then improves to bits, and the bound in Theorem 1.2 improves to bits. However, that reduction is non-constructive; we state our bounds so as to describe explicit sketches.
4 Approximate Nearest Neighbor Search
We now describe our approximate nearest neighbor search query procedure, and prove Theorem 1.1. Suppose Bob wants to report a -approximate nearest neighbor in for a point .
Algorithm Report Nearest Neighbor:
Start at the subtree that contains the root of .
Recover all surrogates , by the subroutine below.
Let be the leaf of that minimizes .
If is the head of a long edge, recurse on the subtree under that long edge. Otherwise is a leaf in , and in that case return .
Subroutine Recover Surrogates:
This is a subroutine that attempts to recover all surrogates in a given subtree , using both Alice’s sketch and Bob’s point .
Observe that to this end, the only information missing from the sketch is the root surrogate , which served as the induction base for defining the rest of the surrogates. The induction steps are fully defined by , , , and , which are stored in the sketch for every node in the subtree. The missing root surrogate was defined as . Instead, the sketch stores its hashed value and the hash function .444Note that fully storing the root surrogates is prohibitive: has cells, hence storing a cell ID takes bits, and since there can be subtree roots, this would bring the total sketch size to .
The subroutine attempts to reverse the hash. It enumerates over all points such that . For each it computes . If then it sets and recovers all surrogates accordingly. If either no , or more than one , satisfy , then it proceeds with set to an arbitrary point (say, the origin in ).
Let be the roots of the subtrees traversed on the algorithm. Note that they reside on a downward path in .
Since , we have . ∎
Let be the smallest such that satisfies . (The algorithm does not identify , but we will use it for the analysis.)
With probability , for every simultaneously, the subroutine recovers correctly as . (Consequently, all surrogates in the subtree rooted by are also recovered correctly.)
Fix a subtree rooted in , that satisfies . Since (by Lemma 3.2), we have . Hence the surrogate recovery subroutine tries as one of the hash pre-image candidates, and will identify that matches the hash stored in the sketch. Furthermore, by Section 3.2, the number of candidates is at most . Since the range of has size , then with probability there are no collisions, and is recovered correctly. The lemma follows by taking a union bound over the first subtrees traversed by the algorithm, i.e. those rooted by for . Noting that is upper-bounded by the number of levels in the tree, , we get that all the ’s are recovered correctly simultaneously with probability . ∎
From now on we assume that the event in Lemma 4 succeeds, meaning in steps , the algorithm recovers all surrogates correctly. We henceforth prove that under this event, the algorithm returns a -approximate nearest neighbor of . In what follows, let be a fixed true nearest neighbor of in .
Let be a subtree rooted in , such that . Let a leaf of that minimizes . Then either , or every is a -approximate nearest neighbor of .
Suppose w.l.o.g. by scaling that . If then we are done. Assume now that for a leaf of . Let . We start by showing that . Assume by contradiction this is not the case. Since is a subtree leaf and , we have by Lemma 3.1. We also have by Lemma 3.2. Together, . On the other hand, by the triangle inequality, . Noting that (by Lemma 3.1, since and are separated at level ), (by the contradiction hypothesis) and (by Lemma 3.2), we get . This contradicts the choice of .
Proof of Theorem 1.1.
We may assume w.l.o.g. that is smaller than a sufficiently small constant. Suppose that the event in Lemma 4 holds, hence all surrogates in the subtrees rooted by are recovered correctly. We consider two cases. In the first case, . Let be the smallest such that . By applying Lemma 4 on , we have that every point in is a -approximate nearest neighbor of . After reaching , the algorithm would return the center of some leaf reachable from , and it would be a correct output.
In the second case, . We will show that every point in is a -approximate nearest neighbor of , so once again, once the algorithm arrives at it can return anything. By Lemma 3.1, every satisfies
In particular, . By definition of we have . Combining the two yields . Combining this with eq. 6, we find that every satisfies , and hence (for ). Hence is a -nearest neighbor of .
The proof assumes the event in Lemma 4, which occurs with probability . By a union bound, the simultaneous success probability of the query points of Bob is as required. ∎
5 Distance Estimation
We now prove Theorem 1.2. To this end, we augment the basic sketch from Section 3 with additional information, relying on the following distance sketches due to [Ach01] (following [JL84]) and [KOR00].
Let . Let for a sufficiently large constant . Let be a random matrix in which every entry is chosen independently uniformly at random from . Then for every , with probability , .
Let be fixed and let . There is a randomized map of vectors in into bits, with the following guarantee. For every , given and , one can output the following with probability :
If , output a -estimate of .
If , output “Small”.
If , output “Large”.
We augment the basic sketch from Section 3 as follows. We sample a matrix from Lemma 5, with . In addition, for every level in the tree , we sample a map from Lemma 5, with . For every subtree root in , we store and in the sketch. Let us calculate the added size to the sketch:
Since has coordinates of magnitude each, has coordinates of magnitude each. Since there are subtree roots (cf. Lemma 3.3), storing for every adds bits to the sketch. In addition we store the matrix , which takes bits to store, which is dominated by the previous term.
By Lemma 5, each adds bits to the sketch, and as above there are of these. In addition we store the map for every . Each map takes bits to store.
In total, we get the sketch size stated in Theorem 1.2. Next we show how to compute all distances from a new query point .
Given the sketch, an index of a point in , and a new query point , the algorithm needs to estimate up to distortion. It proceeds as follows.
Perform the approximate nearest neighbor query algorithm from Section 4. Let be the downward sequence of subtree roots traversed by it.
For each , estimate from the sketch whether . This can be done by Lemma 5, since the sketch stores and also the map , with which we can compute .
Let be the maximal such that .
(In words, is the root of the subtree in which and “part ways”.)
If , return . Note that and are stored in the sketch.
If , let be the bottom node on the downward path from to that does not traverse a long edge. Return .
Fix a query point . Define the “good event” as the intersection of the following:
For every subtree root traversed by the query algorithm above, the invocation of Lemma 5 on and succeeds in deciding whether . Specifically, this ensures that for every , and . Recalling that we invoked the lemma with , we can take a union bound and succeed in all levels simultaneously with probability .
. By Lemma 5 this holds with probability .
Altogether, occurs with probability .
Conditioned on occuring, with probabiliy , Lemma 4 holds. Namely, the query algorithm correctly recovers all surrogrates in the subtrees rooted by for .
Proof of Theorem 1.2.
Let denote the event in which both occurs and the conclusion of Lemma 4 occurs. By the above lemma, happens with probability . From now on we will assume that occurs, and conditioned on this, we will show that the distance from to any data point can be deterministically estimated correctly. To this end, fix and suppose our goal is to estimate . Let and be as defined by the distance query algorithm above. We handle the two cases of the algorithm separately.
Case I: . This means . By Lemma 3.1 we have . By the occurance of we have . Together, . This means that is a good estimate for . Since occurs, it holds that , hence is also a good estimate for , and this is what the algorithm returns.
Case II: . Let be the subtree rooted by . By the occurance of , all surrogates in are recovered correctly, and in particular is recovered correctly. By Lemma 3.2 we have , and by Lemma 3.1 (noting that by choice of ) we have . Together, .
Let be the leaf in that minimizes (over all leaves of ). Equivalently, is the top node of the long edge whose bottom node is . Let . By choice of we have , hence the centers of these two leaves are separated already at level , hence by Lemma 3.1. By two applications of Lemma 3.2 we have and . Together, . Since is closer to than to (by choice of ), we have . Combining this with , which was shown above, yields . Therefore, , which means is a good estimate for , and this is what the algorithm returns.
Conclusion: Combining both cases, we have shown that for any query point , all distances from to can be estimated correctly with probability . Taking a union bound over queries, and scaling and appropriately by a constant, yields the theorem. ∎∥∥∥∥∥∥
This work was supported by grants from the MITEI-Shell program, Amazon Research Award and Simons Investigator Award.
- [Ach01] Dimitris Achlioptas, Database-friendly random projections, Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, ACM, 2001, pp. 274–281.
- [AI17] Alexandr Andoni and Piotr Indyk, Nearest neighbors in high-dimensional spaces, CRC Handbook of Discrete and Computational Geometry (2017).
- [CW79] J Lawrence Carter and Mark N Wegman, Universal classes of hash functions, Journal of computer and system sciences 18 (1979), no. 2, 143–154.
- [Efr17] A. Efros, How to stop worrying and learn to love nearest neighbors, https://nn2017.mit.edu/wp-content/uploads/sites/5/2017/12/Efros-NIPS-NN-17.pdf (2017).
Chirag Gupta, Arun Sai Suggala, Ankit Goyal, Harsha Vardhan Simhadri, Bhargavi
Paranjape, Ashish Kumar, Saurabh Goyal, Raghavendra Udupa, Manik Varma, and
Protonn: Compressed and accurate knn for resource-scarce devices, International Conference on Machine Learning, 2017, pp. 1331–1340.
Sariel Har-Peled, Piotr Indyk, and Rajeev Motwani,
Approximate nearest neighbor: Towards removing the curse of dimensionality.8 (2012), no. 1, 321–350.
- [IRW17] Piotr Indyk, Ilya Razenshteyn, and Tal Wagner, Practical data-dependent metric compression with provable guarantees, Advances in Neural Information Processing Systems, 2017, pp. 2614–2623.
- [IW17] Piotr Indyk and Tal Wagner, Near-optimal (euclidean) metric compression, Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SIAM, 2017, pp. 710–723.
- [JDJ17a] Jeff Johnson, Matthijs Douze, and Hervé Jégou, Billion-scale similarity search with gpus, arXiv preprint arXiv:1702.08734 (2017).
- [JDJ17b] , Faiss: A library for efficient similarity search, https://code.facebook.com/posts/1373769912645926/faiss-a-library-for-efficient-similarity-search/ (2017).
- [JL84] William B Johnson and Joram Lindenstrauss, Extensions of lipschitz mappings into a hilbert space, Contemporary mathematics 26 (1984), no. 189-206, 1.
- [JW13] Thathachar S Jayram and David P Woodruff, Optimal bounds for johnson-lindenstrauss transforms and streaming problems with subconstant error, ACM Transactions on Algorithms (TALG) 9 (2013), no. 3, 26.
Daniel Kane, Raghu Meka, and Jelani Nelson, Almost optimal explicit
, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, Springer, 2011, pp. 628–639.
- [KOR00] Eyal Kushilevitz, Rafail Ostrovsky, and Yuval Rabani, Efficient search for approximate nearest neighbor in high dimensional spaces, SIAM Journal on Computing 30 (2000), no. 2, 457–474.
- [MWY13] Marco Molinaro, David P Woodruff, and Grigory Yaroslavtsev, Beating the direct sum theorem in communication complexity with implications for sketching, Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms, Society for Industrial and Applied Mathematics, 2013, pp. 1738–1756.
- [New91] Ilan Newman, Private vs. common random bits in communication complexity, Information processing letters 39 (1991), no. 2, 67–71.
- [SDI06] Gregory Shakhnarovich, Trevor Darrell, and Piotr Indyk, Nearest-neighbor methods in learning and vision: theory and practice (neural information processing), The MIT press, 2006.
- [WLKC16] Jun Wang, Wei Liu, Sanjiv Kumar, and Shih-Fu Chang, Learning to hash for indexing big data: a survey, Proceedings of the IEEE 104 (2016), no. 1, 34–57.
Appendix A Deferred Proofs from Section 3
Proof of Lemma 3.1.
Charging the degree- nodes along every maximal -path to its bottom (non-degree-) node, the total number of nodes after top-out compression is bounded by
[IW17] show this is at most . The difference is that their compression replaces summands larger than by zero, while our (top-out) compression trims them to . ∎
Proof of Lemma 3.1.
By top-out compression, is the top of a downward -path of length whose bottom node is . Since no clusters are joined along a -path, we have , hence and hence . Noting that and rearranging, we find , which yields the claim. ∎
Proof of Lemma 3.1.
If is a leaf in then is a singleton cluster, hence . Otherwise is the top node of a long edge, and the claim follows by Lemma 3.1 on the bottom node of that long edge. ∎
Proof of Lemma 3.2.
The first part of the lemma (where is any node, not necessarily a subtree leaf) is proved by induction on the ingresses. In the base case we use that