On the I/O complexity of the k-nearest neighbor problem

We consider static, external memory indexes for exact and approximate versions of the k-nearest neighbor (k-NN) problem, and show new lower bounds under a standard indivisibility assumption: - Polynomial space indexing schemes for high-dimensional k-NN in Hamming space cannot take advantage of block transfers: Ω(k) block reads are needed to to answer a query. - For the ℓ_∞ metric the lower bound holds even if we allow c-appoximate nearest neighbors to be returned, for c ∈ (1, 3). - The restriction to c < 3 is necessary: For every metric there exists an indexing scheme in the indexability model of Hellerstein et al. using space O(kn), where n is the number of points, that can retrieve k 3-approximate nearest neighbors using k/B I/Os, which is optimal. - For specific metrics, data structures with better approximation factors are possible. For k-NN in Hamming space and every approximation factor c>1 there exists a polynomial space data structure that returns kc-approximate nearest neighbors in k/B I/Os. To show these lower bounds we develop two new techniques: First, to handle that approximation algorithms have more freedom in deciding which result set to return we develop a relaxed version of the λ-set workload technique of Hellerstein et al. This technique allows us to show lower bounds that hold in d≥ n dimensions. To extend the lower bounds down to d = O(k log(n/k)) dimensions, we develop a new deterministic dimension reduction technique that may be of independent interest.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

08/21/2017

Approximate nearest neighbors search without false negatives for l_2 for c>√(n)

In this paper, we report progress on answering the open problem presente...
03/17/2022

Stronger 3SUM-Indexing Lower Bounds

The 3SUM-Indexing problem was introduced as a data structure version of ...
06/30/2018

Approximate Nearest Neighbors in Limited Space

We consider the (1+ϵ)-approximate nearest neighbor search problem: given...
07/14/2022

Provably Adversarially Robust Nearest Prototype Classifiers

Nearest prototype classifiers (NPCs) assign to each input point the labe...
05/24/2019

Learning Mahalanobis Metric Spaces via Geometric Approximation Algorithms

Learning Mahalanobis metric spaces is an important problem that has foun...
12/21/2018

Lower bounds for text indexing with mismatches and differences

In this paper we study lower bounds for the fundamental problem of text ...
10/19/2011

Is the k-NN classifier in high dimensions affected by the curse of dimensionality?

There is an increasing body of evidence suggesting that exact nearest ne...

1. Introduction

The DEEP1B data set (Babenko and Lempitsky, 2016) is among the largest image data sets that has been examined in the similarity search literature. From each of

images, a 96-dimensional vector has been extracted from an intermediate layer of a pre-trained deep neural network, a state-of-the-art method for semantically meaningful feature vectors 

(Babenko et al., 2014)

. Such feature vectors can be thought of as compressed representations of the images that, for example, can be used to estimate the similarity of two images. In many use cases, though, it is not enough to substitute the images with their feature vectors, but we also need to be able to access the corresponding images. Though the size of the raw image data behind DEEP1B is not stated in 

(Babenko and Lempitsky, 2016), an estimate would be 1 MB per image on average, or 1000 TB in total. Clearly, retrieving similar images in a data set of this size is beyond what is possible on a single machine, and even just indexing the set of feature vectors would require an amount of internal memory that is larger than what is present in most servers.

-nearest neighbors

In the -nearest neighbors (-NN) problem we want to construct a data structure on a set of points in some metric space that, given an integer and a query point , finds the closest points to in . We will be focusing on data structures that can be constructed in polynomial time and space. The -NN problem is believed to be hard in high dimensions even for , and the brute-force algorithm that considers all data points in essentially optimal. In particular, Williams (see (Alman and Williams, 2015)) proved that for constant , no query time data structure is possible for -dimensional Hamming space, assuming the Strong Exponential Time Hypothesis.

Because of the hardness of the problem most research has revolved around approximate solutions. The -approximate -NN (-NN) problem asks to return points from with distance at most to , where is the distance to the th nearest neighbor of . It is known that -NN is equivalent, up to polylogarithmic factors, to the simpler near neighbor problem: Given an upper bound , return a point within distance  (Har-Peled et al., 2012). We refer to (Andoni et al., 2018) for more background on recent developments in approximate near neighbor search.

Models of computation

Motivated by large-scale similarity search applications we consider models of hardware aimed at massive data sets. The external memory model (Aggarwal and Vitter, 1988) abstracts modern, block-oriented storage where memory consists of blocks each capable of holding data items. The cost of an algorithm or data structure is measured in terms of the number of block accesses, referred to as I/Os. When considering the -NN problem we let denote the number of vectors that fit in a block.

Distributing similarity search onto many machines has also been considered as a way of scaling to large data sets (Bahmani et al., 2012; Muja and Lowe, 2014; Hu et al., 2019). We can interpret a static, external memory data structure as an abstraction of a large, distributed system in which each server holds data items (and information associated with them). In this context the parameter may be relatively large in comparison to . For example, to store the DEEP1B vectors of dimension , and associated raw data, we could imagine servers each holding data items (so that on average each item is replicated times to achieve redundancy). The number of I/Os needed to answer a query then equals the number of servers that need to be involved when answering a -NN query.

Most lower bounds for I/O efficient algorithms are shown under the assumption that data items are indivisible in the sense that they are treated as atomic units, and that a block contains a well-defined set of at most data items. The indexability model (Hellerstein et al., 2002, 1997), introduced at PODS ’97, formalized external memory data structures for queries that return a set of data items under the indivisibility assumption. For a given data structure, the complexity of a query family is the smallest number of blocks that must be read to answer a query, in the worst case over all queries . There does not need to be a constructive procedure for identifying the correct blocks. In particular, the nearest neighbor problem () is trivial in the indexability model since the block containing the answer to the query is given for free; the search aspect is completely removed from consideration, and the algorithm would return the block in one I/O. Though the original indexability model does not accommodate notions of approximation, it can be naturally extended to the setting where there is a set of at least elements that are valid answers to a query and we are required to return any of them.

1.1. Our results

The complexity of general -NN queries in the I/O model lies between two extremes:

  • There exist blocks (or servers) that contains a set of valid answers to the query, and

  • No block (or server) contains more than one valid query answer, so block reads are needed.

Since we do not care about constant factors we can, for simplicity, assume that , since otherwise a trivial brute-force algorithm that reads all points is optimal within a factor of 2. We give several upper and lower bounds for -NN that suggest a dichotomy for polynomial space data structures: Depending on the metric space and the approximation factor, either it is possible to achieve I/Os, or I/Os are required.

Our results are summarized in Table 1. For our lower bounds, the dimension column represents the minimum (asymptotic) number of dimensions required for the bounds to hold. Since it is possible to decrease the I/O complexity in lower dimensions (Streppel and Yi, 2011), a condition on the number of dimensions is needed. Our upper bounds do not depend on the number of dimensions, except indirectly through the definition of as the number of vectors in a block. We stress that the I/O upper bounds for an indexing scheme do not imply the existence of a data structure with the same guarantees — for a data structure in the I/O model we would expect an additional search cost. Our main theorems are:

=1.5mm c—c—c—c Metric & Dimension & Approximation factor & I/O bound

any & any & any &

& & &

Hamming & & &

Hamming & any & &

Table 1. Our I/O lower and upper bounds on the -NN problem for data structures using polynomial space, where can be any constant. The lower bound for holds assuming .
Theorem 1.1 ().

(L-infinity metric lower bound) Any indexing scheme for -NN in -dimensional space with and with worst case query time of I/Os, where must use blocks of space.

Theorem 1.2 ().

(General metric indexing scheme) Given a set of points in any metric space, there exists a -approximate indexing scheme that uses blocks of space (where is the block size) and returns -approximate, nearest neighbors in optimal I/Os.

Theorem 1.3 ().

(Hamming metric lower bound) Any indexing scheme for -NN in -dimensional Hamming space with and with worst case query time of I/Os, where must use blocks of space.

Theorem 1.4 ().

(Hamming indexing scheme) Given a set of points in -dimensional Hamming space and a constant , there exists a -approximate indexing scheme that uses blocks of space and returns -approximate, nearest neighbors in optimal I/Os.

1.2. Related work

Lower bounds on nearest neighbors in restricted models

A well-known work of Berchtold et al. (Berchtold et al., 1997) analyzes the performance of certain types of nearest neighbor indexes on random data sets. More recently, Pestov and Stojmirović (Pestov and Stojmirović, 2006) and Pestov (Pestov, 2013) showed lower bounds for high-dimensional similarity search in a model of data structures that encompasses many tree-based indexing methods. These results do not consider approximation, and their algorithmic models do not encompass modern algorithmic approaches to approximate similarity search such as locality-sensitive hashing.

Data structure lower bounds based on indivisibility

There is a rich literature, starting with the seminal paper of Aggarwal and Vitter (Aggarwal and Vitter, 1988), giving lower bounds on I/O-efficiency under an indivisibility assumption. Such results in the context of data structures are known for dynamic dictionaries (Wei et al., 2009), planar point enclosure (Arge et al., 2009), range sampling (Hu et al., 2014), and many variants of orthogonal range reporting (Afshani, 2012; Afshani et al., 2009; Hellerstein et al., 2002; Larsen and Van Walderveen, 2013; Yi, 2009). Below we elaborate on the most closely related works on orthogonal range queries.

To our best knowledge the high-dimensional -NN problem has not been explicitly studied in the indexability model (Hellerstein et al., 2002). However, in -dimensional Hamming space it is straightforward to use the -set workload technique of (Hellerstein et al., 2002) to show that even obtaining I/Os is not possible (for ) unless the indexing scheme uses quadratic space. Our lower bound technique is a generalization of the -set workload that allows us to deal with approximation as well as space usage larger than quadratic. It also allows us to show lower bounds all the way down to dimensions, as opposed to dimensions.

Orthogonal range queries

Orthogonal range reporting in dimensions asks to report all points in lying inside a query range that is a cross product of intervals. Note that -NN in the metric is the special case of orthogonal range reporting where all intervals have the same length. Hellerstein et al. (Hellerstein et al., 2002) showed that in order to answer orthogonal range reporting queries in I/Os, for some , the data structure needs to use space . In particular, in dimension a polynomial space data structure needs I/Os.

Approximate -dimensional range reporting has been studied in the I/O model: Streppel and Yi (Streppel and Yi, 2011) show that for a query rectangle and constant , allowing the data structure to report points at distance up to from makes it possible to report points in I/Os (plus a logarithmic search cost, which does not apply to the indexability model). To our best knowledge, no lower bounds were known for approximate range queries before our work.

Lower bounds based on computational assumptions

Unconditional lower bounds for -NN in the cell probe model (Barkol and Rabani, 2002; Borodin et al., 1999; Chakrabarti and Regev, 2004; Panigrahy et al., 2010) only match upper bounds in the regime where space is very large. To better understand the complexity for, say, sub-quadratic space usage, a possibility is to base lower bounds on computational assumptions such as the Strong Exponential Time Hypothesis (SETH), or the weaker Orthogonal Vectors Hypothesis (OVH). Recently, Rubinstein (Rubinstein, 2018) showed that under either of these hypotheses, for each constant , achieving an approximation factor of is not possible for a data structure with polynomial space and construction time unless the query time exceeds . Already in 2001 Indyk showed that in the metric, -NN with approximation factor is similarly hard (Indyk, 2001). (Though Indyk links the lower bound to a different problem, it can be easily checked that the same conclusion follows from the more recent SETH and OVH assumptions.)

Upper bounds

The I/O complexity of the near neighbor problem was studied by Gionis et al. (Gionis et al., 1999), focusing on Hamming space. (Their approach extends to other spaces that have good locality-sensitive hash functions.) For every approximation factor , they show that I/Os suffices to retrieve one near neighbor, using a data structure of size subquadratic in . It seems that the same algorithm can be adapted to return near neighbors at an additional cost of I/Os. Tao et al. (Tao et al., 2010) extended these results to handle nearest neighbor queries, but they do not consider the case of nearest neighbors.

For the Euclidean and Hamming metrics with constant approximation factor it is known how to get query time for the near neighbor problem () with a polynomial space data structure (see (Andoni et al., 2017) and its references). For Kapralov (Kapralov, 2015) even showed how to achieve this by a single probe to the data structure, returning a pointer to the result.

Organization:

The rest of this paper is organized as follows. In Section we develop notation and proceed in Section 2.1 to extend the indexability result of (Hellerstein et al., 1995) to approximate indexing schemes. Section describes the indexing scheme promised in Theorem 1.2, followed by Section that proves lower bounds in the L-infinity metric (Theorem 1.1). Section contains our results on the Hamming metric (Theorem 1.3 and Theorem 1.4), followed by conclusion and open problems in Section .

2. Preliminaries

The external memory model of computation (due to Aggarwal and Vitter (Aggarwal and Vitter, 1988)) has a main memory of size and an infinite external memory, both realized as arrays. Data is stored in external memory, and is transferred to/from main memory (where computation happens) in I/Os, or block transfers, where a block holds data items. Computation in main memory is free — the cost of an algorithm is the number of I/Os it performs.

We will be using the following definitions from Hellerstein et al. (Hellerstein et al., 2002, 1997). For brevity we will refer to a subset of of size as a -subset.

Definition 2.0 (Workload).

A workload consistst of a non-empty set (the domain), a nonempty finite set (the instance, whose size we denote by ), and a set of subsets of (the query set).

Definition 2.0 (Indexing Scheme).

An indexing scheme consists of a workload and a set of -subsets of , where is the block size of the indexing scheme.

Definition 2.0 (Cover set).

A cover set for a query is a minimum-size subset of the blocks such that .

We will assume that the blocks are chosen such that exists for every query .

Definition 2.0 (-Set Workload).

The -set workload is a workload with instance whose query set is the set of all -subsets of the instance.

While (Hellerstein et al., 2002, 1997) measure performance in terms of redundancy and access overhead, we find it more natural to define performance in the same way as the I/O model. The space usage of an indexing scheme is the number of blocks, . The query cost of is the size of a cover set, . Observe that the cost of a query of size can range from I/Os to I/Os, depending on how many blocks are needed to cover it.

All static data structures we know of in the external memory model with a given space usage and query cost translate directly to an indexing scheme with the same, or better, performance. This is because these data structures store, or can be adapted to store, vectors explicitly in each block, such that the set of result vectors is explicitly present in the blocks that are read when answering a query. This means that any lower bounds in the indexability model strongly suggest lower bounds also in the external memory model.

2.1. Approximate indexing schemes

Definition 2.0 (-approximate -nearest neighbors problem, -Nn).

Let be a metric space with distance function , and be a constant. Given , , and a positive integer , construct a data structure that upon receiving a query , returns , , such that for all , , where is the th nearest neighbor of in .

We will consider indexing schemes for -NN that can depend on the parameters and , i.e., the parameters are known when the data structure is constructed. The set of points to be stored is denoted by . Observe that for -NN, there can be at most distinct query answers ( in the -set workload above). Intuitively, the more queries there are in a workload, the higher is the space needed by the data structure. But in the -NN problem, for a query there is no one “right” answer, as any set of approximate near neighbors form a valid output. Thus we find that the definitions in the original indexability model need to be extended to capture approximate data structures.

Definition 2.0 (Relaxed -set workload).

The relaxed -set workload is a workload whose instance is . The query set is the set of all subsets of size . Given a query corresponding to such a -subset , the indexing scheme must report elements, at least half of which must come from .

We next show a space-I/O tradeoff for the relaxed workload.

Lemma 2.7 (Relaxed set workload tradeoff).

Any indexing scheme for the relaxed set workload with a space usage of blocks and a query time of block accesses must have

Proof.

Recall that by the relaxation, the indexing only needs to find a subset  of size in the index. In other words, with each subset that can be retrieved by the index, the algorithm can add any elements and arrive at a valid query .

We upper bound the total number of distinct -sets reported by an indexing scheme using space and query I/Os as follows:

  • choose the set of blocks to retrieve from the index ( choices),

  • choose the elements to use from these at most distinct elements (at most choices),

  • choose arbitrary other elements (at most choices).

The total number of such combinations should be at least , which is the possible set of queries, or -subsets. Using the inequalities this gives us:

(1)

One may be interested in a query time of I/Os, where .

Corollary 2.8 ().

Any indexing scheme for the relaxed -set workload with worst case query time of I/Os, where must use blocks of space.

This lower bound is essentially tight: Achieving a query time of I/Os using space is easy by just preprocessing all combinations of the items. By doing an analogous calculation for the exact workload, we are able to give the following tradeoff for the standard -set workload.

Lemma 2.9 (-set workload tradeoff).

Any indexing scheme for the -set workload with worst case query time of I/Os, where must use blocks of space.

Lemma 2.9 generalizes Theorem 7.2 in (Hellerstein et al., 2002) which considered the case of .

3. A 3-approximation indexing scheme for general metrics

We prove Theorem 1.2 in this section, that asserts that (a relatively simple) indexing scheme provides a -approximation for the -NN problem in any metric space. Note that we are only presenting an indexing scheme as opposed to a data structure; i.e., we assume that once the query is given, an oracle provides the smallest set of blocks that contain a valid answer to the query.

Let be a set of points in a metric space. Consider the indexing scheme that consists of the set of blocks where we store and the set of the -nearest neighbors of the point (including itself). This requires blocks per element of , as claimed. For a query point let be the nearest neighbor of . The oracle then returns the set , i.e., the nearest neighbors of ’s nearest neighbor, using I/Os.

Theorem 3.1 ().

Let be the set of the exact nearest neighbors of a query , and let be the set of points returned by the indexing scheme described above. Then for any , . That is, all the returned points are within a factor of the distance of the query to its th nearest neighbor.

Proof.

The proof is a case analysis. Let be the smallest ball centered at that contains (the “ nearest neighbor disk” of ). The cases are:

  1. Both and are outside . For each , we have that

  2. is inside , and is outside . For each , we have that

  3. Both and are inside . In this case, we first claim that it is sufficient to consider that for some , is outside . If that was not the case, we have reported all , i.e., the exact nearest neighbors of . For each we now have that

  4. is outside , and is inside . In this case we have that

Tightness of the analysis:

Next, we give an example of points where the indexing scheme above does not achieve a better approximation factor than , for arbitrary . Consider the scenario when all points including the query lie on a line:

The disk contains all points but , will be reported, and .

4. Lower bound for L-infinity metric

In this section we prove Theorem 1.1. We warm up with a lower bound in dimensions, using a slight variant of a reduction of Indyk (Indyk, 2001).

Lemma 4.1 ().

Consider the approximate -NN problem in with metric. There exists a set of  points for that is a -set workload, i.e, for every subset there exists a query point such that the -distance is

Hence, the -NN for in -dimensional space leads to a -set workload.

Proof.

The set consists of the unit vectors (where only the th entry is 1 and all other entries are 0). For a set  the query vector is defined as

To deterministically reduce the dimensionality of the space we use an expander and switch to relaxed -set workloads. Expanders were previously used for deterministic embeddings of Euclidean space into by Indyk (Indyk, 2007). There is a vast literature on expanders and the results we are using are standard by now. For the sake of concreteness, we take the definitions and precise results almost literally from (Östlin and Pagh, 2002). We define -expander graphs and state some results concerning these graphs. For the rest of this paper we will assume to divide , as this makes statements and proofs simpler. This will be without loss of generality, as the statements we show do not change when rounding  down to the nearest such value. Let be a bipartite graph with left vertex set , right vertex set , and edge set . We denote the set of neighbors of a set by , and use as a shorthand for , .

Definition 4.0 (Definition 3 of (Östlin and Pagh, 2002)).

A bipartite graph is -regular if the degree of all nodes in is . A bipartite -regular graph is an -expander if for each with it holds that .

Lemma 4.3 (Corollary 5 in (Östlin and Pagh, 2002)).

For every constant there exists an -expander with , and .

Nex, we discuss how to give an analogue of the hard query set in Lemma 4.1 with dimensions.

Lemma 4.4 ().

Let be arbitrary integer parameters. Consider -NN with . There exists a set  of points in dimension such that for any with there exists a query point such that the set of potential answer points has and .

Proof.

Fix and . Let and let be an -expander with and . For concreteness we take .

Construct the set of points where

Define the query point for set with where

It is easy to see that for all , so has at least neighbors at distance . It remains to show that this leads to a relaxed -set workload, i.e., that for any any set of 1/2-near points in the set has at least points in common with . Fix a subset  and consider the query point . Let be the points of the set with distance at most 1/2 to . Observe that by construction. Let be the set of “unintended near points”. Observe that every point that does not fulfill has and cannot be reported since . Hence we consider

Observe and , by the definition of  and by the definition of . Combining this we get , or . That is, , as desired. Hence we have a relaxed -set workload. ∎

This means that we can apply Lemma 2.7 and Lemma 2.8:

Corollary 4.5 ().

(Theorem 1.1) Any indexing scheme for -NN in -dimensional space with with worst case query time of I/Os, where must use blocks of space.

5. Bounds in the Hamming metric

5.1. Lower Bound for Hamming metric

We now prove Theorem 1.3, giving a lower bound on indexes on sets of vectors in Hamming space with approximation factor . This directly implies a lower bound in the metric, as well as a lower bound for with approximation factor .

Lemma 5.1 ().

Consider the approximate -NN problem in Hamming space of dimension . There exists a set of  points , of dimension that is a -set workload, i.e, for every subset there exists a query point such that the Hamming (or ) distance is:

Hence, the -NN for in -dimensional Hamming space leads to a -set workload.

Proof.

The set consists of the unit vectors . The query vector for set  is the characteristic vector of . It is easy to verify that the distances from the query to the vectors in is , and to those not in is . Note that , which gives the bound on . ∎

Lemma 5.2 ().

Consider the approximate -NN problem in Hamming space of dimension . There exists a set of  points , , and that is a relaxed -set workload, i.e, for every subset there exists a query point such that the Hamming ()-distance is

for an with . Hence, leads to a relaxed -set workload for -NN in -dimensional Hamming space.

Proof.

Fix and . Let be an -expander with and , which exists by Lemma 4.3. For concreteness we take .

Construct the set of points in the same way as earlier:

We define the query point for each set with where

In other words, the vectors are the characteristic vectors of the neighbor-sets in .

For we have . Observe that

For any we have , leading to the definition We can set the distance threshold ratio such that the number of unintended neighbors is again at most :

We can set the distance threshold ratio to such that the set of unintended near neighbors are the with :

To calculate the set of unintended neighbors we define

Then

leading to

Hence

This means that the described workload is a relaxed -set workload. Applying Lemmas and now completes the proof of Theorem 1.3. ∎

5.2. Upper bounds (indexing scheme) for the Hamming metric

In this section we prove Theorem 1.4. For any given approximation factor we wish to construct an indexing scheme that answers -NN queries in -dimensional Hamming space using polynomial space and with I/Os. Our construction is an application of the dimension reduction technique of Kushilevitz et al. (Kushilevitz et al., 2000).

For each we create a data structure that handles the case where the th closest point to is at distance . The data structure must report points that have distance at most from . The central idea of (Kushilevitz et al., 2000) is to use a randomized mapping

where , such that for each

with high probability for all

:

(2)

(We note that the required dimension grows as approaches , hence we need to keep fixed.) Consider the mapped multiset and create a data structure that for each lists, for the nearest neighbors of in , the corresponding vectors in (breaking ties arbitrarily), using blocks. If (2) holds then list contains only -approximate -nearest neighbors of . To eliminate the error probability, choose such random mappings and construct corresponding data structures: With high probability there will be no query that does not have at least one data structure that returns a correct result. If this fails for some , start over from the beginning and choose new mappings.

The total space usage is , which is polynomial in and , as desired. Queries can be answered in I/Os since we are taking full advantage of the power of the indexability model: To answer a query it is necessary to know which mapping can be used to answer it correctly, and where in storage the blocks with index reside.

Of course, we can also get an algorithm in the standard I/O model with a multiplicative query time overhead of by querying all repetitions and returning the closest points seen.

6. Conclusion and open problems

We have shown that nontrivial lower bounds can be shown in the indexability model, even under approximation. The main open problem that we leave is whether our hardness result for Hamming distance can be extended to approximation factor , independent of . This would give an unconditional analogue of the recent conditional lower bound of Rubinstein (Rubinstein, 2018).

References

  • (1)
  • Afshani (2012) Peyman Afshani. 2012. Improved pointer machine and I/O lower bounds for simplex range reporting and related problems. In Proceedings of 28th Symposium on Computational Geometry (SoCG). ACM, 339–346.
  • Afshani et al. (2009) Peyman Afshani, Lars Arge, and Kasper Dalgaard Larsen. 2009. Orthogonal range reporting in three and higher dimensions. In 50th IEEE Symposium on Foundations of Computer Science (FOCS). IEEE, 149–158.
  • Aggarwal and Vitter (1988) Alok Aggarwal and Jeffrey S Vitter. 1988. The input/output complexity of sorting and related problems. Commun. ACM 31 (1988), 1116–1127. Issue 9.
  • Alman and Williams (2015) Josh Alman and Ryan Williams. 2015. Probabilistic Polynomials and Hamming Nearest Neighbors. In Proceedings of 56th symposium on Foundations of Computer Science (FOCS). 136–150. https://doi.org/10.1109/FOCS.2015.18
  • Andoni et al. (2018) Alexandr Andoni, Piotr Indyk, and Ilya Razenshteyn. 2018. Approximate nearest neighbor search in high dimensions. arXiv preprint 1806.09823 (2018). Also appears in proceedings of ICM 2018.
  • Andoni et al. (2017) Alexandr Andoni, Thijs Laarhoven, Ilya P. Razenshteyn, and Erik Waingarten. 2017. Optimal Hashing-based Time-Space Trade-offs for Approximate Near Neighbors. In Proceedings of 28th ACM-SIAM Symposium on Discrete Algorithms (SODA). 47–66. https://doi.org/10.1137/1.9781611974782.4
  • Arge et al. (2009) Lars Arge, Vasilis Samoladas, and Ke Yi. 2009. Optimal external memory planar point enclosure. Algorithmica 54, 3 (2009), 337–352.
  • Babenko and Lempitsky (2016) Artem Babenko and Victor Lempitsky. 2016. Efficient indexing of billion-scale datasets of deep descriptors. In

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

    . 2055–2063.
  • Babenko et al. (2014) Artem Babenko, Anton Slesarev, Alexandr Chigorin, and Victor Lempitsky. 2014.

    Neural codes for image retrieval. In

    European conference on computer vision. Springer, 584–599.
  • Bahmani et al. (2012) Bahman Bahmani, Ashish Goel, and Rajendra Shinde. 2012. Efficient distributed locality sensitive hashing. In Proceedings of 21st ACM international conference on information and knowledge management (CIKM). ACM, 2174–2178.
  • Barkol and Rabani (2002) Omer Barkol and Yuval Rabani. 2002. Tighter Lower Bounds for Nearest Neighbor Search and Related Problems in the Cell Probe Model. J. Comput. Syst. Sci. 64, 4 (June 2002), 873–896. https://doi.org/10.1006/jcss.2002.1831
  • Berchtold et al. (1997) Stefan Berchtold, Christian Böhm, Daniel A. Keim, and Hans-Peter Kriegel. 1997.

    A Cost Model For Nearest Neighbor Search in High-Dimensional Data Space. In

    Proceedings of 6th ACM Symposium on Principles of Database Systems (PODS).
  • Borodin et al. (1999) Allan Borodin, Rafail Ostrovsky, and Yuval Rabani. 1999. Lower bounds for high dimensional nearest neighbor search and related problems. In

    Proceedings of 31st Symposium on Theory of Computation (STOC)

    . 312–321.
  • Chakrabarti and Regev (2004) Amit Chakrabarti and Oded Regev. 2004. An Optimal Randomised Cell Probe Lower Bound for Approximate Nearest Neighbour Searching. In Proceedings of 45th IEEE Symposium on Foundations of Computer Science (FOCS ’04). 473–482. https://doi.org/10.1109/FOCS.2004.12
  • Gionis et al. (1999) Aristides Gionis, Piotr Indyk, and Rajeev Motwani. 1999. Similarity Search in High Dimensions via Hashing. In Proceedings of 25th International Conference on Very Large Data Bases (VLDB). Morgan Kaufmann, 518–529. http://www.vldb.org/conf/1999/P49.pdf
  • Har-Peled et al. (2012) Sariel Har-Peled, Piotr Indyk, and Rajeev Motwani. 2012.

    Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality.

    Theory of Computing 8, 1 (2012), 321–350. https://doi.org/10.4086/toc.2012.v008a014
  • Hellerstein et al. (2002) Joseph Hellerstein, Elias Koutsoupias, Daniel Miranker, Christos Papadimitriou, and Samoladas Vasilis. 2002. On a Model of Indexability and Its Bounds for Range Queries. J. ACM 49, 1 (2002), 35–55.
  • Hellerstein et al. (1997) Joseph Hellerstein, Elias Koutsoupias, and Christos Papadimitriou. 1997. On the Analysis of Indexing Schemes. In Proceedings of 16th ACM Symposium on Principles of Database Systems (PODS). ACM, 249–256.
  • Hellerstein et al. (1995) Joseph.M. Hellerstein, Jeffrey. F. Naughton, and Avi Pfeffer. 1995. Generalized Search Tree for Database Systems. In Proceedings of 21th International Conference on Very Large Data Bases (VLDB). ACM, 562–573.
  • Hu et al. (2014) Xiaocheng Hu, Miao Qiao, and Yufei Tao. 2014. Independent range sampling. In Proceedings of 33rd ACM Symposium on Principles of database systems (PODS). ACM, 246–255.
  • Hu et al. (2019) Xiao Hu, Ke Yi, and Yufei Tao. 2019. Output-Optimal Massively Parallel Algorithms for Similarity Joins. ACM Transactions on Database Systems 44, 2 (April 2019), 1–36. https://doi.org/10.1145/3311967
  • Indyk (2001) Piotr Indyk. 2001. On approximate nearest neighbors under norm. J. Comput. System Sci. 63, 4 (2001), 627–638.
  • Indyk (2007) Piotr Indyk. 2007. Uncertainty principles, extractors, and explicit embeddings of L2 into L1. In Proceedings of 39th ACM Symposium on Theory of Computing (STOC). 615–620. https://doi.org/10.1145/1250790.1250881
  • Kapralov (2015) Michael Kapralov. 2015. Smooth tradeoffs between insert and query complexity in nearest neighbor search. In Proceedings of 34th ACM Symposium on Principles of Database Systems (PODS). ACM, 329–342.
  • Kushilevitz et al. (2000) Eyal Kushilevitz, Rafail Ostrovsky, and Yuval Rabani. 2000. Efficient search for approximate nearest neighbor in high dimensional spaces. SIAM J. Comput. 30, 2 (2000), 457–474.
  • Larsen and Van Walderveen (2013) Kasper Green Larsen and Freek Van Walderveen. 2013. Near-optimal range reporting structures for categorical data. In Proceedings of 24th ACM-SIAM symposium on discrete algorithms (SODA). SIAM, 265–276.
  • Muja and Lowe (2014) Marius Muja and David G Lowe. 2014. Scalable nearest neighbor algorithms for high dimensional data. IEEE transactions on pattern analysis and machine intelligence 36, 11 (2014), 2227–2240.
  • Östlin and Pagh (2002) Anna Östlin and Rasmus Pagh. 2002. One-Probe Search. In Proceedings of 29th international colloquium on automata, languages and programming (ICALP). 439–450.
  • Panigrahy et al. (2010) Rina Panigrahy, Kunal Talwar, and Udi Wieder. 2010. Lower bounds on near neighbor search via metric expansion. In 2010 IEEE 51st Symposium on Foundations of Computer Science. IEEE, 805–814.
  • Pestov (2013) Vladimir Pestov. 2013. Lower bounds on performance of metric tree indexing schemes for exact similarity search in high dimensions. Algorithmica 66, 2 (2013), 310–328.
  • Pestov and Stojmirović (2006) Vladimir Pestov and Aleksandar Stojmirović. 2006. Indexing schemes for similarity search: An illustrated paradigm. Fundamenta Informaticae 70, 4 (2006), 367–385.
  • Rubinstein (2018) Aviad Rubinstein. 2018. Hardness of approximate nearest neighbor search. In Proceedings of 50th ACM SIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, June 25-29, 2018, Ilias Diakonikolas, David Kempe, and Monika Henzinger (Eds.). ACM, 1260–1268. https://doi.org/10.1145/3188745.3188916
  • Streppel and Yi (2011) Micha Streppel and Ke Yi. 2011. Approximate range searching in external memory. Algorithmica 59, 2 (2011), 115–128.
  • Tao et al. (2010) Yufei Tao, Ke Yi, Cheng Sheng, and Panos Kalnis. 2010. Efficient and accurate nearest neighbor and closest pair search in high-dimensional space. ACM Trans. Database Syst. 35, 3 (2010), 20:1–20:46. https://doi.org/10.1145/1806907.1806912
  • Wei et al. (2009) Zhewei Wei, Ke Yi, and Qin Zhang. 2009. Dynamic external hashing: The limit of buffering. In Proceedings of 21st symposium on Parallelism in algorithms and architectures. ACM, 253–259.
  • Yi (2009) Ke Yi. 2009. Dynamic indexability and lower bounds for dynamic one-dimensional range query indexes. In Proceedings of 28th ACM Symposium on Principles of Database Systems (PODS). ACM, 187–196.