1 Introduction
The Nearest Neighbor Search (NNS) problem is defined as follows. Given an point dataset in a dimensional Euclidean space , we would like to preprocess to answer nearest neighbor queries quickly. That is, given a query point , we want to find the data points from that are closest to . NNS is a cornerstone of the modern data analysis and, at the same time, a fundamental geometric data structure problem that led to many exciting theoretical developments over the past decades. See, e.g., [WLKC16, AIR18] for an overview.
The main two approaches to constructing efficient NNS data structures are indexing and sketching. The goal of indexing is to construct a data structure that, given a query point, produces a small subset of (called candidate set
) that includes the desired neighbors. Such a data structure can be stored on a single machine, or (if the data set is very large) distributed among multiple machines. In contrast, the goal of sketching is to compute compressed representations of points to enable computing approximate distances quickly (e.g., compact binary hash codes with the Hamming distance used as an estimator
[WSSJ14, WLKC16]). Indexing and sketching can be (and often are) combined to maximize the overall performance [WGS17, JDJ17].Both indexing and sketching have been the topic of a vast amount of theoretical and empirical literature. In this work, we consider the indexing problem. In particular, we focus on indexing based on space partitions. The overarching idea is to build a partition of the ambient space and split the dataset accordingly. Given a query point , we identify the bin containing and form the resulting list of candidates from the data points residing in the same bin (or, to boost the accuracy, nearby bins as well). Some of the popular space partitioning methods include localitysensitive hashing (LSH) [LJW07, AIL15, DSN17]; quantizationbased approaches, where partitions are obtained via means clustering of the dataset [JDS11, BL12]; and treebased methods such as randomprojection trees or PCA trees [Spr91, BCG05, DS13, KS18].
Compared to other indexing methods, space partitions have multiple benefits. First, they are naturally applicable in distributed settings, as different bins can be stored on different machines [BGS12, NCB17, LCY17, BW18]. Moverover, the computational efficiency of search can be further improved by using any nearest neighbor search algorithm locally on each machine. Second, partitionbased indexing is particularly suitable for GPUs due to the simple and predictable memory access pattern [JDJ17]. Finally, partitions can be combined with cryptographic techniques to yield efficient secure similarity search algorithms [CCD19]. Thus, in this paper we focus on designing space partitions that optimize the tradeoff between their key metrics: the number of reported candidates, the fraction of the true nearest neighbors among the candidates, the number of bins, and the computational efficiency of the point location.
Recently, there has been a large body of work that studies how modern machine learning techniques (such as neural networks) can help tackle various classic algorithmic problems (a partial list includes
[MPB15, BLS16, BJPD17, DKZ17, MMB17, KBC18, BDSV18, LV18, Mit18, PSK18]). Similar methods—under the name “learn to hash”—have been used to improve the sketching approach to NNS [WLKC16]. However, when it comes to indexing, while some unsupervised techniques such as PCA or means have been successfully applied, the full power of modern tools like neural networks has not yet been harnessed. This state of affairs naturally leads to the following general question: Can we employ modern (supervised) machine learning techniques to find good space partitions for nearest neighbor search?1.1 Our contribution
In this paper we address the aforementioned challenge and present a new framework for finding highquality space partitions of . Our approach consists of three major steps:

Build the NN graph of the dataset by connecting each data point to nearest neighbors;

Find a balanced partition of the graph into parts of nearlyequal size such that the number of edges between different parts is as small as possible;

Obtain a partition of
by training a classifier on the data points with labels being the parts of the partition
found in the second step.
See Figure 1 for illustration. The new algorithm directly optimizes the performance of the partitionbased nearest neighbor data structure. Indeed, if a query is chosen as a uniformly random data point, then the average NN accuracy is exactly equal to the fraction of edges of the NN graph whose endpoints are separated by the partition . This generalizes to outofsample queries provided that the query and dataset distributions are close, and the test accuracy of the trained classifier is high.
At the same time, our approach is directly related to and inspired by recent theoretical work [ANN18a, ANN18b] on NNS for general metric spaces. The two relevant contributions in these papers are as follows. First, the following structural result is shown for a large class of metric spaces (which includes Euclidean space, and, more generally, all normed spaces). Any graph embeddable into such a space in a way that (a) all edges are short, yet (b) there are no lowradius balls that contain a large fraction of vertices, must contain a sparse cut. It is natural to expect that the NN graph of a wellbehaved dataset would have these properties, which implies the existence of a desired balanced partition. The second relevant result from [ANN18a, ANN18b] shows that, under additional assumptions on a metric space, any such sparse cut in an embedded graph can be assumed to have a certain nice form, which makes it efficient to store and query. This result has strong parallels with our learning step, where we extend a graph partition to a partition of the ambient induced by an (algorithmically nice) classifier. Unlike [ANN18a, ANN18b]
, where the whole space is discretized into a graph, we build a graph supported only on the dataset points and learn the extension to the ambient space using supervised learning.
The new framework is very flexible and uses partitioning and learning in a blackbox way. This allows us to plug various models (linear models, neural networks, etc.) and explore the tradeoff between the quality and the algorithmic efficiency of the resulting partitions. We emphasize the importance of balanced partitions for the indexing problem, where all bins contain roughly the same number of data points. This property is crucial in the distributed setting, since we naturally would like to assign a similar number of points to each machine. Furthermore, balanced partitions allow tighter control of the number of candidates simply by varying the number of retrieved parts. Note that a priori, it is unclear how to partition so as to induce balanced bins of a given dataset. Here the combinatorial portion of our approach is particularly useful, as balanced graph partitioning is a wellstudied problem, and our supervised extension to naturally preserves the balance by virtue of attaining high training accuracy.
We speculate that the new method might be potentially useful for solving the NNS problem for nonEuclidean metrics, such as the edit distance [ZZ17] or optimal transport distance [KSKW15]. Indeed, for any metric space, one can compute the NN graph and then partition it. The only step that needs to be adjusted to the specific metric at hand is the learning step.
Let us finally put forward the challenge of scaling our method up to billionsized or even larger datasets. For such scale, one needs to build an approximate NN graph as well as to use graph partitioning algorithms that are faster than KaHIP. We leave this exciting direction to future work.
Evaluation We instantiate our framework with the KaHIP algorithm [SS13] for the graph partitioning step, and either linear models or smallsize neural networks for the learning step. We evaluate it on several standard benchmarks for NNS [ABF17] and conclude that in terms of quality of the resulting partitions, it consistently outperforms quantizationbased and treebased partitioning procedures, while maintaining comparable algorithmic efficiency. In the high accuracy regime, our framework yields partitions that lead to processing up to fewer candidates than alternative approaches.
As a baseline method we use means clustering [JDS11]. It produces a partition of the dataset into bins, in a way that naturally extends to all of , by assigning a query point to its nearest centroid. (More generally, for multiprobe querying, we can rank the bins by the distance of their centroids to ). This simple scheme produces very highquality results for indexing.
1.2 Related work
On the empirical side, currently the fastest indexing techniques for the NNS problem are graphbased [MY18]. The highlevel idea is to construct a graph on the dataset (it can be the NN graph, but other constructions are also possible), and then for each query perform a walk, which eventually converges to the nearest neighbor. Although very fast, graphbased approaches have suboptimal “locality of reference”, which makes them less suitable for several modern architectures. For instance, this is the case when the algorithm is run on a GPU [JDJ17], or when the data is stored in external memory [SWQ14] or in a distributed manner [BGS12, NCB17]. Moreover, graphbased indexing requires many rounds of adaptive access to the dataset, whereas partitionbased indexing accesses the dataset in one shot. This is crucial, for example, for nearest neighbor search over encrypted data [CCD19]. These benefits justify further study of partitionbased methods.
Machine learning techniques are particularly useful for the sketching approach, leading to a vast body of research under the label “learning to hash” [WSSJ14, WLKC16]. In particular, several recent works employed neural networks to obtain highquality sketches [LLW15, SDSJ19]. The fundamental difference from our work is that sketching is designed to speed up linear scans over the dataset, by reducing the cost of distance evaluation, while indexing is designed for sublinear time searches, by reducing the number of distance evaluations. We highlight the work [SDSJ19], which uses neural networks to learn a mapping that improves the geometry of the dataset and the queries to facilitate subsequent sketching. It is natural to apply the same family of maps for partitions; however, as our experiments show, in the high accuracy regime the maps learned using the algorithm of [SDSJ19] consistently degrade the quality of partitions.
Prior work [CD07] has used learning to tune the parameters of certain structured classes of partitions, such as KDtrees or rectilinear LSH. This is substantially different from our method, which learns a much more general class of partitions, whose only structural constraint stems from the chosen learning component—say, the class of space partitions that can be learned by SVM, a neural network, and so on.
2 Our method
Training Given a dataset of points, and a number of bins , our goal is to find a partition of into bins with the following properties:

Balanced: The number of data points in each bin is not much larger than .

Locality sensitive: For a typical query point , most of its nearest neighbors belong to the same bin of . We assume that queries and data points come from similar distributions.

Simple:
The partition should admit a compact description and, moreover, the point location process should be computationally efficient. For example, we might look for a space partition induced by hyperplanes.
First, suppose that the query is chosen as a uniformly random data point, . Let be the NN graph of , whose vertices are the data points, and each vertex is connected to nearest neighbors. Then the above problem boils down to partitioning vertices of the graph into bins such that each bin contains roughly vertices, and the number of edges crossing between different bins is as small as possible (see Figure 1(b)). This balanced graph partitioning problem is extremely wellstudied, and there are available combinatorial partitioning solvers that produce very highquality solutions. In our implementation, we use the opensource solver KaHIP [SS13], which is based on a sophisticated local search.
More generally, we need to handle outofsample queries, i.e., which are not contained in . Let denote the partition of (equivalently, of the dataset ) found by the graph partitioner. To convert into a solution to our problem, we need to extend it to a partition of the whole space that would work well for query points. In order to accomplish this, we train a model that, given a query point , predicts which of the bins of the point belongs to (see Figure 1(c)). We use the dataset as a training set, and the partition as the labels – i.e., each data point is labeled with the ID of the bin of containing it. The geometric intuition for this learning step is that – even though the partition is obtained by combinatorial means, and in principle might consist of illbehaved subsets of – in most practical scenarios, we actually expect it to be close to being induced by a simple partition of the ambient space. For example, if the dataset is fairly welldistributed on the unit sphere, and the number of bins is , a balanced cut of should be close to a hyperplane.
The choice of model to train depends on the desired properties of the output partition . For instance, if we are interested in a hyperplane partition, we can train a linear model using SVM or regression. In this paper, we instantiate the learning step with both linear models and smallsized neural networks. Here, there is natural tension between the size of the model we train and the accuracy of the resulting classifier, and hence the quality of the partition we produce. A larger model yields better NNS accuracy, at the expense of computational efficiency. We discuss this more in Section 3.
Multiprobe querying Given a query point , the trained model can be used to assign it to a bin of a partition , and search for nearest neighbors within the data points in that part. In order to achieve high search accuracy, we actually train the model to predict several bins for a given query point, which are likely to contain nearest neighbors. For neural networks, this can be done naturally by taking several largest outputs of the last layer. By searching through more bins (in the order of preference predicted by the model) we can achieve better accuracy, allowing for a tradeoff between computational resources and accuracy.
Hierarchical partitions When the required number of bins is large, in order to improve the efficiency of the resulting partition, it pays off to produce it in a hierarchical manner. Namely, we first find a partition of into bins, then recursively partition each of the bins into bins, and so on, repeating the partitioning for levels. The total number of bins in the overall partition is . See Figure 7 in the appendix for illustration. The advantage of such a hierarchical partition is that it is much simpler to navigate than a oneshot partition with bins.
2.1 Neural LSH
In one instantiation of the supervised learning component, we use neural networks with a small number of layers and constrained hidden dimensions. The exact parameters depend on the size of the training set, and are specified in the next section.
Soft labels In order to support effective multiprobe querying, we need to infer not just the bin that contains the query point, but rather a distribution over bins that are likely to contain this point and its neighbors. A probe candidate list is then formed from all data points in the most likely bins.
In order to accomplish this, we use soft labels for data points generated as follows. For and a data point , the soft label is a distribution over the bin containing a point chosen uniformly at random among nearest neighbors of (including itself). Now, for a predicted distribution , we seek to minimize the KL divergence between and : . Intuitively, soft labels help guide the neural network with information about multiple bin ranking.
is a hyperparameter that needs to be tuned. We study its setting in Section
3.4.3 Experiments
Datasets For the experimental evaluation, we use three standard ANN benchmarks [ABF17]: SIFT (image descriptors, 1M 128dimensional points), GloVe (word embeddings [PSM14], approximately 1.2M 100dimensional points, normalized), and MNIST (images of digits, 60K 784dimensional points). All three datasets come with query points, which we use for evaluation. We include the results for SIFT and GloVe in the main text, and MNIST in Appendix A.
Evaluation metrics We mainly investigate the tradeoff between the number of candidates generated for a query point, and the NN accuracy, defined as the fraction of its nearest neighbors that are among those candidates. The number of candidates determines the processing time of an individual query. Over the entire query set, we report both the average as well as the
th quantile
of the number of candidates. The former measures the throughput^{2}^{2}2Number of queries per second. of the data structure, while the latter measures its latency.^{3}^{3}3Maximum time per query, modulo a small fraction of outliers.
We mostly focus on parameter regimes that lead to NN accuracy of at least . In virtually all of our experiments, .Methods evaluated We evaluate two variants of our method, corresponding to two different choices of the supervised learning component in our framework.

Neural LSH: In this variant we use small neural networks. Their exact architecture is detailed in the next section. We compare Neural LSH to partitions obtained by means clustering. As mentioned in Section 1, this method produces high quality partitions of the dataset that naturally extend to all of , and other existing methods we have tried (such as LSH) did not match its performance. We evaluate partitions into bins and bins. We test both onelevel (nonhierarchical) and twolevel (hierarchical) partitions. Queries are multiprobe.

Regression LSH:
This variant uses logistic regression as the supervised learning component and, as a result, produces very simple partitions induced by
hyperplanes. We compare this method with PCA trees [Spr91, KZN08, AAKK14], random projection trees [DS13], and recursive bisections using means clustering. We build trees of hierarchical bisections of depth up to (thus, the total number of leaves is up to ). The query procedure descends a single roottoleaf path and returns the candidates in that leaf.
3.1 Implementation details
Neural LSH uses a fixed neural network architecture for the toplevel partition, and a fixed architecture for all secondlevel partitions. Both architectures consist of several blocks, where each block is a fullyconnected layer + batch normalization
[IS15]+ ReLU activations. The final block is followed by a fullyconnected layer and a softmax layer. The resulting network predicts a distribution over the bins of the partition. The only difference between the toplevel network the secondlevel network architecture is their number of blocks (
) and the size of their hidden layers (). In the toplevel network we use and . In the secondlevel networks we use and. To reduce overfitting, we use dropout with probability
during training. The networks are trained using the Adam optimizer [KB15] for under epochs on both levels. We reduce the learning rate multiplicatively at regular intervals. We use the Glorot initialization to generate the initial weights. To tune soft labels, we try different values of between and .We evaluate two settings for the number of bins in each level, and (leading to a total number of bins of the total number of bins in the twolevel experiments are and , respectively). In the twolevel setting with the bottom level of Neural LSH uses means instead of a neural network, to avoid overfitting when the number of points per bin is tiny. The other configurations (twolevels with and onelevel with either or ) we use Neural LSH at all levels.
3.2 Comparison with means
Figure 2 shows the empirical comparison of Neural LSH with means. The points listed are those that attained an accuracy . We note that twolevel partitioning with is the best performing configuration of means, for both SIFT and GloVe.^{4}^{4}4In terms of the minimum number of candidates that attains accuracy. Thus we evaluate the baseline at its optimal performance. However, if one wishes to use partitions to split points across machines to build a distributed NNS data structure, then a singlelevel settings seems to be more suitable.
In all settings considered, Neural LSH yields consistently better partitions than means. Depending on the setting, means requires significantly more candidates to achieve the same accuracy:

Up to more for the average number of candidates for GloVe;

Up to more for the quantiles of candidates for GloVe;

Up to more for the average number of candidates for SIFT;

Up to more for the quantiles of candidates for SIFT;
Figure 8 in the appendix lists the largest multiplicative advantage in the number of candidates of Neural LSH compared to means, for accuracy values of at least . Specifically, for every configuration of means, we compute the ratio between the number of candidates in that configuration and the number of candidates of Neural LSH in its optimal configuration, among those that attained at least the same accuracy as that means configuration.
We also note that in all settings except twolevel partitioning with ,^{5}^{5}5As mentioned earlier, in this setting Neural LSH uses means at the second level, due to the large overall number of bins compared to the size of the datasets. This explains why the gap between the average and the quantile number of candidates of Neural LSH is larger for this setting. Neural LSH produces partitions for which the
quantiles for the number of candidates are very close to the average number of candidates, which indicates very little variance between query times over different query points. In contrast, the respective gap in the partitions produced by
means is much larger, since unlike Neural LSH, it does not directly favor balanced partitions. This implies that Neural LSH might be particularly suitable for latencycritical NNS applications.Model sizes. The largest model size learned by Neural LSH is equivalent to storing about points for SIFT, or points for GloVe.This is considerably larger than means with , which stores at most points. Nonetheless, we believe the larger model size is acceptable for Neural LSH, for the following reasons. First, in most of the NNS applications, especially for the distributed setting, the bottleneck in the high accuracy regime is the memory accesses needed to retrieve candidates and the further processing (such as distance computations, exact or approximate). The model size is not a hindrance as long as does not exceed certain reasonable limits (e.g., it should fit into a CPU cache). Neural LSH significantly reduces the memory access cost, while increasing the model size by an acceptable amount. Second, we have observed that the quality of the Neural LSH partitions is not too sensitive to decreasing the sizes the hidden layers. The model sizes we report are, for the sake of concreteness, the largest ones that still lead to improved performance. Larger models do not increase the accuracy, and sometimes decrease it due to overfitting.
3.3 Comparison with treebased methods
Next we compare binary decision trees, where in each tree node a
hyperplane is used to determine which of the two subtrees to descend into. We generate hyperplanes via multiple methods: Regression LSH, cutting the dataset into two equal halves along the top PCA direction [Spr91, KZN08], means clustering, and random projections of the centered dataset [DS13, KS18]. We build trees of depth up to , which corresponds to hierarchical partitions with the total number of bins up to . We summarize the results for GloVe and SIFT datasets in Figure 9 (see appendix). For random projections, we run each configuration times and average the results.For GloVe, Regression LSH significantly outperforms means, while for SIFT, Regression LSH essentially matches means in terms of the average number of candidates, but shows a noticeable advantage in terms of the percentiles. In both instances, Regression LSH significantly outperforms PCA tree, and all of the above methods dramatically improve upon random projections.
Note, however, that random projections have an additional benefit: in order to boost search accuracy, one can simply repeat the sampling process several times and generate an ensemble of decision trees instead of a single tree. This allows making each individual tree relatively deep, which decreases the overall number of candidates, trading space for query time. Other considered approaches (Regression LSH, means, PCA tree) are inherently deterministic, and boosting their accuracy requires more care: for instance, one can use partitioning into blocks as in [JDS11], or alternative approaches like [KS18]. Since we focus on individual partitions and not ensembles, we leave this issue out of the scope.
3.4 Additional experiments
We perform several additional experiments that we describe in a greater detail in the appendix.

We evaluate the NN accuracy of Neural LSH when the partitioning step is run on either the NN or the NN graph.^{6}^{6}6Neural LSH can solve NNS by partitioning the NN graph, for any ; they do not have to be equal. Both settings outperform means, and the gap between using the NN and NN graphs is negligible, which indicates the robustness of Neural LSH.

We show that effect of tuning the size of soft labels . We show that setting to be at least is immensely beneficial compared to , but beyond that we start observing diminishing returns.

We evaluate the effect of Neural Catalyzer [SDSJ19] on the partitions produced by means.
4 Conclusions and future directions
We presented a new technique for finding partitions of which support highperformance indexing for sublineartime NNS. It proceeds in two major steps: (1) We perform a combinatorial balanced partitioning of the NN graph of the dataset; (2) We extend the resulting partition to the whole ambient space by using supervised classification (such as logistic regression, neural networks, etc.). Our experiments show that the new approach consistently outperforms quantizationbased and treebased partitions. There is a number of exciting open problems we would like to highlight:

Can we jointly optimize a graph partition and a classifier at the same time? By making the two components aware of each other, we expect the quality of the resulting partition of to improve.

Can our approach be extended to learning several highquality partitions that complement each other? Such an ensemble might be useful to trade query time for memory usage [ALRW17].

Can we use machine learning techniques to improve graphbased indexing techniques [MY18] for NNS? (This is in contrast to partitionbased indexing, as done in this work).

Our framework is an example of combinatorial tools aiding “continuous” learning techniques. A more openended question is whether other problems can benefit from such symbiosis.
References
 [AAKK14] Amirali Abdullah, Alexandr Andoni, Ravindran Kannan, and Robert Krauthgamer, Spectral approaches to nearest neighbor search, arXiv preprint arXiv:1408.0751 (2014).
 [ABF17] Martin Aumüller, Erik Bernhardsson, and Alexander Faithfull, Annbenchmarks: A benchmarking tool for approximate nearest neighbor algorithms, International Conference on Similarity Search and Applications, Springer, 2017, pp. 34–49.
 [AIL15] Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya Razenshteyn, and Ludwig Schmidt, Practical and optimal lsh for angular distance, Advances in Neural Information Processing Systems, 2015, pp. 1225–1233.
 [AIR18] Alexandr Andoni, Piotr Indyk, and Ilya Razenshteyn, Approximate nearest neighbor search in high dimensions, arXiv preprint arXiv:1806.09823 (2018).
 [ALRW17] Alexandr Andoni, Thijs Laarhoven, Ilya Razenshteyn, and Erik Waingarten, Optimal hashingbased timespace tradeoffs for approximate near neighbors, Proceedings of the TwentyEighth Annual ACMSIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics, 2017, pp. 47–66.

[ANN18a]
Alexandr Andoni, Assaf Naor, Aleksandar Nikolov, Ilya Razenshteyn, and Erik
Waingarten, Datadependent hashing via nonlinear spectral gaps
, Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing (2018), 787–800.
 [ANN18b] , Hölder homeomorphisms and approximate nearest neighbors, 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), IEEE, 2018, pp. 159–169.
 [BCG05] Mayank Bawa, Tyson Condie, and Prasanna Ganesan, Lsh forest: selftuning indexes for similarity search, Proceedings of the 14th international conference on World Wide Web, ACM, 2005, pp. 651–660.
 [BDSV18] MariaFlorina Balcan, Travis Dick, Tuomas Sandholm, and Ellen Vitercik, Learning to branch, International Conference on Machine Learning, 2018.
 [BGS12] Bahman Bahmani, Ashish Goel, and Rajendra Shinde, Efficient distributed locality sensitive hashing, Proceedings of the 21st ACM international conference on Information and knowledge management, ACM, 2012, pp. 2174–2178.
 [BJPD17] Ashish Bora, Ajil Jalal, Eric Price, and Alexandros G Dimakis, Compressed sensing using generative models, International Conference on Machine Learning, 2017, pp. 537–546.

[BL12]
Artem Babenko and Victor Lempitsky, The inverted multiindex
, Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE, 2012, pp. 3069–3076.
 [BLS16] Luca Baldassarre, YenHuan Li, Jonathan Scarlett, Baran Gözcü, Ilija Bogunovic, and Volkan Cevher, Learningbased compressive subsampling, IEEE Journal of Selected Topics in Signal Processing 10 (2016), no. 4, 809–822.
 [BW18] Aditya Bhaskara and Maheshakya Wijewardena, Distributed clustering via lsh based data partitioning, International Conference on Machine Learning, 2018, pp. 569–578.
 [CCD19] Hao Chen, Ilaria Chillotti, Yihe Dong, Oxana Poburinnaya, Ilya Razenshteyn, and M Sadegh Riazi, Sanns: Scaling up secure approximate knearest neighbors search, arXiv preprint arXiv:1904.02033 (2019).
 [CD07] Lawrence Cayton and Sanjoy Dasgupta, A learning framework for nearest neighbor search, Advances in Neural Information Processing Systems, 2007, pp. 233–240.

[DKZ17]
Hanjun Dai, Elias Khalil, Yuyu Zhang, Bistra Dilkina, and Le Song,
Learning combinatorial optimization algorithms over graphs
, Advances in Neural Information Processing Systems, 2017, pp. 6351–6361.  [DS13] Sanjoy Dasgupta and Kaushik Sinha, Randomized partition trees for exact nearest neighbor search, Conference on Learning Theory, 2013, pp. 317–337.
 [DSN17] Sanjoy Dasgupta, Charles F Stevens, and Saket Navlakha, A neural algorithm for a fundamental computing problem, Science 358 (2017), no. 6364, 793–796.
 [IS15] Sergey Ioffe and Christian Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, arXiv preprint arXiv:1502.03167 (2015).
 [JDJ17] Jeff Johnson, Matthijs Douze, and Hervé Jégou, Billionscale similarity search with gpus, arXiv preprint arXiv:1702.08734 (2017).
 [JDS11] Herve Jégou, Matthijs Douze, and Cordelia Schmid, Product quantization for nearest neighbor search, IEEE transactions on pattern analysis and machine intelligence 33 (2011), no. 1, 117–128.
 [KB15] Diederik Kingma and Jimmy Ba, Adam: A method for stochastic optimization, International Conference for Learning Representations, 2015.
 [KBC18] Tim Kraska, Alex Beutel, Ed H Chi, Jeffrey Dean, and Neoklis Polyzotis, The case for learned index structures, Proceedings of the 2018 International Conference on Management of Data, ACM, 2018, pp. 489–504.
 [KS18] Omid Keivani and Kaushik Sinha, Improved nearest neighbor search using auxiliary information and priority functions, International Conference on Machine Learning, 2018, pp. 2578–2586.
 [KSKW15] Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger, From word embeddings to document distances, International Conference on Machine Learning, 2015, pp. 957–966.
 [KZN08] Neeraj Kumar, Li Zhang, and Shree Nayar, What is a good nearest neighbors algorithm for finding similar patches in images?, European conference on computer vision, Springer, 2008, pp. 364–378.
 [LCY17] Jinfeng Li, James Cheng, Fan Yang, Yuzhen Huang, Yunjian Zhao, Xiao Yan, and Ruihao Zhao, Losha: A general framework for scalable locality sensitive hashing, Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, 2017, pp. 635–644.
 [LJW07] Qin Lv, William Josephson, Zhe Wang, Moses Charikar, and Kai Li, Multiprobe lsh: efficient indexing for highdimensional similarity search, Proceedings of the 33rd international conference on Very large data bases, VLDB Endowment, 2007, pp. 950–961.
 [LLW15] Venice Erin Liong, Jiwen Lu, Gang Wang, Pierre Moulin, and Jie Zhou, Deep hashing for compact binary codes learning, Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 2475–2483.
 [LV18] Thodoris Lykouris and Sergei Vassilvitskii, Competitive caching with machine learned advice, International Conference on Machine Learning, 2018.
 [Mit18] Michael Mitzenmacher, A model for learned bloom filters and optimizing by sandwiching, Advances in Neural Information Processing Systems, 2018.
 [MMB17] Chris Metzler, Ali Mousavi, and Richard Baraniuk, Learned damp: Principled neural network based compressive image recovery, Advances in Neural Information Processing Systems, 2017, pp. 1772–1783.

[MPB15]
Ali Mousavi, Ankit B Patel, and Richard G Baraniuk,
A deep learning approach to structured signal recovery
, Communication, Control, and Computing (Allerton), 2015 53rd Annual Allerton Conference on, IEEE, 2015, pp. 1336–1343.  [MY18] Yury A Malkov and Dmitry A Yashunin, Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs, IEEE transactions on pattern analysis and machine intelligence (2018).
 [NCB17] Y Ni, K Chu, and J Bradley, Detecting abuse at scale: Locality sensitive hashing at uber engineering, 2017.
 [PSK18] Manish Purohit, Zoya Svitkina, and Ravi Kumar, Improving online algorithms via ml predictions, Advances in Neural Information Processing Systems, 2018, pp. 9661–9670.

[PSM14]
Jeffrey Pennington, Richard Socher, and Christopher Manning,
Glove: Global vectors for word representation
, Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.
 [SDSJ19] Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, and Herve Jégou, Spreading vectors for similarity search, International Conference on Learning Representations, 2019.
 [Spr91] Robert F Sproull, Refinements to nearestneighbor searching inkdimensional trees, Algorithmica 6 (1991), no. 16, 579–589.
 [SS13] Peter Sanders and Christian Schulz, Think Locally, Act Globally: Highly Balanced Graph Partitioning, Proceedings of the 12th International Symposium on Experimental Algorithms (SEA’13), LNCS, vol. 7933, Springer, 2013, pp. 164–175.
 [SWQ14] Yifang Sun, Wei Wang, Jianbin Qin, Ying Zhang, and Xuemin Lin, Srs: solving capproximate nearest neighbor queries in high dimensional euclidean space with a tiny index, Proceedings of the VLDB Endowment 8 (2014), no. 1, 1–12.
 [WGS17] Xiang Wu, Ruiqi Guo, Ananda Theertha Suresh, Sanjiv Kumar, Daniel N HoltmannRice, David Simcha, and Felix Yu, Multiscale quantization for fast similarity search, Advances in Neural Information Processing Systems, 2017, pp. 5745–5755.
 [WLKC16] Jun Wang, Wei Liu, Sanjiv Kumar, and ShihFu Chang, Learning to hash for indexing big data  a survey, Proceedings of the IEEE 104 (2016), no. 1, 34–57.
 [WSSJ14] Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji, Hashing for similarity search: A survey, arXiv preprint arXiv:1408.2927 (2014).
 [ZZ17] Haoyu Zhang and Qin Zhang, Embedjoin: Efficient edit similarity joins via embeddings, Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2017, pp. 585–594.
Appendix A Results for MNIST
We include experimental results for the MNIST dataset, where all the experiments are performed exactly in the same way as for SIFT and GloVe. Consistent with the trend we observed for SIFT and GloVe, Neural LSH consistently outperforms means (see Figure 3) both in terms of average number of candidates and especially in terms of the th quantiles. We also compare Regression LSH with recursive means, as well as PCA tree and random projections (see Figure 4), where Regression LSH consistently outperforms the other methods.
Appendix B Additional experiments
Here we describe three additional experiments that we referred to in Section 3.4 in a greater detail.
First, we compare Neural LSH and means for (instead of the default setting of ). Moreover, we consider two variants of Neural LSH. In one of them, we use the NN graph for partitioning, but for the other variant we use merely the NN graph. Figure (a)a compares these three algorithms on GloVe for bins reporting average numbers of candidates. From this plot, we can see that for , Neural LSH convincingly outperforms means, and whether we use NN or NN graph matters very little.
Second, we study the effect of varying (the soft labels parameter) for Neural LSH on GloVe for bins. See Figure (b)b where we report the average number of candidates. As we can see from the plot, the setting yields much better results compared to the vanilla case of . However, increasing beyond has little effect on the overall accuracy.
Finally, we compare vanilla means with means run after applying a Neural Catalyzer map [SDSJ19]. The goal is to check whether the Neural Catalyzer, which has been designed to boost up the performance of sketching methods for NNS by adjusting the input geometry, could also improve the quality of space partitions for NNS. See Figure 6 for the comparison on GloVe and SIFT with bins. On both datasets (especially SIFT) Neural Catalyzer in fact degrades the quality of the partitions. We observed a similar trend for other numbers of bins than the setting reported here. These findings support our observation that while both indexing and sketching for NNS can benefit from learningbased enhancements, they are fundamentally different approaches and require different specialized techniques.
Appendix C Additional implementation details
We slightly modify the KaHIP partitioner to make it more efficient on the NN graphs. Namely, we introduce a hard threshold of on the number of iterations for the local search part of the algorithm, which speeds up the partitioning dramatically, while barely affecting the quality of the resulting partitions.
GloVe  SIFT  
Averages  quantiles  Averages  quantiles  
One level  bins  1.745  2.125  1.031  1.240 
bins  1.491  1.752  1.047  1.348  
Two levels  bins  2.176  2.308  1.113  1.306 
bins  1.241  1.154  1.182  1.192 