Learning Space Partitions for Nearest Neighbor Search

01/24/2019
by   Yihe Dong, et al.
0

Space partitions of R^d underlie a vast and important class of fast nearest neighbor search (NNS) algorithms. Inspired by recent theoretical work on NNS for general metric spaces [Andoni, Naor, Nikolov, Razenshteyn, Waingarten STOC 2018, FOCS 2018], we develop a new framework for building space partitions reducing the problem to balanced graph partitioning followed by supervised classification. We instantiate this general approach with the KaHIP graph partitioner [Sanders, Schulz SEA 2013] and neural networks, respectively, to obtain a new partitioning procedure called Neural Locality-Sensitive Hashing (Neural LSH). On several standard benchmarks for NNS, our experiments show that the partitions obtained by Neural LSH consistently outperform partitions found by quantization-based and tree-based methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

01/24/2019

Learning Sublinear-Time Indexing for Nearest Neighbor Search

Most of the efficient sublinear-time indexing algorithms for the high-di...
04/06/2018

Enumerating Graph Partitions Without Too Small Connected Components Using Zero-suppressed Binary and Ternary Decision Diagrams

Partitioning a graph into balanced components is important for several a...
11/17/2014

FGPGA: An Efficient Genetic Approach for Producing Feasible Graph Partitions

Graph partitioning, a well studied problem of parallel computing has man...
12/22/2017

Lattice-based Locality Sensitive Hashing is Optimal

Locality sensitive hashing (LSH) was introduced by Indyk and Motwani (ST...
01/19/2017

Optimized Spatial Partitioning via Minimal Swarm Intelligence

Optimized spatial partitioning algorithms are the corner stone of many s...
09/11/2015

A reliable order-statistics-based approximate nearest neighbor search algorithm

We propose a new algorithm for fast approximate nearest neighbor search ...
06/02/2020

Unsupervised Discretization by Two-dimensional MDL-based Histogram

Unsupervised discretization is a crucial step in many knowledge discover...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The Nearest Neighbor Search (NNS) problem is defined as follows. Given an -point dataset in a -dimensional Euclidean space , we would like to preprocess to answer -nearest neighbor queries quickly. That is, given a query point , we want to find the data points from that are closest to . NNS is a cornerstone of the modern data analysis and, at the same time, a fundamental geometric data structure problem that led to many exciting theoretical developments over the past decades. See, e.g., [WLKC16, AIR18] for an overview.

The main two approaches to constructing efficient NNS data structures are indexing and sketching. The goal of indexing is to construct a data structure that, given a query point, produces a small subset of (called candidate set

) that includes the desired neighbors. Such a data structure can be stored on a single machine, or (if the data set is very large) distributed among multiple machines. In contrast, the goal of sketching is to compute compressed representations of points to enable computing approximate distances quickly (e.g., compact binary hash codes with the Hamming distance used as an estimator 

[WSSJ14, WLKC16]). Indexing and sketching can be (and often are) combined to maximize the overall performance [WGS17, JDJ17].

Both indexing and sketching have been the topic of a vast amount of theoretical and empirical literature. In this work, we consider the indexing problem. In particular, we focus on indexing based on space partitions. The overarching idea is to build a partition of the ambient space and split the dataset accordingly. Given a query point , we identify the bin containing and form the resulting list of candidates from the data points residing in the same bin (or, to boost the accuracy, nearby bins as well). Some of the popular space partitioning methods include locality-sensitive hashing (LSH) [LJW07, AIL15, DSN17]; quantization-based approaches, where partitions are obtained via -means clustering of the dataset [JDS11, BL12]; and tree-based methods such as random-projection trees or PCA trees [Spr91, BCG05, DS13, KS18].

Compared to other indexing methods, space partitions have multiple benefits. First, they are naturally applicable in distributed settings, as different bins can be stored on different machines [BGS12, NCB17, LCY17, BW18]. Moverover, the computational efficiency of search can be further improved by using any nearest neighbor search algorithm locally on each machine. Second, partition-based indexing is particularly suitable for GPUs due to the simple and predictable memory access pattern [JDJ17]. Finally, partitions can be combined with cryptographic techniques to yield efficient secure similarity search algorithms [CCD19]. Thus, in this paper we focus on designing space partitions that optimize the trade-off between their key metrics: the number of reported candidates, the fraction of the true nearest neighbors among the candidates, the number of bins, and the computational efficiency of the point location.

Recently, there has been a large body of work that studies how modern machine learning techniques (such as neural networks) can help tackle various classic algorithmic problems (a partial list includes 

[MPB15, BLS16, BJPD17, DKZ17, MMB17, KBC18, BDSV18, LV18, Mit18, PSK18]). Similar methods—under the name “learn to hash”—have been used to improve the sketching approach to NNS [WLKC16]. However, when it comes to indexing, while some unsupervised techniques such as PCA or -means have been successfully applied, the full power of modern tools like neural networks has not yet been harnessed. This state of affairs naturally leads to the following general question: Can we employ modern (supervised) machine learning techniques to find good space partitions for nearest neighbor search?

1.1 Our contribution

In this paper we address the aforementioned challenge and present a new framework for finding high-quality space partitions of . Our approach consists of three major steps:

  • Build the -NN graph of the dataset by connecting each data point to nearest neighbors;

  • Find a balanced partition of the graph into parts of nearly-equal size such that the number of edges between different parts is as small as possible;

  • Obtain a partition of

    by training a classifier on the data points with labels being the parts of the partition

    found in the second step.

See Figure 1 for illustration. The new algorithm directly optimizes the performance of the partition-based nearest neighbor data structure. Indeed, if a query is chosen as a uniformly random data point, then the average -NN accuracy is exactly equal to the fraction of edges of the -NN graph whose endpoints are separated by the partition . This generalizes to out-of-sample queries provided that the query and dataset distributions are close, and the test accuracy of the trained classifier is high.

At the same time, our approach is directly related to and inspired by recent theoretical work [ANN18a, ANN18b] on NNS for general metric spaces. The two relevant contributions in these papers are as follows. First, the following structural result is shown for a large class of metric spaces (which includes Euclidean space, and, more generally, all normed spaces). Any graph embeddable into such a space in a way that (a) all edges are short, yet (b) there are no low-radius balls that contain a large fraction of vertices, must contain a sparse cut. It is natural to expect that the -NN graph of a well-behaved dataset would have these properties, which implies the existence of a desired balanced partition. The second relevant result from [ANN18a, ANN18b] shows that, under additional assumptions on a metric space, any such sparse cut in an embedded graph can be assumed to have a certain nice form, which makes it efficient to store and query. This result has strong parallels with our learning step, where we extend a graph partition to a partition of the ambient induced by an (algorithmically nice) classifier. Unlike [ANN18a, ANN18b]

, where the whole space is discretized into a graph, we build a graph supported only on the dataset points and learn the extension to the ambient space using supervised learning.

The new framework is very flexible and uses partitioning and learning in a black-box way. This allows us to plug various models (linear models, neural networks, etc.) and explore the trade-off between the quality and the algorithmic efficiency of the resulting partitions. We emphasize the importance of balanced partitions for the indexing problem, where all bins contain roughly the same number of data points. This property is crucial in the distributed setting, since we naturally would like to assign a similar number of points to each machine. Furthermore, balanced partitions allow tighter control of the number of candidates simply by varying the number of retrieved parts. Note that a priori, it is unclear how to partition so as to induce balanced bins of a given dataset. Here the combinatorial portion of our approach is particularly useful, as balanced graph partitioning is a well-studied problem, and our supervised extension to naturally preserves the balance by virtue of attaining high training accuracy.

We speculate that the new method might be potentially useful for solving the NNS problem for non-Euclidean metrics, such as the edit distance [ZZ17] or optimal transport distance [KSKW15]. Indeed, for any metric space, one can compute the -NN graph and then partition it. The only step that needs to be adjusted to the specific metric at hand is the learning step.

Let us finally put forward the challenge of scaling our method up to billion-sized or even larger datasets. For such scale, one needs to build an approximate -NN graph as well as to use graph partitioning algorithms that are faster than KaHIP. We leave this exciting direction to future work.

Evaluation We instantiate our framework with the KaHIP algorithm [SS13] for the graph partitioning step, and either linear models or small-size neural networks for the learning step. We evaluate it on several standard benchmarks for NNS [ABF17] and conclude that in terms of quality of the resulting partitions, it consistently outperforms quantization-based and tree-based partitioning procedures, while maintaining comparable algorithmic efficiency. In the high accuracy regime, our framework yields partitions that lead to processing up to fewer candidates than alternative approaches.

As a baseline method we use -means clustering [JDS11]. It produces a partition of the dataset into bins, in a way that naturally extends to all of , by assigning a query point to its nearest centroid. (More generally, for multi-probe querying, we can rank the bins by the distance of their centroids to ). This simple scheme produces very high-quality results for indexing.

1.2 Related work

On the empirical side, currently the fastest indexing techniques for the NNS problem are graph-based [MY18]. The high-level idea is to construct a graph on the dataset (it can be the -NN graph, but other constructions are also possible), and then for each query perform a walk, which eventually converges to the nearest neighbor. Although very fast, graph-based approaches have suboptimal “locality of reference”, which makes them less suitable for several modern architectures. For instance, this is the case when the algorithm is run on a GPU [JDJ17], or when the data is stored in external memory [SWQ14] or in a distributed manner [BGS12, NCB17]. Moreover, graph-based indexing requires many rounds of adaptive access to the dataset, whereas partition-based indexing accesses the dataset in one shot. This is crucial, for example, for nearest neighbor search over encrypted data [CCD19]. These benefits justify further study of partition-based methods.

Machine learning techniques are particularly useful for the sketching approach, leading to a vast body of research under the label “learning to hash” [WSSJ14, WLKC16]. In particular, several recent works employed neural networks to obtain high-quality sketches [LLW15, SDSJ19]. The fundamental difference from our work is that sketching is designed to speed up linear scans over the dataset, by reducing the cost of distance evaluation, while indexing is designed for sublinear time searches, by reducing the number of distance evaluations. We highlight the work [SDSJ19], which uses neural networks to learn a mapping that improves the geometry of the dataset and the queries to facilitate subsequent sketching. It is natural to apply the same family of maps for partitions; however, as our experiments show, in the high accuracy regime the maps learned using the algorithm of [SDSJ19] consistently degrade the quality of partitions.

Prior work [CD07] has used learning to tune the parameters of certain structured classes of partitions, such as KD-trees or rectilinear LSH. This is substantially different from our method, which learns a much more general class of partitions, whose only structural constraint stems from the chosen learning component—say, the class of space partitions that can be learned by SVM, a neural network, and so on.

(a) Dataset
(b) -NN graph together with a balanced partition
(c) Learned partition
Figure 1: Stages of our framework

2 Our method

Training Given a dataset of points, and a number of bins , our goal is to find a partition of into bins with the following properties:

  • Balanced: The number of data points in each bin is not much larger than .

  • Locality sensitive: For a typical query point , most of its nearest neighbors belong to the same bin of . We assume that queries and data points come from similar distributions.

  • Simple:

    The partition should admit a compact description and, moreover, the point location process should be computationally efficient. For example, we might look for a space partition induced by hyperplanes.

First, suppose that the query is chosen as a uniformly random data point, . Let be the -NN graph of , whose vertices are the data points, and each vertex is connected to nearest neighbors. Then the above problem boils down to partitioning vertices of the graph into bins such that each bin contains roughly vertices, and the number of edges crossing between different bins is as small as possible (see Figure 1(b)). This balanced graph partitioning problem is extremely well-studied, and there are available combinatorial partitioning solvers that produce very high-quality solutions. In our implementation, we use the open-source solver KaHIP [SS13], which is based on a sophisticated local search.

More generally, we need to handle out-of-sample queries, i.e., which are not contained in . Let denote the partition of (equivalently, of the dataset ) found by the graph partitioner. To convert into a solution to our problem, we need to extend it to a partition of the whole space that would work well for query points. In order to accomplish this, we train a model that, given a query point , predicts which of the bins of the point belongs to (see Figure 1(c)). We use the dataset as a training set, and the partition as the labels – i.e., each data point is labeled with the ID of the bin of containing it. The geometric intuition for this learning step is that – even though the partition is obtained by combinatorial means, and in principle might consist of ill-behaved subsets of – in most practical scenarios, we actually expect it to be close to being induced by a simple partition of the ambient space. For example, if the dataset is fairly well-distributed on the unit sphere, and the number of bins is , a balanced cut of should be close to a hyperplane.

The choice of model to train depends on the desired properties of the output partition . For instance, if we are interested in a hyperplane partition, we can train a linear model using SVM or regression. In this paper, we instantiate the learning step with both linear models and small-sized neural networks. Here, there is natural tension between the size of the model we train and the accuracy of the resulting classifier, and hence the quality of the partition we produce. A larger model yields better NNS accuracy, at the expense of computational efficiency. We discuss this more in Section 3.

Multi-probe querying Given a query point , the trained model can be used to assign it to a bin of a partition , and search for nearest neighbors within the data points in that part. In order to achieve high search accuracy, we actually train the model to predict several bins for a given query point, which are likely to contain nearest neighbors. For neural networks, this can be done naturally by taking several largest outputs of the last layer. By searching through more bins (in the order of preference predicted by the model) we can achieve better accuracy, allowing for a trade-off between computational resources and accuracy.

Hierarchical partitions When the required number of bins is large, in order to improve the efficiency of the resulting partition, it pays off to produce it in a hierarchical manner. Namely, we first find a partition of into bins, then recursively partition each of the bins into bins, and so on, repeating the partitioning for levels. The total number of bins in the overall partition is . See Figure 7 in the appendix for illustration. The advantage of such a hierarchical partition is that it is much simpler to navigate than a one-shot partition with bins.

2.1 Neural LSH

In one instantiation of the supervised learning component, we use neural networks with a small number of layers and constrained hidden dimensions. The exact parameters depend on the size of the training set, and are specified in the next section.

Soft labels In order to support effective multi-probe querying, we need to infer not just the bin that contains the query point, but rather a distribution over bins that are likely to contain this point and its neighbors. A -probe candidate list is then formed from all data points in the most likely bins.

In order to accomplish this, we use soft labels for data points generated as follows. For and a data point , the soft label is a distribution over the bin containing a point chosen uniformly at random among nearest neighbors of (including itself). Now, for a predicted distribution , we seek to minimize the KL divergence between and : . Intuitively, soft labels help guide the neural network with information about multiple bin ranking.

is a hyperparameter that needs to be tuned. We study its setting in Section 

3.4.

3 Experiments

Datasets For the experimental evaluation, we use three standard ANN benchmarks [ABF17]: SIFT (image descriptors, 1M 128-dimensional points), GloVe (word embeddings [PSM14], approximately 1.2M 100-dimensional points, normalized), and MNIST (images of digits, 60K 784-dimensional points). All three datasets come with query points, which we use for evaluation. We include the results for SIFT and GloVe in the main text, and MNIST in Appendix A.

Evaluation metrics We mainly investigate the trade-off between the number of candidates generated for a query point, and the -NN accuracy, defined as the fraction of its nearest neighbors that are among those candidates. The number of candidates determines the processing time of an individual query. Over the entire query set, we report both the average as well as the

-th quantile

of the number of candidates. The former measures the throughput222Number of queries per second. of the data structure, while the latter measures its latency.333

Maximum time per query, modulo a small fraction of outliers.

We mostly focus on parameter regimes that lead to -NN accuracy of at least . In virtually all of our experiments, .

Methods evaluated We evaluate two variants of our method, corresponding to two different choices of the supervised learning component in our framework.

  • Neural LSH: In this variant we use small neural networks. Their exact architecture is detailed in the next section. We compare Neural LSH to partitions obtained by -means clustering. As mentioned in Section 1, this method produces high quality partitions of the dataset that naturally extend to all of , and other existing methods we have tried (such as LSH) did not match its performance. We evaluate partitions into bins and bins. We test both one-level (non-hierarchical) and two-level (hierarchical) partitions. Queries are multi-probe.

  • Regression LSH:

    This variant uses logistic regression as the supervised learning component and, as a result, produces very simple partitions induced by

    hyperplanes. We compare this method with PCA trees [Spr91, KZN08, AAKK14], random projection trees [DS13], and recursive bisections using -means clustering. We build trees of hierarchical bisections of depth up to (thus, the total number of leaves is up to ). The query procedure descends a single root-to-leaf path and returns the candidates in that leaf.

3.1 Implementation details

Neural LSH uses a fixed neural network architecture for the top-level partition, and a fixed architecture for all second-level partitions. Both architectures consist of several blocks, where each block is a fully-connected layer + batch normalization 

[IS15]

+ ReLU activations. The final block is followed by a fully-connected layer and a softmax layer. The resulting network predicts a distribution over the bins of the partition. The only difference between the top-level network the second-level network architecture is their number of blocks (

) and the size of their hidden layers (). In the top-level network we use and . In the second-level networks we use and

. To reduce overfitting, we use dropout with probability

during training. The networks are trained using the Adam optimizer [KB15] for under epochs on both levels. We reduce the learning rate multiplicatively at regular intervals. We use the Glorot initialization to generate the initial weights. To tune soft labels, we try different values of between and .

We evaluate two settings for the number of bins in each level, and (leading to a total number of bins of the total number of bins in the two-level experiments are and , respectively). In the two-level setting with the bottom level of Neural LSH uses -means instead of a neural network, to avoid overfitting when the number of points per bin is tiny. The other configurations (two-levels with and one-level with either or ) we use Neural LSH at all levels.

3.2 Comparison with -means

Figure 2 shows the empirical comparison of Neural LSH with -means. The points listed are those that attained an accuracy . We note that two-level partitioning with is the best performing configuration of -means, for both SIFT and GloVe.444In terms of the minimum number of candidates that attains accuracy. Thus we evaluate the baseline at its optimal performance. However, if one wishes to use partitions to split points across machines to build a distributed NNS data structure, then a single-level settings seems to be more suitable.

In all settings considered, Neural LSH yields consistently better partitions than -means. Depending on the setting, -means requires significantly more candidates to achieve the same accuracy:

  • Up to more for the average number of candidates for GloVe;

  • Up to more for the -quantiles of candidates for GloVe;

  • Up to more for the average number of candidates for SIFT;

  • Up to more for the -quantiles of candidates for SIFT;

Figure 8 in the appendix lists the largest multiplicative advantage in the number of candidates of Neural LSH compared to -means, for accuracy values of at least . Specifically, for every configuration of -means, we compute the ratio between the number of candidates in that configuration and the number of candidates of Neural LSH in its optimal configuration, among those that attained at least the same accuracy as that -means configuration.

We also note that in all settings except two-level partitioning with ,555As mentioned earlier, in this setting Neural LSH uses -means at the second level, due to the large overall number of bins compared to the size of the datasets. This explains why the gap between the average and the -quantile number of candidates of Neural LSH is larger for this setting. Neural LSH produces partitions for which the

-quantiles for the number of candidates are very close to the average number of candidates, which indicates very little variance between query times over different query points. In contrast, the respective gap in the partitions produced by

-means is much larger, since unlike Neural LSH, it does not directly favor balanced partitions. This implies that Neural LSH might be particularly suitable for latency-critical NNS applications.

(a) GloVe, one level, bins
(b) SIFT, one level, bins
(c) GloVe, one level, bins
(d) SIFT, one level, bins
(e) GloVe, two levels, bins
(f) SIFT, two levels, bins
(g) GloVe, two levels, bins, -means at 2nd level
(h) SIFT, two levels, bins, -means at 2nd level
Figure 2: Comparison of Neural LSH with -means; x-axis is the number of candidates, y-axis is the -NN accuracy

Model sizes. The largest model size learned by Neural LSH is equivalent to storing about points for SIFT, or points for GloVe.This is considerably larger than -means with , which stores at most points. Nonetheless, we believe the larger model size is acceptable for Neural LSH, for the following reasons. First, in most of the NNS applications, especially for the distributed setting, the bottleneck in the high accuracy regime is the memory accesses needed to retrieve candidates and the further processing (such as distance computations, exact or approximate). The model size is not a hindrance as long as does not exceed certain reasonable limits (e.g., it should fit into a CPU cache). Neural LSH significantly reduces the memory access cost, while increasing the model size by an acceptable amount. Second, we have observed that the quality of the Neural LSH partitions is not too sensitive to decreasing the sizes the hidden layers. The model sizes we report are, for the sake of concreteness, the largest ones that still lead to improved performance. Larger models do not increase the accuracy, and sometimes decrease it due to overfitting.

3.3 Comparison with tree-based methods

Next we compare binary decision trees, where in each tree node a

hyperplane is used to determine which of the two subtrees to descend into. We generate hyperplanes via multiple methods: Regression LSH, cutting the dataset into two equal halves along the top PCA direction [Spr91, KZN08], -means clustering, and random projections of the centered dataset [DS13, KS18]. We build trees of depth up to , which corresponds to hierarchical partitions with the total number of bins up to . We summarize the results for GloVe and SIFT datasets in Figure 9 (see appendix). For random projections, we run each configuration times and average the results.

For GloVe, Regression LSH significantly outperforms -means, while for SIFT, Regression LSH essentially matches -means in terms of the average number of candidates, but shows a noticeable advantage in terms of the -percentiles. In both instances, Regression LSH significantly outperforms PCA tree, and all of the above methods dramatically improve upon random projections.

Note, however, that random projections have an additional benefit: in order to boost search accuracy, one can simply repeat the sampling process several times and generate an ensemble of decision trees instead of a single tree. This allows making each individual tree relatively deep, which decreases the overall number of candidates, trading space for query time. Other considered approaches (Regression LSH, -means, PCA tree) are inherently deterministic, and boosting their accuracy requires more care: for instance, one can use partitioning into blocks as in [JDS11], or alternative approaches like [KS18]. Since we focus on individual partitions and not ensembles, we leave this issue out of the scope.

3.4 Additional experiments

We perform several additional experiments that we describe in a greater detail in the appendix.

  • We evaluate the -NN accuracy of Neural LSH when the partitioning step is run on either the -NN or the -NN graph.666Neural LSH can solve -NNS by partitioning the -NN graph, for any ; they do not have to be equal. Both settings outperform -means, and the gap between using the -NN and -NN graphs is negligible, which indicates the robustness of Neural LSH.

  • We show that effect of tuning the size of soft labels . We show that setting to be at least is immensely beneficial compared to , but beyond that we start observing diminishing returns.

  • We evaluate the effect of Neural Catalyzer [SDSJ19] on the partitions produced by -means.

4 Conclusions and future directions

We presented a new technique for finding partitions of which support high-performance indexing for sublinear-time NNS. It proceeds in two major steps: (1) We perform a combinatorial balanced partitioning of the -NN graph of the dataset; (2) We extend the resulting partition to the whole ambient space by using supervised classification (such as logistic regression, neural networks, etc.). Our experiments show that the new approach consistently outperforms quantization-based and tree-based partitions. There is a number of exciting open problems we would like to highlight:

  • Can we use our approach for NNS over non-Euclidean geometries, such as the edit distance [ZZ17] or the optimal transport distance [KSKW15]? The graph partitioning step directly carries through, but the learning step may need to be adjusted.

  • Can we jointly optimize a graph partition and a classifier at the same time? By making the two components aware of each other, we expect the quality of the resulting partition of to improve.

  • Can our approach be extended to learning several high-quality partitions that complement each other? Such an ensemble might be useful to trade query time for memory usage [ALRW17].

  • Can we use machine learning techniques to improve graph-based indexing techniques [MY18] for NNS? (This is in contrast to partition-based indexing, as done in this work).

  • Our framework is an example of combinatorial tools aiding “continuous” learning techniques. A more open-ended question is whether other problems can benefit from such symbiosis.

References

  • [AAKK14] Amirali Abdullah, Alexandr Andoni, Ravindran Kannan, and Robert Krauthgamer, Spectral approaches to nearest neighbor search, arXiv preprint arXiv:1408.0751 (2014).
  • [ABF17] Martin Aumüller, Erik Bernhardsson, and Alexander Faithfull, Ann-benchmarks: A benchmarking tool for approximate nearest neighbor algorithms, International Conference on Similarity Search and Applications, Springer, 2017, pp. 34–49.
  • [AIL15] Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya Razenshteyn, and Ludwig Schmidt, Practical and optimal lsh for angular distance, Advances in Neural Information Processing Systems, 2015, pp. 1225–1233.
  • [AIR18] Alexandr Andoni, Piotr Indyk, and Ilya Razenshteyn, Approximate nearest neighbor search in high dimensions, arXiv preprint arXiv:1806.09823 (2018).
  • [ALRW17] Alexandr Andoni, Thijs Laarhoven, Ilya Razenshteyn, and Erik Waingarten, Optimal hashing-based time-space trade-offs for approximate near neighbors, Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics, 2017, pp. 47–66.
  • [ANN18a] Alexandr Andoni, Assaf Naor, Aleksandar Nikolov, Ilya Razenshteyn, and Erik Waingarten, Data-dependent hashing via nonlinear spectral gaps

    , Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing (2018), 787–800.

  • [ANN18b]  , Hölder homeomorphisms and approximate nearest neighbors, 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), IEEE, 2018, pp. 159–169.
  • [BCG05] Mayank Bawa, Tyson Condie, and Prasanna Ganesan, Lsh forest: self-tuning indexes for similarity search, Proceedings of the 14th international conference on World Wide Web, ACM, 2005, pp. 651–660.
  • [BDSV18] Maria-Florina Balcan, Travis Dick, Tuomas Sandholm, and Ellen Vitercik, Learning to branch, International Conference on Machine Learning, 2018.
  • [BGS12] Bahman Bahmani, Ashish Goel, and Rajendra Shinde, Efficient distributed locality sensitive hashing, Proceedings of the 21st ACM international conference on Information and knowledge management, ACM, 2012, pp. 2174–2178.
  • [BJPD17] Ashish Bora, Ajil Jalal, Eric Price, and Alexandros G Dimakis, Compressed sensing using generative models, International Conference on Machine Learning, 2017, pp. 537–546.
  • [BL12] Artem Babenko and Victor Lempitsky, The inverted multi-index

    , Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE, 2012, pp. 3069–3076.

  • [BLS16] Luca Baldassarre, Yen-Huan Li, Jonathan Scarlett, Baran Gözcü, Ilija Bogunovic, and Volkan Cevher, Learning-based compressive subsampling, IEEE Journal of Selected Topics in Signal Processing 10 (2016), no. 4, 809–822.
  • [BW18] Aditya Bhaskara and Maheshakya Wijewardena, Distributed clustering via lsh based data partitioning, International Conference on Machine Learning, 2018, pp. 569–578.
  • [CCD19] Hao Chen, Ilaria Chillotti, Yihe Dong, Oxana Poburinnaya, Ilya Razenshteyn, and M Sadegh Riazi, Sanns: Scaling up secure approximate k-nearest neighbors search, arXiv preprint arXiv:1904.02033 (2019).
  • [CD07] Lawrence Cayton and Sanjoy Dasgupta, A learning framework for nearest neighbor search, Advances in Neural Information Processing Systems, 2007, pp. 233–240.
  • [DKZ17] Hanjun Dai, Elias Khalil, Yuyu Zhang, Bistra Dilkina, and Le Song,

    Learning combinatorial optimization algorithms over graphs

    , Advances in Neural Information Processing Systems, 2017, pp. 6351–6361.
  • [DS13] Sanjoy Dasgupta and Kaushik Sinha, Randomized partition trees for exact nearest neighbor search, Conference on Learning Theory, 2013, pp. 317–337.
  • [DSN17] Sanjoy Dasgupta, Charles F Stevens, and Saket Navlakha, A neural algorithm for a fundamental computing problem, Science 358 (2017), no. 6364, 793–796.
  • [IS15] Sergey Ioffe and Christian Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, arXiv preprint arXiv:1502.03167 (2015).
  • [JDJ17] Jeff Johnson, Matthijs Douze, and Hervé Jégou, Billion-scale similarity search with gpus, arXiv preprint arXiv:1702.08734 (2017).
  • [JDS11] Herve Jégou, Matthijs Douze, and Cordelia Schmid, Product quantization for nearest neighbor search, IEEE transactions on pattern analysis and machine intelligence 33 (2011), no. 1, 117–128.
  • [KB15] Diederik Kingma and Jimmy Ba, Adam: A method for stochastic optimization, International Conference for Learning Representations, 2015.
  • [KBC18] Tim Kraska, Alex Beutel, Ed H Chi, Jeffrey Dean, and Neoklis Polyzotis, The case for learned index structures, Proceedings of the 2018 International Conference on Management of Data, ACM, 2018, pp. 489–504.
  • [KS18] Omid Keivani and Kaushik Sinha, Improved nearest neighbor search using auxiliary information and priority functions, International Conference on Machine Learning, 2018, pp. 2578–2586.
  • [KSKW15] Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger, From word embeddings to document distances, International Conference on Machine Learning, 2015, pp. 957–966.
  • [KZN08] Neeraj Kumar, Li Zhang, and Shree Nayar, What is a good nearest neighbors algorithm for finding similar patches in images?, European conference on computer vision, Springer, 2008, pp. 364–378.
  • [LCY17] Jinfeng Li, James Cheng, Fan Yang, Yuzhen Huang, Yunjian Zhao, Xiao Yan, and Ruihao Zhao, Losha: A general framework for scalable locality sensitive hashing, Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, 2017, pp. 635–644.
  • [LJW07] Qin Lv, William Josephson, Zhe Wang, Moses Charikar, and Kai Li, Multi-probe lsh: efficient indexing for high-dimensional similarity search, Proceedings of the 33rd international conference on Very large data bases, VLDB Endowment, 2007, pp. 950–961.
  • [LLW15] Venice Erin Liong, Jiwen Lu, Gang Wang, Pierre Moulin, and Jie Zhou, Deep hashing for compact binary codes learning, Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 2475–2483.
  • [LV18] Thodoris Lykouris and Sergei Vassilvitskii, Competitive caching with machine learned advice, International Conference on Machine Learning, 2018.
  • [Mit18] Michael Mitzenmacher, A model for learned bloom filters and optimizing by sandwiching, Advances in Neural Information Processing Systems, 2018.
  • [MMB17] Chris Metzler, Ali Mousavi, and Richard Baraniuk, Learned d-amp: Principled neural network based compressive image recovery, Advances in Neural Information Processing Systems, 2017, pp. 1772–1783.
  • [MPB15] Ali Mousavi, Ankit B Patel, and Richard G Baraniuk,

    A deep learning approach to structured signal recovery

    , Communication, Control, and Computing (Allerton), 2015 53rd Annual Allerton Conference on, IEEE, 2015, pp. 1336–1343.
  • [MY18] Yury A Malkov and Dmitry A Yashunin, Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs, IEEE transactions on pattern analysis and machine intelligence (2018).
  • [NCB17] Y Ni, K Chu, and J Bradley, Detecting abuse at scale: Locality sensitive hashing at uber engineering, 2017.
  • [PSK18] Manish Purohit, Zoya Svitkina, and Ravi Kumar, Improving online algorithms via ml predictions, Advances in Neural Information Processing Systems, 2018, pp. 9661–9670.
  • [PSM14] Jeffrey Pennington, Richard Socher, and Christopher Manning,

    Glove: Global vectors for word representation

    , Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.

  • [SDSJ19] Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, and Herve Jégou, Spreading vectors for similarity search, International Conference on Learning Representations, 2019.
  • [Spr91] Robert F Sproull, Refinements to nearest-neighbor searching ink-dimensional trees, Algorithmica 6 (1991), no. 1-6, 579–589.
  • [SS13] Peter Sanders and Christian Schulz, Think Locally, Act Globally: Highly Balanced Graph Partitioning, Proceedings of the 12th International Symposium on Experimental Algorithms (SEA’13), LNCS, vol. 7933, Springer, 2013, pp. 164–175.
  • [SWQ14] Yifang Sun, Wei Wang, Jianbin Qin, Ying Zhang, and Xuemin Lin, Srs: solving c-approximate nearest neighbor queries in high dimensional euclidean space with a tiny index, Proceedings of the VLDB Endowment 8 (2014), no. 1, 1–12.
  • [WGS17] Xiang Wu, Ruiqi Guo, Ananda Theertha Suresh, Sanjiv Kumar, Daniel N Holtmann-Rice, David Simcha, and Felix Yu, Multiscale quantization for fast similarity search, Advances in Neural Information Processing Systems, 2017, pp. 5745–5755.
  • [WLKC16] Jun Wang, Wei Liu, Sanjiv Kumar, and Shih-Fu Chang, Learning to hash for indexing big data - a survey, Proceedings of the IEEE 104 (2016), no. 1, 34–57.
  • [WSSJ14] Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji, Hashing for similarity search: A survey, arXiv preprint arXiv:1408.2927 (2014).
  • [ZZ17] Haoyu Zhang and Qin Zhang, Embedjoin: Efficient edit similarity joins via embeddings, Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2017, pp. 585–594.

Appendix A Results for MNIST

We include experimental results for the MNIST dataset, where all the experiments are performed exactly in the same way as for SIFT and GloVe. Consistent with the trend we observed for SIFT and GloVe, Neural LSH consistently outperforms -means (see Figure 3) both in terms of average number of candidates and especially in terms of the -th quantiles. We also compare Regression LSH with recursive -means, as well as PCA tree and random projections (see Figure 4), where Regression LSH consistently outperforms the other methods.

(a) One level, bins
(b) Two levels, bins
Figure 3: MNIST, comparison of Neural LSH with -means; x-axis is the number of candidates, y-axis is the -NN accuracy.
Figure 4: Comparison of decision trees built from hyperplanes; x-axis is the number of candidates, y-axis is the -NN accuracy

Appendix B Additional experiments

Here we describe three additional experiments that we referred to in Section 3.4 in a greater detail.

First, we compare Neural LSH and -means for (instead of the default setting of ). Moreover, we consider two variants of Neural LSH. In one of them, we use the -NN graph for partitioning, but for the other variant we use merely the -NN graph. Figure (a)a compares these three algorithms on GloVe for bins reporting average numbers of candidates. From this plot, we can see that for , Neural LSH convincingly outperforms -means, and whether we use -NN or -NN graph matters very little.

Second, we study the effect of varying (the soft labels parameter) for Neural LSH on GloVe for bins. See Figure (b)b where we report the average number of candidates. As we can see from the plot, the setting yields much better results compared to the vanilla case of . However, increasing beyond has little effect on the overall accuracy.

Finally, we compare vanilla -means with -means run after applying a Neural Catalyzer map [SDSJ19]. The goal is to check whether the Neural Catalyzer, which has been designed to boost up the performance of sketching methods for NNS by adjusting the input geometry, could also improve the quality of space partitions for NNS. See Figure 6 for the comparison on GloVe and SIFT with bins. On both datasets (especially SIFT) Neural Catalyzer in fact degrades the quality of the partitions. We observed a similar trend for other numbers of bins than the setting reported here. These findings support our observation that while both indexing and sketching for NNS can benefit from learning-based enhancements, they are fundamentally different approaches and require different specialized techniques.

(a) GloVe, one level, bins, -NN accuracy using -NN and -NN graphs
(b) GloVe, one level, bins, varying
Figure 5: Effect of various hyperparameters
(a) GloVe, one level, bins
(b) SIFT, one level, bins
Figure 6: Comparison of -means and Catalyzer + -means

Appendix C Additional implementation details

We slightly modify the KaHIP partitioner to make it more efficient on the -NN graphs. Namely, we introduce a hard threshold of on the number of iterations for the local search part of the algorithm, which speeds up the partitioning dramatically, while barely affecting the quality of the resulting partitions.

Figure 7: Hierarchical partition into bins with . ’s are partitions, ’s are the bins of the dataset. Multi-probe query procedure, which descends into bins, may visit the bins marked in bold.
GloVe SIFT
Averages -quantiles Averages -quantiles
One level bins 1.745 2.125 1.031 1.240
bins 1.491 1.752 1.047 1.348
Two levels bins 2.176 2.308 1.113 1.306
bins 1.241 1.154 1.182 1.192
Figure 8: Largest ratio between the number of candidates for Neural LSH and -means over the settings where both attain the same target -NN accuracy, over accuracies of at least . See details in Section 3.2.
Figure 9: Comparison of decision trees built from hyperplanes: the left plot is GloVe, the right plot corresponds to SIFT; x-axis is the number of candidates, y-axis is the -NN accuracy