Lucene for Approximate Nearest-Neighbors Search on Arbitrary Dense Vectors

by   Tommaso Teofili, et al.

We demonstrate three approaches for adapting the open-source Lucene search library to perform approximate nearest-neighbor search on arbitrary dense vectors, using similarity search on word embeddings as a case study. At its core, Lucene is built around inverted indexes of a document collection's (sparse) term-document matrix, which is incompatible with the lower-dimensional dense vectors that are common in deep learning applications. We evaluate three techniques to overcome these challenges that can all be natively integrated into Lucene: the creation of documents populated with fake words, LSH applied to lexical realizations of dense vectors, and k-d trees coupled with dimensionality reduction. Experiments show that the "fake words" approach represents the best balance between effectiveness and efficiency. These techniques are integrated into the Anserini open-source toolkit and made available to the community.



There are no comments yet.


page 1

page 2

page 3

page 4


On the Difficulty of Nearest Neighbor Search

Fast approximate nearest neighbor (NN) search in large databases is beco...

Leveraging Reinforcement Learning for evaluating Robustness of KNN Search Algorithms

The problem of finding K-nearest neighbors in the given dataset for a gi...

Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec

Distributed dense word vectors have been shown to be effective at captur...

Low-Precision Quantization for Efficient Nearest Neighbor Search

Fast k-Nearest Neighbor search over real-valued vector spaces (KNN) is a...

scikit-hubness: Hubness Reduction and Approximate Neighbor Search

This paper introduces scikit-hubness, a Python package for efficient nea...

Neural Distributed Autoassociative Memories: A Survey

Introduction. Neural network models of autoassociative, distributed memo...

A New Parallel Algorithm for Sinkhorn Word-Movers Distance and Its Performance on PIUMA and Xeon CPU

The Word Movers Distance (WMD) measures the semantic dissimilarity betwe...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There is no doubt that the open-source Lucene search library is the most widely-adopted solution for developers seeking to build production search applications. While it is true that commercial search engine companies such as Google and Bing deploy custom infrastructure, most organizations today—including Apple, Bloomberg, Reddit, Twitter, and Wikipedia—all use Lucene, typically via Solr or Elasticsearch. There is, however, one important missing feature in Lucene: the ability to perform nearest-neighbor search on arbitrary vectors. Our work addresses this gap.

With the advent of deep learning and neural approaches to both natural language processing and information retrieval, this is a major shortcoming of Lucene. Such a feature is needed, for example, to look up similar words based on word embeddings. Additionally, researchers have been developing neural models that

directly attempt to minimize some simple metric (e.g., cosine distance) between “queries” and “documents” for retrieval tasks [4, 12, 5], which require fast nearest-neighbor search on collections of arbitrary vectors.

At its core, Lucene is built around inverted indexes of a document collection’s term–document matrix. Since the feature space comprises the vocabulary, the vectors are very sparse. In contrast, deep learning applications mostly use dense vectors, typically only a few hundred dimensions (e.g., word embeddings), which are not directly compatible with inverted indexes. Similarity search typically requires an entirely different set of techniques, most often based on some variant of locality-sensitive hashing [3, 2]. As a result of these fundamental differences, systems that require both capabilities—for example, a ranking architecture that uses inverted indexes for candidate generation followed by a model exploiting vector similarity—typically cobble together heterogeneous components.

What if this wasn’t necessary? Our demonstration explores techniques for performing approximate nearest-neighbor search on dense vectors directly in Lucene. We examine three approaches to the specific problem of retrieving similar word embedding vectors. Experimental results show that the “fake words” approach provides reasonable effectiveness and efficiency. Although, admittedly, our solutions lack elegance, they can be directly implemented in Lucene without any external dependencies.

2 Methods

We examine three techniques for implementing approximate nearest-neighbor search on dense vectors within Lucene, outlined below:

“Fake words”. We implement the approach described in Amato et al. [1], which encodes the features of a vector as a number of “fake” terms proportional to the feature value according to the following scheme: Given a vector , each feature is associated with a unique alphanumeric term so that the document corresponding to the vector is represented by fake words generated by , where is a quantization factor. Thus, the fake words encoding maintains direct proportionality between the float value of a feature and the term frequency of the corresponding fake index term. Feature-level matching for retrieval is achieved by matching on these fake words with scores computed by Lucene’s ClassicSimilarity

(a tf-idf variant). Finally, for this approach to be effective, vector inner products have to be equivalent to cosine similarity, which can be achieved by normalizing the vectors to unit length.

“Lexical LSH”. We implement an approach that lexically quantizes vector components for easy indexing and search in Lucene using LSH. Given a vector , each feature is rounded to the first decimal place and tagged with its feature index . For example, is realized as the tokens 1_0.1, , and . Optionally, tokens are aggregated into -grams and finally passed to an LSH function (already implemented in Lucene as MinHashFilter) to hash the tokens (or -grams) into a configurable number of buckets; see Gionis et al. [3]. Thus, the vector

is represented as a set of LSH-generated text signatures for

tagged and quantized feature -grams.

k–d trees. We leverage Lucene’s existing capability to index -dimensional points comprised of floating point values, which is based on k–d trees, to perform nearest-neighbor search. The Lucene implementation currently suffers from the limitation of being able to handle at most eight dimensions, and therefore k–d trees can only be used after dimensionality reduction. In order to accomplish this, we use either PCA [10] or post-processing from Mu et al. [7] combined with PCA, as in Raunak [9].

3 Experiments

We choose nearest-neighbor search on dense word embedding vectors as our representative task for evaluation. Specifically, we considered word2vec [6], trained on a GoogleNews corpus, and GloVe [8], trained on a Twitter corpus, both having 300 dimensional vectors.

All techniques discussed in the previous section are implemented in the Anserini toolkit [11]111 and are released along with this demonstration. For the “fake words” and “lexical LSH” approaches, there are a number of parameters that control effectiveness–efficiency tradeoffs, which we tune specifically for word2vec and GloVe. We argue that usual notions of segregating training and test sets are not applicable here, because the word embeddings are static and provided in advance—and thus there is no reason why a researcher wouldn’t optimally tune parameters for the specific corpus.

One more implementation detail is worth mentioning: for the “fake words” and “lexical LSH” approaches, we observe a large number of terms that are generated at indexing time, which significantly reduces search performance. To combat this, we filter highly-frequent terms at search time. Once again, this filtering threshold is tuned per collection, and we observe that this technique gives us both efficiency and effectiveness gains.

Our techniques are evaluated in terms of top recall at retrieval depth , which we abbreviate as R@. For a given query vector , the goal is to retrieve its top most similar vectors (in terms of cosine similarity), where we determine the “ground truth” by brute force (in the case of k–d trees, on the original vectors). For each technique, Lucene can retrieve ranked results to any arbitrary depth . As is common, setting allows a refinement step (which we did not implement) where the actual similarity of all vectors can be computed and reranked to produced a final top- ranked list.

In our experiments, we set , matching the application of using word embeddings in document ranking from Zuccon et al. [13]. Specifically, we examined the settings . The query terms used to perform evaluation were taken from the title of topics used in the TREC 2004 Robust Track, to match our retrieval application. Latency measurements were performed on a common laptop (2.6 GHz Intel Core i7 CPU, 16GB of RAM; macOS 10.14.6; Oracle JDK 11) with . Note that query latency for top- retrieval in Lucene grows with , and thus we are measuring the worst case setting.

Model Configuration latency index size
fake words 0.64 0.82 0.94 0.97 234ms 175MB
fake words 0.63 0.81 0.93 0.96 221ms 190MB
fake words 0.62 0.81 0.92 0.96 209ms 122MB
fake words 0.61 0.78 0.90 0.95 105ms 96MB
fake words 0.57 0.74 0.87 0.93 97ms 69MB
lexical LSH 0.51 0.65 0.79 0.85 193ms 194MB
lexical LSH 0.55 0.72 0.84 0.91 245ms 130MB
lexical LSH 0.51 0.65 0.79 0.85 196ms 194MB
lexical LSH 0.55 0.72 0.84 0.91 276ms 130MB
k–d tree ppa-pca-ppa 0 0.004 0.008 0.01 9ms 14MB
k–d tree pca 0.008 0.01 0.02 0.03 11ms 25MB
fake words 0.64 0.83 0.95 0.98 220ms 238MB
fake words 0.63 0.82 0.94 0.97 193ms 202MB
fake words 0.62 0.81 0.93 0.97 225ms 166MB
fake words 0.61 0.79 0.92 0.97 195ms 167MB
fake words 0.57 0.75 0.89 0.94 132ms 94MB
lexical LSH 0.50 0.65 0.80 0.87 176ms 169MB
lexical LSH 0.51 0.70 0.85 0.91 196ms 176MB
lexical LSH 0.50 0.65 0.80 0.87 203ms 269MB
lexical LSH 0.52 0.72 0.84 0.91 278ms 176MB
k–d tree ppa-pca-ppa 0.001 0.004 0.006 0.01 13ms 19MB
k–d tree pca 0.002 0.002 0.006 0.01 16ms 34MB
Table 1: R@(10,) for different values of , query latency, and index size. Parameters for the the various models are as follows:  is the quantization factor for fake words; is the number of buckets, is the length -grams, and is the number of hashes for lexical LSH; ppa-pca-ppa refers to Raunak [9] and pca refers to Wold et al. [10]

Our experimental results are presented in Table 1, where we show, for different approaches and settings, R@(10,) for different values of , as well as average query latency (at ) and index size. It is clear that of the three approaches, fake words is the most effective as well as most efficient. The k–d tree, while fast, yields terrible recall—the dimensionality reduction techniques discard too much information for the data structure to be useful.

4 Conclusions

It bears emphasis that Lucene was fundamentally not designed to support approximate nearest-neighbor search on dense vectors, and thus we are appropriating its indexing and retrieval pipeline for an unintended use. As a result, our solutions lack elegance, but they do accomplish our goal of bringing approximate nearest-neighbor search into Lucene with any external dependencies. For the builder of “real-world” search applications, the choice is between a single system (Lucene) that excels at retrieval with inverted indexes and imperfectly performs approximate nearest-neighbor search, or integrating a separate purpose-built system. Ultimately, the decision needs to be considered within a broader set of tradeoffs, but the fake words approach might be compelling in certain scenarios.