There is no doubt that the open-source Lucene search library is the most widely-adopted solution for developers seeking to build production search applications. While it is true that commercial search engine companies such as Google and Bing deploy custom infrastructure, most organizations today—including Apple, Bloomberg, Reddit, Twitter, and Wikipedia—all use Lucene, typically via Solr or Elasticsearch. There is, however, one important missing feature in Lucene: the ability to perform nearest-neighbor search on arbitrary vectors. Our work addresses this gap.
With the advent of deep learning and neural approaches to both natural language processing and information retrieval, this is a major shortcoming of Lucene. Such a feature is needed, for example, to look up similar words based on word embeddings. Additionally, researchers have been developing neural models thatdirectly attempt to minimize some simple metric (e.g., cosine distance) between “queries” and “documents” for retrieval tasks [4, 12, 5], which require fast nearest-neighbor search on collections of arbitrary vectors.
At its core, Lucene is built around inverted indexes of a document collection’s term–document matrix. Since the feature space comprises the vocabulary, the vectors are very sparse. In contrast, deep learning applications mostly use dense vectors, typically only a few hundred dimensions (e.g., word embeddings), which are not directly compatible with inverted indexes. Similarity search typically requires an entirely different set of techniques, most often based on some variant of locality-sensitive hashing [3, 2]. As a result of these fundamental differences, systems that require both capabilities—for example, a ranking architecture that uses inverted indexes for candidate generation followed by a model exploiting vector similarity—typically cobble together heterogeneous components.
What if this wasn’t necessary? Our demonstration explores techniques for performing approximate nearest-neighbor search on dense vectors directly in Lucene. We examine three approaches to the specific problem of retrieving similar word embedding vectors. Experimental results show that the “fake words” approach provides reasonable effectiveness and efficiency. Although, admittedly, our solutions lack elegance, they can be directly implemented in Lucene without any external dependencies.
We examine three techniques for implementing approximate nearest-neighbor search on dense vectors within Lucene, outlined below:
“Fake words”. We implement the approach described in Amato et al. , which encodes the features of a vector as a number of “fake” terms proportional to the feature value according to the following scheme: Given a vector , each feature is associated with a unique alphanumeric term so that the document corresponding to the vector is represented by fake words generated by , where is a quantization factor. Thus, the fake words encoding maintains direct proportionality between the float value of a feature and the term frequency of the corresponding fake index term. Feature-level matching for retrieval is achieved by matching on these fake words with scores computed by Lucene’s ClassicSimilarity
(a tf-idf variant). Finally, for this approach to be effective, vector inner products have to be equivalent to cosine similarity, which can be achieved by normalizing the vectors to unit length.
“Lexical LSH”. We implement an approach that lexically quantizes vector components for easy indexing and search in Lucene using LSH. Given a vector , each feature is rounded to the first decimal place and tagged with its feature index . For example, is realized as the tokens 1_0.1, , and . Optionally, tokens are aggregated into -grams and finally passed to an LSH function (already implemented in Lucene as MinHashFilter) to hash the tokens (or -grams) into a configurable number of buckets; see Gionis et al. . Thus, the vector
is represented as a set of LSH-generated text signatures fortagged and quantized feature -grams.
k–d trees. We leverage Lucene’s existing capability to index -dimensional points comprised of floating point values, which is based on k–d trees, to perform nearest-neighbor search. The Lucene implementation currently suffers from the limitation of being able to handle at most eight dimensions, and therefore k–d trees can only be used after dimensionality reduction. In order to accomplish this, we use either PCA  or post-processing from Mu et al.  combined with PCA, as in Raunak .
We choose nearest-neighbor search on dense word embedding vectors as our representative task for evaluation. Specifically, we considered word2vec , trained on a GoogleNews corpus, and GloVe , trained on a Twitter corpus, both having 300 dimensional vectors.
All techniques discussed in the previous section are implemented in the Anserini toolkit 111http://anserini.io/ and are released along with this demonstration. For the “fake words” and “lexical LSH” approaches, there are a number of parameters that control effectiveness–efficiency tradeoffs, which we tune specifically for word2vec and GloVe. We argue that usual notions of segregating training and test sets are not applicable here, because the word embeddings are static and provided in advance—and thus there is no reason why a researcher wouldn’t optimally tune parameters for the specific corpus.
One more implementation detail is worth mentioning: for the “fake words” and “lexical LSH” approaches, we observe a large number of terms that are generated at indexing time, which significantly reduces search performance. To combat this, we filter highly-frequent terms at search time. Once again, this filtering threshold is tuned per collection, and we observe that this technique gives us both efficiency and effectiveness gains.
Our techniques are evaluated in terms of top recall at retrieval depth , which we abbreviate as R@. For a given query vector , the goal is to retrieve its top most similar vectors (in terms of cosine similarity), where we determine the “ground truth” by brute force (in the case of k–d trees, on the original vectors). For each technique, Lucene can retrieve ranked results to any arbitrary depth . As is common, setting allows a refinement step (which we did not implement) where the actual similarity of all vectors can be computed and reranked to produced a final top- ranked list.
In our experiments, we set , matching the application of using word embeddings in document ranking from Zuccon et al. . Specifically, we examined the settings . The query terms used to perform evaluation were taken from the title of topics used in the TREC 2004 Robust Track, to match our retrieval application. Latency measurements were performed on a common laptop (2.6 GHz Intel Core i7 CPU, 16GB of RAM; macOS 10.14.6; Oracle JDK 11) with . Note that query latency for top- retrieval in Lucene grows with , and thus we are measuring the worst case setting.
Our experimental results are presented in Table 1, where we show, for different approaches and settings, R@(10,) for different values of , as well as average query latency (at ) and index size. It is clear that of the three approaches, fake words is the most effective as well as most efficient. The k–d tree, while fast, yields terrible recall—the dimensionality reduction techniques discard too much information for the data structure to be useful.
It bears emphasis that Lucene was fundamentally not designed to support approximate nearest-neighbor search on dense vectors, and thus we are appropriating its indexing and retrieval pipeline for an unintended use. As a result, our solutions lack elegance, but they do accomplish our goal of bringing approximate nearest-neighbor search into Lucene with any external dependencies. For the builder of “real-world” search applications, the choice is between a single system (Lucene) that excels at retrieval with inverted indexes and imperfectly performs approximate nearest-neighbor search, or integrating a separate purpose-built system. Ultimately, the decision needs to be considered within a broader set of tradeoffs, but the fake words approach might be compelling in certain scenarios.
Amato, G., Debole, F., Falchi, F., Gennaro, C., Rabitti, F.: Large scale indexing and searching deep convolutional neural network features. In: Big Data Analytics and Knowledge Discovery (DaWaK 2016). pp. 213–224 (2016)
-  Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. CACM 51(1), 117–122 (2008)
-  Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: Proceedings of the 25th International Conference on Very Large Data Bases (VLDB 1999). pp. 518–529. Edinburgh, Scotland (1999)
-  Henderson, M., Al-Rfou, R., Strope, B., Sung, Y., Lukacs, L., Guo, R., Kumar, S., Miklos, B., Kurzweil, R.: Efficient natural language response suggestion for Smart Reply. arXiv:1705.00652 (2017)
-  Ji, S., Shao, J., Yang, T.: Efficient interaction-based neural ranking with locality sensitive hashing. In: Proceedings of the 2019 International World Wide Web Conference. pp. 2858–2864. San Francisco, California (2019)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems. pp. 3111–3119 (2013)
-  Mu, J., Bhat, S., Viswanath, P.: All-but-the-top: Simple and effective postprocessing for word representations. arXiv:1702.01417 (2017)
-  Pennington, J., Socher, R., Manning, C.D.: GloVe: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014). pp. 1532–1543. Doha, Qatar (2014)
-  Raunak, V.: Simple and effective dimensionality reduction for word embeddings. arXiv:1708.03629 (2017)
Wold, S., Esbensen, K., Geladi, P.: Principal component analysis. Chemometrics and Intelligent Laboratory Systems2(1-3), 37–52 (1987)
-  Yang, P., Fang, H., Lin, J.: Anserini: Reproducible ranking baselines using Lucene. Journal of Data and Information Quality 10(4), Article 16 (2018)
-  Zamani, H., Dehghani, M., Croft, W.B., Learned-Miller, E., Kamps, J.: From neural re-ranking to neural ranking: Learning a sparse representation for inverted indexing. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM 2018). pp. 497–506. Torino, Italy (2018)
-  Zuccon, G., Koopman, B., Bruza, P., Azzopardi, L.: Integrating and evaluating neural word embeddings in information retrieval. In: Proceedings of the 20th Australasian Document Computing Symposium. Parramatta, Australia (2015)