Anserini Gets Dense Retrieval: Integration of Lucene's HNSW Indexes

04/24/2023
by   Xueguang Ma, et al.
0

Anserini is a Lucene-based toolkit for reproducible information retrieval research in Java that has been gaining traction in the community. It provides retrieval capabilities for both "traditional" bag-of-words retrieval models such as BM25 as well as retrieval using learned sparse representations such as SPLADE. With Pyserini, which provides a Python interface to Anserini, users gain access to both sparse and dense retrieval models, as Pyserini implements bindings to the Faiss vector search library alongside Lucene inverted indexes in a uniform, consistent interface. Nevertheless, hybrid fusion techniques that integrate sparse and dense retrieval models need to stitch together results from two completely different "software stacks", which creates unnecessary complexities and inefficiencies. However, the introduction of HNSW indexes for dense vector search in Lucene promises the integration of both dense and sparse retrieval within a single software framework. We explore exactly this integration in the context of Anserini. Experiments on the MS MARCO passage and BEIR datasets show that our Anserini HNSW integration supports (reasonably) effective and (reasonably) efficient approximate nearest neighbor search for dense retrieval models, using only Lucene.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/19/2021

Pyserini: An Easy-to-Use Python Toolkit to Support Replicable IR Research with Sparse and Dense Representations

Pyserini is an easy-to-use Python toolkit that supports replicable IR re...
research
03/31/2023

Dense Sparse Retrieval: Using Sparse Language Models for Inference Efficient Dense Retrieval

Vector-based retrieval systems have become a common staple for academic ...
research
04/26/2023

A Personalized Dense Retrieval Framework for Unified Information Access

Developing a universal model that can efficiently and effectively respon...
research
10/28/2020

Flexible retrieval with NMSLIB and FlexNeuART

Our objective is to introduce to the NLP community an existing k-NN sear...
research
06/28/2021

A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques

Recent developments in representational learning for information retriev...
research
03/11/2022

Tevatron: An Efficient and Flexible Toolkit for Dense Retrieval

Recent rapid advancements in deep pre-trained language models and the in...
research
10/22/2019

Lucene for Approximate Nearest-Neighbors Search on Arbitrary Dense Vectors

We demonstrate three approaches for adapting the open-source Lucene sear...

Please sign up or login with your details

Forgot password? Click here to reset