On Approximate Nearest Neighbour Selection for Multi-Stage Dense Retrieval

08/25/2021
by   Craig Macdonald, et al.
0

Dense retrieval, which describes the use of contextualised language models such as BERT to identify documents from a collection by leveraging approximate nearest neighbour (ANN) techniques, has been increasing in popularity. Two families of approaches have emerged, depending on whether documents and queries are represented by single or multiple embeddings. ColBERT, the exemplar of the latter, uses an ANN index and approximate scores to identify a set of candidate documents for each query embedding, which are then re-ranked using accurate document representations. In this manner, a large number of documents can be retrieved for each query, hindering the efficiency of the approach. In this work, we investigate the use of ANN scores for ranking the candidate documents, in order to decrease the number of candidate documents being fully scored. Experiments conducted on the MSMARCO passage ranking corpus demonstrate that, by cutting of the candidate set by using the approximate scores to only 200 documents, we can still obtain an effective ranking without statistically significant differences in effectiveness, and resulting in a 2x speedup in efficiency.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/23/2021

Query Embedding Pruning for Dense Retrieval

Recent advances in dense retrieval techniques have offered the promise o...
research
08/18/2022

Adaptive Re-Ranking with a Corpus Graph

Search systems often employ a re-ranking pipeline, wherein documents (or...
research
08/13/2021

On Single and Multiple Representations in Dense Passage Retrieval

The advent of contextualised language models has brought gains in search...
research
04/14/2022

Composite Code Sparse Autoencoders for first stage retrieval

We propose a Composite Code Sparse Autoencoder (CCSA) approach for Appro...
research
05/06/2021

Learning Early Exit Strategies for Additive Ranking Ensembles

Modern search engine ranking pipelines are commonly based on large machi...
research
04/30/2020

Query-level Early Exit for Additive Learning-to-Rank Ensembles

Search engine ranking pipelines are commonly based on large ensembles of...
research
04/01/2022

Distill-VQ: Learning Retrieval Oriented Vector Quantization By Distilling Knowledge from Dense Embeddings

Vector quantization (VQ) based ANN indexes, such as Inverted File System...

Please sign up or login with your details

Forgot password? Click here to reset