Composite Code Sparse Autoencoders for first stage retrieval

04/14/2022
by   Carlos Lassance, et al.
0

We propose a Composite Code Sparse Autoencoder (CCSA) approach for Approximate Nearest Neighbor (ANN) search of document representations based on Siamese-BERT models. In Information Retrieval (IR), the ranking pipeline is generally decomposed in two stages: the first stage focus on retrieving a candidate set from the whole collection. The second stage re-ranks the candidate set by relying on more complex models. Recently, Siamese-BERT models have been used as first stage ranker to replace or complement the traditional bag-of-word models. However, indexing and searching a large document collection require efficient similarity search on dense vectors and this is why ANN techniques come into play. Since composite codes are naturally sparse, we first show how CCSA can learn efficient parallel inverted index thanks to an uniformity regularizer. Second, CCSA can be used as a binary quantization method and we propose to combine it with the recent graph based ANN techniques. Our experiments on MSMARCO dataset reveal that CCSA outperforms IVF with product quantization. Furthermore, CCSA binary quantization is beneficial for the index size, and memory usage for the graph-based HNSW method, while maintaining a good level of recall and MRR. Third, we compare with recent supervised quantization methods for image retrieval and find that CCSA is able to outperform them.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/15/2016

Scalable Image Retrieval by Sparse Product Quantization

Fast Approximate Nearest Neighbor (ANN) search technique for high-dimens...
research
08/25/2021

On Approximate Nearest Neighbour Selection for Multi-Stage Dense Retrieval

Dense retrieval, which describes the use of contextualised language mode...
research
05/09/2021

Joint Learning of Deep Retrieval Model and Product Quantization based Embedding Index

Embedding index that enables fast approximate nearest neighbor(ANN) sear...
research
01/14/2022

Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval

Ad-hoc search calls for the selection of appropriate answers from a mass...
research
08/10/2016

Approximate search with quantized sparse representations

This paper tackles the task of storing a large collection of vectors, su...
research
06/13/2023

Practice with Graph-based ANN Algorithms on Sparse Data: Chi-square Two-tower model, HNSW, Sign Cauchy Projections

Sparse data are common. The traditional “handcrafted” features are often...
research
05/30/2023

AdANNS: A Framework for Adaptive Semantic Search

Web-scale search systems learn an encoder to embed a given query which i...

Please sign up or login with your details

Forgot password? Click here to reset