Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval

01/14/2022
by   Shitao Xiao, et al.
8

Ad-hoc search calls for the selection of appropriate answers from a massive-scale corpus. Nowadays, the embedding-based retrieval (EBR) becomes a promising solution, where deep learning based document representation and ANN search techniques are allied to handle this task. However, a major challenge is that the ANN index can be too large to fit into memory, given the considerable size of answer corpus. In this work, we tackle this problem with Bi-Granular Document Representation, where the lightweight sparse embeddings are indexed and standby in memory for coarse-grained candidate search, and the heavyweight dense embeddings are hosted in disk for fine-grained post verification. For the best of retrieval accuracy, a Progressive Optimization framework is designed. The sparse embeddings are learned ahead for high-quality search of candidates. Conditioned on the candidate distribution induced by the sparse embeddings, the dense embeddings are continuously learned to optimize the discrimination of ground-truth from the shortlisted candidates. Besides, two techniques: the contrastive quantization and the locality-centric sampling are introduced for the learning of sparse and dense embeddings, which substantially contribute to their performances. Thanks to the above features, our method effectively handles massive-scale EBR with strong advantages in accuracy: with up to +4.3 recall gain on million-scale corpus, and up to +17.5 billion-scale corpus. Besides, Our method is applied to a major sponsored search platform with substantial gains on revenue (+1.95 CTR (+0.49

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/01/2022

Distill-VQ: Learning Retrieval Oriented Vector Quantization By Distilling Knowledge from Dense Embeddings

Vector quantization (VQ) based ANN indexes, such as Inverted File System...
research
05/25/2023

Enhancing the Ranking Context of Dense Retrieval Methods through Reciprocal Nearest Neighbors

Sparse annotation poses persistent challenges to training dense retrieva...
research
04/14/2022

Composite Code Sparse Autoencoders for first stage retrieval

We propose a Composite Code Sparse Autoencoder (CCSA) approach for Appro...
research
02/13/2022

Uni-Retriever: Towards Learning The Unified Embedding Based Retriever in Bing Sponsored Search

Embedding based retrieval (EBR) is a fundamental building block in many ...
research
04/22/2021

Hybrid Encoder: Towards Efficient and Precise Native AdsRecommendation via Hybrid Transformer Encoding Networks

Transformer encoding networks have been proved to be a powerful tool of ...
research
04/06/2023

Revisiting Dense Retrieval with Unanswerable Counterfactuals

The retriever-reader framework is popular for open-domain question answe...
research
09/08/2023

CSPRD: A Financial Policy Retrieval Dataset for Chinese Stock Market

In recent years, great advances in pre-trained language models (PLMs) ha...

Please sign up or login with your details

Forgot password? Click here to reset