A Contrastive Pre-training Approach to Learn Discriminative Autoencoder for Dense Retrieval

08/21/2022
by   Xinyu Ma, et al.
0

Dense retrieval (DR) has shown promising results in information retrieval. In essence, DR requires high-quality text representations to support effective search in the representation space. Recent studies have shown that pre-trained autoencoder-based language models with a weak decoder can provide high-quality text representations, boosting the effectiveness and few-shot ability of DR models. However, even a weak autoregressive decoder has the bypass effect on the encoder. More importantly, the discriminative ability of learned representations may be limited since each token is treated equally important in decoding the input texts. To address the above problems, in this paper, we propose a contrastive pre-training approach to learn a discriminative autoencoder with a lightweight multi-layer perception (MLP) decoder. The basic idea is to generate word distributions of input text in a non-autoregressive fashion and pull the word distributions of two masked versions of one text close while pushing away from others. We theoretically show that our contrastive strategy can suppress the common words and highlight the representative words in decoding, leading to discriminative representations. Empirical results show that our method can significantly outperform the state-of-the-art autoencoder-based language models and other pre-trained models for dense retrieval.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/22/2022

Pre-train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction

Dense retrieval has shown promising results in many information retrieva...
research
05/22/2023

Challenging Decoder helps in Masked Auto-Encoder Pre-training for Dense Passage Retrieval

Recently, various studies have been directed towards exploring dense pas...
research
08/31/2022

LexMAE: Lexicon-Bottlenecked Pretraining for Large-Scale Retrieval

In large-scale retrieval, the lexicon-weighting paradigm, learning weigh...
research
09/13/2022

HEARTS: Multi-task Fusion of Dense Retrieval and Non-autoregressive Generation for Sponsored Search

Matching user search queries with relevant keywords bid by advertisers i...
research
02/18/2021

Less is More: Pre-training a Strong Siamese Encoder Using a Weak Decoder

Many real-world applications use Siamese networks to efficiently match t...
research
04/17/2023

Typos-aware Bottlenecked Pre-Training for Robust Dense Retrieval

Current dense retrievers (DRs) are limited in their ability to effective...
research
07/17/2023

Soft Prompt Tuning for Augmenting Dense Retrieval with Large Language Models

Dense retrieval (DR) converts queries and documents into dense embedding...

Please sign up or login with your details

Forgot password? Click here to reset