RetroMAE v2: Duplex Masked Auto-Encoder For Pre-Training Retrieval-Oriented Language Models

11/16/2022
by   Shitao Xiao, et al.
0

To better support retrieval applications such as web search and question answering, growing effort is made to develop retrieval-oriented language models. Most of the existing works focus on improving the semantic representation capability for the contextualized embedding of [CLS] token. However, recent study shows that the ordinary tokens besides [CLS] may provide extra information, which helps to produce a better representation effect. As such, it's necessary to extend the current methods where all contextualized embeddings can be jointly pre-trained for the retrieval tasks. With this motivation, we propose a new pre-training method: duplex masked auto-encoder, a.k.a. DupMAE, which targets on improving the semantic representation capacity for the contextualized embeddings of both [CLS] and ordinary tokens. It introduces two decoding tasks: one is to reconstruct the original input sentence based on the [CLS] embedding, the other one is to minimize the bag-of-words loss (BoW) about the input sentence based on the entire ordinary tokens' embeddings. The two decoding losses are added up to train a unified encoding model. The embeddings from [CLS] and ordinary tokens, after dimension reduction and aggregation, are concatenated as one unified semantic representation for the input. DupMAE is simple but empirically competitive: with a small decoding cost, it substantially contributes to the model's representation capability and transferability, where remarkable improvements are achieved on MS MARCO and BEIR benchmarks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/04/2023

RetroMAE-2: Duplex Masked Auto-Encoder For Pre-Training Retrieval-Oriented Language Models

To better support information retrieval tasks such as web search and ope...
research
05/24/2022

RetroMAE: Pre-training Retrieval-oriented Transformers via Masked Auto-Encoder

Pre-trained models have demonstrated superior power on many important ta...
research
08/16/2022

ConTextual Mask Auto-Encoder for Dense Passage Retrieval

Dense passage retrieval aims to retrieve the relevant passages of a quer...
research
12/08/2021

Improving language models by retrieving from trillions of tokens

We enhance auto-regressive language models by conditioning on document c...
research
04/05/2023

CoT-MAE v2: Contextual Masked Auto-Encoder with Multi-view Modeling for Passage Retrieval

Growing techniques have been emerging to improve the performance of pass...
research
04/20/2023

CoT-MoTE: Exploring ConTextual Masked Auto-Encoder Pre-training with Mixture-of-Textual-Experts for Passage Retrieval

Passage retrieval aims to retrieve relevant passages from large collecti...
research
04/25/2023

Semantic Compression With Large Language Models

The rise of large language models (LLMs) is revolutionizing information ...

Please sign up or login with your details

Forgot password? Click here to reset