MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders are Better Dense Retrievers

by   Kun Zhou, et al.

Dense retrieval aims to map queries and passages into low-dimensional vector space for efficient similarity measuring, showing promising effectiveness in various large-scale retrieval tasks. Since most existing methods commonly adopt pre-trained Transformers (e.g. BERT) for parameter initialization, some work focuses on proposing new pre-training tasks for compressing the useful semantic information from passages into dense vectors, achieving remarkable performances. However, it is still challenging to effectively capture the rich semantic information and relations about passages into the dense vectors via one single particular pre-training task. In this work, we propose a multi-task pre-trained model, MASTER, that unifies and integrates multiple pre-training tasks with different learning objectives under the bottlenecked masked autoencoder architecture. Concretely, MASTER utilizes a multi-decoder architecture to integrate three types of pre-training tasks: corrupted passages recovering, related passage recovering and PLMs outputs recovering. By incorporating a shared deep encoder, we construct a representation bottleneck in our architecture, compressing the abundant semantic information across tasks into dense vectors. The first two types of tasks concentrate on capturing the semantic information of passages and relationships among them within the pre-training corpus. The third one can capture the knowledge beyond the corpus from external PLMs (e.g. GPT-2). Extensive experiments on several large-scale passage retrieval datasets have shown that our approach outperforms the previous state-of-the-art dense retrieval methods. Our code and data are publicly released in


page 1

page 2

page 3

page 4


Aggretriever: A Simple Approach to Aggregate Textual Representation for Robust Dense Passage Retrieval

Pre-trained transformers has declared its success in many NLP tasks. One...

ERNIE 2.0: A Continual Pre-training Framework for Language Understanding

Recently, pre-trained models have achieved state-of-the-art results in v...

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

Pre-trained models for Natural Languages (NL) like BERT and GPT have bee...

SAILER: Structure-aware Pre-trained Language Model for Legal Case Retrieval

Legal case retrieval, which aims to find relevant cases for a query case...

Pre-training with Aspect-Content Text Mutual Prediction for Multi-Aspect Dense Retrieval

Grounded on pre-trained language models (PLMs), dense retrieval has been...

Finding Similar Exercises in Retrieval Manner

When students make a mistake in an exercise, they can consolidate it by ...

CoT-MAE v2: Contextual Masked Auto-Encoder with Multi-view Modeling for Passage Retrieval

Growing techniques have been emerging to improve the performance of pass...

Please sign up or login with your details

Forgot password? Click here to reset