Domain-matched Pre-training Tasks for Dense Retrieval

by   Barlas Oguz, et al.

Pre-training on larger datasets with ever increasing model size is now a proven recipe for increased performance across almost all NLP tasks. A notable exception is information retrieval, where additional pre-training has so far failed to produce convincing results. We show that, with the right pre-training setup, this barrier can be overcome. We demonstrate this by pre-training large bi-encoder models on 1) a recently released set of 65 million synthetically generated questions, and 2) 200 million post-comment pairs from a preexisting dataset of Reddit conversations made available by We evaluate on a set of information retrieval and dialogue retrieval benchmarks, showing substantial improvements over supervised baselines.


page 1

page 2

page 3

page 4


Semantic-based Pre-training for Dialogue Understanding

Pre-trained language models have made great progress on dialogue tasks. ...

Cross-Thought for Sentence Encoder Pre-training

In this paper, we propose Cross-Thought, a novel approach to pre-trainin...

Evaluating Token-Level and Passage-Level Dense Retrieval Models for Math Information Retrieval

With the recent success of dense retrieval methods based on bi-encoders,...

Towards Robust Neural Retrieval Models with Synthetic Pre-Training

Recent work has shown that commonly available machine reading comprehens...

Pre-training Methods in Information Retrieval

The core of information retrieval (IR) is to identify relevant informati...

Pre-training Tasks for Embedding-based Large-scale Retrieval

We consider the large-scale query-document retrieval problem: given a qu...

Effective Sequence-to-Sequence Dialogue State Tracking

Sequence-to-sequence models have been applied to a wide variety of NLP t...