Domain-matched Pre-training Tasks for Dense Retrieval

07/28/2021
by   Barlas Oguz, et al.
0

Pre-training on larger datasets with ever increasing model size is now a proven recipe for increased performance across almost all NLP tasks. A notable exception is information retrieval, where additional pre-training has so far failed to produce convincing results. We show that, with the right pre-training setup, this barrier can be overcome. We demonstrate this by pre-training large bi-encoder models on 1) a recently released set of 65 million synthetically generated questions, and 2) 200 million post-comment pairs from a preexisting dataset of Reddit conversations made available by pushshift.io. We evaluate on a set of information retrieval and dialogue retrieval benchmarks, showing substantial improvements over supervised baselines.

READ FULL TEXT

page 1

page 2

page 3

page 4

09/19/2022

Semantic-based Pre-training for Dialogue Understanding

Pre-trained language models have made great progress on dialogue tasks. ...
10/07/2020

Cross-Thought for Sentence Encoder Pre-training

In this paper, we propose Cross-Thought, a novel approach to pre-trainin...
03/21/2022

Evaluating Token-Level and Passage-Level Dense Retrieval Models for Math Information Retrieval

With the recent success of dense retrieval methods based on bi-encoders,...
04/15/2021

Towards Robust Neural Retrieval Models with Synthetic Pre-Training

Recent work has shown that commonly available machine reading comprehens...
11/27/2021

Pre-training Methods in Information Retrieval

The core of information retrieval (IR) is to identify relevant informati...
02/10/2020

Pre-training Tasks for Embedding-based Large-scale Retrieval

We consider the large-scale query-document retrieval problem: given a qu...
08/31/2021

Effective Sequence-to-Sequence Dialogue State Tracking

Sequence-to-sequence models have been applied to a wide variety of NLP t...