PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval

10/20/2020
by   Xinyu Ma, et al.
0

Recently pre-trained language representation models such as BERT have shown great success when fine-tuned on downstream tasks including information retrieval (IR). However, pre-training objectives tailored for ad-hoc retrieval have not been well explored. In this paper, we propose Pre-training with Representative wOrds Prediction (PROP) for ad-hoc retrieval. PROP is inspired by the classical statistical language model for IR, specifically the query likelihood model, which assumes that the query is generated as the piece of text representative of the "ideal" document. Based on this idea, we construct the representative words prediction (ROP) task for pre-training. Given an input document, we sample a pair of word sets according to the document language model, where the set with higher likelihood is deemed as more representative of the document. We then pre-train the Transformer model to predict the pairwise preference between the two word sets, jointly with the Masked Language Model (MLM) objective. By further fine-tuning on a variety of representative downstream ad-hoc retrieval tasks, PROP achieves significant improvements over baselines without pre-training or with other pre-training methods. We also show that PROP can achieve exciting performance under both the zero- and low-resource IR settings. The code and pre-trained models are available at https://github.com/Albert-Ma/PROP.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/20/2021

B-PROP: Bootstrapped Pre-training with Representative Words Prediction for Ad-hoc Retrieval

Pre-training and fine-tuning have achieved remarkable success in many do...
research
08/20/2021

Pre-training for Ad-hoc Retrieval: Hyperlink is Also You Need

Designing pre-training objectives that more closely resemble the downstr...
research
08/16/2022

CorpusBrain: Pre-train a Generative Retrieval Model for Knowledge-Intensive Language Tasks

Knowledge-intensive language tasks (KILT) usually require a large body o...
research
09/14/2022

Pre-training for Information Retrieval: Are Hyperlinks Fully Explored?

Recent years have witnessed great progress on applying pre-trained langu...
research
04/14/2021

Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little

A possible explanation for the impressive performance of masked language...
research
02/27/2023

Pretraining De-Biased Language Model with Large-scale Click Logs for Document Ranking

Pre-trained language models have achieved great success in various large...
research
06/05/2023

Benchmarking Middle-Trained Language Models for Neural Search

Middle training methods aim to bridge the gap between the Masked Languag...

Please sign up or login with your details

Forgot password? Click here to reset