B-PROP: Bootstrapped Pre-training with Representative Words Prediction for Ad-hoc Retrieval

04/20/2021
by   Xinyu Ma, et al.
0

Pre-training and fine-tuning have achieved remarkable success in many downstream natural language processing (NLP) tasks. Recently, pre-training methods tailored for information retrieval (IR) have also been explored, and the latest success is the PROP method which has reached new SOTA on a variety of ad-hoc retrieval benchmarks. The basic idea of PROP is to construct the representative words prediction (ROP) task for pre-training inspired by the query likelihood model. Despite its exciting performance, the effectiveness of PROP might be bounded by the classical unigram language model adopted in the ROP task construction process. To tackle this problem, we propose a bootstrapped pre-training method (namely B-PROP) based on BERT for ad-hoc retrieval. The key idea is to use the powerful contextual language model BERT to replace the classical unigram language model for the ROP task construction, and re-train BERT itself towards the tailored objective for IR. Specifically, we introduce a novel contrastive method, inspired by the divergence-from-randomness idea, to leverage BERT's self-attention mechanism to sample representative words from the document. By further fine-tuning on downstream ad-hoc retrieval tasks, our method achieves significant improvements over baselines without pre-training or with other pre-training methods, and further pushes forward the SOTA on a variety of ad-hoc retrieval tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/20/2020

PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval

Recently pre-trained language representation models such as BERT have sh...
research
08/20/2021

Pre-training for Ad-hoc Retrieval: Hyperlink is Also You Need

Designing pre-training objectives that more closely resemble the downstr...
research
04/25/2022

C3: Continued Pretraining with Contrastive Weak Supervision for Cross Language Ad-Hoc Retrieval

Pretrained language models have improved effectiveness on numerous tasks...
research
01/30/2023

CSDR-BERT: a pre-trained scientific dataset match model for Chinese Scientific Dataset Retrieval

As the number of open and shared scientific datasets on the Internet inc...
research
10/22/2018

The Bregman chord divergence

Distances are fundamental primitives whose choice significantly impacts ...
research
09/14/2022

Pre-training for Information Retrieval: Are Hyperlinks Fully Explored?

Recent years have witnessed great progress on applying pre-trained langu...
research
01/12/2022

Diagnosing BERT with Retrieval Heuristics

Word embeddings, made widely popular in 2013 with the release of word2ve...

Please sign up or login with your details

Forgot password? Click here to reset