Pre-training for Ad-hoc Retrieval: Hyperlink is Also You Need

08/20/2021
by   Zhengyi Ma, et al.
0

Designing pre-training objectives that more closely resemble the downstream tasks for pre-trained language models can lead to better performance at the fine-tuning stage, especially in the ad-hoc retrieval area. Existing pre-training approaches tailored for IR tried to incorporate weak supervised signals, such as query-likelihood based sampling, to construct pseudo query-document pairs from the raw textual corpus. However, these signals rely heavily on the sampling method. For example, the query likelihood model may lead to much noise in the constructed pre-training data. † This work was done during an internship at Huawei. In this paper, we propose to leverage the large-scale hyperlinks and anchor texts to pre-train the language model for ad-hoc retrieval. Since the anchor texts are created by webmasters and can usually summarize the target document, it can help to build more accurate and reliable pre-training samples than a specific algorithm. Considering different views of the downstream ad-hoc retrieval, we devise four pre-training tasks based on the hyperlinks. We then pre-train the Transformer model to predict the pair-wise preference, jointly with the Masked Language Model objective. Experimental results on two large-scale ad-hoc retrieval datasets show the significant improvement of our model compared with the existing methods.

READ FULL TEXT
research
10/20/2020

PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval

Recently pre-trained language representation models such as BERT have sh...
research
09/14/2022

Pre-training for Information Retrieval: Are Hyperlinks Fully Explored?

Recent years have witnessed great progress on applying pre-trained langu...
research
04/20/2021

B-PROP: Bootstrapped Pre-training with Representative Words Prediction for Ad-hoc Retrieval

Pre-training and fine-tuning have achieved remarkable success in many do...
research
04/25/2022

C3: Continued Pretraining with Contrastive Weak Supervision for Cross Language Ad-Hoc Retrieval

Pretrained language models have improved effectiveness on numerous tasks...
research
08/16/2022

Interactive and Visual Prompt Engineering for Ad-hoc Task Adaptation with Large Language Models

State-of-the-art neural language models can now be used to solve ad-hoc ...
research
04/27/2021

One Backward from Ten Forward, Subsampling for Large-Scale Deep Learning

Deep learning models in large-scale machine learning systems are often c...
research
12/01/2022

Language Model Pre-training on True Negatives

Discriminative pre-trained language models (PLMs) learn to predict origi...

Please sign up or login with your details

Forgot password? Click here to reset