Pre-training for Information Retrieval: Are Hyperlinks Fully Explored?

09/14/2022
by   Jiawen Wu, et al.
2

Recent years have witnessed great progress on applying pre-trained language models, e.g., BERT, to information retrieval (IR) tasks. Hyperlinks, which are commonly used in Web pages, have been leveraged for designing pre-training objectives. For example, anchor texts of the hyperlinks have been used for simulating queries, thus constructing tremendous query-document pairs for pre-training. However, as a bridge across two web pages, the potential of hyperlinks has not been fully explored. In this work, we focus on modeling the relationship between two documents that are connected by hyperlinks and designing a new pre-training objective for ad-hoc retrieval. Specifically, we categorize the relationships between documents into four groups: no link, unidirectional link, symmetric link, and the most relevant symmetric link. By comparing two documents sampled from adjacent groups, the model can gradually improve its capability of capturing matching signals. We propose a progressive hyperlink predication (PHP) framework to explore the utilization of hyperlinks in pre-training. Experimental results on two large-scale ad-hoc retrieval datasets and six question-answering datasets demonstrate its superiority over existing pre-training methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/20/2021

Pre-training for Ad-hoc Retrieval: Hyperlink is Also You Need

Designing pre-training objectives that more closely resemble the downstr...
research
10/20/2020

PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval

Recently pre-trained language representation models such as BERT have sh...
research
02/10/2020

Pre-training Tasks for Embedding-based Large-scale Retrieval

We consider the large-scale query-document retrieval problem: given a qu...
research
04/20/2021

B-PROP: Bootstrapped Pre-training with Representative Words Prediction for Ad-hoc Retrieval

Pre-training and fine-tuning have achieved remarkable success in many do...
research
12/04/2019

WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset

Over the past years, deep learning methods allowed for new state-of-the-...
research
11/09/2018

Adversarial Sampling and Training for Semi-Supervised Information Retrieval

Modern ad-hoc retrieval models learned with implicit feedback have two p...
research
10/19/2022

Forging Multiple Training Objectives for Pre-trained Language Models via Meta-Learning

Multiple pre-training objectives fill the vacancy of the understanding c...

Please sign up or login with your details

Forgot password? Click here to reset