Retrieval Oriented Masking Pre-training Language Model for Dense Passage Retrieval

10/27/2022
by   Dingkun Long, et al.
0

Pre-trained language model (PTM) has been shown to yield powerful text representations for dense passage retrieval task. The Masked Language Modeling (MLM) is a major sub-task of the pre-training process. However, we found that the conventional random masking strategy tend to select a large number of tokens that have limited effect on the passage retrieval task (e,g. stop-words and punctuation). By noticing the term importance weight can provide valuable information for passage retrieval, we hereby propose alternative retrieval oriented masking (dubbed as ROM) strategy where more important tokens will have a higher probability of being masked out, to capture this straightforward yet essential information to facilitate the language model pre-training process. Notably, the proposed new token masking method will not change the architecture and learning objective of original PTM. Our experiments verify that the proposed ROM enables term importance information to help language model pre-training thus achieving better performance on multiple passage retrieval benchmarks.

READ FULL TEXT
research
08/16/2023

Pre-training with Large Language Model-based Document Expansion for Dense Passage Retrieval

In this paper, we systematically study the potential of pre-training wit...
research
08/22/2023

Pre-training with Aspect-Content Text Mutual Prediction for Multi-Aspect Dense Retrieval

Grounded on pre-trained language models (PLMs), dense retrieval has been...
research
10/13/2021

Maximizing Efficiency of Language Model Pre-training for Learning Representation

Pre-trained language models in the past years have shown exponential gro...
research
04/17/2023

Typos-aware Bottlenecked Pre-Training for Robust Dense Retrieval

Current dense retrievers (DRs) are limited in their ability to effective...
research
08/23/2022

Learning Better Masking for Better Language Model Pre-training

Masked Language Modeling (MLM) has been widely used as the denoising obj...
research
08/29/2023

Characterizing Learning Curves During Language Model Pre-Training: Learning, Forgetting, and Stability

How do language models learn to make predictions during pre-training? To...
research
10/29/2021

Pre-training Co-evolutionary Protein Representation via A Pairwise Masked Language Model

Understanding protein sequences is vital and urgent for biology, healthc...

Please sign up or login with your details

Forgot password? Click here to reset