Benchmarking Middle-Trained Language Models for Neural Search

06/05/2023
by   Hervé Déjean, et al.
0

Middle training methods aim to bridge the gap between the Masked Language Model (MLM) pre-training and the final finetuning for retrieval. Recent models such as CoCondenser, RetroMAE, and LexMAE argue that the MLM task is not sufficient enough to pre-train a transformer network for retrieval and hence propose various tasks to do so. Intrigued by those novel methods, we noticed that all these models used different finetuning protocols, making it hard to assess the benefits of middle training. We propose in this paper a benchmark of CoCondenser, RetroMAE, and LexMAE, under the same finetuning conditions. We compare both dense and sparse approaches under various finetuning protocols and middle training on different collections (MS MARCO, Wikipedia or Tripclick). We use additional middle training baselines, such as a standard MLM finetuning on the retrieval collection, optionally augmented by a CLS predicting the passage term frequency. For the sparse approach, our study reveals that there is almost no statistical difference between those methods: the more effective the finetuning procedure is, the less difference there is between those models. For the dense approach, RetroMAE using MS MARCO as middle-training collection shows excellent results in almost all the settings. Finally, we show that middle training on the retrieval collection, thus adapting the language model to it, is a critical factor. Overall, a better experimental setup should be adopted to evaluate middle training methods. Code available at https://github.com/naver/splade/tree/benchmarch-SIGIR23

READ FULL TEXT
research
11/10/2022

LERT: A Linguistically-motivated Pre-trained Language Model

Pre-trained Language Model (PLM) has become a representative foundation ...
research
01/20/2022

Transfer Learning Approaches for Building Cross-Language Dense Retrieval Models

The advent of transformer-based models such as BERT has led to the rise ...
research
12/19/2022

Query-as-context Pre-training for Dense Passage Retrieval

This paper presents a pre-training technique called query-as-context tha...
research
10/20/2020

PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval

Recently pre-trained language representation models such as BERT have sh...
research
04/22/2022

Sparse and Dense Approaches for the Full-rank Retrieval of Responses for Dialogues

Ranking responses for a given dialogue context is a popular benchmark in...
research
10/31/2022

Reduce Catastrophic Forgetting of Dense Retrieval Training with Teleportation Negatives

In this paper, we investigate the instability in the standard dense retr...
research
09/19/2023

OpenBA: An Open-sourced 15B Bilingual Asymmetric seq2seq Model Pre-trained from Scratch

Large language models (LLMs) with billions of parameters have demonstrat...

Please sign up or login with your details

Forgot password? Click here to reset