An Experimental Study on Pretraining Transformers from Scratch for IR

01/25/2023
by   Carlos Lassance, et al.
0

Finetuning Pretrained Language Models (PLM) for IR has been de facto the standard practice since their breakthrough effectiveness few years ago. But, is this approach well understood? In this paper, we study the impact of the pretraining collection on the final IR effectiveness. In particular, we challenge the current hypothesis that PLM shall be trained on a large enough generic collection and we show that pretraining from scratch on the collection of interest is surprisingly competitive with the current approach. We benchmark first-stage ranking rankers and cross-encoders for reranking on the task of general passage retrieval on MSMARCO, Mr-Tydi for Arabic, Japanese and Russian, and TripClick for specific domain. Contrary to popular belief, we show that, for finetuning first-stage rankers, models pretrained solely on their collection have equivalent or better effectiveness compared to more general models. However, there is a slight effectiveness drop for rerankers pretrained only on the target collection. Overall, our study sheds a new light on the role of the pretraining collection and should make our community ponder on building specialized models by pretraining from scratch. Last but not least, doing so could enable better control of efficiency, data bias and replicability, which are key research questions for the IR community.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/02/2016

Multi-pretrained Deep Neural Network

Pretraining is widely used in deep neutral network and one of the most f...
research
11/07/2021

NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient Framework

Pretrained language models have become the standard approach for many NL...
research
05/13/2022

Improving the Numerical Reasoning Skills of Pretrained Language Models

State-of-the-art pretrained language models tend to perform below their ...
research
09/30/2021

Compositional generalization in semantic parsing with pretrained transformers

Large-scale pretraining instills large amounts of knowledge in deep neur...
research
10/13/2020

Pretrained Transformers for Text Ranking: BERT and Beyond

The goal of text ranking is to generate an ordered list of texts retriev...
research
08/25/2022

A Compact Pretraining Approach for Neural Language Models

Domain adaptation for large neural language models (NLMs) is coupled wit...
research
09/23/2019

Multi-stage Pretraining for Abstractive Summarization

Neural models for abstractive summarization tend to achieve the best per...

Please sign up or login with your details

Forgot password? Click here to reset