Annotating Data for Fine-Tuning a Neural Ranker? Current Active Learning Strategies are not Better than Random Selection

09/12/2023
by   Sophia Althammer, et al.
0

Search methods based on Pretrained Language Models (PLM) have demonstrated great effectiveness gains compared to statistical and early neural ranking models. However, fine-tuning PLM-based rankers requires a great amount of annotated training data. Annotating data involves a large manual effort and thus is expensive, especially in domain specific tasks. In this paper we investigate fine-tuning PLM-based rankers under limited training data and budget. We investigate two scenarios: fine-tuning a ranker from scratch, and domain adaptation starting with a ranker already fine-tuned on general data, and continuing fine-tuning on a target dataset. We observe a great variability in effectiveness when fine-tuning on different randomly selected subsets of training data. This suggests that it is possible to achieve effectiveness gains by actively selecting a subset of the training data that has the most positive effect on the rankers. This way, it would be possible to fine-tune effective PLM rankers at a reduced annotation budget. To investigate this, we adapt existing Active Learning (AL) strategies to the task of fine-tuning PLM rankers and investigate their effectiveness, also considering annotation and computational costs. Our extensive analysis shows that AL strategies do not significantly outperform random selection of training subsets in terms of effectiveness. We further find that gains provided by AL strategies come at the expense of more assessments (thus higher annotation costs) and AL strategies underperform random selection when comparing effectiveness given a fixed annotation cost. Our results highlight that “optimal” subsets of training data that provide high effectiveness at low annotation cost do exist, but current mainstream AL strategies applied to PLM rankers are not capable of identifying them.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/16/2021

Bayesian Active Learning with Pretrained Language Models

Active Learning (AL) is a method to iteratively select data for annotati...
research
09/15/2021

On the Complementarity of Data Selection and Fine Tuning for Domain Adaptation

Domain adaptation of neural networks commonly relies on three training p...
research
06/05/2022

Speech Detection Task Against Asian Hate: BERT the Central, While Data-Centric Studies the Crucial

With the epidemic continuing, hatred against Asians is intensifying in c...
research
08/08/2023

Fine-Tuning Games: Bargaining and Adaptation for General-Purpose Models

Major advances in Machine Learning (ML) and Artificial Intelligence (AI)...
research
08/29/2019

Active Learning for Domain Classification in a Commercial Spoken Personal Assistant

We describe a method for selecting relevant new training data for the LS...
research
05/16/2023

On Dataset Transferability in Active Learning for Transformers

Active learning (AL) aims to reduce labeling costs by querying the examp...
research
12/20/2022

Smooth Sailing: Improving Active Learning for Pre-trained Language Models with Representation Smoothness Analysis

Developed as a solution to a practical need, active learning (AL) method...

Please sign up or login with your details

Forgot password? Click here to reset