Think Twice: Measuring the Efficiency of Eliminating Prediction Shortcuts of Question Answering Models

05/11/2023
by   Lukáš Mikula, et al.
0

While the Large Language Models (LLMs) dominate a majority of language understanding tasks, previous work shows that some of these results are supported by modelling spurious correlations of training datasets. Authors commonly assess model robustness by evaluating their models on out-of-distribution (OOD) datasets of the same task, but these datasets might share the bias of the training dataset. We propose a simple method for measuring a scale of models' reliance on any identified spurious feature and assess the robustness towards a large set of known and newly found prediction biases for various pre-trained models and debiasing methods in Question Answering (QA). We find that the reported OOD gains of debiasing methods can not be explained by mitigated reliance on biased features, suggesting that biases are shared among QA datasets. We further evidence this by measuring that performance of OOD models depends on bias features comparably to the ID model, motivating future work to refine the reports of LLMs' robustness to a level of known spurious features.

READ FULL TEXT

page 6

page 7

page 8

page 11

research
07/07/2020

What Gives the Answer Away? Question Answering Bias Analysis on Video QA Datasets

Question answering biases in video QA datasets can mislead multimodal mo...
research
09/09/2018

Transforming Question Answering Datasets Into Natural Language Inference Datasets

Existing datasets for natural language inference (NLI) have propelled re...
research
04/05/2022

Improved and Efficient Conversational Slot Labeling through Question Answering

Transformer-based pretrained language models (PLMs) offer unmatched perf...
research
11/01/2021

Introspective Distillation for Robust Question Answering

Question answering (QA) models are well-known to exploit data bias, e.g....
research
04/21/2023

Inducing anxiety in large language models increases exploration and bias

Large language models are transforming research on machine learning whil...
research
10/06/2020

UNQOVERing Stereotyping Biases via Underspecified Questions

While language embeddings have been shown to have stereotyping biases, h...
research
10/07/2020

Improving QA Generalization by Concurrent Modeling of Multiple Biases

Existing NLP datasets contain various biases that models can easily expl...

Please sign up or login with your details

Forgot password? Click here to reset