The Effect of Natural Distribution Shift on Question Answering Models

04/29/2020
by   John Miller, et al.
9

We build four new test sets for the Stanford Question Answering Dataset (SQuAD) and evaluate the ability of question-answering systems to generalize to new data. Our first test set is from the original Wikipedia domain and measures the extent to which existing systems overfit the original test set. Despite several years of heavy test set re-use, we find no evidence of adaptive overfitting. The remaining three test sets are constructed from New York Times articles, Reddit posts, and Amazon product reviews and measure robustness to natural distribution shifts. Across a broad range of models, we observe average performance drops of 3.8, 14.0, and 17.4 F1 points, respectively. In contrast, a strong human baseline matches or exceeds the performance of SQuAD models on the original domain and exhibits little to no drop in new domains. Taken together, our results confirm the surprising resilience of the holdout method and emphasize the need to move towards evaluation metrics that incorporate robustness to natural distribution shifts.

READ FULL TEXT

page 38

page 39

page 41

research
06/01/2018

Do CIFAR-10 Classifiers Generalize to CIFAR-10?

Machine learning is currently dominated by largely experimental work foc...
research
09/03/2023

Generative Data Augmentation using LLMs improves Distributional Robustness in Question Answering

Robustness in Natural Language Processing continues to be a pertinent is...
research
04/22/2018

Question Answering Resources Applied to Slot-Filling

We investigate the utility of pre-existing question answering models and...
research
02/09/2023

Robust Question Answering against Distribution Shifts with Test-Time Adaptation: An Empirical Study

A deployed question answering (QA) model can easily fail when the test d...
research
03/17/2021

Automatic Generation of Contrast Sets from Scene Graphs: Probing the Compositional Consistency of GQA

Recent works have shown that supervised models often exploit data artifa...
research
09/09/2019

Don't Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases

State-of-the-art models often make use of superficial patterns in the da...
research
03/06/2019

Detecting Overfitting via Adversarial Examples

The repeated reuse of test sets in popular benchmark problems raises dou...

Please sign up or login with your details

Forgot password? Click here to reset