Evaluating NLP Models via Contrast Sets

04/06/2020
by   Matt Gardner, et al.
0

Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture a dataset's intended capabilities. We propose a new annotation paradigm for NLP that helps to close systematic gaps in the test data. In particular, after a dataset is constructed, we recommend that the dataset authors manually perturb the test instances in small but meaningful ways that (typically) change the gold label, creating contrast sets. Contrast sets provide a local view of a model's decision boundary, which can be used to more accurately evaluate a model's true linguistic capabilities. We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets (e.g., DROP reading comprehension, UD parsing, IMDb sentiment analysis). Although our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test sets—up to 25% in some cases. We release our contrast sets as new evaluation benchmarks and encourage future dataset construction efforts to follow similar annotation processes.

READ FULL TEXT
research
05/23/2023

Exploring Contrast Consistency of Open-Domain Question Answering Systems on Minimally Edited Questions

Contrast consistency, the ability of a model to make consistently correc...
research
02/09/2023

Augmenting NLP data to counter Annotation Artifacts for NLI Tasks

In this paper, we explore Annotation Artifacts - the phenomena wherein l...
research
03/17/2021

Automatic Generation of Contrast Sets from Scene Graphs: Probing the Compositional Consistency of GQA

Recent works have shown that supervised models often exploit data artifa...
research
05/22/2023

Cross-functional Analysis of Generalisation in Behavioural Learning

In behavioural testing, system functionalities underrepresented in the s...
research
07/29/2021

Break, Perturb, Build: Automatic Perturbation of Reasoning Paths through Question Decomposition

Recent efforts to create challenge benchmarks that test the abilities of...
research
02/17/2020

Handling Missing Annotations in Supervised Learning Data

Data annotation is an essential stage in supervised learning. However, t...
research
12/11/2017

MURA Dataset: Towards Radiologist-Level Abnormality Detection in Musculoskeletal Radiographs

We introduce MURA, a large dataset of musculoskeletal radiographs contai...

Please sign up or login with your details

Forgot password? Click here to reset