Don't Blame the Annotator: Bias Already Starts in the Annotation Instructions

05/01/2022
by   Mihir Parmar, et al.
5

In recent years, progress in NLU has been driven by benchmarks. These benchmarks are typically collected by crowdsourcing, where annotators write examples based on annotation instructions crafted by dataset creators. In this work, we hypothesize that annotators pick up on patterns in the crowdsourcing instructions, which bias them to write similar examples that are then over-represented in the collected data. We study this form of bias, termed instruction bias, in 14 recent NLU benchmarks, showing that instruction examples often exhibit concrete patterns, which are propagated by crowdworkers to the collected data. This extends previous work (Geva et al., 2019) and raises a new concern of whether we are modeling the dataset creator's instructions, rather than the task. Through a series of experiments, we show that, indeed, instruction bias can lead to overestimation of model performance, and that models struggle to generalize beyond biases originating in the crowdsourcing instructions. We further analyze the influence of instruction bias in terms of pattern frequency and model size, and derive concrete recommendations for creating future NLU benchmarks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/19/2022

Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor

Instruction tuning enables pretrained language models to perform new tas...
research
03/17/2022

How Many Data Samples is an Additional Instruction Worth?

Recently introduced instruction-paradigm empowers non-expert users to le...
research
09/21/2023

A Computational Analysis of Vagueness in Revisions of Instructional Texts

WikiHow is an open-domain repository of instructional articles for a var...
research
05/24/2023

Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models

Instruction-tuned models are trained on crowdsourcing datasets with task...
research
08/21/2019

Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets

Crowdsourcing has been the prevalent paradigm for creating natural langu...
research
05/21/2018

A new dataset and model for learning to understand navigational instructions

In this paper, we present a state-of-the-art model and introduce a new d...
research
04/06/2023

When do you need Chain-of-Thought Prompting for ChatGPT?

Chain-of-Thought (CoT) prompting can effectively elicit complex multi-st...

Please sign up or login with your details

Forgot password? Click here to reset