DeepAI AI Chat
Log In Sign Up

BBQ: A Hand-Built Bias Benchmark for Question Answering

by   Alicia Parrish, et al.

It is well documented that NLP models learn social biases present in the world, but little work has been done to show how these biases manifest in actual model outputs for applied tasks like question answering (QA). We introduce the Bias Benchmark for QA (BBQ), a dataset consisting of question-sets constructed by the authors that highlight attested social biases against people belonging to protected classes along nine different social dimensions relevant for U.S. English-speaking contexts. Our task evaluates model responses at two distinct levels: (i) given an under-informative context, test how strongly model answers reflect social biases, and (ii) given an adequately informative context, test whether the model's biases still override a correct answer choice. We find that models strongly rely on stereotypes when the context is ambiguous, meaning that the model's outputs consistently reproduce harmful biases in this setting. Though models are much more accurate when the context provides an unambiguous answer, they still rely on stereotyped information and achieve an accuracy 2.5 percentage points higher on examples where the correct answer aligns with a social bias, with this accuracy difference widening to 5 points for examples targeting gender.


page 2

page 9

page 15

page 16


What Gives the Answer Away? Question Answering Bias Analysis on Video QA Datasets

Question answering biases in video QA datasets can mislead multimodal mo...

Q-Pain: A Question Answering Dataset to Measure Social Bias in Pain Management

Recent advances in Natural Language Processing (NLP), and specifically a...

TASA: Deceiving Question Answering Models by Twin Answer Sentences Attack

We present Twin Answer Sentences Attack (TASA), an adversarial attack me...

UNQOVERing Stereotyping Biases via Underspecified Questions

While language embeddings have been shown to have stereotyping biases, h...

Explicit Bias Discovery in Visual Question Answering Models

Researchers have observed that Visual Question Answering (VQA) models te...

SODAPOP: Open-Ended Discovery of Social Biases in Social Commonsense Reasoning Models

A common limitation of diagnostic tests for detecting social biases in N...

On Measuring Social Biases in Prompt-Based Multi-Task Learning

Large language models trained on a mixture of NLP tasks that are convert...