WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale

07/24/2019
by   Keisuke Sakaguchi, et al.
0

The Winograd Schema Challenge (WSC), proposed by Levesque et al. (2011) as an alternative to the Turing Test, was originally designed as a pronoun resolution problem that cannot be solved based on statistical patterns in large text corpora. However, recent studies suggest that current WSC datasets, even when composed carefully by experts, are still prone to such biases that statistical methods can exploit. We introduce WINOGRANDE, a new collection of WSC problems that are adversarially constructed to be robust against spurious statistical biases. While the original WSC dataset provided only 273 instances, WINOGRANDE includes 43,985 instances, half of which are determined as adversarial. Key to our approach is a novel adversarial filtering algorithm AFLITE for systematic bias reduction, combined with a careful crowdsourcing design. Despite the significant increase in training data, the performance of existing state-of-the-art methods remains modest (61.6 performance (90.8 to use transfer learning for achieving new state-of-the-art results on the original WSC and related datasets. Finally, we discuss how biases lead to overestimating the true capabilities of machine commonsense.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/13/2016

Commonsense Knowledge Enhanced Embeddings for Solving Pronoun Disambiguation Problems in Winograd Schema Challenge

In this paper, we propose commonsense knowledge enhanced embeddings (KEE...
research
05/30/2023

Fighting Bias with Bias: Promoting Model Robustness by Amplifying Dataset Biases

NLP models often rely on superficial cues known as dataset biases to ach...
research
04/23/2020

A Review of Winograd Schema Challenge Datasets and Approaches

The Winograd Schema Challenge is both a commonsense reasoning and natura...
research
01/08/2018

Winograd Schema - Knowledge Extraction Using Narrative Chains

The Winograd Schema Challenge (WSC) is a test of machine intelligence, d...
research
08/16/2018

SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference

Given a partial description like "she opened the hood of the car," human...
research
05/19/2019

HellaSwag: Can a Machine Really Finish Your Sentence?

Recent work by Zellers et al. (2018) introduced a new task of commonsens...
research
01/06/2020

Stance Detection Benchmark: How Robust Is Your Stance Detection?

Stance Detection (StD) aims to detect an author's stance towards a certa...

Please sign up or login with your details

Forgot password? Click here to reset