HellaSwag: Can a Machine Really Finish Your Sentence?

05/19/2019
by   Rowan Zellers, et al.
0

Recent work by Zellers et al. (2018) introduced a new task of commonsense natural language inference: given an event description such as "A woman sits at a piano," a machine must select the most likely followup: "She sets her fingers on the keys." With the introduction of BERT, near human-level performance was reached. Does this mean that machines can perform human level commonsense inference? In this paper, we show that commonsense inference still proves difficult for even state-of-the-art models, by presenting HellaSwag, a new challenge dataset. Though its questions are trivial for humans (>95 models struggle (<48 collection paradigm wherein a series of discriminators iteratively select an adversarial set of machine-generated wrong answers. AF proves to be surprisingly robust. The key insight is to scale up the length and complexity of the dataset examples towards a critical 'Goldilocks' zone wherein generated text is ridiculous to humans, yet often misclassified by state-of-the-art models. Our construction of HellaSwag, and its resulting difficulty, sheds light on the inner workings of deep pretrained models. More broadly, it suggests a new path forward for NLP research, in which benchmarks co-evolve with the evolving state-of-the-art in an adversarial way, so as to present ever-harder challenges.

READ FULL TEXT

page 6

page 7

page 13

page 14

research
08/16/2018

SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference

Given a partial description like "she opened the hood of the car," human...
research
10/25/2020

Commonsense knowledge adversarial dataset that challenges ELECTRA

Commonsense knowledge is critical in human reading comprehension. While ...
research
02/15/2023

Commonsense Reasoning for Conversational AI: A Survey of the State of the Art

Large, transformer-based pretrained language models like BERT, GPT, and ...
research
11/27/2018

From Recognition to Cognition: Visual Commonsense Reasoning

Visual understanding goes well beyond object recognition. With one glanc...
research
05/15/2019

A Surprisingly Robust Trick for Winograd Schema Challenge

The Winograd Schema Challenge (WSC) dataset WSC273 and its inference cou...
research
10/31/2019

Adversarial NLI: A New Benchmark for Natural Language Understanding

We introduce a new large-scale NLI benchmark dataset, collected via an i...
research
07/24/2019

WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale

The Winograd Schema Challenge (WSC), proposed by Levesque et al. (2011) ...

Please sign up or login with your details

Forgot password? Click here to reset