Precise Task Formalization Matters in Winograd Schema Evaluations

10/08/2020
by   Haokun Liu, et al.
0

Performance on the Winograd Schema Challenge (WSC), a respected English commonsense reasoning benchmark, recently rocketed from chance accuracy to 89 on the SuperGLUE leaderboard, with relatively little corroborating evidence of a correspondingly large improvement in reasoning ability. We hypothesize that much of this improvement comes from recent changes in task formalization—the combination of input specification, loss function, and reuse of pretrained parameters—by users of the dataset, rather than improvements in the pretrained model's reasoning ability. We perform an ablation on two Winograd Schema datasets that interpolates between the formalizations used before and after this surge, and find (i) framing the task as multiple choice improves performance by 2-6 points and (ii) several additional techniques, including the reuse of a pretrained language modeling head, can mitigate the model's extreme sensitivity to hyperparameters. We urge future benchmark creators to impose additional structure to minimize the impact of formalization decisions on reported results.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/20/2019

Teaching Pretrained Models with Commonsense Reasoning: A Preliminary KB-Based Approach

Recently, pretrained language models (e.g., BERT) have achieved great su...
research
04/23/2020

A Review of Winograd Schema Challenge Datasets and Approaches

The Winograd Schema Challenge is both a commonsense reasoning and natura...
research
06/22/2021

It's All in the Heads: Using Attention Heads as a Baseline for Cross-Lingual Transfer in Commonsense Reasoning

Commonsense reasoning is one of the key problems in natural language pro...
research
06/02/2021

COM2SENSE: A Commonsense Reasoning Benchmark with Complementary Sentences

Commonsense reasoning is intuitive for humans but has been a long-term c...
research
06/12/2021

Prompting Contrastive Explanations for Commonsense Reasoning Tasks

Many commonsense reasoning NLP tasks involve choosing between one or mor...
research
01/13/2021

Improving Commonsense Causal Reasoning by Adversarial Training and Data Augmentation

Determining the plausibility of causal relations between clauses is a co...
research
07/05/2021

Doing Good or Doing Right? Exploring the Weakness of Commonsense Causal Reasoning Models

Pretrained language models (PLM) achieve surprising performance on the C...

Please sign up or login with your details

Forgot password? Click here to reset