The Sensitivity of Language Models and Humans to Winograd Schema Perturbations

05/04/2020
by   Mostafa Abdou, et al.
0

Large-scale pretrained language models are the major driving force behind recent improvements in performance on the Winograd Schema Challenge, a widely employed test of common sense reasoning ability. We show, however, with a new diagnostic dataset, that these models are sensitive to linguistic perturbations of the Winograd examples that minimally affect human understanding. Our results highlight interesting differences between humans and language models: language models are more sensitive to a number or gender alternations and synonym replacements than humans, and humans are more stable and consistent in their predictions, maintain a much higher absolute performance, and perform better on non-associative instances than associative ones. Overall, humans are correct more often than out-of-the-box models, and the models are sometimes right for the wrong reasons. Finally, we show that fine-tuning on a large, task-specific dataset can offer a solution to these issues.

READ FULL TEXT

page 1

page 2

page 3

page 4

05/15/2019

A Surprisingly Robust Trick for Winograd Schema Challenge

The Winograd Schema Challenge (WSC) dataset WSC273 and its inference cou...
09/02/2021

So Cloze yet so Far: N400 Amplitude is Better Predicted by Distributional Information than Human Predictability Judgements

More predictable words are easier to process - they are read faster and ...
05/28/2020

Language Models are Few-Shot Learners

Recent work has demonstrated substantial gains on many NLP tasks and ben...
04/18/2021

GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation

Large-scale language models such as GPT-3 are excellent few-shot learner...
09/21/2021

Stepmothers are mean and academics are pretentious: What do pretrained language models learn about you?

In this paper, we investigate what types of stereotypical information ar...
10/08/2020

Precise Task Formalization Matters in Winograd Schema Evaluations

Performance on the Winograd Schema Challenge (WSC), a respected English ...
04/07/2022

Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

We present a novel task and dataset for evaluating the ability of vision...