The Sensitivity of Language Models and Humans to Winograd Schema Perturbations

by   Mostafa Abdou, et al.

Large-scale pretrained language models are the major driving force behind recent improvements in performance on the Winograd Schema Challenge, a widely employed test of common sense reasoning ability. We show, however, with a new diagnostic dataset, that these models are sensitive to linguistic perturbations of the Winograd examples that minimally affect human understanding. Our results highlight interesting differences between humans and language models: language models are more sensitive to a number or gender alternations and synonym replacements than humans, and humans are more stable and consistent in their predictions, maintain a much higher absolute performance, and perform better on non-associative instances than associative ones. Overall, humans are correct more often than out-of-the-box models, and the models are sometimes right for the wrong reasons. Finally, we show that fine-tuning on a large, task-specific dataset can offer a solution to these issues.


page 1

page 2

page 3

page 4


A Surprisingly Robust Trick for Winograd Schema Challenge

The Winograd Schema Challenge (WSC) dataset WSC273 and its inference cou...

So Cloze yet so Far: N400 Amplitude is Better Predicted by Distributional Information than Human Predictability Judgements

More predictable words are easier to process - they are read faster and ...

Language Models are Few-Shot Learners

Recent work has demonstrated substantial gains on many NLP tasks and ben...

GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation

Large-scale language models such as GPT-3 are excellent few-shot learner...

Stepmothers are mean and academics are pretentious: What do pretrained language models learn about you?

In this paper, we investigate what types of stereotypical information ar...

Precise Task Formalization Matters in Winograd Schema Evaluations

Performance on the Winograd Schema Challenge (WSC), a respected English ...

Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

We present a novel task and dataset for evaluating the ability of vision...