Human vs. Muppet: A Conservative Estimate of HumanPerformance on the GLUE Benchmark

05/24/2019
by   Nikita Nangia, et al.
3

The GLUE benchmark (Wang et al., 2019b) is a suite of language understanding tasks which has seen dramatic progress in the past year, with average performance moving from 70.0 at launch to 83.9, state of the art at the time of writing (May 24, 2019). Here, we measure human performance on the benchmark, in order to learn whether significant headroom remains for further progress. We provide a conservative estimate of human performance on the benchmark through crowdsourcing: Our annotators are non-experts who must learn each task from a brief set of instructions and 20 examples. In spite of limited training, these annotators robustly outperform the state of the art on six of the nine GLUE tasks and achieve an average score of 87.1. Given the fast pace of progress however, the headroom we observe is quite limited. To reproduce the data-poor setting that our annotators must learn in, we also train the BERT model (Devlin et al., 2019) in limited-data regimes, and conclude that low-resource sentence classification remains a challenge for modern neural network approaches to text understanding.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/24/2019

Human vs. Muppet: A Conservative Estimate of Human Performance on the GLUE Benchmark

The GLUE benchmark (Wang et al., 2019b) is a suite of language understan...
research
05/02/2019

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

In the last year, new models and methods for pretraining and transfer le...
research
01/11/2019

Grammatical Analysis of Pretrained Sentence Encoders with Acceptability Judgments

Recent pretrained sentence encoders achieve state of the art results on ...
research
02/09/2019

The Omniglot Challenge: A 3-Year Progress Report

Three years ago, we released the Omniglot dataset for developing more hu...
research
01/23/2020

Action Recognition and State Change Prediction in a Recipe Understanding Task Using a Lightweight Neural Network Model

Consider a natural language sentence describing a specific step in a foo...
research
06/09/2019

Encouraging Paragraph Embeddings to Remember Sentence Identity Improves Classification

While paragraph embedding models are remarkably effective for downstream...
research
11/15/2019

The implications of Labour's plan to scrap Key Stage 2 tests for Progress 8 and secondary school accountability in England

In England, Progress 8 is the Conservative government's headline seconda...

Please sign up or login with your details

Forgot password? Click here to reset