Human vs. Muppet: A Conservative Estimate of HumanPerformance on the GLUE Benchmark

by   Nikita Nangia, et al.
NYU college

The GLUE benchmark (Wang et al., 2019b) is a suite of language understanding tasks which has seen dramatic progress in the past year, with average performance moving from 70.0 at launch to 83.9, state of the art at the time of writing (May 24, 2019). Here, we measure human performance on the benchmark, in order to learn whether significant headroom remains for further progress. We provide a conservative estimate of human performance on the benchmark through crowdsourcing: Our annotators are non-experts who must learn each task from a brief set of instructions and 20 examples. In spite of limited training, these annotators robustly outperform the state of the art on six of the nine GLUE tasks and achieve an average score of 87.1. Given the fast pace of progress however, the headroom we observe is quite limited. To reproduce the data-poor setting that our annotators must learn in, we also train the BERT model (Devlin et al., 2019) in limited-data regimes, and conclude that low-resource sentence classification remains a challenge for modern neural network approaches to text understanding.



There are no comments yet.


page 1

page 2

page 3

page 4


Human vs. Muppet: A Conservative Estimate of Human Performance on the GLUE Benchmark

The GLUE benchmark (Wang et al., 2019b) is a suite of language understan...

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

In the last year, new models and methods for pretraining and transfer le...

Grammatical Analysis of Pretrained Sentence Encoders with Acceptability Judgments

Recent pretrained sentence encoders achieve state of the art results on ...

Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks

Pretraining with language modeling and related unsupervised tasks has re...

The Omniglot Challenge: A 3-Year Progress Report

Three years ago, we released the Omniglot dataset for developing more hu...

Tracking the World State with Recurrent Entity Networks

We introduce a new model, the Recurrent Entity Network (EntNet). It is e...

The implications of Labour's plan to scrap Key Stage 2 tests for Progress 8 and secondary school accountability in England

In England, Progress 8 is the Conservative government's headline seconda...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This past year has seen tremendous progress in building general purpose models that can learn good language representations across a range of tasks and domains (McCann et al., 2017; Peters et al., 2018; Devlin et al., 2019; Howard and Ruder, 2018; Liu et al., 2019). Reusable models like these can be readily adapted to different language understanding tasks and genres. The General Language Understanding Evaluation (GLUE; Wang et al., 2019b)

benchmark is designed to evaluate such models. GLUE is built around nine sentence-level natural language understanding (NLU) tasks and datasets, including instances of natural language inference, sentiment analysis, acceptability judgment, sentence similarity, and common sense reasoning.

The recent BigBird model (Liu et al., 2019) —a fine-tuned variant of the BERT model (Devlin et al., 2019)—is state-of-the-art on GLUE at the time of writing, with the original BERT right at its heels. Both models perform impressively enough on GLUE to prompt some increasingly urgent questions: How much better are humans at these NLP tasks? Do standard benchmarks have enough headroom to meaningfully measure further progress? In the case of one prominent language understanding task with a known human performance number, SQuAD 2.0 Rajpurkar et al. (2018), models built on BERT come extremely close to human performance.111 On the recent Situations With Adversarial Generations (SWAG; Zellers et al., 2018) dataset, BERT outperforms individual expert human annotators. In this work, we estimate human performance on the GLUE test set to determine which tasks see substantial remaining headroom between human and machine performance.

While human performance or interannotator agreement numbers have been reported on some GLUE tasks, the data collection methods used to establish those baselines vary substantially. To maintain consistency in our reported baseline numbers, and to ensure that our results are at least roughly comparable to numbers for submitted machine learning models, we collect annotations using a uniform method for all nine tasks.

We hire crowdworker annotators: For each of the nine tasks, we give the workers a brief training exercise on the task, ask them to annotate a random subset of the test data, and then collect majority vote labels from five annotators for each example in the subset. Comparing these labels with the ground-truth test labels yields an overall GLUE score of 87.1—well above BERT’s 80.5 and BigBird’s 82.9—and yields single-task scores that are substantially better than both on six of nine tasks. However, in light of the pace of recent progress made on GLUE, the gap in most tasks is relatively small. The one striking exception is the data-poor Winograd Schema NLI Corpus (WNLI; based on Levesque et al., 2012), in which humans outperform machines by over 30 percentage points.

To reproduce the data-poor training regime of our annotators, and of WNLI, we investigate BERT’s performance on data-poor versions of the other GLUE tasks and find that it suffers considerably in these low-resource settings. Ultimately however, BERT’s performance seems genuinely close to human performance and leaves limited headroom in GLUE.

2 Background and Related Work


GLUE (Wang et al., 2019b) is composed of nine sentence or sentence-pair classification or regression tasks: MultiNLI (Williams et al., 2018), RTE (competition releases 1–3 and 5, merged and treated as a single binary classification task; Dagan et al. 2006, Bar Haim et al. 2006, Giampiccolo et al. 2007, Bentivogli et al. 2009), QNLI (an answer sentence selection task based on SQuAD; Rajpurkar et al. 2016),222Our human performance numbers for QNLI are on the original test set since we collected data before the release of the slightly revised second test set. BERT-large’s performance went up by 1.6 percentage points on the new test set, and BERT-base’s performance saw a 0.5 point increase. This suggests that our human performance number represents a reasonable—if very conservative—approximation of human performance on QNLI. and WNLI test natural language inference. WNLI is derived from private data created for the Winograd Schema Challenge (Levesque et al., 2012), which specifically tests for common sense reasoning. The Microsoft Research Paraphrase Corpus (MRPC; Dolan and Brockett, 2005), the Semantic Textual Similarity Benchmark (STS-B; Cer et al., 2017), and Quora Question Pairs (QQP)333 test paraphrase and sentence similarity evaluation. The Corpus of Linguistic Acceptability (CoLA; Warstadt et al., 2018) tests grammatical acceptability judgment. Finally, the Stanford Sentiment Treebank (SST; Socher et al., 2013) tests sentiment analysis.

Human Evaluations on GLUE Tasks

Warstadt et al. (2018) report human performance numbers on CoLA as well. Using the majority decision from five expert annotators on 200 examples, they get a Matthews correlation coefficient (MCC) of 71.3. Bender (2015) also estimates human performance on the original public Winograd Schema Challenge (WSC) data. They use crowdworkers and report an average accuracy of 92.1%. The RTE corpus papers report inter-annotator agreement numbers on their test sets: 80% on RTE-1, 89.2% on RTE-2, 87.8% on RTE-3, and 97.02% on RTE-5. Wang et al. (2019b) report human performance numbers on GLUE’s manually curated diagnostic test set. The examples in this test set are natural language inference sentence pairs that are tagged for a set of linguistic phenomena. They use expert annotators and report an average coefficient of 0.8.

3 Data Collection Method

To establish human performance on GLUE, we hire annotators through the Hybrid444 data collection platform, which is similar to Amazon’s Mechanical Turk. Each worker first completes a short training procedure then moves on to the main annotation task. For the annotation phase, we tune the pay rate for each task, with an average rate of $17/hr. The training phase has a lower, standard pay rate, with an average pay of $7.6/hr.


In the training phase for each GLUE task, each worker answers 20 random examples from the task development set. Each training page links to instructions that are tailored to the task, and shows five examples.555All the task specific instructions and FAQ pages used can be found at The answers to these examples can be revealed by clicking on a “Show” button at the bottom of the page. We ask the workers to label each set of examples and check their work so they can familiarize themselves with the task. Workers who get less than 65% of the examples correct during training do not qualify for the main task. This is an intentionally low threshold meant only to encourage a reasonable effort. Our platform cannot fully prevent workers from changing their answers after viewing the correct labels, so we cannot use the training phase as a substantial filter. (See Appendix A.1 for details on the training phase.)


We randomly sample 500 examples from each task’s test set for annotation, with the exception of WNLI where we sample 145 of the 147 available test examples (the two missing examples are the result of a data preparation error). For each of these sampled data points, we collect five annotations from five different workers (see Appendix A.2). We use the test set since the test and development sets are qualitatively different for some tasks, and we wish to compare our results directly with those on the GLUE leaderboard.

4 Results and Discussions

To calculate the human performance baseline, we take the majority vote across the five crowd-sourced annotations. In the case of MultiNLI, since there are three possible labels—entailment, neutral, and contradiction—about 2% of examples see a tie between two labels. For these ties, we take the label that is more frequent in the development set. In the case of STS-B, we take an average of the scalar annotator labels. Since we only collect annotations for a subset of the data, we cannot access the test set through the GLUE leaderboard interface, we instead submit our predictions to the GLUE organizers privately.

We compare human performance to BERT and BigBird. The human performance numbers in Table 1 shows that overall our annotators stick it to the Muppets on GLUE. However on MRPC, QQP, and QNLI, Bigbird and BERT outperform our annotators. The results on QQP are particularly surprising: BERT and BigBird score over 12 F1 points better than our annotators. Our annotators, however, are only given 20 examples and a short set of instructions for training, while BERT and BigBird are fine-tuned on the 364k-example QQP training set. In addition, we find it difficult to compose concise instructions for QQP that actually match the supplied labels. We do not have access to the material used to create the dataset, and we find it difficult to infer simple instructions from the data (sample provided in Appendix B). If given more training data, it is possible that our annotators could better learn relatively subtle label definitions that better fit the corpus.

Unanimous Vote

To investigate the possible effect of ambiguous label definitions, we look at human performance when there is 5-way annotator agreement. Using unanimous agreement, rather than majority agreement, has the potential effect of filtering out examples of two kinds: those for which our supplied annotation guidelines don’t provide clear advice and those for which humans understand the expectations of the task but find the example genuinely difficult or uncertain. To disentangle the two effects, we also look at BERT results on this subset of the test set, as BERT’s use of large training sets means that it should only suffer in the latter cases. We get consent from the authors of BERT to work in cooperation with the GLUE team to measure BERT’s performance on this subset, which we show in Table 1. Overall, we see the gap widen between the human baseline and BERT by 3.1 points. The largest shifts in performance are on CoLA, MRPC, QQP, and WNLI. The relative jumps in performance on MRPC and QQP support the claim that human performance is hurt by imprecise guidelines and that the use of substantially more training data gives BERT an edge on our annotators.

In general, BERT needs large datasets to fine-tune on. This is further evidenced by its performance discrepancy between MultiNLI and RTE: human performance is similar for the two, whereas BERT shows a 16.2 percentage point gap between the two datasets. Both MultiNLI and RTE are textual entailment datasets, but MultiNLI’s training set is quite large at 393k examples, while the GLUE version of RTE has only 2.5k examples. However, BigBird does not show as large a gap, which may be because it employs a multi-task learning approach which fine-tunes the model for all sentence-pair tasks jointly. Their RTE classifier, for example, benefits from the large training dataset for the closely related MultiNLI task.

Low-Resource BERT Baseline

To understand the impact of abundant target tasks on the limited headroom that we observe, we train several additional baselines. In these, we fine-tune BERT on 5k, 1k, and 500 examples for each GLUE task (or fewer for tasks with fewer training examples). We use BERT for this analysis because the authors have released their code and have provided pretrained weights for the model. We use their publicly available implementation of BERT-large

, their pretrained weights as the initialization for fine-tuning on the GLUE tasks, and the hyperparameters they report. We see a precipitous drop in performance on most tasks with large datasets, with the exception of QNLI. A possible partial explanation is that both QNLI and the BERT training data come from English Wikipedia. On MRPC and QQP however, BERT’s performance drops below human performance in the 1k- and 500-example settings. On the whole, we find that BERT suffers in low-resource settings. These results are in agreement with the findings in

Phang et al. (2019) who conduct essentially the same experiment.


Our human performance number on CoLA is 4.9 points below what was reported in Warstadt et al. (2018). We believe this discrepancy is because they use linguistics PhD students as expert annotators while we use crowdworkers. This further supports our belief that our human performance baseline is a conservative estimate, and that higher performance is possible, particularly with more training.


No system on the GLUE leaderboard has managed to exceed the performance of the most-frequent-class baseline on WNLI, and several papers that propose methods for GLUE justify their poor performance by asserting that the task must be somehow broken.666Devlin et al. (2019), for example, mention that they avoid “the problematic WNLI set”. WNLI’s source Winograd Schema data was constructed so as not to include any statistical cues that a simple machine learning system can exploit, which can make it quite difficult. The WNLI test set shows one of the highest

human performance scores of the nine GLUE tasks, reflecting its status as a corpus constructed and vetted by artificial intelligence experts. This affirms that tasks like WNLI with small training sets (634 sentence pairs) and no simple cues remain a serious (and sometimes unacknowledged) blind spot for modern neural network sentence understanding methods.

5 Conclusion

This paper presents a conservative estimate of human performance to serve as a target for the GLUE sentence understanding benchmark. We obtain this baseline with the help of crowdworker annotators. We find that state-of-the-art models like BERT are not far behind human performance on most GLUE tasks. But we also note that, when trained in low-resource settings, BERT’s performance falls considerably. Given these results, and the continued difficulty neural methods have with the Winograd Schema Challenge, we argue that future work on GLUE-style sentence understanding tasks might benefit from a focus on learning from smaller training sets. In work subsequent to the main results of this paper, we have prepared such a benchmark in the GLUE follow-up SuperGLUE (Wang et al., 2019a).


This work was made possible in part by a donation to NYU from Eric and Wendy Schmidt made by recommendation of the Schmidt Futures program and by funding from Samsung Research. We gratefully acknowledge the support of NVIDIA Corporation with the donation of a Titan V GPU used at NYU for this research. We thank Alex Wang and Amanpreet Singh for their help with conducting GLUE evaluations, and we thank Jason Phang for his help with training the BERT model.


Appendix A Crowd-Sourced Data Collection

a.1 Training Phase

During training, we provide a link to task-specific instructions. As an example, the instructions for CoLA are shown in Table 2. The instructions for each task follows the same format: briefly describing the annotator’s job, explaining the labels, and providing at least one example.

In addition to the task-specific instructions, we provide general instructions about the training phase. An example is given in Table 3. Lastly, we provide a link to an FAQ page. The FAQ page addresses the balance of the data. If the labels are balanced, we tell the annotators so. If the labels are not balanced, we assure the annotators that they need not worry about assigning one label more frequently. For most tasks we also describe where the data comes from, e.g. news articles. All of the task specific instructions and FAQ pages can be found at

On each training page, each annotator is given five examples to annotate. At the bottom of the page, there is a “Show” button which reveals the ground truth labels. If their submitted answer is incorrect, the label is shown in red, otherwise it is shown in black. In the instructions, the annotator is asked to check their work with this button. Given this procedure, we cannot prevent the annotators from changing their answer after viewing the ground truth labels.

New York University’s Center for Data Science is collecting your answers for use in research on computer understanding of English. Thank you for your help!
We will present you with a sentence someone spoke. Your job is to figure out, based on this sentence, if the speaker is a native speaker of English. You should ignore the general topic of the sentence and focus on the fluency of the sentence.
  • Choose correct if you think the sentence sounds fluent and you think it was spoken by a native-English speaker. Examples:

    • “A hundred men surrounded the fort.”

    • “Everybody who attended last week’s huge rally, whoever they were, signed the petition.”

    • “Where did you go and who ate what?”

  • Choose incorrect if you think the sentence does not sound completely fluent and may have been spoken by a non-native English speaker. Examples:

    • “Sue gave to Bill a book.”

    • “Mary came to be introduced by the bartender and I also came to be.”

    • “The problem perceives easily.”

More questions? See the FAQ page.
Table 2: The instructions given to crowd-sourced worker for the CoLA task. While the instructions were tailored for each task in GLUE, they all followed a similar format.
This project is a training task that needs to be completed before working on the main project on Hybrid named Human Performance: CoLA. For this CoLA task, we have the true label and we want to get information on how well people do on the task. This training is short but is designed to help you get a sense of the questions and the expected labels.
Please note that the pay per HIT for this training task is also lower than it is for the main project Human Performance: CoLA. Once you are done with the training, please proceed to the main task!
In this training, you must answer all the questions on the page and then, to see how you did, click the Show button at the bottom of the page before moving onto the next HIT. The Show button will reveal the true labels. If you answered correctly, the revealed label will be in black, otherwise it will be in red. Please use this training and the provided answers to build an understanding of what the answers to these questions looks like (the main project, Human Performance: CoLA, does not have the answers on the page).
Table 3: Instructions about the training phase provided to workers. This example is for CoLA training. The only change in instructions for other tasks is the name of the task.

a.2 Annotation Phase

In the main data collection phase we provide the annotators with a link to the same task-specific instructions (Table 2) and FAQ page used during the training phase. We enforce the training phase as a qualification for annotation, so crowdworkers cannot participate in annotation without first completing the associated training.

Appendix B QQP Example

The 25 randomly sampled examples from the QQP development set are given in Tables 4, 5, and 6.

Question 1 Question 2 Label
6.5cmWhat are the best resources for learning Ukrainian? 7cmWhat are the best resources for learning Turkish? 0
7cmHow much time will it take to charge a 10,000 mAh power bank? 7cmHow much time does it takes to charge the power bank 13000mAh for full charge? 0
7cmHow do you know if you’re in love? 7cmHow can you know if you’re in love or just attracted to someone? 1
7cmWhich are the best and affordable resorts in Goa? 7cmWhat are some affordable and safe beach resorts in Goa? 1
7cmHow winning money from YouTube? 7cmHow do I make money from a YouTube channel? 1
Table 4: Five randomly sampled examples from QQP’s development set. Pairs of sentences with a label of 1 are marked as paraphrases in QQP.
Question 1 Question 2 Label
7cmWhat is actual meaning of life? Indeen, it depend on perception of people or other thing? 7cmWhat is the meaning of my life? 1
7cmWhat is the difference between CC and 2S classes of travel in Jan Shatabdi express? 7cmWhat is TQWL in IRCTC wait list? 0
7cmWhat would have happened if Hitler hadn’t declared war on the United States after Pearl Harbor? 7cmWhat would have happened if the United States split in two after the revolutionary war? 0
7cmWill it be a problem if a friend deposits 4 lakhs in my savings bank account and I don’t have a source of income to show? 7cmI am 25.5 year old boy with a B.Com in a sales job having a package of 4 LPA. I will be married in less than a year. I want to quit my job and start my own business with the savings I have of 2 Lakh. Is this an ideal situation to take a risk? 0
7cmWhat should you do if you meet an alien? 7cmWhat could be the possible conversation between humans and aliens on their first meeting? 0
7cmWhy can’t I ask any questions on Quora? 7cmCan you ask any question on Quora? 0
7cmShould I move from the USA to India? 7cmMoving from usA to India? 1
7cmWhich European countries provide mostly free university education to Indian citizen? 7cmWhat countries provide free education to Indian students? 0
7cmI got 112 rank in CDAC (A+B+C). My subject of interest is VLSI. Is there any chance that I would get CDAC Pune, Noida for VLSI? 7cmSuggest some good indian youtube channels for studying Aptitude? 0
6cmWhat are the positives and negatives of restorative justice? 7cmIs Vengence and Justice opposite? 0
Table 5: Another ten randomly sampled examples from QQP’s development set. Pairs of sentences with a label of 1 are marked as paraphrases in QQP.
Question 1 Question 2 Label
7cmWhat’s a good way to make money through effort? 7cmHow do I make money without much effort? 0
7cmWhat is the meaning of life? Whats our purpose on Earth? 7cmWhat actually is the purpose of life? 1
7cmWhich among five seasons (summer, winter, autumn, spring, rainy) is most favourable for farming and cultivating of crops? 7cmWhich among the five seasons (summer, winter, rainy, spring, autumn) is better for farming and cultivating of crops? 1
6.5cmHow can I find the real true purpose of my life? 7cmWhat should one do to find purpose of one’s life? 1
7cmIs Donald Trump likely to win the 2016 election (late 2015 / early 2016)? 7cmWhat will Donald Trump’s response be if he doesn’t win the 2016 presidential election? 0
7cmWhat is the easiest and cheapest way to lose weight fast? 7cmWhat are the easiest and the fastest ways to lose weight? 1
7cmWhy are basically all of my questions on Quora marked as ’needing improvement’? Am I that bad? 7cmWhy do questions get marked for ’needing improvment’ when they clearly don’t? 1
7cmWhat are some of the most visually stunning apps? 7cmWhat are the most visually stunning foods? 0
7cmWhat are some of the good hotels near chennai central railway station? 7cmBest places to eat in Chennai? 0
7cmHow do you prepare for a job interview? 7cmHow do I prepare for my first job interview? 1
Table 6: Another ten randomly sampled examples from QQP’s development set. Pairs of sentences with a label of 1 are marked as paraphrases in QQP.