1 Introduction
This past year has seen tremendous progress in building general purpose models that can learn good language representations across a range of tasks and domains (McCann et al., 2017; Peters et al., 2018; Devlin et al., 2019; Howard and Ruder, 2018; Liu et al., 2019). Reusable models like these can be readily adapted to different language understanding tasks and genres. The General Language Understanding Evaluation (GLUE; Wang et al., 2019b)
benchmark is designed to evaluate such models. GLUE is built around nine sentence-level natural language understanding (NLU) tasks and datasets, including instances of natural language inference, sentiment analysis, acceptability judgment, sentence similarity, and common sense reasoning.
The recent BigBird model (Liu et al., 2019) —a fine-tuned variant of the BERT model (Devlin et al., 2019)—is state-of-the-art on GLUE at the time of writing, with the original BERT right at its heels. Both models perform impressively enough on GLUE to prompt some increasingly urgent questions: How much better are humans at these NLP tasks? Do standard benchmarks have enough headroom to meaningfully measure further progress? In the case of one prominent language understanding task with a known human performance number, SQuAD 2.0 Rajpurkar et al. (2018), models built on BERT come extremely close to human performance.111https://rajpurkar.github.io/SQuAD-explorer/ On the recent Situations With Adversarial Generations (SWAG; Zellers et al., 2018) dataset, BERT outperforms individual expert human annotators. In this work, we estimate human performance on the GLUE test set to determine which tasks see substantial remaining headroom between human and machine performance.
While human performance or interannotator agreement numbers have been reported on some GLUE tasks, the data collection methods used to establish those baselines vary substantially. To maintain consistency in our reported baseline numbers, and to ensure that our results are at least roughly comparable to numbers for submitted machine learning models, we collect annotations using a uniform method for all nine tasks.
We hire crowdworker annotators: For each of the nine tasks, we give the workers a brief training exercise on the task, ask them to annotate a random subset of the test data, and then collect majority vote labels from five annotators for each example in the subset. Comparing these labels with the ground-truth test labels yields an overall GLUE score of 87.1—well above BERT’s 80.5 and BigBird’s 82.9—and yields single-task scores that are substantially better than both on six of nine tasks. However, in light of the pace of recent progress made on GLUE, the gap in most tasks is relatively small. The one striking exception is the data-poor Winograd Schema NLI Corpus (WNLI; based on Levesque et al., 2012), in which humans outperform machines by over 30 percentage points.
To reproduce the data-poor training regime of our annotators, and of WNLI, we investigate BERT’s performance on data-poor versions of the other GLUE tasks and find that it suffers considerably in these low-resource settings. Ultimately however, BERT’s performance seems genuinely close to human performance and leaves limited headroom in GLUE.
2 Background and Related Work
Glue
GLUE (Wang et al., 2019b) is composed of nine sentence or sentence-pair classification or regression tasks: MultiNLI (Williams et al., 2018), RTE (competition releases 1–3 and 5, merged and treated as a single binary classification task; Dagan et al. 2006, Bar Haim et al. 2006, Giampiccolo et al. 2007, Bentivogli et al. 2009), QNLI (an answer sentence selection task based on SQuAD; Rajpurkar et al. 2016),222Our human performance numbers for QNLI are on the original test set since we collected data before the release of the slightly revised second test set. BERT-large’s performance went up by 1.6 percentage points on the new test set, and BERT-base’s performance saw a 0.5 point increase. This suggests that our human performance number represents a reasonable—if very conservative—approximation of human performance on QNLI. and WNLI test natural language inference. WNLI is derived from private data created for the Winograd Schema Challenge (Levesque et al., 2012), which specifically tests for common sense reasoning. The Microsoft Research Paraphrase Corpus (MRPC; Dolan and Brockett, 2005), the Semantic Textual Similarity Benchmark (STS-B; Cer et al., 2017), and Quora Question Pairs (QQP)333https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs test paraphrase and sentence similarity evaluation. The Corpus of Linguistic Acceptability (CoLA; Warstadt et al., 2018) tests grammatical acceptability judgment. Finally, the Stanford Sentiment Treebank (SST; Socher et al., 2013) tests sentiment analysis.
Human Evaluations on GLUE Tasks
Warstadt et al. (2018) report human performance numbers on CoLA as well. Using the majority decision from five expert annotators on 200 examples, they get a Matthews correlation coefficient (MCC) of 71.3. Bender (2015) also estimates human performance on the original public Winograd Schema Challenge (WSC) data. They use crowdworkers and report an average accuracy of 92.1%. The RTE corpus papers report inter-annotator agreement numbers on their test sets: 80% on RTE-1, 89.2% on RTE-2, 87.8% on RTE-3, and 97.02% on RTE-5. Wang et al. (2019b) report human performance numbers on GLUE’s manually curated diagnostic test set. The examples in this test set are natural language inference sentence pairs that are tagged for a set of linguistic phenomena. They use expert annotators and report an average coefficient of 0.8.
3 Data Collection Method
To establish human performance on GLUE, we hire annotators through the Hybrid444http://www.gethybrid.io data collection platform, which is similar to Amazon’s Mechanical Turk. Each worker first completes a short training procedure then moves on to the main annotation task. For the annotation phase, we tune the pay rate for each task, with an average rate of $17/hr. The training phase has a lower, standard pay rate, with an average pay of $7.6/hr.
Training
In the training phase for each GLUE task, each worker answers 20 random examples from the task development set. Each training page links to instructions that are tailored to the task, and shows five examples.555All the task specific instructions and FAQ pages used can be found at https://nyu-mll.github.io/GLUE-human-performance/ The answers to these examples can be revealed by clicking on a “Show” button at the bottom of the page. We ask the workers to label each set of examples and check their work so they can familiarize themselves with the task. Workers who get less than 65% of the examples correct during training do not qualify for the main task. This is an intentionally low threshold meant only to encourage a reasonable effort. Our platform cannot fully prevent workers from changing their answers after viewing the correct labels, so we cannot use the training phase as a substantial filter. (See Appendix A.1 for details on the training phase.)
Annotation
We randomly sample 500 examples from each task’s test set for annotation, with the exception of WNLI where we sample 145 of the 147 available test examples (the two missing examples are the result of a data preparation error). For each of these sampled data points, we collect five annotations from five different workers (see Appendix A.2). We use the test set since the test and development sets are qualitatively different for some tasks, and we wish to compare our results directly with those on the GLUE leaderboard.
4 Results and Discussions
To calculate the human performance baseline, we take the majority vote across the five crowd-sourced annotations. In the case of MultiNLI, since there are three possible labels—entailment, neutral, and contradiction—about 2% of examples see a tie between two labels. For these ties, we take the label that is more frequent in the development set. In the case of STS-B, we take an average of the scalar annotator labels. Since we only collect annotations for a subset of the data, we cannot access the test set through the GLUE leaderboard interface, we instead submit our predictions to the GLUE organizers privately.
We compare human performance to BERT and BigBird. The human performance numbers in Table 1 shows that overall our annotators stick it to the Muppets on GLUE. However on MRPC, QQP, and QNLI, Bigbird and BERT outperform our annotators. The results on QQP are particularly surprising: BERT and BigBird score over 12 F1 points better than our annotators. Our annotators, however, are only given 20 examples and a short set of instructions for training, while BERT and BigBird are fine-tuned on the 364k-example QQP training set. In addition, we find it difficult to compose concise instructions for QQP that actually match the supplied labels. We do not have access to the material used to create the dataset, and we find it difficult to infer simple instructions from the data (sample provided in Appendix B). If given more training data, it is possible that our annotators could better learn relatively subtle label definitions that better fit the corpus.
Unanimous Vote
To investigate the possible effect of ambiguous label definitions, we look at human performance when there is 5-way annotator agreement. Using unanimous agreement, rather than majority agreement, has the potential effect of filtering out examples of two kinds: those for which our supplied annotation guidelines don’t provide clear advice and those for which humans understand the expectations of the task but find the example genuinely difficult or uncertain. To disentangle the two effects, we also look at BERT results on this subset of the test set, as BERT’s use of large training sets means that it should only suffer in the latter cases. We get consent from the authors of BERT to work in cooperation with the GLUE team to measure BERT’s performance on this subset, which we show in Table 1. Overall, we see the gap widen between the human baseline and BERT by 3.1 points. The largest shifts in performance are on CoLA, MRPC, QQP, and WNLI. The relative jumps in performance on MRPC and QQP support the claim that human performance is hurt by imprecise guidelines and that the use of substantially more training data gives BERT an edge on our annotators.
In general, BERT needs large datasets to fine-tune on. This is further evidenced by its performance discrepancy between MultiNLI and RTE: human performance is similar for the two, whereas BERT shows a 16.2 percentage point gap between the two datasets. Both MultiNLI and RTE are textual entailment datasets, but MultiNLI’s training set is quite large at 393k examples, while the GLUE version of RTE has only 2.5k examples. However, BigBird does not show as large a gap, which may be because it employs a multi-task learning approach which fine-tunes the model for all sentence-pair tasks jointly. Their RTE classifier, for example, benefits from the large training dataset for the closely related MultiNLI task.
Low-Resource BERT Baseline
To understand the impact of abundant target tasks on the limited headroom that we observe, we train several additional baselines. In these, we fine-tune BERT on 5k, 1k, and 500 examples for each GLUE task (or fewer for tasks with fewer training examples). We use BERT for this analysis because the authors have released their code and have provided pretrained weights for the model. We use their publicly available implementation of BERT-large
, their pretrained weights as the initialization for fine-tuning on the GLUE tasks, and the hyperparameters they report. We see a precipitous drop in performance on most tasks with large datasets, with the exception of QNLI. A possible partial explanation is that both QNLI and the BERT training data come from English Wikipedia. On MRPC and QQP however, BERT’s performance drops below human performance in the 1k- and 500-example settings. On the whole, we find that BERT suffers in low-resource settings. These results are in agreement with the findings in
Phang et al. (2019) who conduct essentially the same experiment.CoLA
Our human performance number on CoLA is 4.9 points below what was reported in Warstadt et al. (2018). We believe this discrepancy is because they use linguistics PhD students as expert annotators while we use crowdworkers. This further supports our belief that our human performance baseline is a conservative estimate, and that higher performance is possible, particularly with more training.
Wnli
No system on the GLUE leaderboard has managed to exceed the performance of the most-frequent-class baseline on WNLI, and several papers that propose methods for GLUE justify their poor performance by asserting that the task must be somehow broken.666Devlin et al. (2019), for example, mention that they avoid “the problematic WNLI set”. WNLI’s source Winograd Schema data was constructed so as not to include any statistical cues that a simple machine learning system can exploit, which can make it quite difficult. The WNLI test set shows one of the highest
human performance scores of the nine GLUE tasks, reflecting its status as a corpus constructed and vetted by artificial intelligence experts. This affirms that tasks like WNLI with small training sets (634 sentence pairs) and no simple cues remain a serious (and sometimes unacknowledged) blind spot for modern neural network sentence understanding methods.
5 Conclusion
This paper presents a conservative estimate of human performance to serve as a target for the GLUE sentence understanding benchmark. We obtain this baseline with the help of crowdworker annotators. We find that state-of-the-art models like BERT are not far behind human performance on most GLUE tasks. But we also note that, when trained in low-resource settings, BERT’s performance falls considerably. Given these results, and the continued difficulty neural methods have with the Winograd Schema Challenge, we argue that future work on GLUE-style sentence understanding tasks might benefit from a focus on learning from smaller training sets. In work subsequent to the main results of this paper, we have prepared such a benchmark in the GLUE follow-up SuperGLUE (Wang et al., 2019a).
Acknowledgments
This work was made possible in part by a donation to NYU from Eric and Wendy Schmidt made by recommendation of the Schmidt Futures program and by funding from Samsung Research. We gratefully acknowledge the support of NVIDIA Corporation with the donation of a Titan V GPU used at NYU for this research. We thank Alex Wang and Amanpreet Singh for their help with conducting GLUE evaluations, and we thank Jason Phang for his help with training the BERT model.
References
- Bar Haim et al. (2006) Roy Bar Haim, Ido Dagan, Bill Dolan, Ferro Lisa, Danilo Giampiccolo, Bernardo Magnini, and Idan Szpektor. 2006. The second PASCAL recognising textual entailment challenge. In Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment.
- Bender (2015) David Bender. 2015. Establishing a human baseline for the winograd schema challenge. In MAICS.
- Bentivogli et al. (2009) Luisa Bentivogli, Ido Dagan, Hao Trang Dang, Danilo Giampiccolo, and Bernardo Magnini. 2009. The fifth PASCAL recognizing textual entailment challenge. In Proceedings of Text Analysis Conference.
- Cer et al. (2017) Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. 2017. Semeval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 1–14. Association for Computational Linguistics.
- Dagan et al. (2006) Ido Dagan, Oren Glickmen, and Magnini Bernardo. 2006. The PASCAL recognising textual entailment challenge. In Machine learning challenges, pages 177–190. Springer.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
- Dolan and Brockett (2005) William B Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of IWP.
- Giampiccolo et al. (2007) Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. 2007. The third PASCAL recognizing textual entailment challenge. In ACL-PASCAL workshop on textual entailmentand paraphrasing, pages 1–9. Association for Computational Linguistics.
- Howard and Ruder (2018) Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339. Association for Computational Linguistics.
- Levesque et al. (2012) Hector J. Levesque, Ernest Davis, and Leora Morgenstern. 2012. The winograd schema challenge. In Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning, KR’12, pages 552–561. AAAI Press.
- Liu et al. (2019) Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019. Multi-task deep neural networks for natural language understanding. arXiv preprint 1901.11504.
- McCann et al. (2017) Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6294–6305. Curran Associates, Inc.
- Peters et al. (2018) Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics.
- Phang et al. (2019) Jason Phang, Thibault Févry, and Samuel R. Bowman. 2019. Sentence encoders on stilts: Supplementary training on intermediate labeled-data tasks. arXiv preprint 1811.01088.
- Rajpurkar et al. (2018) Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789. Association for Computational Linguistics.
-
Rajpurkar et al. (2016)
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016.
SQuAD: 100,000+
questions for machine comprehension of text.
In
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing
, pages 2383–2392. Association for Computational Linguistics. - Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- Wang et al. (2019a) Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julien Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019a. SuperGLUE: A stickier benchmark for general-purpose languageunderstanding systems. arXiv preprint 1905.00537.
- Wang et al. (2019b) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019b. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations.
- Warstadt et al. (2018) Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. 2018. Neural network acceptability judgments. arXiv preprint 1805.12471.
- Williams et al. (2018) Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1112–1122. Association for Computational Linguistics.
- Zellers et al. (2018) Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. SWAG: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
Appendix A Crowd-Sourced Data Collection
a.1 Training Phase
During training, we provide a link to task-specific instructions. As an example, the instructions for CoLA are shown in Table 2. The instructions for each task follows the same format: briefly describing the annotator’s job, explaining the labels, and providing at least one example.
In addition to the task-specific instructions, we provide general instructions about the training phase. An example is given in Table 3. Lastly, we provide a link to an FAQ page. The FAQ page addresses the balance of the data. If the labels are balanced, we tell the annotators so. If the labels are not balanced, we assure the annotators that they need not worry about assigning one label more frequently. For most tasks we also describe where the data comes from, e.g. news articles. All of the task specific instructions and FAQ pages can be found at nyu-mll.github.io/GLUE-human-performance/.
On each training page, each annotator is given five examples to annotate. At the bottom of the page, there is a “Show” button which reveals the ground truth labels. If their submitted answer is incorrect, the label is shown in red, otherwise it is shown in black. In the instructions, the annotator is asked to check their work with this button. Given this procedure, we cannot prevent the annotators from changing their answer after viewing the ground truth labels.
New York University’s Center for Data Science is collecting your answers for use in research on computer understanding of English. Thank you for your help! |
We will present you with a sentence someone spoke. Your job is to figure out, based on this sentence, if the speaker is a native speaker of English. You should ignore the general topic of the sentence and focus on the fluency of the sentence.
|
This project is a training task that needs to be completed before working on the main project on Hybrid named Human Performance: CoLA. For this CoLA task, we have the true label and we want to get information on how well people do on the task. This training is short but is designed to help you get a sense of the questions and the expected labels. |
Please note that the pay per HIT for this training task is also lower than it is for the main project Human Performance: CoLA. Once you are done with the training, please proceed to the main task! |
In this training, you must answer all the questions on the page and then, to see how you did, click the Show button at the bottom of the page before moving onto the next HIT. The Show button will reveal the true labels. If you answered correctly, the revealed label will be in black, otherwise it will be in red. Please use this training and the provided answers to build an understanding of what the answers to these questions looks like (the main project, Human Performance: CoLA, does not have the answers on the page). |
a.2 Annotation Phase
In the main data collection phase we provide the annotators with a link to the same task-specific instructions (Table 2) and FAQ page used during the training phase. We enforce the training phase as a qualification for annotation, so crowdworkers cannot participate in annotation without first completing the associated training.
Appendix B QQP Example
Question 1 | Question 2 | Label |
---|---|---|
6.5cmWhat are the best resources for learning Ukrainian? | 7cmWhat are the best resources for learning Turkish? | 0 |
7cmHow much time will it take to charge a 10,000 mAh power bank? | 7cmHow much time does it takes to charge the power bank 13000mAh for full charge? | 0 |
7cmHow do you know if you’re in love? | 7cmHow can you know if you’re in love or just attracted to someone? | 1 |
7cmWhich are the best and affordable resorts in Goa? | 7cmWhat are some affordable and safe beach resorts in Goa? | 1 |
7cmHow winning money from YouTube? | 7cmHow do I make money from a YouTube channel? | 1 |
Question 1 | Question 2 | Label |
---|---|---|
7cmWhat is actual meaning of life? Indeen, it depend on perception of people or other thing? | 7cmWhat is the meaning of my life? | 1 |
7cmWhat is the difference between CC and 2S classes of travel in Jan Shatabdi express? | 7cmWhat is TQWL in IRCTC wait list? | 0 |
7cmWhat would have happened if Hitler hadn’t declared war on the United States after Pearl Harbor? | 7cmWhat would have happened if the United States split in two after the revolutionary war? | 0 |
7cmWill it be a problem if a friend deposits 4 lakhs in my savings bank account and I don’t have a source of income to show? | 7cmI am 25.5 year old boy with a B.Com in a sales job having a package of 4 LPA. I will be married in less than a year. I want to quit my job and start my own business with the savings I have of 2 Lakh. Is this an ideal situation to take a risk? | 0 |
7cmWhat should you do if you meet an alien? | 7cmWhat could be the possible conversation between humans and aliens on their first meeting? | 0 |
7cmWhy can’t I ask any questions on Quora? | 7cmCan you ask any question on Quora? | 0 |
7cmShould I move from the USA to India? | 7cmMoving from usA to India? | 1 |
7cmWhich European countries provide mostly free university education to Indian citizen? | 7cmWhat countries provide free education to Indian students? | 0 |
7cmI got 112 rank in CDAC (A+B+C). My subject of interest is VLSI. Is there any chance that I would get CDAC Pune, Noida for VLSI? | 7cmSuggest some good indian youtube channels for studying Aptitude? | 0 |
6cmWhat are the positives and negatives of restorative justice? | 7cmIs Vengence and Justice opposite? | 0 |
Question 1 | Question 2 | Label |
---|---|---|
7cmWhat’s a good way to make money through effort? | 7cmHow do I make money without much effort? | 0 |
7cmWhat is the meaning of life? Whats our purpose on Earth? | 7cmWhat actually is the purpose of life? | 1 |
7cmWhich among five seasons (summer, winter, autumn, spring, rainy) is most favourable for farming and cultivating of crops? | 7cmWhich among the five seasons (summer, winter, rainy, spring, autumn) is better for farming and cultivating of crops? | 1 |
6.5cmHow can I find the real true purpose of my life? | 7cmWhat should one do to find purpose of one’s life? | 1 |
7cmIs Donald Trump likely to win the 2016 election (late 2015 / early 2016)? | 7cmWhat will Donald Trump’s response be if he doesn’t win the 2016 presidential election? | 0 |
7cmWhat is the easiest and cheapest way to lose weight fast? | 7cmWhat are the easiest and the fastest ways to lose weight? | 1 |
7cmWhy are basically all of my questions on Quora marked as ’needing improvement’? Am I that bad? | 7cmWhy do questions get marked for ’needing improvment’ when they clearly don’t? | 1 |
7cmWhat are some of the most visually stunning apps? | 7cmWhat are the most visually stunning foods? | 0 |
7cmWhat are some of the good hotels near chennai central railway station? | 7cmBest places to eat in Chennai? | 0 |
7cmHow do you prepare for a job interview? | 7cmHow do I prepare for my first job interview? | 1 |
Comments
There are no comments yet.