We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.READ FULL TEXT VIEW PDF
We introduce jiant, an open source toolkit for conducting multitask and
The CLEF 2019 ProtestNews Lab tasks participants to identify text relati...
For natural language understanding (NLU) technology to be maximally usef...
In the last year, new models and methods for pretraining and transfer
Speech translation (ST) aims to learn transformations from speech in the...
In this article, we shall describe some of the most interesting topics i...
The group structure on the rational points of elliptic curves plays seve...
Natural Language Processing (NLP) models have achieved superhuman performance on a number of recently proposed benchmarks. However, these models are still well below human level performance for language understanding as a whole, suggesting a disconnect between our benchmarks and the actual capabilities of these models. The General Language Understanding Evaluation benchmark (GLUE) (wang2018glue) was introduced in 2018 to evaluate performance on a wide range of NLP tasks, and top models achieved superhuman performance within a year. To address the shortcomings of GLUE, researchers designed the SuperGLUE benchmark with more difficult tasks (wang2019superglue). About a year since the release of SuperGLUE, performance is again essentially human-level (raffel2019exploringT5). While these benchmarks evaluate linguistic skills more than overall language understanding, an array of commonsense benchmarks have been proposed to measure basic reasoning and everyday knowledge (zellers2019hellaswag; huang2019cosmosqa; bisk2019physicaliqa). However, these recent benchmarks have similarly seen rapid progress (khashabi2020unifiedqa). Overall, the near human-level performance on these benchmarks suggests that they are not capturing important facets of language understanding.
Transformer models have driven this recent progress by pretraining on massive text corpora, including all of Wikipedia, thousands of books, and numerous websites. These models consequently see extensive information about specialized topics, most of which is not assessed by existing NLP benchmarks. It consequently remains an open question just how capable current language models are at learning and applying knowledge from many domains.
To bridge the gap between the wide-ranging knowledge that models see during pretraining and the existing measures of success, we introduce a new benchmark for assessing models across a diverse set of subjects that humans learn. We design the benchmark to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This makes the benchmark more challenging and more similar to how we evaluate humans. The benchmark covers subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots.
We find that meaningful progress on our benchmark has only become possible in recent months. In particular, models up to billion parameters (brown2020gpt3) achieve random chance performance of accuracy, but the billion parameter GPT-3 model reaches a much higher accuracy (see Figure 0(b)). On the other hand, unlike human professionals GPT-3 does not excel at any single subject. Instead, we find that performance is lopsided, with GPT-3 having almost accuracy for its best subject but near-random performance for several other subjects.
Our results indicate that while recent advances have been impressive, state-of-the-art models still struggle at learning and applying knowledge from pretraining. The tasks with near-random accuracy include calculation-heavy subjects such as physics and mathematics and subjects related to human values such as law and morality. This second weakness is particularly concerning because it will be important for future models to have a strong understanding of what is legal and what is ethical. Worryingly, we also find that GPT-3 does not have an accurate sense of what it does or does not know since its average confidence can be up to off from its actual accuracy. We comprehensively evaluate the breadth and depth of a model’s text understanding by covering numerous topics that humans are incentivized to learn. Since our test consists in tasks, it can be used to analyze aggregate properties of models across tasks and to track important shortcomings. The test and code is available at github.com/hendrycks/test.
The dominant paradigm in NLP is to pretrain large models on massive text corpora including educational books and websites. In the process, these models are exposed to information about a wide range of topics. petroni2019languagemodelsasknowledgebase found that recent models learn enough information from pretraining that they can serve as knowledge bases. However, no prior work has comprehensively measured the knowledge models have across many real-world domains.
Until recently, researchers primarily used fine-tuned models on downstream tasks (BERTDevlin2019). However, larger pretrained models like GPT-3 (brown2020gpt3) have made it possible to achieve competitive performance without fine-tuning by using few-shot learning, which removes the need for a large fine-tuning set. With the advent of strong zero-shot and few-shot learning, it is now possible to curate a diverse set of tasks for evaluation and remove the possibility of models on “spurious cues” (geirhos2020shortcut; Hendrycks2019NaturalAE) in a dataset to achieve high performance.
Many recent benchmarks aim to assess a model’s general world knowledge and basic reasoning ability by testing its “commonsense.” A number of commonsense benchmarks have been proposed in the past year, but recent models are already nearing human-level performance on several of these, including HellaSwag (zellers2019hellaswag), Physical IQA (bisk2019physicaliqa), and CosmosQA (huang2019cosmosqa). By design, these datasets assess abilities that almost every child has. In contrast, we include harder specialized subjects that people must study to learn.
Some researchers have suggested that the future of NLP evaluation should focus on Natural Language Generation (NLG)(zellers2020turingadvice), an idea that reaches back to the Turing Test (Turing1990TuringTest). However, NLG is notoriously difficult to evaluate and lacks a standard metric (Sai2020NLGSurvey). Consequently, we instead create a simple-to-evaluate test that measures classification accuracy on multiple choice questions.
While several question answering benchmarks exist, they are comparatively limited in scope. Most either cover easy topics like grade school subjects for which models can already achieve strong performance (Clark2018ARCAI2; khot2019qasc; OpenBookQA2018; Clark2019RegentsScienceExams), or are focused on linguistic understanding in the form of reading comprehension (lai2017race; richardson-etal-2013-mctest). In contrast, we include a wide range of difficult subjects that go far beyond linguistic understanding.
We create a massive multitask test consisting of multiple-choice questions from various branches of knowledge. The test spans subjects in the humanities, social sciences, hard sciences, and other areas that are important for some people to learn. There are tasks in total, which is also the number of Atari games (Bellemare2013Atari), all of which are listed in Appendix B. The questions in the dataset were manually collected by graduate and undergraduate students from freely available sources online. These include practice questions for tests such as the Graduate Record Examination and the United States Medical Licensing Examination. It also includes questions designed for undergraduate courses and questions designed for readers of Oxford University Press books. Some tasks cover a subject, like psychology, but at a specific level of difficulty, such as “Elementary,” “High School,” “College,” or “Professional.” For example, the “Professional Psychology” task draws on questions from freely available practice questions for the Examination for Professional Practice in Psychology, while the “High School Psychology” task has questions like those from Advanced Placement Psychology examinations.
We collected questions in total, which we split into a few-shot development set, a validation set, and a test set. The few-shot development set has
questions per subject, the validation set may be used for selecting hyperparameters and is made ofquestions, and the test set has questions. Each subject contains test examples at the minimum, which is longer than most exams designed to assess people.
Since our test aggregates different subjects and several levels of difficulty, we measure more than straightforward commonsense or narrow linguistic understanding. Instead, we measure arbitrary real-world text understanding. Since models are pretrained on the Internet, this enables us to test how well they can extract useful knowledge from massive corpora. To succeed at our test, future models should be well-rounded, possess extensive world knowledge, and develop expert-level problem solving ability. These properties make the test likely to be an enduring and informative goalpost.
The humanities is a group of disciplines that make use of qualitative analysis and analytic methods rather than scientific empirical methods. Branches of the humanities include law, philosophy, history, and so on (Appendix B). Mastering these subjects requires a variety of skills. For example, legal understanding requires knowledge of how to apply rules and standards to complex scenarios, and also provide answers with stipulations and explanations. We illustrate this in Figure 1
. Legal understanding is also necessary for understanding and following rules and regulations, a necessary capability to constrain open-world machine learning models. For philosophy, our questions cover concepts like logical fallacies, formal logic, and famous philosophical arguments. It also covers moral scenarios, including questions from the ETHICS dataset(hendrycks2020ethicsdataset) that test a model’s understanding of normative statements through predicting widespread moral intuitions about diverse everyday scenarios. Finally, our history questions cover a wide range of time periods and geographical locations, including prehistory and other advanced subjects.
Social science includes branches of knowledge that examine human behavior and society. Subject areas include economics, sociology, politics, geography, psychology, and so on. See Figure 2 for example questions. Our economics questions include microeconomics, macroeconomics, and econometrics, and cover different types of problems, including questions that require a mixture of world knowledge, qualitative reasoning, or quantitative reasoning. We also include important but more esoteric topics such as security studies in order to test the boundaries of what is experienced and learned during pretraining. Social science also includes psychology, a field that may be especially important for attaining a nuanced understanding of humans.
STEM subjects include physics, computer science, mathematics, and more. Two examples are shown in Figure 3. Conceptual physics tests understanding of simple physics principles and may be thought of as a harder version of the physical commonsense benchmark Physical IQA (bisk2019physicaliqa). We also test mathematical problem solving ability at various levels of difficulty, from the elementary to the college level. College mathematics questions, like those found on the GRE mathematics subject test, often require chains of reasoning and abstract knowledge. To encode mathematics expressions, we use LaTeX or symbols such as * and ^ for multiplication and exponentiation respectively. STEM subjects require knowledge of empirical methods, fluid intelligence, and procedural knowledge.
There is a long tail of subjects that either do not neatly fit into any of the three preceding categories or for which there are not thousands of freely available questions. We put these subjects into Other. This section includes the Professional Medicine task, which has difficult questions that require humans many years of study to master. An example is depicted in Figure 4. This section also contains business topics like finance, accounting, and marketing, as well as knowledge of global facts. The latter includes statistics about poverty in different countries over time, which may be necessary for having an accurate model of the world internationally.
To measure performance on our multitask benchmark, we compute the classification accuracy across all examples and tasks. We evaluate GPT-3 (brown2020gpt3) and UnifiedQA (khashabi2020unifiedqa). For GPT-3 we use the OpenAI API, which provides access to four model variants, “Ada,” “Babbage,” “Curie,” and “Davinci,” which we refer to as “Small” ( billion parameters), “Medium” ( billion), “Large” ( billion) and “X-Large” ( billion). UnifiedQA uses the T5 (raffel2019exploringT5) text-to-text backbone and is fine-tuned on previously proposed question answering datasets (lai2017race), where the prediction is the class with the highest token overlap with UnifiedQA’s text output. Since UnifiedQA is fine-tuned on other datasets, we evaluate it without any further tuning to assess its transfer accuracy.
We feed GPT-3 prompts like that shown in Figure 0(a). We begin each prompt with “The following are multiple choice questions (with answers) about [subject].” For zero-shot evaluation, we append the question to the prompt. For few-shot evaluation, we add up to
demonstration examples with answers to the prompt before appending the question. All prompts end with “Answer: ”. The model then produces probabilities for the tokens “A,” “B,” “C,” and “D,” and we treat the highest probability option as the prediction. For consistent evaluation, we create a dev set withfixed few-shot examples for each subject.
We compare the few-shot accuracy of each GPT-3 size in Table 1. We find that the three smaller GPT-3 models have near random accuracy (around ). We also assess the billion parameter T5 model in a few-shot setting and confirmed that it likewise has random chance accuracy. In contrast, we find that the X-Large billion parameter GPT-3 model performs substantially better than random, with an accuracy of . We also find qualitatively similar results in the zero-shot setting. While the smaller models have around zero-shot accuracy, Figure 5(c) in Appendix A shows that the largest GPT-3 model has a much higher zero-shot accuracy of about . In Figure 0(b) we show that non-random accuracy on the multitask test emerged with recent large few-shot models compared to datasets that assess commonsense and linguistic understanding.
To test the importance of model size for other methods, we also evaluate UnifiedQA models. UnifiedQA has the advantage of being fine-tuned on other question answering datasets, and we assess it by evaluating its transfer performance without any additional fine-tuning. The largest UnifiedQA model we test has billion parameters, which is slightly larger than GPT-3 Small. Nevertheless, we show in Table 1 that it attains accuracy. This is worse than few-shot GPT-3 X-Large accuracy but higher than zero-shot GPT-3 X-Large, despite UnifiedQA having two orders of magnitude fewer parameters. We also find that even the smallest UnifiedQA variant, with just million parameters, has approximately accuracy. These results suggest that while model size is a key component for achieving strong performance, it is not the only important factor.
Comparing Disciplines. Using our test, we discover that GPT-3 has lopsided performance and several substantial knowledge gaps. Figure 5 shows the few-shot accuracy of GPT-3 for all tasks. It shows the GPT-3 is below expert-level performance for all tasks, with accuracy ranging from for US Foreign Policy to for College Chemistry.
Overall, GPT-3 does poorly on highly procedural problems. Figure 5 shows that calculation-heavy STEM subjects tend to have low accuracy compared to verbal subjects. In fact, out of the lowest-accuracy tasks are STEM subjects that emphasize mathematics or calculations. We speculate that is in part because GPT-3 acquires declarative knowledge more readily than procedural knowlege. For example, many questions in Elementary Mathematics require applying the order of operations for arithmetic, which is described by the acronym PEMDAS (Parentheses Exponents Multiplication Division Addition Subtraction). In Figure 5(a), we confirm that GPT-3 is aware of the acronymn PEMDAS. However, it does not consistently apply PEMDAS to actual problems. On the other hand, procedural understanding is not its only weak point. We find that some verbal tasks such as Moral Scenarios (hendrycks2020ethicsdataset) and Professional Law also have especially low accuracy.
Our test also shows that GPT-3 acquires knowledge quite unlike humans. For example, GPT-3 learns about topics in a pedagogically unusual order. GPT-3 does better on College Medicine () and College Mathematics () than calculation-heavy Elementary Mathematics (). GPT-3 demonstrates unusual breadth, but it does not master a single subject. In this way, our test shows that GPT-3 has many knowledge blindspots and has capabilities that are lopsided.
We should not trust a model’s prediction unless the model is calibrated, meaning that its confidence is a good estimate of the actual probability the prediction is correct. However, large neural networks are often miscalibrated(kilian2017calibration), especially under distribution shift (ovadia2019can). We evaluate the calibration of GPT-3 by testing how well its average confidence estimates its actual accuracy for each subject. We show the results in Figure 5(b), which demonstrates that GPT-3 is uncalibrated. In fact, its confidence is only weakly related to its actual accuracy in the zero-shot setting, with the difference between its accuracy and confidence reaching up to for some subjects. Another calibration measure is the Root Mean Squared (RMS) calibration error (hendrycks2019oe; kumar2019verifiedcalibration). Many tasks have miscalibrated predictions, such as Elementary Mathematics which has a zero-shot RMS calibration error of 19.4%. These results suggest that model calibration has wide room for improvement.
Multimodal Understanding. While text is capable of conveying an enormous number of concepts about the world, many important concepts are conveyed mainly through other modalities, such as images, audio, and physical interaction (bisk2020experiencegroundslang). Existing large-scale NLP models, such as GPT-3, do not incorporate multimodal information, so we design our benchmark to capture a diverse array of tasks in a text-only format. However, as models gain the ability to process multimodal inputs, benchmarks should be designed to reflect this change. One such benchmark could be a “Turk Test,” consisting of Amazon Mechanical Turk Human Intelligence Tasks. These are well-defined tasks that require models to interact with flexible formats and demonstrate multimodal understanding.
The Internet as a Training Set. A major distinction between our benchmark and previous multitask NLP benchmarks is that we do not require large training sets. Instead, we assume that models have acquired the requisite knowledge from reading vast quantities of diverse text from the Internet. This process is typically called pretraining, but it can be thought of as training in its own right, where the downstream evaluation is demonstrating whatever knowledge we would expect a human to pick up from reading the same text.
This motivates us to propose a methodological change so that models are trained more like how humans learn. While most previous machine learning benchmarks have models learn from a large question bank, humans primarily learn new subjects by reading books and listening to others talk about the topic. For specialized subjects such as Professional Law, massive legal corpora are available, such as the 164-volume legal encyclopedia Corpus Juris Secundum, but there are fewer than 5,000 multistate bar exam questions available. Learning the entire law exclusively through a small number of practice tests is implausible, so future models must learn more during pretraining.
For this reason we assess pretrained models in a zero-shot or few-shot setting and we provide a dev, val, and test set for each task. The dev set is used for few-shot prompts, the val set could be used for hyperparameter tuning, and the test set is used to compute the final accuracy. Importantly, the format of our evaluation is not identical to the format in which information is acquired during pretraining. This has the benefit of obviating concerns about spurious training set annotation artifacts (geirhos2020shortcut; Hendrycks2019NaturalAE) and is in stark contrast to the previous paradigm of identically distributed training and test sets. This change also enables collecting a much more extensive and diverse set of tasks for evaluation. We anticipate our methodology becoming more widespread as models improve at extracting information from diverse online sources.
Model Limitations. We find that current large-scale Transformers have wide room for improvement. They are notably poor at modeling human (dis)approval, as evident by the low performance on the Professional Law and Moral Scenarios tasks. For future systems to be aligned with human values, high performance on these tasks is crucial (hendrycks2020ethicsdataset), so future research should especially aim to increase accuracy on these tasks. Models also have difficulty performing calculations, so much so that they exhibit poor performance on Elementary Mathematics and many other STEM subjects with “plug and chug” problems. Additionally, they do not match expert-level performance on any subject, so for all subjects it is subhuman. On average, models are only now starting to move beyond random-chance accuracy levels.
Addressing these shortcomings may be challenging. To illustrate this, we attempted to create a better Professional Law model by pretraining on specialized data but achieved only limited success. We collected approximately 2,000 additional Professional Law training examples. After fine-tuning a RoBERTa-base model (RobertaLiu2019AR) using this custom training set, our model attained test accuracy. To test the impact of additional specialized training data, we also had RoBERTa continue pretraining on approximately 1.6 million legal case summaries using Harvard’s Law Library case law corpus case.law, but after fine-tuning it only attained accuracy. This suggests that while additional pretraining on relevant high quality text can help, it may not be enough to substantially increase the performance of current models.
It is unclear whether simply scaling up existing language models will solve the test. Current understanding indicates that a increase in model size must be accompanied by an approximate increase in data (kaplan2020scalinglaws). Aside from the tremendous expense in creating multi-trillion parameter language models, data may also become a bottleneck, as there is far less written about esoteric branches of knowledge than about everyday text.
We introduced a new test that measures how well text models can learn and apply knowledge encountered during pretraining. By covering 57 subjects at varying levels of difficulty, the test assesses language understanding in greater breadth and depth than previous benchmarks. We found that it has recently become possible for models to make meaningful progress on the test, but that state-of-the-art models have lopsided performance and still do not excel at any individual task. We also showed that current models are uncalibrated and have difficulty with tasks that require calculations. Worryingly, models also perform especially poorly on socially relevant subjects including morality and law. Our expansive test can help researchers pinpoint important shortcomings of models, making it easier to gain a clearer picture of state-of-the-art capabilities.
We would like to thank the following for their helpful comments: Jan Leike, David Krueger, Alex Tamkin, Girish Sastry, and Henry Zhu. DH is supported by the NSF GRFP Fellowship and an Open Philanthropy Project Fellowship. This research was also supported by the NSF Frontier Award 1804794.
We qualitatively analyze when GPT-3 makes high confidence mistakes. We find that while many of these mistakes were clearly wrong, many were mistakes that a human might make. For example, one question it got wrong was “How many chromosomes do all human somatic cells contain?” The correct answer is , while few-shot GPT-3 predicted with confidence . This answer would have been correct if the question asked about the number of pairs of chromosomes. Similarly, many of its other high confidence mistakes were also correct answers to slightly different questions.
Rather than testing models in a few-shot setting, we now determine the impact of fine-tuning with thousands of examples. For this section, we fine-tune RoBERTa-base [RobertaLiu2019AR] and ALBERT-xxlarge [AlbertLan2020], which have fewer than million parameters and have at least fewer parameters than GPT-3 Small. Models are fine-tuned to predict one of four classes and are fine-tuned using the dev+val set, and we test on the test set. We observe that these smaller models can attain better-than-random accuracy. RoBERTa-base attains an overall accuracy of , with accuracy for the humanities, for social sciences, for STEM, and for other. ALBERT-xxlarge attains an accuracy of , with accuracy for the humanities, for the social sciences, for STEM, and for other. Consequently smaller models that are not designed for QA are able to exceed random chance, though barely.
Since language models train on vast text corpora, there is some chance that they have seen the exact question and answer during pretraining. If they memorized the exact question and answer, then they would attain higher accuracy than their true ability. Likewise, a question’s entropy would be especially low if it were memorized. Memorized questions and answers should have low entropy and high accuracy. However, in Figure 6, we see that accuracy and question entropy are not positively correlated, suggesting that the test’s low-entropy questions do not correspond to memorized (and thereby correctly predicted) answers. This suggests that our exact questions were not memorized. However, during pretraining models encountered text related to our questions through processing Wikipedia. We also note that most of our questions came from PDFs or websites where questions and answers are on separate pages.
See brown2020gpt3 for a previous discussion of contamination showing that the phenomena hardly affects performance. To reduce the probability that future models encounter exact questions during test-time, we provide a list of question sources on our github.