AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

04/13/2023
by   Wanjun Zhong, et al.
0

Evaluating the general abilities of foundation models to tackle human-level tasks is a vital aspect of their development and application in the pursuit of Artificial General Intelligence (AGI). Traditional benchmarks, which rely on artificial datasets, may not accurately represent human-level capabilities. In this paper, we introduce AGIEval, a novel benchmark specifically designed to assess foundation model in the context of human-centric standardized exams, such as college entrance exams, law school admission tests, math competitions, and lawyer qualification tests. We evaluate several state-of-the-art foundation models, including GPT-4, ChatGPT, and Text-Davinci-003, using this benchmark. Impressively, GPT-4 surpasses average human performance on SAT, LSAT, and math competitions, attaining a 95 accuracy on the English test of the Chinese national college entrance exam. This demonstrates the extraordinary performance of contemporary foundation models. In contrast, we also find that GPT-4 is less proficient in tasks that require complex reasoning or specific domain knowledge. Our comprehensive analyses of model capabilities (understanding, knowledge, reasoning, and calculation) reveal these models' strengths and limitations, providing valuable insights into future directions for enhancing their general capabilities. By concentrating on tasks pertinent to human cognition and decision-making, our benchmark delivers a more meaningful and robust evaluation of foundation models' performance in real-world scenarios. The data, code, and all model outputs are released in https://github.com/microsoft/AGIEval.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/15/2023

C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models

New NLP benchmarks are urgently needed to align with the rapid developme...
research
07/20/2023

Decoding the Enigma: Benchmarking Humans and AIs on the Many Facets of Working Memory

Working memory (WM), a fundamental cognitive process facilitating the te...
research
11/17/2022

InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges

In this report, we present our champion solutions to five tracks at Ego4...
research
06/08/2023

M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models

Despite the existence of various benchmarks for evaluating natural langu...
research
03/08/2023

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

ChatGPT is attracting a cross-field interest as it provides a language i...
research
09/13/2023

TrafficGPT: Viewing, Processing and Interacting with Traffic Foundation Models

With the promotion of chatgpt to the public, Large language models indee...
research
09/08/2022

FETA: Towards Specializing Foundation Models for Expert Task Applications

Foundation Models (FMs) have demonstrated unprecedented capabilities inc...

Please sign up or login with your details

Forgot password? Click here to reset