CMMLU: Measuring massive multitask language understanding in Chinese

06/15/2023
by   Haonan Li, et al.
0

As the capabilities of large language models (LLMs) continue to advance, evaluating their performance becomes increasingly crucial and challenging. This paper aims to bridge this gap by introducing CMMLU, a comprehensive Chinese benchmark that covers various subjects, including natural science, social sciences, engineering, and humanities. We conduct a thorough evaluation of 18 advanced multilingual- and Chinese-oriented LLMs, assessing their performance across different subjects and settings. The results reveal that most existing LLMs struggle to achieve an average accuracy of 50 in-context examples and chain-of-thought prompts, whereas the random baseline stands at 25 Additionally, we conduct extensive experiments to identify factors impacting the models' performance and propose directions for enhancing LLMs. CMMLU fills the gap in evaluating the knowledge and reasoning capabilities of large language models within the Chinese context.

READ FULL TEXT

page 3

page 6

research
05/15/2023

C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models

New NLP benchmarks are urgently needed to align with the rapid developme...
research
08/09/2023

Evaluating the Generation Capabilities of Large Chinese Language Models

This paper presents CG-Eval, the first comprehensive evaluation of the g...
research
08/09/2023

CLEVA: Chinese Language Models EVAluation Platform

With the continuous emergence of Chinese Large Language Models (LLMs), h...
research
05/26/2023

Large Language Models Can be Lazy Learners: Analyze Shortcuts in In-Context Learning

Large language models (LLMs) have recently shown great potential for in-...
research
09/07/2020

Measuring Massive Multitask Language Understanding

We propose a new test to measure a text model's multitask accuracy. The ...
research
07/27/2023

SuperCLUE: A Comprehensive Chinese Large Language Model Benchmark

Large language models (LLMs) have shown the potential to be integrated i...
research
09/18/2023

Proposition from the Perspective of Chinese Language: A Chinese Proposition Classification Evaluation Benchmark

Existing propositions often rely on logical constants for classification...

Please sign up or login with your details

Forgot password? Click here to reset