CLEVA: Chinese Language Models EVAluation Platform

08/09/2023
by   Yanyang Li, et al.
0

With the continuous emergence of Chinese Large Language Models (LLMs), how to evaluate a model's capabilities has become an increasingly significant issue. The absence of a comprehensive Chinese benchmark that thoroughly assesses a model's performance, the unstandardized and incomparable prompting procedure, and the prevalent risk of contamination pose major challenges in the current evaluation of Chinese LLMs. We present CLEVA, a user-friendly platform crafted to holistically evaluate Chinese LLMs. Our platform employs a standardized workflow to assess LLMs' performance across various dimensions, regularly updating a competitive leaderboard. To alleviate contamination, CLEVA curates a significant proportion of new data and develops a sampling strategy that guarantees a unique subset for each leaderboard round. Empowered by an easy-to-use interface that requires just a few mouse clicks and a model API, users can conduct a thorough evaluation with minimal coding. Large-scale experiments featuring 23 influential Chinese LLMs have validated CLEVA's efficacy.

READ FULL TEXT

page 4

page 6

page 19

page 25

page 26

page 29

research
08/28/2023

ZhuJiu: A Multi-dimensional, Multi-faceted Chinese Benchmark for Large Language Models

The unprecedented performance of large language models (LLMs) requires c...
research
06/15/2023

CMMLU: Measuring massive multitask language understanding in Chinese

As the capabilities of large language models (LLMs) continue to advance,...
research
01/16/2022

COLD: A Benchmark for Chinese Offensive Language Detection

Offensive language detection and prevention becomes increasing critical ...
research
07/19/2023

CValues: Measuring the Values of Chinese Large Language Models from Safety to Responsibility

With the rapid evolution of large language models (LLMs), there is a gro...
research
04/16/2023

Towards Better Instruction Following Language Models for Chinese: Investigating the Impact of Training Data and Evaluation

Recently, significant public efforts have been directed towards developi...
research
07/27/2023

SuperCLUE: A Comprehensive Chinese Large Language Model Benchmark

Large language models (LLMs) have shown the potential to be integrated i...
research
05/15/2023

C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models

New NLP benchmarks are urgently needed to align with the rapid developme...

Please sign up or login with your details

Forgot password? Click here to reset