ZhuJiu: A Multi-dimensional, Multi-faceted Chinese Benchmark for Large Language Models

08/28/2023
by   Baoli Zhang, et al.
0

The unprecedented performance of large language models (LLMs) requires comprehensive and accurate evaluation. We argue that for LLMs evaluation, benchmarks need to be comprehensive and systematic. To this end, we propose the ZhuJiu benchmark, which has the following strengths: (1) Multi-dimensional ability coverage: We comprehensively evaluate LLMs across 7 ability dimensions covering 51 tasks. Especially, we also propose a new benchmark that focuses on knowledge ability of LLMs. (2) Multi-faceted evaluation methods collaboration: We use 3 different yet complementary evaluation methods to comprehensively evaluate LLMs, which can ensure the authority and accuracy of the evaluation results. (3) Comprehensive Chinese benchmark: ZhuJiu is the pioneering benchmark that fully assesses LLMs in Chinese, while also providing equally robust evaluation abilities in English. (4) Avoiding potential data leakage: To avoid data leakage, we construct evaluation data specifically for 37 tasks. We evaluate 10 current mainstream LLMs and conduct an in-depth discussion and analysis of their results. The ZhuJiu benchmark and open-participation leaderboard are publicly released at http://www.zhujiu-benchmark.com/ and we also provide a demo video at https://youtu.be/qypkJ89L1Ic.

READ FULL TEXT
research
08/25/2023

SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research

Recently, there has been growing interest in using Large Language Models...
research
08/09/2023

CLEVA: Chinese Language Models EVAluation Platform

With the continuous emergence of Chinese Large Language Models (LLMs), h...
research
08/07/2023

AgentBench: Evaluating LLMs as Agents

Large Language Models (LLMs) are becoming increasingly smart and autonom...
research
07/27/2023

SuperCLUE: A Comprehensive Chinese Large Language Model Benchmark

Large language models (LLMs) have shown the potential to be integrated i...
research
06/15/2023

KoLA: Carefully Benchmarking World Knowledge of Large Language Models

The unprecedented performance of large language models (LLMs) necessitat...
research
05/15/2023

C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models

New NLP benchmarks are urgently needed to align with the rapid developme...
research
08/17/2023

CMB: A Comprehensive Medical Benchmark in Chinese

Large Language Models (LLMs) provide a possibility to make a great break...

Please sign up or login with your details

Forgot password? Click here to reset