SuperCLUE: A Comprehensive Chinese Large Language Model Benchmark

07/27/2023
by   Liang Xu, et al.
0

Large language models (LLMs) have shown the potential to be integrated into human daily lives. Therefore, user preference is the most critical criterion for assessing LLMs' performance in real-world scenarios. However, existing benchmarks mainly focus on measuring models' accuracy using multi-choice questions, which limits the understanding of their capabilities in real applications. We fill this gap by proposing a comprehensive Chinese benchmark SuperCLUE, named after another popular Chinese LLM benchmark CLUE. SuperCLUE encompasses three sub-tasks: actual users' queries and ratings derived from an LLM battle platform (CArena), open-ended questions with single and multiple-turn dialogues (OPEN), and closed-ended questions with the same stems as open-ended single-turn ones (CLOSE). Our study shows that accuracy on closed-ended questions is insufficient to reflect human preferences achieved on open-ended ones. At the same time, they can complement each other to predict actual user preferences. We also demonstrate that GPT-4 is a reliable judge to automatically evaluate human preferences on open-ended questions in a Chinese context. Our benchmark will be released at https://www.CLUEbenchmarks.com

READ FULL TEXT
research
08/28/2023

ZhuJiu: A Multi-dimensional, Multi-faceted Chinese Benchmark for Large Language Models

The unprecedented performance of large language models (LLMs) requires c...
research
07/19/2023

CValues: Measuring the Values of Chinese Large Language Models from Safety to Responsibility

With the rapid evolution of large language models (LLMs), there is a gro...
research
06/15/2023

CMMLU: Measuring massive multitask language understanding in Chinese

As the capabilities of large language models (LLMs) continue to advance,...
research
06/26/2023

Automatic Assessment of Divergent Thinking in Chinese Language with TransDis: A Transformer-Based Language Model Approach

Language models have been increasingly popular for automatic creativity ...
research
08/09/2023

CLEVA: Chinese Language Models EVAluation Platform

With the continuous emergence of Chinese Large Language Models (LLMs), h...
research
08/01/2023

JIANG: Chinese Open Foundation Language Model

With the advancements in large language model technology, it has showcas...
research
07/04/2023

CARE-MI: Chinese Benchmark for Misinformation Evaluation in Maternity and Infant Care

The recent advances in NLP, have led to a new trend of applying LLMs to ...

Please sign up or login with your details

Forgot password? Click here to reset