Evaluating the Performance of Large Language Models on GAOKAO Benchmark

05/21/2023
by   Xiaotian Zhang, et al.
0

Large language models have demonstrated remarkable performance across various natural language processing tasks; however, their efficacy in more challenging and domain-specific tasks remains less explored. This paper introduces the GAOKAO-Benchmark (GAOKAO-Bench), an intuitive benchmark that employs questions from the Chinese Gaokao examination as test samples for evaluating large language models.In order to align the evaluation results with humans as much as possible, we designed a method based on zero-shot prompts to analyze the accuracy and scoring rate of the model by dividing the questions into subjective and objective types. We evaluated the ChatGPT model on GAOKAO-Benchmark performance.Our findings reveal that the ChatGPT model excels in tackling objective questions, while also shedding light on its shortcomings and areas for improvement. To further scrutinize the model's responses, we incorporate human evaluations.In conclusion, this research contributes a robust evaluation benchmark for future large-scale language models and offers valuable insights into the limitations of such models.

READ FULL TEXT
research
08/19/2023

FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models

Large language models (LLMs) have demonstrated exceptional performance i...
research
05/19/2023

HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models

Large language models (LLMs), such as ChatGPT, are prone to generate hal...
research
04/02/2023

LLMMaps – A Visual Metaphor for Stratified Evaluation of Large Language Models

Large Language Models (LLMs) have revolutionized natural language proces...
research
09/01/2023

Large Language Models for Semantic Monitoring of Corporate Disclosures: A Case Study on Korea's Top 50 KOSPI Companies

In the rapidly advancing domain of artificial intelligence, state-of-the...
research
03/22/2023

Can we trust the evaluation on ChatGPT?

ChatGPT, the first large language model (LLM) with mass adoption, has de...
research
08/25/2023

SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research

Recently, there has been growing interest in using Large Language Models...
research
11/03/2022

LMentry: A Language Model Benchmark of Elementary Language Tasks

As the performance of large language models rapidly improves, benchmarks...

Please sign up or login with your details

Forgot password? Click here to reset