Benchmarking Large Language Models on CMExam – A Comprehensive Chinese Medical Exam Dataset

06/05/2023
by   Junling Liu, et al.
0

Recent advancements in large language models (LLMs) have transformed the field of question answering (QA). However, evaluating LLMs in the medical field is challenging due to the lack of standardized and comprehensive datasets. To address this gap, we introduce CMExam, sourced from the Chinese National Medical Licensing Examination. CMExam consists of 60K+ multiple-choice questions for standardized and objective evaluations, as well as solution explanations for model reasoning evaluation in an open-ended manner. For in-depth analyses of LLMs, we invited medical professionals to label five additional question-wise annotations, including disease groups, clinical departments, medical disciplines, areas of competency, and question difficulty levels. Alongside the dataset, we further conducted thorough experiments with representative LLMs and QA algorithms on CMExam. The results show that GPT-4 had the best accuracy of 61.6 highlight a great disparity when compared to human accuracy, which stood at 71.6 demonstrate improved performance after finetuning, they fall short of a desired standard, indicating ample room for improvement. To the best of our knowledge, CMExam is the first Chinese medical exam dataset to provide comprehensive medical annotations. The experiments and findings of LLM evaluation also provide valuable insights into the challenges and potential solutions in developing Chinese medical QA systems and LLM evaluation pipelines. The dataset and relevant code are available at https://github.com/williamliujl/CMExam.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/03/2023

MedChatZH: a Better Medical Adviser Learns from Better Instructions

Generative large language models (LLMs) have shown great success in vari...
research
08/18/2021

MeDiaQA: A Question Answering Dataset on Medical Dialogues

In this paper, we introduce MeDiaQA, a novel question answering(QA) data...
research
05/09/2023

Large Language Models Need Holistically Thought in Medical Conversational QA

The medical conversational question answering (CQA) system aims at provi...
research
08/17/2023

CMB: A Comprehensive Medical Benchmark in Chinese

Large Language Models (LLMs) provide a possibility to make a great break...
research
05/22/2023

ExplainCPE: A Free-text Explanation Benchmark of Chinese Pharmacist Examination

As ChatGPT and GPT-4 spearhead the development of Large Language Models ...
research
06/15/2023

Towards Benchmarking and Improving the Temporal Reasoning Capability of Large Language Models

Reasoning about time is of fundamental importance. Many facts are time-d...
research
08/09/2023

Evaluating the Generation Capabilities of Large Chinese Language Models

This paper presents CG-Eval, the first comprehensive evaluation of the g...

Please sign up or login with your details

Forgot password? Click here to reset