LLM-Mini-CEX: Automatic Evaluation of Large Language Model for Diagnostic Conversation

08/15/2023
by   Xiaoming Shi, et al.
0

There is an increasing interest in developing LLMs for medical diagnosis to improve diagnosis efficiency. Despite their alluring technological potential, there is no unified and comprehensive evaluation criterion, leading to the inability to evaluate the quality and potential risks of medical LLMs, further hindering the application of LLMs in medical treatment scenarios. Besides, current evaluations heavily rely on labor-intensive interactions with LLMs to obtain diagnostic dialogues and human evaluation on the quality of diagnosis dialogue. To tackle the lack of unified and comprehensive evaluation criterion, we first initially establish an evaluation criterion, termed LLM-specific Mini-CEX to assess the diagnostic capabilities of LLMs effectively, based on original Mini-CEX. To address the labor-intensive interaction problem, we develop a patient simulator to engage in automatic conversations with LLMs, and utilize ChatGPT for evaluating diagnosis dialogues automatically. Experimental results show that the LLM-specific Mini-CEX is adequate and necessary to evaluate medical diagnosis dialogue. Besides, ChatGPT can replace manual evaluation on the metrics of humanistic qualities and provides reproducible and automated comparisons between different LLMs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/19/2023

Plug-and-Play Medical Dialogue System

Medical dialogue systems aim to provide accurate answers to patients, ne...
research
08/03/2021

How to Evaluate Your Dialogue Models: A Review of Approaches

Evaluating the quality of a dialogue system is an understudied problem. ...
research
05/24/2022

D4: a Chinese Dialogue Dataset for Depression-Diagnosis-Oriented Chat

In a depression-diagnosis-directed clinical session, doctors initiate a ...
research
07/20/2023

IvyGPT: InteractiVe Chinese pathwaY language model in medical domain

General large language models (LLMs) such as ChatGPT have shown remarkab...
research
06/13/2023

HAUSER: Towards Holistic and Automatic Evaluation of Simile Generation

Similes play an imperative role in creative writing such as story and di...
research
01/17/2022

Improving Clinical Diagnosis Performance with Automated X-ray Scan Quality Enhancement Algorithms

In clinical diagnosis, diagnostic images that are obtained from the scan...
research
06/07/2016

Sorting out symptoms: design and evaluation of the 'babylon check' automated triage system

Prior to seeking professional medical care it is increasingly common for...

Please sign up or login with your details

Forgot password? Click here to reset