Consultation Checklists: Standardising the Human Evaluation of Medical Note Generation

11/17/2022
by   Aleksandar Savkov, et al.
0

Evaluating automatically generated text is generally hard due to the inherently subjective nature of many aspects of the output quality. This difficulty is compounded in automatic consultation note generation by differing opinions between medical experts both about which patient statements should be included in generated notes and about their respective importance in arriving at a diagnosis. Previous real-world evaluations of note-generation systems saw substantial disagreement between expert evaluators. In this paper we propose a protocol that aims to increase objectivity by grounding evaluations in Consultation Checklists, which are created in a preliminary step and then used as a common point of reference during quality assessment. We observed good levels of inter-annotator agreement in a first evaluation study using the protocol; further, using Consultation Checklists produced in the study as reference for automatic metrics such as ROUGE or BERTScore improves their correlation with human judgements compared to using the original human note.

READ FULL TEXT
research
04/01/2022

Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation

In recent years, machine learning models have rapidly become better at g...
research
05/27/2023

An Investigation of Evaluation Metrics for Automated Medical Note Generation

Recent studies on automatic note generation have shown that doctors can ...
research
05/03/2023

Clinical Note Generation from Doctor-Patient Conversations using Large Language Models: Insights from MEDIQA-Chat

This paper describes our submission to the MEDIQA-Chat 2023 shared task ...
research
05/05/2022

User-Driven Research of Medical Note Generation Software

A growing body of work uses Natural Language Processing (NLP) methods to...
research
04/09/2021

A preliminary study on evaluating Consultation Notes with Post-Editing

Automatic summarisation has the potential to aid physicians in streamlin...
research
10/11/2018

Semantic Structural Evaluation for Text Simplification

Current measures for evaluating text simplification systems focus on eva...
research
10/07/2020

What Can We Learn from Collective Human Opinions on Natural Language Inference Data?

Despite the subjective nature of many NLP tasks, most NLU evaluations ha...

Please sign up or login with your details

Forgot password? Click here to reset