MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records

08/27/2023
by   Scott L. Fleming, et al.
0

The ability of large language models (LLMs) to follow natural language instructions with human-level fluency suggests many opportunities in healthcare to reduce administrative burden and improve quality of care. However, evaluating LLMs on realistic text generation tasks for healthcare remains challenging. Existing question answering datasets for electronic health record (EHR) data fail to capture the complexity of information needs and documentation burdens experienced by clinicians. To address these challenges, we introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data. MedAlign is curated by 15 clinicians (7 specialities), includes clinician-written reference responses for 303 instructions, and provides 276 longitudinal EHRs for grounding instruction-response pairs. We used MedAlign to evaluate 6 general domain LLMs, having clinicians rank the accuracy and quality of each LLM response. We found high error rates, ranging from 35 68 context lengths for GPT-4. Finally, we report correlations between clinician rankings and automated natural language generation metrics as a way to rank LLMs without human review. We make MedAlign available under a research data use agreement to enable LLM evaluations on tasks aligned with clinician needs and preferences.

READ FULL TEXT

page 7

page 16

page 24

page 30

research
04/17/2023

LongForm: Optimizing Instruction Tuning for Long Text Generation with Corpus Extraction

Instruction tuning enables language models to generalize more effectivel...
research
06/06/2023

Injecting knowledge into language generation: a case study in auto-charting after-visit care instructions from medical dialogue

Factual correctness is often the limiting factor in practical applicatio...
research
05/22/2022

Instruction Induction: From Few Examples to Natural Language Task Descriptions

Large language models are able to perform a task by conditioning on a fe...
research
07/20/2023

Instruction-following Evaluation through Verbalizer Manipulation

While instruction-tuned models have shown remarkable success in various ...
research
09/09/2023

Efficient Finetuning Large Language Models For Vietnamese Chatbot

Large language models (LLMs), such as GPT-4, PaLM, and LLaMa, have been ...
research
08/10/2021

Continual Learning for Grounded Instruction Generation by Observing Human Following Behavior

We study continual learning for natural language instruction generation,...
research
06/08/2023

PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization

Instruction tuning large language models (LLMs) remains a challenging ta...

Please sign up or login with your details

Forgot password? Click here to reset