Capabilities of GPT-4 on Medical Challenge Problems

03/20/2023
by   Harsha Nori, et al.
0

Large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation across various domains, including medicine. We present a comprehensive evaluation of GPT-4, a state-of-the-art LLM, on medical competency examinations and benchmark datasets. GPT-4 is a general-purpose model that is not specialized for medical problems through training or engineered to solve clinical tasks. Our analysis covers two sets of official practice materials for the USMLE, a three-step examination program used to assess clinical competency and grant licensure in the United States. We also evaluate performance on the MultiMedQA suite of benchmark datasets. Beyond measuring model performance, experiments were conducted to investigate the influence of test questions containing both text and images on model performance, probe for memorization of content during training, and study probability calibration, which is of critical importance in high-stakes applications like medicine. Our results show that GPT-4, without any specialized prompt crafting, exceeds the passing score on USMLE by over 20 points and outperforms earlier general-purpose models (GPT-3.5) as well as models specifically fine-tuned on medical knowledge (Med-PaLM, a prompt-tuned version of Flan-PaLM 540B). In addition, GPT-4 is significantly better calibrated than GPT-3.5, demonstrating a much-improved ability to predict the likelihood that its answers are correct. We also explore the behavior of the model qualitatively through a case study that shows the ability of GPT-4 to explain medical reasoning, personalize explanations to students, and interactively craft new counterfactual scenarios around a medical case. Implications of the findings are discussed for potential uses of GPT-4 in medical education, assessment, and clinical practice, with appropriate attention to challenges of accuracy and safety.

READ FULL TEXT

page 8

page 15

page 16

page 17

page 18

page 31

page 32

page 33

research
12/26/2022

Large Language Models Encode Clinical Knowledge

Large language models (LLMs) have demonstrated impressive capabilities i...
research
08/09/2023

A Comparative Study of Open-Source Large Language Models, GPT-4 and Claude 2: Multiple-Choice Test Taking in Nephrology

In recent years, there have been significant breakthroughs in the field ...
research
06/28/2023

Beyond the Hype: Assessing the Performance, Trustworthiness, and Clinical Suitability of GPT3.5

The use of large language models (LLMs) in healthcare is gaining popular...
research
03/01/2023

Almanac: Knowledge-Grounded Language Models for Clinical Medicine

Large-language models have recently demonstrated impressive zero-shot ca...
research
09/05/2023

An Automatic Evaluation Framework for Multi-turn Medical Consultations Capabilities of Large Language Models

Large language models (LLMs) have achieved significant success in intera...
research
07/28/2023

A Critical Review of Large Language Models: Sensitivity, Bias, and the Path Toward Specialized AI

This paper examines the comparative effectiveness of a specialized compi...

Please sign up or login with your details

Forgot password? Click here to reset