LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

06/01/2023
by   Chunyuan Li, et al.
1

Conversational generative AI has demonstrated remarkable promise for empowering biomedical practitioners, but current investigations focus on unimodal text. Multimodal conversational AI has seen rapid progress by leveraging billions of image-text pairs from the public web, but such general-domain vision-language models still lack sophistication in understanding and conversing about biomedical images. In this paper, we propose a cost-efficient approach for training a vision-language conversational assistant that can answer open-ended research questions of biomedical images. The key idea is to leverage a large-scale, broad-coverage biomedical figure-caption dataset extracted from PubMed Central, use GPT-4 to self-instruct open-ended instruction-following data from the captions, and then fine-tune a large general-domain vision-language model using a novel curriculum learning method. Specifically, the model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics using GPT-4 generated instruction-following data, broadly mimicking how a layperson gradually acquires biomedical knowledge. This enables us to train a Large Language and Vision Assistant for BioMedicine (LLaVA-Med) in less than 15 hours (with eight A100s). LLaVA-Med exhibits excellent multimodal conversational capability and can follow open-ended instruction to assist with inquiries about a biomedical image. On three standard biomedical visual question answering datasets, LLaVA-Med outperforms previous supervised state-of-the-art on certain metrics. To facilitate biomedical multimodal research, we will release our instruction-following data and the LLaVA-Med model.

READ FULL TEXT

page 4

page 5

page 10

research
06/13/2023

XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models

The latest breakthroughs in large vision-language models, such as Bard a...
research
03/02/2023

Large-Scale Domain-Specific Pretraining for Biomedical Vision-Language Processing

Contrastive pretraining on parallel image-text data has attained great s...
research
07/07/2023

GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest

Instruction tuning large language model (LLM) on image-text pairs has ac...
research
04/30/2020

Improving Vision-and-Language Navigation with Image-Text Pairs from the Web

Following a navigation instruction such as 'Walk down the stairs and sto...
research
04/21/2022

Making the Most of Text Semantics to Improve Biomedical Vision–Language Processing

Multi-modal data abounds in biomedicine, such as radiology images and re...
research
05/24/2023

PathAsst: Redefining Pathology through Generative Foundation AI Assistant for Pathology

As advances in large language models (LLMs) and multimodal techniques co...
research
07/31/2023

FinVis-GPT: A Multimodal Large Language Model for Financial Chart Analysis

In this paper, we propose FinVis-GPT, a novel multimodal large language ...

Please sign up or login with your details

Forgot password? Click here to reset