RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training

03/01/2023
by   Zheng Yuan, et al.
0

Vision-and-language multi-modal pretraining and fine-tuning have shown great success in visual question answering (VQA). Compared to general domain VQA, the performance of biomedical VQA suffers from limited data. In this paper, we propose a retrieval-augmented pretrain-and-finetune paradigm named RAMM for biomedical VQA to overcome the data limitation issue. Specifically, we collect a new biomedical dataset named PMCPM which offers patient-based image-text pairs containing diverse patient situations from PubMed. Then, we pretrain the biomedical multi-modal model to learn visual and textual representation for image-text pairs and align these representations with image-text contrastive objective (ITC). Finally, we propose a retrieval-augmented method to better use the limited data. We propose to retrieve similar image-text pairs based on ITC from pretraining datasets and introduce a novel retrieval-attention module to fuse the representation of the image and the question with the retrieved images and texts. Experiments demonstrate that our retrieval-augmented pretrain-and-finetune paradigm obtains state-of-the-art performance on Med-VQA2019, Med-VQA2021, VQARAD, and SLAKE datasets. Further analysis shows that the proposed RAMM and PMCPM can enhance biomedical VQA performance compared with previous resources and methods. We will open-source our dataset, codes, and pretrained model.

READ FULL TEXT

page 1

page 4

research
06/28/2023

Pre-Training Multi-Modal Dense Retrievers for Outside-Knowledge Visual Question Answering

This paper studies a category of visual question answering tasks, in whi...
research
08/05/2022

ChiQA: A Large Scale Image-based Real-World Question Answering Dataset for Multi-Modal Understanding

Visual question answering is an important task in both natural language ...
research
03/02/2023

Large-Scale Domain-Specific Pretraining for Biomedical Vision-Language Processing

Contrastive pretraining on parallel image-text data has attained great s...
research
03/13/2023

PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents

Foundation models trained on large-scale dataset gain a recent surge in ...
research
04/16/2021

Cross-Modal Retrieval Augmentation for Multi-Modal Classification

Recent advances in using retrieval components over external knowledge so...
research
06/01/2023

End-to-end Knowledge Retrieval with Multi-modal Queries

We investigate knowledge retrieval with multi-modal queries, i.e. querie...
research
03/23/2021

Multi-Modal Answer Validation for Knowledge-Based VQA

The problem of knowledge-based visual question answering involves answer...

Please sign up or login with your details

Forgot password? Click here to reset