Does CLIP Benefit Visual Question Answering in the Medical Domain as Much as it Does in the General Domain?

12/27/2021
by   Sedigheh Eslami, et al.
37

Contrastive Language–Image Pre-training (CLIP) has shown remarkable success in learning with cross-modal supervision from extensive amounts of image–text pairs collected online. Thus far, the effectiveness of CLIP has been investigated primarily in general-domain multimodal problems. This work evaluates the effectiveness of CLIP for the task of Medical Visual Question Answering (MedVQA). To this end, we present PubMedCLIP, a fine-tuned version of CLIP for the medical domain based on PubMed articles. Our experiments are conducted on two MedVQA benchmark datasets and investigate two MedVQA methods, MEVF (Mixture of Enhanced Visual Features) and QCR (Question answering via Conditional Reasoning). For each of these, we assess the merits of visual representation learning using PubMedCLIP, the original CLIP, and state-of-the-art MAML (Model-Agnostic Meta-Learning) networks pre-trained only on visual data. We open source the code for our MedVQA pipeline and pre-training PubMedCLIP. CLIP and PubMedCLIP achieve improvements in comparison to MAML's visual encoder. PubMedCLIP achieves the best results with gains in the overall accuracy of up to 3 of PubMedCLIP in comparison to the previously widely used MAML networks. Visual representation learning with language supervision in PubMedCLIP leads to noticeable improvements for MedVQA. Our experiments reveal distributional differences in the two MedVQA benchmark datasets that have not been imparted in previous work and cause different back-end visual encoders in PubMedCLIP to exhibit different behavior on these datasets. Moreover, we witness fundamental performance differences of VQA in general versus medical domains.

READ FULL TEXT

page 5

page 6

page 7

research
09/20/2023

Visual Question Answering in the Medical Domain

Medical visual question answering (Med-VQA) is a machine learning task t...
research
11/02/2019

How to Pre-Train Your Model? Comparison of Different Pre-Training Models for Biomedical Question Answering

Using deep learning models on small scale datasets would result in overf...
research
12/21/2022

UnICLAM:Contrastive Representation Learning with Adversarial Masking for Unified and Interpretable Medical Vision Question Answering

Medical Visual Question Answering (Medical-VQA) aims to to answer clinic...
research
04/12/2023

CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes

Training models to apply linguistic knowledge and visual concepts from 2...
research
11/17/2021

Achieving Human Parity on Visual Question Answering

The Visual Question Answering (VQA) task utilizes both visual image and ...
research
11/11/2022

MF2-MVQA: A Multi-stage Feature Fusion method for Medical Visual Question Answering

There is a key problem in the medical visual question answering task tha...
research
07/27/2023

Med-Flamingo: a Multimodal Medical Few-shot Learner

Medicine, by its nature, is a multifaceted domain that requires the synt...

Please sign up or login with your details

Forgot password? Click here to reset