MMBERT: Multimodal BERT Pretraining for Improved Medical VQA

04/03/2021
by   Yash Khare, et al.
17

Images in the medical domain are fundamentally different from the general domain images. Consequently, it is infeasible to directly employ general domain Visual Question Answering (VQA) models for the medical domain. Additionally, medical images annotation is a costly and time-consuming process. To overcome these limitations, we propose a solution inspired by self-supervised pretraining of Transformer-style architectures for NLP, Vision and Language tasks. Our method involves learning richer medical image and text semantic representations using Masked Language Modeling (MLM) with image features as the pretext task on a large medical image+caption dataset. The proposed solution achieves new state-of-the-art performance on two VQA datasets for radiology images – VQA-Med 2019 and VQA-RAD, outperforming even the ensemble models of previous best solutions. Moreover, our solution provides attention maps which help in model interpretability. The code is available at https://github.com/VirajBagal/MMBERT

READ FULL TEXT

page 1

page 4

research
09/20/2023

Visual Question Answering in the Medical Domain

Medical visual question answering (Med-VQA) is a machine learning task t...
research
07/11/2023

Rad-ReStruct: A Novel VQA Benchmark and Method for Structured Radiology Reporting

Radiology reporting is a crucial part of the communication between radio...
research
05/30/2022

An Efficient Modern Baseline for FloodNet VQA

Designing efficient and reliable VQA systems remains a challenging probl...
research
02/25/2023

Medical visual question answering using joint self-supervised learning

Visual Question Answering (VQA) becomes one of the most active research ...
research
07/27/2023

Med-Flamingo: a Multimodal Medical Few-shot Learner

Medicine, by its nature, is a multifaceted domain that requires the synt...
research
01/15/2022

CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks

Contrastive language-image pretraining (CLIP) links vision and language ...
research
09/27/2022

RepsNet: Combining Vision with Language for Automated Medical Reports

Writing reports by analyzing medical images is error-prone for inexperie...

Please sign up or login with your details

Forgot password? Click here to reset