Cross-Modal Self-Attention with Multi-Task Pre-Training for Medical Visual Question Answering

05/01/2021
by   Haifan Gong, et al.
0

Due to the severe lack of labeled data, existing methods of medical visual question answering usually rely on transfer learning to obtain effective image feature representation and use cross-modal fusion of visual and linguistic features to achieve question-related answer prediction. These two phases are performed independently and without considering the compatibility and applicability of the pre-trained features for cross-modal fusion. Thus, we reformulate image feature pre-training as a multi-task learning paradigm and witness its extraordinary superiority, forcing it to take into account the applicability of features for the specific image comprehension task. Furthermore, we introduce a cross-modal self-attention (CMSA) module to selectively capture the long-range contextual relevance for more effective fusion of visual and linguistic features. Experimental results demonstrate that the proposed method outperforms existing state-of-the-art methods. Our code and models are available at https://github.com/haifangong/CMSA-MTPT-4-MedicalVQA.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/16/2021

Distilled Dual-Encoder Model for Vision-Language Understanding

We propose a cross-modal attention distillation framework to train a dua...
research
10/12/2022

Hate-CLIPper: Multimodal Hateful Meme Classification based on Cross-modal Interaction of CLIP Features

Hateful memes are a growing menace on social media. While the image and ...
research
07/09/2021

Cross-modal Attention for MRI and Ultrasound Volume Registration

Prostate cancer biopsy benefits from accurate fusion of transrectal ultr...
research
05/13/2023

Multi-task Paired Masking with Alignment Modeling for Medical Vision-Language Pre-training

In recent years, the growing demand for medical imaging diagnosis has pl...
research
02/09/2021

Referring Segmentation in Images and Videos with Cross-Modal Self-Attention Network

We consider the problem of referring segmentation in images and videos w...
research
03/10/2020

Cross-modal Multi-task Learning for Graphic Recognition of Caricature Face

Face recognition of realistic visual images has been well studied and ma...
research
07/05/2019

Video Question Generation via Cross-Modal Self-Attention Networks Learning

Video Question Answering (Video QA) is a critical and challenging task i...

Please sign up or login with your details

Forgot password? Click here to reset