Multi-task Paired Masking with Alignment Modeling for Medical Vision-Language Pre-training

05/13/2023
by   Ke Zhang, et al.
0

In recent years, the growing demand for medical imaging diagnosis has placed a significant burden on radiologists. As a solution, Medical Vision-Language Pre-training (Med-VLP) methods have been proposed to learn universal representations from medical images and reports, benefiting downstream tasks without requiring fine-grained annotations. However, existing methods have overlooked the importance of cross-modal alignment in joint image-text reconstruction, resulting in insufficient cross-modal interaction. To address this limitation, we propose a unified Med-VLP framework based on Multi-task Paired Masking with Alignment (MPMA) to integrate the cross-modal alignment task into the joint image-text reconstruction framework to achieve more comprehensive cross-modal interaction, while a Global and Local Alignment (GLA) module is designed to assist self-supervised paradigm in obtaining semantic representations with rich domain knowledge. Furthermore, we introduce a Memory-Augmented Cross-Modal Fusion (MA-CMF) module to fully integrate visual information to assist report reconstruction and fuse the multi-modal representations adequately. Experimental results demonstrate that the proposed unified approach outperforms previous methods in all downstream tasks, including uni-modal, cross-modal, and multi-modal tasks.

READ FULL TEXT

page 1

page 4

page 10

page 11

research
12/31/2020

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning

Existed pre-training methods either focus on single-modal tasks or multi...
research
10/12/2022

Multi-Granularity Cross-modal Alignment for Generalized Medical Visual Representation Learning

Learning medical visual representations directly from paired radiology r...
research
02/22/2023

X-TRA: Improving Chest X-ray Tasks with Cross-Modal Retrieval Augmentation

An important component of human analysis of medical images and their con...
research
02/17/2023

Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts

Medical vision-and-language pre-training (Med-VLP) has shown promising i...
research
08/24/2023

Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language Navigation

Cross-modal alignment is one key challenge for Vision-and-Language Navig...
research
05/01/2021

Cross-Modal Self-Attention with Multi-Task Pre-Training for Medical Visual Question Answering

Due to the severe lack of labeled data, existing methods of medical visu...
research
06/26/2023

TCEIP: Text Condition Embedded Regression Network for Dental Implant Position Prediction

When deep neural network has been proposed to assist the dentist in desi...

Please sign up or login with your details

Forgot password? Click here to reset