Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts

02/17/2023
by   Zhihong Chen, et al.
0

Medical vision-and-language pre-training (Med-VLP) has shown promising improvements on many downstream medical tasks owing to its applicability to extracting generic representations from medical images and texts. Practically, there exist two typical types, i.e., the fusion-encoder type and the dual-encoder type, depending on whether a heavy fusion module is used. The former is superior at multi-modal tasks owing to the sufficient interaction between modalities; the latter is good at uni-modal and cross-modal tasks due to the single-modality encoding ability. To take advantage of these two types, we propose an effective yet straightforward scheme named PTUnifier to unify the two types. We first unify the input format by introducing visual and textual prompts, which serve as a feature bank that stores the most representative images/texts. By doing so, a single model could serve as a foundation model that processes various tasks adopting different input formats (i.e., image-only, text-only, and image-text-pair). Furthermore, we construct a prompt pool (instead of static ones) to improve diversity and scalability. Experimental results show that our approach achieves state-of-the-art results on a broad range of tasks, spanning uni-modal tasks (i.e., image/text classification and text summarization), cross-modal tasks (i.e., image-to-text generation and image-text/text-image retrieval), and multi-modal tasks (i.e., visual question answering), demonstrating the effectiveness of our approach. Note that the adoption of prompts is orthogonal to most existing Med-VLP approaches and could be a beneficial and complementary extension to these approaches.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/15/2022

Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge

Medical vision-and-language pre-training (Med-VLP) has received consider...
research
12/16/2021

Distilled Dual-Encoder Model for Vision-Language Understanding

We propose a cross-modal attention distillation framework to train a dua...
research
05/13/2023

Multi-task Paired Masking with Alignment Modeling for Medical Vision-Language Pre-training

In recent years, the growing demand for medical imaging diagnosis has pl...
research
05/20/2021

VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

We present a simplified, task-agnostic multi-modal pre-training approach...
research
03/17/2022

UNIMO-2: End-to-End Unified Vision-Language Grounded Learning

Vision-Language Pre-training (VLP) has achieved impressive performance o...
research
05/10/2023

Multi-Prompt with Depth Partitioned Cross-Modal Learning

In recent years, soft prompt learning methods have been proposed to fine...
research
07/07/2023

MultiQG-TI: Towards Question Generation from Multi-modal Sources

We study the new problem of automatic question generation (QG) from mult...

Please sign up or login with your details

Forgot password? Click here to reset