Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training

05/24/2021
by   Jong Hak Moon, et al.
0

Recently a number of studies demonstrated impressive performance on diverse vision-language multi-modal tasks such as image captioning and visual question answering by extending the BERT architecture with multi-modal pre-training objectives. In this work we explore a broad set of multi-modal representation learning tasks in the medical domain, specifically using radiology images and the unstructured report. We propose Medical Vision Language Learner (MedViLL) which adopts a Transformer-based architecture combined with a novel multimodal attention masking scheme to maximize generalization performance for both vision-language understanding tasks (image-report retrieval, disease classification, medical visual question answering) and vision-language generation task (report generation). By rigorously evaluating the proposed model on four downstream tasks with two chest X-ray image datasets (MIMIC-CXR and Open-I), we empirically demonstrate the superior downstream task performance of MedViLL against various baselines including task-specific architectures.

READ FULL TEXT

page 6

page 10

page 12

research
06/10/2023

Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark

With the availability of large-scale, comprehensive, and general-purpose...
research
01/11/2022

Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

Vision-language pre-training has been an emerging and fast-developing re...
research
02/22/2023

X-TRA: Improving Chest X-ray Tasks with Cross-Modal Retrieval Augmentation

An important component of human analysis of medical images and their con...
research
02/10/2023

Is multi-modal vision supervision beneficial to language?

Vision (image and video) - Language (VL) pre-training is the recent popu...
research
10/27/2022

Masked Vision-Language Transformer in Fashion

We present a masked vision-language transformer (MVLT) for fashion-speci...
research
05/30/2022

VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models

Recent advances in vision-language pre-training (VLP) have demonstrated ...
research
08/18/2023

PUMGPT: A Large Vision-Language Model for Product Understanding

Recent developments of multi-modal large language models have demonstrat...

Please sign up or login with your details

Forgot password? Click here to reset