Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge

09/15/2022
by   Zhihong Chen, et al.
0

Medical vision-and-language pre-training (Med-VLP) has received considerable attention owing to its applicability to extracting generic vision-and-language representations from medical images and texts. Most existing methods mainly contain three elements: uni-modal encoders (i.e., a vision encoder and a language encoder), a multi-modal fusion module, and pretext tasks, with few studies considering the importance of medical domain expert knowledge and explicitly exploiting such knowledge to facilitate Med-VLP. Although there exist knowledge-enhanced vision-and-language pre-training (VLP) methods in the general domain, most require off-the-shelf toolkits (e.g., object detectors and scene graph parsers), which are unavailable in the medical domain. In this paper, we propose a systematic and effective approach to enhance Med-VLP by structured medical knowledge from three perspectives. First, considering knowledge can be regarded as the intermediate medium between vision and language, we align the representations of the vision encoder and the language encoder through knowledge. Second, we inject knowledge into the multi-modal fusion model to enable the model to perform reasoning using knowledge as the supplementation of the input image and text. Third, we guide the model to put emphasis on the most critical information in images and texts by designing knowledge-induced pretext tasks. To perform a comprehensive evaluation and facilitate further research, we construct a medical vision-and-language benchmark including three tasks. Experimental results illustrate the effectiveness of our approach, where state-of-the-art performance is achieved on all downstream tasks. Further analyses explore the effects of different components of our approach and various settings of pre-training.

READ FULL TEXT

page 4

page 8

research
02/17/2023

Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts

Medical vision-and-language pre-training (Med-VLP) has shown promising i...
research
01/11/2022

Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

Vision-language pre-training has been an emerging and fast-developing re...
research
10/17/2022

Contrastive Language-Image Pre-Training with Knowledge Graphs

Recent years have witnessed the fast development of large-scale pre-trai...
research
06/10/2023

Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark

With the availability of large-scale, comprehensive, and general-purpose...
research
06/11/2022

A Unified Continuous Learning Framework for Multi-modal Knowledge Discovery and Pre-training

Multi-modal pre-training and knowledge discovery are two important resea...
research
08/23/2022

Learning More May Not Be Better: Knowledge Transferability in Vision and Language Tasks

Is more data always better to train vision-and-language models? We study...
research
08/04/2023

Towards Generalist Foundation Model for Radiology

In this study, we aim to initiate the development of Radiology Foundatio...

Please sign up or login with your details

Forgot password? Click here to reset