A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports

09/03/2020
by   Yikuan Li, et al.
0

Joint image-text embedding extracted from medical images and associated contextual reports is the bedrock for most biomedical vision-and-language (V+L) tasks, including medical visual question answering, clinical image-text retrieval, clinical report auto-generation. In this study, we adopt four pre-trained V+L models: LXMERT, VisualBERT, UNIER and PixelBERT to learn multimodal representation from MIMIC-CXR radiographs and associated reports. The extrinsic evaluation on OpenI dataset shows that in comparison to the pioneering CNN-RNN model, the joint embedding learned by pre-trained V+L models demonstrate performance improvement in the thoracic findings classification task. We conduct an ablation study to analyze the contribution of certain model components and validate the advantage of joint embedding over text-only embedding. We also visualize attention maps to illustrate the attention mechanism of V+L models.

READ FULL TEXT
research
11/21/2018

Unsupervised Multimodal Representation Learning across Medical Images and Reports

Joint embeddings between medical imaging modalities and associated radio...
research
03/02/2023

ConTEXTual Net: A Multimodal Vision-Language Model for Segmentation of Pneumothorax

Clinical imaging databases contain not only medical images but also text...
research
08/10/2021

BERTHop: An Effective Vision-and-Language Model for Chest X-ray Disease Diagnosis

Vision-and-language(V L) models take image and text as input and learn...
research
09/27/2022

RepsNet: Combining Vision with Language for Automated Medical Reports

Writing reports by analyzing medical images is error-prone for inexperie...
research
08/10/2017

TandemNet: Distilling Knowledge from Medical Images Using Diagnostic Reports as Optional Semantic References

In this paper, we introduce the semantic knowledge of medical images fro...
research
07/08/2017

MDNet: A Semantically and Visually Interpretable Medical Image Diagnosis Network

The inability to interpret the model prediction in semantically and visu...
research
09/05/2018

Bimodal network architectures for automatic generation of image annotation from text

Medical image analysis practitioners have embraced big data methodologie...

Please sign up or login with your details

Forgot password? Click here to reset