BERTHop: An Effective Vision-and-Language Model for Chest X-ray Disease Diagnosis

08/10/2021
by   Masoud Monajatipoor, et al.
26

Vision-and-language(V L) models take image and text as input and learn to capture the associations between them. Prior studies show that pre-trained V L models can significantly improve the model performance for downstream tasks such as Visual Question Answering (VQA). However, V L models are less effective when applied in the medical domain (e.g., on X-ray images and clinical notes) due to the domain gap. In this paper, we investigate the challenges of applying pre-trained V L models in medical applications. In particular, we identify that the visual representation in general V L models is not suitable for processing medical data. To overcome this limitation, we propose BERTHop, a transformer-based model based on PixelHop++ and VisualBERT, for better capturing the associations between the two modalities. Experiments on the OpenI dataset, a commonly used thoracic disease diagnosis benchmark, show that BERTHop achieves an average Area Under the Curve (AUC) of 98.12 higher than state-of-the-art (SOTA) while it is trained on a 9 times smaller dataset.

READ FULL TEXT

page 1

page 3

page 5

page 8

research
05/18/2023

MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and Texts

Vision-language pre-training (VLP) models have been demonstrated to be e...
research
09/03/2020

A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports

Joint image-text embedding extracted from medical images and associated ...
research
04/30/2021

Chop Chop BERT: Visual Question Answering by Chopping VisualBERT's Heads

Vision-and-Language (VL) pre-training has shown great potential on many ...
research
07/10/2023

KU-DMIS-MSRA at RadSum23: Pre-trained Vision-Language Model for Radiology Report Summarization

In this paper, we introduce CheXOFA, a new pre-trained vision-language m...
research
10/26/2022

Compressing And Debiasing Vision-Language Pre-Trained Models for Visual Question Answering

Despite the excellent performance of large-scale vision-language pre-tra...
research
01/17/2023

Curriculum Script Distillation for Multilingual Visual Question Answering

Pre-trained models with dual and cross encoders have shown remarkable su...
research
02/22/2023

X-TRA: Improving Chest X-ray Tasks with Cross-Modal Retrieval Augmentation

An important component of human analysis of medical images and their con...

Please sign up or login with your details

Forgot password? Click here to reset