Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime

03/30/2023
by   Rhydian Windsor, et al.
6

This paper explores training medical vision-language models (VLMs) – where the visual and language inputs are embedded into a common space – with a particular focus on scenarios where training data is limited, as is often the case in clinical datasets. We explore several candidate methods to improve low-data performance, including: (i) adapting generic pre-trained models to novel image and text domains (i.e. medical imaging and reports) via unimodal self-supervision; (ii) using local (e.g. GLoRIA) global (e.g. InfoNCE) contrastive loss functions as well as a combination of the two; (iii) extra supervision during VLM training, via: (a) image- and text-only self-supervision, and (b) creating additional positive image-text pairs for training through augmentation and nearest-neighbour search. Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports. Combined, they significantly improve retrieval compared to fine-tuning CLIP, roughly equivalent to training with the data. A similar pattern is found in the downstream task classification of CXR-related conditions with our method outperforming CLIP and also BioVIL, a strong CXR VLM benchmark, in the zero-shot and linear probing settings. We conclude with a set of recommendations for researchers aiming to train vision-language models on other medical imaging modalities when training data is scarce. To facilitate further research, we will make our code and models publicly available.

READ FULL TEXT

page 14

page 16

page 17

page 18

research
12/14/2022

Significantly improving zero-shot X-ray pathology classification via fine-tuning pre-trained image-text encoders

Deep neural networks have been successfully adopted to diverse domains i...
research
03/30/2021

Self-supervised Image-text Pre-training With Mixed Data In Chest X-rays

Pre-trained models, e.g., from ImageNet, have proven to be effective in ...
research
05/23/2022

Sample Efficient Approaches for Idiomaticity Detection

Deep neural models, in particular Transformer-based pre-trained language...
research
11/23/2022

RoentGen: Vision-Language Foundation Model for Chest X-ray Generation

Multimodal models trained on large natural image-text pair datasets have...
research
03/02/2023

ConTEXTual Net: A Multimodal Vision-Language Model for Segmentation of Pneumothorax

Clinical imaging databases contain not only medical images but also text...
research
08/15/2023

A Foundation LAnguage-Image model of the Retina (FLAIR): Encoding expert knowledge in text supervision

Foundation vision-language models are currently transforming computer vi...
research
01/05/2023

MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training

In this paper, we consider the problem of enhancing self-supervised visu...

Please sign up or login with your details

Forgot password? Click here to reset