MedCLIP: Contrastive Learning from Unpaired Medical Images and Text

10/18/2022
by   Zifeng Wang, et al.
26

Existing vision-text contrastive learning like CLIP aims to match the paired image and caption embeddings while pushing others apart, which improves representation transferability and supports zero-shot prediction. However, medical image-text datasets are orders of magnitude below the general images and captions from the internet. Moreover, previous methods encounter many false negatives, i.e., images and reports from separate patients probably carry the same semantics but are wrongly treated as negatives. In this paper, we decouple images and texts for multimodal contrastive learning thus scaling the usable training data in a combinatorial magnitude with low cost. We also propose to replace the InfoNCE loss with semantic matching loss based on medical knowledge to eliminate false negatives in contrastive learning. We prove that MedCLIP is a simple yet effective framework: it outperforms state-of-the-art methods on zero-shot prediction, supervised classification, and image-text retrieval. Surprisingly, we observe that with only 20K pre-training data, MedCLIP wins over the state-of-the-art method (using around 200K data). Our code is available at https://github.com/RyanWangZf/MedCLIP.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/31/2023

Improving CLIP Training with Language Rewrites

Contrastive Language-Image Pre-training (CLIP) stands as one of the most...
research
07/24/2023

PRIOR: Prototype Representation Joint Learning from Medical Images and Reports

Contrastive learning based vision-language joint pre-training has emerge...
research
12/06/2021

Joint Learning of Localized Representations from Medical Images and Reports

Contrastive learning has proven effective for pre-training image models ...
research
12/01/2022

Finetune like you pretrain: Improved finetuning of zero-shot vision models

Finetuning image-text models such as CLIP achieves state-of-the-art accu...
research
09/19/2023

Learning Tri-modal Embeddings for Zero-Shot Soundscape Mapping

We focus on the task of soundscape mapping, which involves predicting th...
research
05/11/2023

An Inverse Scaling Law for CLIP Training

CLIP, the first foundation model that connects images and text, has enab...
research
05/29/2023

Improved Probabilistic Image-Text Representations

Image-Text Matching (ITM) task, a fundamental vision-language (VL) task,...

Please sign up or login with your details

Forgot password? Click here to reset