Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology Images

06/13/2023
by   Ming Y. Lu, et al.
0

Contrastive visual language pretraining has emerged as a powerful method for either training new language-aware image encoders or augmenting existing pretrained models with zero-shot visual recognition capabilities. However, existing works typically train on large datasets of image-text pairs and have been designed to perform downstream tasks involving only small to medium sized-images, neither of which are applicable to the emerging field of computational pathology where there are limited publicly available paired image-text datasets and each image can span up to 100,000 x 100,000 pixels. In this paper we present MI-Zero, a simple and intuitive framework for unleashing the zero-shot transfer capabilities of contrastively aligned image and text models on gigapixel histopathology whole slide images, enabling multiple downstream diagnostic tasks to be carried out by pretrained encoders without requiring any additional labels. MI-Zero reformulates zero-shot transfer under the framework of multiple instance learning to overcome the computational challenge of inference on extremely large images. We used over 550k pathology reports and other available in-domain text corpora to pre-train our text encoder. By effectively leveraging strong pre-trained encoders, our best model pretrained on over 33k histopathology image-caption pairs achieves an average median zero-shot accuracy of 70.2 subtyping tasks. Our code is available at: https://github.com/mahmoodlab/MI-Zero.

READ FULL TEXT

page 3

page 5

page 8

page 20

page 21

research
04/07/2022

Unsupervised Prompt Learning for Vision-Language Models

Contrastive vision-language models like CLIP have shown great progress i...
research
10/04/2022

ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training

Aligning the visual and language spaces requires to train deep neural ne...
research
04/05/2023

VicTR: Video-conditioned Text Representations for Activity Recognition

Vision-Language models have shown strong performance in the image-domain...
research
05/21/2023

Model-Generated Pretraining Signals Improves Zero-Shot Generalization of Text-to-Text Transformers

This paper explores the effectiveness of model-generated signals in impr...
research
05/21/2023

Contrastive Language-Image Pretrained Models are Zero-Shot Human Scanpath Predictors

Understanding the mechanisms underlying human attention is a fundamental...
research
07/06/2023

T-MARS: Improving Visual Representations by Circumventing Text Feature Learning

Large web-sourced multimodal datasets have powered a slew of new methods...
research
11/18/2021

Simple but Effective: CLIP Embeddings for Embodied AI

Contrastive language image pretraining (CLIP) encoders have been shown t...

Please sign up or login with your details

Forgot password? Click here to reset