Scaling Vision Transformers to Gigapixel Images via Hierarchical Self-Supervised Learning

06/06/2022
by   Richard J. Chen, et al.
22

Vision Transformers (ViTs) and their multi-scale and hierarchical variations have been successful at capturing image representations but their use has been generally studied for low-resolution images (e.g. - 256x256, 384384). For gigapixel whole-slide imaging (WSI) in computational pathology, WSIs can be as large as 150000x150000 pixels at 20X magnification and exhibit a hierarchical structure of visual tokens across varying resolutions: from 16x16 images capture spatial patterns among cells, to 4096x4096 images characterizing interactions within the tissue microenvironment. We introduce a new ViT architecture called the Hierarchical Image Pyramid Transformer (HIPT), which leverages the natural hierarchical structure inherent in WSIs using two levels of self-supervised learning to learn high-resolution image representations. HIPT is pretrained across 33 cancer types using 10,678 gigapixel WSIs, 408,218 4096x4096 images, and 104M 256x256 images. We benchmark HIPT representations on 9 slide-level tasks, and demonstrate that: 1) HIPT with hierarchical pretraining outperforms current state-of-the-art methods for cancer subtyping and survival prediction, 2) self-supervised ViTs are able to model important inductive biases about the hierarchical structure of phenotypes in the tumor microenvironment.

READ FULL TEXT

page 1

page 4

page 5

page 8

page 13

page 17

page 18

page 19

research
03/01/2022

Self-Supervised Vision Transformers Learn Visual Concepts in Histopathology

Tissue phenotyping is a fundamental task in learning objective character...
research
08/29/2023

A General-Purpose Self-Supervised Model for Computational Pathology

Tissue phenotyping is a fundamental computational pathology (CPath) task...
research
03/02/2023

Hierarchical discriminative learning improves visual representations of biomedical microscopy

Learning high-quality, self-supervised, visual representations is essent...
research
09/14/2023

HIGT: Hierarchical Interaction Graph-Transformer for Whole Slide Image Analysis

In computation pathology, the pyramid structure of gigapixel Whole Slide...
research
06/06/2023

DenseDINO: Boosting Dense Self-Supervised Learning with Token-Based Point-Level Consistency

In this paper, we propose a simple yet effective transformer framework f...
research
08/09/2023

A degree of image identification at sub-human scales could be possible with more advanced clusters

The purpose of the research is to determine if currently available self-...
research
08/07/2023

Scaling may be all you need for achieving human-level object recognition capacity with human-like visual experience

This paper asks whether current self-supervised learning methods, if suf...

Please sign up or login with your details

Forgot password? Click here to reset