Contextual Vision Transformers for Robust Representation Learning

05/30/2023
by   Yujia Bao, et al.
14

We present Contextual Vision Transformers (ContextViT), a method for producing robust feature representations for images exhibiting grouped structure such as covariates. ContextViT introduces an extra context token to encode group-specific information, allowing the model to explain away group-specific covariate structures while keeping core visual features shared across groups. Specifically, given an input image, Context-ViT maps images that share the same covariate into this context token appended to the input image tokens to capture the effects of conditioning the model on group membership. We furthermore introduce a context inference network to predict such tokens on the fly given a few samples from a group distribution, enabling ContextViT to generalize to new testing distributions at inference time. We illustrate the performance of ContextViT through a diverse range of applications. In supervised fine-tuning, we demonstrate that augmenting pre-trained ViTs with additional context conditioning leads to significant improvements in out-of-distribution generalization on iWildCam and FMoW. We also explored self-supervised representation learning with ContextViT. Our experiments on the Camelyon17 pathology imaging benchmark and the cpg-0000 microscopy imaging benchmark demonstrate that ContextViT excels in learning stable image featurizations amidst covariate shift, consistently outperforming its ViT counterpart.

READ FULL TEXT
research
12/05/2021

Dynamic Token Normalization Improves Vision Transformer

Vision Transformer (ViT) and its variants (e.g., Swin, PVT) have achieve...
research
06/29/2023

Learning Nuclei Representations with Masked Image Modelling

Masked image modelling (MIM) is a powerful self-supervised representatio...
research
05/27/2023

Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers

Deployment of Transformer models on the edge is increasingly challenging...
research
03/23/2022

What to Hide from Your Students: Attention-Guided Masked Image Modeling

Transformers and masked language modeling are quickly being adopted and ...
research
06/17/2021

Efficient Self-supervised Vision Transformers for Representation Learning

This paper investigates two techniques for developing efficient self-sup...
research
12/30/2021

Stochastic Layers in Vision Transformers

We introduce fully stochastic layers in vision transformers, without cau...

Please sign up or login with your details

Forgot password? Click here to reset