HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling

05/30/2022
by   Xiaosong Zhang, et al.
0

Recently, masked image modeling (MIM) has offered a new methodology of self-supervised pre-training of vision transformers. A key idea of efficient implementation is to discard the masked image patches (or tokens) throughout the target network (encoder), which requires the encoder to be a plain vision transformer (e.g., ViT), albeit hierarchical vision transformers (e.g., Swin Transformer) have potentially better properties in formulating vision inputs. In this paper, we offer a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT) that enjoys both high efficiency and good performance in MIM. The key is to remove the unnecessary "local inter-unit operations", deriving structurally simple hierarchical vision transformers in which mask-units can be serialized like plain vision transformers. For this purpose, we start with Swin Transformer and (i) set the masking unit size to be the token size in the main stage of Swin Transformer, (ii) switch off inter-unit self-attentions before the main stage, and (iii) eliminate all operations after the main stage. Empirical studies demonstrate the advantageous performance of HiViT in terms of fully-supervised, self-supervised, and transfer learning. In particular, in running MAE on ImageNet-1K, HiViT-B reports a +0.6 Swin-B, and the performance gain generalizes to downstream tasks of detection and segmentation. Code will be made publicly available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/15/2021

BEiT: BERT Pre-Training of Image Transformers

We introduce a self-supervised vision representation model BEiT, which s...
research
05/27/2022

Architecture-Agnostic Masked Image Modeling – From ViT back to CNN

Masked image modeling (MIM), an emerging self-supervised pre-training me...
research
03/23/2022

What to Hide from Your Students: Attention-Guided Masked Image Modeling

Transformers and masked language modeling are quickly being adopted and ...
research
06/01/2022

Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer

Vision Transformers (ViTs) enabled the use of transformer architecture o...
research
05/26/2021

Aggregating Nested Transformers

Although hierarchical structures are popular in recent vision transforme...
research
06/17/2021

Efficient Self-supervised Vision Transformers for Representation Learning

This paper investigates two techniques for developing efficient self-sup...
research
08/30/2023

Emergence of Segmentation with Minimalistic White-Box Transformers

Transformer-like models for vision tasks have recently proven effective ...

Please sign up or login with your details

Forgot password? Click here to reset