Efficient Self-supervised Vision Transformers for Representation Learning

06/17/2021
by   Chunyuan Li, et al.
5

This paper investigates two techniques for developing efficient self-supervised vision transformers (EsViT) for visual representation learning. First, we show through a comprehensive empirical study that multi-stage architectures with sparse self-attentions can significantly reduce modeling complexity but with a cost of losing the ability to capture fine-grained correspondences between image regions. Second, we propose a new pre-training task of region matching which allows the model to capture fine-grained region dependencies and as a result significantly improves the quality of the learned vision representations. Our results show that combining the two techniques, EsViT achieves 81.3 outperforming prior arts with around an order magnitude of higher throughput. When transferring to downstream linear classification tasks, EsViT outperforms its supervised counterpart on 17 out of 18 datasets. The code and models will be publicly available.

READ FULL TEXT

page 3

page 9

page 21

page 22

page 23

page 24

research
05/28/2022

A Closer Look at Self-supervised Lightweight Vision Transformers

Self-supervised learning on large-scale Vision Transformers (ViTs) as pr...
research
01/18/2022

RePre: Improving Self-Supervised Vision Transformer with Reconstructive Pre-training

Recently, self-supervised vision transformers have attracted unprecedent...
research
09/26/2019

Joint-task Self-supervised Learning for Temporal Correspondence

This paper proposes to learn reliable dense correspondence from videos i...
research
05/30/2022

HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling

Recently, masked image modeling (MIM) has offered a new methodology of s...
research
06/29/2023

Learning Nuclei Representations with Masked Image Modelling

Masked image modelling (MIM) is a powerful self-supervised representatio...
research
05/30/2023

Contextual Vision Transformers for Robust Representation Learning

We present Contextual Vision Transformers (ContextViT), a method for pro...
research
06/13/2021

InfoBehavior: Self-supervised Representation Learning for Ultra-long Behavior Sequence via Hierarchical Grouping

E-commerce companies have to face abnormal sellers who sell potentially-...

Please sign up or login with your details

Forgot password? Click here to reset