TinyViT: Fast Pretraining Distillation for Small Vision Transformers

07/21/2022
by   Kan Wu, et al.
0

Vision transformer (ViT) recently has drawn great attention in computer vision due to its remarkable model capability. However, most prevailing ViT models suffer from huge number of parameters, restricting their applicability on devices with limited resources. To alleviate this issue, we propose TinyViT, a new family of tiny and efficient small vision transformers pretrained on large-scale datasets with our proposed fast distillation framework. The central idea is to transfer knowledge from large pretrained models to small ones, while enabling small models to get the dividends of massive pretraining data. More specifically, we apply distillation during pretraining for knowledge transfer. The logits of large teacher models are sparsified and stored in disk in advance to save the memory cost and computation overheads. The tiny student transformers are automatically scaled down from a large pretrained model with computation and parameter constraints. Comprehensive experiments demonstrate the efficacy of TinyViT. It achieves a top-1 accuracy of 84.8 with only 21M parameters, being comparable to Swin-B pretrained on ImageNet-21k while using 4.2 times fewer parameters. Moreover, increasing image resolutions, TinyViT can reach 86.5 only 11 ability of TinyViT on various downstream tasks. Code and models are available at https://github.com/microsoft/Cream/tree/main/TinyViT.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/14/2022

MiniViT: Compressing Vision Transformers with Weight Multiplexing

Vision Transformer (ViT) models have recently drawn much attention in co...
research
12/15/2022

Vision Transformers are Parameter-Efficient Audio-Visual Learners

Vision transformers (ViTs) have achieved impressive results on various c...
research
10/12/2022

Prompt Generation Networks for Efficient Adaptation of Frozen Vision Transformers

Large-scale pretrained models, especially those trained from vision-lang...
research
05/29/2022

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Large-scale pretrained transformers have created milestones in text (GPT...
research
07/12/2023

What Happens During Finetuning of Vision Transformers: An Invariance Based Investigation

The pretrain-finetune paradigm usually improves downstream performance o...
research
06/26/2023

Pretraining task diversity and the emergence of non-Bayesian in-context learning for regression

Pretrained transformers exhibit the remarkable ability of in-context lea...
research
05/24/2023

ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers

Recently, plain vision Transformers (ViTs) have shown impressive perform...

Please sign up or login with your details

Forgot password? Click here to reset