Token Labeling: Training a 85.4 56M Parameters on ImageNet

04/22/2021
by   Zihang Jiang, et al.
0

This paper provides a strong baseline for vision transformers on the ImageNet classification task. While recent vision transformers have demonstrated promising results in ImageNet classification, their performance still lags behind powerful convolutional neural networks (CNNs) with approximately the same model size. In this work, instead of describing a novel transformer architecture, we explore the potential of vision transformers in ImageNet classification by developing a bag of training techniques. We show that by slightly tuning the structure of vision transformers and introducing token labeling – a new training objective, our models are able to achieve better results than the CNN counterparts and other transformer-based classification models with similar amount of training parameters and computations. Taking a vision transformer with 26M learnable parameters as an example, we can achieve an 84.4 56M/150M, the result can be further increased to 85.4 data. We hope this study could provide researchers with useful techniques to train powerful vision transformers. Our code and all the training details will be made publicly available at https://github.com/zihangJiang/TokenLabeling.

READ FULL TEXT
research
05/21/2022

Vision Transformers in 2022: An Update on Tiny ImageNet

The recent advances in image transformers have shown impressive results ...
research
03/29/2021

CvT: Introducing Convolutions to Vision Transformers

We present in this paper a new architecture, named Convolutional vision ...
research
02/22/2021

Do We Really Need Explicit Position Encodings for Vision Transformers?

Almost all visual transformers such as ViT or DeiT rely on predefined po...
research
06/01/2023

Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles

Modern hierarchical vision transformers have added several vision-specif...
research
05/23/2022

Super Vision Transformer

We attempt to reduce the computational costs in vision transformers (ViT...
research
04/01/2023

Vision Transformers with Mixed-Resolution Tokenization

Vision Transformer models process input images by dividing them into a s...
research
03/18/2021

Danish Fungi 2020 – Not Just Another Image Recognition Dataset

We introduce a novel fine-grained dataset and benchmark, the Danish Fung...

Please sign up or login with your details

Forgot password? Click here to reset