How to Train Vision Transformer on Small-scale Datasets?

10/13/2022
by   Hanan Gani, et al.
0

Vision Transformer (ViT), a radically different architecture than convolutional neural networks offers multiple advantages including design simplicity, robustness and state-of-the-art performance on many vision tasks. However, in contrast to convolutional neural networks, Vision Transformer lacks inherent inductive biases. Therefore, successful training of such models is mainly attributed to pre-training on large-scale datasets such as ImageNet with 1.2M or JFT with 300M images. This hinders the direct adaption of Vision Transformer for small-scale datasets. In this work, we show that self-supervised inductive biases can be learned directly from small-scale datasets and serve as an effective weight initialization scheme for fine-tuning. This allows to train these models without large-scale pre-training, changes to model architecture or loss functions. We present thorough experiments to successfully train monolithic and non-monolithic Vision Transformers on five small datasets including CIFAR10/100, CINIC10, SVHN, Tiny-ImageNet and two fine-grained datasets: Aircraft and Cars. Our approach consistently improves the performance of Vision Transformers while retaining their properties such as attention to salient regions and higher robustness. Our codes and pre-trained models are available at: https://github.com/hananshafi/vits-for-small-scale-datasets.

READ FULL TEXT

page 4

page 9

page 18

page 19

research
03/29/2021

CvT: Introducing Convolutions to Vision Transformers

We present in this paper a new architecture, named Convolutional vision ...
research
12/07/2021

Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training

Recently, vision Transformers (ViTs) are developing rapidly and starting...
research
04/13/2023

Remote Sensing Change Detection With Transformers Trained from Scratch

Current transformer-based change detection (CD) approaches either employ...
research
08/10/2023

Surface Masked AutoEncoder: Self-Supervision for Cortical Imaging Data

Self-supervision has been widely explored as a means of addressing the l...
research
12/12/2022

Masked autoencoders are effective solution to transformer data-hungry

Vision Transformers (ViTs) outperforms convolutional neural networks (CN...
research
03/14/2022

EIT: Efficiently Lead Inductive Biases to ViT

Vision Transformer (ViT) depends on properties similar to the inductive ...
research
05/22/2023

VanillaNet: the Power of Minimalism in Deep Learning

At the heart of foundation models is the philosophy of "more is differen...

Please sign up or login with your details

Forgot password? Click here to reset