Vision Transformer for Small-Size Datasets

12/27/2021
by   Seung Hoon Lee, et al.
20

Recently, the Vision Transformer (ViT), which applied the transformer structure to the image classification task, has outperformed convolutional neural networks. However, the high performance of the ViT results from pre-training using a large-size dataset such as JFT-300M, and its dependence on a large dataset is interpreted as due to low locality inductive bias. This paper proposes Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA), which effectively solve the lack of locality inductive bias and enable it to learn from scratch even on small-size datasets. Moreover, SPT and LSA are generic and effective add-on modules that are easily applicable to various ViTs. Experimental results show that when both SPT and LSA were applied to the ViTs, the performance improved by an average of 2.96 is a representative small-size dataset. Especially, Swin Transformer achieved an overwhelming performance improvement of 4.08 LSA.

READ FULL TEXT
research
03/19/2021

ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases

Convolutional architectures have proven extremely successful for vision ...
research
09/04/2023

ExMobileViT: Lightweight Classifier Extension for Mobile Vision Transformer

The paper proposes an efficient structure for enhancing the performance ...
research
05/15/2023

Enhancing Performance of Vision Transformers on Small Datasets through Local Inductive Bias Incorporation

Vision transformers (ViTs) achieve remarkable performance on large datas...
research
12/27/2021

ViR:the Vision Reservoir

The most recent year has witnessed the success of applying the Vision Tr...
research
10/13/2022

Vision Transformers provably learn spatial structure

Vision Transformers (ViTs) have achieved comparable or superior performa...
research
10/31/2022

Studying inductive biases in image classification task

Recently, self-attention (SA) structures became popular in computer visi...
research
07/05/2022

CNN-based Local Vision Transformer for COVID-19 Diagnosis

Deep learning technology can be used as an assistive technology to help ...

Please sign up or login with your details

Forgot password? Click here to reset