Enhancing Performance of Vision Transformers on Small Datasets through Local Inductive Bias Incorporation

05/15/2023
by   Ibrahim Batuhan Akkaya, et al.
5

Vision transformers (ViTs) achieve remarkable performance on large datasets, but tend to perform worse than convolutional neural networks (CNNs) when trained from scratch on smaller datasets, possibly due to a lack of local inductive bias in the architecture. Recent studies have therefore added locality to the architecture and demonstrated that it can help ViTs achieve performance comparable to CNNs in the small-size dataset regime. Existing methods, however, are architecture-specific or have higher computational and memory costs. Thus, we propose a module called Local InFormation Enhancer (LIFE) that extracts patch-level local information and incorporates it into the embeddings used in the self-attention block of ViTs. Our proposed module is memory and computation efficient, as well as flexible enough to process auxiliary tokens such as the classification and distillation tokens. Empirical results show that the addition of the LIFE module improves the performance of ViTs on small image classification datasets. We further demonstrate how the effect can be extended to downstream tasks, such as object detection and semantic segmentation. In addition, we introduce a new visualization method, Dense Attention Roll-Out, specifically designed for dense prediction tasks, allowing the generation of class-specific attention maps utilizing the attention maps of all tokens.

READ FULL TEXT

page 1

page 18

page 19

page 20

research
01/21/2022

A Comprehensive Study of Vision Transformers on Dense Prediction Tasks

Convolutional Neural Networks (CNNs), architectures consisting of convol...
research
06/04/2021

RegionViT: Regional-to-Local Attention for Vision Transformers

Vision transformer (ViT) has recently showed its strong capability in ac...
research
07/12/2022

LightViT: Towards Light-Weight Convolution-Free Vision Transformers

Vision transformers (ViTs) are usually considered to be less light-weigh...
research
12/27/2021

Vision Transformer for Small-Size Datasets

Recently, the Vision Transformer (ViT), which applied the transformer st...
research
01/03/2023

Explainability and Robustness of Deep Visual Classification Models

In the computer vision community, Convolutional Neural Networks (CNNs), ...
research
07/17/2023

Cumulative Spatial Knowledge Distillation for Vision Transformers

Distilling knowledge from convolutional neural networks (CNNs) is a doub...
research
11/25/2022

Adaptive Attention Link-based Regularization for Vision Transformers

Although transformer networks are recently employed in various vision ta...

Please sign up or login with your details

Forgot password? Click here to reset