CNN or ViT? Revisiting Vision Transformers Through the Lens of Convolution

09/11/2023
by   Chenghao Li, et al.
0

The success of Vision Transformer (ViT) has been widely reported on a wide range of image recognition tasks. The merit of ViT over CNN has been largely attributed to large training datasets or auxiliary pre-training. Without pre-training, the performance of ViT on small datasets is limited because the global self-attention has limited capacity in local modeling. Towards boosting ViT on small datasets without pre-training, this work improves its local modeling by applying a weight mask on the original self-attention matrix. A straightforward way to locally adapt the self-attention matrix can be realized by an element-wise learnable weight mask (ELM), for which our preliminary results show promising results. However, the element-wise simple learnable weight mask not only induces a non-trivial additional parameter overhead but also increases the optimization complexity. To this end, this work proposes a novel Gaussian mixture mask (GMM) in which one mask only has two learnable parameters and it can be conveniently used in any ViT variants whose attention mechanism allows the use of masks. Experimental results on multiple small datasets demonstrate that the effectiveness of our proposed Gaussian mask for boosting ViTs for free (almost zero additional parameter or computation cost). Our code will be publicly available at \href{https://github.com/CatworldLee/Gaussian-Mixture-Mask-Attention}{https://github.com/CatworldLee/Gaussian-Mixture-Mask-Attention}.

READ FULL TEXT

page 4

page 5

page 8

research
12/27/2022

DAE-Former: Dual Attention-guided Efficient Transformer for Medical Image Segmentation

Transformers have recently gained attention in the computer vision domai...
research
07/27/2023

Take-A-Photo: 3D-to-2D Generative Pre-training of Point Cloud Models

With the overwhelming trend of mask image modeling led by MAE, generativ...
research
05/02/2020

DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering

Transformer-based QA models use input-wide self-attention – i.e. across ...
research
02/14/2020

Electricity Theft Detection with self-attention

In this work we propose a novel self-attention mechanism model to addres...
research
10/22/2022

Accumulated Trivial Attention Matters in Vision Transformers on Small Datasets

Vision Transformers has demonstrated competitive performance on computer...
research
09/06/2023

Gene-induced Multimodal Pre-training for Image-omic Classification

Histology analysis of the tumor micro-environment integrated with genomi...
research
01/21/2021

DAF:re: A Challenging, Crowd-Sourced, Large-Scale, Long-Tailed Dataset For Anime Character Recognition

In this work we tackle the challenging problem of anime character recogn...

Please sign up or login with your details

Forgot password? Click here to reset