SimViT: Exploring a Simple Vision Transformer with sliding windows

12/24/2021
by   Gang Li, et al.
0

Although vision Transformers have achieved excellent performance as backbone models in many vision tasks, most of them intend to capture global relations of all tokens in an image or a window, which disrupts the inherent spatial and local correlations between patches in 2D structure. In this paper, we introduce a simple vision Transformer named SimViT, to incorporate spatial structure and local information into the vision Transformers. Specifically, we introduce Multi-head Central Self-Attention(MCSA) instead of conventional Multi-head Self-Attention to capture highly local relations. The introduction of sliding windows facilitates the capture of spatial structure. Meanwhile, SimViT extracts multi-scale hierarchical features from different layers for dense prediction tasks. Extensive experiments show the SimViT is effective and efficient as a general-purpose backbone model for various image processing tasks. Especially, our SimViT-Micro only needs 3.3M parameters to achieve 71.1 top-1 accuracy on ImageNet-1k dataset, which is the smallest size vision Transformer model by now. Our code will be available in https://github.com/ucasligang/SimViT.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/10/2021

Local-to-Global Self-Attention in Vision Transformers

Transformers have demonstrated great potential in computer vision tasks....
research
03/24/2022

Beyond Fixation: Dynamic Window Visual Transformer

Recently, a surge of interest in visual transformers is to reduce the co...
research
07/23/2020

Spatially Aware Multimodal Transformers for TextVQA

Textual cues are essential for everyday tasks like buying groceries and ...
research
05/26/2022

Fast Vision Transformers with HiLo Attention

Vision Transformers (ViTs) have triggered the most recent and significan...
research
06/07/2021

ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

Transformers have shown great potential in various computer vision tasks...
research
11/26/2021

SWAT: Spatial Structure Within and Among Tokens

Modeling visual data as tokens (i.e., image patches), and applying atten...
research
09/15/2023

Cure the headache of Transformers via Collinear Constrained Attention

As the rapid progression of practical applications based on Large Langua...

Please sign up or login with your details

Forgot password? Click here to reset