Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight

06/08/2021
by   Qi Han, et al.
0

Vision Transformer (ViT) attains state-of-the-art performance in visual recognition, and the variant, Local Vision Transformer, makes further improvements. The major component in Local Vision Transformer, local attention, performs the attention separately over small local windows. We rephrase local attention as a channel-wise locally-connected layer and analyze it from two network regularization manners, sparse connectivity and weight sharing, as well as weight computation. Sparse connectivity: there is no connection across channels, and each position is connected to the positions within a small local window. Weight sharing: the connection weights for one position are shared across channels or within each group of channels. Dynamic weight: the connection weights are dynamically predicted according to each image instance. We point out that local attention resembles depth-wise convolution and its dynamic version in sparse connectivity. The main difference lies in weight sharing - depth-wise convolution shares connection weights (kernel weights) across spatial positions. We empirically observe that the models based on depth-wise convolution and the dynamic variant with lower computation complexity perform on-par with or sometimes slightly better than Swin Transformer, an instance of Local Vision Transformer, for ImageNet classification, COCO object detection and ADE semantic segmentation. These observations suggest that Local Vision Transformer takes advantage of two regularization forms and dynamic weight to increase the network capacity.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/07/2021

Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer

Very recently, Window-based Transformers, which computed self-attention ...
research
05/28/2021

An Attention Free Transformer

We introduce Attention Free Transformer (AFT), an efficient variant of T...
research
04/13/2023

Dynamic Mobile-Former: Strengthening Dynamic Convolution with Attention and Residual Connection in Kernel Space

We introduce Dynamic Mobile-Former(DMF), maximizes the capabilities of d...
research
06/23/2023

Swin-Free: Achieving Better Cross-Window Attention and Efficiency with Size-varying Window

Transformer models have shown great potential in computer vision, follow...
research
03/08/2022

Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group Attention

Recently, Transformers have shown promising performance in various visio...
research
03/31/2023

Rethinking Local Perception in Lightweight Vision Transformer

Vision Transformers (ViTs) have been shown to be effective in various vi...
research
10/17/2022

Deformably-Scaled Transposed Convolution

Transposed convolution is crucial for generating high-resolution outputs...

Please sign up or login with your details

Forgot password? Click here to reset