PSLT: A Light-weight Vision Transformer with Ladder Self-Attention and Progressive Shift

04/07/2023
by   Gaojie Wu, et al.
0

Vision Transformer (ViT) has shown great potential for various visual tasks due to its ability to model long-range dependency. However, ViT requires a large amount of computing resource to compute the global self-attention. In this work, we propose a ladder self-attention block with multiple branches and a progressive shift mechanism to develop a light-weight transformer backbone that requires less computing resources (e.g. a relatively small number of parameters and FLOPs), termed Progressive Shift Ladder Transformer (PSLT). First, the ladder self-attention block reduces the computational cost by modelling local self-attention in each branch. In the meanwhile, the progressive shift mechanism is proposed to enlarge the receptive field in the ladder self-attention block by modelling diverse local self-attention for each branch and interacting among these branches. Second, the input feature of the ladder self-attention block is split equally along the channel dimension for each branch, which considerably reduces the computational cost in the ladder self-attention block (with nearly 1/3 the amount of parameters and FLOPs), and the outputs of these branches are then collaborated by a pixel-adaptive fusion. Therefore, the ladder self-attention block with a relatively small number of parameters and FLOPs is capable of modelling long-range interactions. Based on the ladder self-attention block, PSLT performs well on several vision tasks, including image classification, objection detection and person re-identification. On the ImageNet-1k dataset, PSLT achieves a top-1 accuracy of 79.9 existing models with more than 20M parameters and 4G FLOPs. Code is available at https://isee-ai.cn/wugaojie/PSLT.html.

READ FULL TEXT

page 6

page 7

page 17

research
01/03/2022

Vision Transformer with Deformable Attention

Transformers have recently shown superior performances on various vision...
research
10/31/2022

ViT-LSLA: Vision Transformer with Light Self-Limited-Attention

Transformers have demonstrated a competitive performance across a wide r...
research
12/14/2021

Explore Long-Range Context feature for Speaker Verification

Capturing long-range dependency and modeling long temporal contexts is p...
research
11/24/2021

MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video

Self-attention has become an integral component of the recent network ar...
research
06/21/2022

EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications

In the pursuit of achieving ever-increasing accuracy, large and complex ...
research
01/05/2022

Synthesizing Tensor Transformations for Visual Self-attention

Self-attention shows outstanding competence in capturing long-range rela...
research
03/18/2020

Neuroevolution of Self-Interpretable Agents

Inattentional blindness is the psychological phenomenon that causes one ...

Please sign up or login with your details

Forgot password? Click here to reset