Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention

12/28/2021
by   Sitong Wu, et al.
0

Recently, Transformers have shown promising performance in various vision tasks. To reduce the quadratic computation complexity caused by the global self-attention, various methods constrain the range of attention within a local region to improve its efficiency. Consequently, their receptive fields in a single attention layer are not large enough, resulting in insufficient context modeling. To address this issue, we propose a Pale-Shaped self-Attention (PS-Attention), which performs self-attention within a pale-shaped region. Compared to the global self-attention, PS-Attention can reduce the computation and memory costs significantly. Meanwhile, it can capture richer contextual information under the similar computation complexity with previous local self-attention mechanisms. Based on the PS-Attention, we develop a general Vision Transformer backbone with a hierarchical architecture, named Pale Transformer, which achieves 83.4 model size of 22M, 48M, and 85M respectively for 224 ImageNet-1K classification, outperforming the previous Vision Transformer backbones. For downstream tasks, our Pale Transformer backbone performs better than the recent state-of-the-art CSWin Transformer by a large margin on ADE20K semantic segmentation and COCO object detection instance segmentation. The code will be released on https://github.com/BR-IDL/PaddleViT.

READ FULL TEXT
research
07/01/2021

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

We present CSWin Transformer, an efficient and effective Transformer-bas...
research
06/20/2022

Global Context Vision Transformers

We propose global context vision transformer (GC ViT), a novel architect...
research
09/29/2022

Dilated Neighborhood Attention Transformer

Transformers are quickly becoming one of the most heavily applied deep l...
research
05/02/2023

AxWin Transformer: A Context-Aware Vision Transformer Backbone with Axial Windows

Recently Transformer has shown good performance in several vision tasks ...
research
03/08/2022

Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group Attention

Recently, Transformers have shown promising performance in various visio...
research
06/22/2021

P2T: Pyramid Pooling Transformer for Scene Understanding

This paper jointly resolves two problems in vision transformer: i) the c...
research
05/26/2022

Green Hierarchical Vision Transformer for Masked Image Modeling

We present an efficient approach for Masked Image Modeling (MIM) with hi...

Please sign up or login with your details

Forgot password? Click here to reset