Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer

06/07/2021
by   Zilong Huang, et al.
0

Very recently, Window-based Transformers, which computed self-attention within non-overlapping local windows, demonstrated promising results on image classification, semantic segmentation, and object detection. However, less study has been devoted to the cross-window connection which is the key element to improve the representation ability. In this work, we revisit the spatial shuffle as an efficient way to build connections among windows. As a result, we propose a new vision transformer, named Shuffle Transformer, which is highly efficient and easy to implement by modifying two lines of code. Furthermore, the depth-wise convolution is introduced to complement the spatial shuffle for enhancing neighbor-window connections. The proposed architectures achieve excellent performance on a wide range of visual tasks including image-level classification, object detection, and semantic segmentation. Code will be released for reproduction.

READ FULL TEXT
research
11/25/2022

Degenerate Swin to Win: Plain Window-based Transformer without Sophisticated Operations

The formidable accomplishment of Transformers in natural language proces...
research
04/28/2021

Twins: Revisiting Spatial Attention Design in Vision Transformers

Very recently, a variety of vision transformer architectures for dense p...
research
06/08/2021

Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight

Vision Transformer (ViT) attains state-of-the-art performance in visual ...
research
06/23/2023

Swin-Free: Achieving Better Cross-Window Attention and Efficiency with Size-varying Window

Transformer models have shown great potential in computer vision, follow...
research
08/24/2022

gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window

Following the success in language domain, the self-attention mechanism (...
research
07/06/2023

Art Authentication with Vision Transformers

In recent years, Transformers, initially developed for language, have be...
research
04/06/2022

MixFormer: Mixing Features across Windows and Dimensions

While local-window self-attention performs notably in vision tasks, it s...

Please sign up or login with your details

Forgot password? Click here to reset