MixFormer: Mixing Features across Windows and Dimensions

04/06/2022
by   Qiang Chen, et al.
9

While local-window self-attention performs notably in vision tasks, it suffers from limited receptive field and weak modeling capability issues. This is mainly because it performs self-attention within non-overlapped windows and shares weights on the channel dimension. We propose MixFormer to find a solution. First, we combine local-window self-attention with depth-wise convolution in a parallel design, modeling cross-window connections to enlarge the receptive fields. Second, we propose bi-directional interactions across branches to provide complementary clues in the channel and spatial dimensions. These two designs are integrated to achieve efficient feature mixing among windows and dimensions. Our MixFormer provides competitive results on image classification with EfficientNet and shows better results than RegNet and Swin Transformer. Performance in downstream tasks outperforms its alternatives by significant margins with less computational costs in 5 dense prediction tasks on MS COCO, ADE20k, and LVIS. Code is available at <https://github.com/PaddlePaddle/PaddleClas>.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/01/2021

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

We present CSWin Transformer, an efficient and effective Transformer-bas...
research
04/13/2023

RSIR Transformer: Hierarchical Vision Transformer using Random Sampling Windows and Important Region Windows

Recently, Transformers have shown promising performance in various visio...
research
06/07/2021

Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer

Very recently, Window-based Transformers, which computed self-attention ...
research
06/23/2023

Swin-Free: Achieving Better Cross-Window Attention and Efficiency with Size-varying Window

Transformer models have shown great potential in computer vision, follow...
research
03/22/2023

Spherical Transformer for LiDAR-based 3D Recognition

LiDAR-based 3D point cloud recognition has benefited various application...
research
02/14/2022

Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs

Token-mixing multi-layer perceptron (MLP) models have shown competitive ...
research
03/21/2023

The Multiscale Surface Vision Transformer

Surface meshes are a favoured domain for representing structural and fun...

Please sign up or login with your details

Forgot password? Click here to reset