Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs

02/14/2022
by   Huangjie Zheng, et al.
0

Token-mixing multi-layer perceptron (MLP) models have shown competitive performance in computer vision tasks with a simple architecture and relatively small computational cost. Their success in maintaining computation efficiency is mainly attributed to avoiding the use of self-attention that is often computationally heavy, yet this is at the expense of not being able to mix tokens both globally and locally. In this paper, to exploit both global and local dependencies without self-attention, we present Mix-Shift-MLP (MS-MLP) which makes the size of the local receptive field used for mixing increase with respect to the amount of spatial shifting. In addition to conventional mixing and shifting techniques, MS-MLP mixes both neighboring and distant tokens from fine- to coarse-grained levels and then gathers them via a shifting operation. This directly contributes to the interactions between global and local tokens. Being simple to implement, MS-MLP achieves competitive performance in multiple vision benchmarks. For example, an MS-MLP with 85 million parameters achieves 83.8 MS-MLP with state-of-the-art Vision Transformers such as the Swin Transformer, we show MS-MLP achieves further improvements on three different model scales, e.g., by 0.5 at: https://github.com/JegZheng/MS-MLP.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/30/2021

Shunted Self-Attention via Multi-Scale Token Aggregation

Recent Vision Transformer (ViT) models have demonstrated encouraging res...
research
08/23/2023

SG-Former: Self-guided Transformer with Evolving Token Reallocation

Vision Transformer has demonstrated impressive success across various vi...
research
07/18/2021

AS-MLP: An Axial Shifted MLP Architecture for Vision

An Axial Shifted MLP architecture (AS-MLP) is proposed in this paper. Di...
research
04/07/2022

DaViT: Dual Attention Vision Transformers

In this work, we introduce Dual Attention Vision Transformers (DaViT), a...
research
04/06/2022

MixFormer: Mixing Features across Windows and Dimensions

While local-window self-attention performs notably in vision tasks, it s...
research
08/25/2023

CS-Mixer: A Cross-Scale Vision MLP Model with Spatial-Channel Mixing

Despite their simpler information fusion designs compared with Vision Tr...
research
11/26/2021

SWAT: Spatial Structure Within and Among Tokens

Modeling visual data as tokens (i.e., image patches), and applying atten...

Please sign up or login with your details

Forgot password? Click here to reset