S^2-MLP: Spatial-Shift MLP Architecture for Vision

06/14/2021
by   Tan Yu, et al.
3

Recently, visual Transformer (ViT) and its following works abandon the convolution and exploit the self-attention operation, attaining a comparable or even higher accuracy than CNN. More recently, MLP-Mixer abandons both the convolution and the self-attention operation, proposing an architecture containing only MLP layers. To achieve cross-patch communications, it devises an additional token-mixing MLP besides the channel-mixing MLP. It achieves promising results when training on an extremely large-scale dataset. But it cannot achieve as outstanding performance as its CNN and ViT counterparts when training on medium-scale datasets such as ImageNet1K and ImageNet21K. The performance drop of MLP-Mixer motivates us to rethink the token-mixing MLP. We discover that token-mixing operation in MLP-Mixer is a variant of depthwise convolution with a global reception field and spatial-specific configuration. But the global reception field and the spatial-specific property make token-mixing MLP prone to over-fitting. In this paper, we propose a novel pure MLP architecture, spatial-shift MLP (S^2-MLP). Different from MLP-Mixer, our S^2-MLP only contains channel-mixing MLP. We devise a spatial-shift operation for achieving the communication between patches. It has a local reception field and is spatial-agnostic. Meanwhile, it is parameter-free and efficient for computation. The proposed S^2-MLP attains higher recognition accuracy than MLP-Mixer when training on ImageNet-1K dataset. Meanwhile, S^2-MLP accomplishes as excellent performance as ViT on ImageNet-1K dataset with considerably simpler architecture and fewer FLOPs and parameters.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/28/2021

Rethinking Token-Mixing MLP for MLP-based Vision Backbone

In the past decade, we have witnessed rapid progress in the machine visi...
research
08/02/2021

S^2-MLPv2: Improved Spatial-Shift MLP Architecture for Vision

Recently, MLP-based vision backbones emerge. MLP-based vision architectu...
research
08/25/2023

CS-Mixer: A Cross-Scale Vision MLP Model with Spatial-Channel Mixing

Despite their simpler information fusion designs compared with Vision Tr...
research
08/30/2022

MRL: Learning to Mix with Attention and Convolutions

In this paper, we present a new neural architectural block for the visio...
research
06/13/2022

MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing

Convolutional Neural Networks (CNNs) have been regarded as the go-to mod...
research
08/23/2022

Efficient Attention-free Video Shift Transformers

This paper tackles the problem of efficient video recognition. In this a...
research
03/07/2022

WaveMix: Resource-efficient Token Mixing for Images

Although certain vision transformer (ViT) and CNN architectures generali...

Please sign up or login with your details

Forgot password? Click here to reset