S^2-MLPv2: Improved Spatial-Shift MLP Architecture for Vision

08/02/2021
by   Tan Yu, et al.
6

Recently, MLP-based vision backbones emerge. MLP-based vision architectures with less inductive bias achieve competitive performance in image recognition compared with CNNs and vision Transformers. Among them, spatial-shift MLP (S^2-MLP), adopting the straightforward spatial-shift operation, achieves better performance than the pioneering works including MLP-mixer and ResMLP. More recently, using smaller patches with a pyramid structure, Vision Permutator (ViP) and Global Filter Network (GFNet) achieve better performance than S^2-MLP. In this paper, we improve the S^2-MLP vision backbone. We expand the feature map along the channel dimension and split the expanded feature map into several parts. We conduct different spatial-shift operations on split parts. Meanwhile, we exploit the split-attention operation to fuse these split parts. Moreover, like the counterparts, we adopt smaller-scale patches and use a pyramid structure for boosting the image recognition accuracy. We term the improved spatial-shift MLP vision backbone as S^2-MLPv2. Using 55M parameters, our medium-scale model, S^2-MLPv2-Medium achieves an 83.6% top-1 accuracy on the ImageNet-1K benchmark using 224× 224 images without self-attention and external training data.

READ FULL TEXT
research
06/14/2021

S^2-MLP: Spatial-Shift MLP Architecture for Vision

Recently, visual Transformer (ViT) and its following works abandon the c...
research
05/28/2023

Using Caterpillar to Nibble Small-Scale Images

Recently, MLP-based models have become popular and attained significant ...
research
11/20/2022

R2-MLP: Round-Roll MLP for Multi-View 3D Object Recognition

Recently, vision architectures based exclusively on multi-layer perceptr...
research
12/29/2021

Learning Inception Attention for Image Synthesis and Image Recognition

Image synthesis and image recognition have witnessed remarkable progress...
research
07/09/2021

ViTGAN: Training GANs with Vision Transformers

Recently, Vision Transformers (ViTs) have shown competitive performance ...
research
06/13/2023

Reviving Shift Equivariance in Vision Transformers

Shift equivariance is a fundamental principle that governs how we percei...
research
08/25/2023

CS-Mixer: A Cross-Scale Vision MLP Model with Spatial-Channel Mixing

Despite their simpler information fusion designs compared with Vision Tr...

Please sign up or login with your details

Forgot password? Click here to reset