Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?

09/12/2021
by   Chuanxin Tang, et al.
0

Transformers have sprung up in the field of computer vision. In this work, we explore whether the core self-attention module in Transformer is the key to achieving excellent performance in image recognition. To this end, we build an attention-free network called sMLPNet based on the existing MLP-based vision models. Specifically, we replace the MLP module in the token-mixing step with a novel sparse MLP (sMLP) module. For 2D image tokens, sMLP applies 1D MLP along the axial directions and the parameters are shared among rows or columns. By sparse connection and weight sharing, sMLP module significantly reduces the number of model parameters and computational complexity, avoiding the common over-fitting problem that plagues the performance of MLP-like models. When only trained on the ImageNet-1K dataset, the proposed sMLPNet achieves 81.9 accuracy with only 24M parameters, which is much better than most CNNs and vision Transformers under the same model size constraint. When scaling up to 66M parameters, sMLPNet achieves 83.4 state-of-the-art Swin Transformer. The success of sMLPNet suggests that the self-attention mechanism is not necessarily a silver bullet in computer vision. Code will be made publicly available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/03/2021

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

Attention is sparse in vision transformers. We observe the final predict...
research
08/14/2023

SCSC: Spatial Cross-scale Convolution Module to Strengthen both CNNs and Transformers

This paper presents a module, Spatial Cross-scale Convolution (SCSC), wh...
research
01/28/2022

O-ViT: Orthogonal Vision Transformer

Inspired by the tremendous success of the self-attention mechanism in na...
research
03/01/2021

OmniNet: Omnidirectional Representations from Transformers

This paper proposes Omnidirectional Representations from Transformers (O...
research
09/05/2023

A survey on efficient vision transformers: algorithms, techniques, and performance benchmarking

Vision Transformer (ViT) architectures are becoming increasingly popular...
research
11/14/2022

BiViT: Extremely Compressed Binary Vision Transformer

Model binarization can significantly compress model size, reduce energy ...
research
04/14/2022

MiniViT: Compressing Vision Transformers with Weight Multiplexing

Vision Transformer (ViT) models have recently drawn much attention in co...

Please sign up or login with your details

Forgot password? Click here to reset