MLP-Mixer: An all-MLP Architecture for Vision

05/04/2021 ∙ by Ilya Tolstikhin, et al. ∙ 18

Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the Vision Transformer, have also become popular. In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types of layers: one with MLPs applied independently to image patches (i.e. "mixing" the per-location features), and one with MLPs applied across patches (i.e. "mixing" spatial information). When trained on large datasets, or with modern regularization schemes, MLP-Mixer attains competitive scores on image classification benchmarks, with pre-training and inference cost comparable to state-of-the-art models. We hope that these results spark further research beyond the realms of well established CNNs and Transformers.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

page 15

Code Repositories

do-you-even-need-attention

Exploring whether attention is necessary for vision transformers


view repo

mlp-mixer-pytorch

An All-MLP solution for Vision, from Google AI


view repo

MLP-Mixer-pytorch

Unofficial implementation of MLP-Mixer: An all-MLP Architecture for Vision


view repo

mlp-mixer-pytorch

PyTorch implementation of "MLP-Mixer: An all-MLP Architecture for Vision" Tolstikhin et al. (2021)


view repo

mlp_mixer.pytorch

PyTorch implementation of "MLP-Mixer: An all-MLP Architecture for Vision"


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.