RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision?

08/09/2021
by   Yuki Tatsunami, et al.
0

For the past ten years, CNN has reigned supreme in the world of computer vision, but recently, Transformer is on the rise. However, the quadratic computational cost of self-attention has become a severe problem of practice. There has been much research on architectures without CNN and self-attention in this context. In particular, MLP-Mixer is a simple idea designed using MLPs and hit an accuracy comparable to the Vision Transformer. However, the only inductive bias in this architecture is the embedding of tokens. Thus, there is still a possibility to build a non-convolutional inductive bias into the architecture itself, and we built in an inductive bias using two simple ideas. A way is to divide the token-mixing block vertically and horizontally. Another way is to make spatial correlations denser among some channels of token-mixing. With this approach, we were able to improve the accuracy of the MLP-Mixer while reducing its parameters and computational complexity. Compared to other MLP-based models, the proposed model, named RaftMLP has a good balance of computational complexity, the number of parameters, and actual memory usage. In addition, our work indicates that MLP-based models have the potential to replace CNNs by adopting inductive bias. The source code in PyTorch version is available at <https://github.com/okojoalg/raft-mlp>.

READ FULL TEXT
research
06/28/2021

Rethinking Token-Mixing MLP for MLP-based Vision Backbone

In the past decade, we have witnessed rapid progress in the machine visi...
research
05/03/2023

A Lightweight CNN-Transformer Model for Learning Traveling Salesman Problems

Transformer-based models show state-of-the-art performance even for larg...
research
03/07/2022

WaveMix: Resource-efficient Token Mixing for Images

Although certain vision transformer (ViT) and CNN architectures generali...
research
07/11/2022

Dual Vision Transformer

Prior works have proposed several strategies to reduce the computational...
research
11/19/2021

Rethinking Query, Key, and Value Embedding in Vision Transformer under Tiny Model Constraints

A vision transformer (ViT) is the dominant model in the computer vision ...
research
09/11/2023

SparseSwin: Swin Transformer with Sparse Transformer Block

Advancements in computer vision research have put transformer architectu...
research
04/26/2021

Visformer: The Vision-friendly Transformer

The past year has witnessed the rapid development of applying the Transf...

Please sign up or login with your details

Forgot password? Click here to reset