Strip-MLP: Efficient Token Interaction for Vision MLP

07/21/2023
by   Guiping Cao, et al.
0

Token interaction operation is one of the core modules in MLP-based models to exchange and aggregate information between different spatial locations. However, the power of token interaction on the spatial dimension is highly dependent on the spatial resolution of the feature maps, which limits the model's expressive ability, especially in deep layers where the feature are down-sampled to a small spatial size. To address this issue, we present a novel method called \textbf{Strip-MLP} to enrich the token interaction power in three ways. Firstly, we introduce a new MLP paradigm called Strip MLP layer that allows the token to interact with other tokens in a cross-strip manner, enabling the tokens in a row (or column) to contribute to the information aggregations in adjacent but different strips of rows (or columns). Secondly, a \textbf{C}ascade \textbf{G}roup \textbf{S}trip \textbf{M}ixing \textbf{M}odule (CGSMM) is proposed to overcome the performance degradation caused by small spatial feature size. The module allows tokens to interact more effectively in the manners of within-patch and cross-patch, which is independent to the feature spatial size. Finally, based on the Strip MLP layer, we propose a novel \textbf{L}ocal \textbf{S}trip \textbf{M}ixing \textbf{M}odule (LSMM) to boost the token interaction power in the local region. Extensive experiments demonstrate that Strip-MLP significantly improves the performance of MLP-based models on small datasets and obtains comparable or even better results on ImageNet. In particular, Strip-MLP models achieve higher average Top-1 accuracy than existing MLP-based models by +2.44\% on Caltech-101 and +2.16\% on CIFAR-100. The source codes will be available at~\href{https://github.com/Med-Process/Strip_MLP{https://github.com/Med-Process/Strip\_MLP}.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/21/2023

Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers

Although vision transformers (ViTs) have shown promising results in vari...
research
04/07/2022

DaViT: Dual Attention Vision Transformers

In this work, we introduce Dual Attention Vision Transformers (DaViT), a...
research
07/04/2022

Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks

In this paper, we present a new approach for model acceleration by explo...
research
03/02/2023

Token Contrast for Weakly-Supervised Semantic Segmentation

Weakly-Supervised Semantic Segmentation (WSSS) using image-level labels ...
research
09/05/2023

Extract-and-Adaptation Network for 3D Interacting Hand Mesh Recovery

Understanding how two hands interact with each other is a key component ...
research
12/10/2022

Position Embedding Needs an Independent Layer Normalization

The Position Embedding (PE) is critical for Vision Transformers (VTs) du...
research
07/24/2023

Less is More: Focus Attention for Efficient DETR

DETR-like models have significantly boosted the performance of detectors...

Please sign up or login with your details

Forgot password? Click here to reset