MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video

11/24/2021
by   David Junhao Zhang, et al.
0

Self-attention has become an integral component of the recent network architectures, e.g., Transformer, that dominate major image and video benchmarks. This is because self-attention can flexibly model long-range information. For the same reason, researchers make attempts recently to revive Multiple Layer Perceptron (MLP) and propose a few MLP-Like architectures, showing great potential. However, the current MLP-Like architectures are not good at capturing local details and lack progressive understanding of core details in the images and/or videos. To overcome this issue, we propose a novel MorphMLP architecture that focuses on capturing local details at the low-level layers, while gradually changing to focus on long-term modeling at the high-level layers. Specifically, we design a Fully-Connected-Like layer, dubbed as MorphFC, of two morphable filters that gradually grow its receptive field along the height and width dimension. More interestingly, we propose to flexibly adapt our MorphFC layer in the video domain. To our best knowledge, we are the first to create a MLP-Like backbone for learning video representation. Finally, we conduct extensive experiments on image classification, semantic segmentation and video classification. Our MorphMLP, such a self-attention free backbone, can be as powerful as and even outperform self-attention based models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/06/2020

Global Self-Attention Networks for Image Recognition

Recently, a series of works in computer vision have shown promising resu...
research
04/07/2023

PSLT: A Light-weight Vision Transformer with Ladder Self-Attention and Progressive Shift

Vision Transformer (ViT) has shown great potential for various visual ta...
research
03/17/2020

Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation

Convolution exploits locality for efficiency at a cost of missing long r...
research
07/02/2023

X-MLP: A Patch Embedding-Free MLP Architecture for Vision

Convolutional neural networks (CNNs) and vision transformers (ViT) have ...
research
03/03/2022

ViTransPAD: Video Transformer using convolution and self-attention for Face Presentation Attack Detection

Face Presentation Attack Detection (PAD) is an important measure to prev...
research
02/17/2021

LambdaNetworks: Modeling Long-Range Interactions Without Attention

We present lambda layers – an alternative framework to self-attention – ...
research
08/24/2023

EFormer: Enhanced Transformer towards Semantic-Contour Features of Foreground for Portraits Matting

The portrait matting task aims to extract an alpha matte with complete s...

Please sign up or login with your details

Forgot password? Click here to reset