Dynamic Spectrum Mixer for Visual Recognition

09/13/2023
by   Zhiqiang Hu, et al.
0

Recently, MLP-based vision backbones have achieved promising performance in several visual recognition tasks. However, the existing MLP-based methods directly aggregate tokens with static weights, leaving the adaptability to different images untouched. Moreover, Recent research demonstrates that MLP-Transformer is great at creating long-range dependencies but ineffective at catching high frequencies that primarily transmit local information, which prevents it from applying to the downstream dense prediction tasks, such as semantic segmentation. To address these challenges, we propose a content-adaptive yet computationally efficient structure, dubbed Dynamic Spectrum Mixer (DSM). The DSM represents token interactions in the frequency domain by employing the Discrete Cosine Transform, which can learn long-term spatial dependencies with log-linear complexity. Furthermore, a dynamic spectrum weight generation layer is proposed as the spectrum bands selector, which could emphasize the informative frequency bands while diminishing others. To this end, the technique can efficiently learn detailed features from visual input that contains both high- and low-frequency information. Extensive experiments show that DSM is a powerful and adaptable backbone for a range of visual recognition tasks. Particularly, DSM outperforms previous transformer-based and MLP-based models, on image classification, object detection, and semantic segmentation tasks, such as 83.8 % top-1 accuracy on ImageNet, and 49.9 % mIoU on ADE20K.

READ FULL TEXT
research
11/11/2022

Token Transformer: Can class token help window-based transformer build better long-range interactions?

Compared with the vanilla transformer, the window-based transformer offe...
research
11/25/2021

NomMer: Nominate Synergistic Context in Vision Transformer for Visual Recognition

Recently, Vision Transformers (ViT), with the self-attention (SA) as the...
research
08/26/2020

Visual Concept Reasoning Networks

A split-transform-merge strategy has been broadly used as an architectur...
research
08/07/2023

DiT: Efficient Vision Transformers with Dynamic Token Routing

Recently, the tokens of images share the same static data flow in many d...
research
11/24/2021

An Image Patch is a Wave: Phase-Aware Vision MLP

Different from traditional convolutional neural network (CNN) and vision...
research
03/15/2022

Smoothing Matters: Momentum Transformer for Domain Adaptive Semantic Segmentation

After the great success of Vision Transformer variants (ViTs) in compute...
research
02/27/2021

Efficient Transformer based Method for Remote Sensing Image Change Detection

Modern change detection (CD) has achieved remarkable success by the powe...

Please sign up or login with your details

Forgot password? Click here to reset