Scale-Aware Modulation Meet Transformer

07/17/2023
by   Weifeng Lin, et al.
0

This paper presents a new vision Transformer, Scale-Aware Modulation Transformer (SMT), that can handle various downstream tasks efficiently by combining the convolutional network and vision Transformer. The proposed Scale-Aware Modulation (SAM) in the SMT includes two primary novel designs. Firstly, we introduce the Multi-Head Mixed Convolution (MHMC) module, which can capture multi-scale features and expand the receptive field. Secondly, we propose the Scale-Aware Aggregation (SAA) module, which is lightweight but effective, enabling information fusion across different heads. By leveraging these two modules, convolutional modulation is further enhanced. Furthermore, in contrast to prior works that utilized modulations throughout all stages to build an attention-free network, we propose an Evolutionary Hybrid Network (EHN), which can effectively simulate the shift from capturing local to global dependencies as the network becomes deeper, resulting in superior performance. Extensive experiments demonstrate that SMT significantly outperforms existing state-of-the-art models across a wide range of visual tasks. Specifically, SMT with 11.5M / 2.4GFLOPs and 32M / 7.7GFLOPs can achieve 82.2 accuracy on ImageNet-1K, respectively. After pretrained on ImageNet-22K in 224^2 resolution, it attains 87.1 resolution 224^2 and 384^2, respectively. For object detection with Mask R-CNN, the SMT base trained with 1x and 3x schedule outperforms the Swin Transformer counterpart by 4.2 and 1.3 mAP on COCO, respectively. For semantic segmentation with UPerNet, the SMT base test at single- and multi-scale surpasses Swin by 2.0 and 1.1 mIoU respectively on the ADE20K.

READ FULL TEXT

page 4

page 5

page 14

page 15

research
03/22/2022

Focal Modulation Networks

In this work, we propose focal modulation network (FocalNet in short), w...
research
06/19/2022

EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm

Motivated by biological evolution, this paper explains the rationality o...
research
05/08/2022

ConvMAE: Masked Convolution Meets Masked Autoencoders

Vision Transformers (ViT) become widely-adopted architectures for variou...
research
11/22/2022

Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition

This paper does not attempt to design a state-of-the-art method for visu...
research
07/17/2022

Defect Transformer: An Efficient Hybrid Transformer Architecture for Surface Defect Detection

Surface defect detection is an extremely crucial step to ensure the qual...
research
10/04/2022

MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models

This paper presents MOAT, a family of neural networks that build on top ...
research
04/15/2022

ResT V2: Simpler, Faster and Stronger

This paper proposes ResTv2, a simpler, faster, and stronger multi-scale ...

Please sign up or login with your details

Forgot password? Click here to reset