MAFormer: A Transformer Network with Multi-scale Attention Fusion for Visual Recognition

08/31/2022
by   Yunhao Wang, et al.
0

Vision Transformer and its variants have demonstrated great potential in various computer vision tasks. But conventional vision transformers often focus on global dependency at a coarse level, which suffer from a learning challenge on global relationships and fine-grained representation at a token level. In this paper, we introduce Multi-scale Attention Fusion into transformer (MAFormer), which explores local aggregation and global feature extraction in a dual-stream framework for visual recognition. We develop a simple but effective module to explore the full potential of transformers for visual representation by learning fine-grained and coarse-grained features at a token level and dynamically fusing them. Our Multi-scale Attention Fusion (MAF) block consists of: i) a local window attention branch that learns short-range interactions within windows, aggregating fine-grained local features; ii) global feature extraction through a novel Global Learning with Down-sampling (GLD) operation to efficiently capture long-range context information within the whole image; iii) a fusion module that self-explores the integration of both features via attention. Our MAFormer achieves state-of-the-art performance on common vision tasks. In particular, MAFormer-L achieves 85.9% Top-1 accuracy on ImageNet, surpassing CSWin-B and LV-ViT-L by 1.7% and 0.6% respectively. On MSCOCO, MAFormer outperforms the prior art CSWin by 1.7% mAPs on object detection and 1.4% on instance segmentation with similar-sized parameters, demonstrating the potential to be a general backbone network.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/01/2021

Focal Self-attention for Local-Global Interactions in Vision Transformers

Recently, Vision Transformer and its variants have shown great promise o...
research
05/31/2021

Dual-stream Network for Visual Recognition

Transformers with remarkable global representation capacities achieve co...
research
03/29/2023

PMAA: A Progressive Multi-scale Attention Autoencoder Model for High-Performance Cloud Removal from Multi-temporal Satellite Imagery

Satellite imagery analysis plays a vital role in remote sensing, but the...
research
12/09/2021

DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition

While transformers have shown great potential on video recognition tasks...
research
10/23/2021

An attention-driven hierarchical multi-scale representation for visual recognition

Convolutional Neural Networks (CNNs) have revolutionized the understandi...
research
11/16/2016

On the Exploration of Convolutional Fusion Networks for Visual Recognition

Despite recent advances in multi-scale deep representations, their limit...
research
03/15/2023

BiFormer: Vision Transformer with Bi-Level Routing Attention

As the core building block of vision transformers, attention is a powerf...

Please sign up or login with your details

Forgot password? Click here to reset