Conformer: Local Features Coupling Global Representations for Visual Recognition

05/09/2021
by   Zhiliang Peng, et al.
12

Within Convolutional Neural Network (CNN), the convolution operations are good at extracting local features but experience difficulty to capture global representations. Within visual transformer, the cascaded self-attention modules can capture long-distance feature dependencies but unfortunately deteriorate local feature details. In this paper, we propose a hybrid network structure, termed Conformer, to take advantage of convolutional operations and self-attention mechanisms for enhanced representation learning. Conformer roots in the Feature Coupling Unit (FCU), which fuses local features and global representations under different resolutions in an interactive fashion. Conformer adopts a concurrent structure so that local features and global representations are retained to the maximum extent. Experiments show that Conformer, under the comparable parameter complexity, outperforms the visual transformer (DeiT-B) by 2.3 by 3.7 respectively, demonstrating the great potential to be a general backbone network. Code is available at https://github.com/pengzhiliang/Conformer.

READ FULL TEXT

page 1

page 5

research
04/09/2023

Slide-Transformer: Hierarchical Vision Transformer with Local Self-Attention

Self-attention mechanism has been a key factor in the recent progress of...
research
11/24/2022

Cross Aggregation Transformer for Image Restoration

Recently, Transformer architecture has been introduced into image restor...
research
09/21/2022

HiFuse: Hierarchical Multi-Scale Feature Fusion Network for Medical Image Classification

Medical image classification has developed rapidly under the impetus of ...
research
04/21/2022

Transformer-Guided Convolutional Neural Network for Cross-View Geolocalization

Ground-to-aerial geolocalization refers to localizing a ground-level que...
research
08/19/2021

Causal Attention for Unbiased Visual Recognition

Attention module does not always help deep models learn causal features ...
research
03/02/2023

LANDMARK: Language-guided Representation Enhancement Framework for Scene Graph Generation

Scene graph generation (SGG) is a sophisticated task that suffers from b...
research
09/07/2020

Scalar Coupling Constant Prediction Using Graph Embedding Local Attention Encoder

Scalar coupling constant (SCC) plays a key role in the analysis of three...

Please sign up or login with your details

Forgot password? Click here to reset