ConvFormer: Closing the Gap Between CNN and Vision Transformers

09/16/2022
by   Zimian Wei, et al.
0

Vision transformers have shown excellent performance in computer vision tasks. However, the computation cost of their (local) self-attention mechanism is expensive. Comparatively, CNN is more efficient with built-in inductive bias. Recent works show that CNN is promising to compete with vision transformers by learning their architecture design and training protocols. Nevertheless, existing methods either ignore multi-level features or lack dynamic prosperity, leading to sub-optimal performance. In this paper, we propose a novel attention mechanism named MCA, which captures different patterns of input images by multiple kernel sizes and enables input-adaptive weights with a gating mechanism. Based on MCA, we present a neural network named ConvFormer. ConvFormer adopts the general architecture of vision transformers, while replacing the (local) self-attention mechanism with our proposed MCA. Extensive experimental results demonstrated that ConvFormer achieves state-of-the-art performance on ImageNet classification, which outperforms similar-sized vision transformers(ViTs) and convolutional neural networks (CNNs). Moreover, for object detection on COCO and semantic segmentation tasks on ADE20K, ConvFormer also shows excellent performance compared with recently advanced methods. Code and models will be available.

READ FULL TEXT

page 10

page 11

research
07/15/2022

Lightweight Vision Transformer with Cross Feature Attention

Recent advances in vision transformers (ViTs) have achieved great perfor...
research
05/20/2021

Content-Augmented Feature Pyramid Network with Light Linear Transformers

Recently, plenty of work has tried to introduce transformers into comput...
research
07/09/2020

DCANet: Learning Connected Attentions for Convolutional Neural Networks

While self-attention mechanism has shown promising results for many visi...
research
05/24/2023

ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers

Recently, plain vision Transformers (ViTs) have shown impressive perform...
research
08/07/2021

Vision Transformers for femur fracture classification

Objectives: In recent years, the scientific community has focused on the...
research
11/10/2021

Learning to ignore: rethinking attention in CNNs

Recently, there has been an increasing interest in applying attention me...
research
10/13/2022

Vision Transformers provably learn spatial structure

Vision Transformers (ViTs) have achieved comparable or superior performa...

Please sign up or login with your details

Forgot password? Click here to reset