Demystify Transformers Convolutions in Modern Image Deep Networks

11/10/2022
by   Jifeng Dai, et al.
0

Recent success of vision transformers has inspired a series of vision backbones with novel feature transformation paradigms, which report steady performance gain. Although the novel feature transformation designs are often claimed as the source of gain, some backbones may benefit from advanced engineering techniques, which makes it hard to identify the real gain from the key feature transformation operators. In this paper, we aim to identify real gain of popular convolution and attention operators and make an in-depth study of them. We observe that the main difference among these feature transformation modules, e.g., attention or convolution, lies in the way of spatial feature aggregation, or the so-called "spatial token mixer" (STM). Hence, we first elaborate a unified architecture to eliminate the unfair impact of different engineering techniques, and then fit STMs into this architecture for comparison. Based on various experiments on upstream/downstream tasks and the analysis of inductive bias, we find that the engineering techniques boost the performance significantly, but the performance gap still exists among different STMs. The detailed analysis also reveals some interesting findings of different STMs, such as effective receptive fields and invariance tests. The code and trained models will be publicly available at https://github.com/OpenGVLab/STM-Evaluation

READ FULL TEXT
research
04/26/2023

UniNeXt: Exploring A Unified Architecture for Vision Recognition

Vision Transformers have shown great potential in computer vision tasks....
research
10/12/2022

Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets

There still remains an extreme performance gap between Vision Transforme...
research
11/22/2021

MetaFormer is Actually What You Need for Vision

Transformers have shown great potential in computer vision tasks. A comm...
research
07/12/2022

LightViT: Towards Light-Weight Convolution-Free Vision Transformers

Vision transformers (ViTs) are usually considered to be less light-weigh...
research
03/27/2022

Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers

The past year has witnessed a rapid development of masked image modeling...
research
05/30/2023

Are Large Kernels Better Teachers than Transformers for ConvNets?

This paper reveals a new appeal of the recently emerged large-kernel Con...
research
05/23/2023

Evolution: A Unified Formula for Feature Operators from a High-level Perspective

Traditionally, different types of feature operators (e.g., convolution, ...

Please sign up or login with your details

Forgot password? Click here to reset