A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP

08/30/2021
by   Yucheng Zhao, et al.
0

Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision. Recently, Transformer and multi-layer perceptron (MLP)-based models, such as Vision Transformer and MLP-Mixer, started to lead new trends as they showed promising results in the ImageNet classification task. In this paper, we conduct empirical studies on these DNN structures and try to understand their respective pros and cons. To ensure a fair comparison, we first develop a unified framework called SPACH which adopts separate modules for spatial and channel processing. Our experiments under the SPACH framework reveal that all structures can achieve competitive performance at a moderate scale. However, they demonstrate distinctive behaviors when the network size scales up. Based on our findings, we propose two hybrid models using convolution and Transformer modules. The resulting Hybrid-MS-S+ model achieves 83.9 already on par with the SOTA models with sophisticated designs. The code and models will be made publicly available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/30/2021

Rethinking Spatial Dimensions of Vision Transformers

Vision Transformer (ViT) extends the application range of transformers f...
research
12/30/2020

Transformer for Image Quality Assessment

Transformer has become the new standard method in natural language proce...
research
04/26/2021

Visformer: The Vision-friendly Transformer

The past year has witnessed the rapid development of applying the Transf...
research
10/25/2021

STransGAN: An Empirical Study on Transformer in GANs

Transformer becomes prevalent in computer vision, especially for high-le...
research
02/17/2023

Improving Transformer-based Networks With Locality For Automatic Speaker Verification

Recently, Transformer-based architectures have been explored for speaker...
research
06/21/2022

EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications

In the pursuit of achieving ever-increasing accuracy, large and complex ...
research
04/11/2019

An Empirical Study of Spatial Attention Mechanisms in Deep Networks

Attention mechanisms have become a popular component in deep neural netw...

Please sign up or login with your details

Forgot password? Click here to reset