Rethinking Spatial Dimensions of Vision Transformers

03/30/2021
by   Byeongho Heo, et al.
6

Vision Transformer (ViT) extends the application range of transformers from language processing to computer vision tasks as being an alternative architecture against the existing convolutional neural networks (CNN). Since the transformer-based architecture has been innovative for computer vision modeling, the design convention towards an effective architecture has been less studied yet. From the successful design principles of CNN, we investigate the role of the spatial dimension conversion and its effectiveness on the transformer-based architecture. We particularly attend the dimension reduction principle of CNNs; as the depth increases, a conventional CNN increases channel dimension and decreases spatial dimensions. We empirically show that such a spatial dimension reduction is beneficial to a transformer architecture as well, and propose a novel Pooling-based Vision Transformer (PiT) upon the original ViT model. We show that PiT achieves the improved model capability and generalization performance against ViT. Throughout the extensive experiments, we further show PiT outperforms the baseline on several tasks such as image classification, object detection and robustness evaluation. Source codes and ImageNet models are available at https://github.com/naver-ai/pit

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/17/2021

Rethinking the Design Principles of Robust Vision Transformer

Recent advances on Vision Transformers (ViT) have shown that self-attent...
research
04/16/2022

Searching Intrinsic Dimensions of Vision Transformers

It has been shown by many researchers that transformers perform as well ...
research
04/01/2022

Transformers for 1D Signals in Parkinson's Disease Detection from Gait

This paper focuses on the detection of Parkinson's disease based on the ...
research
07/11/2018

DeSTNet: Densely Fused Spatial Transformer Networks

Modern Convolutional Neural Networks (CNN) are extremely powerful on a r...
research
08/30/2021

A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP

Convolutional neural networks (CNN) are the dominant deep neural network...
research
01/04/2022

PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture

Transformer networks have achieved great progress for computer vision ta...
research
03/17/2022

SepTr: Separable Transformer for Audio Spectrogram Processing

Following the successful application of vision transformers in multiple ...

Please sign up or login with your details

Forgot password? Click here to reset