A survey of the Vision Transformers and its CNN-Transformer based Variants

05/17/2023
by   Asifullah Khan, et al.
0

Vision transformers have recently become popular as a possible alternative to convolutional neural networks (CNNs) for a variety of computer vision applications. These vision transformers due to their ability to focus on global relationships in images have large capacity, but may result in poor generalization as compared to CNNs. Very recently, the hybridization of convolution and self-attention mechanisms in vision transformers is gaining popularity due to their ability of exploiting both local and global image representations. These CNN-Transformer architectures also known as hybrid vision transformers have shown remarkable results for vision applications. Recently, due to the rapidly growing number of these hybrid vision transformers, there is a need for a taxonomy and explanation of these architectures. This survey presents a taxonomy of the recent vision transformer architectures, and more specifically that of the hybrid vision transformers. Additionally, the key features of each architecture such as the attention mechanisms, positional embeddings, multi-scale processing, and convolution are also discussed. This survey highlights the potential of hybrid vision transformers to achieve outstanding performance on a variety of computer vision tasks. Moreover, it also points towards the future directions of this rapidly evolving field.

READ FULL TEXT

page 6

page 24

page 27

page 31

page 34

research
11/02/2021

Can Vision Transformers Perform Convolution?

Several recent studies have demonstrated that attention-based networks, ...
research
10/11/2021

Investigating Transfer Learning Capabilities of Vision Transformers and CNNs by Fine-Tuning a Single Trainable Block

In recent developments in the field of Computer Vision, a rise is seen i...
research
03/02/2023

Self-attention in Vision Transformers Performs Perceptual Grouping, Not Attention

Recently, a considerable number of studies in computer vision involves d...
research
08/23/2021

Discovering Spatial Relationships by Transformers for Domain Generalization

Due to the rapid increase in the diversity of image data, the problem of...
research
09/05/2023

A survey on efficient vision transformers: algorithms, techniques, and performance benchmarking

Vision Transformer (ViT) architectures are becoming increasingly popular...
research
01/25/2023

Out of Distribution Performance of State of Art Vision Model

The vision transformer (ViT) has advanced to the cutting edge in the vis...
research
06/13/2023

Reviving Shift Equivariance in Vision Transformers

Shift equivariance is a fundamental principle that governs how we percei...

Please sign up or login with your details

Forgot password? Click here to reset