A Comprehensive Study of Vision Transformers on Dense Prediction Tasks

01/21/2022
by   Kishaan Jeeveswaran, et al.
33

Convolutional Neural Networks (CNNs), architectures consisting of convolutional layers, have been the standard choice in vision tasks. Recent studies have shown that Vision Transformers (VTs), architectures based on self-attention modules, achieve comparable performance in challenging tasks such as object detection and semantic segmentation. However, the image processing mechanism of VTs is different from that of conventional CNNs. This poses several questions about their generalizability, robustness, reliability, and texture bias when used to extract features for complex tasks. To address these questions, we study and compare VT and CNN architectures as feature extractors in object detection and semantic segmentation. Our extensive empirical results show that the features generated by VTs are more robust to distribution shifts, natural corruptions, and adversarial attacks in both tasks, whereas CNNs perform better at higher image resolutions in object detection. Furthermore, our results demonstrate that VTs in dense prediction tasks produce more reliable and less texture-biased predictions.

READ FULL TEXT

page 6

page 8

page 9

page 11

research
05/25/2023

Making Vision Transformers Truly Shift-Equivariant

For computer vision tasks, Vision Transformers (ViTs) have become one of...
research
05/15/2023

Enhancing Performance of Vision Transformers on Small Datasets through Local Inductive Bias Incorporation

Vision transformers (ViTs) achieve remarkable performance on large datas...
research
07/22/2022

An Impartial Take to the CNN vs Transformer Robustness Contest

Following the surge of popularity of Transformers in Computer Vision, se...
research
03/17/2017

Deformable Convolutional Networks

Convolutional neural networks (CNNs) are inherently limited to model geo...
research
03/07/2022

Knowledge Amalgamation for Object Detection with Transformers

Knowledge amalgamation (KA) is a novel deep model reusing task aiming to...
research
07/06/2023

Art Authentication with Vision Transformers

In recent years, Transformers, initially developed for language, have be...
research
10/20/2022

Large-batch Optimization for Dense Visual Predictions

Training a large-scale deep neural network in a large-scale dataset is c...

Please sign up or login with your details

Forgot password? Click here to reset