Transformers in Vision: A Survey

01/04/2021
by   Salman Khan, et al.
0

Astounding results from transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. This has led to exciting progress on a number of tasks while requiring minimal inductive biases in the model design. This survey aims to provide a comprehensive overview of the transformer models in the computer vision discipline and assumes little to no prior background in the field. We start with an introduction to fundamental concepts behind the success of transformer models i.e., self-supervision and self-attention. Transformer architectures leverage self-attention mechanisms to encode long-range dependencies in the input domain which makes them highly expressive. Since they assume minimal prior knowledge about the structure of the problem, self-supervision using pretext tasks is applied to pre-train transformer models on large-scale (unlabelled) datasets. The learned representations are then fine-tuned on the downstream tasks, typically leading to excellent performance due to the generalization and expressivity of encoded features. We cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering and visual reasoning), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works.

READ FULL TEXT

page 3

page 13

research
12/23/2020

A Survey on Visual Transformer

Transformer is a type of deep neural network mainly based on self-attent...
research
11/11/2022

A Comprehensive Survey of Transformers for Computer Vision

As a special type of transformer, Vision Transformers (ViTs) are used to...
research
11/07/2021

Are we ready for a new paradigm shift? A Survey on Visual Deep MLP

Multilayer perceptron (MLP), as the first neural network structure to ap...
research
05/16/2022

Transformers in 3D Point Clouds: A Survey

In recent years, Transformer models have been proven to have the remarka...
research
04/19/2023

Transformer-Based Visual Segmentation: A Survey

Visual segmentation seeks to partition images, video frames, or point cl...
research
06/22/2022

Behavior Transformers: Cloning k modes with one stone

While behavior learning has made impressive progress in recent times, it...
research
05/04/2022

Sequencer: Deep LSTM for Image Classification

In recent computer vision research, the advent of the Vision Transformer...

Please sign up or login with your details

Forgot password? Click here to reset