Visual Transformers: Token-based Image Representation and Processing for Computer Vision

06/05/2020
by   Bichen Wu, et al.
23

Computer vision has achieved great success using standardized image representations – pixel arrays, and the corresponding deep learning operators – convolutions. In this work, we challenge this paradigm: we instead (a) represent images as a set of visual tokens and (b) apply visual transformers to find relationships between visual semantic concepts. Given an input image, we dynamically extract a set of visual tokens from the image to obtain a compact representation for high-level semantics. We then use visual transformers to operate over the visual tokens to densely model relationships between them. We find that this paradigm of token-based image representation and processing drastically outperforms its convolutional counterparts on image classification and semantic segmentation. To demonstrate the power of this approach on ImageNet classification, we use ResNet as a convenient baseline and use visual transformers to replace the last stage of convolutions. This reduces the stage's MACs by up to 6.9x, while attaining up to 4.53 points higher top-1 accuracy. For semantic segmentation, we use a visual-transformer-based FPN (VT-FPN) module to replace a convolution-based FPN, saving 6.5x fewer MACs while achieving up to 0.35 points higher mIoU on LIP and COCO-stuff.

READ FULL TEXT

page 8

page 13

page 14

page 16

research
03/21/2023

Machine Learning for Brain Disorders: Transformers and Visual Transformers

Transformers were initially introduced for natural language processing (...
research
06/23/2022

Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space

Humans are remarkably flexible in understanding viewpoint changes due to...
research
06/03/2023

Content-aware Token Sharing for Efficient Semantic Segmentation with Vision Transformers

This paper introduces Content-aware Token Sharing (CTS), a token reducti...
research
05/27/2022

MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning

In this study, we propose Mixed and Masked Image Modeling (MixMIM), a si...
research
07/05/2023

MSViT: Dynamic Mixed-Scale Tokenization for Vision Transformers

The input tokens to Vision Transformers carry little semantic meaning as...
research
06/18/2022

SAViR-T: Spatially Attentive Visual Reasoning with Transformers

We present a novel computational model, "SAViR-T", for the family of vis...
research
03/29/2023

Visually Wired NFTs: Exploring the Role of Inspiration in Non-Fungible Tokens

The fervor for Non-Fungible Tokens (NFTs) attracted countless creators, ...

Please sign up or login with your details

Forgot password? Click here to reset