Vision Transformers for Dense Prediction

03/24/2021
by   René Ranftl, et al.
15

We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. We assemble tokens from various stages of the vision transformer into image-like representations at various resolutions and progressively combine them into full-resolution predictions using a convolutional decoder. The transformer backbone processes representations at a constant and relatively high resolution and has a global receptive field at every stage. These properties allow the dense vision transformer to provide finer-grained and more globally coherent predictions when compared to fully-convolutional networks. Our experiments show that this architecture yields substantial improvements on dense prediction tasks, especially when a large amount of training data is available. For monocular depth estimation, we observe an improvement of up to 28 state-of-the-art fully-convolutional network. When applied to semantic segmentation, dense vision transformers set a new state of the art on ADE20K with 49.02 smaller datasets such as NYUv2, KITTI, and Pascal Context where it also sets the new state of the art. Our models are available at https://github.com/intel-isl/DPT.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 6

page 7

page 12

page 13

page 14

page 15

05/10/2021

Self-Supervised Learning with Swin Transformers

We are witnessing a modeling shift from CNN to Transformers in computer ...
12/31/2020

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Most recent semantic segmentation methods adopt a fully-convolutional ne...
05/31/2021

MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens

Transformers have offered a new methodology of designing neural networks...
07/21/2021

CycleMLP: A MLP-like Architecture for Dense Prediction

This paper presents a simple MLP-like architecture, CycleMLP, which is a...
03/29/2020

Superpixel Segmentation with Fully Convolutional Networks

In computer vision, superpixels have been widely used as an effective wa...
12/20/2021

StyleSwin: Transformer-based GAN for High-resolution Image Generation

Despite the tantalizing success in a broad of vision tasks, transformers...
06/01/2021

Multi-task fully convolutional network for tree species mapping in dense forests using small training hyperspectral data

This work proposes a multi-task fully convolutional architecture for tre...

Code Repositories

MiDaS

Code for robust monocular depth estimation described in "Ranftl et. al., Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer, TPAMI 2020"


view repo

DPT

Dense Prediction Transformers


view repo

DPT

Dense Prediction Transformers


view repo

MonoDepthAttacks

Adversarial attacks on state of the art monocular depth estimation networks


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.