Vision Transformers for Dense Prediction

03/24/2021
by   René Ranftl, et al.
15

We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. We assemble tokens from various stages of the vision transformer into image-like representations at various resolutions and progressively combine them into full-resolution predictions using a convolutional decoder. The transformer backbone processes representations at a constant and relatively high resolution and has a global receptive field at every stage. These properties allow the dense vision transformer to provide finer-grained and more globally coherent predictions when compared to fully-convolutional networks. Our experiments show that this architecture yields substantial improvements on dense prediction tasks, especially when a large amount of training data is available. For monocular depth estimation, we observe an improvement of up to 28 state-of-the-art fully-convolutional network. When applied to semantic segmentation, dense vision transformers set a new state of the art on ADE20K with 49.02 smaller datasets such as NYUv2, KITTI, and Pascal Context where it also sets the new state of the art. Our models are available at https://github.com/intel-isl/DPT.

READ FULL TEXT

page 3

page 6

page 7

page 12

page 13

page 14

page 15

research
02/06/2022

GLPanoDepth: Global-to-Local Panoramic Depth Estimation

In this paper, we propose a learning-based method for predicting dense d...
research
07/10/2022

Depthformer : Multiscale Vision Transformer For Monocular Depth Estimation With Local Global Information Fusion

Attention-based models such as transformers have shown outstanding perfo...
research
11/02/2022

WITT: A Wireless Image Transmission Transformer for Semantic Communications

In this paper, we aim to redesign the vision Transformer (ViT) as a new ...
research
01/22/2022

Dual-Flattening Transformers through Decomposed Row and Column Queries for Semantic Segmentation

It is critical to obtain high resolution features with long range depend...
research
03/29/2020

Superpixel Segmentation with Fully Convolutional Networks

In computer vision, superpixels have been widely used as an effective wa...
research
07/26/2023

MiDaS v3.1 – A Model Zoo for Robust Monocular Relative Depth Estimation

We release MiDaS v3.1 for monocular depth estimation, offering a variety...
research
08/31/2023

Towards Optimal Patch Size in Vision Transformers for Tumor Segmentation

Detection of tumors in metastatic colorectal cancer (mCRC) plays an esse...

Please sign up or login with your details

Forgot password? Click here to reset