All in Tokens: Unifying Output Space of Visual Tasks via Soft Token

01/05/2023
by   Jia Ning, et al.
13

Unlike language tasks, where the output space is usually limited to a set of tokens, the output space of visual tasks is more complicated, making it difficult to build a unified visual model for various visual tasks. In this paper, we seek to unify the output space of visual tasks, so that we can also build a unified model for visual tasks. To this end, we demonstrate a single unified model that simultaneously handles two typical visual tasks of instance segmentation and depth estimation, which have discrete/fixed-length and continuous/varied-length outputs, respectively. We propose several new techniques that take into account the particularity of visual tasks: 1) Soft token. We employ soft token to represent the task output. Unlike hard tokens in the common VQ-VAE which are assigned one-hot to discrete codebooks/vocabularies, the soft token is assigned softly to the codebook embeddings. Soft token can improve the accuracy of both the next token inference and decoding of the task output; 2) Mask augmentation. Many visual tasks have corruption, undefined or invalid values in label annotations, i.e., occluded area of depth maps. We show that a mask augmentation technique can greatly benefit these tasks. With these new techniques and other designs, we show that the proposed general-purpose task-solver can perform both instance segmentation and depth estimation well. Particularly, we achieve 0.279 RMSE on the specific task of NYUv2 depth estimation, setting a new record on this benchmark. The general-purpose task-solver, dubbed AiT, is available at <https://github.com/SwinTransformer/AiT>.

READ FULL TEXT

page 4

page 5

page 8

research
10/06/2020

Parallax Motion Effect Generation Through Instance Segmentation And Depth Estimation

Stereo vision is a growing topic in computer vision due to the innumerab...
research
06/01/2022

PanopticDepth: A Unified Framework for Depth-aware Panoptic Segmentation

This paper presents a unified framework for depth-aware panoptic segment...
research
05/17/2023

Incorporating Attribution Importance for Improving Faithfulness Metrics

Feature attribution methods (FAs) are popular approaches for providing i...
research
06/17/2022

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

We propose Unified-IO, a model that performs a large variety of AI tasks...
research
03/17/2022

PanoFormer: Panorama Transformer for Indoor 360° Depth Estimation

Existing panoramic depth estimation methods based on convolutional neura...
research
09/25/2020

Towards General Purpose and Geometry Preserving Single-View Depth Estimation

Single-view depth estimation plays a crucial role in scene understanding...
research
06/09/2022

Extreme Masking for Learning Instance and Distributed Visual Representations

The paper presents a scalable approach for learning distributed represen...

Please sign up or login with your details

Forgot password? Click here to reset