A Unified Sequence Interface for Vision Tasks

06/15/2022
by   Ting Chen, et al.
6

While language tasks are naturally expressed in a single, unified, modeling framework, i.e., generating sequences of tokens, this has not been the case in computer vision. As a result, there is a proliferation of distinct architectures and loss functions for different vision tasks. In this work we show that a diverse set of "core" computer vision tasks can also be unified if formulated in terms of a shared pixel-to-sequence interface. We focus on four tasks, namely, object detection, instance segmentation, keypoint detection, and image captioning, all with diverse types of outputs, e.g., bounding boxes or dense masks. Despite that, by formulating the output of each task as a sequence of discrete tokens with a unified interface, we show that one can train a neural network with a single model architecture and loss function on all these tasks, with no task-specific customization. To solve a specific task, we use a short prompt as task description, and the sequence output adapts to the prompt so it can produce task-specific output. We show that such a model can achieve competitive performance compared to well-established task-specific models.

READ FULL TEXT

page 3

page 7

page 8

research
06/17/2022

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

We propose Unified-IO, a model that performs a large variety of AI tasks...
research
02/16/2023

Tuning computer vision models with task rewards

Misalignment between model predictions and intended usage can be detrime...
research
05/20/2022

UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes

We introduce UViM, a unified approach capable of modeling a wide range o...
research
09/07/2023

InstructDiffusion: A Generalist Modeling Interface for Vision Tasks

We present InstructDiffusion, a unifying and generic framework for align...
research
12/05/2022

Images Speak in Images: A Generalist Painter for In-Context Visual Learning

In-context learning, as a new paradigm in NLP, allows the model to rapid...
research
09/22/2021

Pix2seq: A Language Modeling Framework for Object Detection

This paper presents Pix2Seq, a simple and generic framework for object d...
research
08/17/2022

UniLayout: Taming Unified Sequence-to-Sequence Transformers for Graphic Layout Generation

To satisfy various user needs, different subtasks of graphic layout gene...

Please sign up or login with your details

Forgot password? Click here to reset