Generalized Decoding for Pixel, Image, and Language

12/21/2022
by   Xueyan Zou, et al.
10

We present X-Decoder, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly. X-Decodert takes as input two types of queries: (i) generic non-semantic queries and (ii) semantic queries induced from text inputs, to decode different pixel-level and token-level outputs in the same semantic space. With such a novel design, X-Decoder is the first work that provides a unified way to support all types of image segmentation and a variety of vision-language (VL) tasks. Further, our design enables seamless interactions across tasks at different granularities and brings mutual benefits by learning a common and rich pixel-level visual-semantic understanding space, without any pseudo-labeling. After pretraining on a mixed set of a limited amount of segmentation data and millions of image-text pairs, X-Decoder exhibits strong transferability to a wide range of downstream tasks in both zero-shot and finetuning settings. Notably, it achieves (1) state-of-the-art results on open-vocabulary segmentation and referring segmentation on eight datasets; (2) better or competitive finetuned performance to other generalist and specialist models on segmentation and VL tasks; and (3) flexibility for efficient finetuning and novel task composition (e.g., referring captioning and image editing). Code, demo, video, and visualization are available at https://x-decoder-vl.github.io.

READ FULL TEXT

page 1

page 8

page 9

page 16

page 21

page 22

research
08/11/2023

DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models

Current deep networks are very data-hungry and benefit from training on ...
research
11/30/2021

CRIS: CLIP-Driven Referring Image Segmentation

Referring image segmentation aims to segment a referent via a natural li...
research
12/18/2021

Prompt-Based Multi-Modal Image Segmentation

Image segmentation is usually addressed by training a model for a fixed ...
research
02/14/2023

PolyFormer: Referring Image Segmentation as Sequential Polygon Generation

In this work, instead of directly predicting the pixel-level segmentatio...
research
09/20/2022

Towards Robust Referring Image Segmentation

Referring Image Segmentation (RIS) aims to connect image and language vi...
research
04/15/2023

TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation

Recent success of Contrastive Language-Image Pre-training (CLIP) has sho...
research
03/21/2023

CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation

Existing works on open-vocabulary semantic segmentation have utilized la...

Please sign up or login with your details

Forgot password? Click here to reset