Decoupling Zero-Shot Semantic Segmentation

12/15/2021
by   Jian Ding, et al.
9

Zero-shot semantic segmentation (ZS3) aims to segment the novel categories that have not been seen in the training. Existing works formulate ZS3 as a pixel-level zero-shot classification problem, and transfer semantic knowledge from seen classes to unseen ones with the help of language models pre-trained only with texts. While simple, the pixel-level ZS3 formulation shows the limited capability to integrate vision-language models that are often pre-trained with image-text pairs and currently demonstrate great potential for vision tasks. Inspired by the observation that humans often perform segment-level semantic labeling, we propose to decouple the ZS3 into two sub-tasks: 1) a class-agnostic grouping task to group the pixels into segments. 2) a zero-shot classification task on segments. The former sub-task does not involve category information and can be directly transferred to group pixels for unseen classes. The latter subtask performs at segment-level and provides a natural way to leverage large-scale vision-language models pre-trained with image-text pairs (e.g. CLIP) for ZS3. Based on the decoupling formulation, we propose a simple and effective zero-shot semantic segmentation model, called ZegFormer, which outperforms the previous methods on ZS3 standard benchmarks by large margins, e.g., 35 points on the PASCAL VOC and 3 points on the COCO-Stuff in terms of mIoU for unseen classes. Code will be released at https://github.com/dingjiansw101/ZegFormer.

READ FULL TEXT

page 1

page 6

page 8

page 11

page 12

research
06/03/2019

Zero-Shot Semantic Segmentation

Semantic segmentation models are limited in their ability to scale to la...
research
12/29/2021

A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained Vision-language Model

Recently, zero-shot image classification by vision-language pre-training...
research
10/27/2022

Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models

When trained at a sufficient scale, self-supervised learning has exhibit...
research
07/13/2023

AvatarFusion: Zero-shot Generation of Clothing-Decoupled 3D Avatars Using 2D Diffusion

Large-scale pre-trained vision-language models allow for the zero-shot t...
research
03/23/2023

Zero-guidance Segmentation Using Zero Segment Labels

CLIP has enabled new and exciting joint vision-language applications, on...
research
06/27/2023

What a MESS: Multi-Domain Evaluation of Zero-Shot Semantic Segmentation

While semantic segmentation has seen tremendous improvements in the past...
research
06/14/2022

ReCo: Retrieve and Co-segment for Zero-shot Transfer

Semantic segmentation has a broad range of applications, but its real-wo...

Please sign up or login with your details

Forgot password? Click here to reset