[CLS] Token is All You Need for Zero-Shot Semantic Segmentation

04/13/2023
by   Letian Wu, et al.
0

In this paper, we propose an embarrassingly simple yet highly effective zero-shot semantic segmentation (ZS3) method, based on the pre-trained vision-language model CLIP. First, our study provides a couple of key discoveries: (i) the global tokens (a.k.a [CLS] tokens in Transformer) of the text branch in CLIP provide a powerful representation of semantic information and (ii) these text-side [CLS] tokens can be regarded as category priors to guide CLIP visual encoder pay more attention on the corresponding region of interest. Based on that, we build upon the CLIP model as a backbone which we extend with a One-Way [CLS] token navigation from text to the visual branch that enables zero-shot dense prediction, dubbed ClsCLIP. Specifically, we use the [CLS] token output from the text branch, as an auxiliary semantic prompt, to replace the [CLS] token in shallow layers of the ViT-based visual encoder. This one-way navigation embeds such global category prior earlier and thus promotes semantic segmentation. Furthermore, to better segment tiny objects in ZS3, we further enhance ClsCLIP with a local zoom-in strategy, which employs a region proposal pre-processing and we get ClsCLIP+. Extensive experiments demonstrate that our proposed ZS3 method achieves a SOTA performance, and it is even comparable with those few-shot semantic segmentation methods.

READ FULL TEXT

page 2

page 4

page 5

page 7

research
04/03/2023

Zero-Shot Semantic Segmentation with Decoupled One-Pass Network

Recently, the zero-shot semantic segmentation problem has attracted incr...
research
12/13/2022

TIER: Text-Image Entropy Regularization for CLIP-style models

In this paper, we study the effect of a novel regularization scheme on c...
research
12/16/2022

Attentive Mask CLIP

Image token removal is an efficient augmentation strategy for reducing t...
research
07/18/2022

Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding

To bridge the gap between supervised semantic segmentation and real-worl...
research
05/31/2023

Zero-Shot Automatic Pronunciation Assessment

Automatic Pronunciation Assessment (APA) is vital for computer-assisted ...
research
06/11/2021

Conterfactual Generative Zero-Shot Semantic Segmentation

zero-shot learning is an essential part of computer vision. As a classic...
research
05/31/2021

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

We present SegFormer, a simple, efficient yet powerful semantic segmenta...

Please sign up or login with your details

Forgot password? Click here to reset