CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation

03/21/2023
by   Seokju Cho, et al.
0

Existing works on open-vocabulary semantic segmentation have utilized large-scale vision-language models, such as CLIP, to leverage their exceptional open-vocabulary recognition capabilities. However, the problem of transferring these capabilities learned from image-level supervision to the pixel-level task of segmentation and addressing arbitrary unseen categories at inference makes this task challenging. To address these issues, we aim to attentively relate objects within an image to given categories by leveraging relational information among class categories and visual semantics through aggregation, while also adapting the CLIP representations to the pixel-level task. However, we observe that direct optimization of the CLIP embeddings can harm its open-vocabulary capabilities. In this regard, we propose an alternative approach to optimize the image-text similarity map, i.e. the cost map, using a novel cost aggregation-based method. Our framework, namely CAT-Seg, achieves state-of-the-art performance across all benchmarks. We provide extensive ablation studies to validate our choices. Project page: https://ku-cvlab.github.io/CAT-Seg/.

READ FULL TEXT

page 3

page 8

page 15

page 16

page 17

page 18

page 19

page 20

research
04/14/2023

MVP-SEG: Multi-View Prompt Learning for Open-Vocabulary Semantic Segmentation

CLIP (Contrastive Language-Image Pretraining) is well-developed for open...
research
08/31/2023

Open-Vocabulary Semantic Segmentation via Attribute Decomposition-Aggregation

Open-vocabulary semantic segmentation is a challenging task that require...
research
04/15/2023

TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation

Recent success of Contrastive Language-Image Pre-training (CLIP) has sho...
research
03/26/2017

Open Vocabulary Scene Parsing

Recognizing arbitrary objects in the wild has been a challenging problem...
research
11/27/2022

SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation

Recently, the contrastive language-image pre-training, e.g., CLIP, has d...
research
12/21/2022

Generalized Decoding for Pixel, Image, and Language

We present X-Decoder, a generalized decoding model that can predict pixe...
research
08/22/2023

Dynamic Open Vocabulary Enhanced Safe-landing with Intelligence (DOVESEI)

This work targets what we consider to be the foundational step for urban...

Please sign up or login with your details

Forgot password? Click here to reset