DenseCLIP: Extract Free Dense Labels from CLIP

12/02/2021
by   Chong Zhou, et al.
0

Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary zero-shot image recognition. Many recent studies leverage the pre-trained CLIP models for image-level classification and manipulation. In this paper, we further explore the potentials of CLIP for pixel-level dense prediction, specifically in semantic segmentation. Our method, DenseCLIP, in the absence of annotations and fine-tuning, yields reasonable segmentation results on open concepts across various datasets. By adding pseudo labeling and self-training, DenseCLIP+ surpasses SOTA transductive zero-shot semantic segmentation methods by large margins, e.g., mIoUs of unseen classes on PASCAL VOC/PASCAL Context/COCO Stuff are improved from 35.6/20.7/30.3 to 86.1/66.7/54.7. We also test the robustness of DenseCLIP under input corruption and evaluate its capability in discriminating fine-grained objects and novel concepts. Our finding suggests that DenseCLIP can serve as a new reliable source of supervision for dense prediction tasks to achieve annotation-free segmentation.

READ FULL TEXT

page 1

page 4

page 7

page 8

page 9

research
06/03/2019

Zero-Shot Semantic Segmentation

Semantic segmentation models are limited in their ability to scale to la...
research
08/09/2023

MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner for Open-World Semantic Segmentation

Recently, semantic segmentation models trained with image-level text sup...
research
03/08/2023

CLIP-FO3D: Learning Free Open-world 3D Scene Representations from 2D Dense CLIP

Training a 3D scene understanding model requires complicated human annot...
research
12/22/2021

Open-Vocabulary Image Segmentation

We design an open-vocabulary image segmentation model to organize an ima...
research
03/16/2018

Dynamic-structured Semantic Propagation Network

Semantic concept hierarchy is still under-explored for semantic segmenta...
research
02/26/2021

Recursive Training for Zero-Shot Semantic Segmentation

General purpose semantic segmentation relies on a backbone CNN network t...
research
11/02/2022

P^3OVD: Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection

Inspired by the success of visual-language methods (VLMs) in zero-shot c...

Please sign up or login with your details

Forgot password? Click here to reset