Diffusion Models for Zero-Shot Open-Vocabulary Segmentation

06/15/2023
by   Laurynas Karazija, et al.
0

The variety of objects in the real world is nearly unlimited and is thus impossible to capture using models trained on a fixed set of categories. As a result, in recent years, open-vocabulary methods have attracted the interest of the community. This paper proposes a new method for zero-shot open-vocabulary segmentation. Prior work largely relies on contrastive training using image-text pairs, leveraging grouping mechanisms to learn image features that are both aligned with language and well-localised. This however can introduce ambiguity as the visual appearance of images with similar captions often varies. Instead, we leverage the generative properties of large-scale text-to-image diffusion models to sample a set of support images for a given textual category. This provides a distribution of appearances for a given text circumventing the ambiguity problem. We further propose a mechanism that considers the contextual background of the sampled images to better localise objects and segment the background directly. We show that our method can be used to ground several existing pre-trained self-supervised feature extractors in natural language and provide explainable predictions by mapping back to regions in the support set. Our proposal is training-free, relying on pre-trained components only, yet, shows strong performance on a range of open-vocabulary segmentation benchmarks, obtaining a lead of more than 10 the Pascal VOC benchmark.

READ FULL TEXT

page 3

page 6

page 8

page 9

page 19

page 20

research
01/12/2023

Guiding Text-to-Image Diffusion Model Towards Grounded Generation

The goal of this paper is to augment a pre-trained text-to-image diffusi...
research
11/29/2022

Language-driven Open-Vocabulary 3D Scene Understanding

Open-vocabulary scene understanding aims to localize and recognize unsee...
research
05/23/2023

3D Open-vocabulary Segmentation with Foundation Models

Open-vocabulary segmentation of 3D scenes is a fundamental function of h...
research
02/08/2023

Neural Congealing: Aligning Images to a Joint Semantic Atlas

We present Neural Congealing – a zero-shot self-supervised framework for...
research
10/27/2022

Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models

When trained at a sufficient scale, self-supervised learning has exhibit...
research
03/23/2023

Zero-guidance Segmentation Using Zero Segment Labels

CLIP has enabled new and exciting joint vision-language applications, on...

Please sign up or login with your details

Forgot password? Click here to reset