Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models

10/27/2022
by   Chaofan Ma, et al.
8

When trained at a sufficient scale, self-supervised learning has exhibited a notable ability to solve a wide range of visual or language understanding tasks. In this paper, we investigate simple, yet effective approaches for adapting the pre-trained foundation models to the downstream task of interest, namely, open-vocabulary semantic segmentation. To this end, we make the following contributions: (i) we introduce Fusioner, with a lightweight, transformer-based fusion module, that pairs the frozen visual representation with language concept through a handful of image segmentation data. As a consequence, the model gains the capability of zero-shot transfer to segment novel categories; (ii) without loss of generality, we experiment on a broad range of self-supervised models that have been pre-trained with different schemes, e.g. visual-only models (MoCo v3, DINO), language-only models (BERT), visual-language model (CLIP), and show that, the proposed fusion approach is effective to any pair of visual and language models, even those pre-trained on a corpus of uni-modal data; (iii) we conduct thorough ablation studies to analyze the critical components in our proposed Fusioner, while evaluating on standard benchmarks, e.g. PASCAL-5i and COCO-20i , it surpasses existing state-of-the-art models by a large margin, despite only being trained on frozen visual and language features; (iv) to measure the model's robustness on learning visual-language correspondence, we further evaluate on synthetic dataset, named Mosaic-4, where images are constructed by mosaicking the samples from FSS-1000. Fusioner demonstrates superior performance over previous models.

READ FULL TEXT

page 1

page 4

page 8

page 10

page 20

page 21

research
12/15/2021

Decoupling Zero-Shot Semantic Segmentation

Zero-shot semantic segmentation (ZS3) aims to segment the novel categori...
research
05/03/2023

Making the Most of What You Have: Adapting Pre-trained Visual Language Models in the Low-data Regime

Large-scale visual language models are widely used as pre-trained models...
research
09/07/2021

Self-supervised Tumor Segmentation through Layer Decomposition

In this paper, we propose a self-supervised approach for tumor segmentat...
research
06/15/2023

Diffusion Models for Zero-Shot Open-Vocabulary Segmentation

The variety of objects in the real world is nearly unlimited and is thus...
research
03/21/2023

Multi-modal Prompting for Low-Shot Temporal Action Localization

In this paper, we consider the problem of temporal action localization u...
research
06/15/2023

Robustness Analysis on Foundational Segmentation Models

Due to the increase in computational resources and accessibility of data...
research
05/06/2023

Prompt What You Need: Enhancing Segmentation in Rainy Scenes with Anchor-based Prompting

Semantic segmentation in rainy scenes is a challenging task due to the c...

Please sign up or login with your details

Forgot password? Click here to reset