ZegOT: Zero-shot Segmentation Through Optimal Transport of Text Prompts

01/28/2023
by   Kwanyoung Kim, et al.
0

Recent success of large-scale Contrastive Language-Image Pre-training (CLIP) has led to great promise in zero-shot semantic segmentation by transferring image-text aligned knowledge to pixel-level classification. However, existing methods usually require an additional image encoder or retraining/tuning the CLIP module. Here, we present a cost-effective strategy using text-prompt learning that keeps the entire CLIP module frozen while fully leveraging its rich information. Specifically, we propose a novel Zero-shot segmentation with Optimal Transport (ZegOT) method that matches multiple text prompts with frozen image embeddings through optimal transport, which allows each text prompt to efficiently focus on specific semantic attributes. Additionally, we propose Deep Local Feature Alignment (DLFA) that deeply aligns the text prompts with intermediate local feature of the frozen image encoder layers, which significantly boosts the zero-shot segmentation performance. Through extensive experiments on benchmark datasets, we show that our method achieves the state-of-the-art (SOTA) performance with only x7 lighter parameters compared to previous SOTA approaches.

READ FULL TEXT

page 7

page 14

page 15

page 16

research
08/16/2020

Context-aware Feature Generation for Zero-shot Semantic Segmentation

Existing semantic segmentation models heavily rely on dense pixel-wise a...
research
01/10/2022

Language-driven Semantic Segmentation

We present LSeg, a novel model for language-driven semantic image segmen...
research
10/20/2019

An Optimal Transport Framework for Zero-Shot Learning

We present an optimal transport (OT) framework for generalized zero-shot...
research
08/27/2021

SIGN: Spatial-information Incorporated Generative Network for Generalized Zero-shot Semantic Segmentation

Unlike conventional zero-shot classification, zero-shot semantic segment...
research
12/17/2021

Data Efficient Language-supervised Zero-shot Recognition with Optimal Transport Distillation

Traditional computer vision models are trained to predict a fixed set of...
research
08/19/2021

Few-shot Segmentation with Optimal Transport Matching and Message Flow

We address the challenging task of few-shot segmentation in this work. I...
research
01/13/2022

CLIP-Event: Connecting Text and Images with Event Structures

Vision-language (V+L) pretraining models have achieved great success in ...

Please sign up or login with your details

Forgot password? Click here to reset