Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision

01/22/2023
by   Jilan Xu, et al.
7

In this paper, we consider the problem of open-vocabulary semantic segmentation (OVS), which aims to segment objects of arbitrary classes instead of pre-defined, closed-set categories. The main contributions are as follows: First, we propose a transformer-based model for OVS, termed as OVSegmentor, which only exploits web-crawled image-text pairs for pre-training without using any mask annotations. OVSegmentor assembles the image pixels into a set of learnable group tokens via a slot-attention based binding module, and aligns the group tokens to the corresponding caption embedding. Second, we propose two proxy tasks for training, namely masked entity completion and cross-image mask consistency. The former aims to infer all masked entities in the caption given the group tokens, that enables the model to learn fine-grained alignment between visual groups and text entities. The latter enforces consistent mask predictions between images that contain shared entities, which encourages the model to learn visual invariance. Third, we construct CC4M dataset for pre-training by filtering CC12M with frequently appeared entities, which significantly improves training efficiency. Fourth, we perform zero-shot transfer on three benchmark datasets, PASCAL VOC 2012, PASCAL Context, and COCO Object. Our model achieves superior segmentation results over the state-of-the-art method by using only 3% data (4M vs 134M) for pre-training. Code and pre-trained models will be released for future research.

READ FULL TEXT

page 1

page 4

page 7

page 8

page 12

page 14

page 15

page 16

research
11/27/2022

SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation

Recently, the contrastive language-image pre-training, e.g., CLIP, has d...
research
08/09/2023

MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner for Open-World Semantic Segmentation

Recently, semantic segmentation models trained with image-level text sup...
research
08/18/2022

Open-Vocabulary Panoptic Segmentation with MaskCLIP

In this paper, we tackle a new computer vision task, open-vocabulary pan...
research
10/09/2022

Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP

Open-vocabulary semantic segmentation aims to segment an image into sema...
research
03/25/2023

IFSeg: Image-free Semantic Segmentation via Vision-Language Model

Vision-language (VL) pre-training has recently gained much attention for...
research
04/16/2022

Language-Grounded Indoor 3D Semantic Segmentation in the Wild

Recent advances in 3D semantic segmentation with deep neural networks ha...
research
12/16/2022

Attentive Mask CLIP

Image token removal is an efficient augmentation strategy for reducing t...

Please sign up or login with your details

Forgot password? Click here to reset