Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation

01/02/2023
by   Jianzong Wu, et al.
0

In this work, we focus on instance-level open vocabulary segmentation, intending to expand a segmenter for instance-wise novel categories without mask annotations. We investigate a simple yet effective framework with the help of image captions, focusing on exploiting thousands of object nouns in captions to discover instances of novel classes. Rather than adopting pretrained caption models or using massive caption datasets with complex pipelines, we propose an end-to-end solution from two aspects: caption grounding and caption generation. In particular, we devise a joint Caption Grounding and Generation (CGG) framework based on a Mask Transformer baseline. The framework has a novel grounding loss that performs explicit and implicit multi-modal feature alignments. We further design a lightweight caption generation head to allow for additional caption supervision. We find that grounding and generation complement each other, significantly enhancing the segmentation performance for novel categories. We conduct extensive experiments on the COCO dataset with two settings: Open Vocabulary Instance Segmentation (OVIS) and Open Set Panoptic Segmentation (OSPS). The results demonstrate the superiority of our CGG framework over previous OVIS methods, achieving a large improvement of 6.8 on novel classes without extra caption data. Our method also achieves over 15 PQ improvements for novel classes on the OSPS benchmark under various settings.

READ FULL TEXT

page 2

page 4

page 5

page 8

page 10

page 11

page 12

research
11/12/2019

Equalization Loss for Large Vocabulary Instance Segmentation

Recent object detection and instance segmentation tasks mainly focus on ...
research
07/12/2023

OG: Equip vision occupancy with instance segmentation and visual grounding

Occupancy prediction tasks focus on the inference of both geometry and s...
research
03/29/2023

Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual Mask Annotations

Existing instance segmentation models learn task-specific information us...
research
08/01/2023

Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding

Open-world instance-level scene understanding aims to locate and recogni...
research
01/03/2023

Reference Twice: A Simple and Unified Baseline for Few-Shot Instance Segmentation

Few Shot Instance Segmentation (FSIS) requires models to detect and segm...
research
04/01/2021

The surprising impact of mask-head architecture on novel class segmentation

Instance segmentation models today are very accurate when trained on lar...
research
03/20/2023

Open-vocabulary Panoptic Segmentation with Embedding Modulation

Open-vocabulary image segmentation is attracting increasing attention du...

Please sign up or login with your details

Forgot password? Click here to reset