MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner for Open-World Semantic Segmentation

by   Kaixin Cai, et al.

Recently, semantic segmentation models trained with image-level text supervision have shown promising results in challenging open-world scenarios. However, these models still face difficulties in learning fine-grained semantic alignment at the pixel level and predicting accurate object masks. To address this issue, we propose MixReorg, a novel and straightforward pre-training paradigm for semantic segmentation that enhances a model's ability to reorganize patches mixed across images, exploring both local visual relevance and global semantic coherence. Our approach involves generating fine-grained patch-text pairs data by mixing image patches while preserving the correspondence between patches and text. The model is then trained to minimize the segmentation loss of the mixed images and the two contrastive losses of the original and restored features. With MixReorg as a mask learner, conventional text-supervised semantic segmentation models can achieve highly generalizable pixel-semantic alignment ability, which is crucial for open-world segmentation. After training with large-scale image-text data, MixReorg models can be applied directly to segment visual objects of arbitrary categories, without the need for further fine-tuning. Our proposed framework demonstrates strong performance on popular zero-shot semantic segmentation benchmarks, outperforming GroupViT by significant margins of 5.0 PASCAL Context, MS COCO, and ADE20K, respectively.


page 2

page 4

page 8


DenseCLIP: Extract Free Dense Labels from CLIP

Contrastive Language-Image Pre-training (CLIP) has made a remarkable bre...

Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding

To bridge the gap between supervised semantic segmentation and real-worl...

Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs

We tackle open-world semantic segmentation, which aims at learning to se...

Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision

In this paper, we consider the problem of open-vocabulary semantic segme...

SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation

Recently, the contrastive language-image pre-training, e.g., CLIP, has d...

Print Defect Mapping with Semantic Segmentation

Efficient automated print defect mapping is valuable to the printing ind...

FreeSeg: Free Mask from Interpretable Contrastive Language-Image Pretraining for Semantic Segmentation

Fully supervised semantic segmentation learns from dense masks, which re...

Please sign up or login with your details

Forgot password? Click here to reset