Towards Label-free Scene Understanding by Vision Foundation Models

06/06/2023
by   Runnan Chen, et al.
0

Vision foundation models such as Contrastive Vision-Language Pre-training (CLIP) and Segment Anything (SAM) have demonstrated impressive zero-shot performance on image classification and segmentation tasks. However, the incorporation of CLIP and SAM for label-free scene understanding has yet to be explored. In this paper, we investigate the potential of vision foundation models in enabling networks to comprehend 2D and 3D worlds without labelled data. The primary challenge lies in effectively supervising networks under extremely noisy pseudo labels, which are generated by CLIP and further exacerbated during the propagation from the 2D to the 3D domain. To tackle these challenges, we propose a novel Cross-modality Noisy Supervision (CNS) method that leverages the strengths of CLIP and SAM to supervise 2D and 3D networks simultaneously. In particular, we introduce a prediction consistency regularization to co-train 2D and 3D networks, then further impose the networks' latent space consistency using the SAM's robust feature representation. Experiments conducted on diverse indoor and outdoor datasets demonstrate the superior performance of our method in understanding 2D and 3D open environments. Our 2D and 3D network achieves label-free semantic segmentation with 28.4 respectively. And for nuScenes dataset, our performance is 26.8 improvement of 6 (https://github.com/runnanchen/Label-Free-Scene-Understanding).

READ FULL TEXT

page 2

page 8

page 9

page 16

page 17

page 18

page 19

research
01/12/2023

CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP

Contrastive language-image pre-training (CLIP) achieves promising result...
research
01/28/2022

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Vision-Language Pre-training (VLP) has advanced the performance for many...
research
04/23/2023

Segment Anything in Non-Euclidean Domains: Challenges and Opportunities

The recent work known as Segment Anything (SA) has made significant stri...
research
07/14/2023

Knowledge Boosting: Rethinking Medical Contrastive Vision-Language Pre-Training

The foundation models based on pre-training technology have significantl...
research
06/13/2022

Transductive CLIP with Class-Conditional Contrastive Learning

Inspired by the remarkable zero-shot generalization capacity of vision-l...
research
03/08/2023

CLIP-FO3D: Learning Free Open-world 3D Scene Representations from 2D Dense CLIP

Training a 3D scene understanding model requires complicated human annot...
research
09/19/2023

Few-Shot Panoptic Segmentation With Foundation Models

Current state-of-the-art methods for panoptic segmentation require an im...

Please sign up or login with your details

Forgot password? Click here to reset