Language-driven Open-Vocabulary 3D Scene Understanding

11/29/2022
by   Runyu Ding, et al.
0

Open-vocabulary scene understanding aims to localize and recognize unseen categories beyond the annotated label space. The recent breakthrough of 2D open-vocabulary perception is largely driven by Internet-scale paired image-text data with rich vocabulary concepts. However, this success cannot be directly transferred to 3D scenarios due to the inaccessibility of large-scale 3D-text pairs. To this end, we propose to distill knowledge encoded in pre-trained vision-language (VL) foundation models through captioning multi-view images from 3D, which allows explicitly associating 3D and semantic-rich captions. Further, to facilitate coarse-to-fine visual-semantic representation learning from captions, we design hierarchical 3D-caption pairs, leveraging geometric constraints between 3D scenes and multi-view images. Finally, by employing contrastive learning, the model learns language-aware embeddings that connect 3D and text for open-vocabulary tasks. Our method not only remarkably outperforms baseline methods by 25.8 14.5 segmentation, but also shows robust transferability on challenging zero-shot domain transfer tasks. Code will be available at https://github.com/CVMI-Lab/PLA.

READ FULL TEXT

page 1

page 8

research
08/01/2023

Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding

Open-world instance-level scene understanding aims to locate and recogni...
research
07/05/2022

Open-Vocabulary Multi-Label Classification via Multi-modal Knowledge Transfer

Real-world recognition system often encounters a plenty of unseen labels...
research
06/15/2023

Diffusion Models for Zero-Shot Open-Vocabulary Segmentation

The variety of objects in the real world is nearly unlimited and is thus...
research
04/03/2023

RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding

Existing 3D scene understanding tasks have achieved high performance on ...
research
09/06/2021

Learning to Generate Scene Graph from Natural Language Supervision

Learning from image-text data has demonstrated recent success for many r...
research
04/27/2023

Occ3D: A Large-Scale 3D Occupancy Prediction Benchmark for Autonomous Driving

Robotic perception requires the modeling of both 3D geometry and semanti...
research
04/18/2019

Knowledge-rich Image Gist Understanding Beyond Literal Meaning

We investigate the problem of understanding the message (gist) conveyed ...

Please sign up or login with your details

Forgot password? Click here to reset