ConceptFusion: Open-set Multimodal 3D Mapping

Building 3D maps of the environment is central to robot navigation, planning, and interaction with objects in a scene. Most existing approaches that integrate semantic concepts with 3D maps largely remain confined to the closed-set setting: they can only reason about a finite set of concepts, pre-defined at training time. Further, these maps can only be queried using class labels, or in recent work, using text prompts. We address both these issues with ConceptFusion, a scene representation that is (1) fundamentally open-set, enabling reasoning beyond a closed set of concepts and (ii) inherently multimodal, enabling a diverse range of possible queries to the 3D map, from language, to images, to audio, to 3D geometry, all working in concert. ConceptFusion leverages the open-set capabilities of today's foundation models pre-trained on internet-scale data to reason about concepts across modalities such as natural language, images, and audio. We demonstrate that pixel-aligned open-set features can be fused into 3D maps via traditional SLAM and multi-view fusion approaches. This enables effective zero-shot spatial reasoning, not needing any additional training or finetuning, and retains long-tailed concepts better than supervised approaches, outperforming them by more than 40 ConceptFusion on a number of real-world datasets, simulated home environments, a real-world tabletop manipulation task, and an autonomous driving platform. We showcase new avenues for blending foundation models with 3D open-set multimodal mapping. For more information, visit our project page https://concept-fusion.github.io or watch our 5-minute explainer video https://www.youtube.com/watch?v=rkXgws8fiDs

READ FULL TEXT

page 1

page 4

page 5

page 7

page 8

page 9

page 11

page 13

research
03/13/2023

Audio Visual Language Maps for Robot Navigation

While interacting in the world is a multi-sensory experience, many robot...
research
08/10/2023

Follow Anything: Open-set detection, tracking, and following in real-time

Tracking and following objects of interest is critical to several roboti...
research
06/09/2023

Pave the Way to Grasp Anything: Transferring Foundation Models for Universal Pick-Place Robots

Improving the generalization capabilities of general-purpose robotic age...
research
07/25/2023

GraspGPT: Leveraging Semantic Knowledge from a Large Language Model for Task-Oriented Grasping

Task-oriented grasping (TOG) refers to the problem of predicting grasps ...
research
10/11/2022

Visual Language Maps for Robot Navigation

Grounding language to the visual observations of a navigating agent can ...
research
10/06/2022

Feature-Realistic Neural Fusion for Real-Time, Open Set Scene Understanding

General scene understanding for robotics requires flexible semantic repr...
research
03/16/2023

LERF: Language Embedded Radiance Fields

Humans describe the physical world using natural language to refer to sp...

Please sign up or login with your details

Forgot password? Click here to reset