LanguageRefer: Spatial-Language Model for 3D Visual Grounding

07/07/2021
by   Junha Roh, et al.
5

To realize robots that can understand human instructions and perform meaningful tasks in the near future, it is important to develop learned models that can understand referential language to identify common objects in real-world 3D scenes. In this paper, we develop a spatial-language model for a 3D visual grounding problem. Specifically, given a reconstructed 3D scene in the form of a point cloud with 3D bounding boxes of potential object candidates, and a language utterance referring to a target object in the scene, our model identifies the target object from a set of potential candidates. Our spatial-language model uses a transformer-based architecture that combines spatial embedding from bounding-box with a finetuned language embedding from DistilBert and reasons among the objects in the 3D scene to find the target object. We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D. We provide additional analysis of performance in spatial reasoning tasks decoupled from perception noise, the effect of view-dependent utterances in terms of accuracy, and view-point annotations for potential robotics applications.

READ FULL TEXT

page 1

page 8

research
09/08/2023

Three Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding

3D visual grounding is the task of localizing the object in a 3D scene w...
research
06/26/2023

Kosmos-2: Grounding Multimodal Large Language Models to the World

We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enablin...
research
07/17/2012

Fast View Frustum Culling of Spatial Object by Analytical Bounding Bin

It is a common sense to apply the VFC (view frustum culling) of spatial ...
research
11/16/2018

Temporal Grounding Graphs for Language Understanding with Accrued Visual-Linguistic Context

A robot's ability to understand or ground natural language instructions ...
research
11/17/2022

Language Conditioned Spatial Relation Reasoning for 3D Object Grounding

Localizing objects in 3D scenes based on natural language requires under...
research
11/24/2018

What and Where: A Context-based Recommendation System for Object Insertion

In this work, we propose a novel topic consisting of two dual tasks: 1) ...
research
01/06/2022

Incremental Object Grounding Using Scene Graphs

Object grounding tasks aim to locate the target object in an image throu...

Please sign up or login with your details

Forgot password? Click here to reset