Three Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding

09/08/2023
by   Ozan Unal, et al.
0

3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language. With a wide range of applications ranging from autonomous indoor robotics to AR/VR, the task has recently risen in popularity. A common formulation to tackle 3D visual grounding is grounding-by-detection, where localization is done via bounding boxes. However, for real-life applications that require physical interactions, a bounding box insufficiently describes the geometry of an object. We therefore tackle the problem of dense 3D visual grounding, i.e. referral-based 3D instance segmentation. We propose a dense 3D grounding network ConcreteNet, featuring three novel stand-alone modules which aim to improve grounding performance for challenging repetitive instances, i.e. instances with distractors of the same semantic class. First, we introduce a bottom-up attentive fusion module that aims to disambiguate inter-instance relational cues, next we construct a contrastive training scheme to induce separation in the latent space, and finally we resolve view-dependent utterances via a learned global camera token. ConcreteNet ranks 1st on the challenging ScanRefer online benchmark by a considerable +9.43 ICCV 3rd Workshop on Language for 3D Scenes "3D Object Localization" challenge.

READ FULL TEXT

page 1

page 3

page 6

page 8

page 9

page 10

research
07/07/2021

LanguageRefer: Spatial-Language Model for 3D Visual Grounding

To realize robots that can understand human instructions and perform mea...
research
07/12/2023

OG: Equip vision occupancy with instance segmentation and visual grounding

Occupancy prediction tasks focus on the inference of both geometry and s...
research
06/11/2021

Team RUC_AIM3 Technical Report at ActivityNet 2021: Entities Object Localization

Entities Object Localization (EOL) aims to evaluate how grounded or fait...
research
12/02/2017

Interactive Reinforcement Learning for Object Grounding via Self-Talking

Humans are able to identify a referred visual object in a complex scene ...
research
03/01/2021

InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring

Compared with the visual grounding in 2D images, the natural-language-gu...
research
09/18/2020

Commands 4 Autonomous Vehicles (C4AV) Workshop Summary

The task of visual grounding requires locating the most relevant region ...
research
03/29/2018

Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts

Textual grounding is an important but challenging task for human-compute...

Please sign up or login with your details

Forgot password? Click here to reset