ScanERU: Interactive 3D Visual Grounding based on Embodied Reference Understanding

03/23/2023
by   Ziyang Lu, et al.
0

Aiming to link natural language descriptions to specific regions in a 3D scene represented as 3D point clouds, 3D visual grounding is a very fundamental task for human-robot interaction. The recognition errors can significantly impact the overall accuracy and then degrade the operation of AI systems. Despite their effectiveness, existing methods suffer from the difficulty of low recognition accuracy in cases of multiple adjacent objects with similar appearances.To address this issue, this work intuitively introduces the human-robot interaction as a cue to facilitate the development of 3D visual grounding. Specifically, a new task termed Embodied Reference Understanding (ERU) is first designed for this concern. Then a new dataset called ScanERU is constructed to evaluate the effectiveness of this idea. Different from existing datasets, our ScanERU is the first to cover semi-synthetic scene integration with textual, real-world visual, and synthetic gestural information. Additionally, this paper formulates a heuristic framework based on attention mechanisms and human body movements to enlighten the research of ERU. Experimental results demonstrate the superiority of the proposed method, especially in the recognition of multiple identical objects. Our codes and dataset are ready to be available publicly.

READ FULL TEXT

page 2

page 3

page 5

page 7

page 13

page 15

page 16

research
08/24/2023

HuBo-VLM: Unified Vision-Language Model designed for HUman roBOt interaction tasks

Human robot interaction is an exciting task, which aimed to guide robots...
research
09/05/2022

Trust in Language Grounding: a new AI challenge for human-robot teams

The challenge of language grounding is to fully understand natural langu...
research
03/17/2021

Few-Shot Visual Grounding for Natural Human-Robot Interaction

Natural Human-Robot Interaction (HRI) is one of the key components for s...
research
05/23/2023

Cross3DVG: Baseline and Dataset for Cross-Dataset 3D Visual Grounding on Different RGB-D Scans

We present Cross3DVG, a novel task for cross-dataset visual grounding in...
research
11/25/2022

Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for 3D Visual Grounding

The 3D visual grounding task has been explored with visual and language ...
research
07/23/2023

Iterative Robust Visual Grounding with Masked Reference based Centerpoint Supervision

Visual Grounding (VG) aims at localizing target objects from an image ba...
research
12/03/2019

Scene recognition based on DNN and game theory with its applications in human-robot interaction

Scene recognition model based on the DNN and game theory with its applic...

Please sign up or login with your details

Forgot password? Click here to reset