Multi-View Transformer for 3D Visual Grounding

04/05/2022
by   Shijia Huang, et al.
0

The 3D visual grounding task aims to ground a natural language description to the targeted object in a 3D scene, which is usually represented in 3D point clouds. Previous works studied visual grounding under specific views. The vision-language correspondence learned by this way can easily fail once the view changes. In this paper, we propose a Multi-View Transformer (MVT) for 3D visual grounding. We project the 3D scene to a multi-view space, in which the position information of the 3D scene under different views are modeled simultaneously and aggregated together. The multi-view space enables the network to learn a more robust multi-modal representation for 3D visual grounding and eliminates the dependence on specific views. Extensive experiments show that our approach significantly outperforms all state-of-the-art methods. Specifically, on Nr3D and Sr3D datasets, our method outperforms the best competitor by 11.2 work with extra 2D assistance by 5.9 https://github.com/sega-hsj/MVT-3DVG.

READ FULL TEXT

page 1

page 4

page 8

research
03/29/2023

ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance

Understanding 3D scenes from multi-view inputs has been proven to allevi...
research
11/25/2022

Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for 3D Visual Grounding

The 3D visual grounding task has been explored with visual and language ...
research
10/07/2019

Adversarial reconstruction for Multi-modal Machine Translation

Even with the growing interest in problems at the intersection of Comput...
research
03/01/2021

InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring

Compared with the visual grounding in 2D images, the natural-language-gu...
research
10/22/2022

HAM: Hierarchical Attention Model with High Performance for 3D Visual Grounding

This paper tackles an emerging and challenging vision-language task, 3D ...
research
04/25/2023

MMRDN: Consistent Representation for Multi-View Manipulation Relationship Detection in Object-Stacked Scenes

Manipulation relationship detection (MRD) aims to guide the robot to gra...
research
03/14/2021

Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD Images

Grounding referring expressions in RGBD image has been an emerging field...

Please sign up or login with your details

Forgot password? Click here to reset