ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance

03/29/2023
by   Ziyu Guo, et al.
0

Understanding 3D scenes from multi-view inputs has been proven to alleviate the view discrepancy issue in 3D visual grounding. However, existing methods normally neglect the view cues embedded in the text modality and fail to weigh the relative importance of different views. In this paper, we propose ViewRefer, a multi-view framework for 3D visual grounding exploring how to grasp the view knowledge from both text and 3D modalities. For the text branch, ViewRefer leverages the diverse linguistic knowledge of large-scale language models, e.g., GPT, to expand a single grounding text to multiple geometry-consistent descriptions. Meanwhile, in the 3D modality, a transformer fusion module with inter-view attention is introduced to boost the interaction of objects across views. On top of that, we further present a set of learnable multi-view prototypes, which memorize scene-agnostic knowledge for different views, and enhance the framework from two perspectives: a view-guided attention module for more robust text features, and a view-guided scoring strategy during the final prediction. With our designed paradigm, ViewRefer achieves superior performance on three benchmarks and surpasses the second-best by +2.8 and +0.73 https://github.com/ZiyuGuo99/ViewRefer3D.

READ FULL TEXT

page 3

page 7

page 8

research
04/05/2022

Multi-View Transformer for 3D Visual Grounding

The 3D visual grounding task aims to ground a natural language descripti...
research
10/12/2022

MFFN: Multi-view Feature Fusion Network for Camouflaged Object Detection

Recent research about camouflaged object detection (COD) aims to segment...
research
12/12/2022

CTT-Net: A Multi-view Cross-token Transformer for Cataract Postoperative Visual Acuity Prediction

Surgery is the only viable treatment for cataract patients with visual a...
research
04/08/2023

POEM: Reconstructing Hand in a Point Embedded Multi-view Stereo

Enable neural networks to capture 3D geometrical-aware features is essen...
research
08/06/2023

Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation

In recent years, 3D representation learning has turned to 2D vision-lang...
research
04/29/2023

ViewFormer: View Set Attention for Multi-view 3D Shape Understanding

This paper presents ViewFormer, a simple yet effective model for multi-v...
research
09/28/2022

Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding

Multimodal transformer exhibits high capacity and flexibility to align i...

Please sign up or login with your details

Forgot password? Click here to reset