HAM: Hierarchical Attention Model with High Performance for 3D Visual Grounding

10/22/2022
by   Jiaming Chen, et al.
0

This paper tackles an emerging and challenging vision-language task, 3D visual grounding on point clouds. Many recent works benefit from Transformer with the well-known attention mechanism, leading to a tremendous breakthrough for this task. However, we find that they realize the achievement by using various pre-training or multi-stage processing. To simplify the pipeline, we carefully investigate 3D visual grounding and propose three fundamental questions about how to develop an end-to-end model with high performance for this task. To address these problems, we especially introduce a novel Hierarchical Attention Model (HAM), offering multi-granularity representation and efficient augmentation for both given texts and multi-modal visual inputs. More importantly, HAM ranks first on the large-scale ScanRefer challenge, which outperforms all the existing methods by a significant margin. Codes will be released after acceptance.

READ FULL TEXT

page 1

page 4

page 9

page 11

page 12

research
06/27/2023

GroundNLQ @ Ego4D Natural Language Queries Challenge 2023

In this report, we present our champion solution for Ego4D Natural Langu...
research
07/31/2021

Word2Pix: Word to Pixel Cross Attention Transformer in Visual Grounding

Current one-stage methods for visual grounding encode the language query...
research
04/05/2022

Multi-View Transformer for 3D Visual Grounding

The 3D visual grounding task aims to ground a natural language descripti...
research
11/25/2022

Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for 3D Visual Grounding

The 3D visual grounding task has been explored with visual and language ...
research
04/17/2021

TransVG: End-to-End Visual Grounding with Transformers

In this paper, we present a neat yet effective transformer-based framewo...
research
10/07/2019

Adversarial reconstruction for Multi-modal Machine Translation

Even with the growing interest in problems at the intersection of Comput...
research
07/27/2022

SiRi: A Simple Selective Retraining Mechanism for Transformer-based Visual Grounding

In this paper, we investigate how to achieve better visual grounding wit...

Please sign up or login with your details

Forgot password? Click here to reset