HAM: Hierarchical Attention Model with High Performance for 3D Visual Grounding

10/22/2022
by   Jiaming Chen, et al.
0

This paper tackles an emerging and challenging vision-language task, 3D visual grounding on point clouds. Many recent works benefit from Transformer with the well-known attention mechanism, leading to a tremendous breakthrough for this task. However, we find that they realize the achievement by using various pre-training or multi-stage processing. To simplify the pipeline, we carefully investigate 3D visual grounding and propose three fundamental questions about how to develop an end-to-end model with high performance for this task. To address these problems, we especially introduce a novel Hierarchical Attention Model (HAM), offering multi-granularity representation and efficient augmentation for both given texts and multi-modal visual inputs. More importantly, HAM ranks first on the large-scale ScanRefer challenge, which outperforms all the existing methods by a significant margin. Codes will be released after acceptance.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset