TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding

08/05/2021
by   Dailan He, et al.
1

Recently proposed fine-grained 3D visual grounding is an essential and challenging task, whose goal is to identify the 3D object referred by a natural language sentence from other distractive objects of the same category. Existing works usually adopt dynamic graph networks to indirectly model the intra/inter-modal interactions, making the model difficult to distinguish the referred object from distractors due to the monolithic representations of visual and linguistic contents. In this work, we exploit Transformer for its natural suitability on permutation-invariant 3D point clouds data and propose a TransRefer3D network to extract entity-and-relation aware multimodal context among objects for more discriminative feature learning. Concretely, we devise an Entity-aware Attention (EA) module and a Relation-aware Attention (RA) module to conduct fine-grained cross-modal feature matching. Facilitated by co-attention operation, our EA module matches visual entity features with linguistic entity features while RA module matches pair-wise visual relation features with linguistic relation features, respectively. We further integrate EA and RA modules into an Entity-and-Relation aware Contextual Block (ERCB) and stack several ERCBs to form our TransRefer3D for hierarchical multimodal context modeling. Extensive experiments on both Nr3D and Sr3D datasets demonstrate that our proposed model significantly outperforms existing approaches by up to 10.6 our knowledge, this is the first work investigating Transformer architecture for fine-grained 3D visual grounding task.

READ FULL TEXT

page 2

page 7

page 8

research
08/31/2022

SIM-Trans: Structure Information Modeling Transformer for Fine-grained Visual Categorization

Fine-grained visual categorization (FGVC) aims at recognizing objects fr...
research
05/12/2022

Entity-aware and Motion-aware Transformers for Language-driven Action Localization in Videos

Language-driven action localization in videos is a challenging task that...
research
10/06/2022

Video Referring Expression Comprehension via Transformer with Content-aware Query

Video Referring Expression Comprehension (REC) aims to localize a target...
research
09/28/2022

Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding

Multimodal transformer exhibits high capacity and flexibility to align i...
research
04/06/2022

Modeling Temporal-Modal Entity Graph for Procedural Multimodal Machine Comprehension

Procedural Multimodal Documents (PMDs) organize textual instructions and...
research
04/13/2021

Disentangled Motif-aware Graph Learning for Phrase Grounding

In this paper, we propose a novel graph learning framework for phrase gr...
research
09/25/2019

Attention Convolutional Binary Neural Tree for Fine-Grained Visual Categorization

Fine-grained visual categorization (FGVC) is an important but challengin...

Please sign up or login with your details

Forgot password? Click here to reset