Toward Explainable and Fine-Grained 3D Grounding through Referring Textual Phrases

07/05/2022
by   Zhihao Yuan, et al.
0

Recent progress on 3D scene understanding has explored visual grounding (3DVG) to localize a target object through a language description. However, existing methods only consider the dependency between the entire sentence and the target object, thus ignoring fine-grained relationships between contexts and non-target ones. In this paper, we extend 3DVG to a more reliable and explainable task, called 3D Phrase Aware Grounding (3DPAG). The 3DPAG task aims to localize the target object in the 3D scenes by explicitly identifying all phrase-related objects and then conducting reasoning according to contextual phrases. To tackle this problem, we label about 400K phrase-level annotations from 170K sentences in available 3DVG datasets, i.e., Nr3D, Sr3D and ScanRefer. By tapping on these developed datasets, we propose a novel framework, i.e., PhraseRefer, which conducts phrase-aware and object-level representation learning through phrase-object alignment optimization as well as phrase-specific pre-training. In our setting, we extend previous 3DVG methods to the phrase-aware scenario and provide metrics to measure the explainability of the 3DPAG task. Extensive results confirm that 3DPAG effectively boosts the 3DVG, and PhraseRefer achieves state-of-the-arts across three datasets, i.e., 63.0 respectively.

READ FULL TEXT

page 2

page 6

page 12

page 14

research
10/23/2022

Extending Phrase Grounding with Pronouns in Visual Dialogues

Conventional phrase grounding aims to localize noun phrases mentioned in...
research
12/06/2021

Learning to Reason from General Concepts to Fine-grained Tokens for Discriminative Phrase Detection

Phrase detection requires methods to identify if a phrase is relevant to...
research
10/12/2020

MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding

Phrase localization is a task that studies the mapping from textual phra...
research
10/06/2022

Video Referring Expression Comprehension via Transformer with Content-aware Query

Video Referring Expression Comprehension (REC) aims to localize a target...
research
01/09/2017

Task-Specific Attentive Pooling of Phrase Alignments Contributes to Sentence Matching

This work studies comparatively two typical sentence matching tasks: tex...
research
04/12/2022

Position-aware Location Regression Network for Temporal Video Grounding

The key to successful grounding for video surveillance is to understand ...
research
11/05/2020

Utilizing Every Image Object for Semi-supervised Phrase Grounding

Phrase grounding models localize an object in the image given a referrin...

Please sign up or login with your details

Forgot password? Click here to reset