Disentangled Motif-aware Graph Learning for Phrase Grounding

04/13/2021
by   Zongshen Mu, et al.
0

In this paper, we propose a novel graph learning framework for phrase grounding in the image. Developing from the sequential to the dense graph model, existing works capture coarse-grained context but fail to distinguish the diversity of context among phrases and image regions. In contrast, we pay special attention to different motifs implied in the context of the scene graph and devise the disentangled graph network to integrate the motif-aware contextual information into representations. Besides, we adopt interventional strategies at the feature and the structure levels to consolidate and generalize representations. Finally, the cross-modal attention network is utilized to fuse intra-modal features, where each phrase can be computed similarity with regions to select the best-grounded one. We validate the efficiency of disentangled and interventional graph network (DIGN) through a series of ablation studies, and our model achieves state-of-the-art performance on Flickr30K Entities and ReferIt Game benchmarks.

READ FULL TEXT

page 3

page 7

page 10

research
10/23/2022

Extending Phrase Grounding with Pronouns in Visual Dialogues

Conventional phrase grounding aims to localize noun phrases mentioned in...
research
11/20/2019

Learning Cross-modal Context Graph for Visual Grounding

Visual grounding is a ubiquitous building block in many vision-language ...
research
03/18/2019

Neural Sequential Phrase Grounding (SeqGROUND)

We propose an end-to-end approach for phrase grounding in images. Unlike...
research
08/05/2021

TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding

Recently proposed fine-grained 3D visual grounding is an essential and c...
research
11/05/2019

Contextual Grounding of Natural Language Entities in Images

In this paper, we introduce a contextual grounding approach that capture...
research
05/25/2016

BattRAE: Bidimensional Attention-Based Recursive Autoencoders for Learning Bilingual Phrase Embeddings

In this paper, we propose a bidimensional attention based recursive auto...
research
12/17/2018

Attending Category Disentangled Global Context for Image Classification

In this paper, we propose a general framework for image classification u...

Please sign up or login with your details

Forgot password? Click here to reset