Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing

03/03/2019
by   Xihui Liu, et al.
0

Referring expression grounding aims at locating certain objects or persons in an image with a referring expression, where the key challenge is to comprehend and align various types of information from visual and textual domain, such as visual attributes, location and interactions with surrounding regions. Although the attention mechanism has been successfully applied for cross-modal alignments, previous attention models focus on only the most dominant features of both modalities, and neglect the fact that there could be multiple comprehensive textual-visual correspondences between images and referring expressions. To tackle this issue, we design a novel cross-modal attention-guided erasing approach, where we discard the most dominant information from either textual or visual domains to generate difficult training samples online, and to drive the model to discover complementary textual-visual correspondences. Extensive experiments demonstrate the effectiveness of our proposed method, which achieves state-of-the-art performance on three referring expression grounding datasets.

READ FULL TEXT

page 1

page 7

page 8

research
04/19/2020

Relationship-Embedded Representation Learning for Grounding Referring Expressions

Grounding referring expressions in images aims to locate the object inst...
research
01/18/2022

Unpaired Referring Expression Grounding via Bidirectional Cross-Modal Matching

Referring expression grounding is an important and challenging task in c...
research
01/09/2023

Towards Real-Time Panoptic Narrative Grounding by an End-to-End Grounding Network

Panoptic Narrative Grounding (PNG) is an emerging cross-modal grounding ...
research
06/11/2019

Cross-Modal Relationship Inference for Grounding Referring Expressions

Grounding referring expressions is a fundamental yet challenging task fa...
research
08/07/2017

Identity-Aware Textual-Visual Matching with Latent Co-attention

Textual-visual matching aims at measuring similarities between sentence ...
research
03/12/2022

Differentiated Relevances Embedding for Group-based Referring Expression Comprehension

Referring expression comprehension (REC) aims to locate a certain object...
research
06/06/2023

Language Adaptive Weight Generation for Multi-task Visual Grounding

Although the impressive performance in visual grounding, the prevailing ...

Please sign up or login with your details

Forgot password? Click here to reset