Relationship-Embedded Representation Learning for Grounding Referring Expressions

04/19/2020
by   Sibei Yang, et al.
0

Grounding referring expressions in images aims to locate the object instance in an image described by a referring expression. It involves a joint understanding of natural language and image content, and is essential for a range of visual tasks related to human-computer interaction. As a language-to-vision matching task, the core of this problem is to not only extract all the necessary information (i.e., objects and the relationships among them) in both the image and referring expression, but also make full use of context information to align cross-modal semantic concepts in the extracted information. Unfortunately, existing work on grounding referring expressions fails to accurately extract multi-order relationships from the referring expression and associate them with the objects and their related contexts in the image. In this paper, we propose a Cross-Modal Relationship Extractor (CMRE) to adaptively highlight objects and relationships (spatial and semantic relations) related to the given expression with a cross-modal attention mechanism, and represent the extracted information as a language-guided visual relation graph. In addition, we propose a Gated Graph Convolutional Network (GGCN) to compute multimodal semantic contexts by fusing information from different modes and propagating multimodal information in the structured relation graph. Experimental results on three common benchmark datasets show that our Cross-Modal Relationship Inference Network, which consists of CMRE and GGCN, significantly surpasses all existing state-of-the-art methods. Code is available at https://github.com/sibeiyang/sgmn/tree/master/lib/cmrin_models

READ FULL TEXT

page 1

page 9

page 16

research
06/11/2019

Cross-Modal Relationship Inference for Grounding Referring Expressions

Grounding referring expressions is a fundamental yet challenging task fa...
research
03/03/2019

Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing

Referring expression grounding aims at locating certain objects or perso...
research
03/12/2022

Differentiated Relevances Embedding for Group-based Referring Expression Comprehension

Referring expression comprehension (REC) aims to locate a certain object...
research
07/08/2019

Variational Context: Exploiting Visual and Textual Context for Grounding Referring Expressions

We focus on grounding (i.e., localizing or linking) referring expression...
research
08/31/2023

3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation

In 3D Referring Expression Segmentation (3D-RES), the earlier approach a...
research
10/06/2020

Learning to Represent Image and Text with Denotation Graph

Learning to fuse vision and language information and representing them i...
research
12/05/2017

Grounding Referring Expressions in Images by Variational Context

We focus on grounding (i.e., localizing or linking) referring expression...

Please sign up or login with your details

Forgot password? Click here to reset