Cross-Modal Relationship Inference for Grounding Referring Expressions

06/11/2019
by   Sibei Yang, et al.
0

Grounding referring expressions is a fundamental yet challenging task facilitating human-machine communication in the physical world. It locates the target object in an image on the basis of the comprehension of the relationships between referring natural language expressions and the image. A feasible solution for grounding referring expressions not only needs to extract all the necessary information (i.e., objects and the relationships among them) in both the image and referring expressions, but also compute and represent multimodal contexts from the extracted information. Unfortunately, existing work on grounding referring expressions cannot extract multi-order relationships from the referring expressions accurately and the contexts they obtain have discrepancies with the contexts described by referring expressions. In this paper, we propose a Cross-Modal Relationship Extractor (CMRE) to adaptively highlight objects and relationships, that have connections with a given expression, with a cross-modal attention mechanism, and represent the extracted information as a language-guided visual relation graph. In addition, we propose a Gated Graph Convolutional Network (GGCN) to compute multimodal semantic contexts by fusing information from different modes and propagating multimodal information in the structured relation graph. Experiments on various common benchmark datasets show that our Cross-Modal Relationship Inference Network, which consists of CMRE and GGCN, outperforms all existing state-of-the-art methods.

READ FULL TEXT
research
04/19/2020

Relationship-Embedded Representation Learning for Grounding Referring Expressions

Grounding referring expressions in images aims to locate the object inst...
research
03/03/2019

Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing

Referring expression grounding aims at locating certain objects or perso...
research
01/18/2022

Unpaired Referring Expression Grounding via Bidirectional Cross-Modal Matching

Referring expression grounding is an important and challenging task in c...
research
07/08/2019

Variational Context: Exploiting Visual and Textual Context for Grounding Referring Expressions

We focus on grounding (i.e., localizing or linking) referring expression...
research
12/05/2017

Grounding Referring Expressions in Images by Variational Context

We focus on grounding (i.e., localizing or linking) referring expression...
research
03/12/2022

Differentiated Relevances Embedding for Group-based Referring Expression Comprehension

Referring expression comprehension (REC) aims to locate a certain object...
research
06/09/2019

Referring Expression Grounding by Marginalizing Scene Graph Likelihood

We focus on the task of grounding referring expressions in images, e.g.,...

Please sign up or login with your details

Forgot password? Click here to reset