ParNet: Position-aware Aggregated Relation Network for Image-Text matching

06/17/2019
by   Yaxian Xia, et al.
3

Exploring fine-grained relationship between entities(e.g. objects in image or words in sentence) has great contribution to understand multimedia content precisely. Previous attention mechanism employed in image-text matching either takes multiple self attention steps to gather correspondences or uses image objects (or words) as context to infer image-text similarity. However, they only take advantage of semantic information without considering that objects' relative position also contributes to image understanding. To this end, we introduce a novel position-aware relation module to model both the semantic and spatial relationship simultaneously for image-text matching in this paper. Given an image, our method utilizes the location of different objects to capture spatial relationship innovatively. With the combination of semantic and spatial relationship, it's easier to understand the content of different modalities (images and sentences) and capture fine-grained latent correspondences of image-text pairs. Besides, we employ a two-step aggregated relation module to capture interpretable alignment of image-text pairs. The first step, we call it intra-modal relation mechanism, in which we computes responses between different objects in an image or different words in a sentence separately; The second step, we call it inter-modal relation mechanism, in which the query plays a role of textual context to refine the relationship among object proposals in an image. In this way, our position-aware aggregated relation network (ParNet) not only knows which entities are relevant by attending on different objects (words) adaptively, but also adjust the inter-modal correspondence according to the latent alignments according to query's content. Our approach achieves the state-of-the-art results on MS-COCO dataset.

READ FULL TEXT

page 2

page 3

page 5

page 7

page 8

research
03/21/2018

Stacked Cross Attention for Image-Text Matching

In this paper, we study the problem of image-text matching. Inferring th...
research
09/15/2023

Uncertainty-Aware Multi-View Visual Semantic Embedding

The key challenge in image-text retrieval is effectively leveraging sema...
research
10/17/2022

Cross-modal Semantic Enhanced Interaction for Image-Sentence Retrieval

Image-sentence retrieval has attracted extensive research attention in m...
research
02/04/2023

FGSI: Distant Supervision for Relation Extraction method based on Fine-Grained Semantic Information

The main purpose of relation extraction is to extract the semantic relat...
research
03/08/2020

IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval

Enabling bi-directional retrieval of images and texts is important for u...
research
07/23/2019

Position Focused Attention Network for Image-Text Matching

Image-text matching tasks have recently attracted a lot of attention in ...
research
05/02/2020

Robust and Interpretable Grounding of Spatial References with Relation Networks

Handling spatial references in natural language is a key challenge in ta...

Please sign up or login with your details

Forgot password? Click here to reset