Stacked Cross Attention for Image-Text Matching

03/21/2018
by   Kuang-Huei Lee, et al.
0

In this paper, we study the problem of image-text matching. Inferring the latent semantic alignment between objects or other salient stuffs (e.g. snow, sky, lawn) and the corresponding words in sentences allows to capture fine-grained interplay between vision and language, and makes image-text matching more interpretable. Prior works either simply aggregate the similarity of all possible pairs of regions and words without attending differentially to more and less important words or regions, or use a multi-step attentional process to capture limited number of semantic alignments which is less interpretable. In this paper, we present Stacked Cross Attention to discover the full latent alignments using both image regions and words in sentence as context and infer the image-text similarity. Our approach achieves the state-of-the-art results on the MS-COCO and Flickr30K datasets. On Flickr30K, our approach outperforms the current best methods by 22.1 from image query, and 18.2 Recall@1). On MS-COCO, our approach improves sentence retrieval by 17.8 image retrieval by 16.6

READ FULL TEXT

page 2

page 14

page 18

page 19

page 20

page 21

page 22

page 23

research
06/17/2019

ParNet: Position-aware Aggregated Relation Network for Image-Text matching

Exploring fine-grained relationship between entities(e.g. objects in ima...
research
09/06/2019

Visual Semantic Reasoning for Image-Text Matching

Image-text matching has been a hot research topic bridging the vision an...
research
02/20/2020

Expressing Objects just like Words: Recurrent Visual Embedding for Image-Text Matching

Existing image-text matching approaches typically infer the similarity o...
research
10/23/2020

Beyond the Deep Metric Learning: Enhance the Cross-Modal Matching with Adversarial Discriminative Domain Regularization

Matching information across image and text modalities is a fundamental c...
research
09/12/2019

CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval

Text-image cross-modal retrieval is a challenging task in the field of l...
research
03/08/2020

IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval

Enabling bi-directional retrieval of images and texts is important for u...
research
01/11/2020

MHSAN: Multi-Head Self-Attention Network for Visual Semantic Embedding

Visual-semantic embedding enables various tasks such as image-text retri...

Please sign up or login with your details

Forgot password? Click here to reset