CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval

09/12/2019
by   Zihao Wang, et al.
0

Text-image cross-modal retrieval is a challenging task in the field of language and vision. Most previous approaches independently embed images and sentences into a joint embedding space and compare their similarities. However, previous approaches rarely explore the interactions between images and sentences before calculating similarities in the joint space. Intuitively, when matching between images and sentences, human beings would alternatively attend to regions in images and words in sentences, and select the most salient information considering the interaction between both modalities. In this paper, we propose Cross-modal Adaptive Message Passing (CAMP), which adaptively controls the information flow for message passing across modalities. Our approach not only takes comprehensive and fine-grained cross-modal interactions into account, but also properly handles negative pairs and irrelevant information with an adaptive gating scheme. Moreover, instead of conventional joint embedding approaches for text-image matching, we infer the matching score based on the fused features, and propose a hardest negative binary cross-entropy loss for training. Results on COCO and Flickr30k significantly surpass state-of-the-art methods, demonstrating the effectiveness of our approach.

READ FULL TEXT

page 1

page 3

page 7

page 8

research
05/20/2020

FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval

In this paper, we address the text and image matching in cross-modal ret...
research
07/29/2022

ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval

Image-text matching is gaining a leading role among tasks involving the ...
research
08/12/2020

Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders

Despite the evolution of deep-learning-based visual-textual processing s...
research
04/30/2019

Cross-Modal Message Passing for Two-stream Fusion

Processing and fusing information among multi-modal is a very useful tec...
research
01/10/2023

Pix2Map: Cross-modal Retrieval for Inferring Street Maps from Images

Self-driving vehicles rely on urban street maps for autonomous navigatio...
research
02/23/2020

Deep Multimodal Image-Text Embeddings for Automatic Cross-Media Retrieval

This paper considers the task of matching images and sentences by learni...
research
03/21/2018

Stacked Cross Attention for Image-Text Matching

In this paper, we study the problem of image-text matching. Inferring th...

Please sign up or login with your details

Forgot password? Click here to reset