CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval

by   Zihao Wang, et al.
SenseTime Corporation
Beihang University
The Chinese University of Hong Kong

Text-image cross-modal retrieval is a challenging task in the field of language and vision. Most previous approaches independently embed images and sentences into a joint embedding space and compare their similarities. However, previous approaches rarely explore the interactions between images and sentences before calculating similarities in the joint space. Intuitively, when matching between images and sentences, human beings would alternatively attend to regions in images and words in sentences, and select the most salient information considering the interaction between both modalities. In this paper, we propose Cross-modal Adaptive Message Passing (CAMP), which adaptively controls the information flow for message passing across modalities. Our approach not only takes comprehensive and fine-grained cross-modal interactions into account, but also properly handles negative pairs and irrelevant information with an adaptive gating scheme. Moreover, instead of conventional joint embedding approaches for text-image matching, we infer the matching score based on the fused features, and propose a hardest negative binary cross-entropy loss for training. Results on COCO and Flickr30k significantly surpass state-of-the-art methods, demonstrating the effectiveness of our approach.


page 1

page 3

page 7

page 8


FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval

In this paper, we address the text and image matching in cross-modal ret...

ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval

Image-text matching is gaining a leading role among tasks involving the ...

Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders

Despite the evolution of deep-learning-based visual-textual processing s...

Cross-Modal Message Passing for Two-stream Fusion

Processing and fusing information among multi-modal is a very useful tec...

Pix2Map: Cross-modal Retrieval for Inferring Street Maps from Images

Self-driving vehicles rely on urban street maps for autonomous navigatio...

Deep Multimodal Image-Text Embeddings for Automatic Cross-Media Retrieval

This paper considers the task of matching images and sentences by learni...

Stacked Cross Attention for Image-Text Matching

In this paper, we study the problem of image-text matching. Inferring th...

Please sign up or login with your details

Forgot password? Click here to reset