DeepAI AI Chat
Log In Sign Up

Preserving Semantic Neighborhoods for Robust Cross-modal Retrieval

by   Christopher Thomas, et al.

The abundance of multimodal data (e.g. social media posts) has inspired interest in cross-modal retrieval methods. Popular approaches rely on a variety of metric learning losses, which prescribe what the proximity of image and text should be, in the learned space. However, most prior methods have focused on the case where image and text convey redundant information; in contrast, real-world image-text pairs convey complementary information with little overlap. Further, images in news articles and media portray topics in a visually diverse fashion; thus, we need to take special care to ensure a meaningful image representation. We propose novel within-modality losses which encourage semantic coherency in both the text and image subspaces, which does not necessarily align with visual coherency. Our method ensures that not only are paired images and texts close, but the expected image-image and text-text relationships are also observed. Our approach improves the results of cross-modal retrieval on four datasets compared to five baselines.


page 2

page 14


Revisiting Cross Modal Retrieval

This paper proposes a cross-modal retrieval system that leverages on ima...

On Metric Learning for Audio-Text Cross-Modal Retrieval

Audio-text retrieval aims at retrieving a target audio clip or caption f...

ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval

Visual appearance is considered to be the most important cue to understa...

Learning Shared Semantic Space with Correlation Alignment for Cross-modal Event Retrieval

In this paper, we propose to learn shared semantic space with correlatio...

Image-text Retrieval: A Survey on Recent Research and Development

In the past few years, cross-modal image-text retrieval (ITR) has experi...

"Is this an example image?" -- Predicting the Relative Abstractness Level of Image and Text

Successful multimodal search and retrieval requires the automatic unders...

Modality-dependent Cross-media Retrieval

In this paper, we investigate the cross-media retrieval between images a...