Deep Multimodal Image-Text Embeddings for Automatic Cross-Media Retrieval

02/23/2020
by   Hadi Abdi Khojasteh, et al.
9

This paper considers the task of matching images and sentences by learning a visual-textual embedding space for cross-modal retrieval. Finding such a space is a challenging task since the features and representations of text and image are not comparable. In this work, we introduce an end-to-end deep multimodal convolutional-recurrent network for learning both vision and language representations simultaneously to infer image-text similarity. The model learns which pairs are a match (positive) and which ones are a mismatch (negative) using a hinge-based triplet ranking. To learn about the joint representations, we leverage our newly extracted collection of tweets from Twitter. The main characteristic of our dataset is that the images and tweets are not standardized the same as the benchmarks. Furthermore, there can be a higher semantic correlation between the pictures and tweets contrary to benchmarks in which the descriptions are well-organized. Experimental results on MS-COCO benchmark dataset show that our model outperforms certain methods presented previously and has competitive performance compared to the state-of-the-art. The code and dataset have been made available publicly.

READ FULL TEXT
research
08/10/2019

Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-modal Retrieval

Cross-modal retrieval aims to retrieve data in one modality by a query i...
research
09/03/2019

Do Cross Modal Systems Leverage Semantic Relationships?

Current cross-modal retrieval systems are evaluated using R@K measure wh...
research
11/15/2017

Dual-Path Convolutional Image-Text Embedding with Instance Loss

Matching images and sentences demands a fine understanding of both modal...
research
11/15/2017

Dual-Path Convolutional Image-Text Embedding

This paper considers the task of matching images and sentences. The chal...
research
09/12/2019

CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval

Text-image cross-modal retrieval is a challenging task in the field of l...
research
11/13/2021

Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning

In natural language processing, most models try to learn semantic repres...
research
02/20/2020

Expressing Objects just like Words: Recurrent Visual Embedding for Image-Text Matching

Existing image-text matching approaches typically infer the similarity o...

Please sign up or login with your details

Forgot password? Click here to reset