Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval

Cross-modal retrieval between visual data and natural language description remains a long-standing challenge in multimedia. While recent image-text retrieval methods offer great promise by learning deep representations aligned across modalities, most of these methods are plagued by the issue of training with small-scale datasets covering a limited number of images with ground-truth sentences. Moreover, it is extremely expensive to create a larger dataset by annotating millions of images with sentences and may lead to a biased model. Inspired by the recent success of webly supervised learning in deep neural networks, we capitalize on readily-available web images with noisy annotations to learn robust image-text joint representation. Specifically, our main idea is to leverage web images and corresponding tags, along with fully annotated datasets, in training for learning the visual-semantic joint embedding. We propose a two-stage approach for the task that can augment a typical supervised pair-wise ranking loss based formulation with weakly-annotated web images to learn a more robust visual-semantic embedding. Experiments on two standard benchmark datasets demonstrate that our method achieves a significant performance gain in image-text retrieval compared to state-of-the-art approaches.

READ FULL TEXT

page 1

page 2

page 7

research
01/31/2019

Self-Supervised Visual Representations for Cross-Modal Retrieval

Cross-modal retrieval methods have been significantly improved in last y...
research
12/08/2020

StacMR: Scene-Text Aware Cross-Modal Retrieval

Recent models for cross-modal retrieval have benefited from an increasin...
research
08/08/2017

Deep Binaries: Encoding Semantic-Rich Cues for Efficient Textual-Visual Cross Retrieval

Cross-modal hashing is usually regarded as an effective technique for la...
research
06/03/2017

See, Hear, and Read: Deep Aligned Representations

We capitalize on large amounts of readily-available, synchronous data to...
research
03/17/2017

Learning Robust Visual-Semantic Embeddings

Many of the existing methods for learning joint embedding of images and ...
research
11/19/2015

Learning Deep Structure-Preserving Image-Text Embeddings

This paper proposes a method for learning joint embeddings of images and...
research
02/28/2023

Joint Representations of Text and Knowledge Graphs for Retrieval and Evaluation

A key feature of neural models is that they can produce semantic vector ...

Please sign up or login with your details

Forgot password? Click here to reset