Deep Fragment Embeddings for Bidirectional Image Sentence Mapping

06/22/2014
by   Andrej Karpathy, et al.
0

We introduce a model for bidirectional retrieval of images and sentences through a multi-modal embedding of visual and natural language data. Unlike previous models that directly map images or sentences into a common embedding space, our model works on a finer level and embeds fragments of images (objects) and fragments of sentences (typed dependency tree relations) into a common space. In addition to a ranking objective seen in previous work, this allows us to add a new fragment alignment objective that learns to directly associate these fragments across modalities. Extensive experimental evaluation shows that reasoning on both the global level of images and sentences and the finer level of their respective fragments significantly improves performance on image-sentence retrieval tasks. Additionally, our model provides interpretable predictions since the inferred inter-modal fragment alignment is explicit.

READ FULL TEXT

page 7

page 8

research
08/05/2021

Structured Multi-modal Feature Embedding and Alignment for Image-Sentence Retrieval

The current state-of-the-art image-sentence retrieval methods implicitly...
research
12/07/2014

Deep Visual-Semantic Alignments for Generating Image Descriptions

We present a model that generates natural language descriptions of image...
research
04/23/2015

Multimodal Convolutional Neural Networks for Matching Image and Sentence

In this paper, we propose multimodal convolutional neural networks (m-CN...
research
06/01/2021

Towards Efficient Cross-Modal Visual Textual Retrieval using Transformer-Encoder Deep Features

Cross-modal retrieval is an important functionality in modern search eng...
research
09/14/2015

Deep Learning Applied to Image and Text Matching

The ability to describe images with natural language sentences is the ha...
research
08/03/2015

Maintaining prediction quality under the condition of a growing knowledge space

Intelligence can be understood as an agent's ability to predict its envi...

Please sign up or login with your details

Forgot password? Click here to reset