Deep Visual-Semantic Alignments for Generating Image Descriptions

12/07/2014
by   Andrej Karpathy, et al.
0

We present a model that generates natural language descriptions of images and their regions. Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data. Our alignment model is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Multimodal Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate that our alignment model produces state of the art results in retrieval experiments on Flickr8K, Flickr30K and MSCOCO datasets. We then show that the generated descriptions significantly outperform retrieval baselines on both full images and on a new dataset of region-level annotations.

READ FULL TEXT

page 1

page 6

page 7

page 8

page 14

page 15

page 16

page 17

research
10/04/2014

Explain Images with Multimodal Recurrent Neural Networks

In this paper, we present a multimodal Recurrent Neural Network (m-RNN) ...
research
11/20/2014

Learning a Recurrent Visual Representation for Image Caption Generation

In this paper we explore the bi-directional mapping between images and t...
research
06/03/2017

Order embeddings and character-level convolutions for multimodal alignment

With the novel and fast advances in the area of deep neural networks, se...
research
06/22/2014

Deep Fragment Embeddings for Bidirectional Image Sentence Mapping

We introduce a model for bidirectional retrieval of images and sentences...
research
11/15/2017

Dual-Path Convolutional Image-Text Embedding

This paper considers the task of matching images and sentences. The chal...
research
04/12/2016

Video Description using Bidirectional Recurrent Neural Networks

Although traditionally used in the machine translation field, the encode...
research
09/14/2015

Deep Learning Applied to Image and Text Matching

The ability to describe images with natural language sentences is the ha...

Please sign up or login with your details

Forgot password? Click here to reset