Learning a Recurrent Visual Representation for Image Caption Generation

by   Xinlei Chen, et al.
Carnegie Mellon University

In this paper we explore the bi-directional mapping between images and their sentence-based descriptions. We propose learning this mapping using a recurrent neural network. Unlike previous approaches that map both sentences and images to a common embedding, we enable the generation of novel sentences given an image. Using the same model, we can also reconstruct the visual features associated with an image given its visual description. We use a novel recurrent visual memory that automatically learns to remember long-term visual concepts to aid in both sentence generation and visual feature reconstruction. We evaluate our approach on several tasks. These include sentence generation, sentence retrieval and image retrieval. State-of-the-art results are shown for the task of generating novel image descriptions. When compared to human generated captions, our automatically generated captions are preferred by humans over 19.8% of the time. Results are better than or comparable to state-of-the-art results on the image and sentence retrieval tasks for methods using similar visual features.


page 3

page 7

page 8


Deep Visual-Semantic Alignments for Generating Image Descriptions

We present a model that generates natural language descriptions of image...

Learning Joint Representations of Videos and Sentences with Web Image Search

Our objective is video retrieval based on natural language queries. In a...

Learning to Describe Differences Between Pairs of Similar Images

In this paper, we introduce the task of automatically generating text to...

Predicting Visual Features from Text for Image and Video Caption Retrieval

This paper strives to find amidst a set of sentences the one best descri...

Generating Multi-Sentence Lingual Descriptions of Indoor Scenes

This paper proposes a novel framework for generating lingual description...

Knowledge driven Description Synthesis for Floor Plan Interpretation

Image captioning is a widely known problem in the area of AI. Caption ge...

On Architectures for Including Visual Information in Neural Language Models for Image Description

A neural language model can be conditioned into generating descriptions ...

Please sign up or login with your details

Forgot password? Click here to reset