Jointly Modeling Embedding and Translation to Bridge Video and Language

05/07/2015
by   Yingwei Pan, et al.
0

Automatically describing video content with natural language is a fundamental challenge of multimedia. Recurrent Neural Networks (RNN), which models sequence dynamics, has attracted increasing attention on visual interpretation. However, most existing approaches generate a word locally with given previous words and the visual content, while the relationship between sentence semantics and visual content is not holistically exploited. As a result, the generated sentences may be contextually correct but the semantics (e.g., subjects, verbs or objects) are not true. This paper presents a novel unified framework, named Long Short-Term Memory with visual-semantic Embedding (LSTM-E), which can simultaneously explore the learning of LSTM and visual-semantic embedding. The former aims to locally maximize the probability of generating the next word given previous words and visual content, while the latter is to create a visual-semantic embedding space for enforcing the relationship between the semantics of the entire sentence and visual content. Our proposed LSTM-E consists of three components: a 2-D and/or 3-D deep convolutional neural networks for learning powerful video representation, a deep RNN for generating sentences, and a joint embedding model for exploring the relationships between visual content and sentence semantics. The experiments on YouTube2Text dataset show that our proposed LSTM-E achieves to-date the best reported performance in generating natural sentences: 45.3 demonstrate that LSTM-E is superior in predicting Subject-Verb-Object (SVO) triplets to several state-of-the-art techniques.

READ FULL TEXT
research
11/23/2016

Video Captioning with Transferred Semantic Attributes

Automatically generating natural language descriptions of videos plays a...
research
02/24/2015

Deep Sentence Embedding Using Long Short-Term Memory Networks: Analysis and Application to Information Retrieval

This paper develops a model that addresses sentence embedding, a hot top...
research
06/08/2017

Image Captioning with Object Detection and Localization

Automatically generating a natural language description of an image is a...
research
03/01/2018

Matching Natural Language Sentences with Hierarchical Sentence Factorization

Semantic matching of natural language sentences or identifying the relat...
research
06/12/2015

Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences

We propose a neural sequence-to-sequence model for direction following, ...
research
12/11/2017

A Novel Way of Identifying Cyber Predators

Recurrent Neural Networks with Long Short-Term Memory cell (LSTM-RNN) ha...
research
06/13/2018

Generating Sentences Using a Dynamic Canvas

We introduce the Attentive Unsupervised Text (W)riter (AUTR), which is a...

Please sign up or login with your details

Forgot password? Click here to reset