Translating Videos to Natural Language Using Deep Recurrent Neural Networks

12/15/2014
by   Subhashini Venugopalan, et al.
0

Solving the visual symbol grounding problem has long been a goal of artificial intelligence. The field appears to be advancing closer to this goal with recent breakthroughs in deep learning for natural language grounding in static images. In this paper, we propose to translate videos directly to sentences using a unified deep neural network with both convolutional and recurrent structure. Described video datasets are scarce, and most existing methods have been applied to toy domains with a small vocabulary of possible words. By transferring knowledge from 1.2M+ images with category labels and 100,000+ images with captions, our method is able to create sentence descriptions of open-domain videos with large vocabularies. We compare our approach with recent work using language generation metrics, subject, verb, and object prediction accuracy, and a human evaluation.

READ FULL TEXT

page 1

page 9

research
02/25/2021

IMAGETOTEXT: IMAGE CAPTION GENERATION USING HYBRID RECURRENT NEURAL NETWORK

Generating a natural language description from images is an important pr...
research
12/01/2021

MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions

The recent and increasing interest in video-language research has driven...
research
10/11/2022

Visual Language Maps for Robot Navigation

Grounding language to the visual observations of a navigating agent can ...
research
02/27/2015

Describing Videos by Exploiting Temporal Structure

Recent progress in using recurrent neural networks (RNNs) for image desc...
research
05/18/2017

Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks

Linking human whole-body motion and natural language is of great interes...
research
07/29/2021

Video Generation from Text Employing Latent Path Construction for Temporal Modeling

Video generation is one of the most challenging tasks in Machine Learnin...
research
09/09/2021

Reconstructing and grounding narrated instructional videos in 3D

Narrated instructional videos often show and describe manipulations of s...

Please sign up or login with your details

Forgot password? Click here to reset