Describing Videos by Exploiting Temporal Structure

02/27/2015
by   Li Yao, et al.
0

Recent progress in using recurrent neural networks (RNNs) for image description has motivated the exploration of their application for video description. However, while images are static, working with videos requires modeling their dynamic temporal structure and then properly integrating that information into a natural language description. In this context, we propose an approach that successfully takes into account both the local and global temporal structure of videos to produce descriptions. First, our approach incorporates a spatial temporal 3-D convolutional neural network (3-D CNN) representation of the short temporal dynamics. The 3-D CNN representation is trained on video action recognition tasks, so as to produce a representation that is tuned to human motion and behavior. Second we propose a temporal attention mechanism that allows to go beyond local temporal modeling and learns to automatically select the most relevant temporal segments given the text-generating RNN. Our approach exceeds the current state-of-art for both BLEU and METEOR metrics on the Youtube2Text dataset. We also present results on a new, larger and more challenging dataset of paired video and natural language descriptions.

READ FULL TEXT

page 8

page 13

page 14

page 16

page 17

page 20

page 22

page 23

research
11/05/2018

StNet: Local and Global Spatial-Temporal Modeling for Action Recognition

Despite the success of deep learning for static image understanding, it ...
research
08/27/2018

Attentive Sequence to Sequence Translation for Localizing Clips of Interest by Natural Language Descriptions

We propose a novel attentive sequence to sequence translator (ASST) for ...
research
08/04/2017

Localizing Moments in Video with Natural Language

We consider retrieving a specific temporal segment, or moment, from a vi...
research
10/28/2018

Sequential anatomy localization in fetal echocardiography videos

Fetal heart motion is an important diagnostic indicator for structural d...
research
12/15/2014

Translating Videos to Natural Language Using Deep Recurrent Neural Networks

Solving the visual symbol grounding problem has long been a goal of arti...
research
02/22/2022

Exploiting long-term temporal dynamics for video captioning

Automatically describing videos with natural language is a fundamental c...
research
07/15/2020

Temporal Distinct Representation Learning for Action Recognition

Motivated by the previous success of Two-Dimensional Convolutional Neura...

Please sign up or login with your details

Forgot password? Click here to reset