Exploiting long-term temporal dynamics for video captioning

02/22/2022
by   Yuyu Guo, et al.
0

Automatically describing videos with natural language is a fundamental challenge for computer vision and natural language processing. Recently, progress in this problem has been achieved through two steps: 1) employing 2-D and/or 3-D Convolutional Neural Networks (CNNs) (e.g. VGG, ResNet or C3D) to extract spatial and/or temporal features to encode video contents; and 2) applying Recurrent Neural Networks (RNNs) to generate sentences to describe events in videos. Temporal attention-based model has gained much progress by considering the importance of each video frame. However, for a long video, especially for a video which consists of a set of sub-events, we should discover and leverage the importance of each sub-shot instead of each frame. In this paper, we propose a novel approach, namely temporal and spatial LSTM (TS-LSTM), which systematically exploits spatial and temporal dynamics within video sequences. In TS-LSTM, a temporal pooling LSTM (TP-LSTM) is designed to incorporate both spatial and temporal information to extract long-term temporal dynamics within video sub-shots; and a stacked LSTM is introduced to generate a list of words to describe the video. Experimental results obtained in two public video captioning benchmarks indicate that our TS-LSTM outperforms the state-of-the-art methods.

READ FULL TEXT

page 2

page 14

research
01/17/2020

Spatio-Temporal Ranked-Attention Networks for Video Captioning

Generating video descriptions automatically is a challenging task that i...
research
12/18/2017

Spatial-Temporal Memory Networks for Video Object Detection

We introduce Spatial-Temporal Memory Networks (STMN) for video object de...
research
11/17/2016

Multimodal Memory Modelling for Video Captioning

Video captioning which automatically translates video clips into natural...
research
11/23/2016

Video Captioning with Transferred Semantic Attributes

Automatically generating natural language descriptions of videos plays a...
research
02/27/2015

Describing Videos by Exploiting Temporal Structure

Recent progress in using recurrent neural networks (RNNs) for image desc...
research
11/11/2015

Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning

Recently, deep learning approach, especially deep Convolutional Neural N...
research
06/08/2022

Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation

Referring video object segmentation aims to predict foreground labels fo...

Please sign up or login with your details

Forgot password? Click here to reset