Bidirectional Long-Short Term Memory for Video Description

06/15/2016
by   Yi Bin, et al.
0

Video captioning has been attracting broad research attention in multimedia community. However, most existing approaches either ignore temporal information among video frames or just employ local contextual temporal knowledge. In this work, we propose a novel video captioning framework, termed as Bidirectional Long-Short Term Memory (BiLSTM), which deeply captures bidirectional global temporal structure in video. Specifically, we first devise a joint visual modelling approach to encode video data by combining a forward LSTM pass, a backward LSTM pass, together with visual features from Convolutional Neural Networks (CNNs). Then, we inject the derived video representation into the subsequent language model for initialization. The benefits are in two folds: 1) comprehensively preserving sequential and visual information; and 2) adaptively learning dense visual features and sparse semantic representations for videos and sentences, respectively. We verify the effectiveness of our proposed video captioning framework on a commonly-used benchmark, i.e., Microsoft Video Description (MSVD) corpus, and the experimental results demonstrate that the superiority of the proposed approach as compared to several state-of-the-art methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/07/2020

NITS-VC System for VATEX Video Captioning Challenge 2020

Video captioning is process of summarising the content, event and action...
research
11/17/2016

Multimodal Memory Modelling for Video Captioning

Video captioning which automatically translates video clips into natural...
research
01/02/2021

Video Captioning in Compressed Video

Existing approaches in video captioning concentrate on exploring global ...
research
02/27/2019

Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning

Automatic generation of video captions is a fundamental challenge in com...
research
03/27/2021

Video Rescaling Networks with Joint Optimization Strategies for Downscaling and Upscaling

This paper addresses the video rescaling task, which arises from the nee...
research
10/17/2016

Spatio-Temporal Attention Models for Grounded Video Captioning

Automatic video captioning is challenging due to the complex interaction...
research
02/19/2020

SummaryNet: A Multi-Stage Deep Learning Model for Automatic Video Summarisation

Video summarisation can be posed as the task of extracting important par...

Please sign up or login with your details

Forgot password? Click here to reset