Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning

05/03/2019
by   Jingwen Chen, et al.
0

It is well believed that video captioning is a fundamental but challenging task in both computer vision and artificial intelligence fields. The prevalent approach is to map an input video to a variable-length output sentence in a sequence to sequence manner via Recurrent Neural Network (RNN). Nevertheless, the training of RNN still suffers to some degree from vanishing/exploding gradient problem, making the optimization difficult. Moreover, the inherently recurrent dependency in RNN prevents parallelization within a sequence during training and therefore limits the computations. In this paper, we present a novel design --- Temporal Deformable Convolutional Encoder-Decoder Networks (dubbed as TDConvED) that fully employ convolutions in both encoder and decoder networks for video captioning. Technically, we exploit convolutional block structures that compute intermediate states of a fixed number of inputs and stack several blocks to capture long-term relationships. The structure in encoder is further equipped with temporal deformable convolution to enable free-form deformation of temporal sampling. Our model also capitalizes on temporal attention mechanism for sentence generation. Extensive experiments are conducted on both MSVD and MSR-VTT video captioning datasets, and superior results are reported when comparing to conventional RNN-based encoder-decoder techniques. More remarkably, TDConvED increases CIDEr-D performance from 58.8 to 67.2

READ FULL TEXT
research
05/25/2016

Review Networks for Caption Generation

We propose a novel extension of the encoder-decoder framework, called a ...
research
02/27/2020

Hierarchical Memory Decoding for Video Captioning

Recent advances of video captioning often employ a recurrent neural netw...
research
12/02/2021

TCTN: A 3D-Temporal Convolutional Transformer Network for Spatiotemporal Predictive Learning

Spatiotemporal predictive learning is to generate future frames given a ...
research
03/30/2018

Regularizing RNNs for Caption Generation by Reconstructing The Past with The Present

Recently, caption generation with an encoder-decoder framework has been ...
research
11/23/2016

Video Captioning with Transferred Semantic Attributes

Automatically generating natural language descriptions of videos plays a...
research
05/22/2017

TricorNet: A Hybrid Temporal Convolutional and Recurrent Network for Video Action Segmentation

Action segmentation as a milestone towards building automatic systems to...
research
12/01/2017

Folded Recurrent Neural Networks for Future Video Prediction

Future video prediction is an ill-posed Computer Vision problem that rec...

Please sign up or login with your details

Forgot password? Click here to reset