Video Captioning with Text-based Dynamic Attention and Step-by-Step Learning

11/05/2019
by   Huanhou Xiao, et al.
26

Automatically describing video content with natural language has been attracting much attention in CV and NLP communities. Most existing methods predict one word at a time, and by feeding the last generated word back as input at the next time, while the other generated words are not fully exploited. Furthermore, traditional methods optimize the model using all the training samples in each epoch without considering their learning situations, which leads to a lot of unnecessary training and can not target the difficult samples. To address these issues, we propose a text-based dynamic attention model named TDAM, which imposes a dynamic attention mechanism on all the generated words with the motivation to improve the context semantic information and enhance the overall control of the whole sentence. Moreover, the text-based dynamic attention mechanism and the visual attention mechanism are linked together to focus on the important words. They can benefit from each other during training. Accordingly, the model is trained through two steps: "starting from scratch" and "checking for gaps". The former uses all the samples to optimize the model, while the latter only trains for samples with poor control. Experimental results on the popular datasets MSVD and MSR-VTT demonstrate that our non-ensemble model outperforms the state-of-the-art video captioning benchmarks.

READ FULL TEXT

page 2

page 7

research
06/05/2017

Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning

Recent progress has been made in using attention based encoder-decoder f...
research
12/26/2018

Hierarchical LSTMs with Adaptive Attention for Visual Captioning

Recent progress has been made in using attention based encoder-decoder f...
research
12/01/2016

Video Captioning with Multi-Faceted Attention

Recently, video captioning has been attracting an increasing amount of i...
research
01/01/2019

Not All Words are Equal: Video-specific Information Loss for Video Captioning

An ideal description for a given video should fix its gaze on salient an...
research
10/16/2021

Visual-aware Attention Dual-stream Decoder for Video Captioning

Video captioning is a challenging task that captures different visual pa...
research
05/10/2019

Memory-Attended Recurrent Network for Video Captioning

Typical techniques for video captioning follow the encoder-decoder frame...
research
03/13/2021

Approximating How Single Head Attention Learns

Why do models often attend to salient words, and how does this evolve th...

Please sign up or login with your details

Forgot password? Click here to reset