Log In Sign Up

Delving Deeper into the Decoder for Video Captioning

by   Haoran Chen, et al.

Video captioning is an advanced multi-modal task which aims to describe a video clip using a natural language sentence. The encoder-decoder framework is the most popular paradigm for this task in recent years. However, there still exist some non-negligible problems in the decoder of a video captioning model. We make a thorough investigation into the decoder and adopt three techniques to improve the performance of the model. First of all, a combination of variational dropout and layer normalization is embedded into a recurrent unit to alleviate the problem of overfitting. Secondly, a new method is proposed to evaluate the performance of a model on a validation set so as to select the best checkpoint for testing. Finally, a new training strategy called professional learning is proposed which develops the strong points of a captioning model and bypasses its weaknesses. It is demonstrated in the experiments on Microsoft Research Video Description Corpus (MSVD) and MSR-Video to Text (MSR-VTT) datasets that our model has achieved the best results evaluated by BLEU, CIDEr, METEOR and ROUGE-L metrics with significant gains of up to 11.7 state-of-the-art models.


page 4

page 6


Guidance Module Network for Video Captioning

Video captioning has been a challenging and significant task that descri...

Reconstruction Network for Video Captioning

In this paper, the problem of describing visual contents of a video sequ...

Image to Language Understanding: Captioning approach

Extracting context from visual representations is of utmost importance i...

Empirical Autopsy of Deep Video Captioning Frameworks

Contemporary deep learning based video captioning follows encoder-decode...

CLIP4Caption ++: Multi-CLIP for Video Caption

This report describes our solution to the VALUE Challenge 2021 in the ca...

Video Captioning: a comparative review of where we are and which could be the route

Video captioning is the process of describing the content of a sequence ...

From Deterministic to Generative: Multi-Modal Stochastic RNNs for Video Captioning

Video captioning in essential is a complex natural process, which is aff...

Code Repositories


Source code for Delving Deeper into the Decoder for Video Captioning

view repo