Boosting Video Captioning with Dynamic Loss Network

07/25/2021 ∙ by Nasibullah, et al. ∙ 0

Video captioning is one of the challenging problems at the intersection of vision and language, having many real-life applications in video retrieval, video surveillance, assisting visually challenged people, Human-machine interface, and many more. Recent deep learning-based methods have shown promising results but are still on the lower side than other vision tasks (such as image classification, object detection). A significant drawback with existing video captioning methods is that they are optimized over cross-entropy loss function, which is uncorrelated to the de facto evaluation metrics (BLEU, METEOR, CIDER, ROUGE).In other words, cross-entropy is not a proper surrogate of the true loss function for video captioning. This paper addresses the drawback by introducing a dynamic loss network (DLN), which provides an additional feedback signal that directly reflects the evaluation metrics. Our results on Microsoft Research Video Description Corpus (MSVD) and MSR-Video to Text (MSRVTT) datasets outperform previous methods.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Video captioning is the task of describing the content in a video in natural language. With the explosion of sensors and the internet as a data carrier, automatic video understanding and captioning have become essential. It can be applied in many applications such as video surveillance, assisting visually challenged people, video retrieval, and many more. Despite having many applications, jointly modeling the spatial appearance and temporal dynamics makes it a difficult task.

Motivated by machine translation [4]

and image captioning

[5, 6], recently, the encoder-decoder architecture has been adapted for the video captioning task [7, 8, 2, 9, 1]

. With the limited amount of Spatio-temporal data, pre-trained convents have been successful at the encoder side. The decoder exploits a variant of recurrent neural networks (LSTM, GRU) to generate more relevant caption. Some recent approaches

[9] force the decoder to capture better semantics about the video by introducing reconstruction loss, while others have used multi-modal [29] features and memory-based modeling [2].

However, a potential drawback with these methods is the training signal does not align with the standard evaluation metrics such as BLEU [19], METEOR [30], ROUGE-L [31], CIDER [32]. As a result, even low training and validation loss can lead to poor metric scores and vice versa, as shown in Fig.1(b). Direct optimization over metric function is not possible due to the non-differentiable nature of the network. To this end, we propose a dynamic loss network (DLN), a transformer-based model that approximates metric function and pre-trained on external data using a self-supervised setup. Although the proposed DLN can be utilized to approximate any metric function, in our case, we approximate the BLEU score. Once trained, the DLN can be used with the video captioning model in an end-to-end manner, as shown in Fig.1(a).

Figure 1: (a) The proposed Dynamic loss network (DLN) with an encoder-decoder architecture. The encoder-decoder relies on the standard cross-entropy training signal whereas the DLN introduces additional training signal aligned with the evaluation metrics. (b) Training signal and evaluation metric curve for a standard encoder-decoder architecture. (c) Training signal and evaluation metric curve for our proposed architecture.

Finally, we demonstrate that the feedback signals from our proposed model align with the evaluation metric, as shown in Fig.1(c).

2 Related Work

2.0.1 Video Captioning.

The main breakthrough in video captioning happened with the inception of encoder-decoder based seq2seq models. The encoder-decoder framework for video captioning was first introduced by MP-LSTM [7], which uses mean pooling over frame features and then decodes caption by LSTM. The temporal nature of video was first modeled by S2VT [1] and SA-LSTM [8]. The former shares a single LSTM for both the encoder and the decoder, while the later uses attention over frame features along with 3D HOG features. RecNet [9] uses backward flow and reconstruction loss to capture better semantics where MARN [2] uses memory to capture correspondence between a word and its various similar visual context. More recently, ORG-TRL [3] has modeled object interaction via Graph convolutional network (GCN). As these methods suffer from improper training signal, some effort has already been made to mitigate this issue.
Training on evaluation metric function. Ranzato et al. [10] use the REINFORCE algorithm [11] to train an image captioning model directly on BLEU score, where Rennie et al. [12] use the Actor-critic method [13]. Both methods use the reward signal, but these methods are not applicable for video captioning due to the sparse nature of the reward. Zhukov et al. [14] propose a differentiable lower bound of expected BLEU score where Casas et al. [15] reported poor training signal corresponding to their formulation of differentiable BLEU score [19]. Unlike previous works, we leverage successful Transformer based pre-trained models to approximate the evaluation metrics.

3 Method

Our proposed method follows a two-stage training process. At the first stage, the DLN is trained in a self-supervised setup, whereas at the second stage, the trained DLN is used along with the existing video captioning model. The entire process flow is in the Fig.2. During the second stage, the loss from the DLN back propagates through the encoder-decoder model and forces it to capture better representation. Moreover, the proposed loss network can combined with different encoder-decoder architectures for video captioning. In this paper, we have employed the DLN on top of MP-LSTM [7], SA-LSTM [8], S2VT [1], RecNet [9], and MARN [2].

3.1 Encoder-Decoder

The video captioning task aims at generating sentence that describes the content of a given video , where is a word in caption and is a frame in video V. Standard encoder-decoder architecture models conditional word distribution given previous tokens and video feature.


where is the learnable parameters and is the word in the sentence of length T.

To generate a proper caption, semantically rich visual features need to be extracted. An encoder is responsible for the video feature extraction, which will be fed to the downstream decoder. A typical encoding method is to extract video frame features using pre-trained 2D Convnets (Inception-v4

[25], ResNet [26], etc.) and fuse them to get a fixed-length video feature. In our implementation, we follow the same encoding strategy as MP-LSTM [7], SA-LSTM [8], RecNet [9], and MARN [2].
Decoder. The decoder generates the caption word by word based on the visual features obtained from the encoder. A recurrent neural network is utilized as the backbone of the decoder because of its superior temporal modeling capability. In the proposed system, the decoder is designed using LSTM [7] with attention over frame features [8].
The encoder-decoder model is jointly trained by maximizing the log-likelihood of the sentence given the video feature.


3.2 Dynamic Loss Network (DLN)

As shown in Fig.1(a), the proposed DLN is built on top of the encoder-decoder. The DLN provides an additional training signal aligned with the evaluation metric. The proposed DLN approximates the BLEU [19]

score, which involves mapping from a pair of sentences to a numerical value. Motivated by the tremendous success in vision and natural language processing (NLP), pre-trained transformer network

[28] is used as the backbone for the proposed DLN. In the following, we have included a brief description on the training of DLN and also, how video captioning is achieved with the trained DLN.

Figure 2: (a) Training of Dynamic Loss Network in self-supervised setup. (b) End-to-end training of video captioning model along with DLN. (c) Video captioning model at test time.

Training of Dynamic Loss Network. The training of the DLN is achieved in a self-supervised manner. As external data, the caption corpus of MSCOCO [18] is utilized for the training of DLN, specifically, for the modeling of to

score. The training data is generated following the two strategies: (i) we perturb each sentence randomly with a p% probability to generate (candidate, reference) pair. For the perturbation, deletion and swapping are done over the word(s). (ii) we use the predicted and ground truth caption as (candidate, reference) pair during training of an image captioning model on MSCOCO

[18] data. In both the cases, ground truth ( to ) is generated using the NLTK [21] library. It is to be noted that, the

score is computed based on the modified N-gram precision between candidate and reference sentence.


where is the modified precision between n-grams, is a scalar weight corresponding to the and, is the brevity penalty.
The self-attention layer in the transformer network (to be more specific, transformer network with word as input) calculates the attention score between words. This characteristic makes the transformer network [28] a natural choice to model the BLEU score function. Although BERT [22] and GPT [23] are the state-of-the-art pre-trained transformer architecture, they are not suitable to model BLEU score due to subword input tokenization. Instead, BLEU score is modeled using TransformerXL [24] architecture, which works with standard word input (similar to the LSTM decoder). A regression head has been added on top of the standard TransformerXL network and trained by maximizing the log-likelihood of BLEU scores given reference and candidate sentence.


where is the learnable parameters and R,C are reference and candidate sentence respectively.

Video Captioning with Dynamic Loss Network. Once trained, the DLN is combined with the standard encoder-decoder network at the second stage of training. The proposed DLN is applied only at the training stage, so there is no run-time overhead during inference. As shown in Fig.2(b) the DLN take inputs from the output of the decoder and ground truth caption. During the backward pass the output value of DLN is added to cross entropy loss and the model is trained on the combined loss function.


At the early phase of training, cross entropy acts as a better training signal so we rely more on cross entropy loss. On the other hand, we rely more on loss from the proposed loss network at the later phase of training. To this end, the hyper-parameter is introduced. and it’s value increases along with the iterations.

4 Experiments and Results

We have conducted experiments to evaluate the performance of the proposed DLN-based video captioning on two benchmark datasets: Microsoft Research-Video to Text (MSRVTT) [16] and Microsoft Research Video Description Corpus (MSVD) [17]. We have investigated the effect of DLN on the performance of the state-of-the-art methods for video captioning.

4.1 Datasets

MSVD. MSVD contains open domain 1970 Youtube videos with approximately 40 sentences per clip. Each clip contains a single activity in 10 seconds to 25 seconds. We have followed the standard split [7, 8, 2] of 1200 videos for training, 100 for validation, and 670 for testing.

MSRVTT. MSRVTT is the largest open domain video captioning dataset with 10k videos and 20 categories. Each video clip is annotated with 20 sentences, resulting in 200k video-sentence pairs. We have followed the public benchmark splits, i.e., 6513 for training, 497 for validation, and 2990 for testing.

4.2 Implementation Details

We have uniformly sampled 28 frames per video and extracted 1536 dimensional appearance features from InceptionV4 [25]

, pre-trained on ImageNet

[20]. At the decoder end, the hidden layer contains 512 units, and the size of the word embedding is considered 468. In the Attention-based Recurrent Decoder, the dimension of the attention module is set to 256. All the sentences longer than 30 words are truncated. Apart from these, we have followed the standard settings mentioned in the models [7, 1, 8, 9, 2] while adding DLN on top of them. For the DLN, we use 16 multi-head and 18 layers TransformerXL [24]

pre-trained on WikiText-103. A regression head composed of three fully connected (FC) layers are added on the top of the TransformerXL

[24]. to is modeled through the training of DLN. During both stages of training, the learning rate for DLN and the end-to-end video captioning model is set to 1e-4. Adam [27] is employed for optimization. The model selection is made using the validation set performance. At the test time, beam search with a beam length of 5 is used for the final caption generation.

4.3 Experimental Results

We have compared our proposed model with the existing video captioning models on MSVD and MSRVTT datasets, as shown in Table.1 and Table.2. All four popular evaluation metrics, including BLEU, METEOR, ROUGE, and CIDER are reported.

Model without DLN with DLN (Ours)
MP-LSTM [7] 33.3 29.1 - - 43.6 32.5 69 79
S2VT [1] 39.6 31.2 67.5 66.7 42.3 32.7 68.1 73.7
SA-LSTM [8] 45.3 31.9 64.2 76.2 43.9 33.1 68.7 77.2
RecNet (global) [9] 51.1 34.0 69.4 79.7 50 34.1 71.1 78.5
RecNet (local) [9] 52.3 34.1 69.8 80.3 51.8 34.4 69.9 81
MARN [2] 48.6 35.1 71.9 92.2 52.3 35 72.3 89.7

Table 1: Performance of different video captioning models on MSVD dataset with and without DLN in terms of four metrics.

It is apparent that adding DLN provided significant gain to the captioning performance in all metrics. In our experiment, the training is performed with or without DLN under the same settings as described by the existing methods.

Model without DLN with DLN (Ours)
MP-LSTM [7] 32.3 23.4 - - 35.3 26.0 57.9 37.7
SA-LSTM [8] 34.8 23.8 - - 35.1 26.3 58.1 38.1
RecNet (global) [9] 38.3 26.2 59.1 41.7 38.7 26.5 58.7 41.9
RecNet (local) [9] 39.1 26.6 59.3 42.7 38.9 26.7 58.9 42.7
MARN [2] 40.4 28.1 60.7 47.1 40.5 27.9 60.8 45.1

Table 2: Performance of different video captioning models on MSRVTT dataset with and without DLN in terms of four evaluation metrics.

Significance of BLEU as a prediction function for DLN. Among the different evaluation metrics, BLEU is the most uncorrelated metric with the training signal, as apparent from Fig.1(b). Moreover, significant similarity is noticed in the mechanism of the self-attention in the transformer network and the BLEU score in Eq.3. The efficacy of the DLN model has been demonstrated in Table.1 and Table.2.
Study on the training of the DLN. The training of the DLN is performed to predict to . The novel idea of the DLN is proposed in this paper, so no benchmark results are available for this task. Hence, the qualitative analysis is performed by comparing histograms of the ground truth and the predicted values on the test set, as shown in Fig.3.

Figure 3: Comparison of BLEU-4 Histograms: ground truth vs model prediction.

Study on the Trade-off Parameter . It is indeed to state that the trade-off parameter plays a crucial role in controlling the combined loss (Eq.5) during training. Experimentally it is observed that the augmentation of DLN () has improved the performance of the majority of the state-of-the-art video captioning models. It is also observed that the quality of the captions deteriorates with the higher values of . We empirically set initial as 0.1 and increases up to 0.5 along with iterations.

5 Conclusion

This work addresses the training signal, evaluation metric alignment mismatch problem of existing video captioning models and proposed a dynamic loss network, which models the evaluation metric under consideration (BLEU is considered in our case). The training is performed in two stages, and the experimental results on the benchmark datasets support the superiority of the proposed DLN over the existing encoder-decoder based video captioning models.


  • [1]

    Subhashini Venugopalan and Marcus Rohrbach and Jeffrey Donahue and Raymond Mooney and Trevor Darrell and Kate Saenko. Sequence to sequence-video to text. In Proceedings of the IEEE international conference on computer vision, pp. 4534–4542. 2015.

  • [2]

    Wenjie Pei and Jiyuan Zhang and Xiangrong Wang and Lei Ke and Xiaoyong Shen and Yu-Wing Tai. Memory-attended recurrent network for video captioning. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, pp. 8347–8356, 2019.

  • [3] Ziqi, Zhang and Yaya Shi and Chunfeng Yuan and Bing Li and Peijin Wang and Weiming Hu and Zhengjun Zha. Object Relational Graph with Teacher-Recommended Learning for Video Captioning. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2020.
  • [4] Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473, September 2014.
  • [5] O. Vinyals, A. Toshev, S. Bengio and D. Erhan, ”Show and tell: A neural image caption generator,” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 2015, pp. 3156-3164, doi: 10.1109/CVPR.2015.7298935.
  • [6] J. Lu, C. Xiong, D. Parikh and R. Socher, ”Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 3242-3250, doi: 10.1109/CVPR.2017.345.
  • [7] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko. Translating videos to natural language using deep recurrent neural networks. In NAACLHLT, 2015.
  • [8] Yao, Li and Torabi, Atousa and Cho, Kyunghyun and Ballas, Nicolas and Pal, Christopher and Larochelle, Hugo and Courville, Aaron. (2015). Describing Videos by Exploiting Temporal Structure. 10.1109/ICCV.2015.512.
  • [9] Bairui Wang, Lin Ma, Wei Zhang, and Wei Liu. Reconstruction network for video captioning. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 7622–7631, 2018.
  • [10] Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. ICLR, 2015
  • [11]

    Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. In Machine Learning, pages 229–256, 1992

  • [12] Steven J. Rennie and Etienne Marcheret and Youssef Mroueh and Jerret Ross and Vaibhava Goel. Self-critical Sequence Training for Image Captioning. CoRR. abs/1612.00563, 2016.
  • [13] Richard S. Sutton and Andrew G. Barto. Reinforcement learning: An introduction. MIT Press, 1998.
  • [14] Vlad Zhukov and Maksim Kretov. Differentiable lower bound for expected BLEU score. CoRR. abs/1712.04708, 2017.
  • [15] Noe Casas and José A.R. Fonollosa and Marta R. Costa-jussà. A differentiable BLEU loss. Analysis and first results. ICLR, 2018.
  • [16] J. Xu, T. Mei, T. Yao, and Y. Rui. Msr-vtt: A large video description dataset for bridging video and language. CVPR, June 2016
  • [17] D. L. Chen and W. B. Dolan. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 190–200. Association for Computational Linguistics, 2011.
  • [18]

    Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C. Lawrence Zitnick. Microsoft ´ COCO: common objects in context. ECCV, 2014.

  • [19] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics, 2002.
  • [20]

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.

  • [21] Bird S, Klein E, Loper E. Natural language processing with Python. Analyzing text with the natural language toolkit.Reilly Media, Inc.” 2009.
  • [22] Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1.Association for Computational Linguistics.
  • [23]

    Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya. Language Models are Unsupervised Multitask Learners. 2019.

  • [24] Dai, Zihang and Yang, Zhilin and Yang, Yiming and Carbonell, Jaime and Le, Quoc and Salakhutdinov, Ruslan. (2019). Transformer-XL: Attentive Language Models beyond a Fixed-Length Context. 2978-2988. 10.18653/v1/P19-1285.
  • [25]

    C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI, 2017.

  • [26] He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian. (2016). Deep Residual Learning for Image Recognition. 770-778. 10.1109/CVPR.2016.90.
  • [27] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [28] Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan and Kaiser, Lukasz and Polosukhin, Illia. (2017). Attention Is All You Need.
  • [29] V. Ramanishka, A. Das, D. H. Park, S. Venugopalan, L. A. Hendricks, M. Rohrbach, and K. . Multimodal video description. In ACM Multimedia Conference, 2016.
  • [30] S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, volume 29, pages 65–72, 2005.
  • [31]

    C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out: Proceedings of the ACL-04 workshop, volume 8. Barcelona, Spain, 2004.

  • [32] R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. In CVPR, 2015.