Multi-modal Feature Fusion with Feature Attention for VATEX Captioning Challenge 2020

by   Ke Lin, et al.
Peking University

This report describes our model for VATEX Captioning Challenge 2020. First, to gather information from multiple domains, we extract motion, appearance, semantic and audio features. Then we design a feature attention module to attend on different feature when decoding. We apply two types of decoders, top-down and X-LAN and ensemble these models to get the final result. The proposed method outperforms official baseline with a significant gap. We achieve 76.0 CIDEr and 50.0 CIDEr on English and Chinese private test set. We rank 2nd on both English and Chinese private test leaderboard.


VATEX Captioning Challenge 2019: Multi-modal Information Fusion and Multi-stage Training Strategy for Video Captioning

Multi-modal information is essential to describe what has happened in a ...

Multi-View Features and Hybrid Reward Strategies for Vatex Video Captioning Challenge 2019

This document describes our solution for the VATEX Captioning Challenge ...

Semantic-Aware Pretraining for Dense Video Captioning

This report describes the details of our approach for the event dense-ca...

Unpaired Image Captioning by Language Pivoting

Image captioning is a multimodal task involving computer vision and natu...

The IOA System for Deep Noise Suppression Challenge using a Framework Combining Dynamic Attention and Recursive Learning

This technical report describes our system that is submitted to the Deep...

Comparison Training for Computer Chinese Chess

This paper describes the application of comparison training (CT) for aut...

The NTT DCASE2020 Challenge Task 6 system: Automated Audio Captioning with Keywords and Sentence Length Estimation

This technical report describes the system participating to the Detectio...

1 Introduction

Video captioning is the task that generating a natural language description of a given video automatically and has drawn more and more attention on both academic and industrial communities. Encoder-decoder framework is the most commonly used structure in video captioning where a CNN is the encoder to gather multi-modal information and an RNN is the decoder to generate captions [10]. Generally, a 2-D and/or a 3-D CNN network is used as encoder to extract the visual feature [1]. Regional feature [14], semantic feature [3] and audio feature [11] are also used in prior video captioning papers to boost the captioning performance. During training, ground truth is feed to the model, while when inference, predicted words are feed to the model, the previous word distribution is different for training and inference, and this is the so-called ”exposure bias” problem [13]. To overcome this problem, scheduled sampling [9] and reward optimization [8] is the most frequently used approach.

In this paper, to get information from multiple domains, we use two 3-D CNN networks to extract motion feature, a 2-D CNN network to extract appearance feature, a ECO network [16] to extract 2-D & 3-D fusion feature, a semantic model to extract semantic feature and a audio network to extract audio feature. To fuse these features, we propose a feature attention module to give different features with different weights. We use two latest decoders, top-down model [2] and X-LAN [7] model separately. We use a multi-stage training strategy to train the model with cross-entropy loss, word-level oracle and self-critical in turn. We ensemble the top-down model and X-LAN model to get the final captioning result.

2 Methods

Figure 1: The proposed video captioning framework.

2.1 Multi-modal Feature

To enlarge the representation ability of the encoder, we extract multi-model feature from motion, appearance, semantic and audio domains.

Visual Feature. In order to get motion feature of a given video, we use two types of 3-D CNN network. First, we use the I3D feature provided by the VATEX challenge organizers. Another 3-D CNN network is SlowFast [4] network 111 pretrained on Kinetics-600 dataset, which has the identical data distribution of VATEX. In order to obtain the appearance feature, we use the PNASNet [6] 222

pretrained on ImageNet on every extracted frame. Inspired by bottom-up attention

[2], local feature is also important for caption generation, so we also extract regional feature using a Faster R-CNN model pretrained on Visual Genome dataset for each extracted frame. Feature from different proposals are averaged to form the bottom-up feature for one frame. We also use an ECO [16] network pretrained on Kinetics-400 dataset to extract 2-D & 3-D fusion feature. The above features are composed to the visual feature of a video.

Semantic Feature. Inspired by some prior work [3, 1], semantic prior is also helpful for video captioning. Following Chen’s work [3], we manually select the

most frequent words from training and validation set as attributes. The semantic encoder consists of a multi-layer-perceptron on top of the visual featuers. We also extract topic ground truth from the training and validation set. Attribute and topic detection is treated as a multi-label classification task. Following


, we concatenate the predicted probability distribution of attributes/topic and the probability distribution of ECO, PNASNet and SlowFast as the semantic feature. Note that semantic feature is a one dimensional vector, we duplicate the semantic feature for

times to align the semantic feature with the dimensions of other visual features, where is the number of frames.

Audio Feature. Inspired by Wang et al.’s work [11], we also extract audio feature because audio from a video is also a powerful additional information. Our video feature extractor is a VGGish [5] network pretrained on Audioset dataset. First, we extract MEL-spectrogram patches for each audio. The sample rate of the audio is 16 KHz. The number of Mel filters is 64. The STFT window length is 25 ms and top length is 10 ms.

All the above feature are embedded to the same dimension using fully connected layer.

2.2 Decoder and feature attention

We apply two types of decoders in this work, top-down model [2] and X-LAN [7] model. Top-down consists of a two-layer GRU and an attention module. X-LAN employs a novel X-Linear attention block which fully leverages bilinear pooling to selectively focus on different visual information. We use a feature attention module to give feature with different weight. Denote the features with , where is the number of the type of feature.


By the above equations, multi-modal feature are fused to a single feature . is the hidden state of time . For top-down model, is the hidden state of the first GRU. For X-LAN model, is the hidden state of the language LSTM.

Validation Test
Top-down 0.377 0.525 0.255 0.882 0.337 0.502 0.237 0.716
X-LAN 0.405 0.541 0.267 0.885 - - - -
Ensemble 0.417 0.543 0.265 0.908 0.392 0.527 0.250 0.760
Table 1: Comparison of captioning performance on VATEX English Captioning task.
Validation Test
Top-down 0.341 0.501 0.312 0.656 0.331 0.495 0.301 0.479
X-LAN 0.321 0.495 0.314 0.662 - - - -
Ensemble 0.341 0.503 0.315 0.676 0.330 0.497 0.303 0.504
Table 2: Comparison of captioning performance on VATEX Chinese Captioning task.

2.3 Training strategy

We train the video captioning model with three stages. In the first stage, we use the traditional cross-entropy loss for 5 epochs. The learning rate is fixed at

. Then we leverage word-level oracle [15] with learning rate of . When the CIDEr score of the validation is no longer growing for 2 epochs, we begin the third stage, self-critical policy gradient training. CIDEr and BLEU-4 are equally optimized. First, the learning rate is . When the increase of CIDEr metric is saturated, we decreased the learning rate to to train the model until convergence.

3 Results

3.1 Dataset and Preprocessing

VATEX. [12] VATEX contains over 41250 video clips in 10 seconds and each video clip depicts a single activity. Each video clip has 10 English descriptions and 10 Chinese descriptions. We use the official 25991 training examples as training data and 3000 validation examples for validation.

We follow the standard caption pre-processing procedure including converting all words to lower cases, tokenizing on white space, clipping sentences over 30 words and filtering words which occur at least five times. We use open source Jieba


toolbox to segment the Chinese words. The final vocabulary sizes are 10260 for VATEX English task, 12776 for VATEX Chinese task. We use standard automatic evaluation metrics including BLEU, METEOR, ROUGE-L and CIDEr.

We uniformly sample 32 frames for each video clip. The embedding dimension . For top-down model, the model size and all hidden size are . For X-LAN decoder, the model dimension is . We train the captioning model using an Adam optimizer.

3.2 Quantitative Result

Table 1. and Table 2. show the quantitative result of the English and Chinese captioning tasks. For both English and Chinese captioning, X-LAN has better performance than top-down on validation set. The ensemble result achieves 0.76 CIDEr for English test set and 0.504 for Chinese test set. This result ranks 2nd on both English and Chinese captioning private test leaderboard.

4 Conclusion

In this challenge report, we propose a multi-modal feature fusion method. We gather feature from spatial, temporal, semantic and audio domains. We propose a feature attention module to attend on different feature when decoding. We use two latest captioning model: top-down and X-LAN. We use multiple stage training strategy to train the model. We rank 2nd at the official private test leaderboard for both English and Chinese captioning challenge.


  • [1] N. Aafaq, N. Akhtar, W. Liu, S. Z. Gilani, and A. Mian (2019) Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In CVPR, Cited by: §1, §2.1.
  • [2] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018)

    Bottom-up and top-down attention for image captioning and visual question answering

    In CVPR, Cited by: §1, §2.1, §2.2.
  • [3] H. Chen, K. Lin, A. Maye, J. Li, and X. Hu (2019) A semantics-assisted video captioning model trained with scheduled sampling. arXiv preprint arXiv:1909.00121. Cited by: §1, §2.1.
  • [4] C. Feichtenhofer, H. Fan, J. Malik, and K. He (2019) SlowFast networks for video recognition. In ICCV, Cited by: §2.1.
  • [5] S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney, R. J. Weiss, and K. W. Wilson (2016) CNN architectures for large-scale audio classification. In CoRR, Cited by: §2.1.
  • [6] C. Liu1, B. Zoph, M. Neumann, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy (2018) Progressive neural architecture search. In ECCV, Cited by: §2.1.
  • [7] Y. Pan, T. Yao, Y. Li, and T. Mei (2020) X-linear attention networks for image captioning. In CVPR, Cited by: §1, §2.2.
  • [8] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel (2017) Self-critical sequence training for image captioning. In CVPR, Cited by: §1.
  • [9] N. J. Samy Bengio and N. Shazeer (2015)

    Scheduled sampling for sequence prediction with recurrent neural networks

    In NeurIPS, Cited by: §1.
  • [10] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko (2015) Sequence to sequence - video to text. In ICCV, Cited by: §1.
  • [11] X. Wang, Y. Wang, and W. Y. Wang (2018) Watch, listen, and describe: globally and locally aligned cross-modal attentions for video captioning. arXiv preprint arXiv:1804.05448. Cited by: §1, §2.1.
  • [12] X. Wang, J. Wu, J. Chen, L. Li, Y. Wang, and W. Y. Wang (2019) Vatex: a large-scale, highquality multilingual dataset for video-and-language research.. In ICCV, Cited by: §3.1.
  • [13] W. Zaremba (2016) Sequence level training with recurrent neural networks. In ICLR, Cited by: §1.
  • [14] J. Zhang and Y. Peng (2019) Object-aware aggregation with bidirectional temporal graph for video captioning. In CVPR, Cited by: §1.
  • [15] W. Zhang, Y. Feng, F. Meng, D. You, and Q. Liu (2019)

    Bridging the gap between training and inference for neural machine translation

    In ACL, Cited by: §2.3.
  • [16] M. Zolfaghari, K. Singh, and T. Brox (2018) Eco: efficient convolutional network for online video understanding. Note:

    Proceedings of the European misc on Computer Vision (ECCV)

    Cited by: §1, §2.1.