Video captioning is the task that generating a natural language description of a given video automatically and has drawn more and more attention on both academic and industrial communities. Encoder-decoder framework is the most commonly used structure in video captioning where a CNN is the encoder to gather multi-modal information and an RNN is the decoder to generate captions . Generally, a 2-D and/or a 3-D CNN network is used as encoder to extract the visual feature . Regional feature , semantic feature  and audio feature  are also used in prior video captioning papers to boost the captioning performance. During training, ground truth is feed to the model, while when inference, predicted words are feed to the model, the previous word distribution is different for training and inference, and this is the so-called ”exposure bias” problem . To overcome this problem, scheduled sampling  and reward optimization  is the most frequently used approach.
In this paper, to get information from multiple domains, we use two 3-D CNN networks to extract motion feature, a 2-D CNN network to extract appearance feature, a ECO network  to extract 2-D & 3-D fusion feature, a semantic model to extract semantic feature and a audio network to extract audio feature. To fuse these features, we propose a feature attention module to give different features with different weights. We use two latest decoders, top-down model  and X-LAN  model separately. We use a multi-stage training strategy to train the model with cross-entropy loss, word-level oracle and self-critical in turn. We ensemble the top-down model and X-LAN model to get the final captioning result.
2.1 Multi-modal Feature
To enlarge the representation ability of the encoder, we extract multi-model feature from motion, appearance, semantic and audio domains.
Visual Feature. In order to get motion feature of a given video, we use two types of 3-D CNN network. First, we use the I3D feature provided by the VATEX challenge organizers. Another 3-D CNN network is SlowFast  network 111https://github.com/facebookresearch/SlowFast pretrained on Kinetics-600 dataset, which has the identical data distribution of VATEX. In order to obtain the appearance feature, we use the PNASNet  222https://github.com/Cadene/pretrained-models.pytorch
pretrained on ImageNet on every extracted frame. Inspired by bottom-up attention, local feature is also important for caption generation, so we also extract regional feature using a Faster R-CNN model pretrained on Visual Genome dataset for each extracted frame. Feature from different proposals are averaged to form the bottom-up feature for one frame. We also use an ECO  network pretrained on Kinetics-400 dataset to extract 2-D & 3-D fusion feature. The above features are composed to the visual feature of a video.
most frequent words from training and validation set as attributes. The semantic encoder consists of a multi-layer-perceptron on top of the visual featuers. We also extract topic ground truth from the training and validation set. Attribute and topic detection is treated as a multi-label classification task. Following
, we concatenate the predicted probability distribution of attributes/topic and the probability distribution of ECO, PNASNet and SlowFast as the semantic feature. Note that semantic feature is a one dimensional vector, we duplicate the semantic feature fortimes to align the semantic feature with the dimensions of other visual features, where is the number of frames.
Audio Feature. Inspired by Wang et al.’s work , we also extract audio feature because audio from a video is also a powerful additional information. Our video feature extractor is a VGGish  network pretrained on Audioset dataset. First, we extract MEL-spectrogram patches for each audio. The sample rate of the audio is 16 KHz. The number of Mel filters is 64. The STFT window length is 25 ms and top length is 10 ms.
All the above feature are embedded to the same dimension using fully connected layer.
2.2 Decoder and feature attention
We apply two types of decoders in this work, top-down model  and X-LAN  model. Top-down consists of a two-layer GRU and an attention module. X-LAN employs a novel X-Linear attention block which fully leverages bilinear pooling to selectively focus on different visual information. We use a feature attention module to give feature with different weight. Denote the features with , where is the number of the type of feature.
By the above equations, multi-modal feature are fused to a single feature . is the hidden state of time . For top-down model, is the hidden state of the first GRU. For X-LAN model, is the hidden state of the language LSTM.
2.3 Training strategy
We train the video captioning model with three stages. In the first stage, we use the traditional cross-entropy loss for 5 epochs. The learning rate is fixed at. Then we leverage word-level oracle  with learning rate of . When the CIDEr score of the validation is no longer growing for 2 epochs, we begin the third stage, self-critical policy gradient training. CIDEr and BLEU-4 are equally optimized. First, the learning rate is . When the increase of CIDEr metric is saturated, we decreased the learning rate to to train the model until convergence.
3.1 Dataset and Preprocessing
VATEX.  VATEX contains over 41250 video clips in 10 seconds and each video clip depicts a single activity. Each video clip has 10 English descriptions and 10 Chinese descriptions. We use the official 25991 training examples as training data and 3000 validation examples for validation.
We follow the standard caption pre-processing procedure including converting all words to lower cases, tokenizing on white space, clipping sentences over 30 words and filtering words which occur at least five times. We use open source Jieba333https://github.com/fxsjy/jieba
toolbox to segment the Chinese words. The final vocabulary sizes are 10260 for VATEX English task, 12776 for VATEX Chinese task. We use standard automatic evaluation metrics including BLEU, METEOR, ROUGE-L and CIDEr.
We uniformly sample 32 frames for each video clip. The embedding dimension . For top-down model, the model size and all hidden size are . For X-LAN decoder, the model dimension is . We train the captioning model using an Adam optimizer.
3.2 Quantitative Result
Table 1. and Table 2. show the quantitative result of the English and Chinese captioning tasks. For both English and Chinese captioning, X-LAN has better performance than top-down on validation set. The ensemble result achieves 0.76 CIDEr for English test set and 0.504 for Chinese test set. This result ranks 2nd on both English and Chinese captioning private test leaderboard.
In this challenge report, we propose a multi-modal feature fusion method. We gather feature from spatial, temporal, semantic and audio domains. We propose a feature attention module to attend on different feature when decoding. We use two latest captioning model: top-down and X-LAN. We use multiple stage training strategy to train the model. We rank 2nd at the official private test leaderboard for both English and Chinese captioning challenge.
-  (2019) Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In CVPR, Cited by: §1, §2.1.
Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, Cited by: §1, §2.1, §2.2.
-  (2019) A semantics-assisted video captioning model trained with scheduled sampling. arXiv preprint arXiv:1909.00121. Cited by: §1, §2.1.
-  (2019) SlowFast networks for video recognition. In ICCV, Cited by: §2.1.
-  (2016) CNN architectures for large-scale audio classification. In CoRR, Cited by: §2.1.
-  (2018) Progressive neural architecture search. In ECCV, Cited by: §2.1.
-  (2020) X-linear attention networks for image captioning. In CVPR, Cited by: §1, §2.2.
-  (2017) Self-critical sequence training for image captioning. In CVPR, Cited by: §1.
Scheduled sampling for sequence prediction with recurrent neural networks. In NeurIPS, Cited by: §1.
-  (2015) Sequence to sequence - video to text. In ICCV, Cited by: §1.
-  (2018) Watch, listen, and describe: globally and locally aligned cross-modal attentions for video captioning. arXiv preprint arXiv:1804.05448. Cited by: §1, §2.1.
-  (2019) Vatex: a large-scale, highquality multilingual dataset for video-and-language research.. In ICCV, Cited by: §3.1.
-  (2016) Sequence level training with recurrent neural networks. In ICLR, Cited by: §1.
-  (2019) Object-aware aggregation with bidirectional temporal graph for video captioning. In CVPR, Cited by: §1.
Bridging the gap between training and inference for neural machine translation. In ACL, Cited by: §2.3.
Eco: efficient convolutional network for online video understanding.
Proceedings of the European misc on Computer Vision (ECCV)Cited by: §1, §2.1.