Video captioning is a complicated task which requires to recognize multiple semantic aspects in the video such as scenes, objects, actions etc. and generate a sentence to describe such semantic contents. Therefore, it is important to capture multi-level details from global events to local objects in the video in order to generate an accurate and comprehensive video description. However, state-of-the-art video captioning models [1, 2] mainly utilize overall video representations or temporal attention  over short video segments to generate video description, which lack details at the spatial object level and thus are prone to miss or predict inaccurate details in the video.
In this work, we propose to integrate spatial object level information and temporal level information for video captioning. The temporal attention is employed to aggregate action movements in the video; while the spatial attention enables the model to ground on fine-grained objects in caption generation. Since the temporal and spatial information are complementary, we utilize a late fusion strategy to combine captions generated from the two types of attention models. Our proposed model achieves consistent improvements over baselines with CIDEr of 88.4 on theVaTeX validation set and 73.4 on the testing set, which wins the second place in the VaTeX challenge leaderboard 2019.
2 Video Captioning System
Our video captioning system consists of three modules, including video encoder to extract global, temporal and spatial features from the video, language decoder that employ spatial and temporal attentions respectively to generate sentences, and temporal-spatial fusion module to integrate generated sentences from different attentive captioning models.
|temporal + spatial||✓||42.2||27.4||55.2||88.8||39.1||25.8||53.3||73.4|
2.1 Video Encoding
Global Video Representation.
In order to comprehensively encode videos as global representation, we extract multi-modal features from three modalities including image, motion and audio. For the image modality, we utilize the resnext101 
pretrained on the Imagenet to extract global image features every 32 frames and apply average pooling on the temporal dimension; for the motion modality, we utilize the ir-csn model pretrained on Kinetics 400 to extract video segment features every 32 frames followed by average pooling; for the audio modality, we utilize VGGish network 
pretrained on Youtube8M to extract acoustic features. We concatenate all the three features and employ linear transformation to obtain the global multi-modal video representation.
In order to capture action movements, we represent the video as a sequence of consecutive segment-level features in the temporal branch. Since the image and motion modality features can be well aligned in the temporal dimension, we concatenate the image feature and the motion feature every 32 frames as the segment-level feature, and then apply linear transformation to obtain the fused embedding for each segment as .
In order to capture fine-grained object details, we employ a Mask R-CNN  pretrained on MSCOCO to detect objects in the video every 32 frames. At most 10 objects are kept for each frame after NMS. We utilize ROI align  to generate object-level features from feature maps of above mentioned image and motion networks and encode features with linear transformation to generate spatial embeddings as .
2.2 Language Decoding
We utilize a two-layer LSTM 
as the language decoder. The first layer is an attention LSTM, which aims to generate a query vectorfor attention via collecting necessary contexts from global video representation , previous word embedding and previous output of the second-layer LSTM as follows:
Then given the attended memories , the query vector dynamically fuses relevant features in as via the attention mechanism:
The contextual feature will be fed into the second-layer LSTM to predict words where is the target word at the -th step:
We employ cross entropy loss to train the video encoder and language decoder. In order to further boost captioning performance in terms of evaluation metrics, we also fine-tune the model with reinforcement learning (RL). Specifically, we utilize CIDEr and BLEU scores as reward in RL and combine it with cross entropy loss for training.
2.3 Temporal-Spatial Fusion
The temporal attentive model and spatial attentive model are complementary with each other since they focus on different aspects in the video. Therefore, we utilize a late fusion strategy to fuse results from the two types of attentive models. We train a video-semantic embedding model  on the VaTeX dataset, and use it to select the best video descriptions among generated captions from the two types of attentive models according to their relevancy to the video.
Following the challenge policy, we only employ the VaTeX dataset  for training. Since some videos are unavailable to download, our training, validation and testing set contain 25,442, 2,933 and 5,781 videos respectively. To submit results on the server which requires predictions on full testing set, we further train models using the provided i3d features to generate captions for unavailable videos.
Table 1 represents captioning performance of different captioning models on VaTeX dataset. The vanilla model only utilizes the global video representation without any attention mechanism, which is inferior to temporal and spatial attentive models. We can see that the temporal and spatial attentions are comparable on the validation set, but the spatial attention achieves slightly better performance on the testing set. Since the testing videos might contain zero-/few-shot actions, the spatial attention to attend objects might be more generalizable on those videos than temporal attention that focus on action movements. In Figure 1, we show some generated captions from temporal and spatial models for videos on the validation set. We can see that the captions from temporal and spatial models are diverse which focus on different aspects of videos, for example, the spatial attentive model can generate descriptions about small objects in the video while the temporal attentive model tends to emphasize on global event. Therefore, it is beneficial to combine the two types of models. After fusing temporal and spatial attentive models via our late fusion strategy, we achieve the best performance as shown in Table 1.
In the VaTeX Challenge 2019, we mainly focus on integrating temporal and spatial attentions for video captioning, which are complementary for comprehensive video description generation. In the future, we will explore more effective methods for spatial-temporal reasoning and fusion. Besides, we will improve the generalization of our models and reduce the performance gap between frequent and few-/zero-shot action videos, such as using stronger video features pretrained on larger dataset like Kinetics 600 and ensemble of different video captioning models.
-  Shizhe Chen, Yuqing Song, Yida Zhao, Qin Jin, Zhaoyang Zeng, Bei Liu, Jianlong Fu, and Alexander Hauptmann. Activitynet 2019 task 3: Exploring contexts for dense captioning events in videos. arXiv preprint arXiv:1907.05092, 2019.
Jonghwan Mun, Linjie Yang, Zhou Ren, Ning Xu, and Bohyung Han.
Streamlined dense video captioning.In , pages 6588–6597, 2019.
-  Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. Describing videos by exploiting temporal structure. In Proceedings of the IEEE international conference on computer vision, pages 4507–4515, 2015.
Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He.
Aggregated residual transformations for deep neural networks.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017.
-  Du Tran, Heng Wang, Lorenzo Torresani, and Matt Feiszli. Video classification with channel-separated convolutional networks. arXiv preprint arXiv:1904.02811, 2019.
-  Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. Cnn architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp), pages 131–135. IEEE, 2017.
-  Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen
Gould, and Lei Zhang.
Bottom-up and top-down attention for image captioning and visual question answering.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6077–6086, 2018.
-  Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. Self-critical sequence training for image captioning. In CVPR, volume 1, page 3, 2017.
-  Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. Vse++: Improved visual-semantic embeddings. arXiv preprint arXiv:1707.05612, 2(7):8, 2017.
-  Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. arXiv preprint arXiv:1904.03493, 2019.