Video understanding is among one of the most fundamental research problems in computer vision and machine learning. The ubiquitous video acquisition devices (e.g., smart phones, surveillance cameras, etc.) have created videos far surpassing what we can watch. It has therefore been a pressing need to develop automatic video understanding and analysis algorithms for various applications.
To recognize actions and events in videos, recent approaches based on deep convolutional neural networks (CNNs)[9, 13, 3, 17, 4] and/or recurrent networks [7, 15, 1] have achieved state-of-the-art results. However, due to the lack of public available datasets, existing video recognition approaches are restricted to understand small-scale data, while large-scale video understanding remains an under-addressed problem. To remedy this issue, Google DeepMind releases a new large-scale video dataset, named as Kinetics dataset , which contains 300K video clips of 400 human action class.
To address this challenge, our solution follows the strategy of DevNet framework . Particularly, we first learn the basic RGB, Flow and Audio neutral network models using the videos. Then we extract the multi modality feature and fed them into different off-shelf temporal models. We also design four novel temporal modeling approaches, namely Multi-group Shifting Attention Network, Temporal Xception Network, Multi-stream sequence Model and Fast-Forward Sequence Model. Experiment results verity the effectiveness of the four models over the traditional temporal modeling approaches. We also find that these four temporal modeling approaches are complementary with each others and lead to the state-of-the-arts performances after ensemble.
2 Multimodal Feature Extraction
Videos are naturally multimodal because a video can be decomposed into visual and acoustic components, and the visual component can be further divided into spatial and temporal parts. We extracted multi modal features to best represent videos accordingly.
2.1 Visual Feature
As in , we used RGB images for spatial feature extraction and stacked optical flow fields for temporal feature extraction. We tried different ConvNet architectures and found Inception-ResNet-v2 
outperforms others in both spatial and temporal components. The RGB model is initialized with pre-trained model from ImageNet and fine-tuned in the Kinetics dataset, while the flow model is initialized from the fine-tuned RGB model. Inspired by, the temporal segment network framework is used and three segments are sampled from each trimmed video for video-level training. During testing, we can densely extract features for each frames in the video.
2.2 Acoustic Feature
We use ConvNet-based audio classification system 
to extract acoustic feature. The audio is divided into 960ms frames, and the frames are processed with Fourier transformation, histogram integration and logarithm transformation. The resulting frame can be seen as aimage that form the input of a VGG16  image classification model. Similar with the visual feature, we trained the acoustic feature in the temporal segment network framework.
3 Off-shelf Temporal Modeling Approaches
In this section, we present a brief introduction of our proposed shifting attention network and temporal Xception network. More implementation details and analysis will be in a following technique report. We also refer  for the details of multi-stream sequence model and fast-forward sequence model.
3.1 Shifting Attention Network
have been proposed and achieved promising results in natural language processing problems. In order to explore the capabilities of attention models in action recognition, a shifting attention network architecture is proposed, which is efficient, elegant and solely based on attention.
3.1.1 Shifting Attention
An attention function can be considered as mapping a set of input features to a single output, where the input and output are both matrices that concatenate feature vectors. The output of the shifting attentionis calculated through a shifting operation based on a weighted sum of the features:
where is a weight vector calculated as
is learnable vector, and are learnable scalars, and is a hyper-parameter to control the sharpness of the distribution. The shifting operation actually shifts the weighted sum and at the same time ensures scale-invariance. The shift operation efficiently enables different attention components to flexibly diverge from each other and have different distributions. This lays the foundation for Multi-SATT, which we describe next.
3.1.2 Multi-Group Shifting Attention Network
In order to collect multi modal information from videos, we extract a variety of different features, such as appearance (RGB), motion (flow) and audio signals. Although the attention model focuses on some specific features and effectively filters out irrelevant noise, it is unrealistic to merge all multi modal feature sets within one attention model, because features of different modality have different values, dimensions and scales. Instead, we propose Multi-Group Shifting Attention Networks for training multiple groups of attentions simultaneously. The architecture of the proposed Multi-SATT is illustrated in Figure 1.
First, we extract multiple feature sets from the video. For each feature set , we apply different shifting attentions, which we call one attention group, and then we concatenate the outputs. Next, the outputs of different attention groups are normalized separately and concatenated to form a global representation vector for the video. Finally, the representation vector is used for classification through a fully-connected layer.
3.2 Temporal Xception Network
Depthwise separable convolution architecture [2, 20] has shown its power in image classification by reducing the number of parameters and increasing classification accuracy simultaneously. Recently, convolutional sequence-to-sequence networks have been successfully applied to machine translation tasks [5, 8]. In this competition, we adopt the temporal Xception network for action recognition, which apply the depthwise separable convolution families to the temporal dimension and achieves promising performance. The proposed temporal Xception network architecture is shown in Figure 2segments for each video. We then feed the video segment features into a Temporal Convolutional block, which is consist of a stack of two separable convolutional layers followed by batch norm and activation with a shortcut connection. Finally, the outputs of three stream features are concatenated and fed into the fully-connected layer for classification.
|Model||Modality||Top-1 Accuracy (%)||Top-5 Accuracy (%)|
|Late fusion||RGB + Flow + Audio||74.9||91.6|
|Multi-stream Sequence Model||RGB + Flow + Audio||77.0||93.2|
|Fast-forward LSTM||RGB + Flow + Audio||77.1||93.2|
|Temporal Xception Network||RGB + Flow + Audio||77.2||93.4|
|Shifting Attention Network||RGB + Flow + Audio||77.7||93.2|
|Ensemble||RGB + Flow + Audio||81.5||95.6|
4 Experiment Results
We conduct experiment on the challenging Kinetics dataset The dataset contains 246,535 training videos, 19,907 validation videos and 38,685 testing videos. Each video is in one of 400 categories.
Table 1 summarizes our results on the Kinetics validation dataset. From Table 1, we have three key observations. (1) Temporal modeling approaches with multi modal features are a more effective approach than naive combining the classification scores of different modality networks for the video classification. (2) The proposed Shifting Attention Network and Temporal Xception Network can achieve comparable or even better results than the traditional sequence models (e.g. LSTM), which indicates they might serve as alternative temporal modeling approaches in future. (3) Different temporal modeling approaches are complementary to each other.
In this work, we have proposed four temporal modeling approaches to address the challenging large-scale video recognition task. Experiment results verify that our approaches achieve significantly better results than the traditional temporal pooling approaches. The ensemble of our individual models has been shown to improve the performance further, enabling our method to rank first worldwide in the challenge competition. All the code and models will be released soon.
-  K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.
Xception: Deep learning with depthwise separable convolutions.CVPR, 2017.
-  C. Gan, N. Wang, Y. Yang, D.-Y. Yeung, and A. G. Hauptmann. Devnet: A deep event network for multimedia event detection and evidence recounting. In CVPR, pages 2568–2577, 2015.
-  C. Gan, T. Yao, K. Yang, Y. Yang, and T. Mei. You lead, we exceed: Labor-free video concept learning by jointly exploiting web videos and images. CVPR, 2016.
-  J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin. Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122, 2017.
-  S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney, R. J. Weiss, and K. Wilson. Cnn architectures for large-scale audio classification. In arXiv preprint arXiv:1609.09430, 2017.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
-  L. Kaiser, A. N. Gomez, and F. Chollet. Depthwise separable convolutions for neural machine translation. arXiv preprint arXiv:1706.03059, 2017.
-  A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014.
-  W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
-  F. Li, C. Gan, X. Liu, Y. Bian, X. Long, Y. Li, Z. Li, J. Zhou, and S. Wen. Temporal modeling approaches for large-scale youtube-8m video understanding. arXiv:1707.04555, 2017.
-  Z. Lin, M. Feng, C. Nogueira dos Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio. A Structured Self-attentive Sentence Embedding. ArXiv e-prints, Mar. 2017.
-  K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
-  N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using lstms. ICML, 2015.
-  C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In arXiv preprint arXiv:1602.07261, 2016.
-  D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. C3D: Generic features for video analysis. ICCV, 2015.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention Is All You Need. ArXiv e-prints, June 2017.
-  L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.
-  S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. CVPR, 2017.