1 Introduction
NL ()  NL ()  HGNL ()  HGNL (shared parameters, )  

0.5248M  0.1312M  8.32K  0.52k  
0.5248M  0. 1312M  8.32K  0.52k  
1.0496M  1.0496M  132.096k  16.512k  
Others  –  –  –  – 
All  2.0992M  1.312M  148.736K  17.552k 
NL ()  NL ()  HGNL ()  
/  –  –  – 
Video classification is one of the challenging tasks in computer vision. Publicly challenges and available video datasets accelerate the research processing, especially the ActivityNet series challenges and related datasets. In recent years, deep convolutional neural networks (CNNs) bring remarkable improvements on the accuracy of video classification
[7, 1, 5, 3].In this report, the method for the iqiyi submission to the trimmed activity recognition (Kinetics) tasks of the ActivityNet Large Scale Activity Recognition Challenge 2019 is described. The Kinetics700 dataset covers 700 human action classes and consists of approximately 650,000 video clips. And, each clip lasts around 10 seconds.
In our model ensemble stage, three models are involved: TSN[7], HGNL and StNet[4]. We propose the hierarchical groupwise nonlocal (HGNL) module for framelevel features aggregation for video classification.
Frequentlyused aggregating methods include maximum, evenly averaging and weighted averaging. The NL module in [8] is also able to be used for aggregating framelevel features. However, the NL module in [8] presents low parameters efficiency and high computational cost, as discussed in detail later in this paper.
We address the problem of building a highly efficient selfattention based framelevel features aggregation module. The Hierarchical Groupwise NonLocal (HGNL) module for framelevel features aggregation is proposed. Comparison with NL in [8], the HGNL module has fewer parameters and smaller computational cost. The proposed module involves a hierarchical groupwise structure, which includes the primary grouped convolutions and the secondary grouped matrix multiplication. Moreover, HGNL generates multiple attention maps. It brings one attention map for each feature group in the entire feature matrix and can mine the nonlocal information in features in detail.
2 Method
2.1 HgNl
In this section, the HGNL is presented in detail.
2.1.1 Formulation of Framelevel Features Aggregation
Considering a video , a sequence of frames ( is the length of a sequence of frames) are extracted from the entire video via some specific rules.
The feature information of a single frame is obtained via a pretrained convolution network:
(1) 
where denotes the th frame, is the feature information of , and denotes the ConvNet operating.
The compact videolevel features can be obtained via aggregating the features from multiple frames:
(2) 
where is the aggregating function, is the length of a sequence of frames, and denotes the compact videolevel features.
2.1.2 SelfAttention (Nonlocal) Based Framelevel Features Aggregation
In selfattention module, the response of a position is computed with weighted average of all positions in an embedding space. As a representative module of attention mechanism, the NL in [8] is adopted to aggregate framelevel features here and is able to obtain longrang dependencies across the frames.
Let , where
is the length of each frame’s feature vector.
, which denotes the feature information of frames, can be obtained via reshaping the size of to (corresponding to ). Then, an attention map having the size of and containing the relationships between every pair of frames can be obtained(3) 
where , , and weight matrices and are learned parameters. Commonly, weight matrices , are implemented as convolutions.
The output based on the attention map is
(4) 
where and weight matrices is also operated as convolutions.
After this, can be obtained
(5) 
In the above formulation, is a scale parameter and the output has the same size as the input signal .
The videolevel feature is obtained via evenly averaging of
(6) 
Figure 1 shows the schema of the NL module for framelevel features aggregation.
Analysis of NL module The NL module is effective for aggregating framelevel features. However, the NL module presents low parameters efficiency and high computational cost. The number of parameters in the NL module is computed as follows. For convolution layers corresponding to , and , the number of their parameters is , and individually. When , the number of parameters in the NL module can be computed and shown in Table 1. As shown in Table 1, if , the number of parameters is about 1.31M. If , the number of parameters is about 2M. The number is quite large for the practical use. In contrast, many backbone networks have very small number of parameters, such as MobileNetV2 [6] (3.4M), MobileNetV21.4 (6.9M), MFNet2D (5.8M), MFNet3D [2] (8.0M) and I3DRGB [1] (12.1 M). As for the computational complexity, when , the total number of multiplyadds (MAdds) required in convolution layers in NL is about M (when ) and M (when ). Therefore, it makes sense to reduces parameters redundancy and computational cost of NL module.
2.1.3 Hierarchical Groupwise Nonlocal Module
In order to reduce parameters redundancy and computational cost, the Hierarchical Groupwise Nonlocal (HGNL) module for famelevel features aggregation is proposed. HGNL has the hierarchical groupwise structure and generates several attention maps. The HGNL module for famelevel features aggregation is performed as following.
Firstly, in HGNL, weight matrices , are implemented as grouped convolutions with the number of groups being . The grouped convolutions can reduce the parameters and the number of operations measured by MAdds largely.
After this, the attention map is computed:
(7) 
where denotes the grouped matrix multiplication with the number of groups being , includes attention maps, and each attention map has size .
As shown in Figure 2, the grouped matrix multiplication in Eq (7) brings one attention map for each feature group in , and the number of attention map achieves . This can mine the nonlocal information in features more detailedly and effectively. As for the NL, only one attention map occurs. Besides, the softmax is deleted in HGNL. The computation of in Eq (7) is lightweight, and the can provide the nonlinearity for the HGNL module.
Then, keeping the same groups as in the grouped matrix multiplication in Eq (7), weight matrices is operated as grouped convolutions with the number of groups being and is computed via the grouped matrix multiplication with the number of groups being
(8) 
Figure 2 shows the schema of the HGNL module (). In general, let and is a ratio. Then the relationship of (primary grouped convolutions) and (secondary grouped matrix multiplication) forms the hierarchical groupwise structure. Consider the value of and . Even though multiple attention maps are able to mine the nonlocal information more detailedly, each attention map will cover too narrow feature information if is too big. On the other hand, when is bigger, the related parameters and MAdds is smaller. Therefore, in common, the values of and are set to different values. As a special case, when equals , the effect of HGNL is the same as processing each feature group of via NL module individually.
Analysis of HGNL module For convolution layers corresponding to , and , the number of parameters of them is , and individually. As shown in Table 1, when and , the HGNL only requires about 1:8  1:14 times fewer parameters than the NL, which has roughly 1.31M ()  2.1M () parameters. If parameters are shared across groups in a grouped convolutional layer in HGNL, the number of each convolution layer’s parameters will be further reduced or times. Besides, as shown in Table 2, when , the MAdds required in convolution layers in HGNL is about M and is several times fewer MAdds than convolution layers in the NL. The MAdds required in other nonconvolution layers keep roughly unchanged. Thus, as we can see that the HGNL is able to reduce the model redundancy and computational cost. Meanwhile, HGNL can achieve the competitive accuracy as NL.
2.1.4 Implementation of HGNL for Video Classification
Benefiting from that no fullyconnected layers are included in the network architecture of HGNL, (the number of frames) can be arbitrarily adjusted. Thus, in the evaluation phase of the proposed HGNL module, the number of frames selected from a video for predicting the label is not needed fixed as the same value as in the training phase and can be adjusted.
Model  Val Accuracy in train phase (3 segments): Top1 ()  Val Accuracy in test phase (25 segments): Top1 () 

TSN  57.38  61.83 
HGNL  57.713  62.12 
StNet  55.7   
Models  avg.error 

Model Ensemble  0.28444 
2.2 Model Ensemble
3 Experiments
In this section, we report some experimental results on Kinetics700 dataset of our method. All models are pretrained on the Kinetics600 training set. We finetuned these models on the Kinetics700 training set. SeResnext101 is adopted as the backbone network. Due to the limited time, we exploit only RGB information. In our experimens, the fulllength video is divided into several equal segments, some frames are randomly selected from each segment. During our training, the number of segments is set to 3 and one frame are randomly selected from each segment. During evaluation, we follow the same testing setup as in TSN [7].
3.1 Tsn
In TSN experiments, the initial learning rate is set as 0.001 and decayed by a factor of 10 at 20 epochs and 30 epochs. The maximum iteration is set as 40 epochs.
3.2 HgNl
In HGNL experiments, , , . Due to time limits, we finetuned the HGNL on the Kinetics700 training set for only 8 epochs with the model pretrained by TSN in Section 3.1. The initial learning rate is set as 0.001 and decayed by a factor of 10 at 4 epochs and 6 epochs. The maximum iteration is set as 8 epochs. The results are shown in Table 3. We can see that HGNL can obtain the top1 accuracy of 62.12%, compared with the top1 accuracy of 61.83% of TSN on the Kinetics700 validation set, as shown in Table 3.
3.3 StNet
For StNet[4], the Temporal Modeling Block and Temporal Xception Block are used in our network. We adopt the same input of TSN as the input of StNet. Because of the time limits, we only trained the network for 20 epochs on kinetics700 datasets. The results of the StNet on kinetics700 validation dataset is 55.7% for top1 and 78.3% for top 5 in the train phase(3frames test).
3.4 Model Ensemble
4 Conclusion
In this report, our team’s solution to the task of ActivityNet 2019 Kinetics700 challenge is described. Experiment results have evidenced the effectiveness of the proposed HGNL method. HGNL achieves the better accuracy than the TSN baseline. With the help of the hierarchical groupwise structure, the HGNL module has 8  70 times fewer parameters and several times smaller computational complexity than the NL module. After model ensemble, our team finally obtains an averaged top1 and top5 error percentage of 28.444% on the Kinetics700 test set.
References

[1]
J. Carreira and A. Zisserman.
Quo vadis, action recognition? a new model and the kinetics dataset.
In
proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 6299–6308, 2017.  [2] Y. Chen, Y. Kalantidis, J. Li, S. Yan, and J. Feng. Multifiber networks for video recognition. In The European Conference on Computer Vision (ECCV), September 2018.
 [3] C. Feichtenhofer, H. Fan, J. Malik, and K. He. Slowfast networks for video recognition. arXiv preprint arXiv:1812.03982, 2018.
 [4] D. He, Z. Zhou, C. Gan, F. Li, X. Liu, Y. Li, L. Wang, and S. Wen. Stnet: Local and global spatialtemporal modeling for action recognition. arXiv preprint arXiv:1811.01549, 2018.
 [5] Z. Qiu, T. Yao, and T. Mei. Learning spatiotemporal representation with pseudo3d residual networks. In proceedings of the IEEE International Conference on Computer Vision, pages 5533–5541, 2017.
 [6] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.C. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
 [7] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, pages 20–36. Springer, 2016.
 [8] X. Wang, R. Girshick, A. Gupta, and K. He. Nonlocal neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7794–7803, 2018.