iqiyi Submission to ActivityNet Challenge 2019 Kinetics-700 challenge: Hierarchical Group-wise Attention

02/07/2020 ∙ by Qian Liu, et al. ∙ 10

In this report, the method for the iqiyi submission to the task of ActivityNet 2019 Kinetics-700 challenge is described. Three models are involved in the model ensemble stage: TSN, HG-NL and StNet. We propose the hierarchical group-wise non-local (HG-NL) module for frame-level features aggregation for video classification. The standard non-local (NL) module is effective in aggregating frame-level features on the task of video classification but presents low parameters efficiency and high computational cost. The HG-NL method involves a hierarchical group-wise structure and generates multiple attention maps to enhance performance. Basing on this hierarchical group-wise structure, the proposed method has competitive accuracy, fewer parameters and smaller computational cost than the standard NL. For the task of ActivityNet 2019 Kinetics-700 challenge, after model ensemble, we finally obtain an averaged top-1 and top-5 error percentage 28.444

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Self-Attention (Non-local) Based Frame-level Features Aggregation ( denotes element-wise sum and denotes matrix multiplication).
NL () NL () HG-NL () HG-NL (shared parameters, )
0.5248M 0.1312M 8.32K 0.52k
0.5248M 0. 1312M 8.32K 0.52k
1.0496M 1.0496M 132.096k 16.512k
Others
All 2.0992M 1.312M 148.736K 17.552k
Table 1: The number of parameters in NL and HG-NL ( and ). The parameters of HG-NL is about 8 - 14 times fewer than NL method. It is roughly 70 times fewer if parameters are shared across groups in a grouped convolutional layer in HG-NL.
NL () NL () HG-NL ()
/
Table 2: MAdds (multiply-adds) of NL and HG-NL. Each convolution layer in HG-NL has g1(=16) or g2(=8) times fewer MAdds than NL. The MAdds of other non-convolution layers keep roughly unchanged.

Video classification is one of the challenging tasks in computer vision. Publicly challenges and available video datasets accelerate the research processing, especially the ActivityNet series challenges and related datasets. In recent years, deep convolutional neural networks (CNNs) bring remarkable improvements on the accuracy of video classification

[7, 1, 5, 3].

In this report, the method for the iqiyi submission to the trimmed activity recognition (Kinetics) tasks of the ActivityNet Large Scale Activity Recognition Challenge 2019 is described. The Kinetics-700 dataset covers 700 human action classes and consists of approximately 650,000 video clips. And, each clip lasts around 10 seconds.

In our model ensemble stage, three models are involved: TSN[7], HG-NL and StNet[4]. We propose the hierarchical group-wise non-local (HG-NL) module for frame-level features aggregation for video classification.

Frequently-used aggregating methods include maximum, evenly averaging and weighted averaging. The NL module in [8] is also able to be used for aggregating frame-level features. However, the NL module in [8] presents low parameters efficiency and high computational cost, as discussed in detail later in this paper.

We address the problem of building a highly efficient self-attention based frame-level features aggregation module. The Hierarchical Group-wise Non-Local (HG-NL) module for frame-level features aggregation is proposed. Comparison with NL in [8], the HG-NL module has fewer parameters and smaller computational cost. The proposed module involves a hierarchical group-wise structure, which includes the primary grouped convolutions and the secondary grouped matrix multiplication. Moreover, HG-NL generates multiple attention maps. It brings one attention map for each feature group in the entire feature matrix and can mine the non-local information in features in detail.

2 Method

2.1 Hg-Nl

In this section, the HG-NL is presented in detail.

2.1.1 Formulation of Frame-level Features Aggregation

Considering a video , a sequence of frames ( is the length of a sequence of frames) are extracted from the entire video via some specific rules.

The feature information of a single frame is obtained via a pre-trained convolution network:

(1)

where denotes the -th frame, is the feature information of , and denotes the ConvNet operating.

The compact video-level features can be obtained via aggregating the features from multiple frames:

(2)

where is the aggregating function, is the length of a sequence of frames, and denotes the compact video-level features.

Figure 2: Hierarchical Group-wise Non-local module (). , and are obtained via grouped convolutions. and are obtained using grouped matrix multiplication.

2.1.2 Self-Attention (Non-local) Based Frame-level Features Aggregation

In self-attention module, the response of a position is computed with weighted average of all positions in an embedding space. As a representative module of attention mechanism, the NL in [8] is adopted to aggregate frame-level features here and is able to obtain long-rang dependencies across the frames.

Let , where

is the length of each frame’s feature vector.

, which denotes the feature information of frames, can be obtained via reshaping the size of to (corresponding to ). Then, an attention map having the size of and containing the relationships between every pair of frames can be obtained

(3)

where , , and weight matrices and are learned parameters. Commonly, weight matrices , are implemented as convolutions.

The output based on the attention map is

(4)

where and weight matrices is also operated as convolutions.

After this, can be obtained

(5)

In the above formulation, is a scale parameter and the output has the same size as the input signal .

The video-level feature is obtained via evenly averaging of

(6)

Figure 1 shows the schema of the NL module for frame-level features aggregation.

Analysis of NL module The NL module is effective for aggregating frame-level features. However, the NL module presents low parameters efficiency and high computational cost. The number of parameters in the NL module is computed as follows. For convolution layers corresponding to , and , the number of their parameters is , and individually. When , the number of parameters in the NL module can be computed and shown in Table 1. As shown in Table 1, if , the number of parameters is about 1.31M. If , the number of parameters is about 2M. The number is quite large for the practical use. In contrast, many backbone networks have very small number of parameters, such as MobileNetV2 [6] (3.4M), MobileNetV2-1.4 (6.9M), MF-Net-2D (5.8M), MF-Net-3D [2] (8.0M) and I3D-RGB [1] (12.1 M). As for the computational complexity, when , the total number of multiply-adds (MAdds) required in convolution layers in NL is about M (when ) and M (when ). Therefore, it makes sense to reduces parameters redundancy and computational cost of NL module.

2.1.3 Hierarchical Group-wise Non-local Module

In order to reduce parameters redundancy and computational cost, the Hierarchical Group-wise Non-local (HG-NL) module for fame-level features aggregation is proposed. HG-NL has the hierarchical group-wise structure and generates several attention maps. The HG-NL module for fame-level features aggregation is performed as following.

Firstly, in HG-NL, weight matrices , are implemented as grouped convolutions with the number of groups being . The grouped convolutions can reduce the parameters and the number of operations measured by MAdds largely.

After this, the attention map is computed:

(7)

where denotes the grouped matrix multiplication with the number of groups being , includes attention maps, and each attention map has size .

As shown in Figure 2, the grouped matrix multiplication in Eq (7) brings one attention map for each feature group in , and the number of attention map achieves . This can mine the non-local information in features more detailedly and effectively. As for the NL, only one attention map occurs. Besides, the softmax is deleted in HG-NL. The computation of in Eq (7) is lightweight, and the can provide the non-linearity for the HG-NL module.

Then, keeping the same groups as in the grouped matrix multiplication in Eq (7), weight matrices is operated as grouped convolutions with the number of groups being and is computed via the grouped matrix multiplication with the number of groups being

(8)

At last, can be obtained based on via Eq (5) and Eq (6) in Section 2.1.2.

Figure 2 shows the schema of the HG-NL module (). In general, let and is a ratio. Then the relationship of (primary grouped convolutions) and (secondary grouped matrix multiplication) forms the hierarchical group-wise structure. Consider the value of and . Even though multiple attention maps are able to mine the non-local information more detailedly, each attention map will cover too narrow feature information if is too big. On the other hand, when is bigger, the related parameters and MAdds is smaller. Therefore, in common, the values of and are set to different values. As a special case, when equals , the effect of HG-NL is the same as processing each feature group of via NL module individually.

Analysis of HG-NL module For convolution layers corresponding to , and , the number of parameters of them is , and individually. As shown in Table 1, when and , the HG-NL only requires about 1:8 - 1:14 times fewer parameters than the NL, which has roughly 1.31M () - 2.1M () parameters. If parameters are shared across groups in a grouped convolutional layer in HG-NL, the number of each convolution layer’s parameters will be further reduced or times. Besides, as shown in Table 2, when , the MAdds required in convolution layers in HG-NL is about M and is several times fewer MAdds than convolution layers in the NL. The MAdds required in other non-convolution layers keep roughly unchanged. Thus, as we can see that the HG-NL is able to reduce the model redundancy and computational cost. Meanwhile, HG-NL can achieve the competitive accuracy as NL.

2.1.4 Implementation of HG-NL for Video Classification

Benefiting from that no fully-connected layers are included in the network architecture of HG-NL, (the number of frames) can be arbitrarily adjusted. Thus, in the evaluation phase of the proposed HG-NL module, the number of frames selected from a video for predicting the label is not needed fixed as the same value as in the training phase and can be adjusted.

Model Val Accuracy in train phase (3 segments): Top-1 () Val Accuracy in test phase (25 segments): Top-1 ()
TSN 57.38 61.83
HG-NL 57.713 62.12
StNet 55.7 -
Table 3: Results of models on Kinetics-700 val set.
Models avg.error
Model Ensemble 0.28444
Table 4: Results on Kinetics-700 test set. The avg. error is an averaged top-1 and top-5 error.

2.2 Model Ensemble

In the model ensemble stage, three models are involved: TSN[7], HG-NL and StNet[4].

3 Experiments

In this section, we report some experimental results on Kinetics-700 dataset of our method. All models are pre-trained on the Kinetics-600 training set. We finetuned these models on the Kinetics-700 training set. Se-Resnext101 is adopted as the backbone network. Due to the limited time, we exploit only RGB information. In our experimens, the full-length video is divided into several equal segments, some frames are randomly selected from each segment. During our training, the number of segments is set to 3 and one frame are randomly selected from each segment. During evaluation, we follow the same testing setup as in TSN [7].

3.1 Tsn

In TSN experiments, the initial learning rate is set as 0.001 and decayed by a factor of 10 at 20 epochs and 30 epochs. The maximum iteration is set as 40 epochs.

3.2 Hg-Nl

In HG-NL experiments, , , . Due to time limits, we finetuned the HG-NL on the Kinetics-700 training set for only 8 epochs with the model pre-trained by TSN in Section 3.1. The initial learning rate is set as 0.001 and decayed by a factor of 10 at 4 epochs and 6 epochs. The maximum iteration is set as 8 epochs. The results are shown in Table 3. We can see that HG-NL can obtain the top-1 accuracy of 62.12%, compared with the top-1 accuracy of 61.83% of TSN on the Kinetics-700 validation set, as shown in Table 3.

3.3 StNet

For StNet[4], the Temporal Modeling Block and Temporal Xception Block are used in our network. We adopt the same input of TSN as the input of StNet. Because of the time limits, we only trained the network for 20 epochs on kinetics-700 datasets. The results of the StNet on kinetics-700 validation dataset is 55.7% for top1 and 78.3% for top 5 in the train phase(3-frames test).

3.4 Model Ensemble

Three models are involved in the model ensemble stage: TSN[7], HG-NL and StNet[4]. Our team finally obtains an averaged top-1 and top-5 error percentage of 28.444% on the Kinetics-700 test set.

4 Conclusion

In this report, our team’s solution to the task of ActivityNet 2019 Kinetics-700 challenge is described. Experiment results have evidenced the effectiveness of the proposed HG-NL method. HG-NL achieves the better accuracy than the TSN baseline. With the help of the hierarchical group-wise structure, the HG-NL module has 8 - 70 times fewer parameters and several times smaller computational complexity than the NL module. After model ensemble, our team finally obtains an averaged top-1 and top-5 error percentage of 28.444% on the Kinetics-700 test set.

References