Deep learning has brought great progress for various recognition tasks in image domain, such as image classification [17, 9], object detection , and instance segmentation . The key to these success is to devise flexible and efficient architectures that are capable of learning powerful visual representations from large-scale image datasets . However, deep learning research progress in video understanding is relatively more slowly, partially due to the high complexity of video data. The core technical problem in video understanding is to design an effective temporal module, that is expected to be able to capture complex temporal structure with high flexibility, while yet to be of low computational consumption for processing high dimensional video data efficiently.
3D Convolutional Neural Networks (3D CNNs)[26, 12] have turned out to be mainstream architectures for video modeling [1, 4, 28, 21]. The 3D convolution is a natural extension over its 2D counterparts and provides a learnable operator for video recognition. However, this simple extension lacks specific consideration about the temporal properties in video data and might as well lead to high computational cost. Therefore, recent methods aim to improve 3D CNNs from two different aspects by combining a lightweight temporal module with 2D CNNs to improve efficiency (e.g., TSN , TSM ), or designing a dedicated temporal module to better capture temporal relation (e.g., Nonlocal Net , ARTNet , STM ). However, how to devise a temporal module with high efficiency and strong flexibility still remains to be an unsolved problem in video recognition. Consequently, we aim at advancing the current video architectures along this direction.
In this paper, we focus on devising a principled adaptive module to capture temporal information in a more flexible way. In general, we observe that video data is with extremely complex dynamics along the temporal dimension due to factors such as camera motion and various speed. Thus 3D convolutions (temporal convolutions) might lack enough representation power to describe motion diversity by simply employing a fixed number of video invariant kernels. To deal with such complex temporal variations in videos, we argue that adaptive temporal kernels for each video are effective and perhaps necessary to describe motion patterns. To this end, as shown in Figure 1, we present a two-level adaptive modeling scheme to decompose this video specific temporal kernel into a location sensitive importance map and a location invariant aggregation kernel. This unique design allows the location sensitive importance map to focus on enhancing discriminative temporal information from a local view, and enables the location invariant aggregation weights to capture temporal dependencies guided by a global view of the input video sequence.
Specifically, our temporal adaptive module (TAM) design strictly follows two principles: high efficiency and strong flexibility. To ensure our TAM to be at a low computational cost, we first squeeze the feature map by employing a global spatial pooling, and then establish our TAM in a channel-wise manner to keep the efficiency. Our TAM is composed of two branches: a local branch () and a global branch (). As shown in Fig. 2, TAM is implemented in an efficient way. The local branch employs temporal convolutional operators to produce a location sensitive importance map to attend the discriminative features, while the global branch uses temporal fully connected layers to produce the location invariant aggregation weights. The importance map generated by a local temporal window focuses on short-term motion modeling and the aggregation weights using a global view pay more attention to the long-term temporal information. Furthermore, our TAM could be flexibly plugged into the existing 2D CNNs to yield an efficient video recognition architecture, termed as TANet.
We test the proposed TANet on the task of action classification in video recognition. Particularly, we first study the performance of the TANet on the Kinetics-400 dataset. We demonstrate that our TAM is better at capturing temporal information than other several counterparts, such as temporal pooling, temporal convolution, TSM , and Non-local block . Our TANet is able to yield a very competitive accuracy with the FLOPs similar to 2D CNNs. We also test our TANet on the motion dominated dataset of Something-Something, where the state-of-the-art performance is achieved.
2 Related Work
Video understanding is a core topic in the field of computer vision. At early stage, a lot of traditional methods[18, 16, 23, 34] have designed various hand-crafted features to encode the video data. iDT  using dense trajectory features has achieved strong performance among these hand-designed methods. Unfortunately, the hand-crafted methods are too inflexible when generalized to other video tasks. Recently, since the rapid development of video understanding has been considerably benefited from deep learning methods [17, 25, 9], especially in video recognition, a series of deep learning methods were proposed to focus on learning spatiotemporal representation, which are generally divided into two categories: (1) 2D CNNs methods, and (2) 3D CNNs methods. Furthermore, our work also slightly refers to attention in CNNs.
2D CNNs Methods for Action Recognition. Since the deep learning method has been wildly used in still image tasks, there are many attempts [14, 24, 31, 36, 7, 19] based on 2D CNNs devoted to modeling the video clips. As the vanilla 2D convolution is incapable of dealing with temporal relations, Two-stream methods [24, 5] leveraged optical flow as motion features to learn the temporal clues. In particular, TSN  used frames sparsely sampled from the whole video to learn the long-range dependencies by aggregating the scores after last fully-connected layer. TSM  shifted the channels along the temporal dimension in an efficient way, which yields a good performance gain fully based on 2D CNNs. Differing from all methods aforementioned, our method based 2D CNNs attempts to use a two-level adaptive modeling scheme by decomposing the video specific kernel into a location sensitive excitation and a location invariant aggregation. As extracting optical flow is too expensive, TANet only considers RGB as an input modality in our experiments.
3D CNNs Methods for Action Recognition. By a simple extension from spatial domain to spatiotemporal domain, 3D convolution [12, 26] was proposed to capture the motion information encoded in video clip. Due to the release of large-scale Kinetics dataset , 3D CNNs  were wildly used in action recognition. Its variants [21, 28, 35] decomposed the 3D convolution into a spatial 2D convolution and a temporal 1D convolution to learn the spatiotemporal features. ARTNet  and SlowFast  designed a network with dual paths to learn the spatiotemporal features. Unlike P3D  or R(2+1)D , our video specific aggregation kernel is employed to perform channel-wise 1D convolution in temporal dimension, and each video clip has its own unique kernel.
Attention. The local branch in TAM mostly relates to SENet . But the SENet learned modulation weights for each channel of feature maps to perform feature recalibration in image recognition. STC block  was proposed STC block to study the spatiotemporal channel correlation in action recognition. Different from these methods, our local branch squeezes the spatial features, but keeps the temporal information to learn the location sensitive importance. Non-local network  was designed with a non-local mean which can be seen as self-attention to capture long-range dependencies. Our TANet captures the long-range dependencies by simply stacking more TAM, and keep the efficiency of backbone networks.
This section first describe the temporal adaptive module (TAM) in details, and then introduce the exemplar of TANet to perform video recognition.
3.1 The Overview of Temporal Adaptive Module
As we discussed in Sec.1, video data typically exhibit complex temporal dynamics caused by factors such as camera motion and speed variations. Therefore, we aim to tackle this issue by introducing a temporal adaptive module (TAM) with video specific kernels, unlike the sharing convolutional kernel in 3D CNNs. The proposed TAM generates a dynamic temporal kernel based on the video features in a flexible and efficient way, and thus is able to aggregate temporal information according to the motion content adaptively. The TAM could be easily integrated into the existing 2D CNNs (e.g., ResNet) to yield a video network architecture, as shown in Figure 2. We will give an overview of TAM and then describe its technical details. We also discuss the relation of our TAM to the existing works.
Formally, let denote a feature map, where represents the number of channels, and are its spatiotemporal dimensions. For inference efficiency, our TAM only focus on temporal modeling and the spatial pattern is expected to captured by 2D convolutions. Therefore, we first employ a global spatial average pooling to squeeze the feature map as follows:
where is the index of different dimensions (in channel, time, height and width), and aggregates the spatial information of . For simplicity, we here use to denote the function that aggregates the spatial information. Our proposed temporal adaptive module is established based on this compressed 1D temporal signal for a high efficiency.
Our TAM is composed of two branches: a local branch and a global branch, which aims to learn a location sensitive importance map to enhance discriminative features and then produces the location invariant weights to aggregate temporal information in convolutional manner. More specifically, the TAM is formulated as follows:
where denotes convolutional operator and denotes element-wise multiplication. represents a global branch and denotes a local branch. These two branches both operates on the squeezed feature map , and the output size of and is and . It is worth noting that these two branches focus on different aspects of temporal information, where the local branch tries to capture short term information to attend important features by using a temporal convolutional operation, while the global branch aims to incorporate long-range temporal structure to guide adaptive temporal aggregation with fully connected layers. Disentangling kernel learning procedures into local and global branches turns out to be an effective way as demonstrated in experiments. We will detailly describe these two branches in the following sections.
3.2 Local Branch in TAM
As discussed above, the local branch aims to leverage short-term temporal dynamics to assist produce video specific kernels. We observe that these short term information varies along temporal dimension and it is required to learn a position sensitive importance map to capture local temporal structure.
More formally, as shown in Figure 2
, we build the local branch of TAM with a sequence of temporal convolutional layers with ReLU non-linearity as follows:
where is the learned importance map, is the number of channels of input tensor and is ReLU function. The is a temporal convolution, and it is parametrized by a input tensor, kernel size, and output channel number. As the goal of local branch is to capture short term information, we set the kernel size as 3 to learn importance map solely based on a local temporal window. To control the model complexity, the first followed by BN  reduces the number of channels from to . Then, the second with a sigmoid activation yields the importance weights . To match size of , we rescale the to by replicating in spatial dimension:
where is the index of different dimensions of channel, time, height and width. Finally, the temporal excitation formulate as follows:
where is activated feature maps and denotes the element-wise multiplication.
3.3 Global Branch in TAM
Concerning the global branch of TAM, we focus on generating an adaptive kernel based on long-term temporal information. It incorporates global context information into TAM and learns the location sharing weights for aggregation. It is required to have a global view to produce a video specific convolution kernel.
Learning the Adaptive Kernels. In global branch, we opt to generate dynamic kernel for each video clip and aggregate temporal information in a convolutional manner. To simply the dynamic kernel generation and as well preserve high efficiency, we propose to learn the adaptive kernel in a channel-wise manner. In this sense, we hope our learned adaptive kernel only considers temporal relation modeling without taking channel correlation into account. Thus, our TAM would not change the channel number of input features, and the learned adaptive kernel is applied to convolve input feature maps in channel-wise manner. More formally, for the channel, the adaptive kernel is learned as follows:
where is generated adaptive kernel (aggregation weight) for channel, is the adaptive kernel size,
denotes ReLU activation function. Similar to importance map learning in local branch, the adaptive kernel is also learned based on the squeezed feature mapwithout considering the spatial structure for modeling efficiency. But different with the local branch, we use fully connected () layers to learn adaptive kernel by leveraging long-term information. Complementary to importance map in local branch, we expect the learned adaptive kernel to have a global receptive field and thus could aggregate temporal features guided by the global context. To increase the modeling power of global branch, we stack two layers and the learned kernel is normalized with a softmax function to yield a positive aggregation weight. The learned aggregation weights will be deployed in a convolutional manner to capture the temporal interaction of features.
Temporal Adaptive Aggregation. After introducing the architecture of double branches, we are ready to describe to aggregate temporal information with the learnt adaptive kernel. As formulated in Equation 3, the learnt location sensitive importance map is for feature excitation, and the location sharing aggregation weights is for temporal convolution as follows:
where denotes the scalar multiplication, is the feature maps after temporal convolution and is the output from local branch.
In summary, TAM presents a principled adaptive module with an unique two-step aggregation scheme, where the local excitation and global aggregation all derived from current feature map, but focus on capturing different temporal structures for aggregation (i.e., short-term and long-term temporal structure). As demonstrated in experiments, it implies our TAM is an efficient yet effective temporal adaptive scheme.
3.4 Exemplar: TANet
We here intend to describe how to instantiate the TANet. Temporal adaptive module as a novel temporal modeling method can endow the existing 2D CNNs with a strong ability to model different temporal structures in video clips. In practice, TAM only causes limited computing overhead, but obviously improves the performance on different types of datasets.
ResNets  are employed as backbones to verify the effectiveness of TAM. As illustrated in Fig. 2, the TAM is embedded into ResNet-Block after the first Conv2D, which easily turns the vanilla ResNet-Block into TA-Block. This fashion will not excessively alter the topology of networks and can reuse the weights of ResNet-Block. Supposing we sample T frames as an input clip, the scores of T frames after are aggregated by average pooling to yield the clip-level scores. No temporal downsampling is performed before layer. In fact, our method does not have too many constraints on the number of TA-Blocks and insertion positions. These points will be discussed in the latter. Such fashions fully exhibit the flexibility and efficiency of our method. The extensive experiments are conducted in Sec. 4.2 to demonstrate the effectiveness of TANet.
We have noticed that the structure of local branch is similar to the SENet . The first obvious difference is our local branch does not squeeze the temporal information. We thus uses temporal 1D convolution as a basic layer, instead of using layer. Two-layer design only seek to obtain more powerful nonlinearity to model the short-term variations in videos. Furthermore, the local branch mainly aims to learn the temporal location sensitive importance maps, and cooperates with global branch to learn the more discriminative features.
TSN , TSM , etc. only aggregate the temporal features with a fixed scheme, but our temporal adaptive module can yield the video specific weights to adaptively aggregate the temporal features in early stages. When it refers to 3D convolution , all input samples share the same convolution kernel without dealing with the temporal diversities in videos as well. In addition, our global branch essentially performs a channel-wise temporal convolution whose filter has size , while each filter in a normal 3D convolution has size , where C is the number of channels and k denotes the receptive fields. Thus our method is more efficient than 3D CNNs.
In this section, we will elaborately study the effectiveness of TANet on standard benchmarks. First, we describe the implementation details of our TANet. Then, the comprehensive ablation studies are performed on Kinetics-400 to investigate its optimal setting. After that, we compare our TAM with our temporal modeling counterparts. Finally, we compare with the previous state-of-the-art methods on Kinetics-400 and Sth-Sth V1&V2. We also provide the visualization of learned kernels to provide some insights of TANet.
4.1 Implementation Details
Datasets. Our experiments are conducted on three large scale datasets, namely, Kinetics-400  and Something-Something (Sth-Sth) V1&V2 . Kinetics-400 contains 300k video clips with 400 human action categories. The videos in Kinetics-400, trimmed from raw YouTube videos, are around 10s. We here train models on training set (240k video clips), and test models on validation set (20k video clips). The Sth-Sth datasets focus on fine-grained action, which contains a series of pre-defined basic actions interacted with daily objects. The Sth-Sth V1 comprises 86k video clips in training set and 12k video clips in validation set. Sth-Sth V2 is an updated version of Sth-Sth V1, which contains 169k video clips in training set and 25k video clips in validation set. They both have 174 action categories.
Training. In our experiments, we only train models using 8 frames and 16 frames. On Kinetics-400, Following the practice in , The frames are sampled from 64 consecutive frames in video. On Sth-Sth V1&V2, we employ the uniform sampling strategy in TSN  to train TANet. We first resize the shorter side of frames to , and apply the multi-scale cropping and randomly horizontal flipping as data augmentation. The cropped frames are resized to
for training the networks. The batch size is set to 64. Our models are initialized by ImageNet pre-trained weights to reduce the training time. Specifically, on Kinetics-400, the epoch for training is 100. The initial learning rate is set 0.01 and divided by 10 at 50, 75, 90 epoch. We use SGD with a momentum of 0.9 and a weight decay of 1e-4 to train TANet. On Sth-Sth V1&V2, We train models with 50 epochs. The learning rate starts at 0.01 and divided by 10 at 30, 40, 45 epoch. We use a momentum of 0.9 and a weight decay of 1e-3 to address the overfitting.
Testing. We apply different inference schemes to fairly compare with other state-of-the-art models. On kinetics-400, the shorter side are scaled to 256 and take 3 crops of to cover the spatial dimensions. In the temporal dimension, we uniformly sample 10 clips for 8-frame models and 4 clips for 16-frame models. The final video-level prediction is yielded by averaging the scores of all spatio-temporal views. On Sth-Sth V1, we scale the shorter side of frames to 256 and use center crop of for evaluation. On Sth-Sth V2, we employ similar evaluation protocols to Kinetics, but only uniformly sample 2 clips.
4.2 Ablation Studies on Kinetics-400
The ablation studies are performed on Kinetics-400 to investigate different aspects of TANet. The ResNet architecture we used is the same the original one . Our TANet replaces all ResNet-Blocks with TA-Blocks by default.
Parameter choices. We use different combinations of and to figure out the optimal super-parameters in our proposed module. The TANet is instantiated as in Fig. 2. Our method with and achieves the highest performance shown in Table 1(a), which will be applied in following experiments.
Inserted position. Table 1(b) tries to insert TAM into different position. TANet-a, TANet-b, TANet-c and TANet-d denotes the TAM is inserted before the first convolution, after the first convolution, after the second convolution and after the last convolution in block, respectively. The style in Fig. 2 actually is TANet-b, which has a slighter advantage than other styles as shown in Table 1(b). The TANet-b is abbreviated as TANet by default in the following sections.
The number of TA-Blocks. To make trade-off between performance and efficiency, we gradually add more TA-Blocks into ResNet. As shown in Table 1(c), we find that the performance is nearly saturated when adding more than 9 TA-Blocks into network. The res achieves the highest performance and will be used in following experiments.
Temporal receptive fields. We also try to increase the temporal receptive fields for learned kernel in global branch. From the Table 1(d), it seems the larger is beneficial to improve the performance when TANet takes more sampled frames as inputs. On the other hand, it even degenerates the performance of TANet when sampled 8 frames. In our following experiments, the is set to 3 for convenience.
|(of single view)|
|NL C2D ||64.49G||31.7M||74.4%||91.5%|
|Global branch + SE ||43.02G||24.7M||75.4%||92.0%|
Study on the effectiveness of TAM. All models use ResNet50 as backbone and take 8 frames with sampling stride 8 as inputs. To be consistent with testing, the FLOPs are calculated with spatial size.
4.3 Comparison with Other Temporal Modules
To understand the effects of our TAM in action recognition, we intend to describe several competitive temporal modules to compare with TANet. The optimal configurations studied above will be applied in following experiments. The training settings of other methods keep consistent with TANet.
2D ConvNet (C2D). We use ResNet50 as backbone to build 2D ConvNet. The 2D ConvNet focuses on learning the spatial clues, which operates on each frame independently without any temporal interaction before the last layer.
2D ConvNet with temporal pooing (C2D-Pool). To probe into the impacts of temporal fusion, C2D-Pool utilizes average pooling layer whose kernel size is to perform temporal fusion, which can be built by easily replacing all temporal adaptive modules in TANet with average pooling layers. This naive way can extend the C2D with the ability to simply aggregate the temporal information. Since C2D-Pool is insensitive to the order of frames, it is incapable of dealing with complicated temporal relations. Another method based on 2D ConvNet called TSM  learns the temporal relations without bringing extra costs by artificially shifting part of the channels through temporal dimension.
Inflated 3D ConvNet (I3D). I3D  is most frequently used models in action recognition. In our implementation, we inflate the first kernel in ResNet-Block to , which can provide more fair comparisons with our TANet. Following the , we use I3D to denote this variant.
The aforementioned methods share a common insight: modeling video clips with a fixed and general scheme. However, as shown in Table 2, our method yield a superior performance that outperforms C2D by 5.9% accuracy, and even higher than I3D (76.1% vs. 74.3%), which exhibits the fixed schemes for modeling videos may be insufficient to learn the temporal clues. And more importantly, our TANet only brings a small portion of FLOPs and parameters.
Non-local C2D (NL C2D). The non-local block that can be seen as a type of self-attention, was proposed to capture the long-range dependencies in videos. Our method utilizes a temporal adaptive scheme to capture the temporal dependencies in an efficient way. The preferable settings with 5 non-local blocks mentioned in  are employed to compare with TANet. As seen in Table 2, our method achieves higher accuracy than NL C2D (76.1% vs. 74.4%). In addition, TANet is more efficient than I3D and NL C2D. TANet only has 43G FLOPs of single view and 25.6M parameters.
To validate the effectiveness of each part of our module mentioned in Sec. 3.1, we proposed three variants of TANet. Global branch only guided by global temporal information performs adaptive fusion on feature maps without resorting to local temporal excitation. Local branch uses local temporal information to assist C2D in learning more discriminative features. Global branch+SE uses SE module to replace local branch in TANet. The SE module uses the optimal configuration mentioned in the paper . TANet also achieves the highest performance among these baselines, which strongly prove the local branch with local temporal receptive field is more beneficial for our adaptive scheme.
|ARTNet ||ResNet18||16 112112||23.5250||70.7%||89.3%|
|NL I3D ||ResNet50||32224224||N/A||74.9%||91.6%|
|NL I3D ||ResNet50||128224224||28230||76.5%||92.6%|
|NL I3D ||ResNet101||128224224||35930||77.7%||93.3%|
4.4 Comparison with the State of the Art
Comparison on Kinetics-400. Table 3 shows the state-of-the-art results on Kinetics-400. Our method, as an adaptive modeling scheme, has achieved the comparative performance compared with other models. TANet-50 with 8-frame also outperforms SlowFast  by 0.5% when using similar FLOPs per view. The 16-frame TANet only uses 4 clips and 3 crops for evaluation such that it provides higher inference efficiency. It is worth noting that our 16-frame TANet-50 is still more accurate than 32-frame NL I3D by 2.2%. Furthermore, our method is compatible with existing video frameworks like SlowFast. Specifically, TANet can easily replace the Slow path in SlowFast. Our TANet is more lightweight than SlowOnly when taking the same number of frames as inputs, but yields a higher accuracy. In general, the proposed TANet makes a good practice on adaptively modeling the temporal relations in videos.
|NL I3D ||ResNet50||ImgNet+K400||334G||44.4%||76.0%|
|NL I3D+GCN ||ResNet50+GCN||ImgNet+K400||606G||46.1%||76.8%|
Comparison on Sth-Sth V1 & V2. As shown in Table 4, our method achieves the state-of-the-art accuracy comparing with other models on Sth-Sth V1. For fair comparisons, the Table 4 only reports the results taking a single clip with center crop as inputs. TANet is higher than TSM equipped with same backbone (Top-1: 50.6% vs. Top-1: 49.7%). We also conduct the experiments on Sth-Sth V2. V2 has more video clips than V1, which can further unleash the full capabilities of TANet without suffering the overfitting. Following the common practice in , TANets use 2 clips with 3 crops to evaluate the accuracy. As shown in Table 5, our models have achieved the state-of-art performance on Sth-Sth V2. As a result, the TANet yields a superior accuracy (Top-1: 65.5%) compared with current SOTA results. The performance on Sth-Sth datasets have demonstrated that our method is also good at modeling the fine-grained and temporal-related actions.
4.5 Visualizations of Learned Kernel
To understand the behavior of TANet, we visualize the distribution of kernel generated by global branch in last block of stage4 and stage5. For clear contrast, the kernel weights in I3D at the same stages are also visualized to find more insights. As depicted in Fig. 3, we find that the learned kernel have an evident character: the shapes and scales of distribution are more diverse than I3D. Since all video clips share the same kernels in I3D, it causes the kernel weights clusters together excessively. On contrary, even modeling the same action in different videos, TAM can generate the kernel with slightly different distributions. Taking driving car as an example, the shapes of the distribution shown in Fig. 3 are similar to each other but the medians of distributions are not equal. For different actions like drinking beer and skydiving, the shapes and medians of distributions are greatly varied. Even the same action in different videos could have different forms. Concerning that the motion patterns in different videos may share varied inherence, it is necessary to employ an adaptive scheme when modeling video sequences.
In this paper, we have presented a novel temporal adaptive module (TAM) to capture complex motion information in videos and built a powerful video architecture (TANet). Our TAM is able to yield a video-specific kernels with the combination of a local importance map and a global aggregation weight. The local and global branches designed in TAM are helpful to capture temporal structure by different views and contributes to making temporal modeling more effective and robust. As demonstrated on the Kinetics-400, TANet equipped with TAM is better than the existing temporal modules in action recognition, which confirms the effectiveness of our TAM in temporal modeling. TANet also achieved the state-of-the-art performance on the motion dominated datasets of Sth-Sth V1&V2.
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR. pp. 4724–4733 (2017)
-  Deng, J., Dong, W., Socher, R., Li, L., Li, K., Li, F.: Imagenet: A large-scale hierarchical image database. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition CVPR. pp. 248–255 (2009)
-  Diba, A., Fayyaz, M., Sharma, V., Arzani, M.M., Yousefzadeh, R., Gall, J., Gool, L.V.: Spatio-temporal channel correlation networks for action classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision - ECCV. Lecture Notes in Computer Science, vol. 11208, pp. 299–315 (2018)
-  Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: IEEE/CVF International Conference on Computer Vision, ICCV. pp. 6201–6210 (2019)
-  Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR. pp. 1933–1941 (2016)
-  Goyal, R., Kahou, S.E., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fründ, I., Yianilos, P., Mueller-Freitag, M., Hoppe, F., Thurau, C., Bax, I., Memisevic, R.: The ”something something” video database for learning and evaluating visual common sense. In: IEEE International Conference on Computer Vision, ICCV. pp. 5843–5851 (2017)
He, D., Zhou, Z., Gan, C., Li, F., Liu, X., Li, Y., Wang, L., Wen, S.: Stnet: Local and global spatial-temporal modeling for action recognition. In: The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI. pp. 8401–8408 (2019)
-  He, K., Gkioxari, G., Dollár, P., Girshick, R.B.: Mask R-CNN. In: IEEE International Conference on Computer Vision, ICCV. pp. 2980–2988. IEEE Computer Society (2017)
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR. pp. 770–778 (2016)
-  Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR. pp. 7132–7141 (2018)
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Bach, F.R., Blei, D.M. (eds.) Proceedings of the 32nd International Conference on Machine Learning, ICML 2015. JMLR Workshop and Conference Proceedings, vol. 37, pp. 448–456. JMLR.org (2015)
-  Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. In: Fürnkranz, J., Joachims, T. (eds.) Proceedings of the 27th International Conference on Machine Learning (ICML). pp. 495–502 (2010)
-  Jiang, B., Wang, M., Gan, W., Wu, W., Yan, J.: STM: spatiotemporal and motion encoding for action recognition. CoRR abs/1908.02486 (2019)
-  Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Li, F.: Large-scale video classification with convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR. pp. 1725–1732 (2014)
-  Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., Suleyman, M., Zisserman, A.: The kinetics human action video dataset. CoRR abs/1705.06950 (2017)
-  Kläser, A., Marszalek, M., Schmid, C.: A spatio-temporal descriptor based on 3d-gradients. In: Everingham, M., Needham, C.J., Fraile, R. (eds.) Proceedings of the British Machine Vision Conference 2008. pp. 1–10. British Machine Vision Association (2008)
-  Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Bartlett, P.L., Pereira, F.C.N., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. pp. 1106–1114 (2012)
-  Le, Q.V., Zou, W.Y., Yeung, S.Y., Ng, A.Y.: Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR. pp. 3361–3368. IEEE Computer Society (2011)
-  Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: IEEE International Conference on Computer Vision, ICCV 2019. pp. 7082–7092 (2019)
-  Liu, X., Lee, J., Jin, H.: Learning video representations from correspondence proposals. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR. pp. 4273–4281. Computer Vision Foundation / IEEE (2019)
-  Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3d residual networks. In: IEEE International Conference on Computer Vision, ICCV. pp. 5534–5542 (2017)
-  Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)
-  Sadanand, S., Corso, J.J.: Action bank: A high-level representation of activity in video. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012. pp. 1234–1241. IEEE Computer Society (2012)
-  Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014. pp. 568–576 (2014)
-  Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015 (2015)
-  Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: IEEE International Conference on Computer Vision, ICCV. pp. 4489–4497 (2015)
-  Tran, D., Wang, H., Feiszli, M., Torresani, L.: Video classification with channel-separated convolutional networks. In: IEEE International Conference on Computer Vision, ICCV 2019. pp. 5551–5560 (2019)
-  Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR. pp. 6450–6459 (2018)
-  Wang, H., Schmid, C.: Action recognition with improved trajectories. In: IEEE International Conference on Computer Vision, ICCV. pp. 3551–3558 (2013)
-  Wang, L., Li, W., Li, W., Gool, L.V.: Appearance-and-relation networks for video classification. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR. pp. 1430–1439 (2018)
-  Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Gool, L.V.: Temporal segment networks: Towards good practices for deep action recognition. In: Computer Vision - ECCV. pp. 20–36 (2016)
-  Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR. pp. 7794–7803 (2018)
-  Wang, X., Gupta, A.: Videos as space-time region graphs. In: Computer Vision - ECCV. pp. 413–431 (2018)
-  Willems, G., Tuytelaars, T., Gool, L.V.: An efficient dense and scale-invariant spatio-temporal interest point detector. In: Forsyth, D.A., Torr, P.H.S., Zisserman, A. (eds.) Computer Vision - ECCV. Lecture Notes in Computer Science, vol. 5303, pp. 650–663 (2008)
-  Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In: Computer Vision - ECCV. pp. 318–335 (2018)
-  Zhou, B., Andonian, A., Oliva, A., Torralba, A.: Temporal relational reasoning in videos. In: Computer Vision - ECCV. pp. 831–846 (2018)
-  Zolfaghari, M., Singh, K., Brox, T.: ECO: efficient convolutional network for online video understanding. In: Computer Vision - ECCV. pp. 713–730 (2018)
Appendix 0.A Appendix
In supplementary material, we are ready to add more visualizations of distribution for importance map in local branch and video adaptive kernel in global branch. The 311 convolution kernels in I3D are also visualized to study their intentions. To probe into the effects on learning kernels in the different stages, the visualized kernels are chosen in stage4_6b and stage5_3b, respectively. Some videos are randomly selected from Kinetics-400 and Sth-Sth V2 to show the diversities in different video datasets.
Firstly, as depicted in Fig. 4 and Fig. 5, We can observe that the distributions of importance map in local branch are smoother than the kernel in global branch, and local branch pays different attention to each video when modeling the temporal relations. Then, the kernel in global branch performs the adaptive aggregation to learn the temporal diversities in videos. The visualized kernels in I3D can make a direct comparison with the kernel , and we find that the distributions of kernel in I3D are extremely narrow whether on Kinetics-400 or on Sth-Sth V2. Finally, our learned kernels visualized in figures have exhibited the clear differences between two datasets (Kinetics-400 vs. Sth-Sth V2). This fact is in line with our prior knowledge that there is an obvious domain shift between two datasets. The Kinetics-400 mainly focuses on appearance and Sth-Sth V2 is a motion dominated dataset. However, this point can not be easily summarized from the kernels in I3D, because the overall distributions of kernels in I3D on two datasets show minor differences.
The diversities in our learned kernels have demonstrated that the diversities are indeed existing in videos, and it is reasonable to learn spatiotemporal representation in an adaptive scheme.