Action recognition is a fundamental problem in video-based tasks. It becomes increasingly demanding in video-based applications, such as intelligent surveillance, autonomous driving, personal recommendation, and entertainment . Though visual appearances (and its context) is important for action recognition, it is rather important to model the temporal structure. Temporal modeling normally presents (or is considered) at different scales: 1) short-range motion between adjacent frames and 2) long-range temporal aggregation at large scales. There are lines of works considering one or both of those aspects, especially in the current era of deep CNNs [23, 33, 49, 6, 38, 48, 36, 2, 1, 41, 52, 51, 29, 43, 31, 26, 22]. Nevertheless, they still leave some gaps, and the problem is far from being solved, i.e., it remains unclear how to model the temporal structure with significant variations and complexities effectively and efficiently.
For short-range motion encoding, most of the existing methods [33, 44] extract hand-crafted optical flow  first, which is then fed into a 2D CNN-based two-stream framework for action recognition. Such a two-stream architecture processes RGB images and optical flow in each stream separately. The computation of optical flow is time-consuming and storage demanding. In particular, the learning of spatial and temporal features is isolated, and the fusion is performed only at the late layers. To address these issues, we propose a motion excitation (ME) module. Instead of adopting the pixel-level optical flow as an additional input modality and separating the training of temporal stream with the spatial stream, our module could integrate the motion modeling into the whole spatiotemporal feature learning approach. Concretely, the feature-level motion representations are firstly calculated between adjacent frames. These motion features are then utilized to produce modulation weights. Finally, the motion-sensitive information in the original features of frames can be excited with the weights. In this way, the networks are forced to discover and enhance the informative temporal features that capture differentiated information.
For long-range temporal aggregation, existing methods either 1) adopt 2D CNN backbones to extract frame-wise features and then utilize a simple temporal max/average pooling to obtain the whole video representation [44, 11]. Such a simple summarization strategy, however, results in temporal information loss/confusion; or 2) adopt local 3D/(2+1)D convolutional operations to process local temporal window [38, 3]. The long-range temporal relationship is indirectly modeled by repeatedly stacking local convolutions in deep networks. However, repeating a large number of local operations will lead to optimization difficulty , as the message needs to be propagated through the long path between distant frames. To tackle this problem, we introduce a multiple temporal aggregation
(MTA) module. The MTA module also adopts (2+1)D convolutions, but a group of sub-convolutions replaces the 1D temporal convolution in MTA. The sub-convolutions formulate a hierarchical structure with residual connections between adjacent subsets. When the spatiotemporal features go through the module, the features realize multiple information exchanges with neighboring frames, and the equivalent temporal receptive field is thus increased multiple times to model long-range temporal dynamics.
The proposed ME module and MTA module are inserted into a standard ResNet block [15, 16] to build the Temporal Excitation and Aggregation (TEA) block, and the entire network is constructed by stacking multiple blocks. The obtained model is efficient: benefiting from the light-weight configurations, the FLOPs of the TEA network are controlled at a low level (only 1.06 as many as 2D ResNet). The proposed model is also effective: the two components of TEA are complementary and cooperate in endowing the network with both short- and long-range temporal modeling abilities. To summarize, the main contributions of our method are three-fold:
1. The motion excitation (ME) module to integrate the short-range motion modeling with the whole spatiotemporal feature learning approach.
2. The multiple temporal aggregation (MTA) module to efficiently enlarge the temporal receptive field for long-range temporal modeling.
3. The two proposed modules are both simple, light-weight, and can be easily integrated into standard ResNet block to cooperate for effective and efficient temporal modeling.
2 Related Works
With the tremendous success of deep learning methods on image-based recognition tasks[24, 34, 37, 15, 16], some researchers started to explore the application of deep networks on video action recognition task [23, 33, 38, 49, 6, 48]. Among them, Karpathy et al.  proposed to apply a single 2D CNN model on each frame of videos independently and explored several strategies to fuse temporal information. However, the method does not consider the motion change between frames, and the final performance is inferior to the hand-crafted feature-based algorithms. Donahue et al.  used LSTM 
to model the temporal relation by aggregating 2D CNN features. In this approach, the feature extraction of each frame is isolated, and only high-level 2D CNN features are considered for temporal relation learning.
The existing methods usually follow two approaches to improve temporal modeling ability. The first one was based on two-stream architecture proposed by Simonyan and Zisserman . The architecture contained a spatial 2D CNN that learns still feature from frames and a temporal 2D CNN that models motion information in the form of optical flow . The training of the two streams is separated, and the final predictions for videos are averaged over two streams. Many following works had extended such a framework. [9, 8] explored different mid-level combination strategies to fuse the features of two streams. TSN  proposed the sparse sampling strategy to capture long-range video clips. All these methods require additional computation and storage costs to deal with optical flow. Moreover, the interactions between different frames and the two modalities are limited, which usually occur at late layers only. In contrast, our proposed method discards optical flow extraction and learns approximate feature-level motion representations by calculating temporal differences. The motion encoding can be integrated with the learning of spatiotemporal features and utilized to discover and enhance their motion-sensitive ingredients.
The most recent work STM  also attempted to model feature-level motion features and inserts motion modeling into spatiotemporal feature learning. Our method differs from STM in that STM directly adds the spatiotemporal features and motion encoding together. In contrast, our method utilizes motion features to recalibrate the features to enhance the motion pattern.
Another typical video action recognition approach is based on 3D CNNs and its (2+1)D CNN variants [38, 36, 3, 40, 46]. The first work in this line was C3D , which performed 3D convolutions on adjacent frames to jointly model the spatial and temporal features in a unified approach. To utilize pre-trained 2D CNNs, Carreira and Zisserman  proposed I3D to inflate the pre-trained 2D convolutions to 3D ones. To reduce the heavy computations of 3D CNNs, some works proposed to decompose the 3D convolution into a 2D spatial convolution and a 1D temporal convolution [36, 5, 27, 14, 31, 39] or utilize a mixup of 2D CNN and 3D CNN [40, 47, 54]. In these methods, the long-range temporal connection can be theoretically established by stacking multiple local temporal convolutions. However, after a large number of local convolution operations, the useful features from distant frames have already been weakened and cannot be captured well. To address this issue, T3D  proposed to adopt densely connected structure  and combined different temporal windows . Non-local module  and stnet  applied self-attention mechanism to model long-range temporal relationship. Either additional parameters or time-consuming operations accompany these attempts. Different from these works, our proposed multiple temporal aggregation module is simple and efficient without introducing extra operators.
3 Our Method
The framework of the proposed method is illustrated in Figure 1. The input videos with variable lengths are sampled using the sparse temporal sampling strategy proposed by TSN . Firstly, the videos are evenly divided into segments. Then one frame is randomly selected from each segment to form the input sequence with frames. For spatiotemporal modeling, our model is based on 2D CNN ResNet  and constructed by stacking multiple Temporal Excitation and Aggregation (TEA) blocks. The TEA block contains a motion excitation (ME) module to excite motion patterns and a multiple temporal aggregation (MTA) module to establish a long-range temporal relationship. Following previous methods [44, 27], the simple temporal average pooling is utilized at the end of the model to average the predictions of all frames.
3.1 Motion Excitation (ME) Module
Motion measures the content displacements of the two successive frames and mainly reflects the actual actions. Many previous works utilize motion representations for action recognition [44, 3]. Still, most of them only consider pixel-level motion pattern in the form of optical flow  and separate the learning of motions from spatiotemporal features. Different from this, in the proposed motion excitation (ME) module, the motion modeling is extended from the raw pixel-level to a largely scoped feature-level, such that the motion modeling and spatiotemporal features learning are incorporated into a unified framework.
The architecture of the ME module is shown in the left panel of Figure 2. The shape of input spatiotemporal feature is , where is the batch size. and denote temporal dimension and feature channels, respectively. and correspond to spatial shape. The intuition of the proposed ME module is that, among all feature channels, different channels would capture distinct information. A portion of channels tends to model the static information related to background scenes; other channels mainly focus on dynamic motion patterns describing the temporal difference. For action recognition, it is beneficial to enable the model to discover and then enhance these motion-sensitive channels.
Given an input feature , a 11 2D convolution layer is firstly adopted to reduce feature channels for efficiency.
where denotes the channel-reduced feature. indicates the convolution operation. is the reduction ratio.
The feature-level motion representations at time step is approximately considered as the difference between the two adjacent frames, and . Instead of directly subtracting the original features, we propose to perform the channel-wise transformation on features first and then utilize the transformed feature to calculate motions. Formally,
where is the motion feature at time . is a 33 2D channel-wise convolution layer performing transformation for each channel.
We denote the motion feature at the end of time steps as zero, i.e., , and construct the final motion matrix by concatenating all the motion features . Then a global average pooling layer is utilized to summarize the spatial information since our goal is to excite the motion-sensitive channels where the detailed spatial layouts are of no great importance:
Another 11 2D convolution layer is utilized to expand the channel dimension of motion features to the original channel dimension , and the motion-attentive weights
can be obtained by using the sigmoid function.
where indicates the sigmoid function.
Finally, the goal of the module is to excite the motion-sensitive channels; thus, a simple way is to conduct channel-wise multiplication between the input feature and attentive weight . However, such an approach will suppress the static background scene information, which is also beneficial for action recognition. To address this issue, in the proposed motion-based excitation module, we propose to adopt a residual connection to enhance motion information meanwhile preserve scene information.
where is the output of the proposed module, in which the motion pattern has been excited and enhanced. indicates the channel-wise multiplication.
3.1.1 Discussion with SENet
The excitation scheme is firstly proposed by SENet [19, 18] for image recognition tasks. We want to highlight our differences with SENet. 1) SENet is designed for image-based tasks. When SENet is applied to spatiotemporal features, it processes each frame of videos independently without considering temporal information. 2) SENet is a kind of self-gating mechanism , and the obtained modulation weights are utilized to enhance the informative channels of feature . While our module aims to enhance the motion-sensitive ingredients of the feature. 3) The useless channels will be completely suppressed in SENet, but the static background information can be preserved in our module by introducing a residual connection.
3.2 Multiple Temporal Aggregation (MTA) Module
Previous action recognition methods [38, 36] typically adopt the local temporal convolution to process neighboring frames at a time, and the long-range temporal structure can be modeled only in deep networks with a large number of stacked local operations. It is an ineffective approach since the optimization message delivered from distant frames has been dramatically weakened and cannot be well handled. To address this issue, we propose the multiple temporal aggregation (MTA) module for effective long-range temporal modeling. The MTA module is inspired by Res2Net , in which the spatiotemporal features and corresponding local convolution layers are split into a group of subsets. This approach is efficient since it does not introduce additional parameters and time-consuming operations. In the module, the subsets are formulated as a hierarchical residual architecture such that a serial of sub-convolutions are successively applied to the features and could accordingly enlarge the equivalent receptive field of the temporal dimension.
As shown in the upper-right corner of Figure 2, given an input feature , a typical approach is to process it with a single local temporal convolution and another spatial convolution. Different from this, we split the feature into four fragments along the channel dimension, and the shape of each fragment thus becomes . The local convolutions are also divided into multiple sub ones. The last three fragments are sequentially processed with one channel-wise temporal sub-convolution layer and another spatial sub-convolution layer. Each of them only has 1/4 parameters as original ones. Moreover, the residual connection is added between the two adjacent fragments, which transforms the module from a parallel architecture to a hierarchical cascade one. Formally111The necessary reshape and permutation operations are ignored for simplicity. In fact, to conduct 1D temporal convolution on input feature , it requires to be reshaped from to .,
where is the output of -th fragment. denotes the 1D channel-wise temporal sub-convolution whose kernel size is 3 and indicates the 33 2D spatial sub-convolution.
In this module, the different fragments have different receptive fields. For example, the output of the first fragment is the same as input fragment ; thus, its receptive field is 111. By aggregating information from former fragments in series, the equivalent receptive field of the last fragment has been enlarged three times. Finally, a simple concatenation strategy is adopted to combine multiple outputs.
The obtained output feature involves spatiotemporal representations capturing different temporal ranges. It is superior to the local temporal representations obtained by using a single local convolution in typical approaches.
3.3 Integration with ResNet Block
Finally, we describe how to integrate the proposed modules into standard ResNet block  to construct our temporal excitation and aggregation (TEA) block. The approach is illustrated in Figure 3. For computational efficiency, the motion excitation (ME) module is integrated into the residual path after the bottleneck layer (the first 11 Conv layer). The multiple temporal aggregation (MTA) module is utilized to replace the original 33 Conv layer in the residual path. The action recognition network can be constructed by stacking the TEA blocks.
The proposed approach is evaluated on two large-scale action recognition datasets, Something-Something V1  and Kinetic400 , and other two small-scale datasets, HMDB51  and UCF101 . As pointed in [47, 53], most of the categories in Kinetics, HMDB, and UCF can be recognized by considering the background scene information only, and the temporal understanding is not very important in most cases. While the categories of Something-Something focus on human interactions with daily life objects, for example, “pull something” and “push something
”. Classifying these interactions requires more considerations of temporal information. Thus the proposed method is mainly evaluated on Something-Something since our goal is to improve the temporal modeling ability.
Kinetics contains 400 categories and provides download URL links for 240k training videos and 20k validation videos. In our experiments, we successfully collect 223,127 training videos and 18,153 validation videos, because a small fraction of the URLs (around 10%) is no longer valid. For the Kinetics dataset, the methods are learned on the training set and evaluated on the validation set. HMDB contains 51 classes and 6,766 videos, while UCF includes 101 categories with 13,320 videos. For these two datasets, we follow TSN  to utilize three different training/testing splits for evaluation, and the average results are reported.
Something-Something V1 includes 174 categories with 86,017 training videos, 11,522 validation videos, and 10,960 test videos. All of them have been split into individual frames at the same rate, and the extracted frames are also publicly available. The methods are learned on the training set and measured on the validation set and test set.
4.2 Implementation Details
We utilize 2D ResNet-50 as the backbone and replace each ResNet block with the TEA block from conv2 to conv5. The sparse sampling strategy  is utilized to extract frames from the video clips ( or 16 in our experiments). During training, random scaling and corner cropping are utilized for data augmentation, and the cropped region is resized to 224224 for each frame222More training details can be found in supplementary materials..
During the test, two evaluation protocols are considered to trade-off accuracy and speed. 1) efficient protocol (center crop1 clip), in which 1 clip with frames is sampled from the video. Each frame is resized to 256256, and a central region of size 224224 is cropped for action prediction. 2) accuracy protocol (full resolution10 clips), in which 10 different clips are randomly sampled from the video, and the final prediction is obtained by averaging all clips’ scores. For each frame in a video clip, we follow the strategy proposed by  and resize the shorter size to 256 with maintaining the aspect ratio. Then 3 crops of 256256 that cover the full-frame are sampled for action prediction.
4.3 Experimental Results
4.3.1 Ablation Study
In this section, we first conduct several ablation experiments to testify the effectiveness of different components in our proposed TEA block. Without loss of generality, the models are trained with 8 frames on the Something-Something V1 training set and evaluated on the validation set. Six baseline networks are considered for comparison, and their corresponding blocks are illustrated in Figure 4. The comparison results, including the classification accuracies and inference protocols, are shown in Table 1.
(2+1)D ResNet. In the residual branch of the standard ResNet block, a 1D channel-wise temporal convolution is inserted after the first 2D spatial convolution.
(2+1)D Res2Net. The channel-wise temporal convolution is integrated into Res2Net block . In Res2Net, the 33 spatial convolution of ResNet block is deformed to be a group of sub-convolutions.
Multiple Temporal Aggregation (MTA). The motion excitation module is removed from the proposed TEA network.
Motion Excitation (ME). Compared with the (2+1)D ResNet baseline, the proposed motion excitation module is added to the residual path.
ME w/o Residual. The residual connection is removed from the ME baseline. Thus the output feature is obtained by directly multiplying the input feature with the motion-sensitive weights, i.e., .
Effect of Multiple Temporal Aggregation.
Firstly, it can be seen from the first compartment of Table 1 that the MTA baseline outperforms the (2+1)D ResNet baseline by a large margin (47.5% vs. 46.0%). Compared with the (2+1)D ResNet baseline, the capable long-range temporal aggregation can be constructed in the MTA module by utilizing the hierarchical structure to enlarge the equivalent receptive field of the temporal dimension in each block, which results in the performance improvements.
Moreover, considering the proposed MTA module enlarges both spatial and temporal receptive fields, it is thus necessary to ascertain the independent impact of the two aspects. To this end, we then compare the (2+1)D ResNet baseline with the (2+1)D Res2Net baseline. In (2+1)D Res2Net, the group of sub-convolutions is applied to spatial dimension only, and the equivalent receptive field of temporal dimension is unchanged in this model. We can see that the accuracies of the two baselines are similar and both inferior to that of MTA (46.0%/46.2% vs. 47.5%). It proves that exploring complicated spatial structures and sophisticated spatial representations have, to some extent, limit impacts on the action recognition task. The key to improving the performance of action recognition is capable and reliable temporal modeling ability.
(2+1)D ResNet (a)
|(2+1)D Res2Net (b)||811||46.2||75.5|
|(2+1)D ResNet (a)||811||46.0||75.3|
|(2+1)D SENet (e)||811||46.5||75.6|
|ME w/o Residual (f)||811||47.2||76.1|
1. XX (y). XX indicates the XX baseline, and y represents that the architecture of the corresponding block is the y-th one in Figure 4.
2. The result of STM using efficient inference protocol is cited from Table 9 in .
|I3D-RGB ||3D ResNet50||3232||153G32||ImgNet + K400||41.6||72.2||-|
|NL I3D-RGB ||3D ResNet50||168G32||44.4||76.0||-|
|NL I3D+GCN-RGB ||3D ResNet50+GCN||303G32||46.1||76.8||45.0|
|ECO-RGB ||BNIncep+3D Res18||811||32G11||K400||39.6||-||-|
|92 + 92||N/A||49.5||-||43.9|
|TSM-RGB ||ResNet50||811||33G11||ImgNet + K400||43.4||73.2||-|
|TSM-RGB ||8 + 16||33G + 65G||46.8||76.1||-|
|TSM-(RGB+Flow) ||16 + 16||N/A||50.2||79.5||47.0|
2. “N/A” represents that the FLOPs cannot be accurately measured because of extracting optical flow.
Effect of Motion Modeling.
To testify the effectiveness of the motion modeling for action recognition, we compare the ME baseline with the (2+1)D ResNet baseline. In the second compartment of Table 1, we can see that the action recognition performance is significantly increased by considering the motion encoding (48.1% vs. 46.0%). The discovery of motion-sensitive features will force the networks to focus on dynamic information that reflects the actual actions.
To prove that such improvement is not brought by introducing extra parameters and soft attention mechanisms, we then compare the (2+1)D SENet baseline with the (2+1)D ResNet baseline. (2+1)D SENet adds the SE block at the start of the trunk path, aiming to excite the informative feature channels. However, the SE block is applied to each frame of videos independently, and the temporal information is not considered in this approach. Thus, the performance of the (2+1)D SENet baseline is similar to the (2+1)D ResNet baseline (46.5% vs. 46.0%). The improvement is quite limited.
Finally, we explore several designs for motion modeling. We first compare the ME baseline with the ME w/o Residual baseline. It can be seen that the performance decreases from 48.1% to 47.2% without residual connections since the static information related background scenes will be eliminated in ME w/o Residual. It proves that the scene information is also beneficial for action recognition, and the residual connection is necessary for the motion excitation module. Then we compare the ME baseline with STM . We can see that ME attains higher accuracy than STM (48.4% vs. 47.5%), which verifies the excitation mechanism utilized in the proposed method is superior to the simple add approach used in STM. When additionally considering the long-range temporal relationship by introducing the MTA module, the accuracy of our method (TEA) can be further improved to 48.9%.
4.3.2 Comparisons with the State-of-the-arts
In this section, we first compare TEA with the existing state-of-the-art action recognition methods on Something-Something V1 and Kinetics400. The comprehensive statistics, including the classification results, inference protocols, and the corresponding FLOPs, are shown in Table 2 and 3.
In both tables, the first compartment contains the methods based on 3D CNNs or the mixup of 2D and 3D CNNs, and the methods in the second compartment are all based on 2D or (2+1)D CNNs. Due to the high computation costs of 3D CNNs, the FLOPs of methods in the first compartment are typically higher than others. Among all existing methods, the most efficient ones are TSN  and TSM  with only 33G FLOPs. Compared with these methods, the FLOPs of our proposed TEA network slightly increases to 35G (1.06), but the performance is increased by a big margin, a relative improvement of 5.4 % (48.8% vs. 43.4%).
|Method||Backbone||FramesCropsClips||FLOPs||Pre-train||Top-1 (%)||Top-5 (%)|
|I3D-RGB ||Inception V1||64N/AN/A||108GN/AN/A||ImgNet||72.1||90.3|
NL I3D-RGB 
NL SlowFast 
|TSN-RGB ||Inception v3||80G101||72.5||90.2|
2. “N/A” represents that the authors do not report the inference protocol in their paper.
The superiority of our TEA on Something-Something is quite impressive. It confirms the remarkable ability of TEA for temporal modeling. Using efficient inference protocol (center crop1 clip) and 8 input frames, the proposed TEA obtains 48.8%, which significantly outperforms TSN and TSM with similar FLOPs (19.7%/43.4%). This results even exceeds the ensemble result of TSM, which combines the two models using 8 and 16 frames, respectively (TSM, 46.8%). When utilizing 16 frames as input and applying a more laborious accuracy evaluation protocol (full resolution10 clips), the FLOPs of our method increase to 2000G, which is similar to NL I3D+GCN . But the proposed method significantly surpasses NL I3D+GCN and all other existing methods (52.3% vs. 46.1%) on the validation set. Our performance on the test set (46.6%) also outperforms most of the existing methods. Moreover, we do not require additional COCO images  to pre-train an object detector as in . When compared with the methods utilizing both RGB and optical flow modalities, i.e., ECO-(RGB+Flow)  (49.5%) and TSM-(RGB+Flow)  (50.2%), the obtained result (52.3%) also shows substantial improvements.
On Kinetics400, the performance of our method (76.1%) is inferior to that of SlowFast  (79.8%). However, the SlowFast networks adopt the deeper networks (ResNet101) based on 3D CNNs and utilize time-consuming non-local  operations. When comparing methods with similar efficiency, such as TSM  and STM , TEA obtains better performance. When adopting 8 frames as input, TEA gains 1% higher accuracy than TSM (75.0% vs. 74.1%). While utilizing 16 input frames, our TEA method outperforms both TSM and STM with a large margin (76.1% vs. 74.7%/73.7%).
Finally, we report comparison results on HMDB51 and UCF101 in Table 4. Our method achieves 73.3% on HMDB51 and 96.9% on UCF101 with the accuracy inference protocol. The performance of our model (TEA) outperforms most of the existing methods except for I3D . I3D is based on 3D CNNs and additional input modality; thus, its computational FLOPs will be far more than ours.
|I3D-(RGB+Flow) ||3D Inception||80.7||98.0|
1. MCA denotes mean class accuracy.
2. TSM does not report MCA results, and the listed results are cited from STM .
In this paper, we propose the Temporal Excitation and Aggregation (TEA) block, including the motion excitation (ME) module and the multiple temporal aggregation (MTA) module for both short- and long-range temporal modeling. Specifically, the ME module could insert the motion encoding into the spatiotemporal feature learning approach and enhance the motion pattern in spatiotemporal features. In the MTA module, the reliable long-range temporal relationship can be established by deforming the local convolutions into a group of sub-convolutions to enlarge the equivalent temporal receptive field. The two proposed modules are integrated into the standard ResNet block and cooperate for capable temporal modeling.
This work is supported by the Video Understanding Middle Platform of the Platform and Content Group (PCG) at Tencent. The authors would like to thank Wei Shen for his helpful suggestions.
-  Hakan Bilen, Basura Fernando, Efstratios Gavves, and Andrea Vedaldi. Action recognition with dynamic image networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12):2799–2813, 2017.
-  Hakan Bilen, Basura Fernando, Efstratios Gavves, Andrea Vedaldi, and Stephen Gould. Dynamic image networks for action recognition. In CVPR, pages 3034–3042, 2016.
-  Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pages 6299–6308, 2017.
-  Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
-  Ali Diba, Mohsen Fayyaz, Vivek Sharma, Amir Hossein Karami, Mohammad Mahdi Arzani, Rahman Yousefzadeh, and Luc Van Gool. Temporal 3d convnets: New architecture and transfer learning for video classification. arXiv preprint arXiv:1711.08200, 2017.
-  Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, pages 2625–2634, 2015.
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He.
Slowfast networks for video recognition.
Proceedings of the IEEE International Conference on Computer Vision, pages 6202–6211, 2019.
-  Christoph Feichtenhofer, Axel Pinz, and Richard P Wildes. Spatiotemporal multiplier networks for video action recognition. In CVPR, pages 4768–4777, 2017.
-  Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Convolutional two-stream network fusion for video action recognition. In CVPR, pages 1933–1941, 2016.
-  Shang-Hua Gao, Ming-Ming Cheng, Kai Zhao, Xin-Yu Zhang, Ming-Hsuan Yang, and Philip Torr. Res2net: A new multi-scale backbone architecture. arXiv preprint arXiv:1904.01169, 2019.
-  Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, and Bryan Russell. Actionvlad: Learning spatio-temporal aggregation for action classification. In CVPR, pages 971–980, 2017.
-  Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
-  Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The “something something” video database for learning and evaluating visual common sense. In ICCV, pages 5843–5851, 2017.
-  Dongliang He, Zhichao Zhou, Chuang Gan, Fu Li, Xiao Liu, Yandong Li, Limin Wang, and Shilei Wen. Stnet: Local and global spatial-temporal modeling for action recognition. In AAAI, pages 8401–8408, 2019.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In ECCV, pages 630–645, 2016.
-  Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
-  J Hu, L Shen, S Albanie, G Sun, and E Wu. Squeeze-and-excitation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
-  Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In CVPR, pages 7132–7141, 2018.
-  Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In CVPR, pages 4700–4708, 2017.
-  Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
-  Boyuan Jiang, MengMeng Wang, Weihao Gan, Wei Wu, and Junjie Yan. Stm: Spatiotemporal and motion encoding for action recognition. In ICCV, pages 2000–2009, 2019.
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul
Sukthankar, and Li Fei-Fei.
Large-scale video classification with convolutional neural networks.In CVPR, pages 1725–1732, 2014.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012.
-  Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre. Hmdb: a large video database for human motion recognition. In ICCV, pages 2556–2563, 2011.
-  Yanghao Li, Sijie Song, Yuqi Li, and Jiaying Liu. Temporal bilinear networks for video action recognition. In AAAI, pages 8674–8681, 2019.
-  Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift module for efficient video understanding. In ICCV, pages 7083–7093, 2019.
-  Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740–755, 2014.
-  Joe Yue-Hei Ng and Larry S Davis. Temporal difference networks for video action recognition. In WACV, pages 1587–1596, 2018.
-  Oleksandra Poquet, Lisa Lim, Negin Mirriahi, and Shane Dawson. Video and learning: a systematic review (2007–2017). In ICLAK, pages 151–160. ACM, 2018.
-  Zhaofan Qiu, Ting Yao, Chong-Wah Ngo, Xinmei Tian, and Tao Mei. Learning spatio-temporal representation with local and global diffusion. In CVPR, pages 12056–12065, 2019.
-  Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
-  Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568–576, 2014.
-  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
-  Lin Sun, Kui Jia, Dit-Yan Yeung, and Bertram E Shi. Human action recognition using factorized spatio-temporal convolutional networks. In ICCV, pages 4597–4605, 2015.
-  Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, pages 1–9, 2015.
-  Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, pages 4489–4497, 2015.
-  Du Tran, Heng Wang, Lorenzo Torresani, and Matt Feiszli. Video classification with channel-separated convolutional networks. arXiv preprint arXiv:1904.02811, 2019.
-  Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. In CVPR, pages 6450–6459, 2018.
-  Gül Varol, Ivan Laptev, and Cordelia Schmid. Long-term temporal convolutions for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6):1510–1517, 2017.
-  Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang Zhang, Xiaogang Wang, and Xiaoou Tang. Residual attention network for image classification. In CVPR, pages 3156–3164, 2017.
-  Limin Wang, Wei Li, Wen Li, and Luc Van Gool. Appearance-and-relation networks for video classification. In CVPR, pages 1430–1439, 2018.
-  Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, pages 20–36, 2016.
-  Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In CVPR, pages 7794–7803, 2018.
-  Xiaolong Wang and Abhinav Gupta. Videos as space-time region graphs. In ECCV, pages 399–417, 2018.
-  Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In ECCV, pages 305–321, 2018.
-  Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. Describing videos by exploiting temporal structure. In ICCV, pages 4507–4515, 2015.
-  Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, pages 4694–4702, 2015.
Christopher Zach, Thomas Pock, and Horst Bischof.
A duality based approach for realtime tv-l 1 optical flow.
Joint Pattern Recognition Symposium, pages 214–223, 2007.
-  Yue Zhao, Yuanjun Xiong, and Dahua Lin. Recognize actions by disentangling components of dynamics. In CVPR, pages 6566–6575, 2018.
-  Yue Zhao, Yuanjun Xiong, and Dahua Lin. Trajectory convolution for action recognition. In NIPS, pages 2204–2215, 2018.
-  Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Torralba. Temporal relational reasoning in videos. In ECCV, pages 803–818, 2018.
-  Mohammadreza Zolfaghari, Kamaljeet Singh, and Thomas Brox. Eco: Efficient convolutional network for online video understanding. In ECCV, pages 695–712, 2018.
Appendix A Temporal Convolutions in TEA
For video action recognition tasks, previous works typically adopt 3D convolutions to simultaneously model spatial and temporal features or utilize (2+1)D convolutions to decouple the temporal representation learning and the spatial feature modeling. The 3D convolutions will bring tremendous computations costs and preclude the benefits of utilizing ImageNet pre-training. Moreover, the blend of spatial and temporal modeling also makes the model harder to optimize. Thus, as shown in Figure 2 of the main text, in our proposed TEA module, we choose (2+1)D architectures and adopt separated 1D temporal convolutions to process temporal information. To train (2+1)D models, a straightforward approach is fine-tuning the 2D spatial convolutions from ImageNet pre-trained networks meanwhile initialize the parameters of 1D temporal convolutions with random noise.
However, according to our observations, it will lead to contradictions to simultaneously optimize the temporal convolutions and spatial convolutions in a unified framework because the temporal information exchange between frames brought by temporal convolutions might harm the spatial modeling ability. In this section, we will describe a useful tactic to deal with this problem for effectively optimizing temporal convolutions in video recognition models.
Before introducing the tactic, we first retrospect the recently proposed action recognition method TSM . Different from previous works adopting temporal convolutions [40, 47], TSM utilizes an ingenious shift operator to endow the model with the temporal modeling ability without introducing any parameters into 2D CNN backbones. Concretely, given an input feature
, the shift operations are denoted to shift the feature channels along the temporal dimension. Suppose the input feature is a five-dimensional tensor, the example pseudo-codes of left/right operator are as follows:
By utilizing such shift operation, the spatial features at time step , , achieves temporal information exchange between neighboring time steps, and . In practice, the shift operation can be conducted on all or some of the feature channels, and the authors explore several shift options . Finally, they find that if all or most of the feature channels are shifted, the performance will decrease due to worse spatial modeling ability.
The possible reason for this observation is that TSM utilizes 2D CNN backbone pre-trained on ImageNet as initializations to fine-tune the models on new video datasets. The benefit of utilizing such a pre-training approach is that the features of pre-trained ImageNet models would contain some kind of useful spatial representations. But after shifting a part of feature channels to neighboring frames, such useful spatial representations modeled by the shifted channels are no longer accessible for current frame. The newly obtained representations become the combination of three successive frames, i.e., , and , which are disordered and might be “meaningless” for the current frame .
To balance these two aspects, TSM experimentally chooses to shift a small part of feature channels. More specifically, the first 1/8 channels are shifted left, the second 1/8 channels are shifted right, and the last 3/4 channels are fixed. Formally,
This part shift operation has been proved effective in TSM and obtains impressive action recognition accuracies on several benchmarks.
When we think over the shift operation proposed by TSM, we find that it is actually a special case of general 1D temporal convolutions and we will show an example to illustrate this. Instead of conducting shift operation on input feature as in Equation 9, we aim to utilize a 1D channel-wise temporal convolution to achieve the same control for (we utilize channel-wise convolution for simplicity, and the formulas in Equation 10 can be extended to general convolutions.). Concretely, a 1D channel-wise temporal convolution is conducted on input features, whose kernel size is 3, the kernel weights at the shifted channels are set to fixed or and the kernel weights at unchanged channels are set to fixed . Formally,
where denotes convolutional kernels, and indicates the convolution operation. It’s not hard to see that the Equation 10 is totally equivalent to Equation 9. The shift operation proposed in TSM can be considered as a 1D temporal convolution operation with fixed pre-designed kernel weights. Thus, one natural question is, whether the performance of video action recognition can be improved by relaxing the fixed kernel weights to learnable kernel weights. We experiment to verify this question.
|Method||Top-1 (%)||Top-5 (%)|
|(2+1)D ResNet-Conv (Ours)||23.5||45.8|
|(2+1)D ResNet-CW (Ours)||43.6||73.4|
|(2+1)D ResNet-Shift (Ours)||46.0||75.3|
The experiment is conducted based on the (2+1)D ResNet baseline. The detailed descriptions of the baseline are introduced in Section 4.3.1 of the main text. We design several variants of (2+1)D ResNet, and the only difference between these variants is the type of utilized 1D temporal convolution.
(2+1)D ResNet-Conv which adopts general 1D temporal convolutions. The parameters of temporal convolutions are randomly initialized.
(2+1)D ResNet-CW which utilizes channel-wise temporal convolutions. The parameters are also randomly initialized.
(2+1)D ResNet-Shift. In this variant, the channel-wise temporal convolutions are also utilized, but the parameters of the temporal convolutions are initialized as in Equation 10 to perform like part shift operators at the beginning of the model learning.
During training, the parameters of temporal convolutions in all the three variants are learnable, and the final obtained models are evaluated on Something-Something V1 with 8 frames as input and the efficient inference protocol.
The comparison results are shown in Table 5. We first notice that when comparing the (2+1)D ResNet-Conv with (2+1)D ResNet-CW, the (2+1)D ResNet-Conv baseline fails to obtain acceptable performance. As we mentioned in the main text, different channels of spatial features capture different information; thus, the temporal combination of each channel should be different and learned independently. Moreover, the general temporal convolution will introduce lots of parameters and make the model harder to be optimized.
The second observation is that the performance of (2+1)D ResNet-CW is only slightly higher than that of TSM (43.6% vs. 43.4%). Although the learnable kernel weights endow the model with the ability to learn dynamic temporal information exchange patterns, all the features channels are disarrayed with randomly initialized convolutions. It finally results in damage to the spatial feature learning capacity and counters the benefits of effective temporal representation learning.
Inspired by the part shift strategy utilized in TSM, (2+1)D ResNet-Shift proposes to initialize the temporal convolutions to perform as part shift, which grantees the spatial feature learning ability inheriting from the pre-trained ImageNet 2D CNN models. Meanwhile, along with the optimization of the models, the temporal convolutions can gradually explore more effective temporal information aggregation strategy with learnable kernel weights. Finally, this part shift initialization strategy obtains 46.0% top-1 accuracy, which is substantially higher than TSM.
According to the experiment, we can see that by drawing lessons from TSM and designing a part shift initialization strategy, the performance of action recognition can be improved by using 1D temporal convolutions with learnable kernel weights. This strategy is thus applied to each of the temporal convolutions in the proposed TEA module.
Appendix B Training Details
In this section, we will elaborate on detailed configurations for training the TEA network on different datasets. The codes and related experimental logs will be made publicly available soon.
|Method||FrameCropsClips||Inference Time (ms/v)||Val Top-1 (%)|
|TSN (2D ResNet) ||811||0.0163||19.7|
|I3D (3D ResNet) ||3232||4.4642||41.6|
ME (d in Figure 4)
|MTA (c in Figure 4)||811||0.0256||47.5|
b.1 Model Initializations
Following the previous action recognition works [44, 27, 22], we utilize 2D CNNs pre-trained on ImageNet dataset as the initializations for our network. Notice that the proposed multiple temporal aggregation (MTA) module is based on Res2Net , whose architecture is different from the standard ResNet . We thus select the released Res2Net50 model (333https://shanghuagao.oss-cn-beijing.aliyuncs.com/res2net/res2net50_26w_4s-06e79181.pth) pre-trained on ImageNet to initialize the proposed network.
Although Res2Net has been proved a stronger backbone than ResNet on various image-based tasks in , e.g., image classification, and image object detection, it will NOT brings many improvements for video action recognition task. As we have discussed in the ablation study section (Section 4.3.1) of the main text, the temporal modeling ability is the key factor for video-based tasks, rather than the complicated spatial representations. The experimental results in Table 1 of the main text also verify this. With more powerful 2D backbones, the action recognition performance of (2+1)D Res2Net only obtains slight improvements over (2+1)D ResNet (46.2% vs. 46.0%).
. For experiments on Kinetics and Something-Something V1 & V2, the networks are fine-tuned from ImageNet pre-trained models. All the batch normalization layers are enabled during training. The learning rate and weight decay of the classification layer (a fully connected layer) are set to 5
higher than other layers. For Kinetics, the batch size, initial learning rate, weight decay, and dropout rate are set to 64, 0.01, 1e-4, and 0.5 respectively; for Something-Something, these hyperparameters are set to 64, 0.02, 5e-4 and 0.5 respectively. For these two datasets, the networks are trained for 50 epochs using stochastic gradient descent (SGD), and the learning rate is decreased by a factor of 10 at 30, 40, and 45 epochs.
When fine-tuning Kinetics models on other small datasets, i.e., HMDB51  and UCF101 , the batch normalization layers are frozen except the first one following TSN . The batch size, initial learning, weight decay and dropout rate are set to 64, 0.001, 5e-4 and 0.8 for both the two datasets. The learning rate and weight decay of the classification layer are set to 5 higher than other layers. The learning rate is decreased by a factor of 10 at 10 and 20 epochs. The training procedure stops at 25 epochs.
Finally, the learning rate should match the batch size as suggested by . For example, the corresponding learning rate should increase two times if the batch size scales up from 64 to 128.
Appendix C The Effect of the Transformation Convolutions in the ME Module
When calculating feature-level motion representations in the ME module, we first apply a channel-wise transformation convolution on features at the time step . The reason is that motions will cause spatial displacements for the same objects between two frames, and it will result in mismatched motion representation to directly compute differences between displaced features. To address this issue, we add a 33 convolution at time step attempting to capture the matched regions of the same object from contexts. According to our verification, this operation leads to further improvement of TEA on Something-Something V1 (from 48.4% to 48.9%). Moreover, we found that conducting transformation on both and time steps does not improve the performance but introduces more operations.
Appendix D Runtime Analysis
We show the accuracies and inference times of TEA and other methods in Table 6. All these tests are conducted on one P40 GPU, and the batch size is set to 16. The time for data loading is excluded from the evaluation. Compared with STM, TEA achieves higher accuracy with similar efficiency. Compared with TSM and I3D, both the effectivity and efficiency of TEA are superior. The runtime of the 2D ResNet baseline (TSN) is nearly 1.8x faster than TEA. But its performance is far behind ours (19.7% vs. 48.9%).
We further analyze the efficiency of each component in TEA by comparing TEA with MTA and ME, respectively. We can see that the hierarchical stages in MTA cause an increase of (TEAME), as the multiple stages need to be sequentially processed. The increased time brought by ME is (TEAMTA). Please note that for an input feature with timestamps, it is not required to subtract between adjacent features timestamp-wise and then concatenate -1 differences. We only need to slice along the temporal dimension twice to obtain the features of time 1-1 and time 2 respectively. Then only one subtraction is performed to obtain the final feature differences. The example pseudo codes are as follows, and the time cost of this approach is only .
Appendix E The Location of the TEA Block
As described in Section 4.2 of the main text, the TEA blocks are utilized to replace all the ResNet blocks of the ResNet-50 backbone from conv2 to conv5. In this section, we conduct an ablation study to explore the different impacts caused by inserting the TEA blocks into ResNet at different locations. Specifically, we replace all the ResNet blocks with the TEA blocks at a particular stage, e.g., conv2, and leave all other stages, e.g., conv3conv5, unchanged. The networks are learned on the training set of Something-Something V1 and measured on its validation set. During the test, the efficient protocol (center crop1 clip) is adopted, and the comparison results are shown in Table 7. It can be seen that, in general, the action recognition performance of inserting TEA blocks into the later stages (i.e., conv4/conv5, 47.1%/46.7%) is superior to that of inserting the TEA blocks into the early stages (i.e., conv2/conv3, 43.5%/45.3%). The spatiotemporal features at the later stage would capture temporal information from a larger range and realize capable temporal aggregations. Thus, the TEA blocks at the later stages would have more effective and determinative impacts for improving temporal modeling ability, which finally results in higher action recognition performance. When inserting the TEA blocks into all stages of the ResNet backbone, the performance of our method further increases and achieves the best result (48.9%).
Appendix F The verification for the assumption of ME
To verify the assumption of ME, we give a visualization example in Figure 5. We can see that different feature channels capture different information. For example, on channels 11 and 25, features model the moving swimmers, and the ME module enhances this motion information by giving a large attention weight (0.90/0.58). In contrast, on channels 14 and 42, the background information is simply preserved with a quite lower attention weight, 0.08/0.05.
Appendix G Experimental Results on Something-Something V2
In this section, we compare the proposed TEA network with other state-of-the-art methods on Something-Something V2 . Something-Something V2 is a newer release version of the Something-Something dataset. It contains 168,913 training videos, 24,777 validation videos and 27,157 videos. Its size is twice larger than Something-Something V1 (108,499 videos in total). The TEA network is learned on the training set and evaluated on the validation set and test set. The accuracy inference protocol (full resolution10 clips) is utilized for evaluation, and the results are shown in Table 8. We can see that on the validation set, our result (65.1%) outperforms those of the existing state-of-the-art methods. On the test set, the obtained number is also comparable to the state-of-the-art result (63.2% vs. 63.5%). These results verify the effectiveness of the proposed TEA network on Something-Something V2.