Deep Analysis of CNN-based Spatio-temporal Representations for Action Recognition

10/22/2020 ∙ by Chun-Fu Chen, et al. ∙ MIT ibm 0

In recent years, a number of approaches based on 2D CNNs and 3D CNNs have emerged for video action recognition, achieving state-of-the-art results on several large-scale benchmark datasets. In this paper, we carry out an in-depth comparative analysis to better understand the differences between these approaches and the progress made by them. To this end, we develop a unified framework for both 2D-CNN and 3D-CNN action models, which enables us to remove bells and whistles and provides a common ground for a fair comparison. We then conduct an effort towards a large-scale analysis involving over 300 action recognition models. Our comprehensive analysis reveals that a) a significant leap is made in efficiency for action recognition, but not in accuracy; b) 2D-CNN and 3D-CNN models behave similarly in terms of spatio-temporal representation abilities and transferability. Our analysis also shows that recent action models seem to be able to learn data-dependent temporality flexibly as needed. Our codes and models are available on



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the recent advances in convolutional neural networks (CNNs) 

[InceptionV1:Szegedy_2015_CVPR, He_2016_CVPR] and the availability of large-scale video datasets [Kinetics:kay2017kinetics, Moments:monfort2019moments]

, deep learning approaches have dominated the field of video action recognition by using

2D CNNs [TSN:wang2016temporal, TSM:lin2018temporal, bLVNetTAM] or 3D CNNs [I3D:carreira2017quo, ResNet3D_2:hara2018can, SlowFast:feichtenhofer2018slowfast] or both  [luo2019grouped, sudhakaran2020gate]. The 2D CNNs perform temporal modeling independent of 2D spatial convolutions while their 3D counterparts learn space and time information jointly by 3D convolution. These methods have achieved state-of-the-art performance on multiple large-scale benchmarks such as Kinetics [Kinetics:kay2017kinetics] and Something-Something [Something:goyal2017something].

Figure 1:

Recent progress of action recognition on Kinetics-400 (only models based on InceptionV1 and ResNet50 are included). The models marked with

are re-trained and evaluated under the same setting (see Section 5.2 for detail) while the others are from the literature. The size of a circle indicates the 1-clip FLOPs of a model. With temporal pooling turned off, I3D performs on par with the state-of-the-art approaches. Best viewed in color.

Although CNN-based approaches have made impressive progress on action recognition, there seems no clear winner in terms of accuracy. For example, 3D models report better performance than 2D models on Kinetics while the latter are superior on Something-Something. Given this, it’s in great need to better understand the differences between these two types of spatio-temporal representations and further what attributes to their accuracy improvements. Unlike image recognition, action recognition lacks a fair performance comparison among existing approaches. The current comparison in the literature mainly focuses on accuracy and efficiency, and tends to neglect other important factors that may affect performance such as the backbone networks. As shown in Fig. 1, I3D [I3D:carreira2017quo], a very popular baseline method based on Inception-V1, is often compared in the literature with other approaches using stronger backbones such as ResNet50 [He_2016_CVPR]. As a result, it is hard to determine whether the improved results of an approach come from a better backbone or the algorithm itself. In addition, variations in training and evaluation protocols, model inputs and pretrained models from approach to approach further confound the comparison.

The lack of fairness in performance evaluation also leads to confusion on understanding the significance of temporal modeling for action recognition. It’s generally believed that temporal modeling is the crux of the matter for action recognition and state-of-the-art approaches can capture better temporal information. However, it has also been demonstrated on datasets such as Kinetics and Moments-in-Time (MiT[Moments:monfort2019moments] that approaches purely based on spatial modeling [TSN:wang2016temporal, Moments:monfort2019moments] can achieve very competitive results compared to more sophisticated spatio-temporal models. More recently, a paper [hutchinson2020accuracy] shows that 2D models outperform their 3D counterparts on the MiT benchmark, concluding that “model depth, rather than input feature scale, is the critical component to an architecture’s ability to extract a video’s semantic action information”. All these findings seem to imply that more complex temporal modeling is not necessary for “static” datasets such as Kinetics and MiT.

In light of the need for a deep analysis of action recognition works, in this paper we provide a common ground for comparative analysis of 2D-CNN and 3D-CNN models without any bells and whistles. We conduct consistent and comprehensive experiments to compare several representative 2D-CNN and 3D-CNN methods on three large-scale benchmark datasets. Our main goal is to deliver clear understanding of a) how differently 2D-CNN and 3D-CNN methods behave with regard to spatial-temporal modeling of video data; b) whether the state-of-the-art approaches enable more effective learning of spatio-temporal representations of video, as claimed in the papers; and c) the significance of temporal modeling for action recognition.

To this end, we first unify 2D-CNN and 3D-CNN approaches into a general framework, which views a model as a sequence of stacked spatio-temporal modules. This limits the main difference between 2D and 3D approaches to how they model temporal information only (see Fig. 2). We then re-implemented six representative approaches of action recognition, including I3D [I3D:carreira2017quo], ResNet3D [ResNet3D_2:hara2018can], S3D [S3D], R(2+1)D [R2plus1D:Tran_2018_CVPR], TSN [TSN:wang2016temporal] and TAM [bLVNetTAM] in a unified framework. We trained about 300 action recognition models on three popular benchmark datasets with different backbone networks (InceptionV1, ResNet18 and ResNet50) and input frames using the same initialization and training protocol. We also develop methods to perform detailed analysis of the spatio-temporal effects of different models across backbone and network architecture. We further analyze data temporality (i.e., temporal information needed for recognition) and observe that temporal information perceived by human as useful for recognition might not be the same as what an action model attempts to learn. However, advanced spatio-temporal models seem to be able to learn data-dependent temporality flexibly as needed. Our systematic analysis will provide insights to researchers to understand spatio-temporal effects of different action models and broadly simulate discussions in the community regarding a very important but largely neglected issue of fair comparison in video action recognition.

The main contributions of our work as follows:

  • A unified framework for Action Recognition. We present a unified framework for 2D-CNN and 3D-CNN approaches and implement several representative methods for comparative analysis on three standard benchmark datasets.

  • Spatio-Temporal Analysis. We systematically compare 2D-CNN and 3D-CNN models to better understand the differences and spatio-temporal behavior of these models. Our analysis leads to some interesting findings as follows: a) the advance in action recognition is mostly on the efficiency side, not on accuracy (Fig. 1); b) By removing non-structural differences between 2D-CNN and 3D-CNN models, they behave similarly in terms of spatio-temporal representation abilities and transferability; and c) Effective temporal modeling is essential for achieving SOTA results even for datasets such as Kinetics.

  • Analysis on Data Temporality. We perform analysis on temporality of action data in the views of both human and machine. Our analysis shows that temporality is not considered as intrinsic in action data by recent spatio-temporal models, which seem to be able to learn temporality as needed in a data-driven way.

2 Related Work

Video understanding is a challenging problem with great application potential. Over the last years video understanding has made rapid progress with the introduction of a number of large-scale video datasets such as such as Kinetics [Kinetics:kay2017kinetics], Sports1M [karpathy2014large], Moments-In-Time [Moments:monfort2019moments], and YouTube-8M [abu2016youtube]. A number of models introduced recently have emphasized the need to efficiently model spatio-temporal information for action recognition. Most successful deep architectures for action recognition are usually based on two-stream model [Simonyan14TwoStream]

, processing RGB frames and optical-flow in two separate Convolutional Neural Networks (CNNs) with a late fusion in the upper layers 

[karpathy2014large]. Over the last few years, two-stream approaches have been used in different action recognition methods [cheron2015p, lstm:donahue2015longterm, gkioxari2015finding, yue2015beyond, srivastava2015unsupervised, venugopalan2015sequence, weinzaepfel2015learning, wang2015action, feichtenhofer2016spatiotemporal, feichtenhofer2017spatiotemporal]. Another straightforward but popular approach is the use of 2D-CNN to extract frame-level features and then model the temporal causality. For example, TSN [TSN:wang2016temporal] proposed the consensus module to aggregate the features; on the other hand, TRN [TRN:zhou2018temporal] used a bag of features idea to model the relationship between frames. While TSM [TSM:lin2018temporal] shifts part of the channels along the temporal dimension, thereby allowing for information to be exchanged among neighboring frames, TAM [bLVNetTAM] is based on depthwise convolutions to capture temporal dependencies across frames effectively. Different methods for temporal aggregation of feature descriptors has also been proposed  [fernando2015modeling, lev2016rnn, xu2015discriminative, wang2015action, peng2014action, girdhar2017actionvlad, girdhar2019video]. More complex approaches have also been investigated for capturing long-range dependencies, e.g. in the context of non-local neural networks [Wang2018NonLocal].

Another approach is to use 3D-CNN, which extends the success of 2D models in image recognition [3DCNNHuamn:Ji2013] to recognize actions in videos. For example, C3D [C3D:Tran2015learning] learns 3D ConvNets which outperforms 2D CNNs through the use of large-scale video datasets. Many variants of 3D-CNNs are introduced for learning spatio-temporal features such as I3D [I3D:carreira2017quo] and ResNet3D [ResNet3D_2:hara2018can]. 3D CNNs features were also demonstrated to generalize well to other vision tasks, such as action detection [shou2016temporal], video captioning [pan2016jointly], action localization [paul2018w], and video summarization [panda2017collaborative]. Nonetheless, as 3D convolution leads high computational load, few works aim to reduce the complexity by decomposing the 3D convolution into 2D spatial convolution and 1D temporal convolution, e.g. P3D [P3D:Qiu_2017_ICCV], S3D [S3D], R(2+1)D [R2plus1D:Tran_2018_CVPR]; or incorporating group convolution [CSN:Tran_2019_ICCV]; or using a combination of 2D-CNN and 3D-CNN [ECO:zolfaghari2018eco]. Furthermore, SlowFast network employs two pathways to capture short-term and long-term temporal information [SlowFast:feichtenhofer2018slowfast] by processing a video at both slow and fast frame rates. Beyond that, Timeception applies the Inception concept in the temporal domain for capturing long-range temporal dependencies [Timeception]. Feichtenhofer [X3D_Feichtenhofer_2020_CVPR] finds efficient networks by extending 2D architectures through a stepwise expansion approach over the key variables such as temporal duration, frame rate, spatial resolution, network width, etc. Leveraging weak supervision [ghadiyaram2019large, wang2017untrimmednets, kuehne2017weakly] or distillation [girdhar2019distinit] is also another recent trend in action recognition.

Recently, few works have assessed the importance of temporal information in a video, e.g., Sigurdsson et. al analyzed recognition performance per action category based on different levels of object complexity, verb complexity, and motion [sigurdsson2017actions]. They state that to differentiate temporally similar but semantically different videos, its important for models to develop temporal understanding. Huang et. al analyzed the effect of motion via an ablation analysis on C3D model [huang2018makes]. Nonetheless, those works are only study a limited set of backbone and temporal modeling methods.

3 2D-CNN and 3D-CNN Approaches

In this work, we focus on several popular 2D-CNN and 3D-CNN action recognition models in our analysis, including I3D [I3D:carreira2017quo], ResNet3D [ResNet3D:hara2017learning], S3D [S3D], R(2+1)D [R2plus1D:Tran_2018_CVPR], TSN [TSN:wang2016temporal] and TAM [bLVNetTAM]. These representative approaches not only yield competitive results on popular large-scale datasets, but also widely serve as fundamental building blocks for many other successive approaches such as SlowFast [SlowFast:feichtenhofer2018slowfast] and CSN [CSN:Tran_2019_ICCV]

. Since SlowFast is arguably one of the best approaches on Kinetics, we use it as a reference to the SOTA results. Among the approaches in our study, I3D and ResNet3D are pure 3D-CNN models, differing only in backbones. S3D and R(2+1)D factorize a 3D convolutional filter into a 2D spatial filter followed by a 1D temporal filter. In such a sense, they are architecturally similar to 2D models. However, we consider them as 3D-CNN approaches as their implementations are based on 3D convolutions. On the other hand, TSN rely only on 2D convolution to learn spatio-temporal representations and achieves competitive results on the popular Kinetics dataset without any temporal modeling. Finally we consider TAM, a recently proposed approach that adds efficient depthwise temporal aggregation on top of TSN and demonstrates strong temporal modeling capability on Something-Something dataset 

[bLVNetTAM]. Apart from using different types of convolutional kernels, 2D and 3D models differ in a number of other aspects, including model input, temporal pooling, and temporal aggregation, as briefly highlighted in Table 1. More information on the models can be found in the Supplemental.

max width= Approach Model Input Backbone Temporal Spatial Temporal Initial Input Sampling Pooling Module Aggregation Weights I3D [I3D:carreira2017quo] 4D Dense InceptionV1 Y 3D Conv. 3D Conv. Inflation R3D [ResNet3D:hara2017learning] ResNet S3D [S3D] InceptionV1 2D Conv. 1D Conv. Inflation R(2+1)D [R2plus1D:Tran_2018_CVPR] ResNet Scratch TAM [bLVNetTAM] 3D Uniform bLResNet N 2D Conv. 1D dw Conv. ImageNet TSN [TSN:wang2016temporal] InceptionV1 None ImageNet

Table 1: 2D-CNN and 3D-CNN approaches in our study.
Figure 2:

A general framework for 2D-CNN and 3D-CNN approaches of video action recognition. A video action recognition model can be viewed as a sequence of stacked spatio-temporal modules. The input frames are formed as a 3D tensor for 2D models and 4D tensor for 3D models.

The differences between 2D-CNN and 3D-CNN approaches make it a challenge to compare these approaches. To remove the bells and whistles and ensure a fair comparison, we show in Fig. 2 that 2D and 3D models can be represented by a general framework. Under such a framework, an action recognition model is viewed as a sequence of stacked spatio-temporal modules with temporal pooling optionally applied. Thus what differentiates a model from another boils down to only its spatio-temporal module. We re-implemented all the approaches used in our comparison using this framework, which allows us to test an approach flexibly under different configurations such as backbone, temporal pooling and temporal aggregation.

4 Datasets, Training, Evaluation Protocols

To ensure fair comparison and facilitate reproduciblity, we train all the models using the same data preprocessing, training protocol, and evaluation protocol.

4.1 Datasets

Table 2 illustrates the characteristics of the datasets used in the paper. The SSV2 dataset contains a total of 192K videos of 174 human-object interactions, captured in a simple setup without much background information. It has been shown that temporal reasoning is essential for recognition on this dataset [TRN:zhou2018temporal]. Kinetics has been the most popular benchmark for deep-learning-based action approaches. It consists of 240K training videos and 20K validation videos of 400 action categories, with each video lasting 6-10 seconds. Interestingly, approaches without temporal modeling such as TSN [TSN:wang2016temporal] achieves strong results on this dataset, implying that modeling temporal information is not that important on this dataset. MiT is a recent collection of one million labeled videos, involving actions from people, animals, objects or natural phenomena. It has 339 classes and each clip is trimmed to 3 seconds long. These datasets cover a wide range of different types of videos, hence are suitable for studying various spatio-temporal representations.

Data preprocessing and augmentation. We extract frame images from videos via the FFMPEG packages and then resize the shorter side of an image to 256 while keeping the aspect ratio of the image. Following the practice in TSN [TSN:wang2016temporal], we apply multi-scale augmentation and randomly crop the same 224

224 region of whole input images for training. In the meanwhile temporal jittering is used to sample different frames from a video. Afterward, the input is normalized by the mean and standard deviation used in the original ImageNet-pretrained model.

4.2 Training

Table 3

illustrates the training protocol we use for all the models and datasets in our experiments. We train most of our models using a single compute node with 6 V100 GPUs and a total of 96G GPU memory with a batch size of 72 or the maximum allowed for a single node (a multiple of 6). For some large models (for example, I3D-ResNet50) using 32 or 64 frames, we limit the number of nodes to no more than 3, i.e. 18 GPUs, and apply synchronized batch normalization in training at a batch size of 36. It is known that batch size has a significant impact on model performance 

[GroupNormalization]. However we observe that such a setup generally leads to comparable model accuracy to the approaches studied in this work.

max width= Dataset # of Images # of Duration Train Val Classes SSV2 [Something:goyal2017something] 168k 24k 174 3-5s@12fps Mini-SSV2 81k 12k 87 Kinetics [Kinetics:kay2017kinetics] 240k 19k 400 6-10s@30fps Mini-Kinetics 121k 10k 200 MiT [Moments:monfort2019moments] 802k 34k 339 3s@30fps Mini-MiT 100k 10k 200 Mini-Something-Something and Mini-Kinetics400 are. created by randomly sampling half of classes.

Table 2: Overview of datasets.

max width=

8-frame 16-frame 32-frame 64-frame
Weight Init. ImageNet 8-frame 16-frame 32-frame
Epochs 75 (100) 35 (45) 35 (45) 35 (45)
Learning rate 0.01
LR scheduler cosine multisteps multisteps multisteps
Weight decay 0.0005
Optimizer Synchronized SGD with moment 0.9
: mini-Kinetics uses the epoch number in the bracket
since it has more data. : when the epoch number is 35,
the learning rate drops 10 at the 10-th,
20-th, 30-th epoch; while drops 10 at the 15-th,
30-th, 40-th epoch when 45 epochs is used.
Table 3: Training protocol

4.3 Evaluation

In the clip-level accuracy setting, we sample frames either with uniform sampling or dense sampling and then crop a 224224 region centered at each image after resizing the shorter side of the image to 224. For uniforming sampling, we choose the middle frame of each segment to form a clip while for dense sample the first clip is used. In the video-level accuracy setting, clips need to be prepared. For dense sampling, we uniformly select points and then take consecutive frames starting at each point. In the case of uniform sampling, we apply an offset from the middle frame, where , to shift the sampling location at each segment. We use to conduct video-level accuracy.

Datasets. We choose Something-Something V2 (SSV2), Kinetics-400 (Kinetics) and Moments-in-time (MiT) for our experiments. We also create a mini version of each dataset - Mini-SSV2 and Mini-Kinetics account for half of their full datasets by randomly selecting half of the categories of SSV2 and Kinetics, respectively. Mini-MiT is provided on the official MiT website, consisting of 1/8 of the full dataset.

Figure 3: Top-1 accuracy of all models without temporal pooling on three mini-datasets. The video architectures are separated by color while the backbones by symbol.

Training. Following [bLVNetTAM], we progressively train the models using different input frames. Let where . We first train a starter model using 8 frames. The model is either inflated with (e.g. I3D) or initialized from (e.g. TAM) its corresponding ImageNet pre-trained model. We then fine tune a model using more frames from the model using frames.


There are two major evaluation metrics for action recognition: clip-level accuracy and video-level accuracy. Clip-level accuracy is prediction from feeding a single clip into the network and video-level accuracy is the combined predictions of

multiple clips; thus, the video-level accuracy is usually higher than the clip-level accuracy. By default, we report the clip-level accuracy.

5 Experimental Results and Analysis

In this section, we provide detailed analysis of the performance of 2D and 3D models (Sec. 5.1), their SOTA results and transferability (Sec. 5.2) and their spatio-temporal effects (Sec. 5.3) as well as the temporal dynamics of datasets (Sec. 5.4). For clarity, from now on, we refer to each of I3D, S3D and TAM as one type of general video architectures illustrated in Fig. 2. We name a specific model by architecture-backbone[-tp] where tp indicates that temporal pooling is applied. For example, I3D-ResNet18-tp is a 3D model based on ResNet18 with temporal pooling.

5.1 Performance Analysis of 2D and 3D Models

For each architecture, we experiment with 3 backbones (InceptionV1, ResNet18 and ResNet50) and two scenarios (w/ and w/o temporal pooling) on three datasets. In each case, 8, 16, 32 and 64 frames are considered as input. This results in a total of models to train, many of which haven’t been explored in the original papers. We report clip-level top-1 accuracies w/o temporal pooling in Fig. 3. Based on these models, we study the effects of several factors on 2D and 3D models including i) Input sampling, ii) Backbone network, iii) Input length; iv) Temporal pooling, and v) Temporal aggregation. Due to space limit, we mainly focus on iv) and v) while briefly summarizing the results of i) to iii) below. Complete results of all the models are included in the Supplemental.

Input Sampling. Two sampling strategies are widely adopted in action recognition to create model inputs. The first one, Uniform sampling, which is often seen in 2D models, divides a video into multiple equal-length segments and then randomly selects one frame from each segment. The other method used by 3D models, dense sampling, instead directly takes a set of continuous frames as the input.

To better understand how input sampling affects model performance, we trained I3D-ResNet18 (3D) and TAM-ResNet18 (2D) on Mini-Kinetics and Mini-SSV2 using both dense and uniform sampling. The clip-level and video-level accuracies of I3D-ResNet18 w/ and w/o temporal pooling are reported in Fig. 4 (a), and the results of TAM-ResNet18 are shown in Fig. 4 (b).

Figure 4: Performance comparison between Uniform Sampling (U) and Dense Sampling (D). (a) The tested model is I3D-ResNet18. (b) The tested model is TAM-ResNet18. Solid bars are the clip-level accuracy while transparent bars indicates the improvement by the video-level (multi-clip) evaluation. Best viewed in color.

Fig. 4 shows that uniform sampling (blue) yields better clip-level accuracies than dense sampling (orange) under all circumstance. This is not surprising as dense sampling only uses a part of the test video in the clip-level evaluation. On the other hand, when multiple clips are used for inference, the performance of models trained by dense sampling is significantly boosted by 6%15% on Mini-Kinetics and 5%20% on Mini-SSV2. This suggests that dense sampling can learn spatiotemporal features effectively, but requires higher inference time to achieve competitive results. Different from dense sampling, uniform sampling gains limited benefit from video-level evaluation, especially when the number of input frames is greater than frames.

Table 4 further shows that uniform sampling in general works better than dense sampling. The only exception is 3D models (I3D) on Mini-Kinetics, where dense sampling is 12% better than uniform sampling. While dense sampling performs well for Kinetics, the high computational evaluation cost required makes it inappropriate for large-scale analysis. Thus in the experiments of this paper, all our analysis is based on uniform sampling and clip-level evaluation unless otherwise stated.

max width= Dataset Approach Backbone Dense (video-level) Uniform (video-level) Top-1 Top-5 Top-1 Top-5 Mini-Kinetics400 I3D ResNet18 69.3 88.3 68.0 87.6 IncpetionV1 70.8 89.6 69.1 88.3 TAM ResNet18 67.7 87.5 69.8 88.9 ResNet50 74.2 91.0 75.2 91.0 Mini-SSV2 I3D ResNet18 42.1 72.1 57.1 82.8 IncpetionV1 46.1 75.4 58.0 83.8 TAM ResNet18 47.2 77.1 60.6 86.2 ResNet50 52.4 80.9 67.2 89.8

Table 4: Video-level model accuracies on Mini-Kinetics and Mini-SSV2.

Backbone Network. If we look at the overall spatiotemporal representation capability of the three backbones in Fig. 5, we observe a clear pattern that ResNet50 InceptionV1 ResNet18, regardless of the spatiotemporal modules used. The overall accuracy of a model however does not necessarily mean the model captures temporal information. In Section 5.3 of the main paper, we present a method to disentangle the temporal component of the model from its spatial counterpart. This is aligned with what’s observed in other works that stronger backbones lead to better results for action recognition [I3D:carreira2017quo].

Figure 5: Backbone effects. Each mark represents the performance gain of a model with regard to the baseline model using ResNet18 as backbone. For clarity here, we do not separate models using different number of input frames. As opposed to ImageNet performance, the results indicate that InceptionV1 is a stronger backbone than ResNet18. Best viewed in color.

Input Length. We generally found that longer input frames lead to better results, which is more pronounced for those models with temporal pooling. However, for those models without temporal pooling, the performance improvement after 32 frames is small on all the datasets.

Temporal Pooling. Temporal pooling is applied to 3D models such as I3D to reduce computational complexity, but it is usually skipped by more efficient 2D models. Fig. 6 shows the performance gaps between models w/ and w/o temporal pooling across different backbones and architectures. As can be seen, temporal pooling in general counters the effectivenss of temporal modeling and hurts the performance of action models, just like what spatial pooling does to object recognition and detection. For this reason, more recent 3D-CNN approaches such as SlowFast and X3D drop temporal pooing and rely on other techniques for reducing computation. Similarly, one important reason for the prior finding that 3D models are inferior to C2D (pure spatial models) on Kinetics and MiT is because their comparisons neglect the negative impact of temporal pooling on 3D models. As shown in Sec. 5.2, I3D w/o temporal pooling is competitively comparable with the SOTA approaches.

Interestingly, TSN is the only architecture benefiting from temporal pooling, demonstrating a large boost in performance on Mini-SSV2 (20%) and Mini-MiT (3%5%). Also, as the number of input frames increase, the improvement is more pronounced. Even though TSN is also negatively impacted by temporal pooling on Mini-Kinetics, it suffers the least and starts seeing positive gains after 32 frames. To further confirm that, we trained a 32-frame TSN model with temporal pooling on Kinetics. This model (TSN-R50 in Fig. 1) achieves a top-1 accuracy of 74.9%, 5.1% higher than the version w/o temporal pooling and only about 2.0% shy from the SOTA results. In summary, temporal pooling enables TSN with the simplest form of exchanging information across frames. The consistent improvements by temporal pooling across all datasets provide strong evidence that temporal modeling is always helpful for action recognition.

Figure 6: Accuracy gain after adding temporal pooling. Temporal pooling significantly hurts the performance of all models except TSNs. Best viewed in color.

max width= Dataset Frames ResNet18 InceptionV1 ResNet50 None Avg. Max. Conv. TAM None Conv. TAM None Conv. TAM TSM NLN Mini-SSV2 =8 29.6 39.5 43.9 58.1 59.1 33.1 58.2 59.7 33.9 61.6 65.4 64.1 53.0 =16 30.9 43.5 48.0 62.6 62.1 34.7 63.7 63.9 35.3 65.7 68.6 67.4 55.0 Mini-Kinetics =8 67.9 64.1 65.2 67.8 69.1 70.4 68.3 68.8 72.1 71.5 74.1 74.1 73.7 =16 68.5 66.0 67.4 70.8 71.3 70.5 70.7 70.0 72.5 73.4 76.4 75.6 74.5

Table 5: Performance of different temporal aggregation strategies w/o temporal pooling.

Temporal Aggregation. The essence of temporal modeling is how it aggregates temporal information. The 2D architecture offers great flexibility in temporal modeling. For example, TSM [TSM:lin2018temporal] and TAM [bLVNetTAM] can be easily inserted into a CNN for learning spatio-temporal features. Here we analyze several basic temporal aggregations on top of the 2D architecture including 1D convolution (Conv, i.e. S3D [S3D]), 1D depthwise convolution (dw Conv, i.e. TAM), TSM, max (Max) and average (Avg) pooling. We also consider the non-local network module (NLN) [Wang2018NonLocal] for its ability to capture long-range temporal video dependencies add 3 NLN modules and 2 NLN modules at stage 2 and stage 3 of TSN-ResNet50, respectively as in [Wang2018NonLocal].

Table 5

shows the results from using different temporal aggregations as well as those of TSN (i.e. w/o any temporal aggregation). As can be seen from the results, average and max pooling are consistently worse than the other methods, suggesting that

effective temporal modeling is required for achieving competitive results, even on datasets such as Kinetics where temporal information is thought as non-essential for recognition. On the other hand, TAM and TSM, while being simple and efficient, demonstrate better performance than the 1D regular convolution and the NLN module, which have more parameters and FLOPs. Interestingly, the NLN module does not perform as well as expected on Mini-SSV2. This is possibly because NLN models temporal dependencies through matching spatial features between frames, which are weak in Mini-SSV2 data.

5.2 Analysis of SOTA Results

5.2.1 Benchmarking of SOTA Approaches

max width= Model Pretrain FLOPs Dataset dataset Kinetics SSV2 I3D-ResNet50 ImageNet 335.3G 76.61 62.84 TAM-ResNet50 ImageNet 171.5G 76.18 63.83 SlowFast-ResNet50-88 None 65.7G 76.40 60.10 SlowFast-ResNet50-88 None 65.7G 77.00 SlowFast-ResNet50-168 Kinetics 65.7G 63.0 CorrNet-ResNet50 None 115G 77.20 I3D-ResNet101 ImageNet 654.7G 77.80 64.29 TAM-ResNet101 ImageNet 327.1G 77.61 65.32 SlowFast-ResNet101-88 None 125.9G 76.72 SlowFast-ResNet101-88 None 125.9G 78.00 SlowFast-ResNet101-168 None 213G 78.90 CSN-ResNet101 None 83G 76.70 CorrNet-ResNet101 None 224G 79.20 X3D-L None 24.8G 77.50 X3D-XL None 48.4G 79.10 : Those networks cannot be initialized from ImageNet due to its structure. : Retrained by ourselves. : reported by the authors of the paper

Table 6: Performance of SOTA models.

To more precisely understand the progress in action recognition, we further conduct a more rigorous benchmarking effort including I3D, TAM and SlowFast on the full datasets. I3D represents the prior SOTA approach for action recognition while SlowFast and TAM are arguably the existing SOTA methods on Kinetics and Something-Something respectively. To ensure apple-to-apple comparison, we follow the practice of SlowFast to train all the models and select ResNet50 as backbone. During training, we take 64 consecutive frames from a video and sample every other frame as the input, i.e., 32 frames are fed to the model. The shorter side of a video is randomly resized to the range of [256, 320] while keeping aspect ratio, and then we randomly crop a 224224 spatial region as the training input. We trained all models for 196 epochs, using a total batch size of 1024 with 128 GPUs, i.e. 8 samples per GPU. Batch normalization is computed on those 8 samples. We warm up the learning rate from 0.01 to 1.6 with 34 epochs linearly and then apply half-period cosine annealing schedule for the remaining epochs. We use synchronized SGD with momentum 0.9 and weight decay 0.0001. On the other hand, for SSV2, we switch to use uniform sampling since it achieves better accuracy for all models. We also follow TSN [TSN:wang2016temporal] to augment data and change weight decay to 0.0005. During evaluation, we uniformly sample 10 clips from a video, and then take 3 256256 crops from each clip whose shorter side of each clip is resized 256. The accuracy of a video is conducted by averaging over 30 predictions. On the other hand, for SSV2, we only sample 2 clips for testing since the video length of SSV2 is shorter.

As can be seen from Table 6, by using a stronger backbone ResNet50 and removing temporal pooling in I3D, it greatly stretches I3D to be on par with the state-of-the-art approaches in accuracy on both benchmark datasets. Our results show that I3D remains as one of the most competitive approaches for action recognition, and that the progress of accuracy on action recognition is largely due to the use of more powerful backbone networks. Nevertheless, we do observe that recent approaches have made significant progress on computational efficiency (FLOPs). The comparable performance of I3D with TAM on both datasets also implies that the two types of models, though being structurally different, may behave similarly in spatio-temporal modeling.

5.2.2 Model Transferability

We further compare the transferability of the three models trained above on four small-scale datasets including UCF101 

[ucf101:Soomro2012], HMDB51 [HMDB:Kuehne2011], Jester [Materzynska_2019_ICCV], and Mini-SSV2. We follow the same training setting in section 4 and finetune 45 epochs with cosine annealing learning rate schedule starting with 0.01; furthermore, since those are 32-frame models, we trained the models with a batch size of 48 with synchronized batch normalization.

Table 7

shows the results, indicating that all the three models have very similar performance (difference of less than 2%) on the downstream tasks. In particular, I3D performs on par with the SOTA approaches like TAM and SlowFast in transfer learning (e.g., I3D obtains the best accuracy of 97.12% on UCF101), which once again corroborates the fact that a significant leap is made in efficient action recognition, but not in accuracy.

max width=.95 Target dataset Model UCF101 HMDB51 Jester Mini-SSV2 I3D-ResNet50 97.12 72.32 96.39 65.86 TAM-ResNet50 95.05 71.67 96.35 66.91 SlowFast-ResNet50-88 95.67 74.61 96.75 63.93

Table 7: Top-1 Acc. of Transferability study from Kinetics

5.3 Analysis of Spatio-temporal Effects

So far we have only looked at the overall spatio-temporal effects of a model (i.e. top-1 accuracy) in our analysis. Here we further disentangle the spatial and temporal contributions of a model to understand its ability of spatio-temporal modeling. Doing so provides great insights into which information, spatial or temporal, is more essential to recognition. We treat TSN w/o temporal pooling as the baseline spatial model as it does not model temporal information. As shown in Fig. 7, TSN can evolve into different types of spatio-temporal models by adding temporal modules on top of it. E.g., TSN-ResNet50 can get to TAM-ResNet50-tp by applying temporal pooling first and then TAM or going the other way around. With this, we compute the spatial and temporal contributions of a model as follows.

Figure 7: Evolving TSN to different spatiotemporal models by adding temporal modules (e.g. temporal pooling or aggregation) on Mini-SSV2. The numbers in parenthesis are the model accuracy while the bold numbers are the performance gain and temporal improvement (see Eq. 1) when evolving to another model.

Let be the accuracy of a model of some architecture that is based on a backbone and takes frames as input. For instance, is the accuracy of a 16-frame I3D-ResNet50 model. Then the spatial contribution and temporal improvement of a model ( is omitted here for clarity) are given by,


Note that is between 0 and 1. When , it indicates that temporal modeling is harmful to model performance. For example, in Fig. 7, the temporal contribution of TAM-ResNet50 is and the spatial contribution is (not shown in Fig. 7). We further combine and across all the models with different backbone networks to obtain the average spatial and temporal contributions of a network architecture, as shown below.


where ={InceptionV1, ResNet18, ResNet50}, = {8,16,32,64}. and are the normalization factors.

max width=.6 Datasets Metrics I3D S3D TAM 0.53 0.53 0.52 Mini- 0.46 0.45 0.47 SSV2 0.38 0.38 0.37 0.97 0.97 0.96 Mini- 0.06 0.08 0.09 Kinetics -0.08 -0.10 -0.12 0.89 0.91 0.87 Mini- 0.04 0.03 0.04 MiT 0.02 0.02 0.04 : the improvement from temporal aggregation only. : the improvement from combining temporal aggregation and pooling.

Table 8: Effects of spatiotemporal modeling.

Table 8 shows the results of and for the three spatio-temporal representations: I3D, S3D and TAM. All three representations behave similarly, namely their spatial modeling contributes slightly more than temporal modeling on Mini-SSV2, much higher on Mini-MiT, and dominantly on Mini-Kinetics. This convincingly explains why a model lack of temporal modeling like TSN can perform well on Mini-Kinetics, but fail badly on Mini-SSV2. Note that similar observations have been made in the literature, but not in a quantitative way like ours. Furthermore, while all the approaches indicate the utmost importance of spatial modeling on mini-Kinetics, the results of suggest that temporal modeling is more effective on Mini-Kinetics than on Mini-MiT for both 2D and 3D approaches.

We also observe that temporal pooling deters the effectiveness of temporal modeling on all the approach from the results of , which are constantly lower than . Such damage is especially substantial on Mini-Kinetics, indicated by the negative values of .

We further plot the temporal gains (i.e. the gap of top-1 accuracy between a model and the corresponding TSN baseline) of I3D, TAM and SlowFast using SOTA models in Section 5.2. As can be seen from Fig. 8, I3D correlates well with TAM with a coefficient of 84%, indicating that 2D and 3D models learn similar temporal representations.

Figure 8: Temporal gains of I3D, TAM and SlowFast w.r.t TSN based on 100 categories randomly selected from Kinetics. The numbers in parentheses indicate the correlation coefficients with I3D computed from all categories. Best viewed in color.

5.4 Analysis of Data Temporality

5.4.1 Human v.s. Machine

Recently a temporal and static dataset selected by human annotators from Kinetics and SSV2 was constructed for temporality analysis [temporal_dataset]. The temporal dataset consists of classes where temporal information matters while the static dataset includes classes where temporal information is redundant. We use a similar methodology to identify temporal and static classes in Kinetics and SSV2, but based on machine perception rather than human perception. Let and be the prediction scores of a class from models TAM-ResNet50 and TSN-ResNet50 respectively. Here TSN is the baseline spatial model. We first define the temporal gain of class by . The temporal gain measures the improvement of accuracy of a class by temporal modeling. We then sort all the action classes of a dataset by and select the top-k classes as temporal classes. For static classes, we simply pick the top-k classes based on the accuracy of TSN. To match the dataset size in [temporal_dataset], is set to 32 for Kinetics and 18 for SSV2.

Figure 9: Temporal gains of all the classes in Kinetics before (blue color) and after (orange color) the temporal classes removed from the original dataset. Note that the first 40 classes are excluded from the training at the first round. The results suggest that temporality is not intrinsic of action classes, but mostly data-dependent. Best viewed in color.

max width= SSV2 Kinetics Temporal Static Temporal Static Class overlap ratio 38.89% 11.11% 21.88% 3.12% Human [temporal_dataset] 79.7% (+47.8%) 71.1 (+31.7%) 75.0% (+8.4%) 76.4% (+8.2%) Machine (Ours) 80.6% (+68.9%) 83.4 (+24.7%) 73.5% (+22.1%) 92.7% (-1.7%)

Table 9: The class overlap ratio, recognition accuracies and average temporal gains (in parenthesis) of the temporal and static datasets identified by human and machine.

Table 9 shows the overlap percentages of the temporal and static datasets identified by human and machine. It is clear that they don’t agree with each other well, especially on the Kinetics dataset. We further compare the average temporal gains of the temporal and static datasets in Table 9. As can be observed, the temporal classes gain more performance improvement from temporal modeling than the static classes, suggesting that temporal information plays an important role in recognition of temporal classes. While the temporal class performance is similar in the case of Human and Machine, it is seen that the for the static classes the machine performance is significantly higher compared to the Human. This suggests that the models are highly evolved and optimized to capture spatial information as compared to temporal information. Overall, the large discrepancies from both datasets imply that the temporal information perceived by human as useful for recognition might not be the same as what an action model attempts to learn.

5.4.2 Is Temporality Intrinsic of Data?

The disagreement between machine and human perceptions in temporality raises an interesting question whether temporality is an intrinsic property of action data that is learnable. If the answer is yes, then we could make a dataset static by taking out the temporal classes from it. In another word, we wouldn’t expect a spatio-temporal approach like TAM to yield significant temporal gains from such a reduced dataset. To verify this hypothesis, we first identify top () temporal classes from Kinetics with the most temporal gains. We then remove these temporal classes and re-train TSN and TAM on the smaller dataset. We repeat this process twice, and report the results in Table 10, which includes the Average Temporal Gain (ATG) of each round for all the classes (ATG-all) and the temporal classes (ATG-tc).

Firstly, we observe that neither ATG-all nor ATG-tc is decreasing significantly at each round as hypothesized, hinting that the dataset is not becoming more static even though some temporal classes are constantly removed from the dataset. Secondly, as shown in Fig. 9, it is surprising that the majority of the classes with more temporality in the original dataset (i.e. classes between 41 and 80) present lower temporal dynamics in the reduced dataset. Instead many classes of little or no temporality now turn out to be substantially more temporal. This suggests that the temporality of an action is not something viewed as inherent by machine, and it may not be constantly learned by a model. Neverthless, advanced spatio-temporal models seem to be able to learn data-dependent temporality flexibly as needed.

max width=.85 # of classes TAM Acc. ATG-all ATG-tc Original 400 74.24% 5.07% 16.93% Round 1 360 74.91% 4.16% 13.84% Round 2 320 76.39% 3.79% 12.68%

Table 10: Results of temporality analysis on Kinetics by removing temporal classes.

6 Conclusion

In this paper, we conducted a comprehensive comparative analysis of several representative CNN-based video action recognition approaches with different backbones and temporal aggregations. Our extensive analysis enables better understanding of the differences and spatio-temporal effects of 2D-CNN and 3D-CNN approaches. It also provides significant insights with regard to the efficacy of spatio-temporal representations for action recognition.

7 Acknowledgments

This work is supported by IARPA via DOI/IBC contract number D17PC00341. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DOI/IBC, or the U.S. Government.


Appendix A Supplementary Materials

Finally, in Section B, we provide more details about our implementation of all the approaches in the paper. Figure 10 shows the top-1 accuracy of all models (three backbones and four video architectures) with and without temporal pooling on three mini-datasets.

Appendix B Implementation Details

To unify the framework, we slightly modify each method, the differences are described as follows.

We follow the original published papers as much as we can to implement the approaches in our analysis. However, due to the differences in backbones, some modifications are necessary to ensure a fair comparison under a common experimental framework. Here we describe how we build the networks including three backbones (InceptionV1, ResNet18 and ResNet50), four video architectures (I3D, S3D, TAM and TSN), and where to perform temporal pooling.

For three backbones, we used those 2D models available on the torchvision repository (googlenet, resnet18, resnet50), and then used the weights in the model zoo for initializing models either through inflation (I3D and S3D) or directly loading (TAM and TSN). Note that, for inflation, we simply copy the weights along the time dimension. Moreover, we always perform the same number of temporal poolings at the similar locations across all backbones for temporal pooling. For each backbone, there are five positions to perform spatial pooling, we add maximum temporal pooling along with the last three spatial poolings (kernel size is set to 3).

I3D. We follow the original I3D paper to re-implement the network [I3D:carreira2017quo]. We convert all 2D convolutional layer into 3D convolutions and set the kernel size in temporal domain to 3 while using the same spatial kernel size. For I3D-ResNet-50, we convert the 33 convolution in the bottleneck block into 333.

S3D. We follow the idea of the original S3D and R(2+1)D paper to factorize 3D convolution in the re-implemented models [S3D, R2plus1D:Tran_2018_CVPR]; thus, each 3D convolution in I3D becomes one 2D spatial convolution and one 1D temporal convolution. Nonetheless, the first convolution of the network is not factorized as the original papers. For InceptionV1 backbone, the difference from the original paper is the location of temporal pooling of backbone [S3D]

. More specifically, in our implementation, we remove the temporal stride in the first convolutional layer and then add an temporal pooling layer to keep the same temporal downsampling ratio over the model. On the other hand, for ResNet backbone, we do not follow the R(2+1)D paper to expand the channels to have similar parameters to the corresponding I3D models, we simply set the output channels to the original output channel size 

[R2plus1D:Tran_2018_CVPR] which helps us to directly load the ImageNet-pretrained weights into the model.

TAM. We follow the original paper to build TAM-ResNet [bLVNetTAM]

, the TAM module is inserted at the non-identity path of every residual block. On the other hand, for TAM-InceptionV1, we add TAM modules after the every inception module.

Figure 10: Top-1 accuracy of all models with and without temporal pooling on three mini-datasets. The video architectures are separated by color while the backbones by symbol. Best viewed in color.

TSN. It does not have any temporal modeling, so it directly uses 2D models.