[CVPR 2020] Temporal Pyramid Network for Action Recognition
Visual tempo characterizes the dynamics and the temporal scale of an action. Modeling such visual tempos of different actions facilitates their recognition. Previous works often capture the visual tempo through sampling raw videos at multiple rates and constructing an input-level frame pyramid, which usually requires a costly multi-branch network to handle. In this work we propose a generic Temporal Pyramid Network (TPN) at the feature-level, which can be flexibly integrated into 2D or 3D backbone networks in a plug-and-play manner. Two essential components of TPN, the source of features and the fusion of features, form a feature hierarchy for the backbone so that it can capture action instances at various tempos. TPN also shows consistent improvements over other challenging baselines on several action recognition datasets. Specifically, when equipped with TPN, the 3D ResNet-50 with dense sampling obtains a 2 also reveals that TPN gains most of its improvements on action classes that have large variances in their visual tempos, validating the effectiveness of TPN.READ FULL TEXT VIEW PDF
[CVPR 2020] Temporal Pyramid Network for Action Recognition
CVPR2020 TPN with Paddlepaddle
While great progress has been made by deep neural networks to improve the accuracy of video action recognition[slowfast, nonlocal, spn, trajconv, trajpool], an important aspect of characterizing dfferent actions is often missed in the design of these recognition networks - the visual tempos of action instances. Visual tempo actually describes how fast an action goes, which tends to determine the effective duration at the temporal scale for recognition. As shown at the bottom of Figure 1, action classes naturally have different visual tempos (e.g. hand clapping and walking). In some cases the key to distinguish different action classes is their visual tempos, as they might share high similarities in visual appearance, such as walking, jogging and running. Moreover, as shown at the top of Figure 1, when performing the same action, each performer may act at his/her own visual tempo, due to various factors such as age, mood, and energy level. e.g. ,an elder man tends to move slower than a younger man, so as a man with a heavier weight. Precise modeling of such intra- and inter-class variances in visual tempos of action instances can potentially bring a significant improvement to action recognition.
Previous attempts [slowfast, dtpn, spn] for extracting the dynamic visual tempos of action instances mainly rely on constructing a frame pyramid, where each pyramid level samples the input frames at a different temporal rate. For example, we can sample from the total frames of an video instance at intervals and respectively, to construct a two-level frame pyramid consisting of and frames. Subsequently, frames at each level are fed into different backbone subnetworks, and their output features are further combined together to make the final prediction. By sampling frames at different rates as input, backbone networks in [slowfast, dtpn] are able to extract features of different receptive fields and represent the input action instance at different visual tempos. These backbone subnetworks thus jointly aggregate temporal information of both fast-tempo and slow-tempo, handling action instances at different temporal scales.
Previous methods [slowfast, dtpn, spn] have obtained noticeable improvements for action recognition, however it remains computationally expensive to deal with the dynamic visual tempos of action instances at the input frame level. It is not scalable to pre-define the tempos in the input frame pyramid and then feed the frames into multiple network branches, especially when we use a large number of sampling rates. On the other hand, many commonly-used models in video recognition, such as C3D and I3D [c3d, kinetics], often stack a series of temporal convolutions. In these networks, as the depth of a layer increases, its temporal receptive field increases as well. As a result, the features at different depths in a single model already capture information of both fast-tempo and slow-tempo. Therefore, we propose to build a temporal pyramid network (TPN) to aggregate the information of various visual tempos at feature level. By leveraging the feature hierarchy formed inside the network, the proposed TPN is able to work with input frames fed at a single rate. As an auxiliary module, TPN could be applied in a plug-and-play manner to various existing action recognition models to bring consistent improvements.
In this work we first provide a general formulation of the proposed TPN, where several components are introduced to better capture the information at multiple visual tempos.We then evaluate TPNs on three benchmarks: Kinetics-400 [kinetics], Something-Something V1 & V2 [sthv1] and Epic-Kitchen [epic] with comprehensive ablation studies. Without any bells and whistles, TPNs bring consistent improvements when combined with both 2D and 3D networks. Besides, the ablation study shows that TPN obtains most of its improvements from the action classes that have significant variances in visual tempos. This result verifies our assumption that aggregating features in a single model is sufficient to capture the visual tempos of action instances for video recognition.
Attempts for video action recognition could be divided into two categories. Methods in the first category often adopt a 2D + 1D paradigm, where 2D CNNs are applied over per-frame inputs, followed by a 1D module that aggregates per-frame features. Specifcally, two-stream networks in [twostream, tsfusion, cts, cnnforar] utilize two separate CNNs on per-frame visual appearances and optical flows respectively, and an average pooling operation for temporal aggregation. Among its variants, TSN [tsn] proposes to represent video clips by sampling from evenly divided segments. And TRN [trn] and TSM [tsm] respectively replace the average pooling operation with an interpretable relational module and utilize a shift module, in order to better capture information along the temporal dimension. However, due to the deployment of 2D CNNs in these methods, semantics of the input frames could not interact with each other in the early stage, which limits their ability to capture the dynamics of visual tempos. Methods [c3d, 3dforar] in the second category alternatively apply 3D CNNs that stack 3D convolutions to jointly model temporal and spatial semantics. Along this line of research, Non-local Network [nonlocal] introduces a special non-local operation to better exploit the long-range temporal dependencies between video frames. Besides Non-local Network, different modifications for the 3D CNNs, including the inflating 2D convolution kernels [kinetics] and the decomposing 3D convolution kernels [p3d, r21d, r21d_v2], can also boost the performances of 3D CNNs. Other effects [trajpool, trajconv, improvedtraj, shao2020finegym, shao2020tapos] are taken on irregular convolution/pool for better feature alignment or study action instances in a fine-grained way. Although the aforementioned methods could better handle temporal information, the large variation of visual tempos remains neglected.
The complex temporal structure of action instances, particularly in terms of the various visual tempos, raises a challenge for action recognition. In recent years, researchers have started exploring this direction. SlowFast [slowfast] hard-codes the variance of visual tempos using an input-level frame pyramid that has level-wise frames sampled at different rates. Each level of the pyramid is also separately processed by a network, where mid-level features of these networks are interactively combined. With the assist of both the frame pyramid and the level-specific networks, SlowFast could robustly handle the variance of visual tempos. The complex temporal structure inside videos, particularly tempo variation, raises a challenge for action recognition. DTPN [dtpn] also samples frames with different frame per seconds (FPS) to construct a natural pyramidal representation for arbitrary-length input videos. However, such a hard-coding scheme tends to require multiple frames , especially when the pyramid scales up. Inspired by feature-level pyramid networks [Hypercolumns, fpn, panet, li2018feature] that deal with the large variance of scales in object detection, we instead leverage the feature hierarchy of a backbone network, handling the variance of visual tempos in the feature-level. In this way we could hide the concern about visual tempos inside a single network, and we only need frames sampled at a single rate at the input-level.
The visual tempo of an action instance is one of the key factors for recognizing it, especially when other factors are ambiguous. For example, we cannot tell if an action instance belongs to walking, jogging or running based on its visual appearance. However, it is difficult to capture the visual tempos due to their inter- and intra-class variance across different videos. Previous works [slowfast, dtpn, spn] address this issue at the input-level. They utilize a frame pyramid that contains frames sampled at pre-defined rates to represent the input video instance at various visual tempos. Since each level of the frame pyramid requires a separate backbone network to handle, such an approach may be computationally expensive, especially when the level of pyramid scales up.
Inspired by the observation that features at multiple depths in a single network already cover various visual tempos, we propose a feature-level temporal pyramid network (TPN) for modeling the visual tempo. TPN could operate on only a single network no matter how many levels are included in it. Moreover, TPN could be applied to different architectures in a plug-and-play manner. To fully implement TPN, two essential components of TPN must be designed properly, namely 1) the feature source and 2) the feature aggregation. We propose the spatial semantic modulation and temporal tempo modulation to control the relative differences of the feature source in Sec.3.1, and construct multiple types of information flows for feature aggregation in Sec.3.2. Finally we show how to adopt TPN for action recognition in Sec.3.3, taking [slowfast] as an exemplar backbone network.
While TPN is built upon a set of hierarchical features that have increasing temporal receptive fields from bottom to top, there are two alternative ways to collect these features from a backbone network. 1) Single-depth pyramid: a simple way is to choose a feature of size at some depth, and to sample along the temporal dimension with different rates . We refer to such a TPN as a single-depth pyramid consisting of of sizes . Features collected in this way could lighten the workload of fusion as they have identical shapes besides the temporal dimension. However, they may limit in effectiveness as they represent video semantics only at a single spatial granularity. 2) Multi-depth pyramid: a better way is to collect a set of features with increasing depths, resulting in a TPN made of of sizes , where generally the dimensions satisfy . Such a multi-depth pyramid contains richer semantics in the spatial dimensions, yet raises the need of careful treatment in feature fusion, in order to ensure correct information flows between features.
To align spatial semantics of features in the multi-depth pyramid, a spatial semantic modulation is utilized for TPN. The spatial semantic modulation works in two complementary ways. For each but the top-level feature, a stack of convolutions with level-specific stride are applied to it, matching its spatial shape and receptive field with the top one. Moreover, an auxiliary classification head is also appended to it to receive stronger supervision, leading to enhanced semantics. The overall objective for a backbone network with our proposed TPN thus becomes:
where is the original Cross-Entropy loss, and is the loss for -th auxiliary head. are balancing coefficients. After spatial semantic modulation, features have aligned shapes and consistent semantics in the spatial dimensions. However, it remains uncalibrated in the temporal dimension, where we introduce the proposed temporal rate modulation.
Recall in the input-level frame pyramid used in [slowfast], the sampling rates of frames could be adjusted dynamically to increase its applicability. On the contrary, TPN is limited in the flexibility, as it operates on features of a backbone network, so that the visual tempos of these features are only controlled by their depths in the original network. To equip TPN with a similar flexibility as in the input-level frame pyramid, a set of hyper-parameters are further introduced to TPN for temporal tempo modulation. Specifically, denotes that after spatial semantic modulation, the updated feature at -level will be temporally downsampled by a factor of , using a parametric sub-net. The inclusion of such hyper-parameters enables us to better control the relative differences of features in terms of temporal scales, so that feature aggregation could be conducted more effectively. With some abuse of notations, we refer to of size as the -th feature after both the spatial semantic modulation and the temporal rate modulation in the following content.
After collecting and pre-processing the hierarchical features as in Sec.3.1, so that they are dynamic in visual tempos and consistent in spatial semantics, we are ready to step in the second step of TPN construction – how to aggregate these features. Let be the aggregated feature at -th level, generally there are three basic options:
where denotes element-wise addition. And to ensure the compatibility of the addition between consecutive features, during aggregation a down/up-sampling operation, with as the feature and is the factor, is applied along the temporal dimension. Besides the above basic flows to aggregate features in TPN, we could also combine them to achieve two additional options, namely Cascade Flow and Parallel Flow. While applying a bottom-up flow after a top-down flow will lead to the cascade flow, applying them simultaneously will result in the parallel flow. See Fig. 3 for an illustration of all the possible flows. It’s worth noting that more complicated flow (e.g. path aggregation in [panet]) could be built on top of these flows. However, our attempts in this line of research have not shown further improvement. Finally, following Fig.2, all aggregated features in TPN will be rescaled and concatenated for succeeding predictions.
Here we introduce the implementation of TPN for action recognition. Following [slowfast], we use inflated ResNet [slowfast] as the 3D backbone network, for its promising performance on various datasets [kinetics]. Meanwhile, original ResNet [resnet] serves as our 2D backbone. We use the output features of res2, res3, res4, res5 to build TPN, where they are spatially downsampled by respectively and times, compared to the input frames. We provide the structure of 3D ResNet-50 in Tab.1 for the reference. In the spatial semantic modulation, a stack of convolutions with stride to process the feature at -th level in a
-level TPN and the feature dimension would be decreased or increased to 1024. Besides, the temporal rate modulation for each feature is achieved by a convolutional layer and a max-pooling layer. Finally, after feature aggregation through one of the five flows mentioned in Sec.3.2, features of TPN will be separately rescaled by max-pooling operations, and their concatenation will be fed into a fully-connected layer to make the final predictions. TPN can be also jointly trained with the backbone network in an end-to-end manner.
|raw||-||8 224 224|
|conv||177, 64, stride 1, 2, 2||8112112|
|pool||133 max, stride 1, 2, 2||85656|
|global average pool, fc||111|
We evaluate the proposed TPN on various action recognition datasets, including Kinetics-400 [kinetics], Something-Something V1 & V2 [sthv1], and Epic-Kitchen [epic]. The consistent improvements show the effectiveness and generality of TPN. Ablation studies on the components of TPN are also included. Moreover, we present several empirical analysis to verify our motivation of TPN, i.e. a feature-level temporal pyramid on a single backbone is beneficial for capturing the variance of visual tempos. All experiments are conducted with the single modality (i.e. RGB frames) on MMAction [mmaction2019] and evaluated on the validation set unless specified.
Kinetics-400 [kinetics] contains around 240k training videos and 19k validation videos that last for 10 seconds. It includes 400 action categories in total. Something-Something V1 [sthv1] consists of 86k training videos and 11k validation videos belonging to 174 action categories, whose durations vary from 2 to 6 seconds. The second release (V2) of Something-Something increase the number of videos to 220k. Epic-Kitchen [epic] includes around 125 verb and 352 noun categories. Following [anticipation], we randomly select 232 videos (23439 segments) for training and 40 videos (4979 segments) for validation.
Unless specified otherwise, our models are defaultly initialized by pre-trained models on ImageNet[imagenet]. Following the setting in [slowfast], the input frames are sampled from a set of consecutive 64 frames at a specific interval . Each frame is randomly cropped so that its short side ranges in pixels, as in [nonlocal, slowfast, vggnet]. The augmentation of horizontal flip and a dropout [dropout] of 0.5 are adopted to reduce overfitting. And BatchNorm (BN) [bn] is not frozen. We use a momentum of 0.9, a weight decay of 0.0001 and a synchronized SGD training over 8 GPUs [sgd1hour]
. Each GPU has a batch-size of 8, resulting in a mini-batch of 64 in total. For Kinetics-400, the learning rate is 0.01 and will be reduced by a factor of 10 at 100, 125 epochs (150 epochs in total) respectively. For Something-Something V1 & V2[sthv1] and Epic-Kitchen [epic], our model is trained for 150 and 55 epochs separately.
There exist two ways for inference: three-crop and ten-crop testing. a) Three-crop testing refers to three random crops of size from the original frames, which are resized firstly to have 256 pixels in their shorter sides. Three-crop testing is used as the approximation of spatially fully-convolutional testing as in [vggnet, nonlocal, slowfast]. b) Ten-crop testing basically follows the procedure of [tsn], which extracts 5 crops of size and flips these crops. Specially, we conduct three-crop
testing on Kinetics-400. We also uniformly sample 10 clips of the whole video and average the softmax probabilities of all clips as the final prediction. For the other two datasets,ten-crop testing and TSN-like methods with 8 segments are adopted.
We evaluate TPN on both 2D and 3D backbone networks. Specifically, the slow-only branch of SlowFast [slowfast] is applied as our backbone network (denoted as I3D) due to its promising performance on various datasets. The architecture of I3D is shown in Table 1, which turns the 2D ResNet [resnet] into a 3D version via inflating kernels [nonlocal, kinetics]. Specifically, a 2D kernel of size will be inflated to have the size , with its original weights copied for times and rescaled by . Note that there are no temporal downsampling operations in the slow-only backbone. ResNet-50 [resnet] is used as 2D backbone to show that TPN could combine with various backbones. The final prediction follows the standard protocol of TSN [tsn] unless specified.
|Two-Stream I3D [kinetics]||64||✓||75.7||92.0|
We compare our TPN with other state-of-the-art methods on Kinetics-400. The multi-depth pyramid and the parallel flow are used as the default setting for TPN. In detail, the multi-depth pyramid is built on the outputs of res4 and res5. And the hyper-parameters are set to be . As discussed in the spatial semantic modulation, an additional auxiliary head is applied on the output of res4 with a balancing coefficient of . Sampling intervals of input frames are compared.
The performance of I3D-R50 + TPN (i.e. TPN-R50) is included in Table 2. It is worth noting that in Table 2 backbones of methods with the same depth are slightly different, which also affect their final accuracies. TPN-R50 could achieve 77.7% top-1 accuracy, better than others with the same depth. TPN-R101 are also evaluated with the input setting of , which obtains an accuracy of 78.9%, surpassing other methods with the same numbers of input frames.
|TSN-50 from [tsm]||8||ten-crop||69.9|
|TSM-50 from [tsm]||8||ten-crop||72.8|
|TSN-50 + TPN||8||ten-crop||73.5|
Being a general module, TPN could be combined with 2D networks. To show this, we add TPN to the ResNet-50 [resnet] in TSN (TSN-50 + TPN), and train such a combination with 8 segments (uniform sampling) in an end-to-end manner. Different from the original TSN [tsn] which takes 25 segments for validation, we utilize only 8 segments and the ten-crop testing, comparing apples to apples. As shown in Table 3, adding TPN to TSN-50 could boost the top-1 accuracy by 3.6%.
|TSN-50 + TPN||8||40.6||55.2|
|TSM-50 + TPN||8||49.0||62.0|
Results of different baselines with and without TPN on the Something-Something are also included in Table 4. For a fair comparison, we use the center crop of size in all 8 segments, following the protocol used in TSM [tsm]. Both TSN and TSM receive a significant performance boost after combined with the proposed TPN. While TSM has a relatively larger capacity compared to TSN, such consistent improvements on both backbones clearly demonstrate the generality of TPN. Besides, on the leaderboard (dated on 03/20/2020), TPN with backbone of TSM-101 achieves 67.5% Top-1 accuracy, following the standard protocol i.e. full resolution of 2 clips.
|TSN (RGB) [epic]||25||36.8||45.7|
|TSN (Flow) [epic]||25||17.4||42.8|
|TSN (Fusion) [epic]||25||36.7||48.2|
|TSN (our impl.)||8||39.7||48.2|
|TSN + TPN||8||41.3||61.1|
As shown in Table 5, we compare TSN+TPN to two baselines on Epic-Kitchen, following the settings in [epic]. Consequently, a similar improvement is observed as in other datasets, especially on verb classification, which has an increase of .
Ablation studies for the components of TPN are conducted on Kinetics-400. Specifically, the I3D-50 backbone and the sparse sampling strategy (i.e. ) are adopted unless specified otherwise.
As is mentioned in Sec.3.1, there exist two alternative ways to collect features from the backbone network, namely single-depth and multi-depth. For the single-depth pyramid, the output of res5 is sampled along the temporal dimension at intervals respectively to construct a four-level feature pyramid. For the multi-depth pyramid, we choose three possible combinations as shown in Table (a)a. The parallel flow is adopted as the default option for feature aggregation. Hyper-parameters for the multi-depth pyramid are chosen to match its shape with the single-depth pyramid. For example, if res4 and res5 are selected as feature sources, the hyper-parameters will be .
The results of using different feature sources are included in Table (a)a, which suggests that the performance of TPN will drop when we take features from relatively shallow sources e.g. res2 or res3. Intuitively there are two related factors: 1) different from object detection where the low-level features contribute to the position regression, action recognition mainly relies on high-level semantics. 2) Another factor might be that the I3D backbone [slowfast] only inflates the convolutions in the blocks of res4 and res5, so that both res2 and res3 is unable to capture useful temporal information. Unfortunately, inflating all 2D convolutions in the backbone will increase the computational complexity significantly and damage the performance as reported in [slowfast]. Compared to the multi-depth pyramid, the single-depth pyramid extracts various tempo representations by directly sampling from a single source. Although improvement is also observed, representing video semantics only at a single spatial granularity may be insufficient.
In Sec.3.2, several information flows are introduced for feature aggregation. Table (c)c lists the performances of different information flows, keeping other components of TPN unmodified. Surprisingly, TPN with the Isolation Flow also boosts the performance by , indicating that under proper modulations, the features with different temporal receptive fields indeed could help action recognition, even they come from a single backbone network. TPN with the Parallel Flow obtains the best result, leading to a performance of . The success of parallel flow suggests that lower-level features could be enhanced by higher-level features via the top-down flow for they have larger temporal receptive fields. The semantics of higher-level features could also be enriched by lower-level features via the bottom-up flow. More importantly, such two opposing information flows are not contradictive but complementary to each other.
The spatial semantic modulation and the temporal rate modulation are respectively introduced to overcome the semantic inconsistency in spatial dimensions and to adjust the relative rates of different levels in the temporal dimension. The effect of these two modulations are studied in Table (b)b, from which we observe: 1) TPN with all the components lead to the best result. 2) if the spatial semantic modulation contains no spatial convolutions, we have to up/down-sample the features of TPN simultaneously at spatial and temporal dimensions, which is ineffective for temporal feature aggregation.
While we use 8 frames sampled at the stride of 8 as the default input in our study experiments, we have also investigated different sample schemes. We denote as frames sampled with the stride of . And in Table (d)d, we include results of both I3D-50 and I3D-101 with inputs obtained by different sample schemes. Consequently, compared to the sparser sampling scheme (), the denser sampling scheme () tends to bring it both rich and redundant information, leading to a slight over-fitting of I3D-50. I3D-50 + TPN, however, does not encounter such an over-fitting, obtaining an increase of . Moreover, consistent improvements are observed for the stronger backbone I3D-101.
To verify whether TPN has captured the variance of visual tempos, several empirical analyses are conducted on TPN.
At first, we have to measure the variance of visual tempos for a set of action instances. Unlike the concept of scale in object detection, it is non-trivial to precisely compute the visual tempo of an action instance. Therefore, we propose a model-based measurement that utilizes the Full Width at Half Maximum (FWHM) of the frame-wise classification probability curve. FWHM is defined by the difference between the two points of a variable where its value is equal to half of its maximum value. We use a trained 2D TSN to collect per-frame classification probabilities for action instances in the validation set, and compute the FWHM for each instance as a measurement of its visual tempo, since when the sampling fps is fixed, a large FWHM intuitively means the action is going with a slow tempo, vice versa. We thus could compute the variance of visual tempos for each action category. The bottom in Figure 1 shows the variances of visual tempos of all action categories, which reveals that not only the variance of visual tempos is large for some categories, different categories also have significantly different variances of visual tempos.
Subsequently, we also estimate the correlations between per-class performance gains when adopting a TPN module and per-class variances of visual tempos. We at first smooth the bar chart in Figure1 by dividing them into bins with an interval of 10. We then calculate the mean of performance gains in each bin. Finally, the statistics of all bins is shown in Figure 4, where performance gain is positively correlated with variance of visual tempos. This study strongly supports our motivation that TPN could bring a significant improvement for such actions with large variances of visual tempo.
Human recognizes actions easily in spite of the large variance of the visual tempos. Does the proposed TPN module also possess such robustness? To study this, we at first train a I3D-50 + TPN on Kinetics-400 [kinetics] with () frames as the input. We then re-scale the original input by re-sample the frames with stride equals to respectively, so that we are adjusting the visual tempo of a given action instance. For instance, when feeding frames sampled as or into the trained I3D-50 + TPN, we are essentially speeding up / slowing down the original action instance since the temporal scope increases/decreases relatively. Figure 5 includes the accuracy curves of varying visual tempos for I3D-50 and I3D-50 + TPN, from which we can see TPN help improve the robustness of I3D-50, resulting in a curve with moderator fluctuations. Moreover, the robustness to the visual tempo variation becomes clearer as we vary the visual tempo harder, as TPN could adapt itself dynamically according to the need.
In this paper, a generic module called Temporal Pyramid Network is proposed to capture the visual tempos of action instances. Our TPN, as a feature-level pyramid, can be applied to existing 2D/3D architectures in the plug-and-play manner, bringing consistent improvements. Empirical analyses reveal the effectiveness of TPN, supporting our motivation and design. We will extend TPN for other video understanding tasks in the future work.
Acknowledgments. This work is supported in part by CUHK Direct Grant and SenseTime Group Limited. We also thank Yue Zhao for the wonderful codebase and insightful discussion.