Learn to cycle: Time-consistent feature discovery for action recognition

by   Alexandros Stergiou, et al.
Utrecht University

Temporal motion has been one of the essential components for effectively recognizing actions in videos. Both, time information and features are primarily extracted hierarchically through small sequences of few frames, with the use of 3D convolutions. In this paper, we propose a method that can learn general feature changes across time, making activations unbounded to a temporal locality, by additionally including a general notion of their learned features. Through this recalibration of temporal feature cues across multiple frames, 3D-CNN models are capable of using features that are prevalent over different time segments, while being less constraint by their temporal receptive fields. We present improvements on both high and low capacity models, with the largest benefits being observed in low-memory models, as most of their current drawbacks rely on their poor generalization capabilities because of the low number and feature complexity. We present average improvements, over both corresponding and state-of-the-art models, in the range of 3.67 Kinetics-700 (K-700), 2.75 Clips and Segments (HACS), 3.195



page 1

page 3


TEA: Temporal Excitation and Aggregation for Action Recognition

Temporal modeling is key for action recognition in videos. It normally c...

Efficient Modelling Across Time of Human Actions and Interactions

This thesis focuses on video understanding for human action and interact...

Challenge report:VIPriors Action Recognition Challenge

This paper is a brief report to our submission to the VIPriors Action Re...

More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation

Current state-of-the-art models for video action recognition are mostly ...

Learning Conditional Random Fields with Augmented Observations for Partially Observed Action Recognition

This paper aims at recognizing partially observed human actions in video...

Visualization of Feature Separation in Advected Scalar Fields

Scalar features in time-dependent fluid flow are traditionally visualize...

Code Repositories


Implementation of Squeeze and Recursion Temporal Gates blocks for action recognition

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Action recognition in videos is an active field of research. A major challenge comes from dealing with the vast variation in the temporal display of the action (Herath et al., 2017; Stergiou and Poppe, 2019a)

. With the introduction of deep learning models, temporal motion has primarily been modeled either through the inclusion of optical flow as a separate input stream

Simonyan and Zisserman (2014) or using 3D convolutions Ji et al. (2013). The latter have shown consistent improvements in state-of-the-art models (Carreira and Zisserman, 2017; Chen et al., 2018; Feichtenhofer et al., 2019; Feichtenhofer, 2020).

3D convolution kernels in convolutional neural networks (3D-CNNs) take into account fixed-sized temporal regions. Kernels in early layers have small receptive fields that primarily focus on simple patterns such as texture and linear movement. Later layers have significantly greater receptive fields that are capable of modeling complex spatio-temporal patterns. Through this hierarchical dependency, the relations between discriminative short-term motions within the larger motion patterns are only established in the very last network layers. Consequently, when training a 3D-CNN, the learned features might include incidental correlations instead of consistent temporal patterns. There appears to be room for improvement in the discovery of discriminative spatio-temporal features,

Figure 1: A. Original 3D convolution block. Activation maps consider a fixed-size temporal window. Features are specific to the local neighborhood. B. SRTG convolution block. Activation maps take global time information into account.

To improve this process, we propose a method named Squeeze and Recursion Temporal Gates (SRTG) which aims towards extracting features that are consistent in the temporal domain. Instead of relying on a fixed-size window, our approach relates specific short-term activations to the overall motion in the video, as shown in Figure 1. We introduce a novel block that uses an LSTM (Hochreiter and Schmidhuber, 1997)) to encapsulate modeled feature dynamics, and a temporal gate to decide whether these discovered dynamics are consistent with the modeled features. The novel block can be used at various places in a wide range of CNN architectures with minimal computational overhead.

Our contributions are as follows:

  • We implement a novel block, Squeeze and Recursion Temporal Gates (SRTG), that favors inputs that are temporally consistent with the modeled features.

  • The SRTG block can be applied to a wide range of 3D-CNNs, including those with residual connections, with minimal computational overhead (


  • We demonstrate state-of-the-art performance on five action recognition datasets when the SRTG block is used. Networks with SRTG consistently outperform their vanilla counterparts, independent of the network depth, the convolution block type and dataset.

We discuss the advancements made in action recognition at Section 2. A detailed description of the main methodology is provided in Section 3. Experimental setup and results are presented in Section 4. We conclude in Section 5.

2 Related Work

We discuss how temporal information is represented in CNNs and in particular using 3D convolutions.

Time representation in CNNs. Apart from the hand-coded calculation of optical flow (Simonyan and Zisserman, 2014), the predominant method for representing spatio-temporal information in CNNs is the use of 3D convolutions. These convolutions process motion information jointly with spatial information (Ji et al., 2013). Because the spatial and temporal dimensions of videos are strongly connected, this has led to great improvements especially for deeper 3D-CNN models (Carreira and Zisserman, 2017; Hara et al., 2018). Recent work additionally targets the efficient incorporation of temporal information at different time scales through the use of separate pathways (Chen et al., 2018; Feichtenhofer et al., 2019).

3D convolution variants. A large body of work has further focused on implementation variants of 3D convolutions improving upon their large computational requirements. Most of these attempts were targeted towards the decoupling of temporal information as a standalone process with the use of pseudo and (2+1)D 3D convolution (Qiu et al., 2017; Tran et al., 2018). Others have also proposed a decoupling of horizontal and vertical motions (Stergiou and Poppe, 2019b).

Information fusion on spatio-temporal activations. Based on approaches of self-attention in image-based models with Squeeze and Excitation (Hu et al., 2018b), Gather and Excite (Hu et al., 2018a) and Point-wise spatial attention (Zhao et al., 2018), which consider the combination of attention with convolutional blocks, works on attention on videos include Long et al. (2018) using clustering to integrate local patterns with different attention units. Others, have studied the use of non-local operations capturing long-range temporal dependencies of spatio-temporal position pairs through different distances (Wang et al., 2018b). Wang et al. (2018a) proposed filtering feature responses with activations decoupled to branches for appearance and spatial relation. Qiu et al. (2019) have further proposed the idea of creating separate pathways for general features that can be updated through per network block activations.

However, what these methods do not address is the exploration of features across large time sequences rather than a small neighborhood of frames. As activations are constrained by the spatio-temporal locality of their receptive fields, they are not allowed to effectively consider extended temporal variations of actions based on their general motion and time of execution. Instead of attempting to map the locality of feature to each of the frame-wise activations, our work combines the locally-learned spatio-temporal features with their temporal variations across the entire duration of the video sequence.

Figure 2: a. Block architectural overview and gate states. The three SRTG states include the Cyclic Gates being inactive, where no soft-nearest neighbors are calculated with both paths fused, and active states, where the the paths are fused based on the value returned from the Temporal Cyclic Gate denoting if they can be fused (open state) or only return the main stream (close state). b. Main SRTG variant configurations with b.(i) Start, b.(ii) Mid, b(iii) End, b(iv) Res and b.(v) Final. Detailed descriptions are discussed in Secition 3.3. We note that the same mindset is also followed for both SimpleBlock and Bottleneck blocks in Residual Networks.

3 Squeeze and Recursion Temporal Gates

In this section we discuss the structure of the proposed SRTG blocks and the main operations performed alongside their possible configurations. We will be denoting the layer input (x) as a stack of frames (x) with a size of C x T x H x W, where C is the number of channels, T is the number of frames used by the volume and H and W are the spatial dimensions of the video. The backbone blocks that SRTG are applied to also include residual connections where the final accumulated activations are the sum of the previous block and the current block denoted as , with used as a block index.

3.1 Squeeze and Recursion

Squeeze and Recursion blocks can be built on top of any spatio-temporal activation map

for any activation function (

) applied over a volume of features () similarly to Squeeze and Excitation (Hu et al., 2018b), as shown in Figure 2

(a). Gor the block, the activation maps are sub-sampled on both of their spatial dimensions to create a vectorized representation of the volume’s features across time.

The created temporal vector holds a sequence of features represented based on their per-frame average intensity. Through this sub-sampling method, the cardinality of the original height () and width () is reduced to a single value as they are sub-sampled based on their average values, encapsulating the averaged temporal attention through the discovered features.

Recurrent cells. The importance of each feature in the temporal attention feature vector is decided by an LSTM sub-network. Through the sequential chain structure of recurrent cells, the overall features that are generally informative for entire video sequences can be discovered. We briefly describe the inner workings of the LSTM sub-network used (Hochreiter and Schmidhuber, 1997) and how the importance of each feature for the entire video is learned, as presented in Figure 3.

Figure 3: Overview of the LSTM chained cells used for the discovery of globally informative local features. Each cell corresponds to a temporal activation map and produces a feature vector of the same size as the input.

Low intensity frame feature activations are discarded in the very first operation of the recurrent cell at the forget gate layer where a decision () is created given the input () and previous frame informative features (). The features that are to be further stored are decided by the product of the sigmodial () input gate layer , and the vector of candidate values as computed in Equation 1.


The previous cell state , as in the top left corner of Figure 3, is then updated based on the two calculated gates in order to forget features that are not consistent across times and discover the quantity that updates are to be made with. That is done through weighting the previous state () by the forget gate features , and accumulating the result by the product of the input gate () and vector of candidates ():


The final cell output of the recurrent cell () is given by the currently calculated cell state (), the previous hidden state () and current input () as:


The produced hidden states are again grouped together for re-creating a coherent sequence of filtered spatio-temporal feature intensities (). The new attention vector considers previous cells states, creating a generalized vector based on the feature intensity across different temporal points.

3.2 Temporal Gates for cyclic consistency

Cyclic consistency. For validating the similarity between two temporal volumes, cyclic consistency has been a widely used technique (Dwibedi et al., 2019; Wang et al., 2019). The focus is the one-to-one mapping of frames between two temporal sequences, shown in Figure 4. Each of the two feature spaces can be considered an embedding space that is cyclic consistent if and only if, each point in time instance () in the embedding space A, has a minimum distance point in embedding space B that is at the exact same time instance (). Equivalently, each temporal point at in embedding space B, should also have a minimum distance point in embedding space A, with the point being at location . As shown in Figure 4, cases where the points do not cycle back to the same temporal location, do not exhibit cyclic consistency.

Figure 4: Temporal Cyclic Error. Soft nearest neighbor is used to match points between two embeddings. Cyclic consistent points cycle back to original points (visualized for ). Otherwise, points are not cyclic consistent (e.g. at ). Corresponding salient areas below are visualized with CFP (Stergiou et al., 2019).

By having points that can cycle back to themselves through intermediate embeddings, a similarity baseline between embedding spaces can be established. Therefore, although individual features of the two spaces may be different, they should demonstrate an overall similarity, as long as their alignment in terms of cyclic consistency is the same. Therefore, comparing volumes by their cyclic consistency is a reasonable measure.

Figure 5: Temporal Cyclic Gates. The two embedding spaces A and B their soft-neighbor cyclic consistency in calculated. Each frame activations () in encoding space B is compared with every frame activations () in the encoding space A through their pair-wise euclidean distance (). This calculates for each frame-wise activation map () the corresponding soft nearest neighbor () in encoding space A. That being, the observation with the lowest value from the distribution of all distances from activations in A. The second part, calculates the distances for every activation at the corresponding embedding space B and selects the minimum as . For the gate to take an open state, both and must be exactly and sequentially equal to and .

Soft nearest neighbor distance. The main problem for creating a coherent similarity measure between two embeddings, is considering the vast feature space that examples are represented in (based on the number of channels per activation), as well as the challenge of creating distance models to discover the ”nearest” point in an adjacent high-dimensional embedding. The idea of soft matches for projected points in embeddings (Goldberger et al., 2005) is based on finding the closest point in an embedding space through the weighted sum of all possible matches and selecting the closest actual observation.

To find the soft nearest neighbor of an activation at temporal location and part of an embedding space A, at an embedding space B, the euclidean distances between observation and all points in B are calculated, as in Figure 5. Each frame is considered a separate instance based on which we want to select the minimum point on the adjacent embedding space. We weight the similarity of each frame in embedding space B to activation by using a softmax activation and by exploiting the exponential difference between activation pairs:


The softmax activation produces a normal probabilistic distribution of similarities (), centered on the frame with the minimum distance from activation . Based on the discovery of the nearest neighbor (), the distance to nearest frames in B can then be computed. This allows the discovery of frames that are closely related to the initially considered frame (), achieved through minimizing the L2 distance from the found soft match:


We define a point as consistent if and only if the initial temporal location matches precisely the same temporal location of the computed point in embedding space B, . By establishing a consistency check for frames in embedding space A, the same procedure is repeated in reverse for every frame in embedding space B, calculating the soft nearest neighbor in embedding space A. The two embeddings are considered cyclic consistent if and only if each point in each temporal location at each of the embeddings is directly mapped to the adjacent point at the same temporal location in the adjacent embedding space, ().

Temporal gates. The temporal activation vector encapsulates average feature attention spanning across time. However, it does not ensure a precise similarity to the local spatio-temporal activations. Thus, we compute cyclic consistency between the pooled activations () and the outputted recurrent cells (). In this context, cyclic consistency is used as a gating mechanism to only fuse the recurrent cell hidden states with un-pooled versions of the activations, if the two volumes are temporally cyclic consistent. This condition further ensures that only relative information is added back to the network, as shown in the active states in Figure 2(a).

3.3 SRTG block variants

As cyclic consistency can be considered in different parts of a convolution block, we investigate six different approaches in terms of constructing a SRTG block. In each case, the principle of global and local information fusion remains with changes made only at the in-block location as well as the LSTM input. All configurations are shown in Figure 2(b).

Start. SRTG is added at the very top of the block ensuring that all operations performed will be based on both global and local information.

Mid. Activations of the first convolution are used by the LSTM, with fused featured being used by the final convolution.

End. Local and global features are fused at the end of the final convolution, before the residual connection concatenation.

Res. The SRTG block can also be applied to the residual connection. This transforms the residual connection to further include global space time features and equivalently combining those features with the convolutional activations.

Final. SRTG blocks are added at the end of the residual block allowing for the activations calculated to be conjoint with their representations across time on the entire video.

4 Experiments and Results

We evaluate our approach on five action recognition benchmark datasets (Section 4.1). We perform experiments with various backbones ResNet-34/50/101 each of which has been implemented with both 3D and (2+1)D convolutions (r3d/r(2+1)d).

4.1 Datasets

For our experiments we make use of five different action recognition benchmark datasets:

Human Action Clips and Segments (HACS, Zhao et al. (2019)) includes approximately 500K clips of 200 classes. Clips are 60-frame segments extracted from 50k unique videos.

Kinetics-700 (K-700, Carreira et al. (2019)) is the extension of Kinetics-400/600 to 700 classes. It contains approximately 600k clips of varying duration.

Moments in Time (MiT, Monfort et al. (2019)) is one of the largest video datasets of human actions and activities. It includes 339 classes with a total of approximately 800K, 3-second clips.

UCF-101 (Soomro et al., 2012) includes 101 classes and 13k clips that vary between 2 and 14 seconds in duration.

HMDB-51 (Kuehne et al., 2011) contains 7K clips divided over 51 classes with at least 101 clips per class.

4.2 Experimental settings

Training was performed with a random sub-sampling of 16 frames, resized to . We adopted a multigrid training scheme (Wu et al., 2020) with an initial learning rate of 0.1, halved at each cycle. We used a SGD optimizer with weight decay and a step-wise learning rate reduction. For HACS, K-700 and MiT, we use the train/test splits suggested by the authors, and report on split1 for UCF-101 and HMDB-51.

4.3 Comparison of SRTG block configurations

We experiment with two different backbone architectures (r3d-34/r(2+1)d-34) each possible block configuration. In Table 1, we use a 34-layer networks as backbones and train from scratch on the HACS dataset. All SRTG blocks perform better than the baseline demonstrating that regardless of the chosen configuration, SRTG modules would improve performance over baseline models that do not include SRTG, with an average accuracy improvement of 1.7% top-1 and 0.5% top-5. The best performing configuration Final SRTG, which falls in line with the mindset of creating general features. With the Final configuration we note a top-1 accuracy improvement of 3.781% for 3D and 4.686% for (2+1)D.

Config Gates top-1 (%) top-5 (%)
3D (2+1)D 3D (2+1)D
No SRTG 74.818 75.703 92.839 93.571
Start 75.705 76.438 93.230 93.781
Mid 75.489 76.685 93.224 93.746
Res 76.703 77.094 93.307 93.856
Final 78.599 80.389 93.569 94.267
Table 1: Comparison of different SRTG configuration with a r3d-34 backbone on HACS.

4.4 Comparison of network architectures

Model HACS Kinetics-700 Moments in Time UCF-101 HMDB-51
top-1(%) top-5(%) top-1(%) top-5(%) top-1(%) top-5(%) top-1(%) top-5(%) top-1(%) top-5(%)
I3D (Carreira and Zisserman, 2017) 79.948 94.482 53.015 69.193 28.143 54.570 92.453 97.619 71.768 94.128
TSM (Lin et al., 2019) N/A N/A 54.032 72.216 N/A N/A 92.336 97.961 72.391 94.158
ir-CSN-101 (Tran et al., 2019) N/A N/A 54.665 73.784 N/A N/A 94.708 98.681 73.554 95.394
MF-Net (Chen et al., 2018) N/A N/A 54.249 73.378 27.286 48.237 93.863 98.372 72.654 94.896
SF r3d-50 (Feichtenhofer et al., 2019) N/A N/A 56.167 75.569 N/A N/A 94.619 98.756 73.291 95.410
SF r3d-101 (Feichtenhofer et al., 2019) N/A N/A 57.326 77.194 N/A N/A 95.756 99.138 74.205 95.974
r3d-34 74.818 92.839 46.138 67.108 24.876 50.104 89.405 96.883 69.583 91.833
r3d-50 78.361 93.763 49.083 72.541 28.165 53.492 93.126 96.293 72.192 94.562
r3d-101 80.492 95.179 52.583 74.631 31.466 57.382 95.756 98.423 75.650 95.917
r(2+1)d-34 75.703 93.571 46.625 68.229 25.614 52.731 88.956 96.972 69.205 90.750
r(2+1)d-50 81.340 94.514 49.927 73.396 29.359 55.241 93.923 97.843 73.056 94.381
r(2+1)d-101 82.957 95.683 52.536 75.177 N/A N/A 95.503 98.705 75.837 95.512
SRTG r3d-34 78.599 93.569 49.153 72.682 28.549 52.347 94.799 98.064 74.319 94.784
SRTG r3d-50 80.362 95.548 53.522 74.171 30.717 55.650 95.756 98.550 75.650 95.674
SRTG r3d-101 81.659 96.326 56.462 76.819 33.564 58.491 97.325 99.557 77.536 96.253
SRTG r(2+1)d-34 80.389 94.267 49.427 73.233 28.972 54.176 94.149 97.814 72.861 92.667
SRTG r(2+1)d-50 83.774 96.560 54.174 74.620 31.603 56.796 95.675 98.842 75.297 95.141
SRTG r(2+1)d-101 84.326 96.852 N/A N/A N/A N/A N/A N/A N/A N/A
Table 2: Action recognition accuracy (top-1 and top-5) for a range of different network architectures on all five benchmark datasets.
(a) HACS (r3d)
(b) HACS (r(2+1)d)
(c) K-700 (r3d)
(d) MiT (r3d)
Figure 6: Accuracy w.r.t. compute overhead. Top-1 accuracy and operations (in GMACs) of r3d/r(2+1)d with and without SRTG on HACS,K-700 and MiT.

To better understand the merits of our method, we compare a number of network architectures with and without SRTG. We summarize the performance on all five benchmark datasets in Table 2. The top part of the table contains the results for state-of-the-art networks. We have used the trained networks from the respective authors’ repositories. Missing values are due to the lack of a trained model. Any deviations from previously reported performances are due to the use of multi-grid (Wu et al., 2020) with a base cycle batch size of 32. The second and third part of Table 2 summarize the performances of ResNets with various depths, 3D or (2+1)D convolutions with and without SRTG, respectively.

In state-of-the-art architectures, the use of larger and deeper models provides significant improvements in accuracies with this becoming clear by comparing the steady increase in performance for deeper models. This is in line with the general trend for action recognition using CNNs where architectures that either deeper (101+ layer networks) or include higher complexity. Models implemented with (2+1)D convolution blocks perform slightly better than their counterparts with 3D convolutions. These differences are modest and not provide consistency across datasets, however.

As shown in Table 2 adding SRTG blocks to any architecture can consistently improve performance. Table 3 shows pair-wise comparisons of the performance on the three largest benchmark datasets for networks with and without SRTG. On average, these improvement are approximately 4.13% for Kinetics-700, 2.6% for MiT and 2.57% for HACS, and independent of the convolution block type. For smaller networks, the performance gains are somewhat higher even with average improvements of 4.12% for r3d-34, 3.94% for r(2+1)d-34, 3.016% for r3d-50 and 2.59% for r(2+1)d-50 over all five datasets. Clearly, the use of time-consistent features obtained through our method improves a generalization ability of 3D-CNNs.

The r3/(2+1)d networks with SRTG perform on-par with the current state-of-the-art architectures as shown in Table 2. The r3d-101 outperforms current state-of-the-art in HACS. MiT, UCF-101 and HMDB-51 with only an additional 1.06 GFlops from the baseline model. The (2+1)D variant further outperforms current architectures on HACS with 84.326% top-1 accuracy. We additionally note an on-par performance with current top performing models in K-700 with only a small margin. The performance gains of both networks are remarkable given the significantly lower complexity of r3d/r(2+1)d arcitectures with SRTG. in comparison, SlowFast is build on a dual-network configuration with its sub-parts being responsible for long-temporal and small-temporal movements therefore including a significantly larger number of operations than the proposed plug-and-play SRTG. We analyze the additional computation cost of the SRTG block in Section 4.5.

Finally, we observe that the performance gain when applying SRTG is also substantial for the two smaller datasets, UCF-101 and HMDB-51. Especially for UCF-101, the action recognition accuracy is more saturated. Still, the already competitive performance of the ResNet-101 models on UCF-101 increases with1.57% and 3.33% for the 3D and (2+1)D convolution variants, respectively. This further demonstrates that our SRTG method can improve the selection of features that contain less noise and generalize better even when there is fewer training data available.

4.5 Analysis of computational overhead

The SRTG block can be added to a large range of 3D-CNN architectures. It leverages the small computational costs for LSTMs compared to 3D convolutions, which enables us to increase the number of parameters without significantly increasing the number of GFLOPs. This also corresponds to small additional memory usage compared to baseline models on both forward and backward passes. We present the number multi-accumulative operations (MACs)444Multi-accumulative operations (Ludgate, 1982) are based on the product of two numbers increased by an accumulator. They relate to the accumulated sum of convolutions between the dot product of the weights and input region.  used for the r3d snd r(2+1)d architectures with and without SRTG in Figure 6 with respect to the corresponding accuracies achieved. The additional computation overhead, for models that include the proposed block, is approximately equivalent to 0.15% of the total number of operations in the vanilla networks. This constitutes a negligible increase, compared to the gain in performance, making overall SRTG a lightweight block that can be easily used on top of created networks.

Dataset r3d-50 r(2+1)d-50 r3d-101
HACS 78.361 80.362 (+2.0) 81.340 83.474 (+2.1) 80.492 81.659 (+1.1)
K-700 49.083 53.522 (+4.4) 49.927 54.174 (+4.2) 52.583 56.462 (+3.8)
MiT 28.165 30.717 (+2.5) 29.359 31.603 (+3.3) 31.466 33.564 (+2.0)
Table 3: Per-convolution block comparisons on HACS, K-700 and MiT.

4.6 Evaluating feature transferability

In order to assess the feature transferability of the proposed SRTG block, we additionally include tests in which we collate results for transfer learning based on different pre-training datasets on both UCF-101 and HMDB-51 fine-tuning datasets. Through this, individual dataset biases can be alleviated as all three datasets can be used for pre-training models given their increased size. Overall, through these tests a significantly more clear overview of the true improvements can be presented to additionaly study the overall generalization capabilities of the feature learned.

As shown by Table 4 the accuracy rates remain consistent throughout the datasets that were used for pre-training. This consistency is ought to be based on both the large sizes of the pre-training datasets as well as the overall robustness of the proposed methods. In all, the average offset between each of the pre-trained models is 0.49% for UCF-101 and 0.62% for HMDB-51 which corresponds to only minor changes in accuracies based on the pre-training datasets with further enforcing that the improvements observed are due to the inclusion of SRTG blocks in the network.

Model Pre-training GFLOPs UCF-101 top-1 (%) HMDB-51 top-1 (%)
SRTG r3d-34 HACS 110.48 94.799 74.319
HACS+K-700 95.842 74.183
HACS+MiT 95.166 74.235
SRTG r(2+1)d-34 HACS 110.8 94.149 72.861
HACS+K-700 94.569 73.217
HACS+MiT 95.648 74.473
SRTG r3d-50 HACS 150.98 95.756 75.650
HACS+K-700 96.853 75.972
HACS+MiT 96.533 76.014
SRTG r(2+1)d-50 HACS 151.6 95.675 75.297
HACS+K-700 95.993 75.743
HACS+MiT 96.278 75.988
SRTG r3d-101 HACS 171.02 97.325 77.536
HACS+K-700 97.404 78.026
HACS+MiT 97.568 78.419
Table 4: Results on UCF-101 and HMDB-51 based on transfer learning dataset.

5 Conclusions

We implement a novel SRTG block for creating time-consistent features that utilize a LSTM (SR) for multi-frame feature dynamics, and a temporal gate (TG) to evaluate the temporal cyclic consistency of the discovered dynamics and the modeled features. The proposed SRTG blocks only add a minuscule computational overhead to the overall network making them very efficient to compute on both forward and backward passes. Using common 3D/(2+1)D ResNets as backbone architectures, we show consistent improvement of 3.53% over the vanilla networks, across five architectures and five datasets, and we obtain results that are on par with, and in most cases outperform, the current state-of-the-art. This shows how multi-frames temporal attention can further benefit local temporal neighborhoods and enchant their features.

6 Acknowledgments

This publication is supported by the Netherlands Organization for Scientific Research (NWO) with a TOP-C2 grant for “Automatic recognition of bodily interactions” (ARBITER).


  • J. Carreira, E. Noland, C. Hillier, and A. Zisserman (2019) A short note on the Kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987. Cited by: §4.1.
  • J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? A new model and the Kinetics dataset. In Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733. Cited by: §1, §2, Table 2.
  • Y. Chen, Y. Kalantidis, J. Li, S. Yan, and J. Feng (2018) Multi-fiber networks for video recognition. In European Conference on Computer Vision (ECCV), pp. 352–367. Cited by: §1, §2, Table 2.
  • D. Dwibedi, Y. Aytar, J. Tompson, P. Sermanet, and A. Zisserman (2019) Temporal cycle-consistency learning. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1801–1810. Cited by: §3.2.
  • C. Feichtenhofer, H. Fan, J. Malik, and K. He (2019) SlowFast networks for video recognition. In International Conference on Computer Vision (ICCV), pp. 6202–6211. Cited by: §1, §2, Table 2.
  • C. Feichtenhofer (2020) X3D: expanding architectures for efficient video recognition. arXiv preprint arxiv:2004.04730. Cited by: §1.
  • J. Goldberger, G. E. Hinton, S. T. Roweis, and R. R. Salakhutdinov (2005) Neighbourhood components analysis. In Advances in neural information processing systems (NIPS), pp. 513–520. Cited by: §3.2.
  • K. Hara, H. Kataoka, and Y. Satoh (2018)

    Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and imagenet?

    In Computer Vision and Pattern Recognition (CVPR), pp. 18–22. Cited by: §2.
  • S. Herath, M. Harandi, and F. Porikli (2017) Going deeper into action recognition: a survey. Image and vision computing 60, pp. 4–21. Cited by: §1.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1, §3.1.
  • J. Hu, L. Shen, S. Albanie, G. Sun, and A. Vedaldi (2018a) Gather-excite: exploiting feature context in convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), pp. 9401–9411. Cited by: §2.
  • J. Hu, L. Shen, and G. Sun (2018b) Squeeze-and-excitation networks. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7132–7141. Cited by: §2, §3.1.
  • S. Ji, W. Xu, M. Yang, and K. Yu (2013) 3D convolutional neural networks for human action recognition. Transactions on Pattern Analysis and Machine Intelligence 35 (1), pp. 221–231. Cited by: §1, §2.
  • H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre (2011) HMDB: A large video database for human motion recognition. In International Conference on Computer Vision (ICCV), pp. 2556–2563. Cited by: §4.1.
  • J. Lin, C. Gan, and S. Han (2019) TSM: temporal shift module for efficient video understanding. In International Conference on Computer Vision (ICCV), pp. 7083–7093. Cited by: Table 2.
  • X. Long, C. Gan, G. De Melo, J. Wu, X. Liu, and S. Wen (2018) Attention clusters: purely attention based local feature integration for video classification. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7834–7843. Cited by: §2.
  • P. E. Ludgate (1982) On a proposed analytical machine. In The Origins of Digital Computers, pp. 73–87. Cited by: footnote 4.
  • M. Monfort, A. Andonian, B. Zhou, K. Ramakrishnan, S. A. Bargal, T. Yan, L. Brown, Q. Fan, D. Gutfreund, C. Vondrick, et al. (2019) Moments in time dataset: One million videos for event understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (2), pp. 502–508. Cited by: §4.1.
  • Z. Qiu, T. Yao, and T. Mei (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In International Conference on Computer Vision (ICCV), pp. 5534–5542. Cited by: §2.
  • Z. Qiu, T. Yao, C. Ngo, X. Tian, and T. Mei (2019) Learning spatio-temporal representation with local and global diffusion. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12056–12065. Cited by: §2.
  • K. Simonyan and A. Zisserman (2014) Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems (NIPS), pp. 568–576. Cited by: §1, §2.
  • K. Soomro, A. R. Zamir, and M. Shah (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. Cited by: §4.1.
  • A. Stergiou, G. Kapidis, G. Kalliatakis, C. Chrysoulas, R. Poppe, and R. Veltkamp (2019) Class feature pyramids for video explanation. In International Conference on Computer Vision Workshop (ICCVW), Vol. , pp. 4255–4264. Cited by: Figure 4.
  • A. Stergiou and R. Poppe (2019a) Analyzing human-human interactions: a survey. Computer Vision and Image Understanding 188, pp. 102799. Cited by: §1.
  • A. Stergiou and R. Poppe (2019b) Spatio-temporal FAST 3D convolutions for human action recognition. In

    International Conference on Machine Learning Applications (ICMLA)

    pp. 1830–1834. Cited by: §2.
  • D. Tran, H. Wang, L. Torresani, and M. Feiszli (2019) Video classification with channel-separated convolutional networks. In International Conference on Computer Vision (ICCV), pp. 5552–5561. Cited by: Table 2.
  • D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri (2018) A closer look at spatiotemporal convolutions for action recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6450–6459. Cited by: §2.
  • L. Wang, W. Li, W. Li, and L. Van Gool (2018a) Appearance-and-relation networks for video classification. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1430–1439. Cited by: §2.
  • X. Wang, R. Girshick, A. Gupta, and K. He (2018b) Non-local neural networks. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7794–7803. Cited by: §2.
  • X. Wang, A. Jabri, and A. A. Efros (2019) Learning correspondence from the cycle-consistency of time. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2566–2576. Cited by: §3.2.
  • C. Wu, R. Girshick, K. He, C. Feichtenhofer, and P. Krähenbühl (2020) A multigrid method for efficiently training video models. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.2, §4.4.
  • H. Zhao, A. Torralba, L. Torresani, and Z. Yan (2019) HACS: human action clips and segments dataset for recognition and temporal localization. In International Conference on Computer Vision (ICCV), pp. 8668–8678. Cited by: §4.1.
  • H. Zhao, Y. Zhang, S. Liu, J. Shi, C. Change Loy, D. Lin, and J. Jia (2018) Psanet: point-wise spatial attention network for scene parsing. In European Conference on Computer Vision (ECCV), pp. 267–283. Cited by: §2.