Squeeze-and-Recursion-Temporal-Gates
Implementation of Squeeze and Recursion Temporal Gates blocks for action recognition
view repo
Temporal motion has been one of the essential components for effectively recognizing actions in videos. Both, time information and features are primarily extracted hierarchically through small sequences of few frames, with the use of 3D convolutions. In this paper, we propose a method that can learn general feature changes across time, making activations unbounded to a temporal locality, by additionally including a general notion of their learned features. Through this recalibration of temporal feature cues across multiple frames, 3D-CNN models are capable of using features that are prevalent over different time segments, while being less constraint by their temporal receptive fields. We present improvements on both high and low capacity models, with the largest benefits being observed in low-memory models, as most of their current drawbacks rely on their poor generalization capabilities because of the low number and feature complexity. We present average improvements, over both corresponding and state-of-the-art models, in the range of 3.67 Kinetics-700 (K-700), 2.75 Clips and Segments (HACS), 3.195
READ FULL TEXT VIEW PDFImplementation of Squeeze and Recursion Temporal Gates blocks for action recognition
Action recognition in videos is an active field of research. A major challenge comes from dealing with the vast variation in the temporal display of the action (Herath et al., 2017; Stergiou and Poppe, 2019a)
. With the introduction of deep learning models, temporal motion has primarily been modeled either through the inclusion of optical flow as a separate input stream
Simonyan and Zisserman (2014) or using 3D convolutions Ji et al. (2013). The latter have shown consistent improvements in state-of-the-art models (Carreira and Zisserman, 2017; Chen et al., 2018; Feichtenhofer et al., 2019; Feichtenhofer, 2020).3D convolution kernels in convolutional neural networks (3D-CNNs) take into account fixed-sized temporal regions. Kernels in early layers have small receptive fields that primarily focus on simple patterns such as texture and linear movement. Later layers have significantly greater receptive fields that are capable of modeling complex spatio-temporal patterns. Through this hierarchical dependency, the relations between discriminative short-term motions within the larger motion patterns are only established in the very last network layers. Consequently, when training a 3D-CNN, the learned features might include incidental correlations instead of consistent temporal patterns. There appears to be room for improvement in the discovery of discriminative spatio-temporal features,
To improve this process, we propose a method named Squeeze and Recursion Temporal Gates (SRTG) which aims towards extracting features that are consistent in the temporal domain. Instead of relying on a fixed-size window, our approach relates specific short-term activations to the overall motion in the video, as shown in Figure 1. We introduce a novel block that uses an LSTM (Hochreiter and Schmidhuber, 1997)) to encapsulate modeled feature dynamics, and a temporal gate to decide whether these discovered dynamics are consistent with the modeled features. The novel block can be used at various places in a wide range of CNN architectures with minimal computational overhead.
Our contributions are as follows:
We implement a novel block, Squeeze and Recursion Temporal Gates (SRTG), that favors inputs that are temporally consistent with the modeled features.
The SRTG block can be applied to a wide range of 3D-CNNs, including those with residual connections, with minimal computational overhead (
).We demonstrate state-of-the-art performance on five action recognition datasets when the SRTG block is used. Networks with SRTG consistently outperform their vanilla counterparts, independent of the network depth, the convolution block type and dataset.
We discuss how temporal information is represented in CNNs and in particular using 3D convolutions.
Time representation in CNNs. Apart from the hand-coded calculation of optical flow (Simonyan and Zisserman, 2014), the predominant method for representing spatio-temporal information in CNNs is the use of 3D convolutions. These convolutions process motion information jointly with spatial information (Ji et al., 2013). Because the spatial and temporal dimensions of videos are strongly connected, this has led to great improvements especially for deeper 3D-CNN models (Carreira and Zisserman, 2017; Hara et al., 2018). Recent work additionally targets the efficient incorporation of temporal information at different time scales through the use of separate pathways (Chen et al., 2018; Feichtenhofer et al., 2019).
3D convolution variants. A large body of work has further focused on implementation variants of 3D convolutions improving upon their large computational requirements. Most of these attempts were targeted towards the decoupling of temporal information as a standalone process with the use of pseudo and (2+1)D 3D convolution (Qiu et al., 2017; Tran et al., 2018). Others have also proposed a decoupling of horizontal and vertical motions (Stergiou and Poppe, 2019b).
Information fusion on spatio-temporal activations. Based on approaches of self-attention in image-based models with Squeeze and Excitation (Hu et al., 2018b), Gather and Excite (Hu et al., 2018a) and Point-wise spatial attention (Zhao et al., 2018), which consider the combination of attention with convolutional blocks, works on attention on videos include Long et al. (2018) using clustering to integrate local patterns with different attention units. Others, have studied the use of non-local operations capturing long-range temporal dependencies of spatio-temporal position pairs through different distances (Wang et al., 2018b). Wang et al. (2018a) proposed filtering feature responses with activations decoupled to branches for appearance and spatial relation. Qiu et al. (2019) have further proposed the idea of creating separate pathways for general features that can be updated through per network block activations.
However, what these methods do not address is the exploration of features across large time sequences rather than a small neighborhood of frames. As activations are constrained by the spatio-temporal locality of their receptive fields, they are not allowed to effectively consider extended temporal variations of actions based on their general motion and time of execution. Instead of attempting to map the locality of feature to each of the frame-wise activations, our work combines the locally-learned spatio-temporal features with their temporal variations across the entire duration of the video sequence.
In this section we discuss the structure of the proposed SRTG blocks and the main operations performed alongside their possible configurations. We will be denoting the layer input (x) as a stack of frames (x) with a size of C x T x H x W, where C is the number of channels, T is the number of frames used by the volume and H and W are the spatial dimensions of the video. The backbone blocks that SRTG are applied to also include residual connections where the final accumulated activations are the sum of the previous block and the current block denoted as , with used as a block index.
Squeeze and Recursion blocks can be built on top of any spatio-temporal activation map
for any activation function (
) applied over a volume of features () similarly to Squeeze and Excitation (Hu et al., 2018b), as shown in Figure 2(a). Gor the block, the activation maps are sub-sampled on both of their spatial dimensions to create a vectorized representation of the volume’s features across time.
The created temporal vector holds a sequence of features represented based on their per-frame average intensity. Through this sub-sampling method, the cardinality of the original height () and width () is reduced to a single value as they are sub-sampled based on their average values, encapsulating the averaged temporal attention through the discovered features.
Recurrent cells. The importance of each feature in the temporal attention feature vector is decided by an LSTM sub-network. Through the sequential chain structure of recurrent cells, the overall features that are generally informative for entire video sequences can be discovered. We briefly describe the inner workings of the LSTM sub-network used (Hochreiter and Schmidhuber, 1997) and how the importance of each feature for the entire video is learned, as presented in Figure 3.
Low intensity frame feature activations are discarded in the very first operation of the recurrent cell at the forget gate layer where a decision () is created given the input () and previous frame informative features (). The features that are to be further stored are decided by the product of the sigmodial () input gate layer , and the vector of candidate values as computed in Equation 1.
(1) |
The previous cell state , as in the top left corner of Figure 3, is then updated based on the two calculated gates in order to forget features that are not consistent across times and discover the quantity that updates are to be made with. That is done through weighting the previous state () by the forget gate features , and accumulating the result by the product of the input gate () and vector of candidates ():
(2) |
The final cell output of the recurrent cell () is given by the currently calculated cell state (), the previous hidden state () and current input () as:
(3) |
The produced hidden states are again grouped together for re-creating a coherent sequence of filtered spatio-temporal feature intensities (). The new attention vector considers previous cells states, creating a generalized vector based on the feature intensity across different temporal points.
Cyclic consistency. For validating the similarity between two temporal volumes, cyclic consistency has been a widely used technique (Dwibedi et al., 2019; Wang et al., 2019). The focus is the one-to-one mapping of frames between two temporal sequences, shown in Figure 4. Each of the two feature spaces can be considered an embedding space that is cyclic consistent if and only if, each point in time instance () in the embedding space A, has a minimum distance point in embedding space B that is at the exact same time instance (). Equivalently, each temporal point at in embedding space B, should also have a minimum distance point in embedding space A, with the point being at location . As shown in Figure 4, cases where the points do not cycle back to the same temporal location, do not exhibit cyclic consistency.
By having points that can cycle back to themselves through intermediate embeddings, a similarity baseline between embedding spaces can be established. Therefore, although individual features of the two spaces may be different, they should demonstrate an overall similarity, as long as their alignment in terms of cyclic consistency is the same. Therefore, comparing volumes by their cyclic consistency is a reasonable measure.
Soft nearest neighbor distance. The main problem for creating a coherent similarity measure between two embeddings, is considering the vast feature space that examples are represented in (based on the number of channels per activation), as well as the challenge of creating distance models to discover the ”nearest” point in an adjacent high-dimensional embedding. The idea of soft matches for projected points in embeddings (Goldberger et al., 2005) is based on finding the closest point in an embedding space through the weighted sum of all possible matches and selecting the closest actual observation.
To find the soft nearest neighbor of an activation at temporal location and part of an embedding space A, at an embedding space B, the euclidean distances between observation and all points in B are calculated, as in Figure 5. Each frame is considered a separate instance based on which we want to select the minimum point on the adjacent embedding space. We weight the similarity of each frame in embedding space B to activation by using a softmax activation and by exploiting the exponential difference between activation pairs:
(4) |
The softmax activation produces a normal probabilistic distribution of similarities (), centered on the frame with the minimum distance from activation . Based on the discovery of the nearest neighbor (), the distance to nearest frames in B can then be computed. This allows the discovery of frames that are closely related to the initially considered frame (), achieved through minimizing the L2 distance from the found soft match:
(5) |
We define a point as consistent if and only if the initial temporal location matches precisely the same temporal location of the computed point in embedding space B, . By establishing a consistency check for frames in embedding space A, the same procedure is repeated in reverse for every frame in embedding space B, calculating the soft nearest neighbor in embedding space A. The two embeddings are considered cyclic consistent if and only if each point in each temporal location at each of the embeddings is directly mapped to the adjacent point at the same temporal location in the adjacent embedding space, ().
Temporal gates. The temporal activation vector encapsulates average feature attention spanning across time. However, it does not ensure a precise similarity to the local spatio-temporal activations. Thus, we compute cyclic consistency between the pooled activations () and the outputted recurrent cells (). In this context, cyclic consistency is used as a gating mechanism to only fuse the recurrent cell hidden states with un-pooled versions of the activations, if the two volumes are temporally cyclic consistent. This condition further ensures that only relative information is added back to the network, as shown in the active states in Figure 2(a).
As cyclic consistency can be considered in different parts of a convolution block, we investigate six different approaches in terms of constructing a SRTG block. In each case, the principle of global and local information fusion remains with changes made only at the in-block location as well as the LSTM input. All configurations are shown in Figure 2(b).
Start. SRTG is added at the very top of the block ensuring that all operations performed will be based on both global and local information.
Mid. Activations of the first convolution are used by the LSTM, with fused featured being used by the final convolution.
End. Local and global features are fused at the end of the final convolution, before the residual connection concatenation.
Res. The SRTG block can also be applied to the residual connection. This transforms the residual connection to further include global space time features and equivalently combining those features with the convolutional activations.
Final. SRTG blocks are added at the end of the residual block allowing for the activations calculated to be conjoint with their representations across time on the entire video.
We evaluate our approach on five action recognition benchmark datasets (Section 4.1). We perform experiments with various backbones ResNet-34/50/101 each of which has been implemented with both 3D and (2+1)D convolutions (r3d/r(2+1)d).
For our experiments we make use of five different action recognition benchmark datasets:
Human Action Clips and Segments (HACS, Zhao et al. (2019)) includes approximately 500K clips of 200 classes. Clips are 60-frame segments extracted from 50k unique videos.
Kinetics-700 (K-700, Carreira et al. (2019)) is the extension of Kinetics-400/600 to 700 classes. It contains approximately 600k clips of varying duration.
Moments in Time (MiT, Monfort et al. (2019)) is one of the largest video datasets of human actions and activities. It includes 339 classes with a total of approximately 800K, 3-second clips.
UCF-101 (Soomro et al., 2012) includes 101 classes and 13k clips that vary between 2 and 14 seconds in duration.
HMDB-51 (Kuehne et al., 2011) contains 7K clips divided over 51 classes with at least 101 clips per class.
Training was performed with a random sub-sampling of 16 frames, resized to . We adopted a multigrid training scheme (Wu et al., 2020) with an initial learning rate of 0.1, halved at each cycle. We used a SGD optimizer with weight decay and a step-wise learning rate reduction. For HACS, K-700 and MiT, we use the train/test splits suggested by the authors, and report on split1 for UCF-101 and HMDB-51.
We experiment with two different backbone architectures (r3d-34/r(2+1)d-34) each possible block configuration. In Table 1, we use a 34-layer networks as backbones and train from scratch on the HACS dataset. All SRTG blocks perform better than the baseline demonstrating that regardless of the chosen configuration, SRTG modules would improve performance over baseline models that do not include SRTG, with an average accuracy improvement of 1.7% top-1 and 0.5% top-5. The best performing configuration Final SRTG, which falls in line with the mindset of creating general features. With the Final configuration we note a top-1 accuracy improvement of 3.781% for 3D and 4.686% for (2+1)D.
Config | Gates | top-1 (%) | top-5 (%) | ||
---|---|---|---|---|---|
3D | (2+1)D | 3D | (2+1)D | ||
No SRTG | ✗ | 74.818 | 75.703 | 92.839 | 93.571 |
Start | ✓ | 75.705 | 76.438 | 93.230 | 93.781 |
Mid | ✓ | 75.489 | 76.685 | 93.224 | 93.746 |
Res | ✓ | 76.703 | 77.094 | 93.307 | 93.856 |
Final | ✓ | 78.599 | 80.389 | 93.569 | 94.267 |
Model | HACS | Kinetics-700 | Moments in Time | UCF-101 | HMDB-51 | |||||
top-1(%) | top-5(%) | top-1(%) | top-5(%) | top-1(%) | top-5(%) | top-1(%) | top-5(%) | top-1(%) | top-5(%) | |
I3D (Carreira and Zisserman, 2017) | 79.948 | 94.482 | 53.015 | 69.193 | 28.143 | 54.570 | 92.453 | 97.619 | 71.768 | 94.128 |
TSM (Lin et al., 2019) | N/A | N/A | 54.032 | 72.216 | N/A | N/A | 92.336 | 97.961 | 72.391 | 94.158 |
ir-CSN-101 (Tran et al., 2019) | N/A | N/A | 54.665 | 73.784 | N/A | N/A | 94.708 | 98.681 | 73.554 | 95.394 |
MF-Net (Chen et al., 2018) | N/A | N/A | 54.249 | 73.378 | 27.286 | 48.237 | 93.863 | 98.372 | 72.654 | 94.896 |
SF r3d-50 (Feichtenhofer et al., 2019) | N/A | N/A | 56.167 | 75.569 | N/A | N/A | 94.619 | 98.756 | 73.291 | 95.410 |
SF r3d-101 (Feichtenhofer et al., 2019) | N/A | N/A | 57.326 | 77.194 | N/A | N/A | 95.756 | 99.138 | 74.205 | 95.974 |
r3d-34 | 74.818 | 92.839 | 46.138 | 67.108 | 24.876 | 50.104 | 89.405 | 96.883 | 69.583 | 91.833 |
r3d-50 | 78.361 | 93.763 | 49.083 | 72.541 | 28.165 | 53.492 | 93.126 | 96.293 | 72.192 | 94.562 |
r3d-101 | 80.492 | 95.179 | 52.583 | 74.631 | 31.466 | 57.382 | 95.756 | 98.423 | 75.650 | 95.917 |
r(2+1)d-34 | 75.703 | 93.571 | 46.625 | 68.229 | 25.614 | 52.731 | 88.956 | 96.972 | 69.205 | 90.750 |
r(2+1)d-50 | 81.340 | 94.514 | 49.927 | 73.396 | 29.359 | 55.241 | 93.923 | 97.843 | 73.056 | 94.381 |
r(2+1)d-101 | 82.957 | 95.683 | 52.536 | 75.177 | N/A | N/A | 95.503 | 98.705 | 75.837 | 95.512 |
SRTG r3d-34 | 78.599 | 93.569 | 49.153 | 72.682 | 28.549 | 52.347 | 94.799 | 98.064 | 74.319 | 94.784 |
SRTG r3d-50 | 80.362 | 95.548 | 53.522 | 74.171 | 30.717 | 55.650 | 95.756 | 98.550 | 75.650 | 95.674 |
SRTG r3d-101 | 81.659 | 96.326 | 56.462 | 76.819 | 33.564 | 58.491 | 97.325 | 99.557 | 77.536 | 96.253 |
SRTG r(2+1)d-34 | 80.389 | 94.267 | 49.427 | 73.233 | 28.972 | 54.176 | 94.149 | 97.814 | 72.861 | 92.667 |
SRTG r(2+1)d-50 | 83.774 | 96.560 | 54.174 | 74.620 | 31.603 | 56.796 | 95.675 | 98.842 | 75.297 | 95.141 |
SRTG r(2+1)d-101 | 84.326 | 96.852 | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A |
![]() |
![]() |
![]() |
![]() |
To better understand the merits of our method, we compare a number of network architectures with and without SRTG. We summarize the performance on all five benchmark datasets in Table 2. The top part of the table contains the results for state-of-the-art networks. We have used the trained networks from the respective authors’ repositories. Missing values are due to the lack of a trained model. Any deviations from previously reported performances are due to the use of multi-grid (Wu et al., 2020) with a base cycle batch size of 32. The second and third part of Table 2 summarize the performances of ResNets with various depths, 3D or (2+1)D convolutions with and without SRTG, respectively.
In state-of-the-art architectures, the use of larger and deeper models provides significant improvements in accuracies with this becoming clear by comparing the steady increase in performance for deeper models. This is in line with the general trend for action recognition using CNNs where architectures that either deeper (101+ layer networks) or include higher complexity. Models implemented with (2+1)D convolution blocks perform slightly better than their counterparts with 3D convolutions. These differences are modest and not provide consistency across datasets, however.
As shown in Table 2 adding SRTG blocks to any architecture can consistently improve performance. Table 3 shows pair-wise comparisons of the performance on the three largest benchmark datasets for networks with and without SRTG. On average, these improvement are approximately 4.13% for Kinetics-700, 2.6% for MiT and 2.57% for HACS, and independent of the convolution block type. For smaller networks, the performance gains are somewhat higher even with average improvements of 4.12% for r3d-34, 3.94% for r(2+1)d-34, 3.016% for r3d-50 and 2.59% for r(2+1)d-50 over all five datasets. Clearly, the use of time-consistent features obtained through our method improves a generalization ability of 3D-CNNs.
The r3/(2+1)d networks with SRTG perform on-par with the current state-of-the-art architectures as shown in Table 2. The r3d-101 outperforms current state-of-the-art in HACS. MiT, UCF-101 and HMDB-51 with only an additional 1.06 GFlops from the baseline model. The (2+1)D variant further outperforms current architectures on HACS with 84.326% top-1 accuracy. We additionally note an on-par performance with current top performing models in K-700 with only a small margin. The performance gains of both networks are remarkable given the significantly lower complexity of r3d/r(2+1)d arcitectures with SRTG. in comparison, SlowFast is build on a dual-network configuration with its sub-parts being responsible for long-temporal and small-temporal movements therefore including a significantly larger number of operations than the proposed plug-and-play SRTG. We analyze the additional computation cost of the SRTG block in Section 4.5.
Finally, we observe that the performance gain when applying SRTG is also substantial for the two smaller datasets, UCF-101 and HMDB-51. Especially for UCF-101, the action recognition accuracy is more saturated. Still, the already competitive performance of the ResNet-101 models on UCF-101 increases with1.57% and 3.33% for the 3D and (2+1)D convolution variants, respectively. This further demonstrates that our SRTG method can improve the selection of features that contain less noise and generalize better even when there is fewer training data available.
The SRTG block can be added to a large range of 3D-CNN architectures. It leverages the small computational costs for LSTMs compared to 3D convolutions, which enables us to increase the number of parameters without significantly increasing the number of GFLOPs. This also corresponds to small additional memory usage compared to baseline models on both forward and backward passes. We present the number multi-accumulative operations (MACs)444Multi-accumulative operations (Ludgate, 1982) are based on the product of two numbers increased by an accumulator. They relate to the accumulated sum of convolutions between the dot product of the weights and input region. used for the r3d snd r(2+1)d architectures with and without SRTG in Figure 6 with respect to the corresponding accuracies achieved. The additional computation overhead, for models that include the proposed block, is approximately equivalent to 0.15% of the total number of operations in the vanilla networks. This constitutes a negligible increase, compared to the gain in performance, making overall SRTG a lightweight block that can be easily used on top of created networks.
Dataset | r3d-50 | r(2+1)d-50 | r3d-101 | |||
---|---|---|---|---|---|---|
None | SRTG | None | SRTG | None | SRTG | |
HACS | 78.361 | 80.362 (+2.0) | 81.340 | 83.474 (+2.1) | 80.492 | 81.659 (+1.1) |
K-700 | 49.083 | 53.522 (+4.4) | 49.927 | 54.174 (+4.2) | 52.583 | 56.462 (+3.8) |
MiT | 28.165 | 30.717 (+2.5) | 29.359 | 31.603 (+3.3) | 31.466 | 33.564 (+2.0) |
In order to assess the feature transferability of the proposed SRTG block, we additionally include tests in which we collate results for transfer learning based on different pre-training datasets on both UCF-101 and HMDB-51 fine-tuning datasets. Through this, individual dataset biases can be alleviated as all three datasets can be used for pre-training models given their increased size. Overall, through these tests a significantly more clear overview of the true improvements can be presented to additionaly study the overall generalization capabilities of the feature learned.
As shown by Table 4 the accuracy rates remain consistent throughout the datasets that were used for pre-training. This consistency is ought to be based on both the large sizes of the pre-training datasets as well as the overall robustness of the proposed methods. In all, the average offset between each of the pre-trained models is 0.49% for UCF-101 and 0.62% for HMDB-51 which corresponds to only minor changes in accuracies based on the pre-training datasets with further enforcing that the improvements observed are due to the inclusion of SRTG blocks in the network.
Model | Pre-training | GFLOPs | UCF-101 top-1 (%) | HMDB-51 top-1 (%) |
---|---|---|---|---|
SRTG r3d-34 | HACS | 110.48 | 94.799 | 74.319 |
HACS+K-700 | 95.842 | 74.183 | ||
HACS+MiT | 95.166 | 74.235 | ||
SRTG r(2+1)d-34 | HACS | 110.8 | 94.149 | 72.861 |
HACS+K-700 | 94.569 | 73.217 | ||
HACS+MiT | 95.648 | 74.473 | ||
SRTG r3d-50 | HACS | 150.98 | 95.756 | 75.650 |
HACS+K-700 | 96.853 | 75.972 | ||
HACS+MiT | 96.533 | 76.014 | ||
SRTG r(2+1)d-50 | HACS | 151.6 | 95.675 | 75.297 |
HACS+K-700 | 95.993 | 75.743 | ||
HACS+MiT | 96.278 | 75.988 | ||
SRTG r3d-101 | HACS | 171.02 | 97.325 | 77.536 |
HACS+K-700 | 97.404 | 78.026 | ||
HACS+MiT | 97.568 | 78.419 |
We implement a novel SRTG block for creating time-consistent features that utilize a LSTM (SR) for multi-frame feature dynamics, and a temporal gate (TG) to evaluate the temporal cyclic consistency of the discovered dynamics and the modeled features. The proposed SRTG blocks only add a minuscule computational overhead to the overall network making them very efficient to compute on both forward and backward passes. Using common 3D/(2+1)D ResNets as backbone architectures, we show consistent improvement of 3.53% over the vanilla networks, across five architectures and five datasets, and we obtain results that are on par with, and in most cases outperform, the current state-of-the-art. This shows how multi-frames temporal attention can further benefit local temporal neighborhoods and enchant their features.
This publication is supported by the Netherlands Organization for Scientific Research (NWO) with a TOP-C2 grant for “Automatic recognition of bodily interactions” (ARBITER).
Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and imagenet?
. In Computer Vision and Pattern Recognition (CVPR), pp. 18–22. Cited by: §2.International Conference on Machine Learning Applications (ICMLA)
, pp. 1830–1834. Cited by: §2.