Learning spatio-temporal dynamics is the key to video understanding. While extending standard convolution in space and time has been actively investigated for the purpose in recent years [45, 1, 47], the empirical results so far indicate that spatio-temporal convolution alone is not sufficient for grasping the whole picture; it often learns irrelevant context bias rather than motion information  and thus the additional use of optical flow turns out to boost the performance in most cases [1, 31]. Motivated by this, recent action recognition methods learn to extract explicit motion, i.e., flow or correspondence, between feature maps of adjacent frames to improve the performance [29, 24]. But, is it essential to extract such an explicit form of flows or correspondences? How can we learn a richer and more robust form of motion information for videos in the wild?
In this paper, we propose to learn spatio-temporal self-similarity (STSS) representation for video understanding. Self-similarity is a relational descriptor for an image that effectively captures intra-structures by representing each local region as similarities to its spatial neighbors . As illustrated in Fig. 1, given a sequence of frames, i.e.
, a video, it extends along time and thus represents each local region as similarities to its neighbors in space and time. By converting appearance features into relational values, STSS enables a learner to better recognize structural patterns in space and time. For neighbors at the same frame it computes a spatial self-similarity map, while for neighbors at a different frame it extracts a motion likelihood map. Note that if we fix our attention to the similarity map to the very next frame within STSS and attempt to extract a single displacement vector to the most likely position at the frame, the problem reduces to optical flow, which is a limited type of motion information. In contrast, we leverage the whole volume of STSS and let our model learn to extract a generalized motion representation from it in an end-to-end manner. With a sufficient volume of neighborhood in space and time, it effectively captures long-term interaction and fast motion in the video, leading to robust action recognition.
We introduce a neural block for STSS representation, dubbed SELFY, that can be easily inserted into neural architectures and learned end-to-end without additional supervision. Our experimental analysis demonstrates its superiority over previous methods for motion modeling as well as its complementarity to spatio-temporal features from direct convolutions. On the standard benchmarks for action recognition, Something-Something V1&V2 , Diving-48 , and FineGym , the proposed method achieves the state-of-the-art results.
2 Related Work
Video action recognition
. Video action recognition aims to categorize videos into pre-defined action classes and one of the main issues in action recognition is to capture temporal dynamics in videos. For modern neural networks, previous methods attempt to learn temporal dynamics in different ways: two-stream networks with external optical flows[41, 49], recurrent networks , temporal pooling methods [9, 25] and 3D CNNs [45, 1]. Recent methods have introduced the advanced 3D CNNs [47, 46, 8, 31, 6] and show the effectiveness of capturing spatio-temporal features, so that 3D CNNs now become a de facto approach to learn temporal dynamics in the video. However, spatio-temporal convolution is vulnerable unless relevant features are well aligned across frames within the fixed-sized kernel. To address this issue, a few methods adaptively translate the kernel offsets with deformable convolutions [56, 27], while several methods [7, 28] modulate the other hyper-parameters, e.g., higher frame rate or larger spatial receptive fields. Unlike these methods, we address the problem of the spatio-temporal convolution by a sufficient volume of STSS, capturing far-sighted spatio-temporal relations.
Learning motion features. Since using the external optical flow benefits 3D CNNs to improve the action recognition accuracy [1, 58, 47], several methods propose to learn frame-by-frame motion features from RGB sequences inside neural architectures. Some methods [5, 36] internalize TV-L1  optical flows into the CNN. Frame-wise feature differences [44, 26, 16, 29] are also utilized as the motion features. Recent correlation-based methods [48, 24] adopt a correlation operator [4, 43, 53] to learn motion features between adjacent frames. However, these methods compute frame-by-frame motion features between two adjacent frames and then rely on stacked spatio-temporal convolutions for capturing long-range motion dynamics. In contrast, we propose to learn STSS features, as generalized motion features, that enable to capture both short-term and long-term interactions in the video.
Self-similarity describes an internal geometric layout of images, which has been used in many computer vision tasks, such as object detection14], and semantic correspondence matching . Shechtman and Irani  introduce the concept of self-similarity for images and videos and use it to a hand-crafted local descriptor for action detection. Inspired from this work, early methods adopt self-similarities for capturing view-invariant temporal patterns [18, 17, 23], but they use temporal self-similarities only due to computational costs. In recent years, STSS has been rarely explored although the non-local approach [50, 32] to video representation can be viewed as using STSS for re-weighting or aligning visual features. Different from these, we advocate using STSS directly as relational features. Our method leverages the full STSS as generalized motion information and learns an effective representation for action recognition within a video-processing architecture. To the best of our knowledge, our work is the first in learning STSS representation using modern neural networks.
The contribution of our paper can be summarized as follows. First, we revisit the notion of self-similarity and propose to learn a generalized, far-sighted motion representations from STSS. Second, we implement STSS representation learning as a neural block, dubbed SELFY, that can be integrated into existing neural architectures. Third, we provide comprehensive evaluations on SELFY, achieving the state-of-the-art on benchmarks: Something-Something V1&V2 , Diving-48 , and FineGym .
3 Our approach
In this section, we first revisit the notions of self-similarity and discuss its relation to motion. We then introduce our method for learning effective spatio-temporal self-similarity representation, which can be easily integrated into video-processing architectures and learned end-to-end.
3.1 Self-Similarity Revisited
Self-similarity is a relational descriptor that suppresses variations in appearance and reveals structural patterns .
Given an image feature map , self-similarity transformation of results in a 4D tensor , whose elements are defined as
where is a similarity function, e.g.
, cosine similarity. Here,is a query coordinate while is a spatial offset from it. To impose a locality, the offset is restricted to its neighborhood: , so that and , respectively. By converting -dimensional appearance feature into -dimensional relational feature , it suppresses variations in appearance and reveals spatial structures in the image. Note that the self-similarity transformation closely relates to conventional cross-similarity (or correlation) across two different feature maps (, ), which can be defined as
Given two images of a moving object, the cross-similarity transformation effectively captures motion information and thus is commonly used in optical flow and correspondence estimation[4, 43, 53].
For a sequence of frames, i.e., a video, one can naturally extend the spatial self-similarity along the temporal axis. Let denote a feature map of the video with frames. Spatio-temporal self-similarity (STSS) transformation of results in a 6D tensor , whose elements are defined as
where is a query coordinate and is a spatio-temporal offset from the query. In addition to the locality of spatial offsets above, the temporal offset is also restricted to its temporal neighborhood: , so that .
What types of information does STSS describe? Interestingly, for each time , the STSS tensor can be decomposed along temporal offset into a single spatial self-similarity tensor (when ) and spatial cross-similarity tensors (when ); the partial tensors with a small offset (e.g., or ) collect motion information from adjacent frames and those with larger offsets capture it from further frames both forward and backward in time. Unlike previous approaches to learn motion [4, 48, 24], which rely on cross-similarity between adjacent frames, STSS allows to take a generalized, far-sighted view on motion, i.e., both short-term and long-term, both forward and backward, as well as spatial self-motion.
3.2 Spatio-temporal self-similarity representation learning
By leveraging the rich information in STSS, we propose to learn a generalized motion representation for video understanding. To achieve this goal without additional supervision, we design a neural block, dubbed SELFY, which can be inserted into a video-processing architectures and learned end-to-end. The overall structure is illustrated in Fig. 2. It consists of three steps: self-similarity transformation, feature extraction, and feature integration.
Given the input video feature tensor , the self-similarity transformation step converts it to the STSS tensor as in Eq. 1. In the following, we describe feature extraction and integration steps.
3.2.1 Feature extraction
From the STSS tensor , we extract a -dimensional feature for each spatio-temporal position and temporal offset so that the resultant tensor is , which is equivariant to translation in space, time, and temporal offset. The dimension of is preserved to extract motion information across different temporal offsets in a consistent manner. While there exist many design choices, we introduce three methods for feature extraction in this work.
Soft-argmax. The first method is to compute explicit displacement fields using , which previous motion learning methods adopt using spatial cross-similarity [4, 43, 53]. One may extract the displacement field by indexing the positions with the highest similarity value via , but it is not differentiable. We instead use soft-argmax , which aggregates displacement vectors with softmax weighting (Fig. 2(a)). The soft-argmax feature extraction can be formulated as
which results in a feature tensor . The temperature factor adjusts the softmax distribution, and we set in our experiments.
Multi-layer perceptron (MLP). The second method is to learn an MLP that converts self-similarity values into a feature. For this, we flatten the volume into -dimensional vectors, and apply an MLP to them (Fig. 2(b)). For the reshaped tensor
, a perceptroncan be expressed as
where denotes the -mode tensor product, is the perceptron parameters, and the output is . The MLP feature extraction can thus be formulated as
which produces a feature tensor . This method is more flexible and may also be more effective than the soft-argmax because not only can it encode displacement information but also it can directly access the similarity values, which may be helpful for learning motion distribution.
Convolution. The third method is to learn convolution kernels over volume of (Fig. 2(c)). When we regard as a 7D tensor with , the convolution layer can be expressed as
where is a multi-channel convolution kernel. Starting from
, we gradually downsample (U,V) and expand channels via multiple convolutions with strides, finally resulting in; we preserve the L dimension, since maintaining fine temporal resolution is shown to be effective for capturing detailed motion information [31, 7]. In practice, we reshape and then apply a regular 3D convolution along dimension of . The convolutional feature extraction with layers can thus be formulated as
which results in a feature tensor . This method is effective in learning structural patterns with their convolution kernels, thus outperforming the former methods as will be seen in our experiments.
3.2.2 Feature integration
In this step, we integrate the extracted STSS features to feed them back to the original input stream with volume.
We first use spatio-temporal convolution kernels along dimension of . The convolution layer can be expressed as
where is a multi-channel convolution kernel. This type of convolutions integrate the extracted STSS features by extending receptive fields along dimension. In practice, we reshape and then apply a regular 3D convolution along dimension of . The resultant features is defined as
We then flatten the volume into -dimensional vectors to obtain , and apply an convolution layer to obtain the final output. This convolution layer integrates features from different temporal offsets and also adjusts its channel dimension to fit that of the original input . The final output tensor is expressed as
where is the -mode tensor product and is the weights of the convolution layer.
Finally, we combine the resultant SSTS representation into the input feature by element-wise addition, thus making SELFY act as a residual block .
|TSN-R50 from ||8||33 G1||19.7||46.6||30.0||60.5|
|TRN-BNIncep ||8||16 GN/A||34.4||-||48.8||-|
|TRN-BNIncep Two-Stream ||✓||8+8||16 GN/A||42.0||-||55.5||-|
|SELFYNet-R50 (ours)||8||37 G1||50.7||79.3||62.7||88.0|
|I3D from ||32||153 G2||41.6||72.2||-||-|
|NL-I3D from ||32||168 G2||44.4||76.0||-||-|
|TSM-R50 ||16||65 G1||47.3||77.1||61.2||86.9|
|TSM-R50 Two-Stream from ||✓||16+16||129 G1||52.6||81.9||65.0||89.4|
|CorrNet-R101 ||32||187 G10||50.9||-||-||-|
|STM-R50 ||16||67 G30||50.7||80.4||64.2||89.8|
|TEA-R50 ||16||70 G30||52.3||81.9||-||-|
|MSNet-TSM-R50 ||16||67 G1||52.1||82.3||64.7||89.4|
|MSNet-TSM-R50 ||16+8||101 G10||55.1||84.0||67.1||91.0|
|SELFYNet-TSM-R50 (ours)||8||37 G1||52.5||80.8||64.5||89.4|
|SELFYNet-TSM-R50 (ours)||16||77 G1||54.3||82.9||65.7||89.8|
|SELFYNet-TSM-R50 (ours)||8+16||114 G1||55.8||83.9||67.4||91.0|
|SELFYNet-TSM-R50 (ours)||8+16||114 G2||56.6||84.4||67.7||91.1|
4.1 Implementation details
as 3D CNN backbones. TSM enables to obtain the effect of spatio-temporal convolutions using spatial convolutions by shifting a part of input channels along the temporal axis before the convolution operation. TSM is added into each residual block of the ResNet. We adopt ImageNet pre-trained weights for our backbones. To transform the backbones to the self-similarity network (SELFYNet), we insert a single SELFY block after the third stage in the backbones with additive fusion. For SELFY block, we use the convolution method as a default feature extraction method and use multi-channelconvolution kernels. For more details, please refer to supplementary materials A and B.
Training & testing. For training, we sample a clip of 8 or 16 frames from each video using segment-based sampling . The spatio-temporal matching region of SELFY block is set as or when using 8 or 16 frames, respectively. For testing, we sample one or two clips from a video, crop their center, and evaluate the averaged prediction of the sampled clips. For more details, please refer to the supplementary material A.
For evaluation, we use benchmarks that contain fine-grained spatio-temporal dynamics in videos.
Something-Something V1 & V2 (SS-V1 & V2) , which are both large-scale action recognition datasets, contain 108k and 220k video clips, respectively. Both datasets share the same 174 action classes that are labeled, e.g., ‘pretending to put something next to something.’
Diving-48 , which contains 18k videos with 48 different diving action classes, is an action recognition dataset that minimizes contextual biases, i.e., scenes or objects.
FineGym  is a fine-grained action dataset built on top of gymnastic videos. We adopt the Gym288 and Gym99 sets for experiments that contains 288 and 99 classes, respectively.
|TSN from ||-||-||16.8|
|TRN from ||-||-||22.8|
|P3D from ||16||N/A1||32.4|
|C3D from ||16||N/A1||34.5|
|GST-R50 ||16||59 G1||38.8|
|CorrNet-R101 ||32||187 G10||38.2|
|GSM-IncV3 ||16||54 G2||40.3|
|TSM-R50 (our impl.)||16||65 G2||38.8|
|SELFYNet-TSM-R50 (ours)||16||77 G2||41.6|
|NL I3D ||8||27.1||62.1|
|TSM Two-Stream ||N/A||46.5||81.2|
|TSM-R50 (our impl.)||3||35.3||73.7|
|TSM-R50 (our impl.)||8||47.9||86.6|
4.3 Comparison with the state-of-the-art methods
Table 1 summarizes the results on SS-V1&V2. The first and second compartment of the table shows the results of other 2D CNN and (pseudo-) 3D CNN models, respectively. The last part of each compartment shows the results of SELFYNet. SELFYNet with TSN-ResNet (SELFYNet-TSN-R50) achieves 50.7% and 62.7% at top-1 accuracy, respectively, which outperforms other 2D models using 8 frames only. When we adopt TSM ResNet (TSM-R50) as our backbone and use 16 frames, our method (SELFYNet-TSM-R50) achieves 54.3% and 65.7% at top-1 accuracy, respectively, which is the best among the single models. Compared to TSM-R50, a single SELFY block obtains the significant gains of 7.0%p and 4.5%p at top-1 accuracy, respectively; our method is more accurate than TSM-R50 two-stream on both datasets. Finally, our ensemble model (SELFYNet-TSM-R50) with 2-clip evaluation sets a new state of the art on both datasets by achieving 56.6% and 67.7% at top-1 accuracy, respectively.
Tables 2 and 3 summarize the results on Diving-48 and FineGym. For Diving-48, TSM-R50 using 16 frames shows 38.8% in top-1 accuracy in our implementation. SELFYNet-TSM-R50 outperforms TSM-R50 by 2.8%p in accuracy so that it sets a new state-of-the-art top-1 accuracy as 41.6% on Diving-48. For FineGym, SELFYNet-TSM-R50 achieves 49.5% and 87.7% at given 288 and 99 classes, respectively, surpassing all the other models reported in .
4.4 Ablation studies
We conduct ablation experiments to demonstrate the effectiveness of the proposed method. All experiments are performed on SS-V1 using 8 frames. Unless specified otherwise, we set ImageNet pre-trained TSM ResNet-18 (TSM-R18) with the single SELFY block of which, as our default SELFYNet.
Types of similarity. In Table 3(a), we investigate the effect of different types of similarity by varying the set of temporal offset on both TSN-ResNet-18 (TSN-R18) and TSM-R18. Interestingly, learning spatial self-similarity () improves accuracy on both backbones, which implies that self-similarity features help capture structural patterns of visual features. Learning cross-similarity with a short temporal range () shows a noticeable gain in accuracy on both backbones, indicating the significance of motion features. Learning STSS outperforms other types of similarity, and the accuracy of SELFYNet increases as the temporal range becomes longer. When STSS takes a far-sighted view on motion, STSS learns both short-term and long-term interactions in videos, as well as spatial self-similarity.
Feature extraction and integration methods. In Table 3(b), we compare the performance of different combinations of feature extraction and integration methods. From the 2 to the 4 rows, different feature extraction methods are compared fixing the feature integration methods to a single fully-connected (FC) layer. Compared to the baseline, the use of soft-argmax, which extracts spatial displacement features, improves the top-1 accuracy by 1.0%p. Replacing soft-argmax with MLP provides the additional gain of 1.9%p at top-1 accuracy, showing the effectiveness of directly using similarity values. When using the convolution method for feature extraction, we achieve 46.7% at top-1 accuracy; the multi-channel convolution kernel is more effective in capturing structural patterns along dimensions than MLP. From the 4 to the 6 rows, different feature integration methods are compared fixing the feature extraction method to convolution. Replacing the single FC layer with MLP improves the top-1 accuracy by 0.6%p. Replacing MLP with convolutional layers further improves and achieves 48.4% at top-1 accuracy. These results demonstrate that our design choice of using convolutions along and dimensions is the most effective in learning the geometry-aware STSS representation.
Comparison with correlation-based methods. We also investigate the difference between our method and correlation-based methods [24, 48]. While correlation-based methods extract motion features only from the spatial cross-similarity tensor between two adjacent frames, and are thus limited to short-term motion, our method effectively captures bi-directional and long-term motion information via learning with the sufficient volume of STSS. Our method can also exploit richer information from the self-similarity values than other methods. MS module  only focuses on the maximal similarity value of the dimensions to extract flow information, and Correlation block  uses an convolution layer for extracting motion features from the similarity values. In contrast to the two methods, we introduce a generalized motion learning framework using the self-similarity tensor at Section 3.2.
We also have conducted experiments to compare our method with MSNet 
, one of the correlation-based methods. For an apple-to-apple comparison, we apply kernel soft-argmax and max pooling operation (KS + CM in ) to our feature extraction method by following their official codes111https://github.com/arunos728/MotionSqueeze. Please note that, when we restrict the temporal offset to , the SELFY block using KS + CM is equivalent to the MS module of which feature transformation layers are the standard 2D convolutional layers. Table 3(c) summarizes the results. KS+CM method achieves 46.1% at top-1 accuracy. As we enlarge the temporal window to 5, we obtain the additional gain as 1.3%p. The learnable convolution layers improve the top-1 accuracy by 1.0%p in both cases. The results demonstrates the effectiveness of learning geometric patterns within the sufficient volume of STSS tensors for learning motion features.
4.5 Complementarity of STSS features
We conduct experiments for analyzing different meanings of spatio-temporal features and STSS features. We organize two basic blocks for representing two different features: spatio-temporal convolution block (STCB) that consists of several spatial-temporal convolutions (Fig. 4) and SELFY-s block, light-weighted version of the SELFY block by removing spatial convolution layers (Fig. 4). Both blocks have the same receptive fields and a similar number of parameters for a fair comparison. Different combinations of the basic blocks are inserted after the third stage of TSN-ResNet-18. Table 5 summarizes the results on SS-V1. STSS features (Figs. 4 and 4) are more effective than spatio-temporal features (Figs. 4 and 4) in top-1 and top-5 accuracy when the same number of blocks are inserted. Interestingly, the combination of two different features (Figs. 4 and 4) shows better results in top-1 and top-5 accuracy compared to the single feature cases (Figs. 4 and 4), which demonstrate that both features complement each other. We conjecture that this complementarity comes from different characteristics of the two features; while spatio-temporal features are obtained by directly encoding appearance features, STSS features are obtained by suppressing variations in appearance and focusing on the relational features in space and time.
4.6 Improving robustness with STSS
In this experiment we demonstrate that STSS representation helps video-processing models to be more robust to video corruptions. We test two types of corruption that are likely to occur in real-world videos: occlusion and motion blur. To induce the corruptions, We either cut out a rectangle patch of a particular frame or generate a motion blur . We corrupt a single center-frame for every clip of SS-V1 at the testing phase and gradually increase the severity of corruptions. We compare the results of TSM-R18 and SELFYNet variants of Table 3(a). Figures 4(a) and 4(b) summarize the results of two corruptions, respectively. The top-1 accuracy of TSM-R18 and SELFYNets with the short temporal range (, , and ) significantly drops as the severity of corruptions becomes harder. We conjecture that the features of the corrupted frame propagate through the stacked TSMs, confusing the entire network. However, the SELFYNets with the long temporal range ( and ) show more robust performance than the other models. As shown in Figs. 4(a) and 4(b), the accuracy gap between SELFYNets with the long temporal range and the others increases as the severity of corruptions becomes higher, indicating that the larger size of STSS features can improve the robustness on action recognition. We also present some qualitative results (Fig. 4(c)) where two SELFYNets with different temporal ranges, and , both answer correctly without corruption, while the SELFYNet with fails for the corrupted input.
We have proposed to learn a generalized, far-sighted motion representation from STSS for video understanding. The comprehensive analyses on the STSS demonstrate that STSS features effectively capture both short-term and long-term interactions, complement spatio-temporal features, and improve the robustness of video-processing models. Our method outperforms other state-of-the-art methods on the three benchmarks for video action recognition.
Quo vadis, action recognition? a new model and the kinetics dataset.
Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §2, Table 3.
-  (2010) Gradient descent optimization of smoothed information retrieval metrics. Information retrieval 13 (3), pp. 216–235. Cited by: §3.2.1.
-  (2015) Long-term recurrent convolutional networks for visual recognition and description. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2015) Flownet: learning optical flow with convolutional networks. In Proc. IEEE International Conference on Computer Vision (ICCV), Cited by: §2, §3.1, §3.1, §3.2.1.
-  (2018) End-to-end learning of motion representation for video understanding. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2020) RubiksNet: learnable 3d-shift for efficient video action recognition. In Proc. European Conference on Computer Vision (ECCV), Cited by: §2.
-  (2019) Slowfast networks for video recognition. In Proc. IEEE International Conference on Computer Vision (ICCV), Cited by: §2, §3.2.1.
-  (2020) X3D: expanding architectures for efficient video recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2017) Attentional pooling for action recognition. arXiv preprint arXiv:1711.01467. Cited by: §2.
-  (2017) Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677. Cited by: §A.
-  (2017) The” something something” video database for learning and evaluating visual common sense.. In Proc. IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §A, §2, §4.2.
-  (2016) Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.2.2.
-  (2019) Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261. Cited by: §4.6.
-  (2008) Deep networks for image retrieval on large-scale databases. In Proceedings of the 16th ACM international conference on Multimedia, pp. 643–646. Cited by: §2.
-  (2019) Local relation networks for image recognition. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3464–3473. Cited by: 6(f), §B.
-  (2019) STM: spatiotemporal and motion encoding for action recognition. In Proc. IEEE International Conference on Computer Vision (ICCV), Cited by: §2, Table 1.
-  (2010) View-independent action recognition from temporal self-similarities. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Cited by: §2.
-  (2008) Cross-view action recognition from temporal self-similarities. In Proc. European Conference on Computer Vision (ECCV), Cited by: §2.
-  (2019) Attentive spatio-temporal representation learning for diving classification. In Proc. IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Cited by: Table 2.
Large-scale video classification with convolutional neural networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.3.
-  (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950. Cited by: §4.3.
-  (2015) DASC: dense adaptive self-correlation descriptor for multi-modal and multi-spectral correspondence. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2013) Temporal self-similarity for appearance-based action recognition in multi-view setups. In International Conference on Computer Analysis of Images and Patterns, pp. 163–171. Cited by: §2.
-  (2020) MotionSqueeze: neural motion feature learning for video understanding. arXiv preprint arXiv:2007.09933. Cited by: §1, §2, §3.1, §4.4, §4.4, Table 1, 3(c).
-  (2018) First person action recognition via two-stream convnet with long-term fusion pooling. Pattern Recognition Letters 112, pp. 161–167. Cited by: §2.
-  (2018) Motion feature network: fixed motion filter for action recognition. In Proc. European Conference on Computer Vision (ECCV), Cited by: §2, Table 1.
-  (2020) Spatio-temporal deformable 3d convnets with attention for action recognition. Pattern Recognition 98, pp. 107037. Cited by: §2.
-  (2020) SmallBigNet: integrating core and contextual views for video classification. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2020) TEA: temporal excitation and aggregation for action recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, Table 1.
-  (2018) Resound: towards action recognition without representation bias. In Proc. European Conference on Computer Vision (ECCV), Cited by: §1, §A, §2, §4.2, Table 2.
-  (2019) Tsm: temporal shift module for efficient video understanding. In Proc. IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2, §3.2.1, §4.1, Table 1, Table 3.
-  (2019) Learning video representations from correspondence proposals. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: 6(e), §2, §B, §B, Table 1.
Sgdr: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983. Cited by: §A.
-  (2019) Grouped spatial-temporal aggregation for efficient action recognition. In Proc. IEEE International Conference on Computer Vision (ICCV), Cited by: Table 2.
-  (2020) Something-else: compositional action recognition with spatial-temporal interaction networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
-  (2019) Representation flow for action recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2019) Stand-alone self-attention in vision models. arXiv preprint arXiv:1906.05909. Cited by: 6(f), §B, §B.
-  (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proc. IEEE International Conference on Computer Vision (ICCV), Cited by: §C.
-  (2020) Finegym: a hierarchical video dataset for fine-grained action understanding. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §A, §2, §4.2, §4.3, Table 3.
-  (2007) Matching local self-similarities across images and videos. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §3.1.
-  (2014) Two-stream convolutional networks for action recognition in videos. In Proc. Neural Information Processing Systems (NeurIPS), Cited by: §2.
-  (2020) Gate-shift networks for video action recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 2.
-  (2018) PWC-net: cnns for optical flow using pyramid, warping, and cost volume. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §3.1, §3.2.1.
-  (2018) Optical flow guided feature: a fast and robust motion representation for video action recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2015) Learning spatiotemporal features with 3d convolutional networks. In Proc. IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2.
-  (2019) Video classification with channel-separated convolutional networks. In Proc. IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
-  (2018) A closer look at spatiotemporal convolutions for action recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §2.
-  (2020) Video modeling with correlation networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §3.1, §4.4, Table 1, Table 2.
-  (2016) Temporal segment networks: towards good practices for deep action recognition. In Proc. European Conference on Computer Vision (ECCV), Cited by: §A, §2, §4.1, §4.1, Table 3.
-  (2018) Non-local neural networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: 6(e), §2, §B, Table 3.
-  (2018) Videos as space-time region graphs. In Proc. European Conference on Computer Vision (ECCV), pp. 399–417. Cited by: Table 1.
-  (2020) Temporal pyramid network for action recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 1.
-  (2019) Volumetric correspondence networks for optical flow. In Proc. Neural Information Processing Systems (NeurIPS), Cited by: §2, §3.1, §3.2.1.
-  (2007) A duality based approach for realtime tv-l 1 optical flow. Pattern Recognition, pp. 214–223. Cited by: §2.
-  (2020) Exploring self-attention for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10076–10085. Cited by: §B.
-  (2018) Trajectory convolution for action recognition. In Proc. Neural Information Processing Systems (NeurIPS), Cited by: §2.
-  (2018) Temporal relational reasoning in videos. In Proc. European Conference on Computer Vision (ECCV), Cited by: Table 1, Table 3.
-  (2018) Eco: efficient convolutional network for online video understanding. In Proc. European Conference on Computer Vision (ECCV), Cited by: §2.
A Implementation details
Architecture details. We use TSN-ResNet and TSM-ResNet as our backbone (see Table 6) and initialize them with ImageNet pre-trained weights. We insert a single SELFY block into right after and use the convolution method as a default feature extraction method. We set the spatio-temporal matching region of SELFY block, , as or when using 8 or 16 input frames, respectively. We stack four convolution layers along dimension for the feature extraction method, and use four convolution layer along dimension for the feature enhancement. We reduce a spatial resolution of video feature tensor, , as 1414 for computation efficiency before the self-similarity transformation. After the feature enhancement, we upsample the enhanced feature tensor, , as 28
28 for the residual connection.
Training. We sample a clip of 8 or 16 frames from each video by using segment-based sampling . We resize the sampled clips into 240 320 images and apply random scaling and horizontal flipping for data augmentation. When applying the horizontal flipping on SS-V1&V2 , we do not flip clips of which class labels include ‘left’ or ‘right’ words; the action labels, e.g., ‘pushing something from left to right.’ We fit the augmented clips into a spatial resolution of 224
224. For SS-V1&V2, we set the initial learning rate to 0.01 and the training epochs to 50; the learning rate is decayed by 1/10 afterand epochs. For Diving-48  and FineGym , we use a cosine learning rate schedule  with the first 10 epochs for gradual warm-up . We set the initial learning rate to 0.01 and the training epochs to 30 and 40, respectively.
Testing. Given a video, we sample 1 or 2 clips, resize them into 240 320 images, and crop their centers as 224 224. We evaluate an average prediction of the sampled clips. We report top-1 and top-5 accuracy for SS-V1&V2 and Diving-48, and mean-class accuracy for FineGym.
Frame corruption details. We adopt two corruptions, occlusion and motion blur, to test the robustness of SELFYNet. We only corrupt a single center-frame for every validation clip of SS-V1; we corrupt the frame amongst 8 input frames. For the occlusion, we cut out a rectangle region from the center of the frame. For the motion blur, we adopt ImageNet-C implementation, which is available online222https://github.com/hendrycks/robustness. We set 6 levels of severity for each corruption. We set the side length of the occluded region as 40px, 80px, 120px, 160px, 200px and 224px from the level 1 to 6. For the motion blur, we set (radius, sigma) tuple arguments as , , , , , and , respectively.
|Layers||TSN ResNet-50||TSM ResNet-50||Output size|
|conv||177, 64, stride 1,2,2||T112112|
|pool||133 max pool, stride 1,2,2||T5656|
|global average pool, FC||# of classes|
B Additional experiments
We conduct additional experiments to identify the behaviors of the proposed method. All experiments are performed on SS-V1 by using 8 frames. Unless otherwise specified, we set ImageNet pre-trained TSM ResNet-18 (TSM-R18) with a single SELFY block of which , as our default SELFYNet.
Spatial matching region. In Table 6(a), we compare a single SELFY block with different spatial matching regions, . As a result, indeed, the larger spatial matching region leads the better accuracy. Considering the accuracy-computation trade-off, we set our spatial matching region, , as as a default.
Block position. From the 2 to the 6 row of Table 6(b), we identify the effect of different positions of SELFY block in the backbone. We resize the spatial resolution of the video tensor, , into 1414, and fix the matching region, , as for all the cases maintaining the similar computational cost. SELFY after the shows the best trade-off by achieving the highest accuracy among the cases. The last row in Table 6(b) shows that the multiple SELFY blocks improve accuracy compared to the single block.
Multi-channel kernel for feature extraction. We investigate the effect of the convolution method for STSS feature extraction, when we use multi-channel kernels. For the experiment, we stack four convolution layers followed by the feature integration step, which are the same as in Section 3.2.2. Table 6(d) summarizes the results. Note that we do not report models of which temporal window , e.g., and . As shown in the table, indeed, the long temporal range gives the higher accuracy. However, the effect of the kernel is comparable to that of the kernel in Table 3(a) in terms of accuracy. Considering the accuracy-computation trade-off, we choose to fix the kernel size, , as for the STSS feature extraction.
Fusing STSS features with visual features. We evaluate SELFYNet purely based on STSS features to see how much the ordinary visual feature contributes for the final prediction. That is, we pass the STSS features, , into the downstream layers without additive fusion. Table 6(c) compares the results of using different cases of the output tenso (, , and ) on SS-V1. Interestingly, SELFYNet using only achieves 45.5% at top-1 accuracy, which is higher as 2.5%p than the baseline. As we add to , we obtain an additional gain of 2.9%p. It indicates that the STSS features and the visual features are complementary to each other.
Comparison with non-local methods. We compare our method with popular non-local methods [50, 32], which capture the long-range interactions of videos. While computing the global self-similarity tensors, both methods use them as attention weights for feature aggregation by multiplying them to the visual features  or aligning top- corresponding features ; they both do not use STSS itself as a relational representation. In contrast, our method does it indeed and learns a more powerful relational feature from STSS. Table 6(e)
summarizes the results. We re-implement the non-local block and the CP module in Pytorch based on their official codes. For a fair comparison, we insert a single block or module at the same position (afterof ResNet-18). Compared to the non-local block and the CP module, SELFY block improves top-1 accuracy by 4.4%p and 1.5%p, while computing less floating-point operations as 7.5 GFLOPs and 8.3 GFLOPs, respectively. It demonstrates that the direct integration of STSS features is more effective for action recognition than the indirect ways of using STSS, e.g., re-weighting visual-semantic features or learning correspondences.
Relation to local self-attention mechanisms. The local self-attention [15, 37, 55] and our method have a common denominator of using the self-similarity tensor but use it in a very different way and purpose. The local self-attention mechanism aims to aggregate the local context features using the self-similarity tensor and it thus uses the self-similarity values as attention weights for feature aggregation. However, our method aims to learn a generalized motion representation from the local STSS, so the final STSS representation is directly fed into the neural network instead of multiplying it to local context features.
For an empirical comparison, we conduct an ablation experiment as follows. We extend the local self attention layer  to temporal dimension, and then add the spatio-temporal local self-attention layer, which is followed by feature integration layers, after . All experimental details are the same as those in supplementary material A, except that we reduce the channel dimension ‘ of appearance feature to 32. Table 6(f) summarizes the results on SS-V1. The spatio-temporal local self-attention layer is accurate as 43.8% at top-1 accuracy, and both of SELFY blocks using the embedded Gaussian and the cosine similarity outperform the local self-attention by achieving top-1 accuracy as 47.6% and 47.8%, respectively. These results are in alignment with the prior work , which reveals that the self-attention mechanism hardly captures motion features in video.
In Fig. 6, we visualize some qualitative results of two different SELFYNet-TSM-R18 ( and ) on SS-V1. We show the different predictions of the two models with 8 input frames. We also overlay Grad-CAMs  on the input frames to see whether a larger volume of STSS benefits to capture long-term interactions in videos. We take Grad-CAMs of features which is right before a global average pooling layer. As shown in the figure, the STSS with the sufficient volume helps to learn more enriched context of temporal dynamics in the video; in Fig. 5(a), for example, SELFYNet with the range of () focuses on not only regions on which an action occurs but also focuses on the white-stain after the action to verify whether the stain is wiped off or not.