Learning Self-Similarity in Space and Time as Generalized Motion for Action Recognition

02/14/2021 ∙ by Heeseung Kwon, et al. ∙ POSTECH 0

Spatio-temporal convolution often fails to learn motion dynamics in videos and thus an effective motion representation is required for video understanding in the wild. In this paper, we propose a rich and robust motion representation based on spatio-temporal self-similarity (STSS). Given a sequence of frames, STSS represents each local region as similarities to its neighbors in space and time. By converting appearance features into relational values, it enables the learner to better recognize structural patterns in space and time. We leverage the whole volume of STSS and let our model learn to extract an effective motion representation from it. The proposed neural block, dubbed SELFY, can be easily inserted into neural architectures and trained end-to-end without additional supervision. With a sufficient volume of the neighborhood in space and time, it effectively captures long-term interaction and fast motion in the video, leading to robust action recognition. Our experimental analysis demonstrates its superiority over previous methods for motion modeling as well as its complementarity to spatio-temporal features from direct convolution. On the standard action recognition benchmarks, Something-Something-V1 V2, Diving-48, and FineGym, the proposed method achieves the state-of-the-art results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning spatio-temporal dynamics is the key to video understanding. While extending standard convolution in space and time has been actively investigated for the purpose in recent years [45, 1, 47], the empirical results so far indicate that spatio-temporal convolution alone is not sufficient for grasping the whole picture; it often learns irrelevant context bias rather than motion information [35] and thus the additional use of optical flow turns out to boost the performance in most cases [1, 31]. Motivated by this, recent action recognition methods learn to extract explicit motion, i.e., flow or correspondence, between feature maps of adjacent frames to improve the performance [29, 24]. But, is it essential to extract such an explicit form of flows or correspondences? How can we learn a richer and more robust form of motion information for videos in the wild?

In this paper, we propose to learn spatio-temporal self-similarity (STSS) representation for video understanding. Self-similarity is a relational descriptor for an image that effectively captures intra-structures by representing each local region as similarities to its spatial neighbors [40]. As illustrated in Fig. 1, given a sequence of frames, i.e.

, a video, it extends along time and thus represents each local region as similarities to its neighbors in space and time. By converting appearance features into relational values, STSS enables a learner to better recognize structural patterns in space and time. For neighbors at the same frame it computes a spatial self-similarity map, while for neighbors at a different frame it extracts a motion likelihood map. Note that if we fix our attention to the similarity map to the very next frame within STSS and attempt to extract a single displacement vector to the most likely position at the frame, the problem reduces to optical flow, which is a limited type of motion information. In contrast, we leverage the whole volume of STSS and let our model learn to extract a generalized motion representation from it in an end-to-end manner. With a sufficient volume of neighborhood in space and time, it effectively captures long-term interaction and fast motion in the video, leading to robust action recognition.

We introduce a neural block for STSS representation, dubbed SELFY, that can be easily inserted into neural architectures and learned end-to-end without additional supervision. Our experimental analysis demonstrates its superiority over previous methods for motion modeling as well as its complementarity to spatio-temporal features from direct convolutions. On the standard benchmarks for action recognition, Something-Something V1&V2 [11], Diving-48 [30], and FineGym [39], the proposed method achieves the state-of-the-art results.

2 Related Work

Video action recognition

. Video action recognition aims to categorize videos into pre-defined action classes and one of the main issues in action recognition is to capture temporal dynamics in videos. For modern neural networks, previous methods attempt to learn temporal dynamics in different ways: two-stream networks with external optical flows 

[41, 49], recurrent networks [3], temporal pooling methods [9, 25] and 3D CNNs [45, 1]. Recent methods have introduced the advanced 3D CNNs [47, 46, 8, 31, 6] and show the effectiveness of capturing spatio-temporal features, so that 3D CNNs now become a de facto approach to learn temporal dynamics in the video. However, spatio-temporal convolution is vulnerable unless relevant features are well aligned across frames within the fixed-sized kernel. To address this issue, a few methods adaptively translate the kernel offsets with deformable convolutions [56, 27], while several methods [7, 28] modulate the other hyper-parameters, e.g., higher frame rate or larger spatial receptive fields. Unlike these methods, we address the problem of the spatio-temporal convolution by a sufficient volume of STSS, capturing far-sighted spatio-temporal relations.

Learning motion features. Since using the external optical flow benefits 3D CNNs to improve the action recognition accuracy [1, 58, 47], several methods propose to learn frame-by-frame motion features from RGB sequences inside neural architectures. Some methods [5, 36] internalize TV-L1 [54] optical flows into the CNN. Frame-wise feature differences [44, 26, 16, 29] are also utilized as the motion features. Recent correlation-based methods [48, 24] adopt a correlation operator [4, 43, 53] to learn motion features between adjacent frames. However, these methods compute frame-by-frame motion features between two adjacent frames and then rely on stacked spatio-temporal convolutions for capturing long-range motion dynamics. In contrast, we propose to learn STSS features, as generalized motion features, that enable to capture both short-term and long-term interactions in the video.

Self-similarity.

Self-similarity describes an internal geometric layout of images, which has been used in many computer vision tasks, such as object detection 

[40]

, image retrieval 

[14], and semantic correspondence matching [22]. Shechtman and Irani [40] introduce the concept of self-similarity for images and videos and use it to a hand-crafted local descriptor for action detection. Inspired from this work, early methods adopt self-similarities for capturing view-invariant temporal patterns [18, 17, 23], but they use temporal self-similarities only due to computational costs. In recent years, STSS has been rarely explored although the non-local approach [50, 32] to video representation can be viewed as using STSS for re-weighting or aligning visual features. Different from these, we advocate using STSS directly as relational features. Our method leverages the full STSS as generalized motion information and learns an effective representation for action recognition within a video-processing architecture. To the best of our knowledge, our work is the first in learning STSS representation using modern neural networks.

The contribution of our paper can be summarized as follows. First, we revisit the notion of self-similarity and propose to learn a generalized, far-sighted motion representations from STSS. Second, we implement STSS representation learning as a neural block, dubbed SELFY, that can be integrated into existing neural architectures. Third, we provide comprehensive evaluations on SELFY, achieving the state-of-the-art on benchmarks: Something-Something V1&V2 [11], Diving-48 [30], and FineGym [39].

Figure 2: Overview of our self-similarity representation block (SELFY). SELFY block takes as input a video feature tensor , transforms it to a STSS tensor , and extracts a feature tensor from . It then produces the final STSS representation via the feature integration, which is the same size as the input . The resultant representation is fused into the input feature by element-wise addition, thus making SELFY act as a residual block. See text for details.

3 Our approach

In this section, we first revisit the notions of self-similarity and discuss its relation to motion. We then introduce our method for learning effective spatio-temporal self-similarity representation, which can be easily integrated into video-processing architectures and learned end-to-end.

3.1 Self-Similarity Revisited

Self-similarity is a relational descriptor that suppresses variations in appearance and reveals structural patterns [40].

Given an image feature map , self-similarity transformation of results in a 4D tensor , whose elements are defined as

where is a similarity function, e.g.

, cosine similarity. Here,

is a query coordinate while is a spatial offset from it. To impose a locality, the offset is restricted to its neighborhood: , so that and , respectively. By converting -dimensional appearance feature into -dimensional relational feature , it suppresses variations in appearance and reveals spatial structures in the image. Note that the self-similarity transformation closely relates to conventional cross-similarity (or correlation) across two different feature maps (, ), which can be defined as

Given two images of a moving object, the cross-similarity transformation effectively captures motion information and thus is commonly used in optical flow and correspondence estimation 

[4, 43, 53].

For a sequence of frames, i.e., a video, one can naturally extend the spatial self-similarity along the temporal axis. Let denote a feature map of the video with frames. Spatio-temporal self-similarity (STSS) transformation of results in a 6D tensor , whose elements are defined as

(1)

where is a query coordinate and is a spatio-temporal offset from the query. In addition to the locality of spatial offsets above, the temporal offset is also restricted to its temporal neighborhood: , so that .

What types of information does STSS describe? Interestingly, for each time , the STSS tensor can be decomposed along temporal offset into a single spatial self-similarity tensor (when ) and spatial cross-similarity tensors (when ); the partial tensors with a small offset (e.g., or ) collect motion information from adjacent frames and those with larger offsets capture it from further frames both forward and backward in time. Unlike previous approaches to learn motion [4, 48, 24], which rely on cross-similarity between adjacent frames, STSS allows to take a generalized, far-sighted view on motion, i.e., both short-term and long-term, both forward and backward, as well as spatial self-motion.

(a) soft-argmax
(b) MLP
(c) convolution
Figure 3: Feature extraction from STSS. For a spatio-temporal position , each method transforms volume of STSS tensor into . See text for details.

3.2 Spatio-temporal self-similarity representation learning

By leveraging the rich information in STSS, we propose to learn a generalized motion representation for video understanding. To achieve this goal without additional supervision, we design a neural block, dubbed SELFY, which can be inserted into a video-processing architectures and learned end-to-end. The overall structure is illustrated in Fig. 2. It consists of three steps: self-similarity transformation, feature extraction, and feature integration.

Given the input video feature tensor , the self-similarity transformation step converts it to the STSS tensor as in Eq. 1. In the following, we describe feature extraction and integration steps.

3.2.1 Feature extraction

From the STSS tensor , we extract a -dimensional feature for each spatio-temporal position and temporal offset so that the resultant tensor is , which is equivariant to translation in space, time, and temporal offset. The dimension of is preserved to extract motion information across different temporal offsets in a consistent manner. While there exist many design choices, we introduce three methods for feature extraction in this work.

Soft-argmax. The first method is to compute explicit displacement fields using , which previous motion learning methods adopt using spatial cross-similarity [4, 43, 53]. One may extract the displacement field by indexing the positions with the highest similarity value via , but it is not differentiable. We instead use soft-argmax [2], which aggregates displacement vectors with softmax weighting (Fig. 2(a)). The soft-argmax feature extraction can be formulated as

(2)

which results in a feature tensor . The temperature factor adjusts the softmax distribution, and we set in our experiments.

Multi-layer perceptron (MLP). The second method is to learn an MLP that converts self-similarity values into a feature. For this, we flatten the volume into -dimensional vectors, and apply an MLP to them (Fig. 2(b)). For the reshaped tensor

, a perceptron

can be expressed as

(3)

where denotes the -mode tensor product, is the perceptron parameters, and the output is . The MLP feature extraction can thus be formulated as

(4)

which produces a feature tensor . This method is more flexible and may also be more effective than the soft-argmax because not only can it encode displacement information but also it can directly access the similarity values, which may be helpful for learning motion distribution.

Convolution. The third method is to learn convolution kernels over volume of (Fig. 2(c)). When we regard as a 7D tensor with , the convolution layer can be expressed as

(5)

where is a multi-channel convolution kernel. Starting from

, we gradually downsample (U,V) and expand channels via multiple convolutions with strides, finally resulting in

; we preserve the L dimension, since maintaining fine temporal resolution is shown to be effective for capturing detailed motion information [31, 7]. In practice, we reshape and then apply a regular 3D convolution along dimension of . The convolutional feature extraction with layers can thus be formulated as

(6)

which results in a feature tensor . This method is effective in learning structural patterns with their convolution kernels, thus outperforming the former methods as will be seen in our experiments.

3.2.2 Feature integration

In this step, we integrate the extracted STSS features to feed them back to the original input stream with volume.

We first use spatio-temporal convolution kernels along dimension of . The convolution layer can be expressed as

(7)

where is a multi-channel convolution kernel. This type of convolutions integrate the extracted STSS features by extending receptive fields along dimension. In practice, we reshape and then apply a regular 3D convolution along dimension of . The resultant features is defined as

(8)

We then flatten the volume into -dimensional vectors to obtain , and apply an convolution layer to obtain the final output. This convolution layer integrates features from different temporal offsets and also adjusts its channel dimension to fit that of the original input . The final output tensor is expressed as

(9)

where is the -mode tensor product and is the weights of the convolution layer.

Finally, we combine the resultant SSTS representation into the input feature by element-wise addition, thus making SELFY act as a residual block [12].

4 Experiments

model flow #frame FLOPsclips SS-V1 SS-V2
top-1 top-5 top-1 top-5
TSN-R50 from [31] 8 33 G1 19.7 46.6 30.0 60.5
TRN-BNIncep [57] 8 16 GN/A 34.4 - 48.8 -
TRN-BNIncep Two-Stream [57] 8+8 16 GN/A 42.0 - 55.5 -
MFNet-R50 [26] 10 N/A10 40.3 70.9 - -
CPNet-R34 [32] 24 N/A96 - - 57.7 84.0
TPN-R50 [52] 8 N/A10 40.6 - 59.1 -
SELFYNet-R50 (ours) 8 37 G1 50.7 79.3 62.7 88.0
I3D from [51] 32 153 G2 41.6 72.2 - -
NL-I3D from [51] 32 168 G2 44.4 76.0 - -
TSM-R50 [31] 16 65 G1 47.3 77.1 61.2 86.9
TSM-R50 Two-Stream from [24] 16+16 129 G1 52.6 81.9 65.0 89.4
CorrNet-R101 [48] 32 187 G10 50.9 - - -
STM-R50 [16] 16 67 G30 50.7 80.4 64.2 89.8
TEA-R50 [29] 16 70 G30 52.3 81.9 - -
MSNet-TSM-R50 [24] 16 67 G1 52.1 82.3 64.7 89.4
MSNet-TSM-R50 [24] 16+8 101 G10 55.1 84.0 67.1 91.0
SELFYNet-TSM-R50 (ours) 8 37 G1 52.5 80.8 64.5 89.4
SELFYNet-TSM-R50 (ours) 16 77 G1 54.3 82.9 65.7 89.8
SELFYNet-TSM-R50 (ours) 8+16 114 G1 55.8 83.9 67.4 91.0
SELFYNet-TSM-R50 (ours) 8+16 114 G2 56.6 84.4 67.7 91.1
Table 1: Performance comparison on SS-V1&V2. Top-1, 5 accuracy (%) and FLOPs (G) are shown.

4.1 Implementation details

Action recognition architecture. We employ TSN ResNets [49] as 2D CNN backbones and TSM ResNets [31]

as 3D CNN backbones. TSM enables to obtain the effect of spatio-temporal convolutions using spatial convolutions by shifting a part of input channels along the temporal axis before the convolution operation. TSM is added into each residual block of the ResNet. We adopt ImageNet pre-trained weights for our backbones. To transform the backbones to the self-similarity network (SELFYNet), we insert a single SELFY block after the third stage in the backbones with additive fusion. For SELFY block, we use the convolution method as a default feature extraction method and use multi-channel

convolution kernels. For more details, please refer to supplementary materials  A and B.

Training & testing. For training, we sample a clip of 8 or 16 frames from each video using segment-based sampling [49]. The spatio-temporal matching region of SELFY block is set as or when using 8 or 16 frames, respectively. For testing, we sample one or two clips from a video, crop their center, and evaluate the averaged prediction of the sampled clips. For more details, please refer to the supplementary material A.

4.2 Datasets

For evaluation, we use benchmarks that contain fine-grained spatio-temporal dynamics in videos.

Something-Something V1 & V2 (SS-V1 & V2) [11], which are both large-scale action recognition datasets, contain 108k and 220k video clips, respectively. Both datasets share the same 174 action classes that are labeled, e.g., ‘pretending to put something next to something.’

Diving-48 [30], which contains 18k videos with 48 different diving action classes, is an action recognition dataset that minimizes contextual biases, i.e., scenes or objects.

FineGym [39] is a fine-grained action dataset built on top of gymnastic videos. We adopt the Gym288 and Gym99 sets for experiments that contains 288 and 99 classes, respectively.

model #frame FLOPs Top-1
clips
TSN from [30] - - 16.8
TRN from [19] - - 22.8
Att-LSTM [19] 64 N/A1 35.6
P3D from [34] 16 N/A1 32.4
C3D from [34] 16 N/A1 34.5
GST-R50 [34] 16 59 G1 38.8
CorrNet-R101 [48] 32 187 G10 38.2
GSM-IncV3 [42] 16 54 G2 40.3
TSM-R50 (our impl.) 16 65 G2 38.8
SELFYNet-TSM-R50 (ours) 16 77 G2 41.6
Table 2: Performance comparison on Diving-48. Top-1 accuracy (%) and FLOPs (G) are shown.
model #frame Gym288 Gym99
Mean Mean
TSN [49] 3 26.5 61.4
TRN [57] 3 33.1 68.7
I3D [1] 8 27.9 63.2
NL I3D [50] 8 27.1 62.1
TSM [31] 3 34.8 70.6
TSM Two-Stream [31] N/A 46.5 81.2
TSM-R50 (our impl.) 3 35.3 73.7
TSM-R50 (our impl.) 8 47.9 86.6
SELFYNet-TSM-R50 (ours) 8 49.5 87.7
Table 3: Performance comparison on FineGym. The averaged per-class accuracy (%) is shown. All results in the upper part are from FineGym paper [39].

4.3 Comparison with the state-of-the-art methods

For a fair comparison, we compare our model with other models that are not pre-trained on additional large-scale video datasets, e.g., Kinetics [21] or Sports1M [20] in the following experiments.

Table 1 summarizes the results on SS-V1&V2. The first and second compartment of the table shows the results of other 2D CNN and (pseudo-) 3D CNN models, respectively. The last part of each compartment shows the results of SELFYNet. SELFYNet with TSN-ResNet (SELFYNet-TSN-R50) achieves 50.7% and 62.7% at top-1 accuracy, respectively, which outperforms other 2D models using 8 frames only. When we adopt TSM ResNet (TSM-R50) as our backbone and use 16 frames, our method (SELFYNet-TSM-R50) achieves 54.3% and 65.7% at top-1 accuracy, respectively, which is the best among the single models. Compared to TSM-R50, a single SELFY block obtains the significant gains of 7.0%p and 4.5%p at top-1 accuracy, respectively; our method is more accurate than TSM-R50 two-stream on both datasets. Finally, our ensemble model (SELFYNet-TSM-R50) with 2-clip evaluation sets a new state of the art on both datasets by achieving 56.6% and 67.7% at top-1 accuracy, respectively.

Tables 2 and 3 summarize the results on Diving-48 and FineGym. For Diving-48, TSM-R50 using 16 frames shows 38.8% in top-1 accuracy in our implementation. SELFYNet-TSM-R50 outperforms TSM-R50 by 2.8%p in accuracy so that it sets a new state-of-the-art top-1 accuracy as 41.6% on Diving-48. For FineGym, SELFYNet-TSM-R50 achieves 49.5% and 87.7% at given 288 and 99 classes, respectively, surpassing all the other models reported in [39].

4.4 Ablation studies

We conduct ablation experiments to demonstrate the effectiveness of the proposed method. All experiments are performed on SS-V1 using 8 frames. Unless specified otherwise, we set ImageNet pre-trained TSM ResNet-18 (TSM-R18) with the single SELFY block of which

, as our default SELFYNet.

Types of similarity. In Table 3(a), we investigate the effect of different types of similarity by varying the set of temporal offset on both TSN-ResNet-18 (TSN-R18) and TSM-R18. Interestingly, learning spatial self-similarity () improves accuracy on both backbones, which implies that self-similarity features help capture structural patterns of visual features. Learning cross-similarity with a short temporal range () shows a noticeable gain in accuracy on both backbones, indicating the significance of motion features. Learning STSS outperforms other types of similarity, and the accuracy of SELFYNet increases as the temporal range becomes longer. When STSS takes a far-sighted view on motion, STSS learns both short-term and long-term interactions in videos, as well as spatial self-similarity.

Feature extraction and integration methods. In Table 3(b), we compare the performance of different combinations of feature extraction and integration methods. From the 2 to the 4 rows, different feature extraction methods are compared fixing the feature integration methods to a single fully-connected (FC) layer. Compared to the baseline, the use of soft-argmax, which extracts spatial displacement features, improves the top-1 accuracy by 1.0%p. Replacing soft-argmax with MLP provides the additional gain of 1.9%p at top-1 accuracy, showing the effectiveness of directly using similarity values. When using the convolution method for feature extraction, we achieve 46.7% at top-1 accuracy; the multi-channel convolution kernel is more effective in capturing structural patterns along dimensions than MLP. From the 4 to the 6 rows, different feature integration methods are compared fixing the feature extraction method to convolution. Replacing the single FC layer with MLP improves the top-1 accuracy by 0.6%p. Replacing MLP with convolutional layers further improves and achieves 48.4% at top-1 accuracy. These results demonstrate that our design choice of using convolutions along and dimensions is the most effective in learning the geometry-aware STSS representation.

Comparison with correlation-based methods. We also investigate the difference between our method and correlation-based methods [24, 48]. While correlation-based methods extract motion features only from the spatial cross-similarity tensor between two adjacent frames, and are thus limited to short-term motion, our method effectively captures bi-directional and long-term motion information via learning with the sufficient volume of STSS. Our method can also exploit richer information from the self-similarity values than other methods. MS module [24] only focuses on the maximal similarity value of the dimensions to extract flow information, and Correlation block [48] uses an convolution layer for extracting motion features from the similarity values. In contrast to the two methods, we introduce a generalized motion learning framework using the self-similarity tensor at Section 3.2.

We also have conducted experiments to compare our method with MSNet [24]

, one of the correlation-based methods. For an apple-to-apple comparison, we apply kernel soft-argmax and max pooling operation (

KS + CM in [24]) to our feature extraction method by following their official codes111https://github.com/arunos728/MotionSqueeze. Please note that, when we restrict the temporal offset to , the SELFY block using KS + CM is equivalent to the MS module of which feature transformation layers are the standard 2D convolutional layers. Table 3(c) summarizes the results. KS+CM method achieves 46.1% at top-1 accuracy. As we enlarge the temporal window to 5, we obtain the additional gain as 1.3%p. The learnable convolution layers improve the top-1 accuracy by 1.0%p in both cases. The results demonstrates the effectiveness of learning geometric patterns within the sufficient volume of STSS tensors for learning motion features.

model range of FLOPs top-1 top-5
TSN-R18 - 14.6 G 16.2 40.8
15.3 G 16.8 42.2
15.3 G 39.7 68.9
SELFYNet 16.3 G 44.7 73.9
17.3 G 46.9 75.9
18.3 G 46.9 76.2
TSM-R18 - 14.6 G 43.0 72.3
15.3 G 45.0 73.4
15.3 G 47.1 76.3
SELFYNet 16.3 G 47.8 76.7
17.3 G 48.4 77.6
18.3 G 48.6 77.7
(a) Types of similarity. Performance comparison with different sets of temporal offset in SELFY block. denotes a set of temporal offset .
model extraction integration top-1 top-5
TSM-R18 - - 43.0 72.3
SELFYNet Smax FC 44.0 72.3
MLP FC 45.9 75.1
Conv FC 46.7 75.8
Conv MLP 47.2 75.9
Conv Conv 48.4 77.6
(b) Feature extraction and integration methods. Smax denotes the soft-argmax operation. MLP consist of four FC layers. The layer in the feature integration stage is omitted from this table.
model extraction top-1 top-5
TSM-R18 - - 43.0 72.3
SELFYNet KS + CM 46.1 75.3
KS + CM 47.4 76.8
Conv 47.1 76.3
Conv 48.4 77.6
(c) Performance comparison with MSNet [24]. KS and CM denote the kernel soft-argmax and confidence map, respectively.
Table 4: Ablations on SS-V1. Top-1 & 5 accuracy (%) are shown.
Figure 4: Basic blocks and their combinations. (a) spatio-temporal convolution block (STCB), (b) SELFY-s block, and (c-f) their different combinations.
model, TSN-R18 top-1 top-5 baseline 16.2 40.8 (a) STCB 42.4 71.7 (b) SELFY-s 46.3 75.1 (c) STCB + STCB 44.4 73.7 (d) SELFY-s + SELFY-s 46.8 75.9 (e) SELFY-s + STCB (parallel) 46.9 76.5 (f) SELFY-s + STCB (sequential) 47.6 76.6 Table 5: Spatio-temporal features v.s. STSS features. The basic blocks and their different combinations in Fig. 4 are compared on SS-V1.
(a) corruption: occlusion
(b) corruption: motion blur
(c) qualitative results on corrupted videos
Figure 5: Robustness experiments. (a) and (b) show top-1 accuracy of SELFYNet variants (Table 3(a)) when different degrees of occlusion and motion blur, respectively, are added to input. (c) shows qualitative examples where SELFYNet () succeeds while SELFYNet () fails.

4.5 Complementarity of STSS features

We conduct experiments for analyzing different meanings of spatio-temporal features and STSS features. We organize two basic blocks for representing two different features: spatio-temporal convolution block (STCB) that consists of several spatial-temporal convolutions (Fig. 4) and SELFY-s block, light-weighted version of the SELFY block by removing spatial convolution layers (Fig. 4). Both blocks have the same receptive fields and a similar number of parameters for a fair comparison. Different combinations of the basic blocks are inserted after the third stage of TSN-ResNet-18. Table 5 summarizes the results on SS-V1. STSS features (Figs. 4 and 4) are more effective than spatio-temporal features (Figs. 4 and 4) in top-1 and top-5 accuracy when the same number of blocks are inserted. Interestingly, the combination of two different features (Figs. 4 and 4) shows better results in top-1 and top-5 accuracy compared to the single feature cases (Figs. 4 and 4), which demonstrate that both features complement each other. We conjecture that this complementarity comes from different characteristics of the two features; while spatio-temporal features are obtained by directly encoding appearance features, STSS features are obtained by suppressing variations in appearance and focusing on the relational features in space and time.

4.6 Improving robustness with STSS

In this experiment we demonstrate that STSS representation helps video-processing models to be more robust to video corruptions. We test two types of corruption that are likely to occur in real-world videos: occlusion and motion blur. To induce the corruptions, We either cut out a rectangle patch of a particular frame or generate a motion blur [13]. We corrupt a single center-frame for every clip of SS-V1 at the testing phase and gradually increase the severity of corruptions. We compare the results of TSM-R18 and SELFYNet variants of Table 3(a). Figures 4(a) and 4(b) summarize the results of two corruptions, respectively. The top-1 accuracy of TSM-R18 and SELFYNets with the short temporal range (, , and ) significantly drops as the severity of corruptions becomes harder. We conjecture that the features of the corrupted frame propagate through the stacked TSMs, confusing the entire network. However, the SELFYNets with the long temporal range ( and ) show more robust performance than the other models. As shown in Figs. 4(a) and 4(b), the accuracy gap between SELFYNets with the long temporal range and the others increases as the severity of corruptions becomes higher, indicating that the larger size of STSS features can improve the robustness on action recognition. We also present some qualitative results (Fig. 4(c)) where two SELFYNets with different temporal ranges, and , both answer correctly without corruption, while the SELFYNet with fails for the corrupted input.

5 Conclusion

We have proposed to learn a generalized, far-sighted motion representation from STSS for video understanding. The comprehensive analyses on the STSS demonstrate that STSS features effectively capture both short-term and long-term interactions, complement spatio-temporal features, and improve the robustness of video-processing models. Our method outperforms other state-of-the-art methods on the three benchmarks for video action recognition.

References

  • [1] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In

    Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §1, §2, §2, Table 3.
  • [2] O. Chapelle and M. Wu (2010) Gradient descent optimization of smoothed information retrieval metrics. Information retrieval 13 (3), pp. 216–235. Cited by: §3.2.1.
  • [3] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell (2015) Long-term recurrent convolutional networks for visual recognition and description. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [4] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cremers, and T. Brox (2015) Flownet: learning optical flow with convolutional networks. In Proc. IEEE International Conference on Computer Vision (ICCV), Cited by: §2, §3.1, §3.1, §3.2.1.
  • [5] L. Fan, W. Huang, C. Gan, S. Ermon, B. Gong, and J. Huang (2018) End-to-end learning of motion representation for video understanding. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [6] L. Fan, S. Buch, G. Wang, R. Cao, Y. Zhu, J. C. Niebles, and L. Fei-Fei (2020) RubiksNet: learnable 3d-shift for efficient video action recognition. In Proc. European Conference on Computer Vision (ECCV), Cited by: §2.
  • [7] C. Feichtenhofer, H. Fan, J. Malik, and K. He (2019) Slowfast networks for video recognition. In Proc. IEEE International Conference on Computer Vision (ICCV), Cited by: §2, §3.2.1.
  • [8] C. Feichtenhofer (2020) X3D: expanding architectures for efficient video recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [9] R. Girdhar and D. Ramanan (2017) Attentional pooling for action recognition. arXiv preprint arXiv:1711.01467. Cited by: §2.
  • [10] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He (2017) Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677. Cited by: §A.
  • [11] R. Goyal, S. E. Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, et al. (2017) The” something something” video database for learning and evaluating visual common sense.. In Proc. IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §A, §2, §4.2.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.2.2.
  • [13] D. Hendrycks and T. Dietterich (2019) Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261. Cited by: §4.6.
  • [14] E. Hörster and R. Lienhart (2008) Deep networks for image retrieval on large-scale databases. In Proceedings of the 16th ACM international conference on Multimedia, pp. 643–646. Cited by: §2.
  • [15] H. Hu, Z. Zhang, Z. Xie, and S. Lin (2019) Local relation networks for image recognition. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3464–3473. Cited by: 6(f), §B.
  • [16] B. Jiang, M. Wang, W. Gan, W. Wu, and J. Yan (2019) STM: spatiotemporal and motion encoding for action recognition. In Proc. IEEE International Conference on Computer Vision (ICCV), Cited by: §2, Table 1.
  • [17] I. N. Junejo, E. Dexter, I. Laptev, and P. Perez (2010) View-independent action recognition from temporal self-similarities. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Cited by: §2.
  • [18] I. N. Junejo, E. Dexter, I. Laptev, and P. PÚrez (2008) Cross-view action recognition from temporal self-similarities. In Proc. European Conference on Computer Vision (ECCV), Cited by: §2.
  • [19] G. Kanojia, S. Kumawat, and S. Raman (2019) Attentive spatio-temporal representation learning for diving classification. In Proc. IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Cited by: Table 2.
  • [20] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei (2014)

    Large-scale video classification with convolutional neural networks

    .
    In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.3.
  • [21] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950. Cited by: §4.3.
  • [22] S. Kim, D. Min, B. Ham, S. Ryu, M. N. Do, and K. Sohn (2015) DASC: dense adaptive self-correlation descriptor for multi-modal and multi-spectral correspondence. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [23] M. Körner and J. Denzler (2013) Temporal self-similarity for appearance-based action recognition in multi-view setups. In International Conference on Computer Analysis of Images and Patterns, pp. 163–171. Cited by: §2.
  • [24] H. Kwon, M. Kim, S. Kwak, and M. Cho (2020) MotionSqueeze: neural motion feature learning for video understanding. arXiv preprint arXiv:2007.09933. Cited by: §1, §2, §3.1, §4.4, §4.4, Table 1, 3(c).
  • [25] H. Kwon, Y. Kim, J. S. Lee, and M. Cho (2018) First person action recognition via two-stream convnet with long-term fusion pooling. Pattern Recognition Letters 112, pp. 161–167. Cited by: §2.
  • [26] M. Lee, S. Lee, S. Son, G. Park, and N. Kwak (2018) Motion feature network: fixed motion filter for action recognition. In Proc. European Conference on Computer Vision (ECCV), Cited by: §2, Table 1.
  • [27] J. Li, X. Liu, M. Zhang, and D. Wang (2020) Spatio-temporal deformable 3d convnets with attention for action recognition. Pattern Recognition 98, pp. 107037. Cited by: §2.
  • [28] X. Li, Y. Wang, Z. Zhou, and Y. Qiao (2020) SmallBigNet: integrating core and contextual views for video classification. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [29] Y. Li, B. Ji, X. Shi, J. Zhang, B. Kang, and L. Wang (2020) TEA: temporal excitation and aggregation for action recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, Table 1.
  • [30] Y. Li, Y. Li, and N. Vasconcelos (2018) Resound: towards action recognition without representation bias. In Proc. European Conference on Computer Vision (ECCV), Cited by: §1, §A, §2, §4.2, Table 2.
  • [31] J. Lin, C. Gan, and S. Han (2019) Tsm: temporal shift module for efficient video understanding. In Proc. IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2, §3.2.1, §4.1, Table 1, Table 3.
  • [32] X. Liu, J. Lee, and H. Jin (2019) Learning video representations from correspondence proposals. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: 6(e), §2, §B, §B, Table 1.
  • [33] I. Loshchilov and F. Hutter (2016)

    Sgdr: stochastic gradient descent with warm restarts

    .
    arXiv preprint arXiv:1608.03983. Cited by: §A.
  • [34] C. Luo and A. L. Yuille (2019) Grouped spatial-temporal aggregation for efficient action recognition. In Proc. IEEE International Conference on Computer Vision (ICCV), Cited by: Table 2.
  • [35] J. Materzynska, T. Xiao, R. Herzig, H. Xu, X. Wang, and T. Darrell (2020) Something-else: compositional action recognition with spatial-temporal interaction networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [36] A. Piergiovanni and M. S. Ryoo (2019) Representation flow for action recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [37] P. Ramachandran, N. Parmar, A. Vaswani, I. Bello, A. Levskaya, and J. Shlens (2019) Stand-alone self-attention in vision models. arXiv preprint arXiv:1906.05909. Cited by: 6(f), §B, §B.
  • [38] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proc. IEEE International Conference on Computer Vision (ICCV), Cited by: §C.
  • [39] D. Shao, Y. Zhao, B. Dai, and D. Lin (2020) Finegym: a hierarchical video dataset for fine-grained action understanding. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §A, §2, §4.2, §4.3, Table 3.
  • [40] E. Shechtman and M. Irani (2007) Matching local self-similarities across images and videos. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §3.1.
  • [41] K. Simonyan and A. Zisserman (2014) Two-stream convolutional networks for action recognition in videos. In Proc. Neural Information Processing Systems (NeurIPS), Cited by: §2.
  • [42] S. Sudhakaran, S. Escalera, and O. Lanz (2020) Gate-shift networks for video action recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 2.
  • [43] D. Sun, X. Yang, M. Liu, and J. Kautz (2018) PWC-net: cnns for optical flow using pyramid, warping, and cost volume. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §3.1, §3.2.1.
  • [44] S. Sun, Z. Kuang, L. Sheng, W. Ouyang, and W. Zhang (2018) Optical flow guided feature: a fast and robust motion representation for video action recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [45] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2015) Learning spatiotemporal features with 3d convolutional networks. In Proc. IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2.
  • [46] D. Tran, H. Wang, L. Torresani, and M. Feiszli (2019) Video classification with channel-separated convolutional networks. In Proc. IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
  • [47] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri (2018) A closer look at spatiotemporal convolutions for action recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §2.
  • [48] H. Wang, D. Tran, L. Torresani, and M. Feiszli (2020) Video modeling with correlation networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §3.1, §4.4, Table 1, Table 2.
  • [49] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool (2016) Temporal segment networks: towards good practices for deep action recognition. In Proc. European Conference on Computer Vision (ECCV), Cited by: §A, §2, §4.1, §4.1, Table 3.
  • [50] X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: 6(e), §2, §B, Table 3.
  • [51] X. Wang and A. Gupta (2018) Videos as space-time region graphs. In Proc. European Conference on Computer Vision (ECCV), pp. 399–417. Cited by: Table 1.
  • [52] C. Yang, Y. Xu, J. Shi, B. Dai, and B. Zhou (2020) Temporal pyramid network for action recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 1.
  • [53] G. Yang and D. Ramanan (2019) Volumetric correspondence networks for optical flow. In Proc. Neural Information Processing Systems (NeurIPS), Cited by: §2, §3.1, §3.2.1.
  • [54] C. Zach, T. Pock, and H. Bischof (2007) A duality based approach for realtime tv-l 1 optical flow. Pattern Recognition, pp. 214–223. Cited by: §2.
  • [55] H. Zhao, J. Jia, and V. Koltun (2020) Exploring self-attention for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10076–10085. Cited by: §B.
  • [56] Y. Zhao, Y. Xiong, and D. Lin (2018) Trajectory convolution for action recognition. In Proc. Neural Information Processing Systems (NeurIPS), Cited by: §2.
  • [57] B. Zhou, A. Andonian, A. Oliva, and A. Torralba (2018) Temporal relational reasoning in videos. In Proc. European Conference on Computer Vision (ECCV), Cited by: Table 1, Table 3.
  • [58] M. Zolfaghari, K. Singh, and T. Brox (2018) Eco: efficient convolutional network for online video understanding. In Proc. European Conference on Computer Vision (ECCV), Cited by: §2.

A Implementation details

Architecture details. We use TSN-ResNet and TSM-ResNet as our backbone (see Table 6) and initialize them with ImageNet pre-trained weights. We insert a single SELFY block into right after and use the convolution method as a default feature extraction method. We set the spatio-temporal matching region of SELFY block, , as or when using 8 or 16 input frames, respectively. We stack four convolution layers along dimension for the feature extraction method, and use four convolution layer along dimension for the feature enhancement. We reduce a spatial resolution of video feature tensor, , as 1414 for computation efficiency before the self-similarity transformation. After the feature enhancement, we upsample the enhanced feature tensor, , as 28

28 for the residual connection.

Training. We sample a clip of 8 or 16 frames from each video by using segment-based sampling [49]. We resize the sampled clips into 240 320 images and apply random scaling and horizontal flipping for data augmentation. When applying the horizontal flipping on SS-V1&V2 [11], we do not flip clips of which class labels include ‘left’ or ‘right’ words; the action labels, e.g., ‘pushing something from left to right.’ We fit the augmented clips into a spatial resolution of 224

224. For SS-V1&V2, we set the initial learning rate to 0.01 and the training epochs to 50; the learning rate is decayed by 1/10 after

and epochs. For Diving-48 [30] and FineGym [39], we use a cosine learning rate schedule [33] with the first 10 epochs for gradual warm-up [10]. We set the initial learning rate to 0.01 and the training epochs to 30 and 40, respectively.

Testing. Given a video, we sample 1 or 2 clips, resize them into 240 320 images, and crop their centers as 224 224. We evaluate an average prediction of the sampled clips. We report top-1 and top-5 accuracy for SS-V1&V2 and Diving-48, and mean-class accuracy for FineGym.

Frame corruption details. We adopt two corruptions, occlusion and motion blur, to test the robustness of SELFYNet. We only corrupt a single center-frame for every validation clip of SS-V1; we corrupt the frame amongst 8 input frames. For the occlusion, we cut out a rectangle region from the center of the frame. For the motion blur, we adopt ImageNet-C implementation, which is available online222https://github.com/hendrycks/robustness. We set 6 levels of severity for each corruption. We set the side length of the occluded region as 40px, 80px, 120px, 160px, 200px and 224px from the level 1 to 6. For the motion blur, we set (radius, sigma) tuple arguments as , , , , , and , respectively.

Layers TSN ResNet-50 TSM ResNet-50 Output size
conv 177, 64, stride 1,2,2 T112112
pool 133 max pool, stride 1,2,2 T5656
res 3 3 T5656
res 4 4 T2828
res 6 6 T1414
res 3 3 T77
global average pool, FC # of classes
Table 6: TSN & TSM ResNet-50 backbone.

B Additional experiments

We conduct additional experiments to identify the behaviors of the proposed method. All experiments are performed on SS-V1 by using 8 frames. Unless otherwise specified, we set ImageNet pre-trained TSM ResNet-18 (TSM-R18) with a single SELFY block of which , as our default SELFYNet.

model FLOPs top-1 top-5
TSM-R18 - 14.6 G 43.0 72.3
SELFYNet 17.1 G 47.8 77.1
17.3 G 48.4 77.6
18.4 G 48.4 77.8
19.8 G 48.6 78.3
(a) Spatial matching region. Performance comparison with different spatial matching-regions, ().
model position top-1 top-5
TSM-R18 - 43.0 72.3
SELFYNet 45.7 77.6
47.2 76.6
48.4 77.6
46.6 76.0
42.8 72.6
48.6 77.9
(b) Position. Performance comparison with different positions of SELFY block. For the last row, 3 SELFY blocks are used in total.
model features top-1 top-5
TSM-R18 43.0 72.3
SELFYNet 45.5 75.9
48.4 77.6
(c) STSS features with visual features. denotes the visual features and STSS features, respectively.
model range of top-1 top-5
TSM-R18 - 43.0 72.3
47.4 77.0
SELFYNet 48.3 77.2
48.5 77.4
(d) Multi-channel kernel for feature extraction. Four convolution layers are used for extracting STSS features. denotes a set of temporal offsets .
model FLOPs top-1 top-5
TSM-R18 14.6 G 43.0 72.3
TSM-R18 + NL [50] 24.8 G 43.8 73.1
TSM-R18 + CP [32] 25.6 G 46.7 75.7
SELFYNet 17.3 G 48.4 77.6
(e) Performance comparison with non-local methods [32, 50]. NL and CP denote a non-local block and a CP module, respectively.
model similarity extraction top-1 top-5
TSM-R18 - - 43.0 72.3
SELFYNet embed. G mult. w/ 43.8 72.3
embed. G Conv 47.6 76.8
cosine Conv 47.8 77.1
(f) Performance comparison with the local self-attention mechanisms [15, 37]. We implemented the local self-attention by following Ramachandran et al. [37].
Table 7: Additional experiments on SS-V1. Top-1 & 5 accuracy (%) are shown.

Spatial matching region. In Table 6(a), we compare a single SELFY block with different spatial matching regions, . As a result, indeed, the larger spatial matching region leads the better accuracy. Considering the accuracy-computation trade-off, we set our spatial matching region, , as as a default.

Block position. From the 2 to the 6 row of Table 6(b), we identify the effect of different positions of SELFY block in the backbone. We resize the spatial resolution of the video tensor, , into 1414, and fix the matching region, , as for all the cases maintaining the similar computational cost. SELFY after the shows the best trade-off by achieving the highest accuracy among the cases. The last row in Table 6(b) shows that the multiple SELFY blocks improve accuracy compared to the single block.

Multi-channel kernel for feature extraction. We investigate the effect of the convolution method for STSS feature extraction, when we use multi-channel kernels. For the experiment, we stack four convolution layers followed by the feature integration step, which are the same as in Section 3.2.2. Table 6(d) summarizes the results. Note that we do not report models of which temporal window , e.g., and . As shown in the table, indeed, the long temporal range gives the higher accuracy. However, the effect of the kernel is comparable to that of the kernel in Table 3(a) in terms of accuracy. Considering the accuracy-computation trade-off, we choose to fix the kernel size, , as for the STSS feature extraction.

Fusing STSS features with visual features. We evaluate SELFYNet purely based on STSS features to see how much the ordinary visual feature contributes for the final prediction. That is, we pass the STSS features, , into the downstream layers without additive fusion. Table 6(c) compares the results of using different cases of the output tenso (, , and ) on SS-V1. Interestingly, SELFYNet using only achieves 45.5% at top-1 accuracy, which is higher as 2.5%p than the baseline. As we add to , we obtain an additional gain of 2.9%p. It indicates that the STSS features and the visual features are complementary to each other.

Comparison with non-local methods. We compare our method with popular non-local methods [50, 32], which capture the long-range interactions of videos. While computing the global self-similarity tensors, both methods use them as attention weights for feature aggregation by multiplying them to the visual features [50] or aligning top- corresponding features [32]; they both do not use STSS itself as a relational representation. In contrast, our method does it indeed and learns a more powerful relational feature from STSS. Table 6(e)

summarizes the results. We re-implement the non-local block and the CP module in Pytorch based on their official codes. For a fair comparison, we insert a single block or module at the same position (after

of ResNet-18). Compared to the non-local block and the CP module, SELFY block improves top-1 accuracy by 4.4%p and 1.5%p, while computing less floating-point operations as 7.5 GFLOPs and 8.3 GFLOPs, respectively. It demonstrates that the direct integration of STSS features is more effective for action recognition than the indirect ways of using STSS, e.g., re-weighting visual-semantic features or learning correspondences.

Relation to local self-attention mechanisms. The local self-attention [15, 37, 55] and our method have a common denominator of using the self-similarity tensor but use it in a very different way and purpose. The local self-attention mechanism aims to aggregate the local context features using the self-similarity tensor and it thus uses the self-similarity values as attention weights for feature aggregation. However, our method aims to learn a generalized motion representation from the local STSS, so the final STSS representation is directly fed into the neural network instead of multiplying it to local context features.

For an empirical comparison, we conduct an ablation experiment as follows. We extend the local self attention layer [37] to temporal dimension, and then add the spatio-temporal local self-attention layer, which is followed by feature integration layers, after . All experimental details are the same as those in supplementary material A, except that we reduce the channel dimension ‘ of appearance feature to 32. Table 6(f) summarizes the results on SS-V1. The spatio-temporal local self-attention layer is accurate as 43.8% at top-1 accuracy, and both of SELFY blocks using the embedded Gaussian and the cosine similarity outperform the local self-attention by achieving top-1 accuracy as 47.6% and 47.8%, respectively. These results are in alignment with the prior work [32], which reveals that the self-attention mechanism hardly captures motion features in video.

C Visualizations

In Fig. 6, we visualize some qualitative results of two different SELFYNet-TSM-R18 ( and ) on SS-V1. We show the different predictions of the two models with 8 input frames. We also overlay Grad-CAMs [38] on the input frames to see whether a larger volume of STSS benefits to capture long-term interactions in videos. We take Grad-CAMs of features which is right before a global average pooling layer. As shown in the figure, the STSS with the sufficient volume helps to learn more enriched context of temporal dynamics in the video; in Fig. 5(a), for example, SELFYNet with the range of () focuses on not only regions on which an action occurs but also focuses on the white-stain after the action to verify whether the stain is wiped off or not.

(a)
(b)
(c)
(d)
Figure 6: Qualitative results of two SELFYNets on SS-V1. Each subfigure visualizes prediction results of the two models with Grad-CAM-overlaid RGB frames. The correct and wrong predictions are colorized as green and red, respectively.