Log In Sign Up

Video Is Graph: Structured Graph Module for Video Action Recognition

In the field of action recognition, video clips are always treated as ordered frames for subsequent processing. To achieve spatio-temporal perception, existing approaches propose to embed adjacent temporal interaction in the convolutional layer. The global semantic information can therefore be obtained by stacking multiple local layers hierarchically. However, such global temporal accumulation can only reflect the high-level semantics in deep layers, neglecting the potential low-level holistic clues in shallow layers. In this paper, we first propose to transform a video sequence into a graph to obtain direct long-term dependencies among temporal frames. To preserve sequential information during transformation, we devise a structured graph module (SGM), achieving fine-grained temporal interactions throughout the entire network. In particular, SGM divides the neighbors of each node into several temporal regions so as to extract global structural information with diverse sequential flows. Extensive experiments are performed on standard benchmark datasets, i.e., Something-Something V1 V2, Diving48, Kinetics-400, UCF101, and HMDB51. The reported performance and analysis demonstrate that SGM can achieve outstanding precision with less computational complexity.


page 1

page 3

page 6

page 7


Revisiting the Spatial and Temporal Modeling for Few-shot Action Recognition

Spatial and temporal modeling is one of the most core aspects of few-sho...

Slow-Fast Visual Tempo Learning for Video-based Action Recognition

Action visual tempo characterizes the dynamics and the temporal scale of...

Feedback Graph Convolutional Network for Skeleton-based Action Recognition

Skeleton-based action recognition has attracted considerable attention i...

Semantics-Guided Neural Networks for Efficient Skeleton-Based Human Action Recognition

Skeleton-based human action recognition has attracted a lot of interests...

Visualizing Semantic Structures of Sequential Data by Learning Temporal Dependencies

While conventional methods for sequential learning focus on interaction ...

Learning Sparse Temporal Video Mapping for Action Quality Assessment in Floor Gymnastics

Athlete performance measurement in sports videos requires modeling long ...

Video Processing for Barycenter Trajectory Identification in Diving

The aim of this paper is to show a procedure for identify the barycentre...

I Introduction

Video action recognition aims to identify the action category of a video clip, which plays a significant role in applications such as video surveillance, human-computer interaction, and autonomous driving industries, etc. However, video data contains more challenges than still images with joint spatial and temporal visual clues. Despite the recent rapid development in the community using deep learning techniques

[10, 17, 8, 28, 40, 33, 38, 37], extracting discriminant spatiotemporal features from the original video data is still an unsolved issue. Without considering supplementary modalities such as optical flow, current end-to-end video recognition methods can be mainly divided into two categories: spatiotemporal joint learning methods based on 3D convolution [30, 10, 8, 24, 31, 3, 12, 39] and spatiotemporal separation learning methods based on additional hand-designed temporal feature learning modules [43, 17, 28, 37, 20, 19, 44].

Fig. 1: We use a specific frame (in the red box) to compare the three different allocation forms of a video clip. (a) is the widely used sequence representation, in which each frame can only communicate with adjacent frames. (b) is a graph representation, in which it can communicate with all frames, but the sequential information is destroyed. (c) is a structured graph representation, enabling interaction with different temporal regions, so that the global structural information and sequential features can be obtained simultaneously.

Existing approaches employ a similar strategy to achieve temporal modelling, i.e.

, focusing on the local temporal feature extraction and increasing network depth or stacking temporal modules to obtain long-term dependencies. Though some hidden long-term patterns can be captured by such local aggregation, the global spatial saliency and temporal perception could conflict with each other. In particular, for a specific temporal module, it is designed to extract the local information under the current stage. Therefore, spatial information is easily neglected. Besides, at different network depths, spatial details are continuously suppressed due to convolution and pooling operations. When the temporal receptive field expands with depth, it is uncertain whether any meaningful spatial information is lost or not. It should be noted that this local aggregation only considers the temporal cues within local windows in each step, thus lacking the perception of the global temporal structure in the shallow layers.

In this paper, we reorganize the frames in a video clip to form an interconnected frame graph. Based on the above discussion, we assume all frames are interlinked by learnable edges, so that the video sequence is transformed into a complete graph. We adopt the graph convolution method to share the information of each frame along the edges, realizing the extraction of long-term dependencies in every single step. However, after being converted to a graph, each frame in the video clip will treat its neighbor frame equally, so the natural sequential information of the video clips is seriously damaged. To tackle this problem, we devise the Structured Graph Module (SGM). SGM is designed to group neighbor nodes according to temporal interval and temporal direction thus dividing these nodes into different temporal regions. In Fig. 1, we compared the three organization forms of a video clip.

By grouping, the original complete graph is divided into several allopatric sub-graphs, with each subgraph containing the relationships between each node and its specific temporal region. Since subgraphs are established according to the original sequence attributes of the video clip, inference clues from different subgraphs contain their corresponding temporal patterns. Therefore, SGM can simultaneously extract global structure information and temporal sequential features. In this way, SGM can force the blind information transmission process to be more concentrated without superfluous operation.

To obtain powerful spatiotemporal features, we insert SGM within the InceptionV3 [29] network to construct an SGN network. We evaluate it on short-interval motion-focused datasets (Something-Something V1V2 [11, 22]), long-interval motion-focused datasets (Diving48 [18]), and scene-focused datasets (Kinetics-400 [3], UCF101 [27], HMDB51 [15]). Our SGN can achieve on par with or better than the latest approaches on these datasets, with a slight increase in computation (1.08 as many as InceptionV3).

The main contributions of this paper are summarized as follows:

  • We propose to reorganize the frames in a video clip and transform it into graph structures to capture long-term dependencies.

  • We propose a novel Structured Graph Module (SGM), which groups the neighbors of nodes according to the temporal prior information, so that sufficient sequential information can be preserved in the transformation process from sequence representation to graph representation.

  • We construct the action recognition network SGN by inserting SGM to inception-v3 standard blocks. Due to the innovative SGM, the SGN can extract various spatiotemporal features and perceive global structural information, achieving SOTA performance on various datasets with a marginal increase in computation.

Ii Related work

Video action recognition. How to extract spatiotemporal features has mirrored advances in video action recognition. Early works [26, 9] extract appearance and motion information via inputting RGB or optical flow to 2D-CNNs. However, calculating and storing optical flow is complex. Therefore, efficiently extracting temporal information from raw RGB frames has become an attractive research topic.

3D-CNNs [30, 12] are the dilatant version of 2D CNNs, that jointly learn spatiotemporal features with direct 3D convolutions. To alleviate the optimization complexity, I3D [3] inflated pretrained 2D kernels to 3D. P3D [24] and R(2+1)D [31] decompose 3D convolution into independent temporal (1D) and spatial (2D) convolution. To reduce the computation overhead, S3D [39] and ECO [45] used different convolution types in the different stages of the network. To develop more capabilities, slow-fast [8] adopted two branches that respectively focused on appearance or motion information. TPN [40] utilized the output at each stage to capture action instances at various tempos. However, 3D-CNNs simply expand the temporal dimension, thus lacking the consideration of the inherent data distribution between spatial and temporal dimensions. As a result, it needs a dense temporal sampling rate and longer input sequence to gain satisfying performance.

Fig. 2: The group process of the structured graph. Figure (a) is the original complete graph and its corresponding adjacency matrix. Each row of the adjacency matrix represents the relationship between the node at that index position and other nodes. The colored ellipses in Figure (a) represent regions divided by different grouping principles. The grouping principle, subregions divided and the corresponding adjacency matrix are shown in Figure (b).

To disentangle the spatiotemporal transformation process, another type is the spatiotemporal separation learning manner. The key is to add spatiotemporal modeling capabilities to the original 2D network. TSM [19] introduced shift operation to achieve interaction between neighbor frames. TEI [20] parameterized shift operation and used the neighbor difference to excite channels. TEA [17] proposed two sequent modules to excite motion information and aggregate multiple temporal information. GSM [28] proposed a fine-grained gate to control shift operation between adjacent frames. These methods share the same schema that firstly modeling short-term temporal information then stacks layers (or operations) to expand the receptive field, which is less efficient with the risk of losing long-term patterns.

Long-term temporal modeling. Early attempts try to capture the long-term dependency in the deep convolutional layers, such as using RNN [23, 5] or multi-scale MLPs [43]. But capturing dependencies at the high-level semantic space can easily ignore useful details. To this end, the non-local approach [35] proposed to build direct dependencies between spatiotemporal pixels, delivering effective performance while sacrificing efficiency with dense and heavy connections. StNet [13] designed a hierarchical manner to learn local and global information. V4D [42] added clip-level convolutions in the later stages to aggregate long-term information. These methods still adopt an inefficient stacking mechanism to gain long-term patterns. Recently, transformer-based methods [2, 1] were also proposed to directly model long-term patterns. But they are hard to train and with a huge amount of computation.

Graph methods for video action recognition While there are considerable works using graph modeling in downstream video understanding tasks and skeleton-based action recognition, less attention was paid to video action recognition. Drawing on this, Wang [36] firstly introduced graph to video action recognition but it relied on extracted spatiotemporal features from I3D. TRG [41], on the other hand, tried to insert GAT [32] modules to different stages, neglecting the discrepancy between video sequences and traditional node data. DyReG [6] used RNN to generate spatiotemporal nodes and utilize MLP to send messages along edges and finnally adopted another GRU [4] unit to update each nodes. However, it filtered out much more scene information with a complicated and heavy design.

Method backbone Frames GFLOPs V1 V2
Top-1(%) Top-5(%) Top-1(%) Top-5(%)
TSN-RGB [34] BNInception 8 16 19.5 - - -
TRN-Multiscale [43] BNInception 8 33 34.4 - 48.3 77.6
S3D-G [39] Inception 64 71.38 48.2 78.7 - -
GSM [28] InceptionV3 16 53.7 50.6 - - -
TSM [19] ResNet-50 8 33 45.6 74.2 - -
TSM [19] ResNet-50 16 65 47.2 77.1 63.4 88.5
TSM [19] ResNet-50 8+16 98 49.7 78.5 - -
TEINet [20] ResNet-50 8 33 47.4 - 61.3 -
TEINet [20] ResNet-50 16 66 49.9 - 62.1 -
TEINet [20] ResNet-50 (8+16)30 9930 52.5 - 66.5 -
TEA [17] ResNet-50 16 70 51.9 80.3 - -
TAM [7] ResNet-50 162 47.7 48.4 78.8 61.7 88.1
ECO [45] BNIncep+R18 92 267 46.4 - - -
I3D [3] ResNet-50 322 306 41.6 72.2 - -
GST [21] ResNet-50 16 59 48.6 77.9 62.6 87.9
STM [14] ResNet-50 1630 6730 50.7 80.4 64.2 89.8
V4D [42] ResNet-50 84 167.6 50.4 - - -
SmallBigNet [16] ResNet-50 8+16 157 50.4 80.5 63.3 88.8
SGN(ours) InceptionV3 8 25.4 48.9 77.2 61.6 87.5
SGN(ours) InceptionV3 16 50.8 51.2 78.9 63.1 88.5
SGN(ours) InceptionV3 (16+8)23 76.26 54.9 82.4 67.1 90.9
TABLE I: Comparison with state-of-the-art methods on Something-Something V1 V2. Our proposed SGN can achieve better performance with less computation.

Iii Approach

In this section, We first explain how to represent a video clip with graph (sec III-A). Then we introduce the structured graph module (SGM) that divides neighbors into different temporal regions according to temporal prior information (sec III-B). Finally, we discuss how to integrate our proposed module (SGM) to the existing 2D network to gain multi-scale spatiotemporal modeling ability (sec III-C).

Iii-a General graph representation of video clips.

Here we present a general graph structure representation for video clips. Given a video, we first uniformly divide a video into segments and then select a frame from each segment to form the video’s sparse represent: . Here denotes the set of input frames and is the th frame (Index also implies the temporal order). Next, we express the corresponding graph representation of as ( denote the node set and edge set of the graph.). Different from traditional graph nodes which are separate individuals, a frame clip is essentially a sparse sampling representation of a single video. From this perspective, when we try to build the graph between frames from a video, we are finding the relations between different temporal components of the video. In this way, we can assume that there is a universal graph that can describe how the different temporal components rely on each other. As discussed above, we designate frame as nodes of graph i.e. and set as learnable weights.

Once obtaining the weighted edge set , we can directly gain the corresponding adjacent matrix . Then the process to aggregate and update each nodes can be formulated as:


where represents the th row th column element in adjacent matrix ,

is original a linear transformation’s weight matrix but here it is replaced by a convolution operation and

is the reasoning result at the th temporal point.

Iii-B structured graph module (SGM).

After we transform a video clip into a graph structure, the connections between the different components of this video clip are established so the long-range dependencies can be captured directly. However, the graph structure breaks the original sequential arrangement of frames, resulting in the loss of sequence information. To solve this problem, we attempt to utilize the prior information of the ordered sequence to guide the information flow in the established graph structure. In practice, we divide the neighbors of each node into several groups by the principles related to temporal attributes, and decompose the initial complete graph into the corresponding sub-graphs. We consider two physical properties of video sequences. 1) Temporal direction. A video sequence is multiple frames that occur in chronological order, the temporal direction would be of much importance for capturing temporal clues such as ‘from right to left or ‘from left to right’. 2) Temporal interval. A human can perceive instantaneous motion and displacement from a small temporal window or global semantic change from connections of large temporal intervals. That means connections with different temporal spans will contain different temporal scale information.

As described above, we have transformed a video frame sequence into a graph representation . To make the grouping concise, we divide the edge set into several allopatric edge set:


Here is the number of group principles. We adopt two sequence attributes, temporal direction and interval, as the basis for grouping. First, we divide the neighbors of nodes into local and global nodes according to temporal interval:


Intuitively, contains edges of small temporal span while

contains edges of large time span so these two sets respectively contain local or global information. Then, we utilize direction to further sub-classify each set:


As and respectively represent the frame indexes connected by . So the values of and also denote the temporal order. Thus the local and global regions are divided into two sub-set (forward or backward) according to the relative temporal order. The corresponding adjacent matrix to each edge set is construct as:


where represents the th row th column element in adjacent matrix , and represents the th row th column element in adjacent matrix .

As shown in Fig. 2 (a), the relationship between nodes (or frames) in the established graph structure is reflected in the adjacent matrix. Each row of the adjacent matrix represents the relationships between the node at the corresponding index position and the nodes at other positions. Therefore, grouping the neighbors of each node corresponds to dividing each row of the adjacent matrix. Since different neighbor grouping policies extract different subgraphs from the original graph structure, there are consistent one-to-one match relations among the grouping policies, the subgraphs, and the adjacency matrixes. Fig. 2 (b) reflects the corresponding relationship among them.

After gaining the divided sub-graphs, we first reasoning on each sub-graph to obtain specific temporal features such as local-forward, global-backward, etc. Then all the sub-reasoning results would be gathered by a fusion operation. The whole process is formulated as:


where represents the reasoning result of the th sub-graph at the th time point and could be either then or directly operation. As shown in (3), each sub-graph reasoning result is formulated as:


where represents the th row th column element in adjacent matrix , is corresponding convolution operation. Fig. 3 shows this process.

Fig. 3: Overall network architecture of SGN. SGM and the 2D convolutions from InceptionV3 together constitute a module capable of extracting various spatiotemporal features. The forward propagation process of SGM is on the right.

Iii-C Network architecture

The SGM module can be easily combined with current popular 2D networks. Considering that SGM is designed to capture various temporal cues, we choose the InceptionV3 network as our backbone for it can well model multi-scale spatial patterns. Follow the practical experience of GST [21] and GSM[28] in designing networks, we insert the SGM module into the Inception standard modules. In this way, the various temporal feature from SGM and the multi-scale spatial features from other inception branches are cascaded together to form semantically spatiotemporal features. The overall network architecture SGN is shown in Fig. 3.

Iv Experiments

We evaluate our methods on 6 challenging and typical datasets. We first introduce these datasets and implementation details. Then we show the ablation study and compare our model with other SOTA methods.

Iv-a Datasets and implementation details

Motion-focused datasets, including Something-Something-V1 [11] &V2 [22]) and Diving48 [18]. Something-Something is a large-scale dataset created by a large number of crowd workers. As collected by performing the same actions with a different object in a different scene, Something-Something puts more temporal modeling requirements to action recognition methods. Something-Something has two versions. The first consists of 86,017 training videos and 11,522 validation videos belonging to 174 action categories while the second contains more training videos (168,913) as well as validation videos (24,777) with the same categories. Besides, samples in Something-Something are about 24 seconds so this dataset is focused on Short-term actions. Diving48 is a fine-grained video dataset of competitive diving, consisting of 18k trimmed video clips of 48 unambiguous dive sequences. As designed with no significant biases towards static or short-term motion representations, Diving48 is suitable to assess the ability of modeling long-term and fine-grained dynamics information. In addition, samples in Diving48 usually contain several distinct stages of diving action, so it is suitable to examine the ability of the model to perceive the global structure of video clips.

Scene-focused datasets, including UCF101, HMDB51, Kinetics-400. All of these three datasets are scene-focused and even a single frame would often contain enough information to guess the category. Kinetics-400 consists of approximately 240k training and 20k validation videos trimmed to 10 seconds from 400 human actions. UCF101 and HMDB51 are more small datasets and UCF101 includes 13,320 videos with 101 action classes and the HMDB51 contains 6766 videos with 51 categories. These two small datasets are very suitable to verify the transferability of the model.

Implementation details

We adopt InceptionV3 pre-trained on Imagenet as our backbone. We randomly sample

frames from a video as the input sequence. Then the short spatial size is resized to 256 and the final spatial size is cropped to 229229 (to match the input size of InceptionV3.) During training, we do random cropping and flipping as data augmentation. The network is trained using SGD with an initial learning rate () of 0.01 and momentum of 0.9 on two GPUs. We use a cosine learning rate schedule to update

at each epoch. The total training epoch is set as 60 with the first 10 epochs used for gradual warm-up. The batch size is 32 for

and 16 for . During inference, for efficient comparison, we just use a single clip and a center-crop with the size of 229229 for evaluation. For accuracy comparison, we adopt 2 clips and 3 crops with the size of 261261 to get the final average prediction.

Iv-B Ablation Study

We report the ablation experiment result on the Something-Something V1 dataset. All the results are referenced with the efficient set, i.e. a single clip with center-crop.

Study of SGM. We first set up two learning paradigms to determine the weights of graph edges in the proposed SGM. In the first paradigm, we take frames as nodes and suppose that each video sample has a unique graph structure. So the specific edge weights between node pairs can be determined by attention-based methods [32]. In another paradigm, we suppose there is a universal graph that can describe how the different temporal components in a video clip rely on each other so these frames are connected with a fixed weighted graph. According to whether the train and test datasets share the same graph structure, we call the first inductive paradigm and the second transductive paradigm.

For each learning paradigm, we gradually add temporal prior information to decompose the graph structure. So there is three different graph structures: full (without decomposition), l&g (decomposed into local and global subgraphs), l&g (directional) (decomposed into four subgraphs: local-forward, local-backward, global-forward and global-backward). Tabel. II shows the results under different settings. Firstly, the performance of the transductive paradigm is better in any situation, which indicates that depending on the matching degree of spatial semantics may mislead the information flow in the graph. We believe the main reason is that this method only considers the spatial semantic similarity of the two nodes when determining the edge weight connecting them. We believe the main reason is the only spatial semantic similarity is considered when determining the edge weight connecting them so that it neglects other information such as the direction and the position of edges in the overall structure. It also illustrates that there is still a migration gap between the video field and the traditional graph model.

As Tabel. II shows, information of temporal interval and temporal direction provides 3.3% and 5.92% improvement for the inductive paradigm, 0.41% and 1.06% improvement for the transductive paradigm. Finally, temporal priors provide a startling improvement of almost 9.22% and 1.7% for the two paradigms, respectively. We also compare the results of fusing four complete graphs and find that simply increasing the number of graphs does not provide gains, which fully illustrates the important role of our proposed SGM.

Paradigm graph structure Top-1(%)
2D backbone - 18.47
1⃝ Inductive full 38.32
2⃝ Transductive full 47.47
3⃝ Inductive l&g 41.62
4⃝ Transductive l&g 47.88
5⃝ Inductive l&g (directional) 47.54
6⃝ Transductive l&g (directional) 48.94
7⃝ Transductive full4 47.44
TABLE II: Ablation study results of paradigms and graph structure. All the results show the effectiveness of graph methods. 1⃝ vs 2⃝, or 3⃝ vs 4⃝, or 5⃝ vs 6⃝ give the specific comparison between inductive and transductive paradigms. 1⃝ vs 3⃝ vs 5⃝, or 2⃝ vs 4⃝, vs 6⃝ give the specific comparison between different graph structures. These results firmly confirm that transductive paradigm is much better and structurally decomposing graph is very effective.

Threshold separating local or global regions. As shown in Formula  4, the parameter divides nodes into temporal local range and temporal global region. We explore how the different values of influences the performance. Tabel. III shows the results of ablation experiments. We found that performance degrades when thresholds are too large or too small because improper partitioning can jumble up different discriminative temporal features. Following the result in Tabel. III, we finally set as .

=1/16 =1/8 =1/4 =1/2
Top-1 (=8) - 48.94% 48.16% 47.87%
Top-1 (=16) 50.46% 51.21% - -
TABLE III: Ablation study results of temporal threshold. Here is the total number of frames in the input video clip.

Fusion method. As for the strategy of fusing the inference results of subgraphs, we compare the operation of direct addition and convolution after cascade. Table. IV shows the computational overhead, parameters, and performance under different fusion strategies. We finally adopted the additive strategy due to its fewer parameters, lower computation, and slightly better results.

Fusion strategy Param. Flops Top-1(%)
cascade 27.3M 28.4G 48.88
addition 24.0M 25.4G 48.94
TABLE IV: Ablation study results of fusion strategies.
Method Top-1(%)
SlowFast [8] 77.6
TimeSformer [2] 74.9
TimeSformer-HR [2] 78.0
TimeSformer-L [2] 81.0
GSM(our impl.) [28] 80.7
SGM(ours) 84.7
SGM(double-clips)(ours) 86.9
TABLE V: Comparison with state-of-the-art methods on diving48.
Method Top-1(%)
3rd 35.9
2nd 37.0
SGM 37.9
SGM()(1st) 45.4
TABLE VI: Result of MMVRAC Fisheye Video-based Action Recognition competition (ICCV21).
Method backbone Frames GFLOPs Top-1 Top-5
TSN-RGB [34] InceptionV3 25110 3.2250 72.5 90.2
S3D-G [39] InceptionV1 64103 71.430 74.7 93.4
TSM [19] ResNet-50 16103 6530 74.7 91.4
TEINet [20] ResNet-50 16103 6630 76.2 92.5
TEA [17] ResNet-50 16103 7030 76.1 92.5
TAM [7] ResNet-50 4833 93.49 73.5 91.2
R(2+1)D [31] ResNet-34 32101 15210 74.3 91.4
NL I3D [35] ResNet-50 128103 28230 76.5 92.6
SlowFast [8] ResNet-50 (4+32)101 36.110 75.6 92.1
SGM(ours) InceptionV3 833 25.49 73.6 91.2
SGM(ours) InceptionV3 1633 50.89 75.4 92.1
SGM(ours) InceptionV3 (8+16)33 76.29 76.2 92.6
SGM(ours) InceptionV3 (8+16+24)33 152.49 77.0 93.0
TABLE VII: Comparison with state-of-the-art methods on Kinetics-400.
Method pretrained backbone UCF101 HMDB51
TSN [34] ImageNet InceptionV2 86.4% 53.7%
P3D [24] ImageNet ResNet-50 88.6% -
C3D [30] Sports-1M ResNet-18 85.8% 54.9%
I3D [3] ImageNet+Kinetics InceptionV2 95.6% 74.8%
S3D [39] ImageNet+Kinetics InceptionV2 96.8% 75.9%
TSM [19] Kinetics ResNet-50 96.0% 73.2%
STM [14] ImageNet+Kinetics ResNet-50 96.2% 72.2%
TEA [17] ImageNet+Kinetics ResNet-50 96.9% 73.3%
SGM(ours) ImageNet+Kinetics InceptionV3 95.6% 78.1%
TABLE VIII: Comparison with state-of-the-art methods on UCF101 and HMDB51.
Fig. 4: Visualization of activation maps with Grad-CAM. The first row is the input video clip.
Fig. 5: Adjacent matrices in different SGN layers on Kinetics-400 and Something-somgthing V1. In a matrix, The vertical dimension represents temporal position and the horizontal dimension represents relationships. So row column represents the weight of the edge from frame to frame .

Iv-C State-of-the-art comparison

Something-Something. In Something-Something datasets, Different categories of samples may share common scenes, objects. So The Something-Something datasets are widely used to evaluate the temporal modeling capability. Table. I gives a summary of the results on the Something-Something V1 and Something-Something V2 datasets. In the methods we compared, TSN and TRN use the 2D network with late fusion, TSM, TEI, TEA, TAM, and GSM adopt 2D network combined with the temporal module, and other methods are recent methods using 3D modules. Results on both datasets consistently prove that our SGN can achieve nearly the best performance with the lowest computational overhead. Of the comparative approaches, our network comes closest to the framework adding temporal modules to 2D backbones. TSM and GSM use temporal channel shift operation to simulate temporal convolution; TEI and TAM use depthwise temporal convolution to parameterize shift operation and add excitation or multi-branch structure to enhance temporal information. TEA utilizes temporal difference excitation mechanisms and Res2Net-like structures to extend the receptive field of temporal convolution. In contrast, our proposed SGM does not carry out the complicated manual intervention, but only makes the model automatically optimized from the perspective of graph structure, which exceeds the previous method with the lowest cost.

Diving48. Different from the Something-Something dataset, Diving48 is a diving action dataset with a longer video duration and distinct stages of actions. We use the latest version of the annotations and report the results of 16 frames with a single clip or double clips. Table. V shows the results. In the case of a single clip, we reproduce the results of GSM, which is closest to our network architecture. Finally, we are 4% more accurate than GSM. In the end, we were 5.9% better than Timeformer-L whose input clip contains 96 frames, achieving the best result so far. The experiments on Diving fully demonstrate the advantages of SGN in capturing long-term dependencies and modeling the global structure of video clips.

Kinetics-400. Noting that Kinetics-400 is collected from YouTube and only release video links, so sample loss is widespread. In the version we adopted, approximately 10,000 samples were lost, resulting in approximately 0.5% damage to the final performance. In general, for the kinetic-400 dataset, using scene information alone can already gain considerable performance. We compare SGN to the same type of methods and present the results in Table. VII. We can still achieve competitive results on Kinetics-400. In the same type of methods, that is, under the framework of 2D network with the temporal module, we are very close to the best performance.

UCF101 and HMDB51. UCF101 and HMDB51 are two small datasets, and we transfer the model pre-trained on Kinetics to them to test the generalization of the proposed SGN. We report the average result in Table. VIII over three splits with 16 frames as input. Since HMDB51 relies more on temporal cues, we achieve the best results on HMDB51 and acceptable results on UCF101.

Iv-D Visualization

We first use Grad-CAM [25] to visualize the class activation map. Fig. 4 shows the results. The results indicate that the model with a complete graph module ignores the keyframe of the action. When decomposing the graph into local and global subgraphs, the attention of the keyframes is increased, but some noise frames are also concerned. Finally, in the SGM, the model only pays attention to the keyframes. The visualization results show that by splitting the complete graph into multiple subgraphs, the mixed temporal cues are gradually separated.

We also visualize the adjacent matrices learned in different layers of SGN. As shown in the Fig. 5, different temporal patterns are concerned in different layers. With the increase in depth, the global features are increasingly valued. And for different datasets, the adjacency matrixes of the same layer are also different. For the Something-something V1 dataset with a shorter sample duration, the local features are emphasized, while in the longer Kinetics-400, the global information is paid more attention.

Iv-E MMVRAC Fisheye Video-based Action Recognition competition (ICCV2021)

The SGN network was used in the 1st solution in MMVRAC Fisheye Video-based Action Recognition competition (ICCV21). In Table. VI we report the performances of our SGN and the top-3 solutions. The final 1st-place solution is an ensemble of multi-models trained by several training strategies.

V Conclusion

In this paper, We abandoned the popular way of viewing video clips as sequences and proposed to think of them as graphs with interconnected components. When transfer clips to graph representations, We notice the problem of sequence information loss in the transformation process and propose the structural decomposition idea (SGM) to alleviate this problem. We take the temporal prior attributes as the basis to guide the decomposition so that SGM can capture the global structural information and sequence features simultaneously. We designed sufficient ablation experiments to demonstrate the effectiveness of SGM. Finally, the promising results on 6 popular action recognition datasets strongly suggest that our method can obtain SOTA performance at a very low computation cost. In general, the most important contribution of our method is to treat the temporal distribution of video clips from a new perspective and put forward an effective practical scheme. We hope that the idea of viewing the temporal dimension of videos from a graph perspective will get more attention.


This work was supported by the National Natural Science Foundation of China (U1836218, 62020106012), the 111 Project of Ministry of Education of China (B12018), and the UK Engineering and Physical Sciences Research Council (EPSRC) under grant number EP/R013616/1, the EPSRC Programme Grant (FACER2VM) EP/N007743/1.


  • [1] A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lui, and C. Schmid (2021) ViViT: a video vision transformer. Cited by: §II.
  • [2] G. Bertasius, H. Wang, and L. Torresani (2021) Is space-time attention all you need for video understanding?. Cited by: §II, TABLE V.
  • [3] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In

    proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 6299–6308. Cited by: §I, §I, TABLE I, §II, TABLE VIII.
  • [4] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio (2014)

    Empirical evaluation of gated recurrent neural networks on sequence modeling

    arXiv preprint arXiv:1412.3555. Cited by: §II.
  • [5] J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K. Saenko, and T. Darrell (2017) Long-term recurrent convolutional networks for visual recognition and description. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 677–691. Cited by: §II.
  • [6] I. Duta and A. Nicolicioiu (2020)

    Dynamic regions graph neural networks for spatio-temporal reasoning

    arXiv e-prints, pp. arXiv–2009. Cited by: §II.
  • [7] Q. Fan, C. Chen, H. Kuehne, M. Pistoia, and D. Cox (2019) More is less: learning efficient video representations by big-little network and depthwise temporal aggregation. arXiv preprint arXiv:1912.00869. Cited by: TABLE I, TABLE VII.
  • [8] C. Feichtenhofer, H. Fan, J. Malik, and K. He (2019) Slowfast networks for video recognition. In Proceedings of the IEEE international conference on computer vision, pp. 6202–6211. Cited by: §I, §II, TABLE V, TABLE VII.
  • [9] C. Feichtenhofer, A. Pinz, and A. Zisserman (2016) Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1933–1941. Cited by: §II.
  • [10] C. Feichtenhofer (2020) X3D: expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 203–213. Cited by: §I.
  • [11] R. Goyal, S. E. Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, et al. (2017) The” something something” video database for learning and evaluating visual common sense.. In ICCV, Vol. 1, pp. 5. Cited by: §I, §IV-A.
  • [12] K. Hara, H. Kataoka, and Y. Satoh (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6546–6555. Cited by: §I, §II.
  • [13] D. He, Z. Zhou, C. Gan, F. Li, and S. Wen (2019) StNet: local and global spatial-temporal modeling for action recognition. Proceedings of the AAAI Conference on Artificial Intelligence 33, pp. 8401–8408. Cited by: §II.
  • [14] B. Jiang, M. Wang, W. Gan, W. Wu, and J. Yan (2019) Stm: spatiotemporal and motion encoding for action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2000–2009. Cited by: TABLE I, TABLE VIII.
  • [15] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre (2011) HMDB: a large video database for human motion recognition. In IEEE International Conference on Computer Vision, Cited by: §I.
  • [16] X. Li, Y. Wang, Z. Zhou, and Y. Qiao (2020) Smallbignet: integrating core and contextual views for video classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1092–1101. Cited by: TABLE I.
  • [17] Y. Li, B. Ji, X. Shi, J. Zhang, B. Kang, and L. Wang (2020) TEA: temporal excitation and aggregation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 909–918. Cited by: §I, TABLE I, §II, TABLE VII, TABLE VIII.
  • [18] Y. Li, Y. Li, and N. Vasconcelos (2018) Resound: towards action recognition without representation bias. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 513–528. Cited by: §I, §IV-A.
  • [19] J. Lin, C. Gan, and S. Han (2019) Tsm: temporal shift module for efficient video understanding. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7083–7093. Cited by: §I, TABLE I, §II, TABLE VII, TABLE VIII.
  • [20] Z. Liu, D. Luo, Y. Wang, L. Wang, Y. Tai, C. Wang, J. Li, F. Huang, and T. Lu (2020) TEINet: towards an efficient architecture for video recognition.. In AAAI, pp. 11669–11676. Cited by: §I, TABLE I, §II, TABLE VII.
  • [21] C. Luo and A. L. Yuille (2019) Grouped spatial-temporal aggregation for efficient action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5512–5521. Cited by: TABLE I, §III-C.
  • [22] F. Mahdisoltani, G. Berger, W. Gharbieh, D. Fleet, and R. Memisevic (2018)

    On the effectiveness of task granularity for transfer learning

    arXiv preprint arXiv:1804.09235. Cited by: §I, §IV-A.
  • [23] Y. H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, and G. Toderici (2015) Beyond short snippets: deep networks for video classification. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §II.
  • [24] Z. Qiu, T. Yao, and T. Mei (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In proceedings of the IEEE International Conference on Computer Vision, pp. 5533–5541. Cited by: §I, §II, TABLE VIII.
  • [25] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp. 618–626. Cited by: §IV-D.
  • [26] K. Simonyan and A. Zisserman (2014) Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pp. 568–576. Cited by: §II.
  • [27] K. Soomro, A. R. Zamir, and M. Shah (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. Computer Science. Cited by: §I.
  • [28] S. Sudhakaran, S. Escalera, and O. Lanz (2020) Gate-shift networks for video action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1102–1111. Cited by: §I, TABLE I, §II, §III-C, TABLE V.
  • [29] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §I.
  • [30] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2015) Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 4489–4497. Cited by: §I, §II, TABLE VIII.
  • [31] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri (2018) A closer look at spatiotemporal convolutions for action recognition. In CVPR, pp. 6450–6459. Cited by: §I, §II, TABLE VII.
  • [32] P. Velikovi, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2017) Graph attention networks. Cited by: §II, §IV-B.
  • [33] H. Wang, D. Tran, L. Torresani, and M. Feiszli (2020) Video modeling with correlation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 352–361. Cited by: §I.
  • [34] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool (2016) Temporal segment networks: towards good practices for deep action recognition. In European conference on computer vision, pp. 20–36. Cited by: TABLE I, TABLE VII, TABLE VIII.
  • [35] X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7794–7803. Cited by: §II, TABLE VII.
  • [36] X. Wang and A. Gupta (2018) Videos as space-time region graphs. In Proceedings of the European conference on computer vision (ECCV), pp. 399–417. Cited by: §II.
  • [37] J. Weng, D. Luo, Y. Wang, Y. Tai, C. Wang, J. Li, F. Huang, X. Jiang, and J. Yuan (2020) Temporal distinct representation learning for action recognition. In European Conference on Computer Vision, pp. 363–378. Cited by: §I.
  • [38] C. Wu, X. Wu, and J. Kittler (2019) Spatial residual layer and dense connection block enhanced spatial temporal graph convolutional network for skeleton-based action recognition. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 0–0. Cited by: §I.
  • [39] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 305–321. Cited by: §I, TABLE I, §II, TABLE VII, TABLE VIII.
  • [40] C. Yang, Y. Xu, J. Shi, B. Dai, and B. Zhou (2020) Temporal pyramid network for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 591–600. Cited by: §I, §II.
  • [41] J. Zhang, F. Shen, X. Xu, and H. T. Shen (2020) Temporal reasoning graph for activity recognition. IEEE Transactions on Image Processing 29, pp. 5491–5506. Cited by: §II.
  • [42] S. Zhang, S. Guo, W. Huang, M. R. Scott, and L. Wang (2020)

    V4D:4d convolutional neural networks for video-level representation learning

    In ICLR 2020, Cited by: TABLE I, §II.
  • [43] B. Zhou, A. Andonian, A. Oliva, and A. Torralba (2018) Temporal relational reasoning in videos. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 803–818. Cited by: §I, TABLE I, §II.
  • [44] X. Zhu, C. Xu, L. Hui, C. Lu, and D. Tao (2019) Approximated bilinear modules for temporal modeling. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3494–3503. Cited by: §I.
  • [45] M. Zolfaghari, K. Singh, and T. Brox (2018) ECO: efficient convolutional network for online video understanding. European Conference on Computer Vision. Cited by: TABLE I, §II.