As one of the important video analysis tasks, activity recognition has attracted significant attention from the academic community in computer vision. Thanks to the efficiency of deep learning techniques and larger datasets in computer vision, the action recognition performance has been remarkably boosted. Two kinds of architectures are widely adopted: (1) 2 dimensional convolution neural networks (2D ConvNets, 2D CNNs) for capturing frame-level features; (2) 3 dimensional convolution neural network (3D ConvNets, 3D CNNs) [1, 19]
or recurrent neural networks (RNNs) for modeling temporal context information. Afterward, a natural idea to boost the recognition performance based on the above architectures is to applying two-stream  based framework which has heterogeneous inputs for action recognition. However, those approaches cannot capture the underlying structure which is characterized by transformations and temporal relations rather than the appearance of certain entities . Therefore, they are very inefficient for activity recognition, because the activity can be characterized by the temporal evolution of appearance governed by motion. Consider three kinds of frequently occurred activity in ordinary daily life as shown in Figure 1. Those activities cannot be recognized without reasoning about short or long-term temporal relations. An ordinary activity typically consists of several temporal relations at a multi-scale time-span. As the example shown in Figure 1 (a) and (b), the activity “throwing” contains the short-term relation like “pillow throwing” and “falling”, and long-term relation like “keeping it falling” or “catching it”, i.e., activity often equipped with specific spatial patterns as well as multi-scale temporal structure.
Hence, the ability to accurately capture relevant relation between temporal sequence and perform temporal reasoning is crucial for understanding activity in video. It is expected to discover temporal semantic knowledge of a video beyond the appearance of objects in the frames over short and long-range temporal dependencies. Therefore, to effectively exploit such relation, it is required to develop an effective deep model that has the capacity to cope with temporal semantic meaning for activity understanding. Several works [41, 53]
have attempted to explore the relation between video sequences. They apply multi-layer perceptron (MLP) which may have limited capacity for learning similarity of sequence object to investigate the relation between video sequence object. Additionally, they could only explore short-range temporal dependencies relation and cannot capture the transformable temporal states over time.
In this paper, we propose a temporal reasoning graph (TRG) network by directly performing temporal state relation reasoning in the graph to address the problem of capturing the transformable temporal states and different time-scale action dependencies. Notably, graph convolution networks (GCNs) [48, 55] are impressively powerful tools to analyze relations, but they typically have identified adjacent relation of different objects. Our proposed TRG approach tackles the key problem of how to represent the temporal graph connection structure. Inspired by the temporal relation network  which defines pairwise temporal relation as a composite function and exploits MLP to capture their semantic dependency, in this paper, the proposed TRG approach presents pairwise relation as a component of a learnable adjacent matrix to address the ambiguous temporal state relations of a video sequence. Additionally, the multi-head temporal relation graphs which discover temporal relations on a multi-scale range based on the adjacent matrix are constructed in our approach. Specifically, rather than convolving solely on the temporal sequence in the consecutive order, our proposed TRG approach defines specific relation graphs to explore the temporal relations of activity in both short-term and long-term temporal dependencies. Furthermore, the temporal instances in the graph have multi-relation between each other, like causality or adjacency, etc. Therefore, the multi-head temporal adjacent matrix is devised to investigate such relations. Subsequently, temporal relation reasoning can be performed to deduce multi-kinds of relation for an activity. To further exploit the temporal relation in the reasoning graph and investigate a semantic meaning of those relations, our proposed TRG approach designs a relation aggregator to fuse those multi-kinds of relation. The whole activity recognition procedure in our proposed TRG approach is illustrated in Figure 2
. Specially, ConvNets for spatial feature extraction of sampling video frames, our proposed TRG approach for temporal relation reasoning and the final fully connection for activity classification. Note that, in later experiment, we prove that our proposed TRG is flexible to be plugged into any stage of the off-the-shelf architectures, e.g., ResNet and Inception .
To further enhance the spatial-temporal features learning for activity recognition, we concatenate the spatial features extracted by ConvNets with the semantic temporal relation features. In this way, we can extract the spatial features of each frame and capture long-range temporal relation in videos. We conduct extensive experiments on three large-scale data sets: Something-Something V1 , V2  and Charades . Both three datasets are extremely challenging, even for a human, as we cannot infer the activity by only the object or background in the frame. The experiments on the two datasets have demonstrated the significant improvements over state-of-the-art approaches and the importance of temporal reasoning in activity recognition.
Our major contributions are summarized as follows:
We propose a novel temporal reasoning graph module for activity recognition which is the first attempt at temporal relation reasoning with graph, to our best knowledge.
We construct multi-head graph representation for multi-kinds temporal relation reasoning of a video sequence with a variant span and scale between the sequence in a long-range video.
A semantic aggregator is developed to learn the importance of sequence state in different graphs and fuse the multi-kinds temporal relation features.
The rest of this paper is organized as follows: we briefly introduce related work in activity recognition literature, visual relation reasoning as well as graph convolution networks in Section 2. In section 3, the proposed temporal reasoning graph (TRG) is elaborated in detail. Section 4 presents the experimental results and further analysis of the model and results. Finally, the conclusion is given in Section 5.
2 Related Work
Activity Recognition. Activity recognition has been widely studied in recent years. An active research which devotes to the design of deep networks for video representation learning has been trying to devise effective CNNs architectures [14, 35, 34, 35, 6]. Karpathy et al.  attempted to design a deep network that stacks CNNs based frame-level features in a fixed size and then conduct spatiotemporal convolutions for video-level features learning. However, the results were not satisfying, which implied the difficulty of CNNs in capturing motion information of the video. Later, many works in this genre leverage CNNs trained on frames to extract low-level features and then perform high-level temporal integration of those features using pooling [40, 39], high-dimensional feature encoding [7, 5], or recurrent neural networks [6, 46, 35, 50]
. To explore long-term temporal relationships of video for learning a more robust representation, recently, the convolution neural network and long short-term memory (CNNs-LSTM) frameworks[6, 46]
were applied by stacking LSTM network to connect frame-level representation. They do have yielded an improvement for modeling temporal dynamics of convolution features in videos. However, this genre using CNNs as an encoder and RNN as a decoder of the video would lose low-level temporal context which is essential for action recognition. These works implied the importance of temporal information for action recognition and the incapability of CNNs to capture such information. To exploit the temporal information, some studies resorted to the use of the 3D convolution kernel. Another efficient way to extract temporal features was to precomputing the optical flow using traditional optical flow estimation methods and training a separate CNNs to encode the precomputed optical flow, which is a kind of escape from temporal modeling but effective in motion features extraction. There are still several important issues with existing CNNs for action recognition: 1) CNNs has limited capacity for learning long temporal dependency, 2) it’s difficult for CNNs to capture the temporal transformation with complex physical properties. To address these issues, we propose an efficient unit that applies graph convolution networks to learn temporal relation, which is much more efficient than convolving dense frames. Meanwhile, the model can construct a temporal graph for representing temporal relation. It flexibly incorporates temporal reasoning and spatial transformation with existing architectures.
Visual Relation Reasoning. Reasoning about the relation between instance over time in the video is critical for activity recognition . In addition, modeling relations between vision objects have become a popular problem in computer vision [51, 18, 47, 45]. The most straightforward visual reasoning task is object relation reasoning in CLEVR benchmark , and significant efforts have been devoted to a variety of traditional visual tasks with pairwise relationship reasoning. In the face clustering problem, several works attempted to modeling pair-wise relationships for face graph generation [45, 49]. To reason between different instances in visual question answering tasks, Xiong et al.  proposed a graph matching module for investigating such relation. In sketch-based action recognition, several works showed that modeling interactions of sketch joint can achieve excellent performance [26, 25, 48]. Here, we show that explicitly exploiting the various and multi-scale temporal relations in videos can boost activity recognition accuracy.
Graph Convolution Networks (GCNs). GCNs is a powerful tool for modeling the graph instance relation. Spatial based GCNs  which is the generalization of CNNs to graphs and perform manually-defined convolution on the graph is good at dealing with graph-structured data. Due to their convincing performance and high interpretability of modeling object relationships, GCNs has been widely applied in many computer vision task which needs to explore the relation of different vision instance. In terms of applications, existing works has led to considerable performance improvement by using GCNs in traditional computer vision tasks , for example, skeleton-based action recognition [48, 25], link prediction [45, 49], semi-supervised classification , hashing [56, 22, 23], person-reid , and multi-label image recognition , and etc.
In our work, we exploit GCNs that is built by stacking multiple layers of graph convolutions with a multi-head adjacent matrix to capture temporal relations at multiple time scales. Moreover, our proposed method explicitly models the temporal interactions by building a temporal reasoning graph, which can be inflexible injected to the existing backbone. We show the temporal reasoning graph can efficient reason between temporal semantic instance for gaining activity classification accuracy.
3 Proposed Method
In this section, we illustrate the framework of our proposed architectures showed in Figure 2, i.e., we will give detailed descriptions of how we build the temporal graph to investigate temporal relation. Firstly, the definition of the problem is given. Secondly, the detail of the construction of the temporal relation graph is described. Thirdly, the way how we implement convolution through the temporal graph is described in detail. Fourthly, the way we aggregate the semantic meaning of the multi-head temporal graph we constructed in Section 3.2 is presented. Finally, the way we model the spatiotemporal features and classify the activity is introduced.
3.1 Problem Definition
Formally, we have extracted a sequence of features in a video as , where denotes the number of sequence or time of the video. If not specified, represents a frame feature map, is the feature channels, and represent the feature height and width, respectively. Here, we will apply the graph convolution neural network to explore the relation between temporal sequence for activity recognition. Let denote a graph of temporal sequence, where is the set of video frame object and is the set of temporal relation edges between the video sequence. The neighbor set of a node is , and the nodes at adjacent time steps are connected with the temporal edge. The time series of a video can be easy obtained, but the semantic meaning of the edge between the node is still ambiguous. Attention mechanisms have been proven as a powerful tool in deep learning studies. We exploit the attention mechanism to investigate the semantic meaning of the edge . More details about how we mine temporal relation will be elaborated in Section 3.2.
3.2 Temporal Relation Graph Construction
In our work, we apply the temporal relation graph to form a hierarchical representation of the video sequence. We define the pairwise temporal relation and measure the similarity of the temporal feature to construct the temporal relation graph. We use the definition of temporal relation function as:
where is the similarity function of different frames. Note that has many formats, typically expressing as following: (1) sum: ; (2) dot product: ; and (3) bilinear: . More specifically, two sequence objects that one of the object instances may be predicted by the other one will have close connection and a high confidence edge.
In the temporal relation graph, we connect pairs of the semantically related frame features together. In order to obtain sufficient expressive power to transform the input frame feature into higher-level features, at least one transformation function which can project the feature sequences into a space for similarity measure is required. Thus, the new temporal relation function is:
is the features transformation function parameter which can be learned via backpropagation. Therefore, a shared transformation, parameterized by a weight variable, is applied to every frame feature. By adding such transformation function, we could learn the adjacent matrix which represents the correlations between different temporal feature across the frame of the temporal graph at each single feedforward step.
After computing the correlations coefficient of the adjacent matrix, we perform normalization across each row of the adjacent matrix for easily comparable across different temporal features. We adopt the softmax function for normalization as:
The learnable adjacent matrix expresses as . Here, to update the similarity measure of temporal relation at each step, we implement the adjacent matrix learning block by a single-layer feedforward neural network. An implementation example of element calculation in the adjacent matrix learning block is illustrated in Figure 3.
We should notice that the temporal-based neighbors of each node may have many kinds of relations which show different importance in learning temporal feature. In order to obtain sufficient representational power to capture the underlying relation of the temporal sequence, we design a mechanism, named multi-head adjacent matrix, to explore different semantic attributes.
Multi-head adjacent matrix. Inspired by transformer  and graph attention networks , we construct the multi-head adjacent matrix to explore the power of temporal relation and stabilize the learning process of sequence object correlation coefficients. Specifically, independent coefficient updating procedure execute the temporal relation function of Equation 3:
where . All the video temporal graph adjacent matrix denote as , where .
3.3 Temporal Graph Convolution Neural Network
Once the temporal graphs are built, we can exploit graph convolution for temporal relation reasoning. Temporal graph convolution takes a video sequence as input, performs computations over the sequence, and returns a new sequence, i.e., the graph convolutions allow us to compute the response of a node based on its neighbors deﬁned by our multi-head graph. So, for a specific target object in the video sequence, it aggregates features from all neighbor objects according to the edge weight defined as previous.
To exploring temporal reasoning on the temporal relation graph, we apply the Graph Convolution Networks proposed in  to process the frame feature. The outputs of the GCNs are updated features of each frame node, which can be aggregated together for video classification. If we perform multi-head adjacent matrix updating mechanisms, the next state sequence feature will be:
where is output semantic feature map reasoning through graph,
is typically nonlinear activation functionand is transformation matrix implemented by standard convolution in feature domain. Note that the filter weight in each graph is shared everywhere on feature sequences, because it is irrelevant to the location of the feature sequences. So after constructing multi-head temporal relation adjacent matrix and apply graph convolution, we can obtain kinds of sequence state feature, denoted as .
3.4 Multi-head Temporal Relation Aggregator
The multi-head adjacent matrix is exploited to discover the multi-kinds of relations between the sequence nodes. The main reason for designing the multi-head temporal relation aggregator is to reflect the different importance of the semantic relation. In short, the aggregator which exploits a mechanism similar to self-attention for temporal graph pooling.
Generally, every sequence instance in a heterogeneous graph contains multiple types of semantic information, i.e., semantic-specific sequence features extracted from one graph can only reflect temporal relation from one aspect. To learn a more comprehensive video feature, we need to fuse multiple semantics which can be revealed by multi-head adjacent matrix. To address the problem of semantic fusion in a heterogeneous graph, we propose a multi-head temporal relation aggregator to automatically learn the importance of sequence state in different graphs and fuse them. The operation of the temporal relation aggregator is illustrated in Figure 4. To investigate the importance of different sequence states which are updated based on different temporal graph adjacent matrix, we define an aggregator function as following to automatically learn the importance,
To implement the function , inspired by squeeze-and-excitation networks , we first use global pooling as follows to extract the global semantic meaning of each state in the different graph,
where . Then, the importance of each sequence state, denoted as , is shown as follows:
the weight for each can be obtained by normalizing the above importance coefficient through all temporal graphs using the softmax function,
With the learned weight as coefficients, we can fuse these semantic-specific temporal features to obtain the final feature as follows:
Finally, the output of a video sequence features from the temporal graph convolution are aggregated. With the designed aggregator, the video sequence features after processed contain the semantic meaning of the multi-head temporal graph. For simplicity, the temporal graph convolution together with multi-head temporal relation aggregator operation are summarized as follows:
where represents one of the multi-head adjacency graphs with shape, and is the input features map of all the sequence in the graph with shape . is the spatial transformation matrix. Here, for computation simplicity, we exploit convolution kernel, so is a matrix of the layer with shape . Note that the transformation matrix in each graph are shared everywhere on feature sequences because it’s irrelevant to the location of the feature sequences. The final output features map still is a tensor with shape . The temporal graph convolution operation and temporal relation aggregator can be stacked into multiple layers.
3.5 Spatial and Temporal Modeling
, but it has a poor capacity of understanding the relation and importance of those spatial units in an image or a sequence for classification. From above, the temporal graph convolution is designed to extract such sequence relations of the spatial features. So we apply residual connection operation to fuse the spatial features extracted by ConvNets with the temporal relation features extracted by temporal graph convolution for activity recognition.
After the video sequence spatial features extracted by ConvNets are obtained and the temporal relation features are extracted by temporal graph convolution, we begin to model the spatial and temporal feature extraction by connecting the and as follows:
where “” is the residual concatenation operation, is the concatenated spatial-temporal features, followed by a non-linearity function , typically ReLU. The process of spatial and temporal modeling procedure is summarized as Algorithm 1.
3.6 Activity Classification with Temporal Reasoning Graph
The temporal reasoning graph provides a solid unit to work with existing architectures. As illustrated in Figure 2
, the input is processing via spatial transformation and temporal graph reasoning to form the video representation, and then the representation is fed into a classifier to generate activity label. To this end, the whole framework can be trained in an end-to-end manner. For single-label activity recognition tasks, we apply cross-entropy loss when training, while binary sigmoid loss is used for multi-label activity recognition task. The loss function is given as:
where , is the confidence scores of class , is the ground true label and is the number of classes, and , is sigmoid function and when classes is the ground true label of the sample and 0 otherwise.
To evaluate the effectiveness of our temporal reasoning graph, we perform extensive experiments on three benchmark datasets for activity recognition: Something-Something V1 , V2  and Charades . We first introduce the two datasets and implementation detail. Then, we compare our method with existing methods. Afterward, we conduct ablation studies to dig into the effect of the component and factor. In addition, we further analyze the temporal scaling strategy, where to place our temporal reasoning graph and recognition performance with different backbone. Finally, we provide the visualization of class activation map (CAM)  of the intermediate features and apply t-SNE  to visualize the distribution of feature representation learned by our model.
4.1 Datasets Description
: The dataset is a large collection of densely-labeled video clips that allow machine learning models to develop a fine-grained understanding of basic actions. It contains 198,499 short video clips across 174 labels of simple textual descriptions based on templates, 86,017 of which are training videos, 11,522 are validation videos and 10,960 are testing videos. Each video has a duration ranging from 2 seconds to 6 seconds. The inference of the video in this dataset requires extracted features that are capable of representing the physical relation of the objects.
Something-Something-V2 : The dataset is twice as many videos as V1, collected by workers with humans performing pre-defined basic actions of everyday objects. Each video in the dataset equips with object annotations in addition to the video label and has a duration ranging from 2 to 6 seconds. In total, It has 318,572 annotations involving 30,408 unique objects and contains 220,847 videos across 174 class, with 168913 of which are training videos, 24,777 are validation videos and 27,157 are testing videos, i.e., it is split into train, validation, and test-sets in a ratio of 8:1:1.
Charades : The dataset is composed of 9,848 videos of daily indoors activities with an average length of 30 seconds, involving interactions with 46 objects classes in 15 types of indoor scenes and containing a vocabulary of 30 verbs leading to 157 action classes. Each video in this dataset is annotated by multiple free-text descriptions, action labels, action intervals and classes of interacting objects. 267 different users were presented with a sentence, which includes objects and actions from a fixed vocabulary, and they recorded a video acting out the sentence. In total, the dataset contains 66,500 temporal annotations for 157 action classes, 41,104 labels for 46 object classes, and 27,847 textual descriptions of the videos. Following the standard split, it has 7,986 training video and 1,863 validation video.
These datasets provide us a large number of samples to investigate our model in video activity understanding and commonsense reasoning for daily human activities.
4.2 Implementation Detail
Training. We adopt the training strategy proposed in TSN 
, like data enhancement and partial batch normalization, etc. The frames sampled from a video were first input to 2D or 3D ConvNets for spatial feature extraction. Then we apply the temporal reason graph to learn the high-level semantic meaning based on the sequence spatial features. All the 2D or 3D spatial feature extraction processes are originated from[42, 43, 44] for a fair comparison.
We employ the PyTorch framework in this paper for Networks building, and all networks are trained on two GeForce GTX Titan X GPU with a total 24G memory. All the input images are resized to 224 224 with the backbone of Inception  and ResNet  and 299 299 with the backbone of Inception-V3  followed by the dataset processing strategy of 
. The network is trained with a mini-batch stochastic gradient descent optimizer for model training, and the initial learning rate here is 0.001 which will reduce by a factor 10 after 50 epochs. It has a decay rate of, and momentum 0.9 to update Network parameters. The whole training procedure takes 100 epochs. For Something-Something V2, the epoch number is halved because the duration of its videos is shorter. All the learnable Parameters in the temporal graph are implemented by a convolution layer followed by a batchnorm  and relu layer .
Test. When comparing with previous state-of-the-art models, we followed the rule that uses the same number of frames for both training and testing to make a direct comparison. Following the common practice, the inference process is conducted by sampling 10 clips and 2 clips from a video along its temporal axis for Charades and Something-Something dataset respectively and the prediction scores are averaged over whole clips. By the way, we also rescale the shorter side to 256 pixels for each frame while maintaining the aspect ratios.
4.3 Evaluation Protocols
Every video in Something-Something-V1 and V2 datasets is assigned to a single classes and the distribution of classes over the test set is almost uniform. The ground truth labels of the videos are , where is the number of samples in the test set. Following [9, 44, 53], we apply and precision to evaluate the performance of several methods in this datasets. The methods produce a list of at most 5 action classes labels, , in the descending order of confidence for each video . The () precision is defined as: , where if and 0 otherwise. Under this circumstance, the precision is when set to 1 and set to 5 for .
As Charades is a multi-label action classification dataset, following [29, 44, 53], we exploit the standard mean average precision (mAP) to measure the performance of several different methods. Here, mAP is defined as the mean of all classes’ average precision (AP), where AP is formulated as , where is the number of positive samples in the test set, is the precision of the top test samples, and is an indicator function equaling 1 if the item at rank is a positive sample, 0 otherwise.
4.4 Compared Methods
We compare the performance of our TRG with the following state-of-the-art methods:
MultiScale TRN (ECCV 2018) : An improved temporal aggregation method that applies multiple full connection layers to reasoning the relation of temporal features extracted by 2D-CNNs.
I3D (CVPR 2017) : The methods inflate the 2D convolution filters to the 3D convolution filters, which can be transferred to many existing architectures.
NL (CVPR 2018) : Non-local block can be plugged into an existing backbone to capture long-term dependencies.
GCNs (ECCV 2018) : The model first applies faster rcnn to extract object features in the video and then use GCNs to reasoning the relation between those semantic objects.
ECO (ECCV 2018) : An end-to-end architecture injects 3D CNNs at the top of 2D CNNs architecture.
TrajectoryNet (NIPS 2018) : Trajectory convolution, a replacement operation of temporal convolution, is proposed to integrate feature along the temporal dimension.
2-stream (NIPS 2014) : The method designs a spatial and temporal stream for action recognition. The spatial stream extracts spatial feature from a single frame in a video, whilst the temporal stream extracts temporal feature from the pre-computed dense optical flow.
Asyn-TF (CVPR 2017) : The model attempts to reason over various aspects of activity that includes objects, actions, and intentions.
4.5 Results and Discussion
|Methods||Backbone||# Frames||# Params||FLOPs||V1||V2|
|MultiScale TRN ||Inception||8||18.3M||16.4G||34.4||-||48.8||-|
|I3D + GCNs ||ResNet-50||32||55.1M||158G||43.3||75.1||-||-|
|NL I3D ||ResNet-50||32||35.1M||168G||44.3||75.1||-||-|
|NL I3D + GCNs ||ResNet-50||32||62.2M||303G||46.1||76.8||-||-|
|2-Stream +LSTM ||VGG16||1/20||17.8|
|MultiScale TRN ||Inception||8||25.2|
|NL I3D ||ResNet-101||32||37.5|
|NL I3D + GCNs ||ResNet-50||32||37.5|
|I3D + GCNs ||ResNet-101||32||39.1|
|NL I3D + GCNs ||ResNet-101||32||39.7|
The results on Something-Something  and Charades [28, 29] are shown in Table I and II respectively. “# Frames” in the table denotes the number of frames used as input by the method, whilst “# Params” denotes the model parameters of the method, and “FLOPs” (floating point operations) is a type of measurement of the model computation complexity. For a fair comparison, we conduct a series of experiments with different input frames and different backbone to evaluate our model. For all of these datasets, we use the standard evaluation protocol provided by the authors. Note that we only exploit video frames as input without complementary information, like hand-crafted features-IDT  or optical flow field.
Table I shows all the results on Something-Something V1 and V2 validation set. Considering that they are fine-grained activity understanding datasets, in which deformation or motion serves as a crucial cue, those recent methods design their model from different aspects. But they didn’t take the long-term temporal relation into account. Compared with Multiscale TRN which also attempt to reason about the temporal state of a video sequence, our work surpasses by a large margin 4.1% on Something-Something V1 and 2.5% on Something-Something V2 with the same experimental settings. The comparison with the work  which also apply graph convolution to reason the object proposal extracted by faster-rcnn  is impressive. Here, we gain 3.7% accuracy improvement with this work. Since this approach considers the object-level relation in different frames, it first use faster-rcnn to extract object-level features, which may be harmful to their model setting if the model extracts negative object or missing some key object. Additionally, their computation cost is higher that our method. The margin may benefit from our multi-head adjacent matrix and semantic relation aggregator. Based on Inception-V3 architecture, our temporal reasoning graph widens the advantage over previous models considerably, bringing overall performance to 49.8% on Something-Something V1 and 61.3% on Something-Something V2. Compare with the most accurate method TrajectorNet which convolves through trajectory path, our method can still yield improvement. Although TrajectoryNet can work with a relatively small backbone to achieve a high performance, it has to apply MotionNet  to pre-produce trajectory as the performance of TrajectoryNet highly depends on the quality of the trajectory. Those results indicate that temporal reasoning with our graph model can capture semantic features for activity recognition.
In addition, the evaluation results on the Charades dataset, a multi-label activity recognition case, are shown in Table II. We present mAP values of those methods in this dataset. Our temporal reasoning graph achieves competitive recognition performance with the non-local inflated 3D CNNs (I3D) plus graph convolution model. Not only can our model get higher performance than , but also the complexity of our model is very low. Since we need not detect objects from the frames, our model has a faster inference speed with only 0.3 seconds at a time for one video (32 frames).
Table I and II also compares the inﬂuence of architecture and the number of sampled frames. As observing from the table, the accuracy degraded with fewer frames partly due to important parts of the action may be missing with fewer samples. However, training with more frames has a higher computation cost. The evaluation performance of TRG, even with just 8 frames, is still much better than most approaches in this literature, since our model takes into account the relationship between these 8 instants in the video, even if they are far apart.
4.6 Further Analysis
In this subsection, we mainly explore the effectiveness of temporal reasoning graph block. To analyze the TRG in-depth, we evaluate the different factors which matter in action representation or be essential in our TRG.
Component Analysis of The Model. To study the contribution of different model parts, we also train ablated versions of our model separately, i.e., without temporal reasoning in the video context. We first evaluate the effect of using the multi-head adjacent matrix for graph convolution in the Inception, Inception-V3 and ResNet-50 with I3D technology architectures as demonstrated in Table I. In the table, “w/o temporal graph” means without temporal graph, whilst “temporal graph with” means replacing temporal relation aggregator with a specific operation. Firstly, without temporal graphs, we apply temporal average pooling among features map as three baselines. In addition, for the 2D CNNs case, we also test plugging 3D convolution layer behind the end of each branch (i.e., inception3a to inception5b and Mixed5b to Mixed7c). Finally, for exploring the effect of the multi-head temporal relation aggregator. We replace the aggregator with an element-wise average operation and feature concatenation operation among the sequence depth dimension. All the experiments here are conducted with only 8 input frames.
We can obtain that the reported baselines typically underperform the proposed model. The performance which yields improvement with our model demonstrates that temporal reasoning with multi-kinds and multi-scale sequence instance relation helps to capture fine grain action for recognition and boosts the performance. We speculate this is because the temporal graph considerably explores the semantic information of a video sequence and the temporal relation aggregator is able to select more informative relation features for activity recognition.
|Inception||AvgPool w/o temporal graph||37.3||68.9|
|3DConv w/o temporal graph||42.4||73.2|
|temporal graph with concatenation||48.5||77.4|
|temporal graph with element-wise Avg||49.3||78.2|
|TRG (full model)||51.3||78.8|
|Inception-V3||AvgPool w/o temporal graph||39.5||70.1|
|3DConv w/o temporal graph||45.1||76.5|
|temporal graph with concatenation||50.4||78.4|
|temporal graph with element-wise Avg||51.2||79.2|
|TRG (full model)||52.5||80.6|
|ResNet-50||AvgPool w/o temporal graph||46.7||74.1|
|temporal graph with concatenation||50.6||77.6|
|temporal graph with element-wise Avg||51.9||79.5|
|TRG (full model)||53.8||81.2|
Head Number of The Graph. To analyze the effect of the different numbers of the multi-head adjacent matrix on recognition performance, We perform a serial of experiments on Something Something V2 and Charades datasets with the different head numbers. Here, we only sample 8 frames for experiments. As shown in Figure 5, we observe that building multiple multi-head temporal adjacent matrix lead to conspicuous gain compared with only one head adjacent matrix. The model is able to further boost accuracy from 49.7% to 53.3% with Inception backbone, 50.2% to 54.7% with Inception-V3 backbone on Something-Something V2, and improve the mAP from 27.2% to 32.3% with Inception backbone, 30.3% to 35.2% with Inception-V3 backbone on Charades dataset. The intuition behind this is that two temporal objects have different kinds of relations. When we deduce an action from a specific video, some kinds of relations may be enhanced and the others may be restricted by the relation aggregator. Although the adjacent relation weight values in the matrix learned with different head encode different semantic information, another observation that the recognition performance is saturated when keep increasing the number of the head indicated that the accumulation of the encoded semantic information with endless has little contribution and may cause confusion for relational reasoning.
Place to Inject Temporal Reasoning Graph. To figure out in which way the temporal reasoning graph helps in video context interaction, we conduct experiments of plugging temporal reasoning graphs in different places of the CNNs backbone. Furthermore, in this part, we will demonstrate how to push the temporal reasoning graph to the 2D or 3D CNNs backbone.
Here, we will apply vanilla Inception as backbone-V3 for 2D CNNs case, whereas, ResNet-50 with I3D technique as backbone for 3D CNNs case. Theoretically, the temporal reasoning graph described above can be plugged at any level of the networks to enhance the temporal sequence interaction. In detail, we will study the effect of the different place where choose to build temporal reasoning graph. We push three graphs into the top or bottom of the multi-branch convolution networks. More specifically, we consider injecting the temporal reasoning graph into the output layers from mixed_5b to mixed_5d at bottom-level and from mixed_6a to mixed_6c for at bottom-level for Inception-V3. As a comparison, we consider processing output from mixed_7a to mixed_7c with our temporal reasoning graph for Inception network at the top-level. Similar processing will be applied to ResNet-50 from res2_2 to res2_4, res3_2 to res3_4 or res5_2 to res5_4.
Results are shown on Figure 6 with 8 frames input case and the performance with top-level processing are higher. We speculate that output feature maps of late ConvNets layers contain abundant semantic information of the spatial objects. With the temporal reasoning of that semantic information, the global contextual good for activity representation is perfectly modeling for recognition.
Study on Temporal Scale. The videos have a variable number of frames, and the activity in the video lasts seconds to even minutes in different datasets. To copy with the problem of action contextual information ranging from seconds to even minutes, we analyze the impact of different temporal stride and temporal spanning for sampling from the original video. By the way, all the experiments were conducted with Inception, and ResNet-50, a 3D model processed the same with , architectures.
The Uniformly sparse and global sampling strategy are exploited to select comprehensive frames from the entire video. Temporal pooling, except 3D CNNs case, process frames independently and their scores are aggregated only in the end. Consequently, the performance stays almost the same when they change the number of samples, which demonstrates that only with sparse sampling strategy does not really help to learn the long-range temporal context. Whereas, our model pays more attention to the temporal interaction of the video sequence. The performance indicates that our model really benefits from long-term temporal reasoning.
By contrast, we also apply dense sampling strategy with a fixed temporal stride 4, which means that this strategy covers a temporal range of 5 seconds at most in our experiments on Charades datasets (24 FPS). Uniformly spare sampling with large temporal striding, in fact, hurts the performance of 3D CNNs model. Temporal convolution kernel might not be suitable for long-range patterns, because long-range contextual patterns are more diverse and changeable, and include challenging scene cuts. On the other hand, large temporal striding with plugging temporal reasoning graphs steadily improves performance.
Temporal reasoning sufficiently exploits the comprehensive information from the entire video or several seconds clip, since our approach modeling the relation of changeable pattern in a video sequence both with short and long-range.
|Sampling strategy||Methods||Backbone||# Frames||mAP|
Effect of Different Backbone. To measure the importance of context interaction of the temporal features in different backbones, we can plug our temporal reasoning graph into existing 2D CNNs and 3D CNNs architectures and inspect its activity recognition performance. For this, three temporal reasoning graphs were injected into the existing backbone at the top with almost the same setting. For fair comparison with existing methods, the 2D architectures, BnInception  and InceptionV3 , are stemmed from TSN ; and the 3D architectures, ResNet-50  and ResNet-101, are processed the same with non-local  using I3D  technique, which inflate the 2D convolution weights into 3D convolution. All the results are listed in Table I.
Model Complexity. Our approach has multi-head temporal graphs for temporal reasoning and a temporal semantic aggregator for features fusion, and the model can be deployed to the existing backbone. It has perfect scalability and provides a balance between performance and complexity. To figure out the time complexity of our model, we report the floating point operations per seconds (FLOPs) in a single clip to demonstrate the cost (shown in Table I). The additional cost and parameters induced by our module are mainly contained in the convolution kernels of the temporal similarity calculation and graph convolution parameters, followed by temporal semantic relation aggregation. More precisely, the parameters introduced by our temporal reasoning graph is , where denotes which stage we plug our block into the backbone, represents the dimension of the output channels, and denotes the numbers of the graph in this stage. The first term of the equation calculates the parameters in temporal similarity measure, the second term of the equation calculates the parameters in the spatial transformation of graph convolution, and the last term of the equation calculates the parameters in the temporal relation aggregator.
CAM Visualization. To get further insights into what our network learns, we visualize the CAM  on Something-Something V2 dataset. CAM can visualize the most discriminative parts which will highlight the interesting region when classifying an activity of a clip. We randomly sample three video clips from the validation set and evaluate them by the trained model. To understand the primitives our model used for representing actions and visualize interesting class information, we show the output of CAM on the select samples. The results are shown in Figure 7. The highlight regions that correspond to the receptive field give us some insight into what the model cares about and indicating that spatio-temporal features are learned effectively.
In Figure 7
, we also list the top-3 prediction probability scores of the three examples. We observe that the model can perfectly recognize the first two cases, but it fails in the last example. For this kind of fine-grained action, the model still has trouble in classification. The score of this case, shown in the figure, predicts almost equally. However, from the prediction, we can not definitely declare that the result is wrong.
Visualization of The Learned Representation Distribution. To visualize the distribution of the learned features, we apply the t-SNE tool (Principle Component Analysis for dimension reduction with 5000 iterations) for embedding the sequence representation extracted from our model with a different adjacent head number. The experiment was conducted on Something-Something V2 dataset with 20 randomly selected class which around 3K samples. The distribution difference between Figure 8 (a) and (b), (c) reveals that the model can learn more discriminative embedding with the temporal reasoning graph. Moreover, as shown in Figure 8 (b) and (c), through the multi-head adjacent matrix, the proposed model could separate sample points into several semantically discriminative clusters.
5 Conclusion and Future Work
In this work, we presented a novel temporal reasoning graph module that captures the temporal relation of a video sequence to address the long-range temporal dependencies for activity recognition. Meanwhile, we proposed the multi-head adjacent matrix to investigate the multi-kinds relations of a video sequence. Additionally, a multi-head temporal relation aggregator was designed to automatically learn the importance of different sequence state in different graphs for comprehensive video-level feature learning. Benefiting from these two novel module designing, the proposed model can capture multi-kinds temporal relation with different scale and temporal span. We evaluated the proposed model on Something-Something and Charades datasets and established competitive results compared with existing methods.
We hope the proposed temporal reasoning graph module will boost performance on various video understanding tasks. Furthermore, we plan to investigate the power of GCNs in video context correspondence learning with a self-supervised manner.
This work was supported in part by the National Natural Science Foundation of China under grants No. 61502081, 61602089, 61632007 and the Sichuan Science and Technology Program 2018GZDZX0032, 2019ZDZX0008 and 2019YFG0003.
J. Carreira and A. Zisserman.
Quo vadis, action recognition? a new model and the kinetics dataset.
proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
-  Y. Chen, M. Rohrbach, Z. Yan, S. Yan, J. Feng, and Y. Kalantidis. Graph-based global reasoning networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 433–442, 2019.
-  Z. Chen, X. Wei, P. Wang, and Y. Guo. Multi-label image recognition with graph convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5177–5186, 2019.
-  L. V. Der Maaten and G. E. Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9:2579–2605, 2008.
-  A. Diba, V. Sharma, and L. Van Gool. Deep temporal linear encoding networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2329–2338, 2017.
-  J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2625–2634, 2015.
-  R. Girdhar, D. Ramanan, A. Gupta, J. Sivic, and B. Russell. Actionvlad: Learning spatio-temporal aggregation for action classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 971–980, 2017.
X. Glorot, A. Bordes, and Y. Bengio.
Deep sparse rectifier neural networks.
Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 315–323, 2011.
-  R. Goyal, S. E. Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, et al. The” something something” video database for learning and evaluating visual common sense. In proceedings of the IEEE International Conference on Computer Vision, pages 3–12, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
-  J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7132–7141, 2018.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
-  J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. Lawrence Zitnick, and R. Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2901–2910, 2017.
-  A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014.
-  T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. international conference on learning representations, 2017.
-  F. Mahdisoltani, G. Berger, W. Gharbieh, D. Fleet, and R. Memisevic. Fine-grained video classification and captioning. arXiv preprint arXiv:1804.09235, 2018.
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
-  M. Qi, W. Li, Z. Yang, Y. Wang, and J. Luo. Attentive relational networks for mapping images to scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3957–3966, 2019.
-  Z. Qiu, T. Yao, and T. Mei. Learning spatio-temporal representation with pseudo-3d residual networks. In proceedings of the IEEE International Conference on Computer Vision, pages 5534–5542, 2017.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in neural information processing systems, pages 91–99, 2015.
-  S. Ren, K. He, R. B. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137–1149, 2017.
-  F. Shen, X. Gao, L. Liu, Y. Yang, and H. T. Shen. Deep asymmetric pairwise hashing. In Proceedings of the 25th ACM international conference on Multimedia, pages 1522–1530. ACM, 2017.
-  F. Shen, C. Shen, W. Liu, and H. Tao Shen. Supervised discrete hashing. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 37–45, 2015.
-  Y. Shen, H. Li, S. Yi, D. Chen, and X. Wang. Person re-identification with deep similarity-guided graph neural network. In Proceedings of the European Conference on Computer Vision, pages 486–504, 2018.
-  L. Shi, Y. Zhang, J. Cheng, and H. Lu. Non-local graph convolutional networks for skeleton-based action recognition. arXiv preprint arXiv:1805.07694, 2018.
-  C. Si, W. Chen, W. Wang, L. Wang, and T. Tan. An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1227–1236, 2019.
-  G. A. Sigurdsson, S. K. Divvala, A. Farhadi, and A. Gupta. Asynchronous temporal fields for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5650–5659, 2017.
-  G. A. Sigurdsson, A. Gupta, C. Schmid, A. Farhadi, and K. Alahari. Actor and observer: Joint modeling of first and third-person videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
-  G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Proceedings of the European Conference on Computer Vision, pages 510–526, 2016.
-  K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Proceedings of the Advances in neural information processing systems, pages 568–576, 2014.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016.
-  D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
-  D. Tran, J. Ray, Z. Shou, S.-F. Chang, and M. Paluri. Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038, 2017.
-  G. Varol, I. Laptev, and C. Schmid. Long-term temporal convolutions for action recognition. IEEE transactions on pattern analysis and machine intelligence, 40(6):1510–1517, 2018.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Proceedings of the neural information processing systems, pages 5998–6008, 2017.
-  P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio. Graph attention networks. Proceeding on the international conference on learning representations, 2018.
-  H. Wang and C. Schmid. Action recognition with improved trajectories. In Proceedings of the IEEE international conference on computer vision, pages 3551–3558, 2013.
-  J. Wang and A. Cherian. Learning discriminative video representations using adversarial perturbations. In Proceedings of the European Conference on Computer Vision, pages 685–701, 2018.
-  J. Wang, A. Cherian, F. Porikli, and S. Gould. Video representation learning using discriminative pooling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1149–1158, 2018.
-  L. Wang, W. Li, and L. Van Gool. Appearance-and-relation networks for video classification. In Proceedongs of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1430–1439, 2018.
-  L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision, pages 20–36. Springer, 2016.
-  X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7794–7803, 2018.
-  X. Wang and A. Gupta. Videos as space-time region graphs. In Proceedings of the european conference on computer vision, pages 413–431, 2018.
-  Z. Wang, L. Zheng, Y. Li, and S. Wang. Linkage based face clustering via graph convolution network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1117–1125, 2019.
-  Z. Wu, X. Wang, Y.-G. Jiang, H. Ye, and X. Xue. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In Proceedings of the 23rd ACM international conference on Multimedia, pages 461–470. ACM, 2015.
-  P. Xiong, H. Zhan, X. Wang, B. Sinha, and Y. Wu. Visual query answering by entity-attribute graph matching and reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8357–8366, 2019.
-  S. Yan, Y. Xiong, and D. Lin. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, pages 7444–7452.
-  L. Yang, X. Zhan, D. Chen, J. Yan, C. C. Loy, and D. Lin. Learning to cluster faces on an affinity graph. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2298–2360.
-  J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4694–4702, 2015.
-  H. Zhang, Z. Kyaw, S. Chang, and T. Chua. Visual translation embedding network for visual relation detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3107–3115, 2017.
-  Y. Zhao, Y. Xiong, and D. Lin. Trajectory convolution for action recognition. In Proceedings of the neural information processing systems, pages 2204–2215, 2018.
-  B. Zhou, A. Andonian, A. Oliva, and A. Torralba. Temporal relational reasoning in videos. In Proceedings of the European Conference on Computer Vision, pages 803–818, 2018.
B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba.
Learning deep features for discriminative localization.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2921–2929, 2016.
-  J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, and M. Sun. Graph neural networks: A review of methods and applications. arXiv: Learning, 2018.
-  X. Zhou, F. Shen, L. Liu, W. Liu, L. Nie, Y. Yang, and H. T. Shen. Graph convolutional network hashing. IEEE Transactions on Cybernetics, 2018.
-  Y. Zhu, Z. Lan, S. Newsam, and A. Hauptmann. Hidden two-stream convolutional networks for action recognition. In Proceedings of the Asian Conference on Computer Vision, pages 363–378, 2018.
-  M. Zolfaghari, K. Singh, and T. Brox. Eco: Efficient convolutional network for online video understanding. In Proceedings of the European Conference on Computer Vision, pages 713–730, 2018.