Video understanding has gained much attention from both academia and industry over recent years, given the rapid growth of videos published in online platforms. Temporal action detection is one of the interesting but challenging tasks in this area. It involves detecting the start and end frames of action instances, as well as predicting their class labels. This is onerous especially in long untrimmed videos.
Video context is an important cue to effectively detect actions. Here, we refer to context as frames that are outside the target action but carry valuable indicative information of it. Using video context to infer potential actions is natural for human beings. In fact, empirical evidence shows that humans can reliably guess or predict the occurrence of a certain type of action by only looking at short video snippets where the action does not happen [1, 2]. Therefore, incorporating context into temporal action detection has become an important strategy to boost detection accuracy in the recent literature [11, 15, 9, 32, 45, 59, 29]. Researchers have proposed various ways to take advantage of video context, such as extending temporal action boundaries by a pre-defined ratio [11, 15, 45, 59, 29], using dilated convolution to encode context into features , and aggregating context features implicitly by way of a Gaussian curve . All these methods only utilize temporal context, which precedes or follows an action instance in its immediate temporal neighborhood. However, real-world videos vary dramatically in temporal extent, action content, and even editing preferences. The use of such temporal context does not fully exploit the rich merits of video context, and it may even impair detection accuracy if not properly designed for underlying videos.
So, what properties characterize desirable video context for the purpose of accurate action detection? First, context should be semantically correlated to the target action other than merely temporally located in its vicinity. Imagine the case where we manually stitch an action clip into some irrelevant frames, the abrupt scene change surrounding the action would definitely not benefit the action detection. On the other hand, snippets located at a distance from an action but containing similar semantic content might provide indicative hints for detecting the action. Second, context should be content-adaptive rather than manually pre-defined. Considering the vast variation of videos, context that helps to detect different action instances could be different in lengths and locations based on the video content. Third, context should be based on multiple semantic levels, since using only one form/level of context is unlikely to generalize well.
In this paper, we endow video context with all the above properties by casting action detection as a sub-graph localization problem based on a graph convolutional network (GCN) . We represent each video sequence as a graph, each snippet as a node, each snippet-snippet correlation as an edge, and target actions associated with context as sub-graphs, as shown in Fig. 1. The context of a snippet is considered to be all snippets connected to it by an edge in a video graph. We define two types of edges — temporal edges and semantic edges, each corresponding to temporal context and semantic context, respectively. Temporal edges exist between each pair of neighboring snippets, whereas semantic edges are dynamically learned from the video features at each GCN layer. Hence, multi-level context of each snippet is gradually aggregated into the features of the snippet throughout the entire GCN. The structure of each GCN block is inspired by ResNeXt , so we name this GCN-based feature extractor GCNeXt.
The pipeline of our proposed Graph-Temporal Action Detection method, dubbed G-TAD, is analogous to faster R-CNN [17, 35] in object detection. There are two critical designs in G-TAD. First, GCNeXt, which generates context-enriched features, corresponds to the backbone network, analagous to a series of CNN layers in faster R-CNN. Second, to mimic region of interest (RoI) alignment 
in faster R-CNN, we design a sub-graph alignment (SGAlign) layer to generate a fixed-size representation for each sub-graph and embed all sub-graphs into the same Euclidean space. Finally, we apply a classifier on the features of each sub-graph to obtain detection results. We summarize our contributions as follows.
(1) We present a novel GCN-based video model to fully exploit video context for effective temporal action detection. Using this video GCN representation, we are able to adaptively incorporate multi-level semantic context into the features of each snippet.
(2) We propose G-TAD, a new sub-graph detection framework to localize actions in video graphs. G-TAD includes two main modules: GCNeXt and SGAlign. GCNeXt performs graph convolutions on video graphs, leveraging both temporal and semantic context. SGAlign re-arranges sub-graph features in an embedded space suitable for detection.
(3) G-TAD achieves state-of-the-art performance on two popular action detection benchmarks. On ActityNet-1.3, it achieves an average mAP of . On THUMOS-14, it reaches mAP@0.5, beating all contemporary one-stage methods.
2 Related Work
2.1 Video Representation
Action Recognition. Many CNN based methods have been proposed to address the action recognition task. Two-stream networks [14, 39, 47] use 2D CNNs to extract frame features from RGB and optical flow sequences. These 2D CNNs can be designed from scratch [20, 40] or adapted from image recognition tasks . Other methods [42, 8, 34, 55] use 3D CNNs to encode spatio-temporal information from the original video. In our work, we use the pre-trained action recognition model in [54, 46] to extract video snippet features as G-TAD input, and use graph convolution as an analogue for 2D or 3D CNNs.
Action Detection. The goal of temporal action detection is to predict the boundaries of action instances and their categories in untrimmed videos. Most methods [38, 41, 59, 58, 9, 29] divide the task into two stages: temporal proposal generation and classification/regression of proposals. For proposal generation, they either predict action proposals using handcrafted anchors [5, 6, 13, 15, 38] , or by classifying starting/ending snippets [59, 29]. Others [28, 5, 27] tackle the problem using a single-stage model, where actions are detected directly. G-TAD is a single-stage model that scores pre-defined anchors with action confidences. We introduce a starting/ending snippet classification loss as a regularizer.
2.2 GCN in Videos
Graphs in Video Understanding. Graphs have been widely used for data representation in various video understanding tasks, such as video feature representation , video classification [49, 10], and action localization . In action recognition, Liu et al. 
view a video tensor as a 3D point cloud in the spatial-temporal space. Wanget al.  represent a video as a space-time region graph, in which the graph nodes are defined by object region proposals. In action detection, Zeng et al.  consider temporal action proposals as nodes in a graph, and refine their boundaries and classification scores based on the established proposal-proposal dependencies. Differently from previous works, G-TAD takes video snippets as nodes in a graph and form dependencies between them based on both their temporal ordering and semantic similarity.
Graph Convolutions. Graph Convolutional Networks (GCNs) 
are widely used for non-Euclidean structures. These years have also seen their successful application in computer vision tasks due to their versatility and effectiveness, such as 3D object detection and point cloud segmentation [50, 53]. Meanwhile, various GCN architectures are proposed for more effective and flexible modelling. Busbridge et al.  propose graph attention networks that assign different weights to neighboring nodes based on local structures. Jiang et al.  dynamically infer a graph, and apply graph convolution on the dynamic graph. Li et al.  propose DeepGCNs to enable GCNs to go as deep as 100 layers by using residual/dense graph connections and dilated graph convolutions. G-TAD uses a DeepGCN-like structure to apply graph convolutions on a dynamic semantic graph as well as a fixed temporal graph.
3 Proposed Method
3.1 Problem Formulation
The input to our pipeline is a video sequence of frames , where is the frame . Following recent video action proposal generation methods [5, 13, 15, 29], we construct our G-TAD model using feature sequences extracted from raw video frames. We uniformly sample frames with a sampling rate and refer to each sampled frame as a snippet. Our input visual feature sequence is represented by , where is the extracted snippet feature at the sampled frame and is its dimension. Each video sequence has a set of annotations , where represents an action instance, and , , and are its starting time, ending time, and action class, respectively.
The temporal action detection task is to predict possible actions from . Here, represents the predicted temporal boundaries for the predicted action, and are its predicted action class and confidence score, respectively.
3.2 G-TAD Architecture
Our action detection framework is illustrated in Fig. 2. We feed snippet features into a stack of GCNeXt blocks, which is designed inspired by ResNeXt , to obtain context-aware features. Each GCNeXt contains two graph convolution streams. One stream operates on fixed temporal neighbors, and the other adaptively aggregates semantic context into snippet features. Each block follows split-transform-merge strategy with multiple convolution paths. Based on a set of pre-defined temporal anchors (see Section 4.2
), we define a sub-graph alignment layer named SGAlign to transform the aggregated feature of each sub-graph to a feature vector. Multiple fully connected layers are used to predict the intersection over union (IoU) of every anchor and the ground truth action instance. We provide a detailed description of both GCNeXt and SGAlign in Sections3.3 and 3.4, respectively.
3.3 GCNeXt for Context Feature Encoding
Our basic graph convolution block, GCNeXt, operates on a graph representation of the video sequence. It encodes snippets using their temporal and semantic neighbors. Fig. 3 illustrates the architecture of GCNeXt.
To build the video graph, we take as input dimensional feature vectors of snippets, denoted as . We build a graph , where and denote the vertex and edge set, respectively. In this case, each vertex is a snippet (represented by its feature) and each edge shows a dependency between pairs of snippets. We define two types of edges — temporal edges and semantic edges , so accordingly we define two graphs — the temporal graph and the semantic graph. We describe each type of edge as well as the graph convolution process in the following.
Temporal Edges (). Temporal edges encode the temporal order of the video. For each node , there is one unique forward edge to node , and one backward edge to node . In this case, we have , where and are forward and backward temporal edge sets defined as follows:
is the number of snippets in the video.
Semantic Edges (). We define using the notion of dynamic edge convolutions . The goal of these edges is to collect information from semantically correlated snippets. For each node in the input graph , we define a set as follows:
Here, is the index of the nearest neighbour from -th node, is constructed dynamically at every layer in the node feature space, which enables us to find the dynamic neighbors that intrinsically carry semantic context information. Since we recompute at each layer, is adaptively changed to represent new levels of semantic context.
Graph Convolution. A Graph Convolution transforms the graph vertices represented as through a general graph convolution operation formulated as:
Here, are trainable weights of the aggregation function , is the adjacency matrix without self-loops (i.eas , where is the indicator function. There are several choices for in the literature. We use a single-layer edge convolution  as our aggregation function in Eq. 3.3.
We use with different subscripts to show different trainable weights. represents the matrix concatenation in columns.
Residual Connection and Cardinality
. We require two more graph operations. First, we use the residual connection proposed in DeepGCN to improve model convergence. Under this setup, our graph convolution block can be formulated as:
where , , and are adjacency matrices for , , and respectively.
The derivation of Eq. 3.3 is demonstrated in the supplementary materials
, where we also prove that it can be efficiently computed by zero-padded 1D/edge convolutions.
Following ResNeXt , GCNeXt adopts split-transform-merge strategy to explore the applicability of group convolution by changing cardinality besides going deeper or wider.
3.4 Sub-Graph Alignment and Localization
Sub-Graph of Interest Alignment (SGAlign). Most previous action detectors perform rescaling to extract a fixed sized proposal feature vector for each action anchor. Given the action anchor , they sample the video feature sequence within
through linear interpolation withpoints. Given our graph formulation, a sub-graph feature is extracted instead from a Sub-Graph of Interest Alignment (SGAlign) layer that aggregates the context feature in an adaptive way, and does not rely on human priors. Fig. 4 illustrates our new graph alignment algorithm and we present its technical details next.
Given an input of feature vectors and an anchor , we expect to sample and vectors from the temporal and semantic graphs, respectively. We repeat this process for all anchors. The alignment is done in four steps. (1) Each snippet is projected back to the temporal order given by the temporal graph. (2) We run an interpolation and rescaling algorithm (Alg. 1) to get vectors from the temporal graph and vectors from the semantic graph. (3) Every node’s feature is replaced with the mean feature of its dynamic neighbors, and then we repeat (1) and (2) to further extract features for the semantic context. (4) The and vectors are concatenated as the output of the SGAlign layer. In Alg. 1, the output for anchor , is the weighted average of all the nodes in the sub-graph defined by . In the backward pass, this weighted sum means gradients will always flow to these nodes.
Sub-Graph Localization. For an anchor , we calculate its Intersection-over-Union (IoU) with all ground truth actions in , and denote the maximum IoU as its label. We compute for all pairs of . Once we get the sub-graph feature from SGAlign layer, we use three fully connected (FC) layers to regress it to . The last FC layer produces a two-dimension vector, where each entry shows classification and regression scores.
3.5 Training G-TAD
In G-TAD, the sub-graph localization is used to determine the confidence scores of anchors which are regressed for final temporal action detection. We do not need to specifically classify starting and ending nodes of actions since they are predefined by the anchors. However, we noticed that adding a node classifier during training can drastically improve the model’s convergence. This classification module is ignored at test time.
Sub-Graph Localization Loss. Sub-graph localization predicts for each anchor position. With the training target being , the sub-graph loss is defined as follows:
where is the weighted cross entropy loss. In our experiments, we take the tradeoff coefficient , since the second loss term tends to be smaller than the first.
Node Classification Regularizer. In the training process, we label a node as a start or end point if they are temporally close to or , while all the other nodes are of the third class containing action and background nodes. We add a separate branch by with 1 FC-layer after the first GCNeXt block to classify nodes to their labels
, with start/end probabilities. We add a node regularizer , where stands for the weighted binary cross entropy loss.
We train G-TAD in the form of a multi-task loss function, including sub-graph loss, node regularizer loss , and an regularization for all the trainable parameters :
In our experiments, we set .
3.6 Inference and Post-processing
At inference time, G-TAD predicts classification and regression scores for each anchor . From the anchors, we construct predicted actions , where is the action boundary in the video scale, is the action class, and is the fused confidence score of this prediction and . In our experiments, we search for the optimal in each setup. We apply Soft-NMS  and select the top- predictions.
4.1 Datasets and Metrics
is a large-scale action understanding dataset for action recognition, temporal detection, proposal generation and dense captioning tasks. It contains 19,994 temporally annotated untrimmed videos with 200 action categories, which are divided into training, validation and testing sets by 2:1:1.
THUMOS-14  dataset contains 413 temporally annotated untrimmed videos with 20 action categories. We merge the 200 videos in validation to the training set and evaluate on the 213 annotated videos from the testing set.
. We take mean Average Precision (mAP) at certain IoU thresholds as the main evaluation metric. Following the official evaluation API, the IoU thresholds are chosen fromand on THUMOS14 and ActivityNet-1.3, respectively. Following standard practise, we also report average mAP over 10 different IoU thresholds on ActivityNet-1.3.
4.2 Implementation Details
Features and Anchors. We use pre-extracted features for both datasets. For ActivityNet-1.3, we adopt the pre-trained two-stream network by Xiong et. al. , with down-sampling ratio . Each video feature sequence is rescaled to snippets, using linear interpolation. For THUMOS-14, the video features are extracted from Kinetics  pretrained TSN model  with . We crop each video feature sequence with a window size and overlap neighbouring windows with snippets. In training, we do not use any crops void of actions.
For ActivityNet-1.3 and THUMOS-14, we enumerate all possible anchors with restriction, e.g. , while and are and , respectively. In SGAlign, we use for ActivityNet-1.3, and for THUMOS-14.
Training and Inference
. We implement and compile our framework using PyTorch 1.1, Python 3.7, and CUDA 10.0. We useGCNeXt blocks and train our model end-to-end, with batch size of 16. The learning rate is on ActivityNet and
on THUMOS14 for 5/5 epochs. In inference, we take video classification scores by and , and multiply them to for evaluation. For post-processing, the Soft-NMS threshold is to pick the top confident predictions, where M is 100 for ActivityNet and 200 for THUMOS.
More details can be found in the supplementary material. To encourage reproducibility, the code and trained models will be made publicly available.
4.3 Comparison with State-of-the-Art
ActivityNet-1.3: Tab. 1 compares G-TAD with state-of-the-art detectors. We report mAP at different tIoU thresholds, as well as average mAP. G-TAD reports the highest average mAP results on this large-scale and diverse dataset.
|Wang et al. ||43.65||-||-||-|
|Singh et al. ||34.47||-||-||-|
|Chao et al. ||38.23||18.30||1.30||20.22|
|Two-stage Temporal Action Detection|
|One-stage Temporal Action Detection|
|Richard et al. ||30.0||23.2||15.2||-||-|
|Yeung et al. ||36.0||26.4||17.1||-||-|
|Yuan et al. ||36.5||27.8||17.8||-||-|
|Hou et al. ||43.7||-||22.0||-||-|
THUMOS14: Tab. 2 compares the action localization results of G-TAD and various state-of-the-art methods on the THUMOS14 dataset. At 0.7 IoU, G-TAD reaches mAP of , compared to the current best of from TAL-Net. At 0.5 IoU, G-TAD outperforms all one-stage detection methods, such as SS-TAD  and BMN . Comparing G-TAD with two-stage methods puts our method at an inherent disadvantage. For example, P-GCN only rescores BSN proposals by mining proposal-proposal relationships and, in doing so, it increases mAP from to . Our model can get such good results only from capturing more information about context.
4.4 Ablation Study
GCNeXt Module: We ablate the three main components of GCNeXt, mainly GCN on temporal edges, GCN on semantic edges, and cardinality increase. Tab. 3 reports the performance obtained on ActivityNet-1.3, when each component is separately enabled/disabled. We see how each of these components contributes to the performance of the final G-TAD model. We highlight the gains from the semantic graph, showing the benefit of integrating adaptive context from semantic neighbors.
|GCNeXt block||tIoU on Validation Set|
SGAlign Module: This layer extracts sub-graph features by densely sampling and rescaling underlying snippet features. The sampling density is defined by factor in Alg. 1. Tab. 4 shows the effect of the sampling and feature concatenation from both temporal and semantic graphs on ActivityNet-1.3. While sampling densely gives us minor improvements, we obtain a larger gain by including context information from the semantic graph.
|SGAlign||tIoU on Validation Set|
Sensitivity to Video Length: We report the results of the sensitivity of G-TAD to different window sizes in THUMOS-14 in Tab. 5. G-TAD benefits more from larger window sizes ( vs. ) for large windows mean G-TAD can aggregate more context snippets from the semantic graph. Performance degrades at , where GPU memory limited us to use a batch size of only 2.
|Window||tIoU on Validation|
4.5 Discussion of Action Context
In the ablation experiments, graph convolutions on the semantic graph improve G-TAD performance in both the GCNeXt block and in the SGAlign layer. Semantic edges connecting background to action snippets can adaptively pass the action context information to each possible action. In this section, we define 2 extra experiments to show how semantic edges encode meaningful context information.
Zero-Context Video. How zero context between action and background leads to semantic graphs with no action-background edges is visually shown by comparing semantic graphs resulting from natural videos and synthetically compiled ones. In Fig. 5 (left and right), we present two natural videos that include actions “wrestling” and “playing darts”, respectively. Semantic edges in their resulting graphs do exist, connecting action with background snippets, thus exemplifying the usage of context in the detection process. Then, we compile a synthetic video that stacks action frames from the wrestling video and background frames from the darts video, feed it to G-TAD and again visualize the semantic graph (middle). As expected, the semantic graph does not include any action-background semantic edges.
Correlation to Context Amount. We also show the correlation between context edges and context as defined by human annotators. We define the video context amount as the average number of background snippets which can be used to predict the video action class. Following DETAD , we collect context amount for all videos in ActivityNet validation set from Amazon Mechanical Turk. The scatter plot in Fig. 7 shows the relation between Context Amount and the ratio of action-background semantic edges over all the semantic edges. From the plot, we observe that if a video has a higher amount of context (from human annotations), it is more likely to have more action-background semantic edges in its semantic graph. We further average context edge ratios in five context amount ranges, and plot them in green. The strong positive correlation between Context Amount and action-background semantic edge ratio indicates that our G-TAD model can effectively find related context snippets in the semantic graph.
We show a few qualitative detection results in Fig. 6 on both ActivityNet-1.3 and THUMOS-14. In Fig. 8, we visualize the evolution of semantic graphs during the training process across GCNeXt layers. Specifically, we feed a video into G-TAD and visualize the semantic graphs emerging at the first, middle, and last layers at epochs 0, 3, 6, and 9 of training. The semantic graphs at the first layer are the same, since they are built on the same input features. As we progress to different layers and epochs, semantic graphs adaptively update their edges. Interestingly, we observe the presence of more context edges as training advances. This indicates that G-TAD progressively learns to incorporate multiple levels of context in the detection process.
In this paper, we cast the temporal action detection task as a sub-graph localization problem by formulating videos as graphs. We take video snippets as graph nodes, snippet-snippet correlations as edges, and apply graph convolution as the basic operation. We propose a new architecture G-TAD to localize sub-graphs. G-TAD includes GCNeXt blocks to aggregate context-enriched snippet features and an SGAlign layer to transform sub-graph features into vector representations. G-TAD can learn enriched multi-level semantic context in an adaptive way by looking at snippet features. Extensive experiments show that G-TAD can find global video context without extra supervision and achieve the state-of-the-art performance on both Thumos-14 and ActivityNet-1.3 under different metrics.
-  (2018) Diagnosing error in temporal action detectors. In European Conference on Computer Vision (ECCV), Cited by: §1, §4.5.
-  (2017) Action search: spotting actions in videos and its application to temporal action localization. In European Conference on Computer Vision (ECCV), Cited by: §1.
-  (2017) Soft-nms – improving object detection with one line of code. In International Conference on Computer Vision (ICCV), Cited by: §3.6.
-  (2017) End-to-end, single-stream temporal action detection in untrimmed videos. In the British Machine Vision Conference (BMVC), Cited by: §4.3, Table 2.
Sst: single-stream temporal action proposals.
Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1, §3.1, Table 2.
-  (2016) Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Cited by: §2.1.
-  (2015) ActivityNet: a large-scale video benchmark for human activity understanding. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.1.
-  (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
-  (2018) Rethinking the faster r-cnn architecture for temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.1, Table 1, Table 2.
-  (2019) Graph-based global reasoning networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.
-  (2017) Temporal context network for activity localization in videos. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §1, Table 1, Table 2.
-  (2009) Imagenet: a large-scale hierarchical image database. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Cited by: §2.1.
-  (2016) Daps: deep action proposals for action understanding. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §2.1, §3.1.
-  (2016) Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
-  (2017) Turn tap: temporal unit regression network for temporal action proposals. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2.1, §3.1, Table 2.
-  (2017) Cascaded boundary regression for temporal action detection. In Proceedings of the British Machine Vision Conference (BMVC), Cited by: Table 2.
-  (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision (ICCV), Cited by: §1.
-  (2019) Mesh r-cnn. arXiv preprint arXiv:1906.02739. Cited by: §2.2.
-  (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (ICCV), Cited by: §1.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (ICCV), Cited by: §2.1.
-  (2017) Scc: semantic context cascade for efficient action detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 1.
-  (2017) Real-time temporal action localization in untrimmed videos by sub-action discovery. In Proceedings of the British Machine Vision Conference (BMVC), Cited by: Table 2.
-  (2014) THUMOS challenge: action recognition with a large number of classes. Cited by: §4.1.
-  (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1, §2.2.
-  (2019) DeepGCNs: can gcns go as deep as cnns?. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.2, §3.3.
-  (2019) Fast learning of temporal action proposal via dense boundary generator. Cited by: Table 2.
-  (2019) BMN: boundary-matching network for temporal action proposal generation. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.1, §4.3, Table 1, Table 2.
-  (2017) Single shot temporal action detection. In Proceedings of the 2017 ACM on Multimedia Conference, MM 2017, Mountain View, CA, USA, Cited by: §2.1.
-  (2018) BSN: boundary sensitive network for temporal action proposal generation. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §1, §2.1, §3.1, Table 1, Table 2.
-  (2019) Learning video representations from correspondence proposals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.
-  (2018) Multi-granularity generator for temporal action proposal. Computing Research Repository (CoRR). Cited by: Table 2.
-  (2019) Gaussian temporal awareness networks for action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
Learning convolutional neural networks for graphs. In
Proceedings of the International conference on machine learning (ICML), Cited by: §2.2.
-  (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §2.1.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, Cited by: §1.
-  (2016) Temporal action detection using a statistical language model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Table 2.
-  (2017) CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 1, Table 2.
-  (2016) Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
-  (2014) Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, Cited by: §2.1.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §2.1.
-  (2016) Untrimmed video classification for activity detection: submission to activitynet challenge. arXiv preprint arXiv:1607.01979. Cited by: §2.1, Table 1.
-  (2015) Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision (ICCV), Cited by: §2.1.
-  (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §2.2.
-  (2017) UntrimmedNets for weakly supervised action recognition and detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §4.2.
-  (2016) Temporal segment networks: towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §1, §4.2.
-  (2016) Temporal segment networks: towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision, Cited by: §2.1.
-  (2015) Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159. Cited by: §2.1.
-  (2016) UTS at activitynet 2016. ActivityNet Large Scale Activity Recognition Challenge. Cited by: Table 1.
-  (2018) Videos as space-time region graphs. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §2.2.
-  (2018) Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics. Cited by: §2.2, §3.3.
-  (2018) Dynamic graph CNN for learning on point clouds. Computing Research Repository (CoRR). Cited by: §3.3.
-  (2017) Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Cited by: §1, §3.2, §3.3.
-  (2019) Point clouds learning with attention-based graph convolution networks. arXiv preprint arXiv:1905.13445. Cited by: §2.2.
-  (2016) CUHK & ethz & siat submission to activitynet challenge 2016. Cited by: §2.1, §4.2, §4.2.
-  (2017) R-c3d: region convolutional 3d network for temporal activity detection. In Proceedings of the IEEE international conference on computer vision (ICCV), Cited by: §2.1, Table 1.
-  (2016) End-to-end learning of action detection from frame glimpses in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Table 2.
-  (2017) Temporal action localization by structured maximal sums. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 2.
-  (2019) Graph convolutional networks for temporal action localization. arXiv preprint arXiv:1909.03252. Cited by: §2.1, §2.2, Table 1, Table 2.
-  (2017) Temporal action detection with structured segment networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2.1, Table 2.
-  (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950. Cited by: §4.2.
6.1 Derivation and Efficient Implementation of Eq. 4
In this section, we provide the derivation of Eq. 4 in the paper (listed here in the following). We also show that Eq. 4 can be efficiently implemented by zero-padded 1D/edge convolutions.
6.2 Derivation of Eq. 4
a) Temporal Graph Convolution. We first provide the derivation for temporal graph convolution.
The temporal forward edges and backward edges are formulated as
The corresponding adjacency matrices can be present by vectors, respectively, shown in Eq. 6.2. We use to present the vector in which the -th element is one but the others are zeros.
Given the input , after temporal graph convolution, the output, , becomes
Here is the trainable weights in the neural network.
b) Semantic Graph Convolution. It is straightforward to obtain Eq. 4 for semantic graph convolution.
6.3 Efficient Implementation of Eq. 4
In implementation of Eq. 4, we use an efficient zero-padded 1D convolution and edge convolution for temporal graph convolution and semantic graph convolution, respectively. In the following, we provide proof that our efficient implementation is equivalent to Eq. 4.
a) Temporal Graph Convolution.
If a 1D convolution has kernel size 3, the weight matrix is a 3D tensor in . We denote the matrix as . Given the same input , we pad zero on the input, .
The output of 1D convolution can be written as
We can prove that by multiplying on both sides in Eq. 6.3. Please be noted that is the trainable weights in the neural network. We can assume
b) Semantic Graph Convolution. In the semantic graph, edge convolution is directly used, so proof is done.
6.4 Training Details
Semantic Edges from Multiple Levels. In G-TAD, we use multiple GCNeXt blocks to adaptively incorporate multi-level semantic context into video features. After that, SGAlign layer embeds each sub-graph by concatenating aligned features from temporal and semantic graphs. However, it is not necessary to consider only the last GCNeXt semantic graphs to align the semantic feature. Last row in Tab. 6 present one more experiment that takes the union of semantic edges from all GCNeXt blocks to aggregate the semantic feature. We can find that the semantic context also helps to improve model performance under this setup.
|SGAlign||tIoU on Validation Set|
2D Conv. for Sub-Graph Localization. Once we get the sub-graph feature from SGAlign layer, instead of using three fully connected (FC) layers regress to , we can arrange the anchors in a 2D map based on the start/end time, and set zeros to the map where is no pre-designed anchors (e.g. ). In doing so, we can use 2D CNNs to regress to a map that arranged by the same order. We call the predicted matrix IoU map.
The neighbouring anchors in the 2D IoU map have similar boundary locations. Thus we can use the proposal-proposal relationship in the 2D convolutions. We set kernel size to 1, 3, and 5, and the results are shown in Tab. 7. We do not observe any significant benefit from 2D convolutions.
|Conv. on IoU map||mAP on Validation Set|