G-TAD: Sub-Graph Localization for Temporal Action Detection

11/26/2019 ∙ by Mengmeng Xu, et al. ∙ King Abdullah University of Science and Technology 0

Temporal action detection is a fundamental yet challenging task in video understanding. Video context is a critical cue to effectively detect actions, but current works mainly focus on temporal context, while neglecting semantic con-text as well as other important context properties. In this work, we propose a graph convolutional network (GCN) model to adaptively incorporate multi-level semantic context into video features and cast temporal action detection as a sub-graph localization problem. Specifically, we formulate video snippets as graph nodes, snippet-snippet cor-relations as edges, and actions associated with context as target sub-graphs. With graph convolution as the basic operation, we design a GCN block called GCNeXt, which learns the features of each node by aggregating its context and dynamically updates the edges in the graph. To localize each sub-graph, we also design a SGAlign layer to embed each sub-graph into the Euclidean space. Extensive experiments show that G-TAD is capable of finding effective video context without extra supervision and achieves state-of-the-art performance on two detection benchmarks. On ActityNet-1.3, we obtain an average mAP of 34.09 40.16

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Video understanding has gained much attention from both academia and industry over recent years, given the rapid growth of videos published in online platforms. Temporal action detection is one of the interesting but challenging tasks in this area. It involves detecting the start and end frames of action instances, as well as predicting their class labels. This is onerous especially in long untrimmed videos.

Figure 1: Graph formulation of a video. Nodes: video snippets (a video snippet is defined as consecutive frames within a short time period). Edges: snippet-snippet correlations. Sub-graphs: actions. There are 4 types of nodes: action, start, end, and background, shown as colored dots. There are 2 types of edges: (1) temporal edges, which are pre-defined according to the snippets’ temporal order; (2) semantic edges, which are learned from node features.

Video context is an important cue to effectively detect actions. Here, we refer to context as frames that are outside the target action but carry valuable indicative information of it. Using video context to infer potential actions is natural for human beings. In fact, empirical evidence shows that humans can reliably guess or predict the occurrence of a certain type of action by only looking at short video snippets where the action does not happen [1, 2]. Therefore, incorporating context into temporal action detection has become an important strategy to boost detection accuracy in the recent literature [11, 15, 9, 32, 45, 59, 29]. Researchers have proposed various ways to take advantage of video context, such as extending temporal action boundaries by a pre-defined ratio [11, 15, 45, 59, 29], using dilated convolution to encode context into features [9], and aggregating context features implicitly by way of a Gaussian curve [32]. All these methods only utilize temporal context, which precedes or follows an action instance in its immediate temporal neighborhood. However, real-world videos vary dramatically in temporal extent, action content, and even editing preferences. The use of such temporal context does not fully exploit the rich merits of video context, and it may even impair detection accuracy if not properly designed for underlying videos.

So, what properties characterize desirable video context for the purpose of accurate action detection? First, context should be semantically correlated to the target action other than merely temporally located in its vicinity. Imagine the case where we manually stitch an action clip into some irrelevant frames, the abrupt scene change surrounding the action would definitely not benefit the action detection. On the other hand, snippets located at a distance from an action but containing similar semantic content might provide indicative hints for detecting the action. Second, context should be content-adaptive rather than manually pre-defined. Considering the vast variation of videos, context that helps to detect different action instances could be different in lengths and locations based on the video content. Third, context should be based on multiple semantic levels, since using only one form/level of context is unlikely to generalize well.

In this paper, we endow video context with all the above properties by casting action detection as a sub-graph localization problem based on a graph convolutional network (GCN) [24]. We represent each video sequence as a graph, each snippet as a node, each snippet-snippet correlation as an edge, and target actions associated with context as sub-graphs, as shown in Fig. 1. The context of a snippet is considered to be all snippets connected to it by an edge in a video graph. We define two types of edges — temporal edges and semantic edges, each corresponding to temporal context and semantic context, respectively. Temporal edges exist between each pair of neighboring snippets, whereas semantic edges are dynamically learned from the video features at each GCN layer. Hence, multi-level context of each snippet is gradually aggregated into the features of the snippet throughout the entire GCN. The structure of each GCN block is inspired by ResNeXt [52], so we name this GCN-based feature extractor GCNeXt.

The pipeline of our proposed Graph-Temporal Action Detection method, dubbed G-TAD, is analogous to faster R-CNN [17, 35] in object detection. There are two critical designs in G-TAD. First, GCNeXt, which generates context-enriched features, corresponds to the backbone network, analagous to a series of CNN layers in faster R-CNN. Second, to mimic region of interest (RoI) alignment [19]

in faster R-CNN, we design a sub-graph alignment (SGAlign) layer to generate a fixed-size representation for each sub-graph and embed all sub-graphs into the same Euclidean space. Finally, we apply a classifier on the features of each sub-graph to obtain detection results. We summarize our contributions as follows.

(1) We present a novel GCN-based video model to fully exploit video context for effective temporal action detection. Using this video GCN representation, we are able to adaptively incorporate multi-level semantic context into the features of each snippet.

(2) We propose G-TAD, a new sub-graph detection framework to localize actions in video graphs. G-TAD includes two main modules: GCNeXt and SGAlign. GCNeXt performs graph convolutions on video graphs, leveraging both temporal and semantic context. SGAlign re-arranges sub-graph features in an embedded space suitable for detection.

(3) G-TAD achieves state-of-the-art performance on two popular action detection benchmarks. On ActityNet-1.3, it achieves an average mAP of . On THUMOS-14, it reaches mAP@0.5, beating all contemporary one-stage methods.

2 Related Work

2.1 Video Representation

Figure 2: Overview of G-TAD architecture. The input of G-TAD is a sequence of snippet features. We first extract features using GCNeXt blocks, which gradually aggregate both temporal and multi-level semantic context. Semantic context, encoded in semantic edges, is dynamically learned from features at each GCNeXt layer. Then we feed the extracted features into the SGAlign layer, where sub-graphs defined by a set of anchors are transformed to a fixed-size representation in the Euclidean space. Finally, the localization module scores and ranks the sub-graphs for detection.

Action Recognition. Many CNN based methods have been proposed to address the action recognition task. Two-stream networks [14, 39, 47] use 2D CNNs to extract frame features from RGB and optical flow sequences. These 2D CNNs can be designed from scratch [20, 40] or adapted from image recognition tasks [12]. Other methods [42, 8, 34, 55] use 3D CNNs to encode spatio-temporal information from the original video. In our work, we use the pre-trained action recognition model in [54, 46] to extract video snippet features as G-TAD input, and use graph convolution as an analogue for 2D or 3D CNNs.

Action Detection. The goal of temporal action detection is to predict the boundaries of action instances and their categories in untrimmed videos. Most methods [38, 41, 59, 58, 9, 29] divide the task into two stages: temporal proposal generation and classification/regression of proposals. For proposal generation, they either predict action proposals using handcrafted anchors [5, 6, 13, 15, 38] , or by classifying starting/ending snippets [59, 29]. Others [28, 5, 27] tackle the problem using a single-stage model, where actions are detected directly. G-TAD is a single-stage model that scores pre-defined anchors with action confidences. We introduce a starting/ending snippet classification loss as a regularizer.

2.2 GCN in Videos

Graphs in Video Understanding. Graphs have been widely used for data representation in various video understanding tasks, such as video feature representation [30], video classification [49, 10], and action localization [58]. In action recognition, Liu et al.  [30]

view a video tensor as a 3D point cloud in the spatial-temporal space. Wang

et al.  [49] represent a video as a space-time region graph, in which the graph nodes are defined by object region proposals. In action detection, Zeng et al.  [58] consider temporal action proposals as nodes in a graph, and refine their boundaries and classification scores based on the established proposal-proposal dependencies. Differently from previous works, G-TAD takes video snippets as nodes in a graph and form dependencies between them based on both their temporal ordering and semantic similarity.

Graph Convolutions. Graph Convolutional Networks (GCNs) [24]

are widely used for non-Euclidean structures. These years have also seen their successful application in computer vision tasks due to their versatility and effectiveness, such as 3D object detection 

[18] and point cloud segmentation  [50, 53]. Meanwhile, various GCN architectures are proposed for more effective and flexible modelling. Busbridge et al.  [43] propose graph attention networks that assign different weights to neighboring nodes based on local structures. Jiang et al. [33] dynamically infer a graph, and apply graph convolution on the dynamic graph. Li et al.  [25] propose DeepGCNs to enable GCNs to go as deep as 100 layers by using residual/dense graph connections and dilated graph convolutions. G-TAD uses a DeepGCN-like structure to apply graph convolutions on a dynamic semantic graph as well as a fixed temporal graph.

3 Proposed Method

3.1 Problem Formulation

The input to our pipeline is a video sequence of frames , where is the frame . Following recent video action proposal generation methods [5, 13, 15, 29], we construct our G-TAD model using feature sequences extracted from raw video frames. We uniformly sample frames with a sampling rate and refer to each sampled frame as a snippet. Our input visual feature sequence is represented by , where is the extracted snippet feature at the sampled frame and is its dimension. Each video sequence has a set of annotations , where represents an action instance, and , , and are its starting time, ending time, and action class, respectively.

The temporal action detection task is to predict possible actions from . Here, represents the predicted temporal boundaries for the predicted action, and are its predicted action class and confidence score, respectively.

3.2 G-TAD Architecture

Our action detection framework is illustrated in Fig. 2. We feed snippet features into a stack of GCNeXt blocks, which is designed inspired by ResNeXt [52], to obtain context-aware features. Each GCNeXt contains two graph convolution streams. One stream operates on fixed temporal neighbors, and the other adaptively aggregates semantic context into snippet features. Each block follows split-transform-merge strategy with multiple convolution paths. Based on a set of pre-defined temporal anchors (see Section 4.2

), we define a sub-graph alignment layer named SGAlign to transform the aggregated feature of each sub-graph to a feature vector. Multiple fully connected layers are used to predict the intersection over union (IoU) of every anchor and the ground truth action instance. We provide a detailed description of both GCNeXt and SGAlign in Sections

3.3 and 3.4, respectively.

Figure 3: GCNeXt block. The input feature is processed by temporal and semantic graphs with the same cardinality. Black and purple boxes represent Edge Convolutions and 1D Convolutions, respectively. We display (input channel, output channel) in each box. Both convolution streams follow a split-transform-merge strategy and 32 paths designed to increase the diversity of transformations. The module output is the summation of both streams and the input.

3.3 GCNeXt for Context Feature Encoding

Our basic graph convolution block, GCNeXt, operates on a graph representation of the video sequence. It encodes snippets using their temporal and semantic neighbors. Fig. 3 illustrates the architecture of GCNeXt.

To build the video graph, we take as input dimensional feature vectors of snippets, denoted as . We build a graph , where and denote the vertex and edge set, respectively. In this case, each vertex is a snippet (represented by its feature) and each edge shows a dependency between pairs of snippets. We define two types of edges — temporal edges and semantic edges , so accordingly we define two graphs — the temporal graph and the semantic graph. We describe each type of edge as well as the graph convolution process in the following.

Temporal Edges (). Temporal edges encode the temporal order of the video. For each node , there is one unique forward edge to node , and one backward edge to node . In this case, we have , where and are forward and backward temporal edge sets defined as follows:

(1)
(2)

is the number of snippets in the video.

Semantic Edges (). We define using the notion of dynamic edge convolutions [50]. The goal of these edges is to collect information from semantically correlated snippets. For each node in the input graph , we define a set as follows:

Here, is the index of the nearest neighbour from -th node, is constructed dynamically at every layer in the node feature space, which enables us to find the dynamic neighbors that intrinsically carry semantic context information. Since we recompute at each layer, is adaptively changed to represent new levels of semantic context.

Graph Convolution. A Graph Convolution transforms the graph vertices represented as through a general graph convolution operation formulated as:

Here, are trainable weights of the aggregation function , is the adjacency matrix without self-loops (i.e

. edges between a node and itself), and ReLU is the rectified linear unit as activation function. We formally define the adjacency element

as , where is the indicator function. There are several choices for in the literature. We use a single-layer edge convolution [51] as our aggregation function in Eq. 3.3.

(3)

We use with different subscripts to show different trainable weights. represents the matrix concatenation in columns.

Residual Connection and Cardinality

. We require two more graph operations. First, we use the residual connection proposed in DeepGCN

[25] to improve model convergence. Under this setup, our graph convolution block can be formulated as:

(4)

where , , and are adjacency matrices for , , and respectively.

The derivation of Eq. 3.3 is demonstrated in the supplementary materials

, where we also prove that it can be efficiently computed by zero-padded 1D/edge convolutions.

Following ResNeXt [52], GCNeXt adopts split-transform-merge strategy to explore the applicability of group convolution by changing cardinality besides going deeper or wider.

3.4 Sub-Graph Alignment and Localization

Figure 4: SGAlign layer. SGAlign extracts sub-graph features using a set of anchors. In the graphs above, the colored dots represent node features, grey arrows are semantic edges, and the orange highlighted arch is the anchor. Both circles represent the same graph. SGAlign arranges node feature along the temporal/semantic graphs, and concatenates both features as output. When using the temporal graph, the order of nodes is preserved in the final representation (black lines). This is not always true for the semantic graph, since node features are represented by their feature neighbors (purple lines).

Sub-Graph of Interest Alignment (SGAlign). Most previous action detectors perform rescaling to extract a fixed sized proposal feature vector for each action anchor. Given the action anchor , they sample the video feature sequence within

through linear interpolation with

points. Given our graph formulation, a sub-graph feature is extracted instead from a Sub-Graph of Interest Alignment (SGAlign) layer that aggregates the context feature in an adaptive way, and does not rely on human priors. Fig. 4 illustrates our new graph alignment algorithm and we present its technical details next.

0:  ordered feature , Anchors , resolution ;
1:  Init
2:  for each anchor in Anchors do
3:      = [0]; = [1];
4:     ; ; = ;
5:      = [ for in range()]
6:      = [ for in ]
7:      = [: for in range()]
8:  end for
8:  .
Algorithm 1 Interpolation and Rescaling in Alignment

Given an input of feature vectors and an anchor , we expect to sample and vectors from the temporal and semantic graphs, respectively. We repeat this process for all anchors. The alignment is done in four steps. (1) Each snippet is projected back to the temporal order given by the temporal graph. (2) We run an interpolation and rescaling algorithm (Alg. 1) to get vectors from the temporal graph and vectors from the semantic graph. (3) Every node’s feature is replaced with the mean feature of its dynamic neighbors, and then we repeat (1) and (2) to further extract features for the semantic context. (4) The and vectors are concatenated as the output of the SGAlign layer. In Alg. 1, the output for anchor , is the weighted average of all the nodes in the sub-graph defined by . In the backward pass, this weighted sum means gradients will always flow to these nodes.

Sub-Graph Localization. For an anchor , we calculate its Intersection-over-Union (IoU) with all ground truth actions in , and denote the maximum IoU as its label. We compute for all pairs of . Once we get the sub-graph feature from SGAlign layer, we use three fully connected (FC) layers to regress it to . The last FC layer produces a two-dimension vector, where each entry shows classification and regression scores.

3.5 Training G-TAD

In G-TAD, the sub-graph localization is used to determine the confidence scores of anchors which are regressed for final temporal action detection. We do not need to specifically classify starting and ending nodes of actions since they are predefined by the anchors. However, we noticed that adding a node classifier during training can drastically improve the model’s convergence. This classification module is ignored at test time.

Sub-Graph Localization Loss. Sub-graph localization predicts for each anchor position. With the training target being , the sub-graph loss is defined as follows:

(5)

where is the weighted cross entropy loss. In our experiments, we take the tradeoff coefficient , since the second loss term tends to be smaller than the first.

Node Classification Regularizer. In the training process, we label a node as a start or end point if they are temporally close to or , while all the other nodes are of the third class containing action and background nodes. We add a separate branch by with 1 FC-layer after the first GCNeXt block to classify nodes to their labels

, with start/end probabilities

. We add a node regularizer , where stands for the weighted binary cross entropy loss.

We train G-TAD in the form of a multi-task loss function, including sub-graph loss

, node regularizer loss , and an regularization for all the trainable parameters :

(6)

In our experiments, we set .

3.6 Inference and Post-processing

At inference time, G-TAD predicts classification and regression scores for each anchor . From the anchors, we construct predicted actions , where is the action boundary in the video scale, is the action class, and is the fused confidence score of this prediction and . In our experiments, we search for the optimal in each setup. We apply Soft-NMS [3] and select the top- predictions.

4 Experiment

4.1 Datasets and Metrics

ActivityNet-1.3 [7]

is a large-scale action understanding dataset for action recognition, temporal detection, proposal generation and dense captioning tasks. It contains 19,994 temporally annotated untrimmed videos with 200 action categories, which are divided into training, validation and testing sets by 2:1:1.

THUMOS-14 [23] dataset contains 413 temporally annotated untrimmed videos with 20 action categories. We merge the 200 videos in validation to the training set and evaluate on the 213 annotated videos from the testing set.

Detection Metric

. We take mean Average Precision (mAP) at certain IoU thresholds as the main evaluation metric. Following the official evaluation API, the IoU thresholds are chosen from

and on THUMOS14 and ActivityNet-1.3, respectively. Following standard practise, we also report average mAP over 10 different IoU thresholds on ActivityNet-1.3.

4.2 Implementation Details

Features and Anchors. We use pre-extracted features for both datasets. For ActivityNet-1.3, we adopt the pre-trained two-stream network by Xiong et. al. [54], with down-sampling ratio . Each video feature sequence is rescaled to snippets, using linear interpolation. For THUMOS-14, the video features are extracted from Kinetics [60] pretrained TSN model [45] with . We crop each video feature sequence with a window size and overlap neighbouring windows with snippets. In training, we do not use any crops void of actions.

For ActivityNet-1.3 and THUMOS-14, we enumerate all possible anchors with restriction, e.g. , while and are and , respectively. In SGAlign, we use for ActivityNet-1.3, and for THUMOS-14.

Training and Inference

. We implement and compile our framework using PyTorch 1.1, Python 3.7, and CUDA 10.0. We use

GCNeXt blocks and train our model end-to-end, with batch size of 16. The learning rate is on ActivityNet and

on THUMOS14 for 5/5 epochs. In inference, we take video classification scores by

[44] and [54], and multiply them to for evaluation. For post-processing, the Soft-NMS threshold is to pick the top confident predictions, where M is 100 for ActivityNet and 200 for THUMOS.

More details can be found in the supplementary material. To encourage reproducibility, the code and trained models will be made publicly available.

4.3 Comparison with State-of-the-Art

ActivityNet-1.3: Tab. 1 compares G-TAD with state-of-the-art detectors. We report mAP at different tIoU thresholds, as well as average mAP. G-TAD reports the highest average mAP results on this large-scale and diverse dataset.

Method 0.5 0.75 0.95 Average
Wang et al. [48] 43.65 - - -
Singh et al. [41] 34.47 - - -
SCC [21] 40.00 17.90 4.70 21.70
CDC [37] 45.30 26.00 0.20 23.80
TCN [11] 37.49 23.47 4.47 23.58
R-C3D [55] 26.80 - - -
BSN [29] 46.45 29.96 8.02 30.03
Chao et al. [9] 38.23 18.30 1.30 20.22
P-GCN [58] 48.26 33.16 3.27 31.11
BMN [27] 50.07 34.78 8.29 33.85
G-TAD (ours) 50.36 34.60 9.02 34.09
Table 1: Action detection results on validation set of ActivityNet-1.3, measured by mAP () at different tIoU thresholds and the average mAP. G-TAD achieves better performance in average mAP than other methods, even the latest work of BMN and PGCN shown in the last second block.
Method 0.3 0.4 0.5 0.6 0.7
Two-stage Temporal Action Detection
SST [5] - - 23.0 - -
CDC [37] 40.1 29.4 23.3 13.1 7.9
TURN-TAP[15] 44.1 34.9 25.6 - -
CBR [16] 50.1 41.3 31.0 19.1 9.9
SSN [59] 51.9 41.0 29.8 - -
BSN [29] 53.5 45.0 36.9 28.4 20.0
TCN [11] - 33.3 25.6 15.9 9.0
TAL-Net [9] 53.2 48.5 42.8 33.8 20.8
MGG [31] 53.9 46.8 37.4 29.5 21.3
DBG [26] 57.8 49.4 39.8 30.2 21.7
P-GCN [58] 63.6 57.8 49.1 - -
One-stage Temporal Action Detection
Richard et al. [36] 30.0 23.2 15.2 - -
Yeung et al. [56] 36.0 26.4 17.1 - -
Yuan et al. [57] 36.5 27.8 17.8 - -
Hou et al. [22] 43.7 - 22.0 - -
SS-TAD [4] 45.7 - 29.2 - 9.6
BMN [27] 56.0 47.4 38.8 29.7 20.5
G-TAD (ours) 54.5 47.6 40.2 30.8 23.4
Table 2: Action detection results on testing set of THUMOS-14, measured by mAP (%) at different tIoU thresholds. G-TAD achieves the best performance among all one-stage methods, and the mAP@0.7 is even higher than strong two-stage TAL-Net[9].

THUMOS14: Tab. 2 compares the action localization results of G-TAD and various state-of-the-art methods on the THUMOS14 dataset. At 0.7 IoU, G-TAD reaches mAP of , compared to the current best of from TAL-Net. At 0.5 IoU, G-TAD outperforms all one-stage detection methods, such as SS-TAD [4] and BMN [27]. Comparing G-TAD with two-stage methods puts our method at an inherent disadvantage. For example, P-GCN only rescores BSN proposals by mining proposal-proposal relationships and, in doing so, it increases mAP from to . Our model can get such good results only from capturing more information about context.

4.4 Ablation Study

GCNeXt Module: We ablate the three main components of GCNeXt, mainly GCN on temporal edges, GCN on semantic edges, and cardinality increase. Tab. 3 reports the performance obtained on ActivityNet-1.3, when each component is separately enabled/disabled. We see how each of these components contributes to the performance of the final G-TAD model. We highlight the gains from the semantic graph, showing the benefit of integrating adaptive context from semantic neighbors.

GCNeXt block tIoU on Validation Set
Temp. Sem. Card. 0.5 0.75 0.95 Avg.
1 48.12 32.16 6.41 31.65
1 50.20 34.80 7.35 33.88
32 50.13 34.17 8.70 33.67
32 49.09 33.32 8.02 32.63
32 50.36 34.60 9.02 34.09
Table 3: Ablating GCNeXt Components. We disable temporal/semantic graph convolutions and set different cardinalities for detection on ActivityNet-1.3.

SGAlign Module: This layer extracts sub-graph features by densely sampling and rescaling underlying snippet features. The sampling density is defined by factor in Alg. 1. Tab. 4 shows the effect of the sampling and feature concatenation from both temporal and semantic graphs on ActivityNet-1.3. While sampling densely gives us minor improvements, we obtain a larger gain by including context information from the semantic graph.

SGAlign tIoU on Validation Set
Samp. Concat. 0.5 0.75 0.95 Avg.
49.84 34.58 8.17 33.78
49.86 34.60 9.56 33.89
50.36 34.60 9.02 34.09
Table 4: Ablating SGAlign Components. We disable the sample-rescale process and the feature concatenation from the semnantic graph for detection on ActivityNet-1.3. The rescaling strategy leads to slight improvement, while the main gain arises from the use of context information (semantic graph).

Sensitivity to Video Length: We report the results of the sensitivity of G-TAD to different window sizes in THUMOS-14 in Tab. 5. G-TAD benefits more from larger window sizes ( vs. ) for large windows mean G-TAD can aggregate more context snippets from the semantic graph. Performance degrades at , where GPU memory limited us to use a batch size of only 2.

Window tIoU on Validation
Length 0.3 0.4 0.5 0.6 0.7
128 51.75 44.90 38.70 29.03 21.32
256 54.50 47.61 40.16 30.83 23.42
512 48.32 41.71 34.38 26.85 19.29
Table 5: Effect of Video Size. We vary the input video size (window length ) and see that G-TAD performance improves with larger sizes (). Degradation occurs at , since GPU memory limits the batch size to be significantly reduced, leading to a noticeable performance drop.
Figure 5: Semantic graphs and Context. Given two untrimmed videos (left and right), we combine action frames from one video with context frames from the second (middle). We therefore create a synthetic video with no action context. As expected, the semantic graph of the synthetic video contains no edges between action and background snippets.
Figure 6: Qualitative results. We show qualitative detection results on ActivityNet-1.3 (top) and THUMOS-14 (bottom).

4.5 Discussion of Action Context

In the ablation experiments, graph convolutions on the semantic graph improve G-TAD performance in both the GCNeXt block and in the SGAlign layer. Semantic edges connecting background to action snippets can adaptively pass the action context information to each possible action. In this section, we define 2 extra experiments to show how semantic edges encode meaningful context information.

Zero-Context Video. How zero context between action and background leads to semantic graphs with no action-background edges is visually shown by comparing semantic graphs resulting from natural videos and synthetically compiled ones. In Fig. 5 (left and right), we present two natural videos that include actions “wrestling” and “playing darts”, respectively. Semantic edges in their resulting graphs do exist, connecting action with background snippets, thus exemplifying the usage of context in the detection process. Then, we compile a synthetic video that stacks action frames from the wrestling video and background frames from the darts video, feed it to G-TAD and again visualize the semantic graph (middle). As expected, the semantic graph does not include any action-background semantic edges.

Figure 7: Action-Background Semantic Edge Ratio vs. Context Amount. In the scatter plot, each purple dot corresponds to a different video graph. Strong positive correlation is observed between context amount and action-background semantic edge ratio, which means we predict on average more semantic edges in the presence of large video context.
Figure 8: Semantic graph evolution during G-TAD training. We visualize the semantic graphs at first, middle, and last layers during training epoch 0, 3, 6, and 9. The semantic edges at the first layer are always the same, while the semantic graphs at the middle and last layers evolve to incorporate more context.

Correlation to Context Amount. We also show the correlation between context edges and context as defined by human annotators. We define the video context amount as the average number of background snippets which can be used to predict the video action class. Following DETAD [1], we collect context amount for all videos in ActivityNet validation set from Amazon Mechanical Turk. The scatter plot in Fig. 7 shows the relation between Context Amount and the ratio of action-background semantic edges over all the semantic edges. From the plot, we observe that if a video has a higher amount of context (from human annotations), it is more likely to have more action-background semantic edges in its semantic graph. We further average context edge ratios in five context amount ranges, and plot them in green. The strong positive correlation between Context Amount and action-background semantic edge ratio indicates that our G-TAD model can effectively find related context snippets in the semantic graph.

4.6 Visualization

We show a few qualitative detection results in Fig. 6 on both ActivityNet-1.3 and THUMOS-14. In Fig. 8, we visualize the evolution of semantic graphs during the training process across GCNeXt layers. Specifically, we feed a video into G-TAD and visualize the semantic graphs emerging at the first, middle, and last layers at epochs 0, 3, 6, and 9 of training. The semantic graphs at the first layer are the same, since they are built on the same input features. As we progress to different layers and epochs, semantic graphs adaptively update their edges. Interestingly, we observe the presence of more context edges as training advances. This indicates that G-TAD progressively learns to incorporate multiple levels of context in the detection process.

5 Conclusion

In this paper, we cast the temporal action detection task as a sub-graph localization problem by formulating videos as graphs. We take video snippets as graph nodes, snippet-snippet correlations as edges, and apply graph convolution as the basic operation. We propose a new architecture G-TAD to localize sub-graphs. G-TAD includes GCNeXt blocks to aggregate context-enriched snippet features and an SGAlign layer to transform sub-graph features into vector representations. G-TAD can learn enriched multi-level semantic context in an adaptive way by looking at snippet features. Extensive experiments show that G-TAD can find global video context without extra supervision and achieve the state-of-the-art performance on both Thumos-14 and ActivityNet-1.3 under different metrics.

References

  • [1] H. Alwassel, F. Caba Heilbron, V. Escorcia, and B. Ghanem (2018) Diagnosing error in temporal action detectors. In European Conference on Computer Vision (ECCV), Cited by: §1, §4.5.
  • [2] H. Alwassel, F. C. Heilbron, and B. Ghanem (2017) Action search: spotting actions in videos and its application to temporal action localization. In European Conference on Computer Vision (ECCV), Cited by: §1.
  • [3] N. Bodla, B. Singh, R. Chellappa, and L. S. Davis (2017) Soft-nms – improving object detection with one line of code. In International Conference on Computer Vision (ICCV), Cited by: §3.6.
  • [4] S. Buch, V. Escorcia, B. Ghanem, L. Fei-Fei, and J. C. Niebles (2017) End-to-end, single-stream temporal action detection in untrimmed videos. In the British Machine Vision Conference (BMVC), Cited by: §4.3, Table 2.
  • [5] S. Buch, V. Escorcia, C. Shen, B. Ghanem, and J. Carlos Niebles (2017) Sst: single-stream temporal action proposals. In

    Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §2.1, §3.1, Table 2.
  • [6] F. Caba Heilbron, J. Carlos Niebles, and B. Ghanem (2016) Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Cited by: §2.1.
  • [7] F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles (2015) ActivityNet: a large-scale video benchmark for human activity understanding. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.1.
  • [8] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
  • [9] Y. Chao, S. Vijayanarasimhan, B. Seybold, D. A. Ross, J. Deng, and R. Sukthankar (2018) Rethinking the faster r-cnn architecture for temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.1, Table 1, Table 2.
  • [10] Y. Chen, M. Rohrbach, Z. Yan, Y. Shuicheng, J. Feng, and Y. Kalantidis (2019) Graph-based global reasoning networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.
  • [11] X. Dai, B. Singh, G. Zhang, L. S. Davis, and Y. Qiu Chen (2017) Temporal context network for activity localization in videos. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §1, Table 1, Table 2.
  • [12] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Cited by: §2.1.
  • [13] V. Escorcia, F. C. Heilbron, J. C. Niebles, and B. Ghanem (2016) Daps: deep action proposals for action understanding. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §2.1, §3.1.
  • [14] C. Feichtenhofer, A. Pinz, and A. Zisserman (2016) Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
  • [15] J. Gao, Z. Yang, K. Chen, C. Sun, and R. Nevatia (2017) Turn tap: temporal unit regression network for temporal action proposals. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2.1, §3.1, Table 2.
  • [16] J. Gao, Z. Yang, and R. Nevatia (2017) Cascaded boundary regression for temporal action detection. In Proceedings of the British Machine Vision Conference (BMVC), Cited by: Table 2.
  • [17] R. Girshick (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision (ICCV), Cited by: §1.
  • [18] G. Gkioxari, J. Malik, and J. Johnson (2019) Mesh r-cnn. arXiv preprint arXiv:1906.02739. Cited by: §2.2.
  • [19] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (ICCV), Cited by: §1.
  • [20] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (ICCV), Cited by: §2.1.
  • [21] F. C. Heilbron, W. Barrios, V. Escorcia, and B. Ghanem (2017) Scc: semantic context cascade for efficient action detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 1.
  • [22] R. Hou, R. Sukthankar, and M. Shah (2017) Real-time temporal action localization in untrimmed videos by sub-action discovery. In Proceedings of the British Machine Vision Conference (BMVC), Cited by: Table 2.
  • [23] Y. Jiang, J. Liu, A. R. Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar (2014) THUMOS challenge: action recognition with a large number of classes. Cited by: §4.1.
  • [24] T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1, §2.2.
  • [25] G. Li, M. Muller, A. Thabet, and B. Ghanem (2019) DeepGCNs: can gcns go as deep as cnns?. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.2, §3.3.
  • [26] C. Lin, J. Li, Y. Wang, Y. Tai, D. Luo, Z. Cui, C. Wang, J. Li, F. Huang, and R. Ji (2019) Fast learning of temporal action proposal via dense boundary generator. Cited by: Table 2.
  • [27] T. Lin, X. Liu, X. Li, E. Ding, and S. Wen (2019) BMN: boundary-matching network for temporal action proposal generation. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.1, §4.3, Table 1, Table 2.
  • [28] T. Lin, X. Zhao, and Z. Shou (2017) Single shot temporal action detection. In Proceedings of the 2017 ACM on Multimedia Conference, MM 2017, Mountain View, CA, USA, Cited by: §2.1.
  • [29] T. Lin, X. Zhao, H. Su, C. Wang, and M. Yang (2018) BSN: boundary sensitive network for temporal action proposal generation. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §1, §2.1, §3.1, Table 1, Table 2.
  • [30] X. Liu, J. Lee, and H. Jin (2019) Learning video representations from correspondence proposals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.
  • [31] Y. Liu, L. Ma, Y. Zhang, W. Liu, and S. Chang (2018) Multi-granularity generator for temporal action proposal. Computing Research Repository (CoRR). Cited by: Table 2.
  • [32] F. Long, T. Yao, Z. Qiu, X. Tian, J. Luo, and T. Mei (2019) Gaussian temporal awareness networks for action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [33] M. Niepert, M. Ahmed, and K. Kutzkov (2016)

    Learning convolutional neural networks for graphs

    .
    In

    Proceedings of the International conference on machine learning (ICML)

    ,
    Cited by: §2.2.
  • [34] Z. Qiu, T. Yao, and T. Mei (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §2.1.
  • [35] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, Cited by: §1.
  • [36] A. Richard and J. Gall (2016) Temporal action detection using a statistical language model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Table 2.
  • [37] Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S. Chang (2017) CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 1, Table 2.
  • [38] Z. Shou, D. Wang, and S. Chang (2016) Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
  • [39] K. Simonyan and A. Zisserman (2014) Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, Cited by: §2.1.
  • [40] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §2.1.
  • [41] G. Singh and F. Cuzzolin (2016) Untrimmed video classification for activity detection: submission to activitynet challenge. arXiv preprint arXiv:1607.01979. Cited by: §2.1, Table 1.
  • [42] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2015) Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision (ICCV), Cited by: §2.1.
  • [43] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §2.2.
  • [44] L. Wang, Y. Xiong, D. Lin, and L. Van Gool (2017) UntrimmedNets for weakly supervised action recognition and detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §4.2.
  • [45] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Val Gool (2016) Temporal segment networks: towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §1, §4.2.
  • [46] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool (2016) Temporal segment networks: towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision, Cited by: §2.1.
  • [47] L. Wang, Y. Xiong, Z. Wang, and Y. Qiao (2015) Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159. Cited by: §2.1.
  • [48] R. Wang and D. Tao (2016) UTS at activitynet 2016. ActivityNet Large Scale Activity Recognition Challenge. Cited by: Table 1.
  • [49] X. Wang and A. Gupta (2018) Videos as space-time region graphs. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §2.2.
  • [50] Y. Wang, Y. Sun, Z. Liu, S. Sarma, M. Bronstein, and J. Solomon (2018) Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics. Cited by: §2.2, §3.3.
  • [51] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon (2018) Dynamic graph CNN for learning on point clouds. Computing Research Repository (CoRR). Cited by: §3.3.
  • [52] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He (2017) Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Cited by: §1, §3.2, §3.3.
  • [53] Z. Xie, J. Chen, and B. Peng (2019) Point clouds learning with attention-based graph convolution networks. arXiv preprint arXiv:1905.13445. Cited by: §2.2.
  • [54] Y. Xiong, L. Wang, Z. Wang, B. Zhang, H. Song, W. Li, D. Lin, Y. Qiao, L. Van Gool, and X. Tang (2016) CUHK & ethz & siat submission to activitynet challenge 2016. Cited by: §2.1, §4.2, §4.2.
  • [55] H. Xu, A. Das, and K. Saenko (2017) R-c3d: region convolutional 3d network for temporal activity detection. In Proceedings of the IEEE international conference on computer vision (ICCV), Cited by: §2.1, Table 1.
  • [56] S. Yeung, O. Russakovsky, G. Mori, and L. Fei-Fei (2016) End-to-end learning of action detection from frame glimpses in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Table 2.
  • [57] Z. Yuan, J. C. Stroud, T. Lu, and J. Deng (2017) Temporal action localization by structured maximal sums. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 2.
  • [58] R. Zeng, W. Huang, M. Tan, Y. Rong, P. Zhao, J. Huang, and C. Gan (2019) Graph convolutional networks for temporal action localization. arXiv preprint arXiv:1909.03252. Cited by: §2.1, §2.2, Table 1, Table 2.
  • [59] Y. Zhao, Y. Xiong, L. Wang, Z. Wu, X. Tang, and D. Lin (2017) Temporal action detection with structured segment networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2.1, Table 2.
  • [60] A. Zisserman, J. Carreira, K. Simonyan, W. Kay, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, et al. (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950. Cited by: §4.2.

6 Appendix

6.1 Derivation and Efficient Implementation of Eq.  4

In this section, we provide the derivation of Eq.  4 in the paper (listed here in the following). We also show that Eq.  4 can be efficiently implemented by zero-padded 1D/edge convolutions.

6.2 Derivation of Eq. 4

a) Temporal Graph Convolution. We first provide the derivation for temporal graph convolution.

The temporal forward edges and backward edges are formulated as

(7)

The corresponding adjacency matrices can be present by vectors, respectively, shown in Eq. 6.2. We use to present the vector in which the -th element is one but the others are zeros.

(8)

Given the input , after temporal graph convolution, the output, , becomes

(9)

Here is the trainable weights in the neural network.

b) Semantic Graph Convolution. It is straightforward to obtain Eq.  4 for semantic graph convolution.

6.3 Efficient Implementation of Eq.  4

In implementation of Eq.  4, we use an efficient zero-padded 1D convolution and edge convolution for temporal graph convolution and semantic graph convolution, respectively. In the following, we provide proof that our efficient implementation is equivalent to Eq.  4.

a) Temporal Graph Convolution.

If a 1D convolution has kernel size 3, the weight matrix is a 3D tensor in . We denote the matrix as . Given the same input , we pad zero on the input, .

The output of 1D convolution can be written as

(10)

We can prove that by multiplying on both sides in Eq. 6.3. Please be noted that is the trainable weights in the neural network. We can assume

(11)

b) Semantic Graph Convolution. In the semantic graph, edge convolution is directly used, so proof is done.

6.4 Training Details

Semantic Edges from Multiple Levels. In G-TAD, we use multiple GCNeXt blocks to adaptively incorporate multi-level semantic context into video features. After that, SGAlign layer embeds each sub-graph by concatenating aligned features from temporal and semantic graphs. However, it is not necessary to consider only the last GCNeXt semantic graphs to align the semantic feature. Last row in Tab. 6 present one more experiment that takes the union of semantic edges from all GCNeXt blocks to aggregate the semantic feature. We can find that the semantic context also helps to improve model performance under this setup.

SGAlign tIoU on Validation Set
Samp. Concat. 0.5 0.75 0.95 Avg.
49.84 34.58 8.17 33.78
49.86 34.60 9.56 33.89
50.36 34.60 9.02 34.09
all 50.26 34.70 8.52 33.95
Table 6: Ablating SGAlign Components. We disable the sample-rescale process and the feature concatenation from the semnantic graph for detection on ActivityNet-1.3. The rescaling strategy leads to slight improvement, while the main gain arises from the use of context information (semantic graph).

2D Conv. for Sub-Graph Localization. Once we get the sub-graph feature from SGAlign layer, instead of using three fully connected (FC) layers regress to , we can arrange the anchors in a 2D map based on the start/end time, and set zeros to the map where is no pre-designed anchors (e.g. ). In doing so, we can use 2D CNNs to regress to a map that arranged by the same order. We call the predicted matrix IoU map.

The neighbouring anchors in the 2D IoU map have similar boundary locations. Thus we can use the proposal-proposal relationship in the 2D convolutions. We set kernel size to 1, 3, and 5, and the results are shown in Tab. 7. We do not observe any significant benefit from 2D convolutions.

Conv. on IoU map mAP on Validation Set
Kernel Size Padding 0.5 0.75 0.95 Avg.
(1,1) (0,0) 50.25 34.66 9.29 34.08
(3,3) (1,1) 50.25 34.94 7.74 34.10
(5,5) (2,2) 49.88 34.39 8.96 33.77
Table 7: The model performance when we use 3 2D convolution layers to predict IoU map.We set kernel size to 1, 3, and 5, and collect result on ActivityNet1.3. We do not observe any significant benefit from 2D convolutions.