Gaussian Temporal Awareness Networks for Action Localization

09/09/2019 ∙ by Fuchen Long, et al. ∙ University of Rochester USTC 0

Temporally localizing actions in a video is a fundamental challenge in video understanding. Most existing approaches have often drawn inspiration from image object detection and extended the advances, e.g., SSD and Faster R-CNN, to produce temporal locations of an action in a 1D sequence. Nevertheless, the results can suffer from robustness problem due to the design of predetermined temporal scales, which overlooks the temporal structure of an action and limits the utility on detecting actions with complex variations. In this paper, we propose to address the problem by introducing Gaussian kernels to dynamically optimize temporal scale of each action proposal. Specifically, we present Gaussian Temporal Awareness Networks (GTAN) --- a new architecture that novelly integrates the exploitation of temporal structure into an one-stage action localization framework. Technically, GTAN models the temporal structure through learning a set of Gaussian kernels, each for a cell in the feature maps. Each Gaussian kernel corresponds to a particular interval of an action proposal and a mixture of Gaussian kernels could further characterize action proposals with various length. Moreover, the values in each Gaussian curve reflect the contextual contributions to the localization of an action proposal. Extensive experiments are conducted on both THUMOS14 and ActivityNet v1.3 datasets, and superior results are reported when comparing to state-of-the-art approaches. More remarkably, GTAN achieves 1.9 of the two datasets.



There are no comments yet.


page 3

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the tremendous increase of online and personal media archives, people are generating, storing and consuming a large collection of videos. The trend encourages the development of effective and efficient algorithms to intelligently parse video data. One fundamental challenge that underlies the success of these advances is action detection in videos from both temporal aspect [6, 9, 17, 32, 41, 45] and spatio-temporal aspect [11, 18]. In this work, the main focus is temporal action detection/localization, which is to locate the exact time stamps of the starting and the ending of an action, and recognize the action with a set of categories.

Figure 1: The intuition of a typical one-stage action localization (upper) and our GTAN (lower). The typical method fixes temporal scale in each feature map and seldom explores temporal structure of an action. In contrast, temporal structure is taken into account in our GTAN through learning a set of Gaussian kernels.

One natural way of temporal action localization is to extend image object detection frameworks, e.g., SSD [23] or Faster R-CNN [29], for producing spatial bounding boxes in a 2D image to temporal localization of an action in a 1D sequence [4, 19]. The upper part of Figure 1 conceptualizes a typical process of one-stage action localization. In general, the frame-level or clip-level features in the video sequence are first aggregated into one feature map, and then multiple 1D temporal convolutional layers are devised to increase the size of temporal receptive fields and predict action proposals. However, the temporal scale corresponding to the cell in each feature map is fixed, making such method unable to capture the inherent temporal structure of an action. As such, one ground-truth action proposal in the green box is detected as three ones in this case. Instead, we propose to alleviate the problem by exploring the temporal structure of an action through learning a Gaussian kernel for each cell, which dynamically indicates a particular interval of an action proposal. A mixture of Gaussian kernels could even be grouped to describe an action, which is more flexible to localize action proposals with various length as illustrated in the bottom part of Figure 1. More importantly, the contextual information is naturally involved with the feature pooling based on the weights in Gaussian curve.

By delving into temporal structure of an action, we present a novel Gaussian Temporal Awareness Networks (GTAN) architecture for one-stage action localization. Given a video, a 3D ConvNet is utilized as the backbone to extract clip-level features, which are sequentially concatenated into a feature map. A couple of convolutional layers plus max-pooling layer are firstly employed to shorten the feature map and increase the temporal size of receptive fields. Then, a cascaded of 1D temporal convolutional layers (anchor layers) continuously shorten the feature map and output anchor feature map, which consists of features of each cell (anchor). On the top of each anchor layer, a Gaussian kernel is learnt for each cell to dynamically predict a particular interval of an action proposal corresponding to that cell. Multiple Gaussian kernels could even be mixed to capture action proposals with arbitrary length. Through Gaussian pooling, the features of each cell is upgraded by aggregating the features of contextual cells weighted by the values in the Gaussian curve for final action proposal prediction. The whole architecture is end-to-end optimized by minimizing one classification loss plus two regression losses, i.e., localization loss and overlap loss.

The main contribution of this work is the design of an one-stage architecture GTAN for addressing the issue of temporal action localization in videos. The solution also leads to the elegant view of how temporal structure of an action should be leveraged for detecting actions with various length and how contextual information should be utilized for boosting temporal localization, which are problems not yet fully understood in the literature.

Figure 2: An overview of our Gaussian Temporal Awareness Networks (GTAN) architecture. The input video is encoded into a series of clip-level features via a 3D ConvNet, which are sequentially concatenated as a feature map. Two 1D convolutional layers plus one max-pooling layer are followed to increase the temporal size of receptive fields. Eight 1D convolutional layers are cascaded to generate multiple feature maps in different temporal resolution. On the top of each feature map, a Gaussian kernel is learnt on each cell to predict a particular interval of an action proposal. Moreover, multiple Gaussian kernels with high overlap are mixed to a larger one for detecting long actions with various length. Through Gaussian pooling, the action proposal is generated by aggregating the features of contextual cells weighted by the values in the Gaussian curve. The GTAN is jointly optimized with action classification loss plus two regression losses, i.e., localization loss and overlap loss for each proposal. Better viewed in original color pdf.

2 Related Work

We briefly group the related works into two categories: temporal action proposal and temporal action detection. The former focuses on investigating how to precisely localize video segments which contain actions, while the latter further classifies these actions into known classes.

We summarize the approaches on temporal action proposal mainly into two directions: content-independent proposal and content-dependent proposal. The main stream of content-independent proposal algorithms is uniformly or sliding window-ly sampling in a video [24, 35, 43], which leads to huge computations for further classification. In contrast, content-dependent proposal methods, e.g., [3, 5, 7, 8, 21], utilize the label of action proposals during training. For instance, Escorcia et al. [5]

leverage Long Short-Term Memory cells to learn an appropriate encoding of a video sequence as a set of discriminative states to indicate proposal scores. Though the method avoids running sliding windows of multiple scales, there is still the need of executing an overlapping sliding window that is inapplicable when the video duration is long. To address this problem, Single Stream Temporal proposal (SST)

[3] generates proposals with only one single pass by utilizing a recurrent GRU-based model, and Temporal Unit Regression Network (TURN) [8] builds video units in a pyramid manner to avoid window overlapping. Different from the above methods which generate proposals in a fixed multi-scale manner, Boundary Sensitive Network (BSN) [21] localizes the action boundaries based on three actionness curves in a more flexible way. Nevertheless, such actionness-based methods may fail in locating dense and short actions because of the difficulty to discriminate between very close starting and ending peaks in the curve.

Once the localization of action proposals completes, the natural way for temporal action detection is to further classify the proposals into known action classes, making the process in two-stage manner [4, 12, 31, 32, 40, 45]. However, the separate of proposal generation and classification may result in sub-optimal solutions. To further facilitate temporal action detection, there have been several one-stage techniques [2, 19, 42] being proposed recently. For example, Single Stream Temporal Action Detection (SS-TAD) [2]

utilizes the Recurrent Neural Network (RNN) based architecture to jointly learn action proposal and classification. Inspired by SSD 

[23], Lin et al. [19]

devise 1D temporal convolution to generate multiple temporal action anchors for action proposal and detection. Moreover, with the development of reinforcement learning, Yeung

et al. [42] explore RNN to learn a glimpse policy for predicting the starting and ending points of actions in an end-to-end manner. Nevertheless, most of one-stage methods are still facing the challenge in localizing all the action proposals due to the predetermined temporal scales.

In short, our approach belongs to one-stage temporal action detection techniques. Different from the aforementioned one-stage methods which often predetermine temporal scales of action proposals, our GTAN in this paper contributes by studying not only learning temporal structure through Gaussian kernels, but also how the contextual information can be better leveraged for action localization.

3 Gaussian Temporal Awareness Networks

In this section we present the proposed Gaussian Temporal Awareness Networks (GTAN) in detail. Figure 2 illustrates an overview of our architecture for action localization. It consists of two main components: a base feature network and a cascaded of 1D temporal convolutional layers with Gaussian kernels. The base feature network is to extract feature map from sequential video clips, which will be fed into cascaded 1D convolutional layers to generate multiple feature maps in different temporal resolution. For each cell in one feature map, a Gaussian kernel is learnt to control temporal scale of an action proposal corresponding to that cell as training proceeds. Furthermore, a Gaussian Kernel Grouping algorithm is devised to merge multiple Gaussian kernels with high overlap to a larger one for capturing long actions with arbitrary length. Specifically, each action proposal is generated by aggregating the features of contextual cells weighted by the values in the Gaussian curve. The whole network is jointly optimized with action classification loss plus two regression losses, i.e., localization loss and overlap loss, which are utilized to learn action category label, default temporal boundary adjustment and overlap confidence score for each action proposal, respectively.

3.1 Base Feature Network

The ultimate target of action localization is to detect action instances in temporal dimension. Given an input video, we first extract clip-level features from continuous clips via a 3D ConvNet which could capture both appearance and motion information of the video. Specifically, a sequence of features are extracted from 3D ConvNet, where

is the temporal length. We concatenate all the features into one feature map and then feed the map into two 1D convolutional layers (“conv1” and “conv2” with temporal kernel size 3, stride 1) plus one max-pooling layer (“pool1” with temporal kernel size 3, stride 2) to increase the temporal size of receptive fields. The base feature network is composed of 3D ConvNet, two 1D convolutional layers and max-pooling layer. The outputs of the base feature network are further exploited for action proposal generation.

3.2 Gaussian Kernel Learning

Given the feature map output from the base feature network, a natural way for one-stage action localization is to stack 1D temporal convolutional layers (anchor layers) to generate proposals (anchors) for classification and boundary regression. This kind of structure with predetermined temporal scale in each anchor layer can capture action proposals whose temporal intervals are well aligned with the size of receptive fields, however, posts difficulty to the detection of proposals with various length. The design limits the utility on localizing actions with complex variations.

To address this issue, we introduce temporal Gaussian kernel to dynamically control the temporal scales of proposals in each feature map. In the literature, there has been evidences on the use of Gaussian kernels for event detection in videos [26, 27]. In particular, as shown in Figure 2, eight 1D temporal convolutional layers (anchor layers) are first cascaded for action proposal generation in different temporal resolution. For each cell in the feature map of the anchor layer, a Gaussian kernel is learnt to predict a particular interval of an action proposal corresponding to that cell. Formally, we denote the feature map of -th convolutional layer as , , where and are the temporal length and feature dimension of the feature map. For a proposal whose center location is , we leverage its temporal scale by a Gaussian kernel

. The standard deviation

of is learnt via a 1D convolutional layer on a feature map cell, and the value is constrained within the range through a sigmoid operation. The weights of the Gaussian kernel are defined as


where is the normalizing constant. Taking the spirit from the theory that the could be considered as a measure of width (Root Mean Square width, RMS) in Gaussian kernel , we utilize as the interval measure of action proposal . Specifically, the can be multiplied with a certain ratio to represent the default temporal boundary:


where and are the center location and width of default temporal boundary and represents temporal scale ratio. The is also utilized for feature aggregation with a pooling mechanism to generate action proposals, which will be elaborated in Section 3.4.

Figure 3: Visualization of Gaussian Kernel Grouping.

Compared to the conventional 1D convolutional anchor layer which fixes the temporal scale as in -th layer, ours employs the dynamic temporal scales by leveraging the learned Gaussian kernel of each proposal to explore the action instances with complex variations.

3.3 Gaussian Kernel Grouping

0:    Original Gaussian kernel set ;Intersection over Union (IoU) threshold ;
0:    Mixed Gaussian kernel set ;
1:  Choose the beginning grouping position ;
2:  Initialize mixed Gaussian kernel set ;
3:  Initialize base Gaussian kernel , the ending grouping position ;
4:  while  do
5:     Compute IoU value between kernel and ;
6:     if  then
7:         Group and to according to Eq.(3), replace with the new mixed kernel ;
8:     else
9:         Add kernel to mixed kernel set ;
10:         ,   ;
11:     end if
12:     ;
13:  end while
14:  return  
Algorithm 1 Gaussian Kernel Grouping

Through learning temporal Gaussian kernels, the temporal scales of most action instances can be characterized with the predicted standard deviation. However, if the learned Gaussian kernels span and overlap with each other, that may implicitly indicate a long action centered at a flexible position among these Gaussian kernels. In other words, utilizing the center locations of these original Gaussian kernels to represent this long proposal may not be appropriate. To alleviate this issue, we attempt to generate a set of new Gaussian kernels to predict center location and temporal scales of proposals for long action. Inspired by the idea of temporal actionness grouping in [45], we propose a novel Gaussian Kernel Grouping algorithm for this target.

Figure 3 illustrates the process of temporal Gaussian Kernel Grouping. Given two adjacent Gaussian kernels and whose center location and standard deviation are and , we compute the temporal intersection and union between two kernels by using the width of the default temporal boundary defined in Section 3.2. In upper part of Figure 3, the length of temporal intersection between two kernels is , while the length of union is . If the Intersection over Union (IoU) between the two kernels exceeds a certain threshold , we merge them into one Gaussian kernel (bottom part of Figure 3). The new mixed Gaussian kernel is formulated as follows


In each feature map, Algorithm 1 details the grouping steps to generate merged kernels.

3.4 Gaussian Pooling

Figure 4: Comparisons of manual extension plus average-pooling strategy (left) and Gaussian pooling strategy (right) for involving temporal contextual information of action proposals.

With the learned and mixed Gaussian kernels, we calculate the weighted sum of the feature map based on the values in Gaussian curve and obtain the aggregated feature . Specifically, given the weighting coefficients of Gaussian kernel at center location in -th layer, the aggregated feature for proposal is formulated as


where the representation is further exploited for the action classification and temporal boundary regression.

The above Gaussian pooling mechanism inherently takes the contextual contributions around each action proposal into account. In contrast to the manual extension plus average-pooling strategy to capture video context information (left part of Figure 4), ours provides an elegant alternative to adaptively learn the weighted representation (right part of Figure 4) based on the importance.

3.5 Network Optimization

Given the representation of each proposal from Gaussian pooling, three 1D convolutional layers are utilized in parallel to predict action classification scores, localization parameters and overlap parameter, respectively. Action classification scores

indicate the probabilities belonging to

action classes plus one “background” class. Localization parameters denote temporal offsets relative to default center location and width , which are leveraged to adjust the temporal coordinate


where , are refined center location and width of the proposal. The , are utilized to control the impact of temporal offsets. In particular, we define an overlap parameter to represent the precise IoU prediction of the proposal, which benefits the proposal re-ranking in prediction.

In the training stage, we accumulate all the proposals from Gaussian pooling and produce the action instances through prediction layer. The overall training objective in our GTAN is formulated as a multi-task loss by integrating action classification loss () and two regression losses, i.e., localization loss () and overlap loss ():


where and are the trade-off parameters. Specifically, we measure the classification loss via the softmax loss:


where indicator function if equals to ground truth action label , otherwise . We denote as the IoU between default temporal boundary of this proposal and its corresponding closest ground truth. If the of this proposal is larger than , we set it as a foreground sample. If is lower than , it will be set as background sample. The ratio between foreground and background samples is set as 1.0 during training. The localization loss is devised as Smooth L1 loss [10] () between the predicted foreground proposal and the closest ground truth instance of the proposal, which is computed by


where and represents the center location and width of the proposal’s closest ground truth instance, respectively. For overlap loss, we adopt the mean square error (MSE) loss to optimize it as follows:


Eventually, the whole network is trained in an end-to-end manner by penalizing the three losses.

3.6 Prediction and Post-processing

During prediction of action localization, the final ranking score of each candidate action proposal depends on both action classification scores and overlap parameter :


Given the predicted action instance with refined boundary (), predicted action label , and ranking score , we employ the soft non-maximum suppression (soft-NMS) [1] for post-processing. In each iteration of soft-NMS, we represent the action instance with the maximum ranking score as . The ranking score of other instance will be decreased or not, according to the IoU computed with :


where is the decay parameter and is the NMS threshold.

4 Experiments

We empirically verify the merit of our GTAN by conducting the experiments of temporal action localization on two popular video recognition benchmarks, i.e., ActivityNet v1.3 [13] and THUMOS14 [16].

4.1 Datasets

The ActivityNet v1.3 dataset contains 19,994 videos in 200 classes collected from YouTube. The dataset is divided into three disjoint subsets: training, validation and testing, by 2:1:1. All the videos in the dataset have temporal annotations. The labels of testing set are not publicly available and the performances of action localization on ActivityNet dataset are reported on validation set. The THUMOS14 dataset has 1,010 videos for validation and 1,574 videos for testing from 20 classes. Among all the videos, there are 220 and 212 videos with temporal annotations in validation and testing set, respectively. Following [45], we train the model on validation set and perform evaluation on testing set.

4.2 Experimental Settings

Implementations. We utilize Pseudo-3D [28] network as our 3D backbone. The network input is a 16-frame clip and the sample rate of frames is set as . The 2,048-way outputs from pool5 layer are extracted as clip-level features. Table 1 summarizes the structures of 1D anchor layers. Moreover, we choose three temporal scale ratios derived from [22]. The IoU threshold in Gaussian grouping is set as by cross validation. The balancing parameters and are also determined on a validation set and set as and . and are set as and in soft-NMS. The parameter and are all set as

by cross validation. We implement GTAN on Caffe 


platform. In all the experiments, our networks are trained by utilizing stochastic gradient descent (SGD) with

momentum. The initial learning rate is set as , and decreased by after every iterations on THUMOS14 and iterations on ActivityNet. The mini-batch size is and the weight decay parameter is .

id type kernel size #channels #stride   RF
1 conv_a1 3 512 2 11
2 conv_a2 3 512 2 19
3 conv_a3 3 1024 2 35
4 conv_a4 3 1024 2 67
5 conv_a5 3 2048 2 131
6 conv_a6 3 2048 2 259
7 conv_a7 3 4096 2 515
8 conv_a8 3 4096 2 1027
Table 1: The details of 1D temporal convolutional (anchor) layers. RF represents the size of receptive fields.

Evaluation Metrics.

We follow the official evaluation metrics in each dataset for action detection task. On ActivityNet v1.3, the mean average precision (mAP) values with IoU thresholds between

and (inclusive) with a step size are exploited for comparison. On THUMOS14, the mAP with IoU threshold is measured. We evaluate performances on top-100 and top-200 returned proposals in ActivityNet v1.3 and THUMOS14, respectively.

4.3 Evaluation on Temporal Action Proposal

We first examine the performances on temporal action proposal task, which is to only assess the boundary quality of action proposals, regardless of action classes. We compare the following advanced approaches: (1) Structure Segment Network (SSN) [45] generates action proposals by temporal actionness grouping. (2) Single Shot Action Detection (SSAD) [19] is the 1D variant version of Single Shot Detection [23], which generates action proposals by multiple temporal anchor layers. (3) Convolution-De-Convolution Network (CDC) [31] builds a 3D Conv-Deconv network to precisely localize the boundary of action instances at frame level. (4) Boundary Sensitive Network (BSN) [21] locates temporal boundaries with three actionness curves and reranks proposals with neural networks. (5) Single Stream Temporal action proposal (SST) [3] builds a RNN-based action proposal network, which could be implemented in a single stream over long video sequences to produce action proposals. (6) Complementary Temporal Action Proposal (CTAP) [7] balances the advantages and disadvantages between sliding window and actionness grouping approaches for final action proposal.

Figure 5: (a) Recall-IoU and (b) AR-AN curve on ActivityNet.
Approach THUMOS14 ActivityNet ActivityNet (test server)
SST [3] 37.9 - - -
  CTAP [7] 50.1 73.2 65.7 -
BSN [21] 53.2 74.2 66.2 66.3
GTAN 54.3 74.8 67.1 67.4
Table 2: AR and AUC values on action proposal. IoU threshold: [0.5:0.05:1.0] for THUMOS14, [0.5:0.05:0.95] for ActivityNet.
Figure 6: Visualization of action localization on a video example from ActivityNet by GTAN. The Gaussian kernels are learnt on the outputs of “conv_a5” layer. The second and third kernels are mixed into a larger one. The default boxes (DB) are predicted by Gaussian kernels.

We adopt the standard metric of Average Recall in different IoU (AR) for action proposal on both datasets. Moreover, following the official evaluations in ActivityNet, we plot both Recall-IoU curve and Average Recall vs. Average Number of proposals per video (AR-AN) curve in Figure 5. In addition to AR metric, the area under AR-AN curve (AUC) is also reported in Table 2 as AUC is the measure on test server of ActivityNet. Overall, the performances across different metrics and two datasets consistently indicate that our GTAN leads to performance boost against baselines. In particular, AR of GTAN achieves 54.3% and 74.8% on THUMOS14 and ActivityNet respectively, making the absolute improvement over the best competitor BSN by 1.1% and 0.6%. GTAN surpasses BSN by 1.1% in AUC when evaluating on online ActivityNet test server. The results demonstrate the advantages of exploiting temporal structure for localizing actions. Furthermore, as shown in Figure 5, the improvements are constantly attained across different IoU. In terms of AR-AN curve, GTAN also exhibits better performance on different number of top returned proposals. Even in the case when only less than 10 proposals are returned, GTAN still shows apparent improvements, indicating that GTAN is benefited from the mechanism of dynamically optimizing temporal scale of each proposal and the correct proposals are ranked at the top.

4.4 Evaluation on Gaussian Kernel and Grouping

Approach THUMOS14 ActivityNet v1.3
Fixed Scale
Gaussian Kernel
Gaussian Grouping
mAP 33.5 37.1 38.2 29.8 31.6 34.3
Table 3: Performance contribution of each design in GTAN.
Approach THUMOS14 ActivityNet v1.3
       All        All
GTAN 22.1 37.1 49.4 31.6
GTAN 25.9 38.2 54.2 34.3
Table 4: The evaluations of Gaussian grouping on actions with different lengths. GTAN excludes Gaussian grouping in GTAN.

Next, we study how each design in GTAN influences the overall performance on temporal action localization task. Fixed Scale simply employs a fixed temporal interval for each cell or anchor in an anchor layer and such way is adopted in SSAD. Gaussian Kernel leverages the idea of learning one Gaussian kernel for each anchor to model temporal structure of an action and dynamically predict temporal scale of each action proposal. Gaussian Grouping further mixes multiple Gaussian kernels to characterize action proposals with various length. In the latter two cases, Gaussian pooling is utilized to augment the features of each anchor with contextual information.

Table 3 details the mAP performances by considering one more factor in GTAN on both datasets. Gaussian Kernel successfully boosts up the mAP performance from 33.5% to 37.1% and from 29.8% to 31.6% on THUMOS14 and ActivityNet v1.3, respectively. This somewhat reveals the weakness of Fixed Scale, where the temporal scale of each anchor is independent of temporal property of the action proposal. Gaussian Kernel, in comparison, models temporal structure and predicts a particular interval for each anchor on the fly. As such, the temporal localization or boundary of each action proposal is more accurate. Moreover, the features of each action proposal are simultaneously enhanced by contextual aggregation through Gaussian pooling and lead to better action classification. Gaussian grouping further contributes a mAP increase of 1.1% and 2.7%, respectively. The results verify the effectiveness and flexibility of mixing multiple Gaussian kernels to capture action proposals with arbitrary length. To better validate the impact of Gaussian grouping, we additionally evaluate GTAN on long action proposals. Here, we consider actions longer than 128 frames in THUMOS14 and 2048 frames in ActivityNet v1.3 as long actions, since the average duration of action instances in THUMOS14 is 4 seconds which is much smaller than that (50 seconds) of ActivityNet. Table 4 shows the mAP comparisons between GTAN and its variant GTAN which excludes Gaussian grouping. As expected, larger degree of improvement is attained on long action proposals by involving Gaussian grouping.

Figure 7: (a) AUC and (b) Average mAP performances of SSAD and GTAN with different number of anchor layers on temporal action proposal and localization tasks in ActivityNet.

4.5 Evaluation on the Number of Anchor Layers

In existing one-stage methods, e.g., SSAD, temporal scale is fixed in each anchor layer and the expansion of multiple temporal scales is implemented through increasing the number of anchor layers. Instead, our GTAN learns one Gaussian kernel for each anchor in every anchor layer and dynamically predicts temporal scale of the action proposal corresponding to each anchor. The grouping of multiple Gaussian kernels makes the temporal scale more flexible. Even with a small number of anchor layers, our GTAN should be more responsible to localize action proposals with various length in theory. Figure 7 empirically compares the performances between SSAD and our GTAN on ActivityNet v1.3 when capitalizing on different number of anchor layers. As indicated by the results, GTAN consistently outperforms SSAD across different depths of anchor layers from 4 to 8 on both temporal action proposal and localization tasks. In general, more anchor layers provide better AUC and mAP performances. It is expected that the performance of SSAD decreases more sharply than that of GTAN when reducing the number of anchor layers. In the extreme case of 4 layers, GTAN still achieves 26.77% in average mAP while SSAD only reaches 5.12%, which again confirms the advantage of exploring temporal structure and predicting temporal scale of action proposals.

4.6 Comparisons with State-of-the-Art

Approach 0.1 0.2 0.3 0.4 0.5
Two-stage Action Localization
Wang [37] 18.2 17.0 14.0 11.7 8.3
FTP [14] - - - - 13.5
DAP [5] - - - - 13.9
Oneata [25] 36.6 33.6 27.0 20.8 14.4
Yuan [43] 51.4 42.6 33.6 26.1 18.8
S-CNN [32] 47.7 43.5 36.3 28.7 19.0
SST [3] - - 37.8 - 23.0
CDC [31] - - 40.1 29.4 23.3
TURN [8] 54.0 50.9 44.1 34.9 25.6
R-C3D [40] 54.5 51.5 44.8 35.6 28.9
SSN [45] 66.0 59.4 51.9 41.0 29.8
CTAP [7] - - - - 29.9
BSN [21] - - 53.5 45.0 36.9
One-stage Action Localization
Richard [30] 39.7 35.7 30.0 23.2 15.2
Yeung [42] 48.9 44.0 36.0 26.4 17.1
SMS [44] 51.0 45.2 36.5 27.8 17.8
SSAD [19] 50.1 47.8 43.0 35.0 24.6
SS-TAD [2] - - 45.7 - 29.2
GTAN (C3D) 67.2 61.1 56.9 46.5 37.9
GTAN 69.1 63.7 57.8 47.2 38.8
Table 5: Performance comparisons of temporal action detection on THUMOS14, measured by mAP at different IoU thresholds .

We compare with several state-of-the-art techniques on THUMOS14 and ActivityNet v1.3 datasets. Table 5 lists the mAP performances with different IoU thresholds on THUMOS14. For fair comparison, we additionally implement GTAN using C3D [36] as 3D ConvNet backbone. The results across different IoU values consistently indicate that GTAN exhibits better performance than others. In particular, the mAP@0.5 of GTAN achieve 37.9% with C3D backbone, making the improvements over one-stage approaches SSAD and SS-TAD by 13.3% and 8.7%, which also employ C3D. Compared to the most advanced two-stage method BSN, our GTAN leads to 1.0% and 1.9% performance gains with C3D and P3D backbone, respectively. The superior results of GTAN demonstrate the advantages of modeling temporal structure of actions through Gaussian kernel.

On ActivityNet v1.3, we summarize the performance comparisons on both validation and testing set in Table 6. For testing set, we submitted the results of GTAN to online ActivityNet test server and evaluated the performance on the localization task. Similarly, GTAN surpasses the best competitor BSN by 0.6% and 1.1% on validation and testing set, respectively. Moreover, our one-stage GTAN is potentially simpler and faster than two-stage solutions, and tends to be more applicable to action localization in videos.

Figure 6 showcases temporal localization results of one video from ActivityNet. The Gaussian kernels and grouping learnt on the outputs of “conv_a5” layer are also visualized. As shown in the Figure, Gaussian kernels nicely capture the temporal structure of each action proposal and predict accurate default boxes for the final regression and classification.

ActivityNet v1.3, mAP
Approach validation testing
0.5 0.75 0.95 Average Average
Wang [38] 45.11 4.11 0.05 16.41 14.62
Singh [33] 26.01 15.22 2.61 14.62 17.68
Singh [34] 22.71 10.82 0.33 11.31 17.83
CDC [31] 45.30 26.00 0.20 23.80 22.90
TAG-D [39] 39.12 23.48 5.49 23.98 26.05
SSN [45] - - - - 28.28
Lin [20] 48.99 32.91 7.87 32.26 33.40
BSN [21] 52.50 33.53 8.85 33.72 34.42
GTAN 52.61 34.14 8.91 34.31 35.54
Table 6: Comparisons of temporal action detection on ActivityNet.

5 Conclusions

We have presented Gaussian Temporal Awareness Networks (GTAN) which aim to explore temporal structure of actions for temporal action localization. Particularly, we study the problem of modeling temporal structure through learning a set of Gaussian kernels to dynamically predict temporal scale of each action proposal. To verify our claim, we have devised an one-stage action localization framework which measures one Gaussian kernel for each cell in every anchor layer. Multiple Gaussian kernels could be even mixed for the purpose of representing action proposals with various length. Another advantage of using Gaussian kernel is to enhance features of action proposals by leveraging contextual information through Gaussian pooling, which benefits the final regression and classification. Experiments conducted on two video datasets, i.e., THUMOS14 and ActivityNet v1.3, validate our proposal and analysis. Performance improvements are also observed when comparing to both one-stage and two-stage advanced techniques.

Acknowledgments This work was supported in part by the National Key R&D Program of China under contract No. 2017YFB1002203 and NSFC No. 61872329.


  • [1] N. Bodla, B. Singh, R. Chellappa, and L. S. Davis (2017) Soft-NMS – Improving Object Detection With One Line of Code. In ICCV, Cited by: §3.6.
  • [2] S. Buch, V. Escorcia, B. Ghanem, L. Fei-Fei, and J. C. Niebles (2017) End-to-End, Single-Stream Temporal Action Detection in Untrimmed Videos. In BMVC, Cited by: §2, Table 5.
  • [3] S. Buch, V. Escorcia, C. Shen, B. Ghanem, and J. C. Niebles (2017) SST: Single-Stream Temporal Action Proposals. In CVPR, Cited by: §2, §4.3, Table 2, Table 5.
  • [4] Y. Chao, S. Vijayanarasimhan, B. Seybold, D. A. Ross, J. Deng, and R. Sukthankar (2018) Rethinking the Faster R-CNN Architecture for Temporal Action Localization. In CVPR, Cited by: §1, §2.
  • [5] V. Escorcia, F. C. Heilbron, J. C. Niebles, and B. Ghanem (2016) DAPs: Deep Action Proposals for Action Understanding. In ECCV, Cited by: §2, Table 5.
  • [6] A. Gaidon, Z. Harchaoui, and C. Schmid (2013) Temporal Localization of Actions with Actoms. IEEE Trans. on PAMI 35 (11), pp. 2782–2795. Cited by: §1.
  • [7] J. Gao, K. Chen, and R. Nevatia (2018) CFAP: Complementary Temporal Action Proposal Generation. In ECCV, Cited by: §2, §4.3, Table 2, Table 5.
  • [8] J. Gao, Z. Yang, C. Sun, K. Chen, and R. Nevatia (2017) TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals. In ICCV, Cited by: §2, Table 5.
  • [9] R. D. Geest, E. Gavves, A. Ghodrati, Z. Li, C. Snoek, and T. Tuytelaars (2016) Online Action Detection. In ECCV, Cited by: §1.
  • [10] R. Girshick (2015) Fast R-CNN. In ICCV, Cited by: §3.5.
  • [11] G. Gkioxari and J. Malik (2015) Finding Action Tubes. In CVPR, Cited by: §1.
  • [12] F. C. Heilbron, W. Barrios, V. Escorica, and B. Ghanem (2017) SCC: Semantic Context Cascade for Efficient Action Detection. In CVPR, Cited by: §2.
  • [13] F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles (2015) ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding. In CVPR, Cited by: §4.
  • [14] F. C. Heilbron, J. C. Niebles, and B. Ghanem (2016) Fast Temporal Activity Proposals for Efficient Detection of Human Actions in Untrimmed Videos. In CVPR, Cited by: Table 5.
  • [15] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B. Girshick, S. Guadarrama, and T. Darrell (2014) Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093. Cited by: §4.2.
  • [16] Y. Jiang, J. Liu, A. R.Zamir, and G. Toderici (2014) THUMOS challenge: Action recognition with a large number of classes. Note: Cited by: §4.
  • [17] C. Lea, R. V. Michael D. Flynn, A. Reiter, and G. D. Hager (2017) Temporal Convolutional Netowrk for Action Segmentation and Detection. In CVPR, Cited by: §1.
  • [18] D. Li, Z. Qiu, Q. Dai, T. Yao, and T. Mei (2018) Recurrent Tubelet Proposal and Recognition Networks for Action Detection. In ECCV, Cited by: §1.
  • [19] T. Lin, X. Zhao, and Z. Shou (2017) Single Shot Temporal Action Detection. In ACM MM, Cited by: §1, §2, §4.3, Table 5.
  • [20] T. Lin, X. Zhao, and Z. Shou (2017) Temporal convolution based action proposal: Submission to activitynet 2017. arXiv preprint arXiv:1707.06750. Cited by: Table 6.
  • [21] T. Lin, X. Zhao, H. Su, C. Wang, and M. Yang (2018) BSN: Boundary Sensitive Network for Temporal Action Proposal Generation. In ECCV, Cited by: §2, §4.3, Table 2, Table 5, Table 6.
  • [22] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar (2017) Focal Loss for Dense Object Detection. In ICCV, Cited by: §4.2.
  • [23] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) SSD: Single Shot MultiBox Detector. In ECCV, Cited by: §1, §2, §4.3.
  • [24] D. Oneata, J. Verbeek, and C. Schmid (2014)

    Action and Event Recognition with Fisher Vectors on a Compact Feature Set

    In ICCV, Cited by: §2.
  • [25] D. Oneata, J. Verbeek, and C. Schmid (2014) The LEAR submission at Thumos 2014. In ECCV THUMOS Challenge Workshop, Cited by: Table 5.
  • [26] A. Piergiovanni, C. Fan, and M. S. Ryoo (2017) Learning Latent Subevents in Activity Videos Using Temporal Attention Filters. In AAAI, Cited by: §3.2.
  • [27] A. Piergiovanni and M. S. Ryoo (2018) Learning Latent Super-Events to Detect Multiple Activities in Videos. In CVPR, Cited by: §3.2.
  • [28] Z. Qiu, T. Yao, and T. Mei (2017) Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks. In ICCV, Cited by: §4.2.
  • [29] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In NIPS, Cited by: §1.
  • [30] A. Richard and J. Gall (2016) Temporal Action Detection using a Statistical Language Model. In CVPR, Cited by: Table 5.
  • [31] Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S. Chang (2017) CDC: Convolutional-De-Convolutional Network for Precise Temporal Action Localization in Untrimmed Videos. In CVPR, Cited by: §2, §4.3, Table 5, Table 6.
  • [32] Z. Shou, D. Wang, and S. Chang (2016) Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs. In CVPR, Cited by: §1, §2, Table 5.
  • [33] B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao (2016) A Multi-Stream Bi-Directional Recurrent Neural Network for Fine-Grained Action Detection. In CVPR, Cited by: Table 6.
  • [34] G. Singh and F. Cuzzolin (2016) Untrimmed Video Classification for Activity Detection: submission to ActivityNet Challenge. arXiv preprint arXiv:1607.01979. Cited by: Table 6.
  • [35] K. Tang, B. Yao, L. Fei-Fei, and D. Koller (2013) Combining the Right Features for Complex Event Recognition. In ICCV, Cited by: §2.
  • [36] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2015) Learning Spatiotemporal Features with 3D Convolutional Networks. In ICCV, Cited by: §4.6.
  • [37] L. Wang, Y. Qiao, and X. Tang (2014) Action Recognition and Detection by Combining Motion and Apperance Feature. In ECCV THUMOS Challenge Workshop, Cited by: Table 5.
  • [38] R. Wang and D. Tao (2016) UTS at activitynet 2016. In CVPR ActivityNet Challenge Workshop, Cited by: Table 6.
  • [39] Y. Xiong, Y. Zhao, L. Wang, D. Lin, and X. Tang (2017) A Pursuit of Temporal Accuracy in General Activity Detection. arXiv preprint arXiv:1703.02716. Cited by: Table 6.
  • [40] H. Xu, A. Das, and K. Saenko (2017) R-C3D: Region Convolutional 3D Network for Temporal Activity Detection. In ICCV, Cited by: §2, Table 5.
  • [41] T. Yao, Y. Li, Z. Qiu, F. Long, Y. Pan, D. Li, and T. Mei (2017) MSR Asia MSM at ActivityNet Challenge 2017: Trimmed Action Recognition, Temporal Action Proposals and Dense-Captioning Events in Videos. In CVPR ActivityNet Challenge Workshop, Cited by: §1.
  • [42] S. Yeung, O. Russakovsky, G. Mori, and L. Fei-Fei (2016) End-to-end Learning of Action Detection from Frame Glimpses in Videos. In CVPR, Cited by: §2, Table 5.
  • [43] J. Yuan, B. Ni, X. Yang, and A. A.Kassim (2016) Temporal Action Localization With Pyramid of Score Distribution Features. In CVPR, Cited by: §2, Table 5.
  • [44] Z. Yuan, J. C. Stroud, T. Lu, and J. Deng (2017) Temporal Action Localization by Structured Maximal Sums. In CVPR, Cited by: Table 5.
  • [45] Y. Zhao, Y. Xiong, L. Wang, Z. Wu, X. Tang, and D. Lin (2017) Temporal Action Detection with Structured Segment Networks. In ICCV, Cited by: §1, §2, §3.3, §4.1, §4.3, Table 5, Table 6.