Precise Temporal Action Localization by Evolving Temporal Proposals

04/13/2018 ∙ by Haonan Qiu, et al. ∙ East China Normal University University of Washington 0

Locating actions in long untrimmed videos has been a challenging problem in video content analysis. The performances of existing action localization approaches remain unsatisfactory in precisely determining the beginning and the end of an action. Imitating the human perception procedure with observations and refinements, we propose a novel three-phase action localization framework. Our framework is embedded with an Actionness Network to generate initial proposals through frame-wise similarity grouping, and then a Refinement Network to conduct boundary adjustment on these proposals. Finally, the refined proposals are sent to a Localization Network for further fine-grained location regression. The whole process can be deemed as multi-stage refinement using a novel non-local pyramid feature under various temporal granularities. We evaluate our framework on THUMOS14 benchmark and obtain a significant improvement over the state-of-the-arts approaches. Specifically, the performance gain is remarkable under precise localization with high IoU thresholds. Our proposed framework achieves mAP@IoU=0.5 of 34.2



There are no comments yet.


page 1

page 3

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

With the ubiquitous video capture devices, a huge amount of videos are producted, which brings a great demand for automatic video content analysis (Lu et al., 2016; Zheng et al., 2010; Simonyan and Zisserman, 2014; Ye et al., 2015; Gao et al., 2017b; Yang et al., 2017; Shou et al., 2016; Yeung et al., 2016; Shou et al., 2017; Yuan et al., 2016; Zhao et al., 2017; Jiang et al., 2013). Action localization and recognition answer where and what an action is in the video. Since most collected videos are untrimmed, action localization becomes, undoubtedly, the first and the foremost steps in video action analysis. On common image object detection, the “Proposal + Classification” framework demonstrates its capability (Girshick et al., 2014; Ren et al., 2015; Liu et al., 2016; Redmon et al., 2016), and is easy to be transferred to other detection tasks, such as detecting the face (Jiang and Learned-Miller, 2017), text (Ma et al., 2018) and vehicle (Wang et al., 2017b). For the video action localization task, the high-quality temporal proposals are also crucial following this paradigm. A promising temporal proposal candidate in action localization should contain the action of interest in accordance with high Intersection-over-Union (IoU) overlap with the groundtruth.

Recently, training deep neural networks to extract the spatial-temporal features of the proposals is widely used in action localization. Both the regressor and the classifier are trained to determine the fidelity of the boundary and the completeness of an action based on these proposal features. Nevertheless, regressing the boundary and judging the completeness of an action are nontrivial tasks due to two difficulties. First, compared to the object proposals which have solid internal consistency, the boundary between an action and the background is vague, because the variations of the consecutive video frames are subtle. This may lead to an unsteady or even incorrect boundary regression, especially under frame-level granularity. Second, it is very subjective to judge the completeness of an action from the background. The action is usually complex and diverse, which makes it hard to discern action snippets from backgrounds. Figure 

1 illustrates an example of the unclear boundary between the action and the background in some action videos. These two issues impede the accuracy of action localization in the long untrimmed videos.

To precisely detect the actions in the untrimmed videos, many efforts have been made to generate well-anchored temporal proposals and to judge the actionness of the proposals. Grouping sampled snippets of various granularities in a bottom-up manner can generate proposals with vastly varying lengths (e.g., in (Yuan et al., 2016; Xiong et al., 2017)). However, snippets grouping is much dependent on the selections of the similarity metrics and the threshold, which decides the clearness of the action boundaries. In this paper, we propose an evolving framework in which the boundary of temporal proposals is precisely predicted with a three-phase algorithm. The first phase generates temporal proposals by an Actionness Network (AN) which employs frame-level features. The goal is to discern the actionness of the proposals. The second phase refines the boundaries of these proposals through a Refinement Network (RN) to improve the fidelity of the proposal boundaries. It turns out that coarse-grained temporal boundary regression is more effective and stable. That is, by segmenting a proposal into smaller units, RN adjusts the boundary on these unit-based features. Finally, the adjusted proposals are sent to the third phase, namely Localization Network (LN), to precisely refine the action boundary. With our evolving framework, we demonstrate a significant improvement over those regressed directly from the original proposals.

Our contributions are summarized as follows.

  • We propose a novel three-phase evolving temporal proposal framework for action localization using multi-stage temporal coordinate regression under various temporal granularities.

  • We exploit unit-based temporal coordinate regression to the boundaries of the proposals to precisely locate the action.

  • We use non-local pyramid features to effectively model an action, which is capable of discriminating between completeness and incompleteness of a proposal.

  • Our proposed framework outperforms the state-of-the-arts approaches, especially in precise action localization.

The remaining of this paper is organized as follows. Section 2 briefly reviews the literature on temporal action localization and the related topics. Section 3 presents the details of the evolving architecture of our framework. We empirically evaluate our proposed framework in Section 4. Finally, Section 5 concludes this paper.

2. Related Works

2.1. Action Recognition

Action recognition has been widely studied during last few years. Early works focus on hand-crafted feature engineering. Various features have been invented, for instance, space-time interest points (STIP) (Laptev, 2005) and histogram of optical flow (HOF) (Laptev et al., 2008). With the development of deep neural network, more effective features are extracted using various deep neural networks. Multi-stream features are fused for action recognition and show remarkable performance (Simonyan and Zisserman, 2014; Ye et al., 2015; Wu et al., 2016).

2.2. Temporal Action Proposal

Temporal action proposal is the elementary factor in the “Proposal + Classification” paradigm. Escorcia et al. use LSTM networks to encode a video stream and produce the proposals inside the video stream (Escorcia et al., 2016). Buch et al. develop a new effective deep architecture to generate single-stream temporal proposals without dividing the video into short snippets (Buch et al., 2017). Gao et al. use unit-level temporal coordinate regression to predict the temporal proposals and refine the proposal boudnaries (Gao et al., 2017b). Gao et al. exploit cascaded boundary regression to refine the sliding windows as class-agnostic proposals (Gao et al., 2017a). Xiong et al. propose a learning-based bottom-up proposal generation scheme called temporal actionness grouping (TAG) (Xiong et al., 2017).

Figure 2. The evolving architecture of our framework. Given a video, we use the Actionness Network to generate initial temporal action proposals. For each proposal, a few video units are extracted and represented by the unit-level features. The features are input to the GRU-based sequence encoder, which produces a refined temporal proposal. The predictions of all proposals are computed in the Localization Network with the non-local pyramid features.

2.3. Temporal Action Localization

Action localization aims to predict where an action begins and ends in the untrimmed videos. Shou et al. propose a Segment-CNN (S-CNN) proposal network and address temporal action localization by using 3D ConvNets (C3D) features, which involve two stages, namely proposal network and localization network (Shou et al., 2016). Xu et al. introduce the regional C3D model which uses a three-dimensional fully convolutional network to generate the temporal regions for activity detection (Xu et al., 2017). Yeung et al. 

propose an end-to-end learning framework using LSTM and CNN features separately, and reinforcement learning for action localization. However, each module is sub-optimal and the action boundaries are not predicted accurately 

(Yeung et al., 2016). Shao et al. extend two-stream framework to multi-stream network and uses bi-directional LSTM network to encode temporal information for fine-grained action detection (Singh et al., 2016). Ma et al. address the problem of early action detection. That is, to train an LSTM network with ranking loss and merge the detection spans based on the frame-wise prediction scores generated by LSTM (Ma et al., 2016). Yang et al. propose temporal preservation convolutional network (TPC) which preserves temporal resolution but down-samples the spatial resolution simultaneously to detect the activities (Yang et al., 2017). Zhao et al. propose structured segment network to model activities via structured temporal pyramid (Zhao et al., 2017). Yuan et al. introduce a pyramid of score distribution feature to capture the motion information for action detection (Yuan et al., 2016).

3. Evolving Temporal Proposals

We illustrate the evolving architecture and networks of our deep learning framework in Figure

2, which are composed of the following components.

  • Actionness Network (AN). Each frame of the videos is firstly fed to the network to compute frame-level scores. These actionness scores are used to generate initial proposals.

  • Refinement Network

    (RN). This network takes the proposals from AN as input. For a single proposal, several video units are cropped with a fixed stride and length. The unit-level features are extracted and fed into a GRU-based sequence encoder. After connecting to a fully connected layer with the target of proposal position, the corresponding refined temporal proposals are produced.

  • Localization Network (LN). The LN takes the refined proposals from the previous network as input. Non-local pyramid features are extracted from the proposals and are sent to the classifier with a multi-task loss. The final outputs are the position of evolved temporal proposals and the appropriate scores.

We describe below each individual component in detail.

3.1. Actionness Network


: Action predict score vectors

, minimum (maximum) length (), threshold
2:Output: Candidate Proposals Set
3:function Proposal()
4:     Average score vector
5:     for  1: do
6:         ConnComponent()
7:         Smooth()
8:         ConnComponent()
9:         Merge , into
10:     end for
11:      NMS()
12:     return
13:end function
14:function ConnComponent()
15:     Candidate Set
16:     Proposal Set
17:     while  do
18:         for  do
20:         end for
21:         for each pair  do
22:              if  then
23:                  Remove from
24:                  Expand to
25:              end if
26:         end for
27:         for  do
28:              Move to , if
29:         end for
30:     end while
31:     return
32:end function
Algorithm 1 Proposal generation in AN

Initial temporal action proposals are produced by this network. Actionness is introduced and discussed in (Xiong et al., 2017) as the measure of possibility of a video snippet residing in any activity instance. The concept of actionness is similar to the saliency measurement that indicates the salient visual stimuli for a given image (Itti and Koch, 2001; Harel et al., 2006; Alexe et al., 2010; Lu et al., 2011; Cheng et al., 2014; Lu et al., 2012; Lu and Shapiro, 2017). Features used for learning a model of saliency include the low-level features (e.g., orientation and color), mid-level features (e.g., surface and horizon line), and high-level features (e.g., face and objects). The saliency can be divided into generic saliency (the degree of uniqueness of the neighborhood w.r.t. entire image) or class-specific saliency (the visual characteristics that best distinguish a particular object class from others). For the actionness, previous work (Xiong et al., 2017) handles the generic case by learning the binary classifier. Due to the limited size of the video dataset versus the complexity of the concept actionness, only using a binary classifier may not achieve ideal classification results. Therefore, in this paper we focus on the frame-level class-specific actionness. A classification network is trained based on the frames from the UCF101 dataset (Soomro et al., 2012). This step accepts arbitrary networks designed for image classification task, and actionness results may benefit from an effective structure. Here we leverage the ResNet architecture (He et al., 2016)

by first loading the weights pre-trained on the ImageNet dataset 

(Russakovsky et al., 2015) and then fine-tuning on the UCF101 dataset (Soomro et al., 2012). We denote a video as , where is the number of frames and is the -th frame. id fed to the classifier to get the action prediction score. The predicted scores from all the frames are stacked and form the action score vectors , where and is the score of action for frame .

To generate initial proposals from these actionness vectors, Algorithm 1 is used. Our basic assumption is that a snippet containing action should consist of frames with the actionness score higher than a threshold. Meanwhile, we observe that action snippets usually have limited durations, therefore the minimum and maximum frame lengths are used. Given the scores, a connected component scheme is devised to merge the neighboring regions with high scores (Line 14-32). Besides the original curve from the action scores, the smooth curve is also applied with Gaussian kernel density estimation (Line 8-9). After that, the proposal set is generated by the non-maximum suppression (NMS) of the candidate snippets.

3.2. Refinement Network

Refining the proposals is important to build an efficient action localization system. While the Actionness Network uses the frame-level information, the Refinement Network considers the context of the short video unit. A unit is represented as , where represents the video, is the starting frames and is unit length. The borders of the actions are usually vague. As a result, it is hard to train the model for the precise boundary regression. For each video, we generate units by cropping the proposal with a fixed length and stride. Any spatial or temporal features can be used to represent the unit. Here we employ the non-local pyramid features, which will be described in Section 3.3.

Context information will help the network to know how to start and where to end. To keep context information, we follow the pipeline in (Zhao et al., 2017) and augment the range of unit from to

. As actions are usually composed by a set of motions, modeling the unit sequence is essential to the success of Refinement Network. To this end, we leverage an RNN-based sequence encoder. Specifically, we use Bi-directional Gated Recurrent Unit(BiGRU) 

(Schuster and Paliwal, 1997) as the RNN unit to encode the context information.

GRU is the of architecture of recurrent unit that learns to encode a variable-length sequence into a fixed-length code. Figure 3 illustrates the architecture of RN based on GRU. Rather than using five gates in LSTM, GRU has only two gates to control the updates of the hidden units, i.e., reset gate and update gate .

The -th hidden unit of reset gate is computed by:



is sigmoid function and

denotes the -th element of a vector. The inputs are and previous state , and and are reset gate weights.

The computation of update gate is similar, which is:


The activation of hidden unit is computed by



Figure 3. Architecture of Refinement Network. The frame-level features are aggregated to get unit-level feature. Refinement Network walks through the unit-level features of the proposal to regress the coarse boundary.

We define three types of proposals based on Intersection-over-Union (IoU) with its groundtruth: positive proposals with IoU 0.7, incomplete proposals with 0.3 IoU 0.7, and background proposals with IoU0.1. During training, both input proposals and the target boundary coordinates are first fit to the applicable video unit. The unit-level features are encoded one by one through the BiGRU network. A full-connected layer then receives the output from BiGRU and is used to regress the localization. Parameter coordinate offset is used for regression. The Refinement Network regresses both interval center and proposal span. The regression loss is defined as


where is the interval center, is the proposal span, (positive and incomplete proposals are considered), and is the smooth- loss. Here and are the center frame and length of the groundtruth () or video snippet ().

3.3. Localization Network

Localization Network is responsible for producing the evolved temporal proposals and the scores. Ideally, the network not only finds the recognized action and the precise action proposals, but also decides whether an proposal is the action or the background. To this end, we adopt the structured segment networks (SSN) (Zhao et al., 2017) as the backbone. Specifically, the temporal proposals from Refinement Network are augmented with three stages, i.e., starting, course, and ending. The structured temporal pyramid features are then calculated in the SSN framework. Inspired by the application of non-local neural network for video classification (Wang et al., 2017a), we extend the representation by adding a non-local block before the last layer. The non-local operation is defined as:


where and are the input and the output signals respectively, indicates the non-local behaviour for all positions in the frame, is the representation, and is the dot-product similarity function between and

. To avoid breaking the existing models, the non-local block is added with the residual connection 

(He et al., 2016). We denote the representations as non-local pyramid features. We will show in our experiments that this feature is superior to directly extracting the structured temporal pyramid features. Based on the features, the multi-task loss is used to train the Localization Network.

Action Classification. Positive and background proposals are used to train a classifier with classes for actions and 1 for background. We randomly sample the positive and background proposals to make the ratio around 1:1 to avoid data imbalance. Cross entropy loss is used as follows,


where is the number of training samples, and is the score of action for proposal .

Completeness Evaluation. Only a few proposals will match the groundtruth instance. A binary classifier is used to predict a value to represent whether the proposal is background or an action component, which plays an important role during evaluation in ranking the proposals. Positive and incomplete proposals are used to train this task. we use the online hard example mining strategy to overcome imbalance of dataset and improve classifier performance. We put positive proposals and incomplete proposals with a ratio of 1:4. During the training we only choose the first 1/4 incomplete proposal examples according to loss value to train loss with positive proposals. The Hinge loss is applied:


where indicates whether the -th proposal is positive proposal () or incomplete proposal (), and

is the probability that proposal

is an action component.

Localization Regression.

The loss function in Equation (

4) is used. Different from Refinement Network, this step only uses the positive proposals for boundary fine-tuning. Therefore, is set for Equation (4).

The overall multi-task loss function in Localization Network is defined as


We set and to 0.1 in our experiments. Based on the multi-task loss, the normalized classification score (see Equation (7)) and completeness score are computed for a given proposal. The final ranking score for action is .

IoU thresholds 0.3 0.4 0.5 0.6 0.7
Wang et al.  (Wang et al., 2014) 14.6 12.1 8.5 4.7 1.5
Oneata et al.  (Oneata et al., 2014) 28.8 21.8 15.0 8.5 3.2
Heilbron et al.  (Caba Heilbron et al., 2016) - - 13.5 - -
Escorcia et al.  (Escorcia et al., 2016) - - 13.9 - -
Richard and Gall (Richard and Gall, 2016) 30.0 23.2 15.2 - -
Yeung et al.  (Yeung et al., 2016) 36.0 26.4 17.1 - -
PSDF (Yuan et al., 2016) 33.6 26.1 18.8 - -
S-CNN (Shou et al., 2016) 36.3 28.7 19.0 10.3 5.3
Conv & De-conv (Shou et al., 2017) 38.6 28.2 22.4 12.0 7.5
CDC (Shou et al., 2017) 40.1 29.4 23.3 13.1 7.9
SSAD (Lin et al., 2017) 43.0 35.0 24.6 - -
TPC+S-CNN (Yang et al., 2017) 41.9 32.5 25.3 14.7 9.0
TURN TAP (Gao et al., 2017b) 44.1 34.9 25.6 - -
TPC+FGM (Yang et al., 2017) 44.1 37.1 28.2 20.6 12.7
TAG (Xiong et al., 2017) 48.7 39.8 28.2 - -
R-C3D (Xu et al., 2017) 44.8 35.6 28.9 - -
SSN (Zhao et al., 2017) 51.9 41.0 29.8 - -
CBR (Gao et al., 2017a) 50.1 41.3 31.0 19.1 9.9
ETP [ours] 48.2 42.4 34.2 23.4 13.9
Table 1. Action detection results on THUMOS14 (in %). mAP at different IoU thresholds are reported. The results are sorted in ascending order of the performances at . ‘-’ indicates the results are not available in the corresponding papers. Bold faces are the top results while underlines correspond to the second runners-up.

4. Experiments

In this section, we evaluate the effectiveness of the proposed framework on the action localization benchmarks. We first introduce the evaluation datasets and the experimental setup, and then evaluate the performance of the proposed framework and compare it with several state-of-the-art approaches. Finally, we discuss the effect of parameters and components. We denote our approach as ETP (Evolving Temporal Proposals).

4.1. Dataset and Evaluation Metric

We conduct the experiments on the THUMOS Challenge 2014 dataset (THUMOS14) (Idrees et al., 2017). The whole dataset contains 1010 videos for validation and 1,574 videos for testing. For the temporal action localization task, only 200 validation-set videos and 213 testing-set videos with temporal annotations of 20 action classes are provided. The UCF101 dataset (Soomro et al., 2012) is appointed as the official training set. As the videos in UCF101 are trimmed, we train our framework on the THUMOS14 validation set and then evaluate on the testing set.

The mean average precision (mAP) at different IoU thresholds are reported. Following recent works for precise temporal action localization (Shou et al., 2017; Yang et al., 2017), we consider the thresholds of {0.3, 0.4, 0.5, 0.6, 0.7}, and the mAP at is used for comparing different approaches.

Figure 4. Visualization of the action instances by our proposed approach on THUMOS14 dataset.

4.2. Experimental Settings


We implment our approach based on PyTorch

111 In the Actionness Network, we use the ResNet-34 (He et al., 2016) to extract the frame-level class-specific actionness scores. The ResNet-34 network is pre-trained on ImageNet dataset and then fine-tuned on the UCF101 dataset. Random horizontal flip and random center crop are used for data augmentation.

We use 2 BiGRU cells with 512 hidden values as RNN units to build the Refinement Network. The training batch size is 128 and it has 20 iterations. Only the positive and incomplete proposals are used as the training samples. We use SGD as the optimizer to train the network with momentum 0.9. The base learning rate of the network is with the decay rate to decrease the learning rate at every 5 iterations.

In Localization Network, the feature extraction is based on Inception V3 

(Szegedy et al., 2016)

with batch normalization pre-trained on Kinetics Human Action Video dataset 

(Kay et al., 2017). Both spatial and temporal flow networks are trained using SGD with momentum 0.9. The network is trained for 90 iterations with the learning rate of 0.1 and scaled down by 0.1 every 5 iterations until the learning rate is less than .

Inference. During testing phase, the initial proposals are generated by Algorithm 1. The Refinement Network is then used for coarse regression of the proposals, which are sent to Localization Network to enhance the boundary regression and predicted action classes. The choice of NMS threshold has important influences on testing results. To achieve precise localization results with higher IoU thresholds, we empirically set the NMS threshold to 0.36.

(Yeung et al., 2016) (Shou et al., 2016) (Xu et al., 2017) Ours
BaseballPitch 14.6 14.9 26.1 22.5
BasketballDunk 6.3 20.1 54.0 30.3
Billiards 9.4 7.6 8.3 8.1
CleanAndJerk 42.8 24.8 27.9 40.9
CliffDiving 15.6 27.5 49.2 16.7
CricketBowling 10.8 15.7 30.6 16.3
CricketShot 3.5 13.8 10.9 7.2
Diving 10.8 17.6 26.2 50.9
FrisbeeCatch 10.4 15.3 20.1 2.3
GolfSwing 13.8 18.2 16.1 44.4
HammerThrow 28.9 19.1 43.2 71.7
HighJump 33.3 20.0 30.9 51.2
JavelinThrow 20.4 18.2 47.0 47.3
LongJump 39.0 34.8 57.4 81.9
PoleVault 16.3 32.1 42.7 56.5
Shotput 16.6 12.1 19.4 32.0
SoccerPenalty 8.3 19.2 15.8 19.7
TennisSwing 5.6 19.3 16.6 29.1
ThrowDiscus 29.5 24.4 29.2 39.1
VolleyballSpiking 5.2 4.6 5.6 14.0
mAP@0.5 17.1 19.0 28.9 34.2
Table 2. Per-class AP at on THUMOS14 (in %).

4.3. Comparison with State-of-the-arts

We first evaluate the overall results of our proposed framework for action localization and compare them with several state-of-the-art approaches. There are a few parameters in ETP, including the lengths of units in the Refinement Network, and feature types in the Actionness Network and the Localization Network. In our experiments, we set 64 frames for unit length, and extract the non-local pyramid features. The effects of these parameters, as well as the components of the framework, will be evaluated in Section 4.4.

Table 1 summarizes the mAPs of all action classes in THUMOS14. We compare ETP with the results during the challenge (Wang et al., 2014; Oneata et al., 2014) and state-of-the-art approaches. From Table 1 we can see that when , ETP outperforms all the challenge results as well as the state-of-the-art approaches shown in the middle part of Table 1, including the actionness based approach TAG (Xiong et al., 2017), the Convolutional-De-Convolutional Networks (Shou et al., 2017), the Cascaded Boundary Regression models (Gao et al., 2017a), and a very recent Temporal Preservation Networks (Yang et al., 2017). The substantial performance gains over the previous works under different IoU thresholds confirm the effectiveness of our evolving temporal proposals for precise temporal action localization. Table 2 further shows the per-class results at for our approach and several previous works (Yeung et al.  (Yeung et al., 2016), S-CNN (Shou et al., 2016), and R-C3D (Xu et al., 2017)). Notably, our approach performs the best on 12 action classes and shows significant improvement (by more than 20% absolute AP over the next best) for 10 actions such as Diving, High Jump, Pole Vault, and Long Jump. Figure 4 illustrates some prediction results for these actions respectively.

Figure 5. Per-class AP@ with incorporating of Refinement Network.
Figure 6. Per-class AP@ w.r.t. various features (STPF: structured temporal pyramid features; NLPF: non-local pyramid features).
Figure 7. Evaluation of the ETP framework with different video unit lengths in Refinement Network on THUMOS14.

4.4. Ablation Study

In this part, the control experiments of switching off different components of the proposed framework are conducted.

Effect of Refinement Network We first evaluate the effect of the Refinement Network over the whole framework. Figure 5 gives the comparison of all the action classes with the pipeline that the outputs of Actionness Network are directly sent to Localization Network. There are 13 categories with performance gains after incorporating with Refinement Network and the overall AP is increased by 1.3%.

Unit Length in RN. We also evaluate the setting of the video unit length in the Refinement Network. The lengths of {16,32,64,128} are tested, and the stride is set to half of the length. In Figure 7 we plot the localization results versus the video unit length and the IoU thresholds . Using the length of 16 frames per video unit, we can already get the mAP of 32.6% when IoU threshold , which already outperforms previous works. We observe significant performance gains when the length increases from 16 to 32, after which the performance tends to be saturated. There is a performance drop when the unit length is 128, and we conjecture that a larger unit length may not be the one at which the GRU-based sequence encoder responds with optimal context information.

Non-local Pyramid Features. Recall that the non-local pyramid features by adding a non-local residual block, based on the non-local operation shown in Equation (5). In Figure 6 we report the detailed localization results for each action learned using the non-local pyramid features as well as the structured temporal pyramid features in (Zhao et al., 2017). For 16 action classes we achieve higher AP values, which is consistent with the observations from previous work (Wang et al., 2017a).

IoU thresholds 0.3 0.4 0.5 0.6 0.7
RGB 39.9 33.7 25.3 16.1 8.6
Flow 34.8 30.4 25.0 18.0 10.7
RGB+Flow 48.2 42.4 34.2 23.4 13.9
Table 3. mAP from different modalities.

Video Modality. Our last experiment evaluates the effect of different modalities for temporal action localization. The results are shown in Table 3. The RGB modality achieves a higher mAP when the IoU threshold is smaller, and the Flow modality tends to be better after increasing . As shown in Table 3, using both modalities leads to performance gains for all IoU thresholds.

5. Conclusions

In this paper, we have proposed the Evolving Temporal Proposals (ETP), a framework with three components (i.e., Actionness Network, Refinement Network, and Localization Network) to generate temporal proposals for precise action localization in the untrimmed videos. Through empirical temporal action localization experiments, we have shown that ETP is more effective than previous systems by generating very competitive results on the THUMOS14 dataset. We leverage the non-local pyramid features to effectively model the activity, which improves the discriminative ablity between completeness and incompleteness of a proposal. For future work, we plan to optimize the inference procedure of the proposed framework, and explore the one-stream video action localization.


  • (1)
  • Alexe et al. (2010) Bogdan Alexe, Thomas Deselaers, and Vittorio Ferrari. 2010. What is an object?. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  • Buch et al. (2017) Shyamal Buch, Victor Escorcia, Chuanqi Shen, Bernard Ghanem, and Juan Carlos Niebles. 2017. SST: Single-Stream Temporal Action Proposals. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Caba Heilbron et al. (2016) Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem. 2016. Fast Temporal Activity Proposals for Efficient Detection of Human Actions in Untrimmed Videos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Cheng et al. (2014) Ming-Ming Cheng, Ziming Zhang, Wen-Yan Lin, and Philip Torr. 2014.

    BING: Binarized Normed Gradients for Objectness Estimation at 300fps. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3286–3293.
  • Escorcia et al. (2016) Victor Escorcia, Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem. 2016. DAPs: Deep Action Proposals for Action Understanding. In European Conference on Computer Vision (ECCV).
  • Gao et al. (2017a) Jiyang Gao, Zhenheng Yang, and Ram Nevatia. 2017a. Cascaded Boundary Regression for Temporal Action Detection. arXiv:1705.01180 (2017).
  • Gao et al. (2017b) Jiyang Gao, Zhenheng Yang, Chen Sun, Kan Chen, and Ram Nevatia. 2017b. TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals. In International Conference on Computer Vision (ICCV). 3648–3656.
  • Girshick et al. (2014) Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 580–587.
  • Harel et al. (2006) Jonathan Harel, Christof Koch, and Pietro Perona. 2006. Graph-Based Visual Saliency. In Neural Information Processing Systems (NIPS).
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778.
  • Idrees et al. (2017) Haroon Idrees, Amir R Zamir, Yu-Gang Jiang, Alex Gorban, Ivan Laptev, Rahul Sukthankar, and Mubarak Shah. 2017. The THUMOS challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding 155 (2017), 1–23.
  • Itti and Koch (2001) Laurent Itti and Christof Koch. 2001. Computational modelling of visual attention. Nature Reviews Neuroscience 2, 3 (2001), 194.
  • Jiang and Learned-Miller (2017) Huaizu Jiang and Erik Learned-Miller. 2017. Face Detection with the Faster R-CNN. In IEEE International Conference on Automatic Face Gesture Recognition (FG). 650–657.
  • Jiang et al. (2013) Yu-Gang Jiang, Yanran Wang, Rui Feng, Xiangyang Xue, Yingbin Zheng, and Hanfang Yang. 2013. Understanding and Predicting Interestingness of Videos. In

    AAAI Conference on Artificial Intelligence (AAAI)

  • Kay et al. (2017) Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. 2017. The kinetics human action video dataset. arXiv:1705.06950 (2017).
  • Laptev (2005) Ivan Laptev. 2005. On space-time interest points. International Journal of Computer Vision 64, 2-3 (2005), 107–123.
  • Laptev et al. (2008) Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld. 2008. Learning realistic human actions from movies. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Lin et al. (2017) Tianwei Lin, Xu Zhao, and Zheng Shou. 2017. Single Shot Temporal Action Detection. In ACM International Conference on Multimedia (MM).
  • Liu et al. (2016) Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. 2016. Ssd: Single shot multibox detector. In European Conference on Computer Vision (ECCV). 21–37.
  • Lu et al. (2016) Yao Lu, Aakanksha Chowdhery, and Srikanth Kandula. 2016. Optasia: A relational platform for efficient large-scale video analytics. In ACM Symposium on Cloud Computing (SoCC).
  • Lu and Shapiro (2017) Yao Lu and Linda G Shapiro. 2017. Closing the Loop for Edge Detection and Object Proposals. In AAAI Conference on Artificial Intelligence (AAAI). 4204–4210.
  • Lu et al. (2012) Yao Lu, Wei Zhang, Cheng Jin, and Xiangyang Xue. 2012. Learning attention map from images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1067–1074.
  • Lu et al. (2011) Yao Lu, Wei Zhang, Hong Lu, and Xiangyang Xue. 2011. Salient object detection using concavity context. In International Conference on Computer Vision (ICCV). 233–240.
  • Ma et al. (2018) Jianqi Ma, Weiyuan Shao, Hao Ye, Li Wang, Hong Wang, Yingbin Zheng, and Xiangyang Xue. 2018. Arbitrary-Oriented Scene Text Detection via Rotation Proposals. IEEE Transactions on Multimedia (2018).
  • Ma et al. (2016) Shugao Ma, Leonid Sigal, and Stan Sclaroff. 2016. Learning activity progression in lstms for activity detection and early detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1942–1950.
  • Oneata et al. (2014) Dan Oneata, Jakob Verbeek, and Cordelia Schmid. 2014. The LEAR submission at Thumos 2014. In ECCV THUMOS Workshop.
  • Redmon et al. (2016) Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 779–788.
  • Ren et al. (2015) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Neural Information Processing Systems (NIPS). 91–99.
  • Richard and Gall (2016) Alexander Richard and Juergen Gall. 2016. Temporal Action Detection Using a Statistical Language Model. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115, 3 (2015), 211–252.
  • Schuster and Paliwal (1997) Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 11 (1997), 2673–2681.
  • Shou et al. (2017) Zheng Shou, Jonathan Chan, Alireza Zareian, Kazuyuki Miyazawa, and Shih-Fu Chang. 2017. CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Shou et al. (2016) Zheng Shou, Dongang Wang, and Shih-Fu Chang. 2016. Temporal Action Localization in Untrimmed Videos via Multi-Stage CNNs. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Neural Information Processing Systems (NIPS). 568–576.
  • Singh et al. (2016) Bharat Singh, Tim K Marks, Michael Jones, Oncel Tuzel, and Ming Shao. 2016. A multi-stream bi-directional recurrent neural network for fine-grained action detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1961–1970.
  • Soomro et al. (2012) Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402 (2012).
  • Szegedy et al. (2016) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2818–2826.
  • Wang et al. (2017b) Li Wang, Yao Lu, Hong Wang, Yingbin Zheng, Hao Ye, and Xiangyang Xue. 2017b. Evolving Boxes for fast Vehicle Detection. In IEEE International Conference on Multimedia & Expo (ICME). 1135–1140.
  • Wang et al. (2014) Limin Wang, Yu Qiao, and Xiaoou Tang. 2014. Action Recognition and Detection by Combining Motion and Appearance Features. In ECCV THUMOS Workshop.
  • Wang et al. (2017a) Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2017a. Non-local Neural Networks. arXiv:1711.07971 (2017).
  • Wu et al. (2016) Zuxuan Wu, Yu-Gang Jiang, Xi Wang, Hao Ye, and Xiangyang Xue. 2016. Multi-stream multi-class fusion of deep networks for video classification. In ACM International Conference on Multimedia (MM). 791–800.
  • Xiong et al. (2017) Yuanjun Xiong, Yue Zhao, Limin Wang, Dahua Lin, and Xiaoou Tang. 2017. A Pursuit of Temporal Accuracy in General Activity Detection. arXiv:1703.02716 (2017).
  • Xu et al. (2017) Huijuan Xu, Abir Das, and Kate Saenko. 2017. R-C3D: Region Convolutional 3D Network for Temporal Activity Detection. In International Conference on Computer Vision (ICCV).
  • Yang et al. (2017) Ke Yang, Peng Qiao, Dongsheng Li, Shaohe Lv, and Yong Dou. 2017. Exploring Temporal Preservation Networks for Precise Temporal Action Localization. arXiv:1708.03280 (2017).
  • Ye et al. (2015) Hao Ye, Zuxuan Wu, Rui-Wei Zhao, Xi Wang, Yu-Gang Jiang, and Xiangyang Xue. 2015. Evaluating two-stream CNN for video classification. In ACM International Conference on Multimedia Retrieval (ICMR). 435–442.
  • Yeung et al. (2016) Serena Yeung, Olga Russakovsky, Greg Mori, and Li Fei-Fei. 2016. End-To-End Learning of Action Detection From Frame Glimpses in Videos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Yuan et al. (2016) Jun Yuan, Bingbing Ni, Xiaokang Yang, and Ashraf A Kassim. 2016. Temporal Action Localization with Pyramid of Score Distribution Features. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Zhao et al. (2017) Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. 2017. Temporal Action Detection With Structured Segment Networks. In International Conference on Computer Vision (ICCV).
  • Zheng et al. (2010) Yingbin Zheng, Renzhong Wei, Hong Lu, and Xiangyang Xue. 2010. Semantic Video Indexing by Fusing Explicit and Implicit Context Spaces. In ACM International Conference on Multimedia (MM). 967–970.