1 Introduction
Recently, as an essential but challenging task in the large research scope of video analysis, temporal action detection in untrimmed videos has drawn tremendous attention from the research community [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]. Given a long untrimmed video consisting of multiple action instances and complex background contents, temporal action detection aims at solving two problems: (1) recognizing the action categories of actions contained in the video; (2) localizing temporal intervals (starting and ending boundaries) where actions of interest occur. Currently, temporal action detection has been applied to multiple practical applications, such as video surveillance, humanrobot interaction and intelligent home care.
For temporal action detection, how to accurately localize the starting and ending boundaries of a complex action instance is a challenging problem, since an action instance can happen at arbitrary temporal location with uncertain duration in a video of diverse length. As addressed in [12], the localization error is the most common and the most impactful error that hampers the detection performance of existing stateoftheart approaches. Preferentially fixing localization errors can significantly boost the detection averagemAP. Therefore, to achieve high temporal localization accuracy, most recently detection methods [3, 4, 6, 7, 13, 14, 15] apply boundary regression paradigm to refine the boundaries given a proposal. However, we argue that trying to directly regress the action boundary temporally constitutes a difficult learning task and hardly yield accurate enough boundaries.
To alleviate this deficiency and focus on the need for improving the localization accuracy of current detection methods, we propose a novel Boundary Likelihood Pinpointing (BLP) network. The main contribution of BLP is that we cast the problem of localizing temporal boundaries as that of assigning probabilities on each equally divided unit of a search interval. Specifically, instead of using boundary regression, we propose a novel boundary pinpointing paradigm to perform accurate temporal action localization, which is implemented with three steps (see Fig. 1). First, given a loosely localized action proposal within a video, we obtain a larger search interval via extending the proposal boundaries by a factor and equally divide it to
units. Second, we assign one or more discrete probabilities to each unit indicating whether the unit is inside of the temporal span of action ground truth or being the starting or ending boundary of the action instance. Finally, we pinpoint the boundaries by simply maximizing the likelihood for estimating the optimal boundaries under these probabilities. Since these probabilities provide far more detailed and useful boundary information, they would encourage the model to yield more accurate boundaries than the regression models, that just predict 2 temporal boundary coordinates. We evaluate BLP model on two challenging datasets: THUMOS’14
[16] and ActivityNet [17]. Extensive experiments demonstrate that BLP model can obtain detection results with more precise boundaries than direct regression. Integrating our BLP model with existing action classifier into detection framework leads to competitive detection mAP on both datasets, especially when the evaluation tIoU is high.
Specifically, our detection framework achieves 34.5% (tIoU = 0.7) and 66.9% (tIoU = 0.95) relative gain over the mAP of the stateofthearts on THUMOS’14 and ActivityNet respectively.Relation to prior work. Recently, an immense amount of deep models [18, 19, 20, 21, 22] have been proposed in action recognition, among which the TwoStream [18] and C3D [19] models are deployed in most existing methods. Due to the explosive growth of untrimmed video data, another challenging task called temporal action detection has been put to the center of attention. Currently, many approaches [6, 4, 3, 7, 14, 15] adopt a “detection by classification” framework, in which boundary regression has been widely employed for adjusting the temporal boundaries and boosting the localization accuracy. Differently from the aforementioned work, we propose the boundary pinpointing paradigm. This paradigm estimates the optimal boundaries by maximizing likelihood under predefined probabilities. where these probabilities can provide more useful information than dirct regression. Our idea stems from a novel object localization methodology called LocNet [23], which proposed to revise the horizontal and vertical object boundaries of the given proposal using border probabilities. Inspired by this work, BSN [24] also adopted similar boundary probabilities for temporal action proposal. However, BSN generates proposal boundaries by simply selecting temporal locations with high starting and ending probabilities separately. Our method localizes temporal boundaries using maximizing likelihood estimation under these probabilities, which can provide a more accurate measure of confidence for delimiting the boundaries on any point in time.
2 Proposed Method
2.1 Temporal Action Detection Pipeline
To begin with, we provide a brief overview of the temporal action detection pipeline. Our detection pipeline contains two major modules: an action classification and localization network.
Formally, an action proposal is represented as , where and are the starting () and ending () boundary coordinates of the segment separately. Given a set of action proposals generated by either sliding temporal window or other temporal action proposal methods [25, 14, 24], the action classification network anticipates action categories by predicting a set of classification scores . The score represents how likely the nth temporal proposal is recognized to be the jth action category . Meanwhile, for each loosely localized proposal, the action localization network localizes boundaries where actions start and end temporally. It generates a new set of action segments that have more compact boundaries enclosing the actions inside the proposal. To eliminate the redundant segments, an extra NonMaximumSuppression (NMS) [26] operation is applied to obtain the final segments with accurate boundaries . Details on the localization process will be discussed in Sec. 2.2. , and are the number of action categories, final results and proposals respectively.
2.2 Boundary Likelihood Pinpointing Network
The purpose of our work is to improve the localization accuracy of the detection pipeline. Currently, most existing detection methods [3, 4, 6, 7, 13, 14, 15] accomplish this by directly regressing two boundary coordinates, which lacks detailed enough information to yield accurate boundaries. Thus, we propose a novel Boundary Likelihood Pinpointing (BLP) network as our localization network.
BLP accepts selected proposal segments and outputs conditional probabilities indicating the boundary location. Given a proposal segment , BLP first extends it by a factor to create a search interval and equally divided it into units. Then, BLP predicts one or more discrete probabilities for each unit to indicating whether the unit is inside of the temporal span of action ground truth or being the starting or ending boundary of the action instance. These probabilities provide more detailed information for precise boundary inference than direct boundary regression, which is detailed in Sec. 2.2.1. During inference, we propose a novel boundary pinpointing paradigm. Based on the probabilities generated by BLP, we can pinpoint the action boundaries by simply maximizing the likelihood for estimating the optimal boundaries. This paradigm is detailed in Sec. 2.2.2.
2.2.1 Boundary Likelihood Predictions
For each unit within a search interval , BLP predicts one or more conditional probabilities corresponding to a specific category . Here we design two types of probabilities.
InOut probabilities: We define the inout probabilities to represent the likelihood of unit being inside the temporal span of an action instance of category . Ideally, given a ground truth segment , the inout probabilities should be equal to the following target probabilities .
Boundary probabilities: and represent two independent probabilities of unit being the starting and ending boundaries of an action instance for category . Given a ground truth , the output boundary probabilities should ideally equal to target probabilities , where .
2.2.2 Inference by Boundary Pinpointing
Given aforementioned probabilities of , we propose a novel boundary pinpointing paradigm to inference the temporal boundaries of the action inside . This process is implemented by adopting one of the following two BLP localization models.
InOut localization model: Maximizes the likelihood of inout elements of temporal boundary :
(1) 
where . The first term in the right hand of the equation represents each unit of to be inside a ground truth interval and the second term represents the likelihood of units that are not part of to be outside a ground truth interval.
Boundary localization model: Maximizes the likelihood of boundary elements of boundary :
(2) 
2.3 Action Detection Network Architecture
The architecture of the detection network is shown in Fig. 2. Given a video sequence consists of frames and a set of action proposals , the network outputs categoryspecific action segments with accurate temporal boundaries.
BLP localization network architecture. BLP network aims to predict aforementioned inout or boundary probabilities for each proposal. To begin with, a deep shared C3D model [19] is utilized to process the input video to extract rich spatiotemporal feature hierarchies, and outputs a shared feature map . Then, given a search interval extended by an action proposal, we map it on and use a 3D RoI pooling layer [6] to extract fixedsize feature maps (of size ) from activation that inside
. The resulting feature maps can be fed forward into two fully connected (fc) layers of C3D and an extra fc layer to yields a 1dimension feature vector with length
, where for inout and boundary possibilities respectively, is the number of divided units of and is the number of action categories. Finally, in order to output the categoryspecific conditional probabilities, the 1dimension feature vector is reshaped and fed into a sigmoid layer to obtain the final conditional probability matrix with dimension .Action classification network architecture. For a given proposal, action classification network anticipates action categories by predicting a set of softmax scores for
categories (including “background”). To this end, the fc7 features are fed into another fc layer and an extra softmax layer to output
class probabilities.2.4 Optimization
We train the detection network by optimizing classification and localization networks jointly. The multitask objective function is:
(3) 
where and stand for batch size and number of proposal segments, respectively, and is the tradeoff parameter and set empirically. The and are the indexes for action proposals, and are the network parameters. For the classification network, is a standard multiclass crossentropy loss, where and are the predicted class probability and the ground truth, respectively; while for the localization network,
adopts a binary logistic regression loss conditioned on a specific class
, where represent evaluation probabilities of  or for each segment, and are the corresponding target probabilities. Specifically, in the case of inout, the loss is given by:(4) 
for the boundary case, it is:
(5) 
where , and . In equation (5), we adapt the tradeoff parameters and as in [23] to balance the two terms of and nonboundary elements.
3 Experiments
In this section, we evaluate proposed BLP network on two prevailing datasets: THUMOS’14 [16] and ActivityNet v1.3 [17]. Baseline Model: We take RC3D [6]
as our baseline, since it’s a regressionbased temporal action detection method. To detect actions, we integrate examined BLP localization model and RC3D classification model into one holistic detection framework. For a fair comparison, we train and test our detection network with the same classification network and the same proposal set generated by RC3D for all experiments. The whole BLP model is implemented on Caffe
[27].3.1 Datasets and Experimental Details
THUMOS’14. THUMOS’14 contains 20 different sport activities, with 200 videos for training and 213 videos for testing. Evaluation metrics. We report the mean Average Precision (mAP) of each action category at tIoU thresholds with [0.1:0.1:0.7], and the mAP at tIoU=0.5 is used for the final comparison with other methods. Implementation details. The weights of C3D model are pretrained on Sport1M and finetuned on UCF101. The
in loss function (
3) is set to be 20. Other implementation details are the same as in [6].ActivityNet. ActivityNet v1.3 contains 19,994 videos with 200 classes and is divided into three sets: training, validation, testing with a ratio of 2:1:1. Evaluation metrics. We report the mAP at tIoU=0.5, 0.75, 0.95, and the average of mAPs with tIoU thresholds [0.5:0.05:0.95] is used for comparison. Implementation details. The C3D model is initialized with the pretrained Sport1M weights finetuned on ActivityNet training videos. We train the BLP with a learning rate fixed at
for first 10 epochs and decreased to
for the last 5 epochs. The is set to be 250.3.2 Ablation Experiments
In this section, we explore the best hyperparameter settings for BLP.
How many units should a search interval be divided into? Given a video search interval, we divide it to units. To explore the influence of , we examine three InOut models with (the extension factor ). As shown in Table 1, the detection performance with InOut model achieves the best performance when . We analyze that with finer resolution (), each unit contains fewer features to determine whether the unit is inside an action of interest. Conversely, with coarse resolution (), each unit spans a longer time interval, therefore the temporal boundary localization may be ambiguous and less precise. The same analysis can be applied to Boundary models. As a result, we choose M = 32 for the following experiments.
How long should a proposal be extended to? A search interval is obtained by extending a temporal segment by a factor . Our intuitive assumption is that with larger , the BLP model will comprehend and leverage more surrounding temporal context. To explore the impact of , we investigate six InOut and Boundary models with (). As shown in Table 2, we observe that when , two models achieves the peak performance, while the worst performance occurs when no context is considered (). However, including redundant context () also leads to the deterioration of performance. Thus, we choose = 2.0 for the following experiments.
tIoU  0.1  0.2  0.3  0.4  0.5 

M=16  54.8  52.7  47.9  39.4  31.2 
M=32  54.9  52.9  48.5  40.3  31.6 
M=48  53.3  51.1  47.1  39.7  29.6 
1.0  1.6  1.8  2.0  2.4  3.0  

InOut  30.5  31.3  31.6  32.1  31.8  31.7 
Boundary  29.3  32.4  32.2  32.5  31.9  31.9 
3.3 Action Localization Effectiveness Analysis
In this section, we compare the localization performance of proposed InOut and Boundary models with the regressionbased model RC3D on THUMOS’14 testing set and ActivityNet validation set. As shown in Fig. 3, to evaluate the localization performance of examined model, we report the classspecific recall (averaging per class recalls) as a function of the tIoU thresholds with [0.05:0.05:1.0] for the final detection results generated by the corresponding detection pipeline. We also report the average recall (AR) for each model in the legend. The higher AR indicates the model can yield the more accurate temporal boundaries. Fig. 3 shows that two proposed models achieve remarkably higher recall than the baseline, and surpass the AR of the baseline by average 5.9% and 4.8% on both datasets respectively. We argue that inout and boundary probabilities help the BLP model to yield more accurate boundaries to have larger overlap with ground truth instances. This demonstrates the effectiveness and superior localization performance of BLP model.
3.4 Action Detection Performance Analysis
The detection performance is highly related to the choice of feature extractor. Since the TwoStream features [18, 20] or other improved 3D ConvNet features [21] are more discriminative and superior than the vanilla C3D features deployed in our model, here we only compare with the stateoftheart methods that adopt vanilla C3D as their feature extractors for a fair comparison.
THUMOS’14. The comparison results on THUMOS’14 testing set are summarized in Table 3. We can observe that: (1) The detection framework with proposed InOut and Boundary localization models outperform the baseline RC3D [6] by 3.2% and 3.6% respectively, which demonstrates that our proposed boundary pinpointing paradigm can truly boost the localization performance and yield much more accurate temporal boundaries than boundary regression. (2) When it comes to the other welldeveloped regressionbased detection methods [14, 4], the detection performance of our emerging probabilitybased localization method is remarkably superior. (3) Our detector shows superior mAP over stateoftheart methods SSTAD [8] across a wide range of tIoU thresholds. Especially when tIoU = 0.7, we outperform SSTAD by 35.4% relatively. These results confirm that our welldesigned probabilities can provide more useful boundary information for accurate localization.
Detection Method  0.7  0.6  0.5  0.4  0.3  0.2  0.1 

SCNN [1]  5.3  10.3  19.0  28.7  36.3  43.5  47.7 
CBRC3D [4]  7.9  13.8  22.7  30.1  37.7  44.3  48.2 
CDC [2]  7.9  13.1  23.3  29.4  40.1     
TURN + SCNN [14]      25.6  34.9  44.1  50.9  54.0 
SSTAD [8]  9.6    29.2    45.7     
RC3D (Baseline) [6]      28.9  35.6  44.8  51.5  54.5 
RC3D + InOut  12.6  23.0  32.1  41.1  49.2  53.9  56.2 
RC3D + Boundary  13.0  22.3  32.5  41.3  48.5  53.0  54.7 
ActivityNet v1.3. The comparison results on the ActivityNet v1.3 testing set are shown in Table 4. The results show that after using BLP models to refine temporal boundaries, we gain obvious improvement over the baseline RC3D [6] in terms of all range of tIoU and the average mAP. Meanwhile, compared with the stateoftheart method CDC [2], our method shows competitive performance and get 66.9% relative gain when the tIoU is high (tIoU=0.95). This indicates that after the refinement, the segments have more precise boundaries and have larger overlap with ground truth instances.
Detection Method  0.95  0.75  0.5  Average 
Wang et al. [28]  0.06  2.88  42.48  14.62 
CDC [2]  0.20  25.70  43.00  22.90 
RC3D (Baseline) [6]  1.69  11.47  26.45  13.33 
RC3D + InOut  2.50  14.12  26.65  15.00 
RC3D + Boundary  2.82  15.00  27.82  15.68 
4 Conclusion
In this paper, we propose a novel Boundary Likelihood Pinpointing (BLP) network for accurate temporal action localization. Specifically, instead of using boundary regression, we propose a substitution paradigm called boundary pinpointing. The localization process starts by assigning conditional probabilities to each equally divided unit of a search interval. These probabilities provide a measurement of confidence for each unit being within an action instance or being at the two boundaries. We can exploit these probabilities to accurately pinpoint the temporal boundaries under a simple probabilistic framework. Extensive experiments demonstrate that effectiveness of BLP localization model. Integrating our BLP model with existing action classifier into detection pipeline, competitive detection performance is achieved and we get 34.5% (tIoU = 0.7) and 66.9% (tIoU = 0.95) relative gain over the mAP of stateoftheart detectors on THUMOS’14 and ActivityNet respectively.
References

[1]
Zheng Shou, Dongang Wang, and ShihFu Chang,
“Temporal action localization in untrimmed videos via multistage
cnns,”
in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2016, pp. 1049–1058.  [2] Zheng Shou, Jonathan Chan, Alireza Zareian, Kazuyuki Miyazawa, and ShihFu Chang, “Cdc: convolutionaldeconvolutional networks for precise temporal action localization in untrimmed videos,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017, pp. 1417–1426.
 [3] Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin, “Temporal action detection with structured segment networks,” in The IEEE International Conference on Computer Vision (ICCV), 2017, vol. 8.
 [4] Jiyang Gao, Zhenheng Yang, and Ram Nevatia, “Cascaded boundary regression for temporal action detection,” in Proceedings of the British Machine Vision Conference (BMVC), 2017.
 [5] Xiyang Dai, Bharat Singh, Guyue Zhang, Larry S Davis, and Yan Qiu Chen, “Temporal context network for activity localization in videos,” in 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, 2017, pp. 5727–5736.
 [6] Huijuan Xu, Abir Das, and Kate Saenko, “Rc3d: Region convolutional 3d network for temporal activity detection,” in The IEEE International Conference on Computer Vision (ICCV), 2017, vol. 6, p. 8.
 [7] Tianwei Lin, Xu Zhao, and Zheng Shou, “Single shot temporal action detection,” in Proceedings of the 2017 ACM on Multimedia Conference. ACM, 2017, pp. 988–996.
 [8] Shyamal Buch, Victor Escorcia, Bernard Ghanem, Li FeiFei, and Juan Carlos Niebles, “Endtoend, singlestream temporal action detection in untrimmed videos,” in Proceedings of the British Machine Vision Conference (BMVC), 2017.
 [9] Ke Yang, Peng Qiao, Dongsheng Li, Shaohe Lv, and Yong Dou, “Exploring Temporal Preservation Networks for Precise Temporal Action Localization,” arXiv.org, Aug. 2017.
 [10] F Caba Heilbron, Wayner Barrios, Victor Escorcia, and Bernard Ghanem, “Scc: Semantic context cascade for efficient action detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, vol. 2.
 [11] Colin Lea, Michael D Flynn, Rene Vidal, Austin Reiter, and Gregory D Hager, “Temporal Convolutional Networks for Action Segmentation and Detection,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, pp. 1003–1012, IEEE.
 [12] Humam Alwassel, Fabian Caba Heilbron, Victor Escorcia, and Bernard Ghanem, “Diagnosing error in temporal action detectors,” in The European Conference on Computer Vision (ECCV), September.
 [13] Tianwei Lin, Xu Zhao, and Zheng Shou, “Temporal convolution based action proposal: Submission to activitynet 2017,” arXiv preprint arXiv:1707.06750, 2017.
 [14] Jiyang Gao, Zhenheng Yang, Kan Chen, Chen Sun, and Ram Nevatia, “Turn tap: Temporal unit regression network for temporal action proposals,” in The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
 [15] YuWei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold, David A Ross, Jia Deng, and Rahul Sukthankar, “Rethinking the faster rcnn architecture for temporal action localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1130–1139.
 [16] Y.G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar, “THUMOS challenge: Action recognition with a large number of classes,” http://crcv.ucf.edu/THUMOS14/, 2014.
 [17] Bernard Ghanem Fabian Caba Heilbron, Victor Escorcia and Juan Carlos Niebles, “Activitynet: A largescale video benchmark for human activity understanding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 961–970.
 [18] Karen Simonyan and Andrew Zisserman, “Twostream convolutional networks for action recognition in videos,” in Advances in neural information processing systems, 2014, pp. 568–576.
 [19] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Computer Vision (ICCV), 2015 IEEE International Conference on. IEEE, 2015, pp. 4489–4497.
 [20] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” in European Conference on Computer Vision. Springer, 2016, pp. 20–36.
 [21] Joao Carreira and Andrew Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017, pp. 4724–4733.
 [22] ChaoYuan Wu, Manzil Zaheer, Hexiang Hu, R Manmatha, Alexander J Smola, and Philipp Krähenbühl, “Compressed video action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6026–6035.
 [23] Spyros Gidaris and Nikos Komodakis, “Locnet: Improving localization accuracy for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 789–798.
 [24] Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang, “Bsn: Boundary sensitive network for temporal action proposal generation,” in European Conference on Computer Vision, 2018.
 [25] Victor Escorcia, Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem, “Daps: Deep action proposals for action understanding,” in European Conference on Computer Vision. Springer, 2016, pp. 768–784.
 [26] Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S. Davis, “Softnms – improving object detection with one line of code,” 2017.
 [27] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proceedings of the 22nd ACM international conference on Multimedia. ACM, 2014, pp. 675–678.
 [28] R. Wang and D. Tao, “Uts at activitynet 2016,” AcitivityNet Large Scale Activity Recognition Challenge, 2016.
Comments
There are no comments yet.