Recently, as an essential but challenging task in the large research scope of video analysis, temporal action detection in untrimmed videos has drawn tremendous attention from the research community [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]. Given a long untrimmed video consisting of multiple action instances and complex background contents, temporal action detection aims at solving two problems: (1) recognizing the action categories of actions contained in the video; (2) localizing temporal intervals (starting and ending boundaries) where actions of interest occur. Currently, temporal action detection has been applied to multiple practical applications, such as video surveillance, human-robot interaction and intelligent home care.
For temporal action detection, how to accurately localize the starting and ending boundaries of a complex action instance is a challenging problem, since an action instance can happen at arbitrary temporal location with uncertain duration in a video of diverse length. As addressed in , the localization error is the most common and the most impactful error that hampers the detection performance of existing state-of-the-art approaches. Preferentially fixing localization errors can significantly boost the detection average-mAP. Therefore, to achieve high temporal localization accuracy, most recently detection methods [3, 4, 6, 7, 13, 14, 15] apply boundary regression paradigm to refine the boundaries given a proposal. However, we argue that trying to directly regress the action boundary temporally constitutes a difficult learning task and hardly yield accurate enough boundaries.
To alleviate this deficiency and focus on the need for improving the localization accuracy of current detection methods, we propose a novel Boundary Likelihood Pinpointing (BLP) network. The main contribution of BLP is that we cast the problem of localizing temporal boundaries as that of assigning probabilities on each equally divided unit of a search interval. Specifically, instead of using boundary regression, we propose a novel boundary pinpointing paradigm to perform accurate temporal action localization, which is implemented with three steps (see Fig. 1). First, given a loosely localized action proposal within a video, we obtain a larger search interval via extending the proposal boundaries by a factor and equally divide it to
units. Second, we assign one or more discrete probabilities to each unit indicating whether the unit is inside of the temporal span of action ground truth or being the starting or ending boundary of the action instance. Finally, we pinpoint the boundaries by simply maximizing the likelihood for estimating the optimal boundaries under these probabilities. Since these probabilities provide far more detailed and useful boundary information, they would encourage the model to yield more accurate boundaries than the regression models, that just predict 2 temporal boundary coordinates. We evaluate BLP model on two challenging datasets: THUMOS’14 and ActivityNet 
. Extensive experiments demonstrate that BLP model can obtain detection results with more precise boundaries than direct regression. Integrating our BLP model with existing action classifier into detection framework leads to competitive detection mAP on both datasets, especially when the evaluation tIoU is high.Specifically, our detection framework achieves 34.5% (tIoU = 0.7) and 66.9% (tIoU = 0.95) relative gain over the mAP of the state-of-the-arts on THUMOS’14 and ActivityNet respectively.
Relation to prior work. Recently, an immense amount of deep models [18, 19, 20, 21, 22] have been proposed in action recognition, among which the Two-Stream  and C3D  models are deployed in most existing methods. Due to the explosive growth of untrimmed video data, another challenging task called temporal action detection has been put to the center of attention. Currently, many approaches [6, 4, 3, 7, 14, 15] adopt a “detection by classification” framework, in which boundary regression has been widely employed for adjusting the temporal boundaries and boosting the localization accuracy. Differently from the aforementioned work, we propose the boundary pinpointing paradigm. This paradigm estimates the optimal boundaries by maximizing likelihood under predefined probabilities. where these probabilities can provide more useful information than dirct regression. Our idea stems from a novel object localization methodology called LocNet , which proposed to revise the horizontal and vertical object boundaries of the given proposal using border probabilities. Inspired by this work, BSN  also adopted similar boundary probabilities for temporal action proposal. However, BSN generates proposal boundaries by simply selecting temporal locations with high starting and ending probabilities separately. Our method localizes temporal boundaries using maximizing likelihood estimation under these probabilities, which can provide a more accurate measure of confidence for delimiting the boundaries on any point in time.
2 Proposed Method
2.1 Temporal Action Detection Pipeline
To begin with, we provide a brief overview of the temporal action detection pipeline. Our detection pipeline contains two major modules: an action classification and localization network.
Formally, an action proposal is represented as , where and are the starting () and ending () boundary coordinates of the segment separately. Given a set of action proposals generated by either sliding temporal window or other temporal action proposal methods [25, 14, 24], the action classification network anticipates action categories by predicting a set of classification scores . The score represents how likely the n-th temporal proposal is recognized to be the j-th action category . Meanwhile, for each loosely localized proposal, the action localization network localizes boundaries where actions start and end temporally. It generates a new set of action segments that have more compact boundaries enclosing the actions inside the proposal. To eliminate the redundant segments, an extra Non-Maximum-Suppression (NMS)  operation is applied to obtain the final segments with accurate boundaries . Details on the localization process will be discussed in Sec. 2.2. , and are the number of action categories, final results and proposals respectively.
2.2 Boundary Likelihood Pinpointing Network
The purpose of our work is to improve the localization accuracy of the detection pipeline. Currently, most existing detection methods [3, 4, 6, 7, 13, 14, 15] accomplish this by directly regressing two boundary coordinates, which lacks detailed enough information to yield accurate boundaries. Thus, we propose a novel Boundary Likelihood Pinpointing (BLP) network as our localization network.
BLP accepts selected proposal segments and outputs conditional probabilities indicating the boundary location. Given a proposal segment , BLP first extends it by a factor to create a search interval and equally divided it into units. Then, BLP predicts one or more discrete probabilities for each unit to indicating whether the unit is inside of the temporal span of action ground truth or being the starting or ending boundary of the action instance. These probabilities provide more detailed information for precise boundary inference than direct boundary regression, which is detailed in Sec. 2.2.1. During inference, we propose a novel boundary pinpointing paradigm. Based on the probabilities generated by BLP, we can pinpoint the action boundaries by simply maximizing the likelihood for estimating the optimal boundaries. This paradigm is detailed in Sec. 2.2.2.
2.2.1 Boundary Likelihood Predictions
For each unit within a search interval , BLP predicts one or more conditional probabilities corresponding to a specific category . Here we design two types of probabilities.
In-Out probabilities: We define the in-out probabilities to represent the likelihood of unit being inside the temporal span of an action instance of category . Ideally, given a ground truth segment , the in-out probabilities should be equal to the following target probabilities .
Boundary probabilities: and represent two independent probabilities of unit being the starting and ending boundaries of an action instance for category . Given a ground truth , the output boundary probabilities should ideally equal to target probabilities , where .
2.2.2 Inference by Boundary Pinpointing
Given aforementioned probabilities of , we propose a novel boundary pinpointing paradigm to inference the temporal boundaries of the action inside . This process is implemented by adopting one of the following two BLP localization models.
In-Out localization model: Maximizes the likelihood of in-out elements of temporal boundary :
where . The first term in the right hand of the equation represents each unit of to be inside a ground truth interval and the second term represents the likelihood of units that are not part of to be outside a ground truth interval.
Boundary localization model: Maximizes the likelihood of boundary elements of boundary :
2.3 Action Detection Network Architecture
The architecture of the detection network is shown in Fig. 2. Given a video sequence consists of frames and a set of action proposals , the network outputs category-specific action segments with accurate temporal boundaries.
BLP localization network architecture. BLP network aims to predict aforementioned in-out or boundary probabilities for each proposal. To begin with, a deep shared C3D model  is utilized to process the input video to extract rich spatio-temporal feature hierarchies, and outputs a shared feature map . Then, given a search interval extended by an action proposal, we map it on and use a 3D RoI pooling layer  to extract fixed-size feature maps (of size ) from activation that inside
. The resulting feature maps can be fed forward into two fully connected (fc) layers of C3D and an extra fc layer to yields a 1-dimension feature vector with length, where for in-out and boundary possibilities respectively, is the number of divided units of and is the number of action categories. Finally, in order to output the category-specific conditional probabilities, the 1-dimension feature vector is reshaped and fed into a sigmoid layer to obtain the final conditional probability matrix with dimension .
Action classification network architecture. For a given proposal, action classification network anticipates action categories by predicting a set of softmax scores for
categories (including “background”). To this end, the fc7 features are fed into another fc layer and an extra softmax layer to outputclass probabilities.
We train the detection network by optimizing classification and localization networks jointly. The multi-task objective function is:
where and stand for batch size and number of proposal segments, respectively, and is the trade-off parameter and set empirically. The and are the indexes for action proposals, and are the network parameters. For the classification network, is a standard multi-class cross-entropy loss, where and are the predicted class probability and the ground truth, respectively; while for the localization network,
adopts a binary logistic regression loss conditioned on a specific class, where represent evaluation probabilities of - or for each segment, and are the corresponding target probabilities. Specifically, in the case of in-out, the loss is given by:
for the boundary case, it is:
as our baseline, since it’s a regression-based temporal action detection method. To detect actions, we integrate examined BLP localization model and R-C3D classification model into one holistic detection framework. For a fair comparison, we train and test our detection network with the same classification network and the same proposal set generated by R-C3D for all experiments. The whole BLP model is implemented on Caffe.
3.1 Datasets and Experimental Details
THUMOS’14. THUMOS’14 contains 20 different sport activities, with 200 videos for training and 213 videos for testing. Evaluation metrics. We report the mean Average Precision (mAP) of each action category at tIoU thresholds with [0.1:0.1:0.7], and the mAP at tIoU=0.5 is used for the final comparison with other methods. Implementation details. The weights of C3D model are pre-trained on Sport-1M and finetuned on UCF101. The
in loss function (3) is set to be 20. Other implementation details are the same as in .
ActivityNet. ActivityNet v1.3 contains 19,994 videos with 200 classes and is divided into three sets: training, validation, testing with a ratio of 2:1:1. Evaluation metrics. We report the mAP at tIoU=0.5, 0.75, 0.95, and the average of mAPs with tIoU thresholds [0.5:0.05:0.95] is used for comparison. Implementation details. The C3D model is initialized with the pre-trained Sport-1M weights finetuned on ActivityNet training videos. We train the BLP with a learning rate fixed at
for first 10 epochs and decreased tofor the last 5 epochs. The is set to be 250.
3.2 Ablation Experiments
In this section, we explore the best hyper-parameter settings for BLP.
How many units should a search interval be divided into? Given a video search interval, we divide it to units. To explore the influence of , we examine three In-Out models with (the extension factor ). As shown in Table 1, the detection performance with In-Out model achieves the best performance when . We analyze that with finer resolution (), each unit contains fewer features to determine whether the unit is inside an action of interest. Conversely, with coarse resolution (), each unit spans a longer time interval, therefore the temporal boundary localization may be ambiguous and less precise. The same analysis can be applied to Boundary models. As a result, we choose M = 32 for the following experiments.
How long should a proposal be extended to? A search interval is obtained by extending a temporal segment by a factor . Our intuitive assumption is that with larger , the BLP model will comprehend and leverage more surrounding temporal context. To explore the impact of , we investigate six In-Out and Boundary models with (). As shown in Table 2, we observe that when , two models achieves the peak performance, while the worst performance occurs when no context is considered (). However, including redundant context () also leads to the deterioration of performance. Thus, we choose = 2.0 for the following experiments.
3.3 Action Localization Effectiveness Analysis
In this section, we compare the localization performance of proposed In-Out and Boundary models with the regression-based model R-C3D on THUMOS’14 testing set and ActivityNet validation set. As shown in Fig. 3, to evaluate the localization performance of examined model, we report the class-specific recall (averaging per class recalls) as a function of the tIoU thresholds with [0.05:0.05:1.0] for the final detection results generated by the corresponding detection pipeline. We also report the average recall (AR) for each model in the legend. The higher AR indicates the model can yield the more accurate temporal boundaries. Fig. 3 shows that two proposed models achieve remarkably higher recall than the baseline, and surpass the AR of the baseline by average 5.9% and 4.8% on both datasets respectively. We argue that in-out and boundary probabilities help the BLP model to yield more accurate boundaries to have larger overlap with ground truth instances. This demonstrates the effectiveness and superior localization performance of BLP model.
3.4 Action Detection Performance Analysis
The detection performance is highly related to the choice of feature extractor. Since the Two-Stream features [18, 20] or other improved 3D ConvNet features  are more discriminative and superior than the vanilla C3D features deployed in our model, here we only compare with the state-of-the-art methods that adopt vanilla C3D as their feature extractors for a fair comparison.
THUMOS’14. The comparison results on THUMOS’14 testing set are summarized in Table 3. We can observe that: (1) The detection framework with proposed In-Out and Boundary localization models outperform the baseline R-C3D  by 3.2% and 3.6% respectively, which demonstrates that our proposed boundary pinpointing paradigm can truly boost the localization performance and yield much more accurate temporal boundaries than boundary regression. (2) When it comes to the other well-developed regression-based detection methods [14, 4], the detection performance of our emerging probability-based localization method is remarkably superior. (3) Our detector shows superior mAP over state-of-the-art methods SS-TAD  across a wide range of tIoU thresholds. Especially when tIoU = 0.7, we outperform SS-TAD by 35.4% relatively. These results confirm that our well-designed probabilities can provide more useful boundary information for accurate localization.
|TURN + S-CNN ||-||-||25.6||34.9||44.1||50.9||54.0|
|R-C3D (Baseline) ||-||-||28.9||35.6||44.8||51.5||54.5|
|R-C3D + In-Out||12.6||23.0||32.1||41.1||49.2||53.9||56.2|
|R-C3D + Boundary||13.0||22.3||32.5||41.3||48.5||53.0||54.7|
ActivityNet v1.3. The comparison results on the ActivityNet v1.3 testing set are shown in Table 4. The results show that after using BLP models to refine temporal boundaries, we gain obvious improvement over the baseline R-C3D  in terms of all range of tIoU and the average mAP. Meanwhile, compared with the state-of-the-art method CDC , our method shows competitive performance and get 66.9% relative gain when the tIoU is high (tIoU=0.95). This indicates that after the refinement, the segments have more precise boundaries and have larger overlap with ground truth instances.
|Wang et al. ||0.06||2.88||42.48||14.62|
|R-C3D (Baseline) ||1.69||11.47||26.45||13.33|
|R-C3D + In-Out||2.50||14.12||26.65||15.00|
|R-C3D + Boundary||2.82||15.00||27.82||15.68|
In this paper, we propose a novel Boundary Likelihood Pinpointing (BLP) network for accurate temporal action localization. Specifically, instead of using boundary regression, we propose a substitution paradigm called boundary pinpointing. The localization process starts by assigning conditional probabilities to each equally divided unit of a search interval. These probabilities provide a measurement of confidence for each unit being within an action instance or being at the two boundaries. We can exploit these probabilities to accurately pinpoint the temporal boundaries under a simple probabilistic framework. Extensive experiments demonstrate that effectiveness of BLP localization model. Integrating our BLP model with existing action classifier into detection pipeline, competitive detection performance is achieved and we get 34.5% (tIoU = 0.7) and 66.9% (tIoU = 0.95) relative gain over the mAP of state-of-the-art detectors on THUMOS’14 and ActivityNet respectively.
-  Zheng Shou, Dongang Wang, and Shih-Fu Chang, “Temporal action localization in untrimmed videos via multi-stage cnns,” in , 2016, pp. 1049–1058.
-  Zheng Shou, Jonathan Chan, Alireza Zareian, Kazuyuki Miyazawa, and Shih-Fu Chang, “Cdc: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017, pp. 1417–1426.
-  Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin, “Temporal action detection with structured segment networks,” in The IEEE International Conference on Computer Vision (ICCV), 2017, vol. 8.
-  Jiyang Gao, Zhenheng Yang, and Ram Nevatia, “Cascaded boundary regression for temporal action detection,” in Proceedings of the British Machine Vision Conference (BMVC), 2017.
-  Xiyang Dai, Bharat Singh, Guyue Zhang, Larry S Davis, and Yan Qiu Chen, “Temporal context network for activity localization in videos,” in 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, 2017, pp. 5727–5736.
-  Huijuan Xu, Abir Das, and Kate Saenko, “R-c3d: Region convolutional 3d network for temporal activity detection,” in The IEEE International Conference on Computer Vision (ICCV), 2017, vol. 6, p. 8.
-  Tianwei Lin, Xu Zhao, and Zheng Shou, “Single shot temporal action detection,” in Proceedings of the 2017 ACM on Multimedia Conference. ACM, 2017, pp. 988–996.
-  Shyamal Buch, Victor Escorcia, Bernard Ghanem, Li Fei-Fei, and Juan Carlos Niebles, “End-to-end, single-stream temporal action detection in untrimmed videos,” in Proceedings of the British Machine Vision Conference (BMVC), 2017.
-  Ke Yang, Peng Qiao, Dongsheng Li, Shaohe Lv, and Yong Dou, “Exploring Temporal Preservation Networks for Precise Temporal Action Localization,” arXiv.org, Aug. 2017.
-  F Caba Heilbron, Wayner Barrios, Victor Escorcia, and Bernard Ghanem, “Scc: Semantic context cascade for efficient action detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, vol. 2.
-  Colin Lea, Michael D Flynn, Rene Vidal, Austin Reiter, and Gregory D Hager, “Temporal Convolutional Networks for Action Segmentation and Detection,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, pp. 1003–1012, IEEE.
-  Humam Alwassel, Fabian Caba Heilbron, Victor Escorcia, and Bernard Ghanem, “Diagnosing error in temporal action detectors,” in The European Conference on Computer Vision (ECCV), September.
-  Tianwei Lin, Xu Zhao, and Zheng Shou, “Temporal convolution based action proposal: Submission to activitynet 2017,” arXiv preprint arXiv:1707.06750, 2017.
-  Jiyang Gao, Zhenheng Yang, Kan Chen, Chen Sun, and Ram Nevatia, “Turn tap: Temporal unit regression network for temporal action proposals,” in The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
-  Yu-Wei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold, David A Ross, Jia Deng, and Rahul Sukthankar, “Rethinking the faster r-cnn architecture for temporal action localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1130–1139.
-  Y.-G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar, “THUMOS challenge: Action recognition with a large number of classes,” http://crcv.ucf.edu/THUMOS14/, 2014.
-  Bernard Ghanem Fabian Caba Heilbron, Victor Escorcia and Juan Carlos Niebles, “Activitynet: A large-scale video benchmark for human activity understanding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 961–970.
-  Karen Simonyan and Andrew Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Advances in neural information processing systems, 2014, pp. 568–576.
-  Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Computer Vision (ICCV), 2015 IEEE International Conference on. IEEE, 2015, pp. 4489–4497.
-  Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” in European Conference on Computer Vision. Springer, 2016, pp. 20–36.
-  Joao Carreira and Andrew Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017, pp. 4724–4733.
-  Chao-Yuan Wu, Manzil Zaheer, Hexiang Hu, R Manmatha, Alexander J Smola, and Philipp Krähenbühl, “Compressed video action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6026–6035.
-  Spyros Gidaris and Nikos Komodakis, “Locnet: Improving localization accuracy for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 789–798.
-  Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang, “Bsn: Boundary sensitive network for temporal action proposal generation,” in European Conference on Computer Vision, 2018.
-  Victor Escorcia, Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem, “Daps: Deep action proposals for action understanding,” in European Conference on Computer Vision. Springer, 2016, pp. 768–784.
-  Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S. Davis, “Soft-nms – improving object detection with one line of code,” 2017.
-  Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proceedings of the 22nd ACM international conference on Multimedia. ACM, 2014, pp. 675–678.
-  R. Wang and D. Tao, “Uts at activitynet 2016,” AcitivityNet Large Scale Activity Recognition Challenge, 2016.