1 Introduction
Driven by the need to process a large number of untrimmed videos generated daily by various video capturing devices, temporal action localization is drawing increasing attention from the research community [27, 23, 6, 30, 16, 17, 25, 15, 28, 14, 19].
Temporal action localization typically involves first, generating video segments as candidate action proposals, and second, jointly classifying them into an action class and regressing/refining their temporal boundaries so as to better localize them in time
[5, 16, 2, 25]. However, for actions in the wild, that is in unconstrained scenarios, there are large variations in how actions are performed – this makes it difficult to predict accurate boundaries. Also, unlike object boundaries, there might even be no sensible definition of what the exact temporal extent of a action is. This makes temporal boundary annotations subjective and, possibly, not consistent across different persons. Such issues are not taken into consideration by traditional regression losses used for boundary refinement (such as loss [6, 5]).To address the above issues, inspired by recent works (e.g., [7, 12]
), we firstly propose to model the boundary predictions as univariate Gaussian distributions in temporal action localization, for which we learn their means and variances – the latter express the uncertainty about each prediction. Then, we exploit this kind of uncertainty by using two uncertaintyaware boundary regression losses. First, we use the KullbackLeibler (KL) loss between a dirac, representing the ground truth location of the boundary and the univariate Guassian – this, is the cost proposed in
[7] for the problem of object detection. Second, we propose to approximate the expectation of loss, that is typically used as regression loss – to backpropagate the error with respect to the parameters of the Gaussian, resort to the reparametrization trick and an approximation by sampling as in [12].Experimental evaluation of the above losses shows that the network learns to assign large variances to the samples that are predicted to be far from the ground truth boundary values. As the network converges and the predictions become more accurate, this behaviour changes and the network assigns small variances to accurate predictions. Both uncertaintyaware losses improve detection and localization performance.
The contributions of the paper are summarized as follows:

We propose a simple and effective onestage network that introduces and exploits uncertainty modeling of the boundary location for temporal action localisation. To the best of our knowledge, this is the first paper that that does so in this domain.

For action localization we propose to use two uncertaintyaware losses: the first, based on the KLdivergence to model the difference between distributions and the second, based on the expectation of the loss proposed by us.

We show that the uncertainty modeling improves over the adopted baseline, and that our onestage network achieves comparable results with recent one and twostage networks on THUMOS’14.^{1}^{1}1Code will be made public here: https://github.com/
2 Related work
Onestage action localization detectors: Singlestage networks have been extensively used for detection [16, 26, 1, 9]. However, their performance is usually inferior than that of twostage networks. The SSDlike detection architecture presented in [16] seems to perform well, but temporal span modeling of videos has more variations and more arbitrary compared to spatial information in objects. Thus, it is hard to use handdesigned anchor to cover them all to get the accurate boundary without explicitly modeling the temporal information, especially for actions with large duration. [26] is the first to propose an endtoend network, but C3D feature [13] has been proved to be inferior than twostream feature [24] used in [6, 25]. [1] exploits C3D as a feature extractor and GRU [3], a concise and elegant way to model temporal information and make predictions of the offset. However, experimental results show that GRU is not sufficient in order to learn representation for accurate localization compared to CNNbased methods. In this paper, we try to alleviate the drawback of the above methods using our onestage network.
Uncertainty Learning in DNNs: To improve robustness and interpretability of discriminant Deep Neural Networks (DNNs), introducing and learning under uncertainty is receiving increasing attention among the research community [4, 11, 7, 21]. In this respect, two main categories of uncertainty are studied: model uncertainty and data uncertainty. Model uncertainty refers to the uncertainty of the model parameters given the training data and can be reduced by collecting additional data [4]. Data uncertainty accounts for the uncertainty in output whose primary source is the inherent noise in input data [11, 7, 21]. Despite the fact that a few methods have been proposed for dealing with data uncertainty in classification and regression problems (e.g., in segmentation [11] or object detection [7] tasks), to the best of our knowledge, this the first work that does that in the temporal action localization domain.
3 Method
3.1 Baseline architecture
In this work, we propose a simple singlestage network, which will serve as our baseline architecture. The proposed network draws inspiration from the standard twostage approach that includes a proposal generation stage and a detection stage. Both of them could be considered as standard classificationregression networks that take as input a fixedsize feature representation scheme extracted from temporal clips of varying lengths. First, in the proposal generation stage a binary classification classifies the segment as being background or foreground (i.e., being one of a set of known actions). Second, in the detection stage, a coupled regressionclassification scheme refines the segment boundaries and classifies it into one of the knows action classes. In this paper, we combine the two stages together into a singestage network that conducts endtoend action detection and localization by having two branches to perform binary classification and multiclassification/regression separately, as shown in Fig. 1.
Specifically, for each input proposal, to partially preserve input temporal structure, we divide each input proposal into parts and apply average pooling to each part, as in [25], to get a fixeddimensional feature representation scheme, and then a
normalization layer and a fullyconnected layer (along with a ReLU layer) are followed. After that, there are two branches of fullyconnected layer: the first branch is only doing binary classification, which indicates if this proposal is an action or not; while the second branch is to output multiclassification scores and the refined start, and end offsets corresponding to refinements of the boundaries for each action category. Different from the lower branch shown in Fig.
1, the baseline network only predicts start offsets and end offsets with regression loss; the distribution prediction will be described in detail in Sect. 3.4.3.2 Classification
Before discussing the different boundary regression methods, let us define the training set with supervision as follows:
where , , , , and
denote the feature vector, the actioness label, class label, the start, and the end offsets of the
th training sample, respectively. is binary, which indicates that a training example depicts a class or the background. is a multiclass label, indicates the category a training example belongs to. Given a feature vector , the baseline network infers the actioness score , multiclass score , the start offset , and the end offset .For the binary classification task, i.e., for learning the actioness score, we use the standard binary cross entropy loss. However, due to the fact that proposals that actually depict an action are far fewer than the ones that depict background, the dataset is imbalanced. To deal with this, we adopt a popular technique (see e.g. [18]), namely hardnegative mining, where we keep the ratio between positive and negative (with respect to actioness) samples fixed and equal to . For our experiments, we set . Then, the total binary loss is given as:
(1) 
where and denote the sets of indices of the positive and chosen negative samples, respectively.
3.3 Standard multiclass classification and boundary regression
For the multiclass classification task, suppose is the number of action category in the dataset, the classification loss is written by:
(2) 
where
For the regression task, i.e., for adjusting the start and the end offsets, typically loss is used:
(3) 
3.4 Uncertaintyaware boundary regression
As discussed above, for modeling output uncertainty, we propose to model the boundary offsets as univariate Gaussian distributions for which their first and secondorder moments are learned by the network (see Fig.
1). That is, instead of predicting a deterministic pair of start/end boundaries, we predict a pair of univariate Gaussians. In the next two sections, we discuss two regression losses that exploit this kind of uncertainty; i.e., one that explicitly uses the distributions for computing the boundary regression loss and one that samples for them to approximate the expectation of loss.KL regression loss: Following similar arguments as in [7], we adopt the KullbackLeibler divergence combined with another loss which is similar to smooth loss for computing the boundary regression loss. To this end, we treat ground truth values as Dirac delta distributions, i.e., centred at the given values, in order to indicate the lack of any priorknowledge about their uncertainty. For the sake of simplicity, if is the ground truth value of a boundary offset, and , are the mean and the variance of the corresponding network’s prediction, then, for , the following regression loss is introduced when :
(4) 
and the modified smooth loss when :
(5) 
We show the above regression loss in Fig. 2 (a). It is worth noting that for large values of , i.e., for predicted offsets that are far from the corresponding ground truth values, loss is decreases for predictions with large variances. That is, using KL loss will force the network to predict offsets with large variances in order to converge quickly. By doing this, the network is given more freedom to discard some noisy training samples by enlarging the variances of the output. On the other hand, when the network starts to converge, i.e., when the distance between the predicted offsets and the ground truth values becomes smaller than a certain threshold, the network is trying to make the variances smaller to be accurate.
regression loss with sampling: We propose an alternative uncertaintyaware boundary regression loss in order to avoid the explicit use of distributions in loss computation. In particular, at each training iteration, we sample from the predicted boundary offset distributions and compute the standard loss. In this way, we approximate the expectation of loss during training.
More specifically, if is the ground truth value of a boundary offset, and , are the mean and the variance of the corresponding network’s prediction, at each iteration we sample from and compute the loss, i.e., the quantity . However, since the sampling operation is not a welldefined differentiable operation, and thus would render backpropagation impossible, we use the wellknown reparameterization trick [12]. That is, by choosing one source of randomness like the univariate standard Gaussian , we express the boundary offset prediction as:
Thus, the regression loss could be represented by (where ):
(6) 
In this way, we approximate during training the expected , which can be analytically be expressed as follows:
(7) 
We show the expected loss in Fig. 2(b). Compared with , it doesn’t have such a big tendency to predict offsets with big variances when is big. From the curve, it can be informed vaguely that tend to optimize firstly, and then turn to variances. Detailed derivation can be found in the appendix.
4 Experiments
Dataset We evaluate the proposed methods on the popular THUMOS’14 [10] dataset, which contains 200 validation and 213 testing untrimmed videos, temporally annotated with 20 action classes. Following the standard practice [2, 17, 15], we train our models on the validation set and evaluate them on the testing set.
Implementation details Our baseline method is illustrated in Fig. 1. Input dimensionality of the features that feed the first FC layer is , where is a userdefined hyperparameter discussed in Sect. 3.1 and is the feature dimension of units (each unit consists of 16 frames) of the input proposal. The first FC layer has 1000 hidden units that feed the second FC layer, which outputs two branches. The first one predicts the actioness score (whether the input proposal depicts an action or background), and the second one predicts classification and regression scores. The output of this branch is a matrix in the baseline case, and a matrix in the case where both means and variances are predicted ( denotes the number of classes). During training, we used a batch size of 128, and a rate of .
4.1 One vs twostage networks
To demonstrate the usability of our onestage network, we compare to a similar twostage architecture [25], for which we use proposals generated from our network. The twostage network is constituted by a proposal generation network, and a detection network, while these two has the same classificationregression structure with ours respectively. In Table 1 we show that the proposed onestage network achieves comparable results by involving classagnostic along with category information in a singlestage network, with approximately half of the parameters.
mAP@IoU (%)  0.3  0.4  0.5  0.6  0.7 

Twostage  49.68  44.67  36.48  24.29  13.59 
Onestage  49.46  44.89  36.22  25.56  14.98 
4.2 UncertaintyAware losses
In this section, we compare the uncertaintyaware KL and the expected sampling boundary regression losses on THUMOS’14 dataset for the problem of temporal action detection. In Fig. 3, we visualize the optimal means and variances of the offsets learned after training with the above two losses. Moreover, in Table 2 we report the performance of the two networks. We observe that, compared to sampling loss, KL loss encourages learning larger variances. As we discussed in Sect. 3, the network can learn more from “easy” samples, and ignore the “hard” ones by increasing their variances to enhance the detection performance, which boosts the baseline in all tIoUs by approximately .
While for sampling loss the variances look smaller compared to KL loss (see Fig. 3(b)), it is constrained dynamically when the norm between ground truth and prediction is becoming smaller. It boosts the performance by around (see Table 2) for tIoUs apart from 0.7 by constraining the uncertainty in a relatively low level.
KL vs sampling regression loss Using KL boundary regression loss arrives at slightly better results than using sampling loss. We argue that, due to the extreme imbalance between positive and negative proposals generated by sliding window, it’s more urgent to suppress the negative noisy samples rather than boost the positive boundary prediction accuracy. While KL divergence could suppress the negative proposals by enlarging the corresponding variances; sampling could give more realistic variances by boosting the regression accuracy, which leads to that.
4.3 Comparison to stateoftheart
In Table 2 we report the experimental results of two uncertaintyaware losses compared to several related works. We note that KL
loss achieves second highest performance among the singlestage methods, even comparable with current twostage methods; and highlight that with the uncertainty estimation our result outperforms the other onestage methods apart from
[9] in a large margin by more than in all tIoUs without bells and whistles. As to [9], the main stream branch uses [16] as the backbone network (they improve the backbone mAP@tIoU=0.5 from to , which is still not as good as ours, ), but it has two extra branches to deal with. That is, a proposal generation and a classification branch need to be trained as well, which triples the parameters of [16] to achieve the reported performance.Method  mAP@IoU (%)  
0.3  0.4  0.5  0.6  0.7  
Twostage methods  
CDC [22]  40.1  29.4  23.3  13.1  7.9 
SSN [30]  51.9  41.0  29.8  –  – 
CBR [5]  50.1  41.3  31.0  19.1  9.9 
Faster rcnn [2]  53.2  48.5  42.8  33.8  20.8 
BSN [17]  53.5  45.0  36.9  28.4  20.0 
TAD [25]  52.5  46.6  37.4  24.5  12.4 
TBN [29]  53.8  47.1  39.1  29.7  20.8 
BMN [15]  56.0  47.4  38.8  29.7  20.5 
GTAN [20]  57.8  47.2  38.8     
Onestage methods  
RL [27]  36.0  26.4  17.1  –  – 
SSAD [16]  43.0  35.0  24.6  15.4  7.7 
SAP [8]      27.7     
SSTAD [1]  45.7    29.2    9.6 
Decoupssad [9]  60.2  54.1  44.2  32.3  19.1 
Ours  
Baseline  49.5  44.9  36.2  25.6  15.0 
] (sampling)  50.5  45.1  37.7  26.1  14.9 
KL  51.8  47.7  37.9  27.6  16.0 
5 Conclusion
In this paper we propose an uncertaintyaware boundary regression loss for the problem of temporal action localization in videos. We model boundary offset predictions as univariate Gaussian distributions and we compute the expectation of loss for improving localization. We compare with another uncertaintyaware loss that explicitly uses the predicted distributions, which we apply to the problem of temporal action localization for the first time. In the future, we intend to investigate the use of the predicted variances in the test phase in the direction of improving inference.
References
 [1] (2017) Endtoend, singlestream temporal action detection in untrimmed videos. In BMVC, Cited by: §2, Table 2.
 [2] (2018) Rethinking the faster rcnn architecture for temporal action localization. In CVPR, pp. 1130–1139. Cited by: §1, Table 2, §4.
 [3] (2014) Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §2.

[4]
(2016)
Dropout as a bayesian approximation: representing model uncertainty in deep learning
. In ICML, pp. 1050–1059. Cited by: §2.  [5] (2017) Cascaded boundary regression for temporal action detection. arXiv preprint arXiv:1705.01180. Cited by: §1, Table 2.
 [6] (2017) Turn tap: temporal unit regression network for temporal action proposals. arXiv preprint arXiv:1703.06189. Cited by: §1, §1, §2.
 [7] (2019) Bounding box regression with uncertainty for accurate object detection. In CVPR, pp. 2888–2897. Cited by: §1, §2, §3.4.

[8]
(2018)
Sap: selfadaptive proposal model for temporal action detection based on reinforcement learning
. In AAAI, Cited by: Table 2.  [9] (2019) Decoupling localization and classification in single shot temporal action detection. arXiv preprint arXiv:1904.07442. Cited by: §2, §4.3, Table 2.
 [10] (2014) THUMOS challenge: action recognition with a large number of classes. Note: http://crcv.ucf.edu/THUMOS14/ Cited by: §4.

[11]
(2017)
What uncertainties do we need in bayesian deep learning for computer vision?
. In NIPS, pp. 5574–5584. Cited by: §2.  [12] (2013) Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §3.4.
 [13] (2011) Learning hierarchical invariant spatiotemporal features for action recognition with independent subspace analysis. In CVPR, pp. 3361–3368. Cited by: §2.
 [14] (2019) Fast learning of temporal action proposal via dense boundary generator. arXiv preprint arXiv:1911.04127. Cited by: §1.
 [15] (2019) BMN: boundarymatching network for temporal action proposal generation. ICCV. Cited by: §1, Table 2, §4.
 [16] (2017) Single shot temporal action detection. In ACM Multimedia, pp. 988–996. Cited by: §1, §1, §2, §4.3, Table 2.
 [17] (2018) BSN: boundary sensitive network for temporal action proposal generation. ECCV. Cited by: §1, Table 2, §4.
 [18] (2016) Ssd: single shot multibox detector. In ECCV, pp. 21–37. Cited by: §3.2.
 [19] (2019) Multigranularity generator for temporal action proposal. In CVPR, pp. 3604–3613. Cited by: §1.
 [20] (2019) Gaussian temporal awareness networks for action localization. In CVPR, pp. 344–353. Cited by: Table 2.
 [21] (2019) Probabilistic face embeddings. arXiv preprint arXiv:1904.09658. Cited by: §2.
 [22] (2017) CDC: convolutionaldeconvolutional networks for precise temporal action localization in untrimmed videos. In CVPR, pp. 1417–1426. Cited by: Table 2.
 [23] (2016) Temporal action localization in untrimmed videos via multistage cnns. In CVPR, pp. 1049–1058. Cited by: §1.
 [24] (2014) Twostream convolutional networks for action recognition in videos. In NIPS, pp. 568–576. Cited by: §2.
 [25] (2019) Exploring feature representation and training strategies in temporal action localization. ICIP. Cited by: §1, §1, §2, §3.1, §4.1, Table 2.
 [26] (2017) Rc3d: region convolutional 3d network for temporal activity detection. In ICCV, Vol. 6, pp. 8. Cited by: §2.
 [27] (2016) Endtoend learning of action detection from frame glimpses in videos. In CVPR, pp. 2678–2687. Cited by: §1, Table 2.
 [28] (2019) Graph convolutional networks for temporal action localization. In ICCV, pp. 7094–7103. Cited by: §1.
 [29] (2019) Boundary information matters more: accurate temporal action detection with temporal boundary network. In ICASSP, pp. 1642–1646. Cited by: Table 2.
 [30] (2017) Temporal action detection with structured segment networks. In ICCV, Vol. 8. Cited by: §1, Table 2.
6 Detailed Derivation of the expectation of loss
Lemma. 1 Suppose the predicted offset is y, and the corresponding ground truth is , the loss is defined by:
where . The expectation of loss can be analytically expressed as follows:
Proof.
According to Equ. 6,
Then,
(8)  
(9) 
Thus,
(10)  
(11) 
(12) 
could be written into:
(13)  
(14) 
At this stage, we are going to divide into parts and conquer one by one. Based on Equation 12, we can derive:
(15) 
Then:
(16) 
As , we know , thus:
(17) 
Thus:
(18) 
(19)  
(20)  
(21)  
(22)  
(23) 
Comments
There are no comments yet.