Background Modeling via Uncertainty Estimation for Weakly-supervised Action Localization

06/12/2020 ∙ by Pilhyeon Lee, et al. ∙ Microsoft Yonsei University 0

Weakly-supervised temporal action localization aims to detect intervals of action instances with only video-level action labels for training. A crucial challenge is to separate frames of action classes from remaining, denoted as background frames (i.e., frames not belonging to any action class). Previous methods attempt background modeling by either synthesizing pseudo background videos with static frames or introducing an auxiliary class for background. However, they overlook an essential fact that background frames could be dynamic and inconsistent. Accordingly, we cast the problem of identifying background frames as out-of-distribution detection and isolate it from conventional action classification. Beyond our base action localization network, we propose a module to estimate the probability of being background (i.e., uncertainty [20]), which allows us to learn uncertainty given only video-level labels via multiple instance learning. A background entropy loss is further designed to reject background frames by forcing them to have uniform probability distribution for action classes. Extensive experiments verify the effectiveness of our background modeling and show that our method significantly outperforms state-of-the-art methods on the standard benchmarks - THUMOS'14 and ActivityNet (1.2 and 1.3). Our code and the trained model are available at



There are no comments yet.


page 2

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Temporal action localization (TAL) is a very challenging problem of finding and classifying action intervals in untrimmed videos, which plays a crucial role in video understanding and analysis. Due to its wide applications (

e.g., video surveillance Vishwakarma2012ASO, video summarization Lee2012DiscoveringIP; Xiong2019LessIM, action retrieval Ma2005AGF), TAL has drawn much attention from the research community. A lot of work has been done in the fully-supervised manner and achieved impressive progress shou2016temporal; shou2017cdc; xu2017r; zhao2017temporal; chao2018rethinking; lin2018bsn; long2019gaussian; Lin2019BMNBN; Zeng2019GraphCN. However, they suffer from the extremely high cost of acquiring precise temporal annotations.

To relieve the high-cost issue and enlarge the scalability, researchers direct their attention to the same task with weak supervision, namely weakly-supervised temporal action localization (WTAL). Among the various levels of weak supervision, thanks to the cheap cost, video-level action label is mainly utilized wang2017untrimmednets; singh2017hide; shou2018autoloc, while other types (e.g., temporal ordering bojanowski2014weakly; kuehne2017weakly; huang2016connectionist, frequencies of action instances Xu2019SegregatedTA; Narayan20193CNetCC) are also explored. In this paper, we follow the mainstream using only video-level labels, where each video is labeled as positive for action classes if it contains corresponding action frames and as negative otherwise. Note that a video may have multiple action classes as its label.

Existing work adopts attention mechanism nguyen2018weakly; Yuan2019MARGINALIZEDAA or multiple instance learning formulation paul2018w; Narayan20193CNetCC to predict frame-level class scores from video-level labels. Nonetheless, weakly-supervised methods still show highly inferior performances when compared to fully-supervised ones. According to the recent literature Xu2019SegregatedTA; liu2019completeness; Nguyen2019WeaklySupervisedAL; lee2020background, the performance degradation comes from the false alarms of complex background frames, since video-level labels do not have any clue for background. To bridge the gap, some studies liu2019completeness; Nguyen2019WeaklySupervisedAL; lee2020background attempt background modeling in the weakly-supervised setting. Liu et al. liu2019completeness synthesize pseudo background videos by merging static frames and label them as the background class, assuming all background frames are static. However, the assumption is too strong and cannot cover all scenarios, as some background frames could be dynamic (Fig. (a)a). Nguyen et al. Nguyen2019WeaklySupervisedAL and BaS-Net lee2020background observe that every untrimmed video has background frames, and identify such background frames as the background class. Still, it is difficult to force all background frames to belong to one specific class, as they may have very inconsistent appearances or semantics (Fig. (b)b).

In this paper, we embrace the observation on dynamism and inconsistency in background frames, and propose to formulate the problem of rejecting background frames as out-of-distribution detection problem hendrycks2016baseline; liang2018enhancing, where action and background are mapped to in-distribution and out-of-distribution, respectively. To detect out-of-distribution samples, it is natural to estimate the probability that a sample is from out-of-distribution, also known as uncertainty bendale2016towards; lakshminarayanan2017simple; Dhamija2018ReducingNA. Accordingly, we aim to estimate uncertainty (i.e., the probability of being background) to identify background frames. Regarding that background frames should have low scores for all action classes, we utilize feature magnitudes to model uncertainty, i.e

., our model needs to produce feature vectors with large magnitudes for action frames while ones with small magnitudes for background frames. Unfortunately, it is inappropriate to directly handle individual frames, as we do not have frame-level labels in weakly-supervised setting.

In order to learn uncertainty only with video-level supervision, we leverage the formulation of multiple instance learning Maron1998MultipleInstanceAF; Andrews2002SupportVM; zhou2004multi, where a model is trained with a bag (i.e., untrimmed video) instead of instances (i.e., frames). From each untrimmed video, we select top-k and bottom-k frames in terms of the feature magnitude and consider them as pseudo action and background frames, respectively. Thereafter, we design uncertainty modeling loss to manipulate the magnitudes of pseudo action/background features, which enables our model to learn uncertainty without frame-level labels. Moreover, we introduce background entropy loss to force pseudo background frames to have uniform probability distribution for action classes. By jointly optimizing the losses for background modeling along with a general action classification loss, our model successfully separates action frames from background frames, achieving a new state-of-the-art performance with a large margin on THUMOS’14 and ActivityNet. The effectiveness of our method is verified by ablation study.

[clip=true, width=0.4]figure_1.pdf

(a) An example of dynamic background frames from a SoccerPenalty video

[clip=true, width=0.4]figure_2.pdf

(b) An example of inconsistent background frames from a GolfSwing video
Figure 1: Observation on dynamism and inconsistency of background frames from THUMOS’14. It should be noted that none of them contain any action instance, i.e., they are all background frames. (a) The frames in the red box showing soccer players celebrating are very dynamic, even though they are background frames. (b) There are two types of background frames: black scenes with subtitles (green box) and a golfer preparing to shoot (blue box). These two types have very inconsistent appearances.

Summary of contributions. Our contributions are three-fold: 1) We are the first to formulate background frames as out-of-distribution, overcoming the difficulty in modeling background with regard to their unconstrained and inconsistent properties. 2) We design a new framework for weakly-supervised action localization, where uncertainty is learned for background modeling only with video-level labels via multiple instance learning. 3) We further encourage separation between action and background with a loss maximizing the entropy of action probability distribution from background frames.

2 Related Work

Fully-supervised action localization. The goal of temporal action localization is to find temporal intervals of action instances from long untrimmed videos and classify them. For the task, many approaches depend on accurate temporal annotations for each training video, i.e., start time and end time of action instances. Most of them first generate proposals, and then classify them. To generate proposals, some methods adopt sliding window method shou2016temporal; yuan2016temporal; shou2017cdc; yang2018exploring; xiong2017pursuit; chao2018rethinking, while others predict start and end frames of action instances lin2018bsn; Lin2019BMNBN. Moreover, there are gaussian modeling of each action instance long2019gaussian and an efficient method without proposal generation alwassel2018action. It should be noted that fully-supervised methods leverage frame-level annotations to distinguish action and background frames, while weakly-supervised ones do not.

Weakly-supervised action localization. Recently, due to the extremely high cost of frame-wise annotations, many attempts have been made to solve temporal action localization with weak supervision, mostly video-level labels. UntrimmedNets wang2017untrimmednets first tackle the problem by selecting relevant segments on soft and hard ways. STPN nguyen2018weakly forces the model to select sparse action segments, while Hide-and-seek singh2017hide and MAAN Yuan2019MARGINALIZEDAA extend discriminative parts by randomly hiding or sampling segments, respectively. W-TALC paul2018w and 3C-Net Narayan20193CNetCC both employ deep metric learning to force features from the same action to get closer to themselves than those from different action classes. Meanwhile, AutoLoc shou2018autoloc and CleanNet Liu2019WeaklyST attempt to regress the intervals of action instances, instead of performing hard thresholding. TSM Yu2019TemporalSM proposes to model each action instance as a multi-phase process and predict the evolving sequence of phases. There are also several studies exploiting additional information, e.g., the frequency of action instances Xu2019SegregatedTA; Narayan20193CNetCC, human pose zhang2020MultiinstanceMA.

Apart from the methods above, Liu et al. liu2019completeness, Nguyen et al. Nguyen2019WeaklySupervisedAL and BaS-Net lee2020background are the ones to seek to explicitly model background. However, as mentioned in Sec. 1, they have innate limitations in that background frames could be dynamic and inconsistent, which leads to difficulty in separating background. On the other hand, we consider background as out-of-distribution regarding its properties and propose to learn uncertainty as well as action class scores. In experiments, the effectiveness of our approach is verified, while outperforming the state-of-the-art methods by a large margin.

Out-of-distribution detection. The aim of out-of-distribution detection is to determine whether an input sample comes from in-distribution (i.e., training distribution) or not hendrycks2016baseline; liang2018enhancing; Dhamija2018ReducingNA. The problem has also been studied in several different ways such as open-set recognition bendale2015towards; bendale2016towards

, outlier rejection 

xu2014deep; geifman2017selective

, anomaly detection 

chalapathy2019deep, or uncertainty estimation graves2011practical; gal2016dropout; springenberg2016bayesian; lakshminarayanan2017simple; malinin2018predictive; malinin2019reverse; malinin2019EnsembleDD. To tackle the problem, ODIN liang2018enhancing uses temperature scaling in the softmax function and adds small perturbations to the input. Meanwhile, Lakshminarayanan et al. lakshminarayanan2017simple and Dhamija et al. Dhamija2018ReducingNA predict uncertainty of samples by using the MLP ensemble model and the feature magnitudes, respectively. Moreover, OpenMax bendale2016towards directly estimates uncertainty without using any out-of-distribution sample for training.

3 Method

In this section, we provide the details of the proposed method. The overview of our architecture is illustrated in Fig. 2. We first set up our baseline with the conventional pipeline for weakly-supervised action localization (Sec. 3.1). Next, we cast background identification problem as out-of-distribution detection and tackle it by modeling uncertainty (Sec. 3.2). Thereafter, the objective functions to train our model are introduced (Sec. 3.3). Lastly, we explain how the inference is performed (Sec. 3.4).

3.1 Main pipeline

Feature extraction.

Due to the memory constraint, we first split each video into multi-frame non-overlapping segments, , where denotes the number of segments in the -th video . To handle the large variation in video lengths, a fixed number of segments are sampled from each original video. Then spatio-temporal features and are extracted from the sampled RGB and flow segments, respectively. Note that any feature extractor can be used. Afterwards, we concatenate the RGB and flow features into complete feature vectors , which are then stacked to build a feature map of length , i.e., .

[clip=true, width=0.95]figure_3.pdf

Figure 2: Overview of the proposed method. The main pipeline serves the conventional process for weakly-supervised action localization. In the background modeling part, the pseudo action/background segments are selected based on features magnitudes, which are used to calculate two losses for background modeling: (a) uncertainty modeling loss which enlarges and reduces the feature magnitudes of the pseudo action and background segments respectively, and (b) background entropy loss forcing the pseudo background segments to have uniform action probability distribution.

Feature embedding.

To embed the extracted features, we feed them into a single 1-D convolutional layer followed by ReLU activation. Formally,

, where

denotes the convolution operator with the activation function and

is the learnable parameters of the convolutional layer. Concretely, the dimension of the embedded features is the same as that of input features, i.e., .

Segment-level classification.

From the embedded features, we predict segment-level class scores, which are later used for action localization. For the -th video , the class scores are derived by the action classifier, i.e., , where represents the linear classifier with parameters , denotes the segment-level action scores, and is the number of action classes.

Action score aggregation.

Adopting multiple instance learning Maron1998MultipleInstanceAF; Andrews2002SupportVM; zhou2004multi, we aggregate top scores along all segments for each action class and average them to build a video-level class score:


where is the subset containing action scores for class , and is a hyper-parameter controlling the number of the aggregated segments.

Thereafter, we obtain the video-level action probability for each action class by applying the softmax function to the aggregated scores:


where represents the softmax score for -th action of -th video.

3.2 Considering background as out-of-distribution

Decomposition of action localization.

From the main pipeline, we obtain the action probabilities for each segment, but the essential component for action localization, i.e., background identification, is not carefully considered. Regarding the unconstraint and inconsistency of background frames, we treat background as out-of-distribution bendale2016towards; lakshminarayanan2017simple; Dhamija2018ReducingNA. Considering the probability for class of segment

, it can be decomposed into two parts with the chain rule,

i.e., the in-distribution action classification and the background identification. Let denotes the variable for the background identification. If the segment belongs to any action class, let , otherwise

(belongs to background). Then, the posterior probability for class

of is given by:


where is the label of the corresponding segment, i.e., if belongs to the -th action class, then , while for background segments.

Uncertainty modeling.

In Eq. 3, the probability for in-distribution action classification, i.e., , is estimated with the softmax function as in general classification task. Additionally, it is necessary to model the probability that a segment belongs to any action class i.e., , to tackle background identification problem. Assuming that background frames should produce low scores for all action classes, we model uncertainty with the magnitudes of feature vectors, namely, background features have small magnitudes, while action features have large ones. Then the probability that the -th segment in the -th video () is an action segment is defined by:


where is the corresponding feature vector of , is a norm function (we use L-2 norm here), and is the pre-defined maximum feature magnitude. From the equation, it is ensured that the probability falls between 0 and 1, i.e., .

Multiple instance learning.

To learn uncertainty only with video-level labels, we borrow the concept of multiple instance learning Maron1998MultipleInstanceAF; Andrews2002SupportVM; zhou2004multi, where a model is trained with a bag (i.e., untrimmed video), rather than instances (i.e., segments). In this setting, we select the top segments in terms of the feature magnitude, and treat them as the pseudo action segments , where indicates the set of pseudo action indices. Meanwhile, the bottom segments are considered as the pseudo background segments , where denotes the set of indices for pseudo background. and represent the number of segments sampled for action and background, respectively. Then the pseudo action/background segments serve as the representatives of the input untrimmed video, and they are used for training the model with video-level labels.

3.3 Training objectives

Our model is optimized with three losses: 1) video-level classification loss for action classification of each input video, 2) uncertainty modeling loss which manipulates the magnitudes of action and background feature vectors for background identification, and 3) background entropy loss

for preventing background segments from having a high probability for any action class. The overall loss function is as follows:


where and are hyper-parameters for balancing the losses.

Video-level classification loss.

For multi-label action classification, we use the binary cross entropy loss with normalized video-level labels wang2017untrimmednets as follows:


where represents the video-level softmax score for the -th class of the -th video (Eq. 2), and is the normalized video-level label for the -th class of the -th video.

Uncertainty modeling loss.

In order to learn uncertainty, we train the model to produce feature vectors with large magnitudes for pseudo action segments but ones with small magnitudes for pseudo background segments, as illustrated in Fig. 2 (a). Formally, uncertainty modeling loss takes the form:


where and are the mean features of the pseudo action and background segments of the -th video, respectively. is the norm function, and is the pre-defined maximum feature magnitude, the same in Eq. 4.

Background entropy loss.

Though uncertainty modeling loss encourages background segments to produce low scores for all actions, softmax scores for some action classes could be high due to the relativeness of softmax function. To prevent background segments from having a high softmax score for any action class, we define a loss function which maximizes the entropy of action probabilities of background segments, i.e., background segments are forced to have uniform probability distribution for action classes as described in Fig. 2 (b). Background entropy loss is calculated as follows:


where is the averaged action probability for the -th class of the pseudo background segments, and is the softmax score for the -th class of the segment .

3.4 Inference

At the test time, for an input video, we first obtain the video-level softmax score and threshold on it with to determine which action classes are localized. For the remaining action classes, we calculate the segment-level posterior probability by multiplying the segment-level softmax score and the probability of being an action segment as in Eq. 3. Afterwards, the segments whose posterior probabilities are larger than are selected as the candidate segments. Finally, consecutive candidate segments are grouped into a single proposal. Since we use multiple thresholds for , non-maxium suppression (NMS) is performed for the proposals. We note that no duplicate proposal is allowed.

4 Experiments

4.1 Experimental settings


THUMOS’14 THUMOS14 is a widely used dataset for temporal action localization, containing 200 validation videos and 213 test videos of 20 action classes. It is very challenging as the lengths of the videos are diverse and actions frequently occur (on average 15 instances per video). We use validation videos for training and test videos for test. On the other hand, ActivityNet caba2015activitynet is a large-scale benchmark containing two versions (1.2 and 1.3). ActivityNet 1.3, consisting of 200 action categories, has 10,024 training videos, 4,926 validation videos and 5,044 test videos. ActivityNet 1.2 is a subset of the version 1.3, and is composed of 4,819 training videos, 2,383 validation videos and 2,480 test videos of 100 action classes. Because the ground-truths for the test videos of ActivityNet are withheld for the challenge, we utilize validation videos for evaluation.

Evaluation metrics.

We evaluate our method with mean average precisions (mAPs) under several different intersection of union (IoU) thresholds, which are the standard evaluation metrics for temporal action localization The official evaluation code of ActivityNet

111 is used for measuring mAPs.

Implementation details.

We employ two different feature extractors, namely UntrimmedNets wang2017untrimmednets and I3D networks carreira2017quo

pre-trained on ImageNet 

deng2009imagenet and Kinetics carreira2017quo, respectively. Each input segment consists of 5 frames for UntrimmedNets and 16 frames for I3D. It should be noted that we do not finetune the feature extractor for fair comparison. TVL1 algorithm wedel2009improved is used to extract optical flow from videos. We fix the number of segments as 750 and 50 for THUMOS’14 and ActivityNet, respectively. The sampling method is the same as STPN nguyen2018weakly. The number of the pseudo action/background frames is determined by the ratio parameters, i.e., and . All hyper-parameters are set by grid search; , , , , , and . To enrich the proposal pool, we use multiple thresholds from 0 to 0.25 with a step size 0.025 for , then perform nom-maxium suppression (NMS) with IoU threshold 0.7.

Method Feature mAP@IoU (%)
0.3 0.4 0.5 0.6 0.7 AVG Full S-CNN shou2016temporal - 36.3 28.7 19.0 10.3 5.3 19.9 CDC shou2017cdc - 40.1 29.4 23.3 13.1 7.9 22.8 R-C3D xu2017r - 44.8 35.6 28.9 - - - SSN zhao2017temporal - 51.9 41.0 29.8 - - - TAL-Net chao2018rethinking - 53.2 48.5 42.8 33.8 20.8 39.8 BSN lin2018bsn - 53.5 45.0 36.9 28.4 20.0 36.8 GTAN long2019gaussian - 57.8 47.2 38.8 - - - BMN Lin2019BMNBN - 56.0 47.4 38.8 29.7 20.5 38.5 P-GCN Zeng2019GraphCN - 63.6 57.8 49.1 - - - Weak STAR Xu2019SegregatedTA I3D 48.7 34.7 23.0 - - - 3C-Net Narayan20193CNetCC I3D 44.2 34.1 26.6 - 8.1 - PreTrimNet zhang2020MultiinstanceMA I3D 41.4 32.1 23.1 14.2 7.7 23.7 Weak UntrimmedNets wang2017untrimmednets - 28.2 21.1 13.7 - - - Hide-and-seek singh2017hide - 19.5 12.7 6.8 - - - STPN nguyen2018weakly UNT 31.1 23.5 16.2 9.8 5.1 17.1 AutoLoc shou2018autoloc UNT 35.8 29.0 21.2 13.4 5.8 21.0 W-TALC paul2018w UNT 32.0 26.0 18.8 - 6.2 - Liu et al. liu2019completeness UNT 37.5 29.1 19.9 12.3 6.0 21.0 CleanNet Liu2019WeaklyST UNT 37.0 30.9 23.9 13.9 7.1 22.6 BaS-Net lee2020background UNT 42.8 34.7 25.1 17.1 9.3 25.8 Ours UNT 44.0 36.5 27.3 18.8 9.6 27.2 STPN nguyen2018weakly I3D 35.5 25.8 16.9 9.9 4.3 18.5 W-TALC paul2018w I3D 40.1 31.1 22.8 - 7.6 - MAAN Yuan2019MARGINALIZEDAA I3D 41.1 30.6 20.3 12.0 6.9 22.2 Liu et al. liu2019completeness I3D 41.2 32.1 23.1 15.0 7.0 23.7 TSM Yu2019TemporalSM I3D 39.5 - 24.5 - 7.1 - Nguyen et al. Nguyen2019WeaklySupervisedAL I3D 46.6 37.5 26.8 17.6 10.0 27.7 BaS-Net lee2020background I3D 44.6 36.0 27.0 18.6 10.4 27.3 RPN huang2020relational I3D 48.2 37.2 27.9 16.7 8.1 27.6 Ours I3D 46.9 39.2 30.7 20.8 12.5 30.0

Table 1: Comparison on THUMOS’14. indicates the use of additional information, such as action frequency or human pose. UNT and I3D denote the use of UntrimmedNets and I3D features respectively, while AVG represents the average mAP under the thresholds 0.3:0.1:0.7.

4.2 Comparison with state-of-the-art methods

We compare our method with the existing fully-supervised and weakly-supervised methods under several IoU thresholds. The results on THUMOS’14, ActivityNet 1.2, and 1.3 are reported in Table 1, Table 2, and Table 4, respectively. We separate the entries by horizontal lines regarding the levels of supervision. For readability, all results are reported on the percentage scale.

Table 1 demonstrates the results on THUMOS’14. As shown, our method achieves a new state-of-the-art performance on weakly-supervised temporal action localization, regardless of the choice of the feature extractor (UntrimmedNets and I3D). Notably, our method with I3D features significantly outperforms the existing background modeling approaches, Liu et al. liu2019completeness, Nguyen et al. Nguyen2019WeaklySupervisedAL and BaS-Net lee2020background, by large margins of 7.6 %, 3.9 % and 3.7 % at the IoU threshold of 0.5, respectively. Moreover, even with a much lower level of supervision, our method performs better than several fully-supervised methods, following the latest fully-supervised approaches with the least gap. The quantitative results on ActivityNet 1.2 are demonstrated in Table 2. Consistent with the results on THUMOS’14, our method outperforms all weakly-supervised approaches. Moreover, our method follows SSN zhao2017temporal with a small gap, which shows the potential of weakly-supervised action localization. We also summarize the performances on ActivityNet 1.3 in Table 4. We see that our method surpasses all existing weakly-supervised methods including those which use external information.

0.95! Sup. Method mAP@IoU (%) 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 AVG Full SSN zhao2017temporal 41.3 38.8 35.9 32.9 30.4 27.0 22.2 18.2 13.2 6.1 26.6 Weak 3C-Net Narayan20193CNetCC 35.4 - - - 22.9 - - - 8.5 - 21.1 Weak UntrimmedNets wang2017untrimmednets 7.4 6.1 5.2 4.5 3.9 3.2 2.5 1.8 1.2 0.7 3.6 AutoLoc shou2018autoloc 27.3 24.9 22.5 19.9 17.5 15.1 13.0 10.0 6.8 3.3 16.0 CleanNet Liu2019WeaklyST 37.1 33.4 29.9 26.7 23.4 20.3 17.2 13.9 9.2 5.0 21.6 W-TALC paul2018w 37.0 33.5 30.4 25.7 14.6 12.7 10.0 7.0 4.2 1.5 18.0 Liu et al. liu2019completeness 36.8 - - - - 22.9 - - - 5.6 22.4 TSM Yu2019TemporalSM 28.3 26.0 23.6 21.2 18.9 17.0 14.0 11.1 7.5 3.5 17.1 RPN huang2020relational 37.6 - - - - 23.9 - - - 5.4 23.3 BaS-Net lee2020background 38.5 35.5 32.7 29.8 27.1 24.2 20.7 16.7 11.9 5.6 24.3 Ours 40.3 36.8 34.3 31.6 28.5 25.1 21.8 17.3 12.7 5.9 25.4

Table 2: Comparison on ActivityNet 1.2. AVG represents the averaged mAP at the thresholds 0.5:0.05:0.95, while denotes the use of action frequency information. Our method uses I3D features.

[.46] .42! figure_4.pdf [.52]

Figure 3: Ablation study on THUMOS’14. The mAP@0.5 and the average mAP under the IoU thresholds 0.3:0.1:0.7 (denoted by AVG) are reported. We also plot the performance of Nguyen et al. Nguyen2019WeaklySupervisedAL to compare different background modeling approaches. Full results can be found in the supplementary materials.

.52! Supervision Method mAP@IoU (%) 0.5 0.75 0.95 AVG Full CDC shou2017cdc 45.3 26.0 0.2 23.8 R-C3D xu2017r 26.8 - - 12.7 TAL-Net chao2018rethinking 38.2 18.3 1.3 20.2 BSN lin2018bsn 46.5 30.0 8.0 30.0 GTAN long2019gaussian 52.6 34.1 8.9 34.3 BMN Lin2019BMNBN 50.1 34.8 8.3 33.9 P-GCN Zeng2019GraphCN 48.3 33.2 3.3 31.1 Weak STAR Xu2019SegregatedTA 31.1 18.8 4.7 - PreTrimNet zhang2020MultiinstanceMA 34.8 20.9 5.3 22.5 Weak STPN nguyen2018weakly 29.3 16.9 2.6 - MAAN Yuan2019MARGINALIZEDAA 33.7 21.9 5.5 - Liu et al. liu2019completeness 34.0 20.9 5.7 21.2 TSM Yu2019TemporalSM 30.3 19.0 4.5 - Nguyen et al. Nguyen2019WeaklySupervisedAL 36.4 19.2 2.9 - BaS-Net lee2020background 34.5 22.5 4.9 22.2 Ours 37.0 23.9 5.7 23.7

Figure 4: Comparison on ActivityNet 1.3. AVG means the averaged mAP under the thresholds 0.5:0.05:0.95. denotes the use of additional information. We note that all methods employ I3D as the feature extractor.

4.3 Ablation study

We conduct ablation study on THUMOS’14 to investigate the contribution of each component. Firstly, the baseline is set as the main pipeline only with video-level classification loss (). Thereafter, the proposed losses for background modeling, i.e., uncertainty modeling loss () and background entropy loss (), are added subsequently. We note that background entropy loss cannot stand alone, as it is calculated with the pseudo background segments which are selected based on feature magnitudes. The mAP at the IoU threshold 0.5 and the average mAP of the variants are reported in Fig. 4. For comparison with different background modeling methods, we also plot the performance of Nguyen et al. Nguyen2019WeaklySupervisedAL, which pushes background frames into an auxiliary class. As can be seen, we enjoy a large performance gain of 6.5 % in mAP@0.5 by modeling uncertainty with feature magnitudes. Notably, it () outperforms the current state-of-the-art background modeling method by a considerable gap, which verifies the effectiveness of our uncertainty modeling. Moreover, by adding background entropy loss, the performance is further improved, achieving a new state-of-the-art with a large margin.

4.4 Qualitative comparison

To confirm the superiority of our background modeling, we compare our method with another background modeling approach by visualization. We choose BaS-Net lee2020background for comparison, since it is an existing state-of-the-art background modeling method and the implementation is publicly available222 As shown in Fig. 5, our model detects the action instances more precisely than BaS-Net. More specifically, in the red boxes, we notice that BaS-Net splits one action instance into multiple incomplete detection results. We conjecture that this problem arises because BaS-Net strongly forces all background frames to belong to a specific class, which makes the model misclassify confusing parts of action instances as the background class and fail to cover complete action instances. On the contrary, our model provides better separation between action and background via uncertainty modeling, which allows our model to successfully localize the complete action instances without false alarms.

[clip=true, width=0.99]figure_5.pdf

Figure 5: Qualitative comparison with BaS-Net lee2020background on THUMOS’14. For each example video, there are five plots with sampled frames. The first and second plot show the final scores and and the detection results of the corresponding action class from BaS-Net respectively. The third and fourth plot represent the final scores and the detection results from our model respectively. The last plot denotes the ground truth action intervals. The horizontal axis in each plot means the timesteps of the videos, while the vertical axes of the first and third plots indicates the scores, ranging from 0 to 1. In the red boxes, while BaS-Net fails to cover complete action instances and splits them into multiple detection results, our method is able to accurately localize them.

5 Conclusion

In this work, we identified the inherent limitations of existing background modeling approaches, with the observation that background frames may be dynamic and inconsistent. Thereafter, based on the properties, we proposed to formulate background frames as out-of-distribution samples and model uncertainty with feature magnitudes. In order to train the model without frame-level annotations, we designed a new architecture, where uncertainty is learned via multiple instance learning. Furthermore, background entropy loss was introduced to prevent background segments from leaning toward any specific action class. Ablation study verified that our uncertainty modeling with feature magnitudes and background entropy loss both are beneficial for the localization performance. Through the extensive experiments on the most popular benchmarks - THUMOS’14 and ActivityNet, our method achieved a new state-of-the-art with a large margin on weakly-supervised temporal action localization.

Broader Impact

Our highly accurate weakly supervised actional localization method makes various action detection applications available without intensive manual labeling efforts, which saves a lot of costs. Action detection is widely used in various domains, such as security, education, sports, and entertainment. By exploiting our method, lots of people or organizations can benefit from it, including governments, colleges, movie makers, and even citizens. For example, a possible application is video surveillance for automatic identification of events of interest, e.g., traffic accidents. If the system fails to detect such actions, it may affect some work of traffic police, but it can still reduce their work for detection. Moreover, our method can be used to summarize large-scale online videos with the detected representative actions, which will significantly reduce the human efforts.