Temporal action localization (TAL) is a very challenging problem of finding and classifying action intervals in untrimmed videos, which plays a crucial role in video understanding and analysis. Due to its wide applications (e.g., video surveillance Vishwakarma2012ASO, video summarization Lee2012DiscoveringIP; Xiong2019LessIM, action retrieval Ma2005AGF), TAL has drawn much attention from the research community. A lot of work has been done in the fully-supervised manner and achieved impressive progress shou2016temporal; shou2017cdc; xu2017r; zhao2017temporal; chao2018rethinking; lin2018bsn; long2019gaussian; Lin2019BMNBN; Zeng2019GraphCN. However, they suffer from the extremely high cost of acquiring precise temporal annotations.
To relieve the high-cost issue and enlarge the scalability, researchers direct their attention to the same task with weak supervision, namely weakly-supervised temporal action localization (WTAL). Among the various levels of weak supervision, thanks to the cheap cost, video-level action label is mainly utilized wang2017untrimmednets; singh2017hide; shou2018autoloc, while other types (e.g., temporal ordering bojanowski2014weakly; kuehne2017weakly; huang2016connectionist, frequencies of action instances Xu2019SegregatedTA; Narayan20193CNetCC) are also explored. In this paper, we follow the mainstream using only video-level labels, where each video is labeled as positive for action classes if it contains corresponding action frames and as negative otherwise. Note that a video may have multiple action classes as its label.
Existing work adopts attention mechanism nguyen2018weakly; Yuan2019MARGINALIZEDAA or multiple instance learning formulation paul2018w; Narayan20193CNetCC to predict frame-level class scores from video-level labels. Nonetheless, weakly-supervised methods still show highly inferior performances when compared to fully-supervised ones. According to the recent literature Xu2019SegregatedTA; liu2019completeness; Nguyen2019WeaklySupervisedAL; lee2020background, the performance degradation comes from the false alarms of complex background frames, since video-level labels do not have any clue for background. To bridge the gap, some studies liu2019completeness; Nguyen2019WeaklySupervisedAL; lee2020background attempt background modeling in the weakly-supervised setting. Liu et al. liu2019completeness synthesize pseudo background videos by merging static frames and label them as the background class, assuming all background frames are static. However, the assumption is too strong and cannot cover all scenarios, as some background frames could be dynamic (Fig. (a)a). Nguyen et al. Nguyen2019WeaklySupervisedAL and BaS-Net lee2020background observe that every untrimmed video has background frames, and identify such background frames as the background class. Still, it is difficult to force all background frames to belong to one specific class, as they may have very inconsistent appearances or semantics (Fig. (b)b).
In this paper, we embrace the observation on dynamism and inconsistency in background frames, and propose to formulate the problem of rejecting background frames as out-of-distribution detection problem hendrycks2016baseline; liang2018enhancing, where action and background are mapped to in-distribution and out-of-distribution, respectively. To detect out-of-distribution samples, it is natural to estimate the probability that a sample is from out-of-distribution, also known as uncertainty bendale2016towards; lakshminarayanan2017simple; Dhamija2018ReducingNA. Accordingly, we aim to estimate uncertainty (i.e., the probability of being background) to identify background frames. Regarding that background frames should have low scores for all action classes, we utilize feature magnitudes to model uncertainty, i.e
., our model needs to produce feature vectors with large magnitudes for action frames while ones with small magnitudes for background frames. Unfortunately, it is inappropriate to directly handle individual frames, as we do not have frame-level labels in weakly-supervised setting.
In order to learn uncertainty only with video-level supervision, we leverage the formulation of multiple instance learning Maron1998MultipleInstanceAF; Andrews2002SupportVM; zhou2004multi, where a model is trained with a bag (i.e., untrimmed video) instead of instances (i.e., frames). From each untrimmed video, we select top-k and bottom-k frames in terms of the feature magnitude and consider them as pseudo action and background frames, respectively. Thereafter, we design uncertainty modeling loss to manipulate the magnitudes of pseudo action/background features, which enables our model to learn uncertainty without frame-level labels. Moreover, we introduce background entropy loss to force pseudo background frames to have uniform probability distribution for action classes. By jointly optimizing the losses for background modeling along with a general action classification loss, our model successfully separates action frames from background frames, achieving a new state-of-the-art performance with a large margin on THUMOS’14 and ActivityNet. The effectiveness of our method is verified by ablation study.
Summary of contributions. Our contributions are three-fold: 1) We are the first to formulate background frames as out-of-distribution, overcoming the difficulty in modeling background with regard to their unconstrained and inconsistent properties. 2) We design a new framework for weakly-supervised action localization, where uncertainty is learned for background modeling only with video-level labels via multiple instance learning. 3) We further encourage separation between action and background with a loss maximizing the entropy of action probability distribution from background frames.
2 Related Work
Fully-supervised action localization. The goal of temporal action localization is to find temporal intervals of action instances from long untrimmed videos and classify them. For the task, many approaches depend on accurate temporal annotations for each training video, i.e., start time and end time of action instances. Most of them first generate proposals, and then classify them. To generate proposals, some methods adopt sliding window method shou2016temporal; yuan2016temporal; shou2017cdc; yang2018exploring; xiong2017pursuit; chao2018rethinking, while others predict start and end frames of action instances lin2018bsn; Lin2019BMNBN. Moreover, there are gaussian modeling of each action instance long2019gaussian and an efficient method without proposal generation alwassel2018action. It should be noted that fully-supervised methods leverage frame-level annotations to distinguish action and background frames, while weakly-supervised ones do not.
Weakly-supervised action localization. Recently, due to the extremely high cost of frame-wise annotations, many attempts have been made to solve temporal action localization with weak supervision, mostly video-level labels. UntrimmedNets wang2017untrimmednets first tackle the problem by selecting relevant segments on soft and hard ways. STPN nguyen2018weakly forces the model to select sparse action segments, while Hide-and-seek singh2017hide and MAAN Yuan2019MARGINALIZEDAA extend discriminative parts by randomly hiding or sampling segments, respectively. W-TALC paul2018w and 3C-Net Narayan20193CNetCC both employ deep metric learning to force features from the same action to get closer to themselves than those from different action classes. Meanwhile, AutoLoc shou2018autoloc and CleanNet Liu2019WeaklyST attempt to regress the intervals of action instances, instead of performing hard thresholding. TSM Yu2019TemporalSM proposes to model each action instance as a multi-phase process and predict the evolving sequence of phases. There are also several studies exploiting additional information, e.g., the frequency of action instances Xu2019SegregatedTA; Narayan20193CNetCC, human pose zhang2020MultiinstanceMA.
Apart from the methods above, Liu et al. liu2019completeness, Nguyen et al. Nguyen2019WeaklySupervisedAL and BaS-Net lee2020background are the ones to seek to explicitly model background. However, as mentioned in Sec. 1, they have innate limitations in that background frames could be dynamic and inconsistent, which leads to difficulty in separating background. On the other hand, we consider background as out-of-distribution regarding its properties and propose to learn uncertainty as well as action class scores. In experiments, the effectiveness of our approach is verified, while outperforming the state-of-the-art methods by a large margin.
Out-of-distribution detection. The aim of out-of-distribution detection is to determine whether an input sample comes from in-distribution (i.e., training distribution) or not hendrycks2016baseline; liang2018enhancing; Dhamija2018ReducingNA. The problem has also been studied in several different ways such as open-set recognition bendale2015towards; bendale2016towards
, outlier rejectionxu2014deep; geifman2017selectivechalapathy2019deep, or uncertainty estimation graves2011practical; gal2016dropout; springenberg2016bayesian; lakshminarayanan2017simple; malinin2018predictive; malinin2019reverse; malinin2019EnsembleDD. To tackle the problem, ODIN liang2018enhancing uses temperature scaling in the softmax function and adds small perturbations to the input. Meanwhile, Lakshminarayanan et al. lakshminarayanan2017simple and Dhamija et al. Dhamija2018ReducingNA predict uncertainty of samples by using the MLP ensemble model and the feature magnitudes, respectively. Moreover, OpenMax bendale2016towards directly estimates uncertainty without using any out-of-distribution sample for training.
In this section, we provide the details of the proposed method. The overview of our architecture is illustrated in Fig. 2. We first set up our baseline with the conventional pipeline for weakly-supervised action localization (Sec. 3.1). Next, we cast background identification problem as out-of-distribution detection and tackle it by modeling uncertainty (Sec. 3.2). Thereafter, the objective functions to train our model are introduced (Sec. 3.3). Lastly, we explain how the inference is performed (Sec. 3.4).
3.1 Main pipeline
Due to the memory constraint, we first split each video into multi-frame non-overlapping segments, , where denotes the number of segments in the -th video . To handle the large variation in video lengths, a fixed number of segments are sampled from each original video. Then spatio-temporal features and are extracted from the sampled RGB and flow segments, respectively. Note that any feature extractor can be used. Afterwards, we concatenate the RGB and flow features into complete feature vectors , which are then stacked to build a feature map of length , i.e., .
To embed the extracted features, we feed them into a single 1-D convolutional layer followed by ReLU activation. Formally,, where
denotes the convolution operator with the activation function andis the learnable parameters of the convolutional layer. Concretely, the dimension of the embedded features is the same as that of input features, i.e., .
From the embedded features, we predict segment-level class scores, which are later used for action localization. For the -th video , the class scores are derived by the action classifier, i.e., , where represents the linear classifier with parameters , denotes the segment-level action scores, and is the number of action classes.
Action score aggregation.
Adopting multiple instance learning Maron1998MultipleInstanceAF; Andrews2002SupportVM; zhou2004multi, we aggregate top scores along all segments for each action class and average them to build a video-level class score:
where is the subset containing action scores for class , and is a hyper-parameter controlling the number of the aggregated segments.
Thereafter, we obtain the video-level action probability for each action class by applying the softmax function to the aggregated scores:
where represents the softmax score for -th action of -th video.
3.2 Considering background as out-of-distribution
Decomposition of action localization.
From the main pipeline, we obtain the action probabilities for each segment, but the essential component for action localization, i.e., background identification, is not carefully considered. Regarding the unconstraint and inconsistency of background frames, we treat background as out-of-distribution bendale2016towards; lakshminarayanan2017simple; Dhamija2018ReducingNA. Considering the probability for class of segment
, it can be decomposed into two parts with the chain rule,i.e., the in-distribution action classification and the background identification. Let denotes the variable for the background identification. If the segment belongs to any action class, let , otherwise
(belongs to background). Then, the posterior probability for classof is given by:
where is the label of the corresponding segment, i.e., if belongs to the -th action class, then , while for background segments.
In Eq. 3, the probability for in-distribution action classification, i.e., , is estimated with the softmax function as in general classification task. Additionally, it is necessary to model the probability that a segment belongs to any action class i.e., , to tackle background identification problem. Assuming that background frames should produce low scores for all action classes, we model uncertainty with the magnitudes of feature vectors, namely, background features have small magnitudes, while action features have large ones. Then the probability that the -th segment in the -th video () is an action segment is defined by:
where is the corresponding feature vector of , is a norm function (we use L-2 norm here), and is the pre-defined maximum feature magnitude. From the equation, it is ensured that the probability falls between 0 and 1, i.e., .
Multiple instance learning.
To learn uncertainty only with video-level labels, we borrow the concept of multiple instance learning Maron1998MultipleInstanceAF; Andrews2002SupportVM; zhou2004multi, where a model is trained with a bag (i.e., untrimmed video), rather than instances (i.e., segments). In this setting, we select the top segments in terms of the feature magnitude, and treat them as the pseudo action segments , where indicates the set of pseudo action indices. Meanwhile, the bottom segments are considered as the pseudo background segments , where denotes the set of indices for pseudo background. and represent the number of segments sampled for action and background, respectively. Then the pseudo action/background segments serve as the representatives of the input untrimmed video, and they are used for training the model with video-level labels.
3.3 Training objectives
Our model is optimized with three losses: 1) video-level classification loss for action classification of each input video, 2) uncertainty modeling loss which manipulates the magnitudes of action and background feature vectors for background identification, and 3) background entropy loss
for preventing background segments from having a high probability for any action class. The overall loss function is as follows:
where and are hyper-parameters for balancing the losses.
Video-level classification loss.
For multi-label action classification, we use the binary cross entropy loss with normalized video-level labels wang2017untrimmednets as follows:
where represents the video-level softmax score for the -th class of the -th video (Eq. 2), and is the normalized video-level label for the -th class of the -th video.
Uncertainty modeling loss.
In order to learn uncertainty, we train the model to produce feature vectors with large magnitudes for pseudo action segments but ones with small magnitudes for pseudo background segments, as illustrated in Fig. 2 (a). Formally, uncertainty modeling loss takes the form:
where and are the mean features of the pseudo action and background segments of the -th video, respectively. is the norm function, and is the pre-defined maximum feature magnitude, the same in Eq. 4.
Background entropy loss.
Though uncertainty modeling loss encourages background segments to produce low scores for all actions, softmax scores for some action classes could be high due to the relativeness of softmax function. To prevent background segments from having a high softmax score for any action class, we define a loss function which maximizes the entropy of action probabilities of background segments, i.e., background segments are forced to have uniform probability distribution for action classes as described in Fig. 2 (b). Background entropy loss is calculated as follows:
where is the averaged action probability for the -th class of the pseudo background segments, and is the softmax score for the -th class of the segment .
At the test time, for an input video, we first obtain the video-level softmax score and threshold on it with to determine which action classes are localized. For the remaining action classes, we calculate the segment-level posterior probability by multiplying the segment-level softmax score and the probability of being an action segment as in Eq. 3. Afterwards, the segments whose posterior probabilities are larger than are selected as the candidate segments. Finally, consecutive candidate segments are grouped into a single proposal. Since we use multiple thresholds for , non-maxium suppression (NMS) is performed for the proposals. We note that no duplicate proposal is allowed.
4.1 Experimental settings
THUMOS’14 THUMOS14 is a widely used dataset for temporal action localization, containing 200 validation videos and 213 test videos of 20 action classes. It is very challenging as the lengths of the videos are diverse and actions frequently occur (on average 15 instances per video). We use validation videos for training and test videos for test. On the other hand, ActivityNet caba2015activitynet is a large-scale benchmark containing two versions (1.2 and 1.3). ActivityNet 1.3, consisting of 200 action categories, has 10,024 training videos, 4,926 validation videos and 5,044 test videos. ActivityNet 1.2 is a subset of the version 1.3, and is composed of 4,819 training videos, 2,383 validation videos and 2,480 test videos of 100 action classes. Because the ground-truths for the test videos of ActivityNet are withheld for the challenge, we utilize validation videos for evaluation.
We evaluate our method with mean average precisions (mAPs) under several different intersection of union (IoU) thresholds, which are the standard evaluation metrics for temporal action localization The official evaluation code of ActivityNet111https://github.com/activitynet/ActivityNet/tree/master/Evaluation is used for measuring mAPs.
We employ two different feature extractors, namely UntrimmedNets wang2017untrimmednets and I3D networks carreira2017quo
pre-trained on ImageNetdeng2009imagenet and Kinetics carreira2017quo, respectively. Each input segment consists of 5 frames for UntrimmedNets and 16 frames for I3D. It should be noted that we do not finetune the feature extractor for fair comparison. TVL1 algorithm wedel2009improved is used to extract optical flow from videos. We fix the number of segments as 750 and 50 for THUMOS’14 and ActivityNet, respectively. The sampling method is the same as STPN nguyen2018weakly. The number of the pseudo action/background frames is determined by the ratio parameters, i.e., and . All hyper-parameters are set by grid search; , , , , , and . To enrich the proposal pool, we use multiple thresholds from 0 to 0.25 with a step size 0.025 for , then perform nom-maxium suppression (NMS) with IoU threshold 0.7.
4.2 Comparison with state-of-the-art methods
We compare our method with the existing fully-supervised and weakly-supervised methods under several IoU thresholds. The results on THUMOS’14, ActivityNet 1.2, and 1.3 are reported in Table 1, Table 2, and Table 4, respectively. We separate the entries by horizontal lines regarding the levels of supervision. For readability, all results are reported on the percentage scale.
Table 1 demonstrates the results on THUMOS’14. As shown, our method achieves a new state-of-the-art performance on weakly-supervised temporal action localization, regardless of the choice of the feature extractor (UntrimmedNets and I3D). Notably, our method with I3D features significantly outperforms the existing background modeling approaches, Liu et al. liu2019completeness, Nguyen et al. Nguyen2019WeaklySupervisedAL and BaS-Net lee2020background, by large margins of 7.6 %, 3.9 % and 3.7 % at the IoU threshold of 0.5, respectively. Moreover, even with a much lower level of supervision, our method performs better than several fully-supervised methods, following the latest fully-supervised approaches with the least gap. The quantitative results on ActivityNet 1.2 are demonstrated in Table 2. Consistent with the results on THUMOS’14, our method outperforms all weakly-supervised approaches. Moreover, our method follows SSN zhao2017temporal with a small gap, which shows the potential of weakly-supervised action localization. We also summarize the performances on ActivityNet 1.3 in Table 4. We see that our method surpasses all existing weakly-supervised methods including those which use external information.
4.3 Ablation study
We conduct ablation study on THUMOS’14 to investigate the contribution of each component. Firstly, the baseline is set as the main pipeline only with video-level classification loss (). Thereafter, the proposed losses for background modeling, i.e., uncertainty modeling loss () and background entropy loss (), are added subsequently. We note that background entropy loss cannot stand alone, as it is calculated with the pseudo background segments which are selected based on feature magnitudes. The mAP at the IoU threshold 0.5 and the average mAP of the variants are reported in Fig. 4. For comparison with different background modeling methods, we also plot the performance of Nguyen et al. Nguyen2019WeaklySupervisedAL, which pushes background frames into an auxiliary class. As can be seen, we enjoy a large performance gain of 6.5 % in mAP@0.5 by modeling uncertainty with feature magnitudes. Notably, it () outperforms the current state-of-the-art background modeling method by a considerable gap, which verifies the effectiveness of our uncertainty modeling. Moreover, by adding background entropy loss, the performance is further improved, achieving a new state-of-the-art with a large margin.
4.4 Qualitative comparison
To confirm the superiority of our background modeling, we compare our method with another background modeling approach by visualization. We choose BaS-Net lee2020background for comparison, since it is an existing state-of-the-art background modeling method and the implementation is publicly available222https://github.com/Pilhyeon/BaSNet-pytorch. As shown in Fig. 5, our model detects the action instances more precisely than BaS-Net. More specifically, in the red boxes, we notice that BaS-Net splits one action instance into multiple incomplete detection results. We conjecture that this problem arises because BaS-Net strongly forces all background frames to belong to a specific class, which makes the model misclassify confusing parts of action instances as the background class and fail to cover complete action instances. On the contrary, our model provides better separation between action and background via uncertainty modeling, which allows our model to successfully localize the complete action instances without false alarms.
In this work, we identified the inherent limitations of existing background modeling approaches, with the observation that background frames may be dynamic and inconsistent. Thereafter, based on the properties, we proposed to formulate background frames as out-of-distribution samples and model uncertainty with feature magnitudes. In order to train the model without frame-level annotations, we designed a new architecture, where uncertainty is learned via multiple instance learning. Furthermore, background entropy loss was introduced to prevent background segments from leaning toward any specific action class. Ablation study verified that our uncertainty modeling with feature magnitudes and background entropy loss both are beneficial for the localization performance. Through the extensive experiments on the most popular benchmarks - THUMOS’14 and ActivityNet, our method achieved a new state-of-the-art with a large margin on weakly-supervised temporal action localization.
Our highly accurate weakly supervised actional localization method makes various action detection applications available without intensive manual labeling efforts, which saves a lot of costs. Action detection is widely used in various domains, such as security, education, sports, and entertainment. By exploiting our method, lots of people or organizations can benefit from it, including governments, colleges, movie makers, and even citizens. For example, a possible application is video surveillance for automatic identification of events of interest, e.g., traffic accidents. If the system fails to detect such actions, it may affect some work of traffic police, but it can still reduce their work for detection. Moreover, our method can be used to summarize large-scale online videos with the detected representative actions, which will significantly reduce the human efforts.