Temporal action localization aims to locate the start and end frames of action of interest in untrimmed videos, , CliffDive and PassBall. Weakly-supervised Temporal Action Localization (WTAL) [stpn, autoloc, untrimmednet] can train such a localizer by using only the video-level annotations like “the video contains CliffDive”, without specifying its start and end frames. Therefore, WTAL is especially valuable in the era of video-sharing social networking service [chen2016micro], where billions of videos have only video-level user-generated tags.
Not surprisingly, WTAL has lower performances than its fully-supervised counterpart [re_faster, bsn, mgg]. As shown in Figure 1, any WTAL method would encounter the three types of localization errors due to the weak supervision: 1) Over-completeness: the ground-truth is a sub-sequence of the localization, which contains additional background frames (Top). 2) Incompleteness: the localization only selects the discriminative sub-sequence of the ground-truth, , misclassifying foreground into background (Mid). 3) Class Confusion: the localization misclassifies different actions (Bottom). Prior studies would always blame these errors for the unlabeled background segments, because the reasons are intuitively sound. For example, if we could model the background as a special Background class, error 1 and error 2 would be mitigated since WTAL knows what visual cues are background [basnet, multi_branch]; if we could exploit the difference between the background and foreground, each action class model would be more discriminative due to training on more negative samples of the background [wum, stpn, w_talc], so all the three errors might be addressed.
However, it is impossible to fundamentally resolve the “background” issue by only using video-level supervision. The reasons are two-fold. First, as there is no video-level label of the background either, even if we set up a Background class, the Background model will be eventually reduced to a simple prior, such as the foreground is sparsely distributed in the background [stpn, dgam]. Second, learning more discriminative features between background and foreground is no easier than WTAL per se, because if we have had such features, videos would be trimmed into segments of foreground actions, which are no longer weakly-supervised anymore. As a result, all such methods essentially resort to bootstrapping from another trivial prior [3c_net, w_talc]: segments of the same action should look similar, which is however not always the case for complex actions (, ThrowDiscus in Figure 1).
In this paper, we propose a causal inference framework [pearl, rubin] to identify the true cause — which is indeed not the background — of the WTAL deficiency, and then how to remedy it by eliminating the cause. To begin with, we help you to find out the possible confounders in Figure 1. In causal terms, a confounder is the common cause of many resultant effects, which are observed to be correlated even if they share no causation. For example, the context “athletic track” may be the confounder between the video and its video-level label LongJump, so, even if the “Prepare”/“Finish” background is not the action, it is inevitably correlated with the foreground and the label, misleading WTAL. The contextual confounder for the ThrowDiscuss and Shotput confusion can be found similarly. As another example, the object “leg on snowboard” may be the confounder for the Snowboarding video, if a segment has no such object, it is wrongly localized as negative to Snowboarding.
It is well-known that such confounding bias can be never removed by using only the observational data correlation [pearl2009causality]
, , the likelihood of an action label given a video, estimated from a Multiple Instance Learning (MIL) model[3c_net]. Instead, we should adopt causal interventions to remove the confounding effect. Ideally, if we can observe all the possible confounders, we can estimate the average effect of the video evenly associated with each confounder [backdoor, zhangdong], mimicking the physical intervention such as “recording” the videos of LongJump everywhere. Unfortunately, the above observed-confounder assumption is not valid in general, because the confounder is too elusive to be defined in different actions that have complex visual cues, , it can be the scene context in LongJump or the object in Snowboarding. Fortunately, we can use another way around: Deconfounder [blessing], which is a general theory that uses latent variables, who can generate the observed data, to be a good substitute for the confounders.
To implement the deconfounder for WTAL, we propose an unsupervised Temporal Smoothing PCA (TS-PCA) model whose base can reconstruct the whole video dataset, whose majority consists of the unlabelled background segments (Section 3.2.2
). Therefore, our punchline is that the background is indeed not a curse but blessings for WTAL. It is worth noting that the TS-PCA deconfounder is specially designed that each projection directly contributes to the label prediction logits. This has two crucial benefits: 1) Its implementation is model-agnostic, , any existing WTAL model can seamlessly incorporate it in a plug-and-play fashion (Section3.2.1). 2) The TS-PCA projection can be used as a foreground/background score function that further enhances the prediction (Section 3.2.2). The deconfounder model is applied to four state-of-the-art WTAL methods with public codes on THUMOS-14 and ActivityNet-1.3 datasets, significant improvement is observed and the deconfounded WUM [wum] establishes new state-of-the-art performances (Section 4).
2 Related Work
Weakly-supervised Temporal Action Localization. Recently, many methods [prototypical, wum, 3c_net, w_talc] have been proposed to handle the task of temporal action localization with only video-level labels. UntrimmedNets [untrimmednet] locates action instances by selecting relevant segments in soft and hard ways. Autoloc [autoloc] directly predicts the temporal boundary of each action instance with a novel outer-inner-contrastive loss. BasNet [basnet] proposes an auxiliary class representing background to suppress activations from background frames. However, these methods would encounter localization errors, which are over-completeness and incompleteness. Many approaches have been proposed to tackle the issues by erasing salient features [hide_and_seek, zhong2018step], imposing diversity loss [multi_branch] or employing marginalized average aggregation strategy [maan]. Nevertheless, these methods ignore the challenging action-context confusion issue caused by the absence of frame-wise label. DGAM [dgam] builds a generative model to form the frame representation, helpful for the action-context separation. However, all components of DGAM are coupled together, which cannot be utilized in a plug-and-play fashion. In this paper, we propose an unsupervised TS-PCA deconfounder to solve the WTAL problem by contributing a formal answer based on causal inference.
Causal Inference. Causal inference can be used to remove the spurious bias [bareinboim2012controlling] and disentangle the desired model effects [besserve2018counterfactuals] in domain-specific applications by pursuing the causal effect [pearl2016causal, rubin2019essential]
. Recently, causal inference has been introduced and widely used in various computer vision tasks[niu2020counterfactual, qi2020two, zhangdong], including image classification [chalupka2014visual], few-shot learning [yue2020interventional], and semantic segmentation [zhangdong]. In our work, as the confounder is unobserved, existing deconfounding methods based on the observed-confounder assumption is no longer valid [qi2020two, yue2020interventional, zhangdong]. To this end, we propose to derive a substitute confounder from a fitted data generation model [blessing].
In this section, we discuss how to derive our deconfounding techniques from the causal effect perspective and how to implement it into the prevailing WTAL framework.
3.1 Action Localization as Causal Effects
Weakly-supervised temporal action localization (WTAL) aims to predict a set of action instances given an untrimmed video. Suppose we have training videos with their video-level labels , where is a
-dimensional multi-hot vector representing the presence ofaction categories and is the extracted video features of segments with dimensions.
The prevailing pipeline for training WTAL is illustrated in the blue dashed box of Figure 2. Note the whole framework is trained with only video-level action labels. First, video
goes through an action classifier to generate segment-wise scores for each action category, namely, Class Activation Sequences (CASs), which are denoted as
. The CAS score reflects the probability of each segment whether it belongs to a specific action category. Afterwards, CASs are aggregated to produce the video-level class scores. Specifically, is usually a rule-based function and the most widely used one is top- mean technique [basnet, 3c_net, w_talc], which averages the top- scores of all segments for each action category, then a softmax operation is performed to derive the video-level action probabilities. Therefore, the video-level classification model can be seen as a proxy objective for CAS learning, with a binary cross-entropy loss for each class as:
During inference, action instances are inferred by thresholding for segment-wise action intervals. However, as we discussed in Section 1, it is essentially ill-posed to infer from only video-level labels via Eq. (1) due to the unobserved confounders. In the following discussion, when the context is clear, we omit the subscript of each sample for simplicity.
Causal Effect Analysis.
We slightly abuse the notation to denote that there exists a “ground-truth” CAS for the ground-truth video-level action label . In causal inference, is also known as the potential outcome [rubin], and the challenge is how to use and only use the training data to estimate it. Denote the conditional Monte Carlo approximations [burt1971conditional] of as . However, as shown in Figure 3, when there are confounders , it will affect both the input video features and CAS , , may correlate with the observed label , even if is not the true cause. Recall the localization errors illustrated in Figure 1. Therefore, the Monte Carlo estimation is biased,
Suppose we could observe and measure all the possible confounders and associate them to each data sample,
which is also known as the backdoor adjustment [pearl2009causality]. Unfortunately, the above assumption of observed confounder is not valid in general because the confounder is too elusive to be defined in different actions that have complex visual cues. Next, we propose a deconfounder method to remove the confounding effect.
3.2 Deconfounded WTAL
We now develop the deconfounder [blessing] for WTAL task, which is a general algorithm that uses latent variables that can generate the observed data, as a good substitute for the unobserved confounders (Section 3.2.1). Note that the substitute is obtained from an unsupervised generative model learned from all videos — regardless of foreground or background, labeled or unlabeled — the blessings of unlabeled background (Section 3.2.2).
3.2.1 Deconfounded CAS Function
The deconfounder theory infers a substitute confounder to perform causal inference. As shown in Figure 3, we assume that there exists a substitute confounder which could generate the distribution of the input video features . Therefore, all the segment-level features are conditionally independent given :
Now we show that is a good substitute which can capture all the unobserved confounders by contradiction. Suppose that there exists an unobserved confounder , which affects multiple input video features within and segment-level labels . Then, the video features would be dependent, even conditional on , through the impact of . This dependence leads to a contradiction that Eq. (4) does not hold.
Given the substitute confounder who can generate , we can replace the expectation over with a single in Eq. (3), thanks to the (weak) ignorability [imai2004causal, rosenbaum1983central]:
where is the outcome model to estimate . In particular, if we denote as a deterministic inference result (, decoded) from the generative model , the expectation in Eq. (5) will be collapsed to:
We propose to use a decoupled addition model to be the overall deconfounded CAS function :
where is the conventional CAS function in any WTAL model, is the adjustment function of the substitute confounder , and is a trade-off parameter. It is worth noting that the deconfounded CAS is model-agnostic — all we need to achieve deconfounded WTAL is just to calibrate any trained and conventional by adding as shown in the dashed red box of Figure 2. Before we detail how to obtain in Section 3.2.2, we first outline the deconfounded WTAL based on the deconfounded CAS function in Eq. (7).
Algorithm 1 illustrates the overview of the deconfounded WTAL. The inputs are training videos with only video-level class labels and the output is the deconfounded CAS scores . Given any WTAL model and the generative model which is to encode , the two models are trained separately. With such decoupled training scheme, any existing WTAL model can be seamlessly incorporated with the adjustment function in a plug-and-play fashion. Specifically,
is typically trained with the loss function in Eq. (1) and the generative model is trained with detailed in Eq. (8). During the inference process, based on the trained and the generated , Eq. (7) is adopted to generate the deconfounded CAS scores, which is also shown in Figure 2. Action instances are inferred by thresholding , which is implemented with existing methods and not within the scope of this paper.
As we mentioned, to be a good substitute confounder,
needs to satisfy: 1) it captures the joint distribution of input features; and 2) is model-agnostic, which requires to generate CAS scores to be seamlessly fused with any WTAL model. Subject to these constraints, we propose to use a temporal smoothing PCA (TS-PCA) as a simple yet effective generative model for encoding the substitute confounder .
Temporal Smoothing PCA.
TS-PCA is a simple and effective generative model to learn . The model directly projects the features into a hidden factor , which can be discriminative and naturally reflects the background or foreground. Without such unsupervised technique, it will require to train the projection together with , which ruins the decoupled training requirement of deconfounder.
Generally, TS-PCA learns unit feature projectors
to maximize the variance of the projected features. Given the training video set , the overall objective function of TS-PCA is:
where and are parameters to control the relative contributions. denotes the projection loss, which is the basis of PCA and composed of a variance part and an orthogonal part as:
Moreover, with the feature projectors, we can approximately reconstruct the original features by mapping it back with . The reconstruction loss is as follows:
Motivated by the fact that nearby segments tend to have similar visual semantics, we propose a temporal smooth loss to penalize the feature projection differences of consecutive segments:
For each video, we can calculate the substitute confounder using the trained TS-PCA above, where is computed by projecting features with the trained projectors111We set as for better clarity of approach. Moreover, in practice, we generate by subtracting the mean value according to Eq. (9), to ensure that the distribution of is zero-centered.. Till now, the foreground and background segments are separated into different clusters based on . We next utilize polarization to differentiate which direction represents the foreground, and which direction represents the background.
Now we introduce how to adjust the sign of to obtain the in Eq. (7), so that the positive sign always stands for the foreground cluster. With the learned , we first discriminate the foreground and background with the assistance of pre-trained , then we unify the shapes of and the adjusted through broadcasting.
Intuitively, the overall foreground/background distributions in are consistent with that of when treating all action categories as foreground. We transform into a category-agnostic vector
through max-pooling along the category dimension, whererepresents the probability of each segment belonging to the foreground or the background. Therefore, we can utilize to identify which of the two clusters in is the foreground. Specifically, for each video, we divide the segments into two groups based on the sign of , i.e., and . Then we compute the average category-agnostic CAS scores of the two groups:
where represents the number of the elements in the set. We compare the values of and , and flip the sign of if the latter is larger,
where is a function that maps positive values to and negative values to . Therefore, the positive sign can always stand for the foreground cluster.
Following previous works [basnet, stpn], we conduct experiments on THUMOS-14 and ActivityNet-1.3 benchmarks. We show the effectiveness of the TS-PCA deconfounder on different baselines in Section 4.2. We then compare the proposed method with the state-of-the-art models in Section 4.3. Moreover, we validate the components in TS-PCA with detailed ablation studies and show the qualitative results in Section 4.4 and Section 4.5, respectively.
Datasets. We evaluate the proposed deconfounder strategy on two standard datasets for WTAL, namely THUMOS-14 [thumos] and ActivityNet-1.3 [activitynet]. THUMOS-14 contains videos and videos with action classes in the validation and test sets, respectively. Each untrimmed video contains at least one action category. Following the standard practice of previous works [basnet, stpn], we train the network on the validation set and evaluate the trained model on the test set. ActivityNet-1.3 is a large-scale video benchmark for temporal action localization, which consists of videos with classes annotated, with % for training, % for validation, and the rest % for testing. As in literature [w_talc, maan], we train our model on the training set and perform evaluations on the validation set.
Following the standard evaluation metrics, we report mean average precision (mAP) at several different levels of intersection over union (IoU) thresholds. We use the official benchmarking code provided by[activitynet] to evaluate our model on THUMOS-14 and ActivityNet-1.3.
Baseline Models. To demonstrate the generalizability of the proposed deconfounder strategy, we integrated it on four popular WTAL models including STPN [stpn], A2CL-PT [min2020adversarial], BasNet [basnet] and WUM [wum] with available source code. The four models utilize different strategies to improve the performance, serving as good baselines for demonstrating TS-PCA’s generalizability for different WTAL models. For a fair comparison, we follow the same settings as reported in the official codes of STPN, A2CL-PT, BasNet and WUM.
Implementation Details. I3D networks [i3d] are used to extract segment features which take segments with 16 frames as input. I3D networks are pre-trained on Kinetics [i3d]. Note that the feature extractors are not finetuned during training.
|Nguyen et al. [nguyen2019weakly]||36.4||19.2||2.9||-|
4.2 Effectiveness on Different Baselines
To demonstrate the generalizability of TS-PCA, we deploy it with four WTAL models on the THUMOS-14 dataset, which are STPN [stpn], A2CL-PT [min2020adversarial], BasNet [basnet] and WUM [wum]. We also perform similar experiments with STPN and A2CL-PT [min2020adversarial] on the ActivityNet-1.3 dataset. Experimental results on THUMOS-14 are shown in Table 1, where “+TP” denotes that the TS-PCA deconfounder is deployed. We can observe that all four models with +TP obtain obvious improvements. Specifically, there are averaged mAP improvement of % on STPN, % on A2CL-PT, % on BasNet and % on WUM. As for ActivityNet-1.3 dataset, STPN and A2CL-PT with +TP also show similar performance improvement as shown in Table 2. Specifically, STPN+TP achieves an average mAP improvement of %, and A2CL-PT has an average improvement of %. These results show the effectiveness of TS-PCA deconfounder in calibrating CAS scores. Note that WUM is a recently proposed SOTA method, where the absolute improvements of % on THUMOS-14 are significant. The above results demonstrate the effectiveness and generalizability of the deconfounder.
4.3 Comparisons with State-of-the-Art Methods
We adopt the TS-PCA deconfounder on WUM as our model to compare with other methods on the testing set of THUMOS-14. We first compare WUM+TP with several baseline models in both weakly-supervised and fully-supervised training manners. The results are shown in Table 3. It can be observed that WUM+TP outperforms other weakly-supervised methods by a large margin, establishing new state-of-the-art performances.
Specifically, WUM+TP achieves an improvement of % on mAP at IoU= and % at IoU= compared with the second best model A2CL-PT. These results further demonstrate the effectiveness of the TS-PCA deconfounder in calibrating CAS. Moreover, our method can even achieve better results than some fully-supervised methods such as SSN [ssn] which is trained with temporal boundary annotations, showing the superiority of the proposed method.
Table 4 shows the comparison results on ActivityNet-1.3, where we adopt the TS-PCA deconfounder on A2CL-PT222We choose A2CL-PT rather than WUM since WUM’s source code and pretrain-ed model on ActivityNet-1.3 are not publicly available.. It can be observed that A2CL-PT+TP outperforms other state-of-the-art weakly-supervised models. Particularly, A2CL-PT+TP surpasses WUM by % on mAP with IoU=. Moreover, A2CL-PT+TP achieves comparative performance with the fully supervised model R-C3D [r-c3d] and CDC [cdc].
4.4 Ablation Studies
We further investigate the contributions of different components in the TS-PCA deconfounder in detail. We conduct ablation studies on THUMOS-14 dataset. Without loss of generality, we incorporate WUM with TS-PCA as the basic model to verify the effectiveness of different components. Specifically, the ablation studies aim to answer two questions. Q1: What’s the contribution of each loss component? Q2: How many feature projectors are needed to encode ?
4.4.1 Contributions of Different Loss Components
To get a better understanding of the proposed TS-PCA deconfounder, we further evaluate the key components of . TS-PCA-R: We discard the reconstruction loss, which is to encourage the reconstruction of video features based on the learned projectors, helpful for capturing the distribution of the input data. TS-PCA-S: The smooth loss is to penalize the feature projection differences of consecutive segments. TS-PCA-R-S: Both the reconstruction loss and smooth loss are discarded. Here, we discard the smooth loss to verify its contributions.
As shown in Table 5, WUM deployed with TS-PCA outperforms all its variants, namely TS-PCA-R, TS-PCA-S and TS-PCA-R-S, and achieves an average improvement of %, % and %, respectively. Therefore, the reconstruction loss and the smooth loss are both effective and necessary.
4.5 Qualitative Results
As mentioned in Section 1, Over-completeness, Incompleteness, and Confusion are the three types of localization errors frequently encountered in WTAL. The deconfounder theory infers a substitute confounder to calibrate conventional CAS, alleviating these problems. Figure 4 illustrates the qualitative results related to the above three common issues: 1) Over-completeness (Top): it’s hard for the WTAL model WUM to discriminate background from foreground. With the assistance of deconfounded strategy, false-positive background segments are largely reduced. 2) Incompleteness (Mid): the unobtrusive action segments are easily overlooked by WUM due to confounding bias. The deconfounded CAS could improve the completeness of action instances. 3) Confusion (Bottom): when two or more action categories such as CricketBowling and CricketShot are visually correlated, it’s hard for the WTAL model to distinguish them. The deconfounded CAS can also tackle this challenge.
4.6 False Positive Analysis
Following DETAD [detad], we utilize BasNet [basnet] on THUMOS-14 as the baseline model to conduct the false positive analysis [detad] on Over-completeness, Incompleteness, and Confusion. We make detailed analysis on the top-5G predictions, where G is the number of ground truth instances. To observe the trend of each error type, we split the top-5G predictions into five equal splits and investigate the percentage of each error type. To highlight the comparisons, we reorganize the figures in Figure 5: BasNet+TP obtains lower error rates in all three types of errors when compared with BasNet. As we have minimized the three error types caused by confounders, the rest of the error types that caused by the lack of full supervision become more significant compared to those in baseline. Note that the higher percentage of background error and double detection error do NOT mean that ours have more errors, because the total number of errors of ours is smaller than the baseline.
In this paper, we first summarize the three basic localization errors in WTAL. Then, we propose a causal inference framework to identify that the reasons are due to the unobserved and un-enumerated confounders. To capture the elusive confounders, we present the unsupervised TS-PCA deconfounder, which exploits the unlabelled background to model an observed substitute for the confounder, to remove the confounding effect. Moreover, with a novel decoupled training scheme, the deconfounder is model-agnostic and could support any WTAL model in a plug-and-play fashion. Significant improvement on four state-of-the-art WTAL methods demonstrates the effectiveness and generalization ability of our deconfounded WTAL.
Acknowledgements. We thank Chong Chen at Damo Academy for his valuable suggestions on loss functions of TS-PCA.