As the number of videos grows tremendously, extracting frames with actions from untrimmed videos is becoming more important so that humans can exploit them more efficiently. Furthermore, such frames are also useful data for machines to learn action representations. Accordingly, temporal action localization (TAL) has been developed to find frames containing actions in untrimmed videos, usually by training deep networks with full supervision, i.e., individual frames are labeled as action classes or background. However, training with full supervision has several pitfalls: they are (1) expensive (2) subjective (especially on action boundaries) (3) error-prone. Consequently, the research community has been interested in weakly-supervised temporal action localization (WTAL).
WTAL also aims to predict frame-wise labels but with weak supervision (e.g., video-level label, frequency of action instances in videos, or temporal ordering of action instances). Among them, the video-level label is the most commonly used weak supervision where each video is treated as a positive sample for action classes if it contains corresponding action frames. We note that a video can have multiple action classes as its label. In order to disseminate the video-level label to individual frames, some previous methods formulate WTAL as multiple instance learning (MIL) which employs labels for bags of instances rather than those for individual instances [wang2017untrimmednets, paul2018w, Xu2019SegregatedTA]
. As a video can be defined as a set of multiple frames, they first classify individual frames into action classes and then aggregate the frame-level scores to predict the video’s action classes so that classification loss from video-level can guide frame-level predictions.
In this paper, we argue that previous MIL-based approaches do not fully model the problem in that background frames have not been regarded as a separate class although they do not belong to any action class. As a result, background frames are trained to be classified as action classes of the video to minimize loss from video-level even though they do not have certain features of actions. This inconsistency pushes background frames towards action classes, which causes false positives and performance degradation.
To tackle this problem, we introduce an auxiliary class for background frames. Since all untrimmed videos contain background frames, they are positive samples for their original action classes and the background class at the same time. The aforementioned inconsistency is resolved as all frames in a video now have their own categories to target. We note that our approach is in line with fully-supervised methods for object detection [ren2015faster, redmon2016you, liu2016ssd] and TAL [shou2016temporal, zhao2017temporal] in employing the background class. However, in weakly-supervised setting, introducing background class alone does not lead to improvement because we have no negative sample for background class to train. This means the network will eventually learn to produce high scores for background class regardless of input videos.
Hence, to better exploit background class, we design Background Suppression Network (BaS-Net) containing two branches: Base branch and Suppression branch. Base branch has the usual MIL architecture which takes frame-wise features as input and produces frame-wise class activation sequence (CAS) to classify videos as positive samples for their action classes and the background class. Meanwhile, Suppression branch starts with a filtering module which is expected to attenuate input features from background frames, followed by the same architecture of Base branch with shared weights. Unlike Base branch, the objective of Suppression branch is to minimize scores for the background class for all videos while optimizing the original objective for the action classes. Because two branches share weights, they are restricted from optimizing both of their contrasting objectives at the same time given the same input. To resolve the restriction, the filtering module learns to suppress the activations from backgrounds. Finally, Suppression branch becomes free from the interference of background frames and, in result, localizes action more precisely.
The effectiveness of our method is illustrated in Fig. 1. Thanks to the filtering module, Suppression branch successes to suppress the activations from background frames and localize the action instance more accurately. In a later section, ablation study verifies that explicitly modeling background class and joint learning with the contrasting training objectives both are necessary to improve performance.
Our contributions are three-fold:
We introduce an auxiliary class representing background which was a missing element to model weakly-supervised temporal action localization problem.
We propose an asymmetrical two-branch weight-sharing architecture with a filtering module and contrasting objectives to suppress activations from background frames.
Our BaS-Net outperforms current state-of-the-art WTAL methods in experiments on the most popular benchmarks - THUMOS’14 and ActivityNet.
2 Related Work
Fully-supervised temporal action localization (TAL)
TAL is challenging because it requires not only action classes but also temporal intervals of the actions. To tackle the problem, previous methods mostly depend on full supervision, i.e., temporal annotations. Many of them [shou2016temporal, yuan2016temporal] generate proposals by sliding window and classify them into classes for action classes plus background class. Furthermore, several work [xu2017r, chao2018rethinking] attempts to generalize object detection algorithm to TAL. Most recently, a sophisticated proposal generation [lin2018bsn] and Gaussian temporal modeling [long2019gaussian] are proposed for accurate action localization.
Weakly-supervised temporal action localization (WTAL)
WTAL solves the same problem but with less supervision, e.g., video-level labels. To derive frame-wise scores from video-level labels, previous methods generate class activation sequence (CAS). Some of them tackle the conventional problem that CAS tends to focus on a few discriminative frames [singh2017hide, Yuan2019MARGINALIZEDAA, liu2019completeness]. While STPN [nguyen2018weakly] leverages class-agnostic attention weights along with CAS, Autoloc [shou2018autoloc] generates proposals by regression instead of thresholding. Meanwhile, some work [wang2017untrimmednets, paul2018w, Xu2019SegregatedTA] formulates WTAL as multiple instance learning (MIL) problem as we do. However, as mentioned in Sec. 1, they do not fully model WTAL problem in that they did not consider the background class so background frames are to be classified as any action class. On the contrary, we introduce an auxiliary background class and also propose to suppress background frames for better action localization.
3 Proposed Method
In this section, we describe details of our Background Suppression Network (BaS-Net). The overall architecture of BaS-Net is illustrated in Fig. 2. Before the detailed description, we first formulate weakly-supervised temporal action localization (WTAL) problem.
Suppose that we are given training videos with their video-level labels , where is
-dimensional binary vector withif -th video contains -th action category otherwise for classes. A video may contain multiple action classes, i.e., . Each input video goes through a network to generate frame-level class scores, i.e., class activation sequence (CAS). Afterwards, the scores are aggregated to produce a video-level class score. The network is trained to correctly predict video-level label, which is a proxy objective for CAS. At test time, frame-wise action intervals are inferred by thresholding CAS for predicted action classes.
3.1 Background class
As discussed in Sec. 1, without background class, activations from background frames lean towards action classes, which causes disturbance to accurate localization. In order to alleviate the disturbance, we introduce an auxiliary class representing the background. Then, naturally, all training videos are labeled as positive samples for background class since every untrimmed video contains background frames. This leads to a data imbalance problem where we have no negative sample for background class to use for training and corresponding CAS will always be high. Consequently, adding background class alone does not bring performance improvement, which is also verified by ablation study in Sec. 4.
3.2 Two-branch Architecture
Hence, we design a two-branch architecture to better exploit the background class. As illustrated in Fig. 2, our architecture contains two branches following a feature extractor; Base branch and Suppression branch. Both branches, sharing their weights, take a feature map and produce CAS to predict video-level scores with two differences: i) Suppression branch contains a filtering module which learns to filter out background frames to ultimately suppress activations from them in CAS. ii) Their training objectives are different. The objective of Base branch is to classify an input video as a positive sample for its original action classes and also for the background class. On the other hand, Suppression branch with the filtering module is trained to minimize the background class score with the same objective for original action classes. The weight-sharing strategy prevents the branches from satisfying both of their objectives at the same time when the same input is given. Therefore, the filtering module is the only key to resolve the congested condition and is trained to suppress activations from background frames to pursue both objectives simultaneously. This reduces the interference of background frames and improves the action localization performance.
3.3 Background Suppression Network
We first divide each input video into 16-frame non-overlapping segments due to memory constraint, i.e., . To deal with large variation of video lengths, we sample a fixed number of segments from each video. Then, we feed sampled RGB and flow segments into the pre-trained feature extractor to generate -dim feature vectors and , respectively. Afterwards, RGB and flow features are concatenated to build complete features , which are then stacked along temporal dimension to form a feature map of length , i.e., (Fig. 2 (a)).
To predict segment-level class scores, we generate CAS where each segment has its class score by feeding the feature map into temporal 1D convolutional layers. This can be formalized as follows for a video :
where denotes trainable parameters in the convolutional layers and . has dimensions because we use action classes and one auxiliary class for the background.
Afterwards, we aggregate segment-level class scores to derive a single video-level class score which will be compared to the ground truth. There are several approaches to gather scores and we adopt top-k mean technique following the previous work[wang2017untrimmednets, paul2018w]. Then, video-level class score for class of video can be derived as follows:
is a hyperparameter to control the ratio of selected segments in a video.
The video-level class score is then used to predict the probability of being a positive sample for each class by applying softmax function along class dimension:
where has dimensions and each dimension indicates the probability of being a positive sample regarding its respective category for video .
To train the network, we define a loss functionwith binary cross-entropy loss for each class.
where is the video-level label for -th video. The additional label for the background class is set to be positive considering that all training videos contain background frames.
Different from Base branch, Suppression branch contains a filtering module in its front, which is trained to suppress background frames by the opposite training objective for the background class. The filtering module consists of two temporal 1D convolutional layers followed by sigmoid function. The output of the filtering module is foreground weightswhich range from 0 to 1. While the configuration of the filtering module is similar to the attention module in STPN [nguyen2018weakly], it should be noted that training objectives are different so that its goal to target and what it learns are also different from STPN. The foreground weights from the filtering module are multiplied to the feature map over the temporal dimension to filter out background frames. This step can be expressed as follows:
where and denotes element-wise multiplication over temporal dimension.
The remaining process is analogous to Base branch except that the input feature map is different:
We note that the convolutional layers of two branches share weights. Following equations (2) and (3), we obtain the video-level class score and the class-wise probability where backgrounds are suppressed.
We build the loss function with binary cross-entropy loss for each class.
where . we set the label for the background class to , which is different from that of Base branch to train the filtering module to suppress background frames.
We jointly train Base branch and Suppression branch. The overall loss function we need to optimize is composed as follows:
where , , and are the hyperparmaters. Following the previous work [nguyen2018weakly, Xu2019SegregatedTA], we employ the L1 normalization of attention weights, i.e., , in order to make foreground weights more polarized.
3.4 Classification and Localization
After describing how our model is configured and trained, we turn to discuss how it works at test time. Since we suppress activations from background frames with our filtering module, it is reasonable to use the output of Suppression branch for inference. For the classification, we discard classes whose probabilities in are below the threshold . Then, for the remaining categories, we threshold the CAS with threshold to select candidate segments. Afterward, each set of consecutive candidate segments becomes a proposal. We compute the confidence score for each proposal using the contrast between inner and outer areas following the recent work [liu2019completeness].
|Full||Richard et al. richard2016temporal||39.7||35.7||30.0||23.2||15.2||-||-||-||-|
|Yeung et al. yeung2016end||48.9||44.0||36.0||36.0||36.0||26.4||17.1||-||-|
|PSDF + T-SVM yuan2016temporal||51.4||42.6||33.6||26.1||18.8||-||-||-||-|
|Yuan et al. yuan2017temporal||51.0||45.2||36.5||27.8||17.8||-||-||-||-|
|Action Search alwassel2018action||51.8||42.4||30.8||20.2||11.1||-||-|
|STPN (UNT) nguyen2018weakly||45.3||38.8||31.1||23.5||16.2||9.8||5.1||2.0||0.3|
|W-TALC (UNT) paul2018w||49.0||42.8||32.0||26.0||18.8||-||6.2||-||-|
|Liu et al. (UNT) liu2019completeness||53.5||46.8||37.5||29.1||19.9||12.3||6.0||-||-|
|STPN (I3D) nguyen2018weakly||52.0||44.7||35.5||25.8||16.9||9.9||4.3||1.2||0.1|
|W-TALC (I3D) paul2018w||55.2||49.6||40.1||31.1||22.8||-||7.6||-||-|
|Liu et al. (I3D) liu2019completeness||57.4||50.8||41.2||32.1||23.1||15.0||7.0||-||-|
In this section, we evaluate our BaS-Net with extensive experiments. We first describe details of the experimental settings, followed by comparison with the state-of-the-art methods and ablation study. Lastly, we visually demonstrate qualitative results of our method.
4.1 Experimental Settings
We conduct experiments on weakly-supervised temporal action localization task on the most popular benchmarks: THUMOS’14 [THUMOS14] and ActivityNet [caba2015activitynet]. They consist of untrimmed videos and provide both video-level action labels and frame-level temporal annotations. Note that we utilize only video-level labels for training and temporal annotations are used only for evaluation.
We use two networks, namely UntrimmedNet [wang2017untrimmednets] and I3D networks [carreira2017quo]
, as our feature extractor. They are pre-trained on ImageNet[deng2009imagenet] and Kinetics [carreira2017quo], respectively. We note that the feature extractor is not fine-tuned for fair comparison. We use TVL1 algorithm [wedel2009improved] for generating optical flow of segments.
We fix the number of input segments to 750. To sample segments from each video, we use stratified random perturbation during training and uniform sampling during test, same as STPN [nguyen2018weakly]. All hyperparameters are empirically determined by grid search; , , , , and . For , we use a set of thresholds from 0 to 0.5 with the step 0.025 and perform non-maximum suppression (NMS) with threshold 0.7 to remove highly overlapped proposals. Experiments are conducted on a single GTX 1080Ti GPU.
|Full||Singh et al. singh2016untrimmed*||34.5||-||-||-|
|Xiong et al. xiong2017pursuit*||39.1||23.5||5.5||24.0|
|Liu et al. liu2019completeness||34.0||20.9||5.7||21.2|
4.2 Comparison with state-of-the-art methods
We compare our BaS-Net with current state-of-the-art fully-supervised and weakly-supervised approaches at the several IoU thresholds. The results on THUMOS’14, ActivityNet1.2 and 1.3 are summarized in Table 1, Table 4 and Table 3, respectively. In the tables, methods at different levels of supervision are separated by horizontal lines for fair comparison. We note that STAR [Xu2019SegregatedTA] cannot be directly compared with our method222STAR is a weakly-supervised method yet its level of supervision is different from that of ours since they exploit additional annotations, i.e., frequency of action instances..
Table 1 demonstrates the quantitative results on THUMOS’14 in chronological order. The lower two partitions are grouped by choice of the feature extractor: UntrimmedNet (UNT) and I3D. Our method significantly outperforms all state-of-the-art methods at the same level of supervision, regardless of the feature extractor network. We also compare our BaS-Net with fully-supervised approaches. Even with a much lower level of supervision, our method shows the least gap regarding the latest fully-supervised methods. Furthermore, it can be noticed that our method even outperforms several fully-supervised methods at some IoU thresholds.
We also evaluate our BaS-Net on ActivityNet1.3 in Table 3. We see that our method outperforms all other weakly-supervised approaches. Moreover, despite using weaker labels, our algorithm outperforms STAR at all IoU thresholds.
Experimental results on ActivityNet1.2 are shown in Table 4 to compare our method with more methods. Our model outperforms all weakly-supervised methods, following the fully-supervised method with a small gap.
|Liu et al. liu2019completeness||36.8||22.0||5.6||22.4|
|Base branch||Suppression branch||BaS-Net|
4.3 Ablation study
In Table 2, We conduct ablation study on THUMOS’14 to investigate the contributions of different components of BaS-Net.
Baseline. We set the baseline with vanilla MIL setting, i.e., Base branch without the auxiliary background class.
Base branch. We add an auxiliary class for background into the baseline, i.e., Base branch. As shown, the performance does not improve, rather decreases. We conjecture that it is because the network is trained to always produce high activations of the background class for any video due to the lack of negative samples, causing disturbance to classification. It indicates that solely correcting WTAL problem setting by explicitly modeling the background class cannot lead to performance improvement.
Suppression branch. We evaluate a variant with only Suppression branch in order to assess the role of Base branch. With the filtering module acting like attention, it improves the localization performance from the baseline. However, we note that it is not derived from the background modeling, since there is no positive sample for background class.
BaS-Net. By employing both branches and jointly training them with contrasting objectives, BaS-Net learns the background class as well as action classes and shows the best performance with large gaps from the others.
We also perform experiments on how effective each branch is for detecting background frames by measuring F-measure. Table 5 demonstrates that BaS-Net requires joint learning of both branches.
4.4 Qualitative results
Fig. 3 shows several qualitative results on THUMOS’14.
Sparse case. Fig. 3 (a) is a challenging example because humans look small and actions sparsely occur i.e., background frames occupy a large portion of the video. Despite these challenges, our method successfully suppresses the activation from background frames and further seeks the action interval precisely.
Frequent case. In Fig. 3 (b), the video has significantly frequent actions of Shotput, which makes the localization difficult. Nonetheless, by distinguishing actions from the background, our method can accurately find the actions.
Challenging background case. Fig. 3 (c) shows an example with challenging background which has a very similar appearance to foreground. As a result, in Base branch, some background frames show even higher activation than foreground frames. Even so, our Suppression branch successfully attenuate background activations, indicating that explicitly modeling background is important.
In this work, we identified a problem posed by the lack of background modeling in the previous weakly-supervised temporal action localization methods. To solve the problem, we proposed to classify not only action classes but also the background class in multiple instance learning. Moreover, to better exploit background information, we introduced a new two-branch architecture and asymmetrical training strategy. Ablation study showed that the background class and the training strategy both are necessary to achieve performance improvement. Through the extensive experiments, we demonstrated that our framework is effective for suppressing background and outperforms the current state-of-the-art methods for weakly-supervised temporal action localization task on both THUMOS’14 and ActivityNet.
This project was partly supported by Next-Generation Information Computing Development Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (NRF-2017M3C4A7069370) and the Institute for Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (No. 2019-0-01558: Study on audio, video, 3d map and activation map generation system using deep generative model)