Two-Stream Consensus Network for Weakly-Supervised Temporal Action Localization

10/22/2020 ∙ by Yuanhao Zhai, et al. ∙ 0

Weakly-supervised Temporal Action Localization (W-TAL) aims to classify and localize all action instances in an untrimmed video under only video-level supervision. However, without frame-level annotations, it is challenging for W-TAL methods to identify false positive action proposals and generate action proposals with precise temporal boundaries. In this paper, we present a Two-Stream Consensus Network (TSCN) to simultaneously address these challenges. The proposed TSCN features an iterative refinement training method, where a frame-level pseudo ground truth is iteratively updated, and used to provide frame-level supervision for improved model training and false positive action proposal elimination. Furthermore, we propose a new attention normalization loss to encourage the predicted attention to act like a binary selection, and promote the precise localization of action instance boundaries. Experiments conducted on the THUMOS14 and ActivityNet datasets show that the proposed TSCN outperforms current state-of-the-art methods, and even achieves comparable results with some recent fully-supervised methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The task of Weakly-supervised Temporal Action Localization (W-TAL) aims at simultaneously localizing and classifying all action instances in a long untrimmed video given only video-level categorical labels in the learning phase. Compared to its fully-supervised counterpart, which requires frame-level annotations of all action instances during training, W-TAL greatly simplifies the procedure of data collection and avoids annotation bias of human annotators, therefore has been widely studied [18, 41, 34, 27, 30, 1, 23, 46, 24, 28, 26, 43, 20] in recent years.

Several W-TAL methods [41, 30, 27, 23, 28, 26, 20] adopt a Multiple Instance Learning (MIL) framework, where a video is treated as a bag of frames/snippets to perform the video-level action classification. During testing, the trained model slides over time and generates a Temporal-Class Activation Map (T-CAM) [49, 27] (i.e.

, a sequence of probability distributions over action classes at each time step) and an attention sequence that measures the relative importance of each snippet. The action proposals are generated by thresholding the attention value and/or the T-CAM. This MIL framework is usually built on two feature modalities,

i.e., RGB frames and optical flow, which are fused in two possible ways. Early fusion methods [30, 34, 1, 23, 24, 20] concatenate the RGB and optical flow features before they are fed into the network, and late fusion methods [27, 23, 28, 26] compute a weighted sum of their respective outputs before generating action proposals. An example of late fusion is shown in Fig. 1.

Figure 1: Visualization of two-stream outputs and their late fusion result. The first two rows are an input video and the ground truth action instances, respectively. The last three rows are attention sequences (scaled from to ) predicted by the RGB stream, the flow stream and their weighted sum (i.e., the fusion result), respectively, and the horizontal and vertical axes denote the time and the intensity of attention values, respectively. The green boxes denote the localization results generated by thresholding the attention at the value of . By properly combining the two different attention distributions predicted by the RGB and flow streams, the late fusion result achieves a higher true positive rate and a lower false positive rate, and thus has better localization performance

Despite these recent development, two major challenges still persist. One of the most critical problems that prior W-TAL methods suffer from is the lack of ability to rule out false positive action proposals. Without frame-level annotations, they localize action instances that do not necessarily correspond to the video-level labels. For example, a model may falsely localize the action “swimming” by only checking the existence of water in the scene. Therefore, it is necessary to exploit more fine-grained supervision to guide the learning process. Another problem lies in the generation of action proposals. In previous methods, action proposals are generated by thresholding the activation sequence with a fixed threshold, which is preset empirically. It has a significant impact on the quality of action proposals: a high threshold may result in incomplete action proposals while a low threshold can bring more false positives. But how to get out of this dilemma was rarely studied.

In this paper, we introduce a Two-Stream Consensus Network (TSCN) to address the two aforementioned problems. To eliminate false positive action proposals, we design an iterative refinement training scheme, where a frame-level pseudo ground truth is generated from late fusion attention sequence, and serves as a more precise frame-level supervision to iteratively update two-stream models. Our intuition is simple: late fusion is essentially a voting ensemble of the RGB and flow streams, and if a proper fusion parameter (i.e.

, the hyperparameter to control the relative importance of two streams) is selected, late fusion can provide more accurate result compared with each individual stream. The advantage of combining these two streams has been demonstrated by the Two-Stream Convolutional Networks

[37] for action recognition. As shown in Fig. 1, the two streams produce different activation distributions, which lead to different false positives and false negatives. However, when they are combined, the false positive action proposals that only exist in one stream can be largely eliminated, and a high activation value occurs only when both streams are confident that an action instance exists. Since the late fusion result is of higher quality than single stream result, it can in turn serve as a frame-level pseudo ground truth to supervise and refine both streams. To generate high-quality action proposals, we introduce a new attention normalization loss. It pushes the predicted attention to approach extreme values, i.e., 0 and 1, so as to avoid ambiguity. As a result, simply setting the threshold to 0.5 yields high-quality action proposals.

Formally, given an input video, RGB and optical flow features are first extracted from pre-trained deep networks. Then two-stream base models are trained with video-level labels on RGB and optical flow features, respectively, where the attention normalization loss is used to learn the attention distribution. After obtaining two-stream attention sequences, a frame-level pseudo ground truth is generated based on their weighted sum (i.e., the late fusion attention sequence), and in turn provides frame-level supervision to improve the two-stream models. We iteratively update the pseudo ground truth and refine the two-stream base models, and the normalization term at the same time forces the predicted attention to approach a binary selection. The final localization result is obtained by thresholding the late fusion attention sequence.

To summarize, our contribution is threefold:

  • We introduce a Two-Stream Consensus Network (TSCN) for W-TAL. The proposed TSCN uses an iterative refinement training method, where a pseudo ground truth generated from late fusion attention sequence at previous iteration can provide more precise frame-level supervision to current iteration.

  • We propose an attention normalization loss function, which forces the attention to act like a binary selection, and thus improves the quality of action proposals generated by the thresholding method.

  • Extensive experiments are conducted on two standard benchmarks (i.e., THUMOS14 and ActivityNet) to demonstrate the effectiveness of the proposed method. Our TSCN significantly outperforms previous state-of-the-art W-TAL methods, and even achieves comparable results to some recent fully-supervised TAL methods.

2 Related Work

Action Recognition. Traditional methods [19, 8, 7, 39] aim to model spatio-temporal information via hand-crafted features. Two-Stream Convolutional Networks [37]

use two separate Convolutional Neural Networks (CNNs) to exploit appearance and motion clues from RGB frames and optical flow, respectively, and use a late fusion method to reconcile the two-stream outputs.

[10] focuses on studying different ways to fuse the two streams. The Inflated 3D ConvNet (I3D) [3] expands the 2D CNNs in two-stream networks to 3D CNNs. Several recent methods [47, 5, 35, 40, 31] focus on directly learning motion clues from RGB frames instead of calculating optical flow.

Fully-supervised Temporal Action Localization. Fully-supervised TAL requires frame-level annotations of all action instances during training. Several large-scale datasets have been created for this task, such as THUMOS [15, 13], ActivityNet [2]

, and Charades

[36]. Many methods [33, 48, 12, 14, 6, 42, 22, 4] adopt a two-stage pipeline, i.e., action proposal generation followed by action classification. Several methods [42, 6, 11, 4] adopt the Faster R-CNN [32] framework to TAL. Most recently, some methods [22, 25, 21] try to generate action proposals with more flexible durations. Zeng et al. [45] apply the Graph Convolutional Networks (GCN) [17, 38] to TAL to exploit proposal-proposal relations.

Weakly-supervised Temporal Action Localization. W-TAL, which only requires video-level supervision during training, greatly relieves the data annotation efforts, and draws more and more attention from the community recently. Hide-and-Seek [18] randomly hides part of the input video to guide the network to discover other relevant parts. UntrimmedNet [41] consists of a selection module to select the important snippets and a classification module to perform per snippet classification. Sparse Temporal Pooling Network (STPN) [27] improves UntrimmedNet by adding a sparse loss to enforce the sparsity of selected segments. W-TALC [30] jointly optimizes a co-activity similarity loss and a multiple instance learning loss to train the network. AutoLoc [34] is one of the first two-stage methods in W-TAL, and it first generates initial action proposals and then regresses the boundaries of the action proposals with an Outer-Inner-Contrastive loss. CleanNet [24] improves AutoLoc by leveraging the temporal contrast in snippet-level action classification predictions. Liu et al. [23] propose a multi-branch network to model different stages of action. Besides, several methods [28, 20] focus on modeling the background and achieve state-of-the-art performances.

Recently, RefineLoc [1] uses an iterative refinement method to help the model capture a complete action instance. And our method is distinct from RefineLoc in three main aspects. (1) We adopt a late fusion framework, while RefineLoc adopts an early fusion framework. (2) Our pseudo ground truth is generated from two-stream late fusion attention sequences, which provides better localization performance than each single stream, while RefineLoc generates the pseudo ground truth by expanding previous localization results, which might result in coarser and over-complete action proposals. (3) We introduce a new attention normalization loss to explicitly avoid the ambiguity of attention, while RefineLoc has no explicit constraints on attention values.

Figure 2: An overview of the proposed Two-Stream Consensus Network, which consists of three parts: (1) RGB and optical flow snippet-level features are extracted with pre-trained models; (2) two-stream base models are separately trained using these RGB and optical flow features; (3) frame-level pseudo ground truth is generated from the two-stream late fusion attention sequence, and in turn provides frame-level supervision to two-stream base models

3 Two-Stream Consensus Network

In this section, we first formulate the task of Weakly-supervised Temporal Action Localization (W-TAL), and then describe the proposed Two-Stream Consensus Network (TSCN) in detail. The overall architecture is shown in Fig. 2.

3.1 Problem Formulation

Assume we are given a set of training videos. For each video , we only have its video-level categorical label , where

is a normalized multi-hot vector, and

is the number of action categories. The goal of temporal action localization is to generate a set of action proposals for each testing video, where denote the start time, the end time, the predicted action category and the confidence score of the action proposal, respectively.

3.2 Feature Extraction

Following recent W-TAL methods [34, 27, 30, 23, 24, 28, 26, 43, 20], we construct TSCN upon snippet-level feature sequences extracted from the raw video volume. The RGB and optical flow features are extracted with pre-trained deep networks (e.g., I3D [3]) from non-overlapping fixed-length RGB frame snippets and optical flow snippets, respectively. They provide high-level appearance and motion information of the corresponding snippets. Formally, given a video with non-overlapping snippets, we denote the RGB features and optical flow features as and , respectively, where are the feature representations of the -th RGB frame and optical flow snippet, respectively, and denotes the channel dimension.

3.3 Two-Stream Base Models

After obtaining the RGB and optical flow features, we first use two-stream base models to perform the video-level action classification, and then iteratively refine the base models with a frame-level pseudo ground truth. The features of two modalities are fed into two separate base models, respectively, and the two base models use the same architecture but do not share parameters. Therefore, in this subsection, we omit the subscript RGB and flow for conciseness.

Since the features are not originally trained for the W-TAL task, we concatenate the input features , and use a set of temporal convolutional layers to generate a set of new features , where , and denotes the output feature dimension.

As a video may contain background snippets, to perform video-level classification, we need to select snippets that are likely to contain action instances and meanwhile filter out snippets that are likely to contain background. To this end, an attention value to measure the likelihood of the -th snippet containing an action is given by a fully-connected (FC) layer:

(1)

where , , and

are the sigmoid function, weight vector and bias of the attention layer. We then perform attention-weighted pooling over the feature sequence to generate a single foreground feature

, and feed it to an FC softmax layer to get the video-level prediction:

(2)
(3)

where

is the probability that the video contains the

-th action, and and are the weight and bias of the FC layer for category . The classification loss function is defined as the standard cross entropy loss:

(4)

where denotes the value of label vector y at index .

Ideally, an attention value is expected to be binary, where indicates the presence of action while indicates background. Recently, several methods [28, 20] introduce a background category, and use the background classification to guide the learning of attention. In this work, instead of using background classification, we introduce an attention normalization term to force the attention to approach extreme values:

(5)

where and is a hyperparameter to control the selected snippets. This normalization loss aims to maximize the difference between the average top- attention values and the average bottom- attention values, and force the foreground attention to be and background attention to be .

Therefore, the overall loss for the base model training is the weighted sum of the classification loss and the attention normalization term:

(6)

where is a hyperparameter to control the weight of the normalization loss.

In addition, the temporal-class activation map (T-CAM) , is also generated by sliding the classification FC softmax layer over all snippets:

(7)

where is the T-CAM value of -th snippet for category .

3.4 Pseudo Ground Truth Generation

We iteratively refine the two-stream base models with a frame-level pseudo ground truth. Specifically, we divide the whole training process into several refinement iterations. At refinement iteration , only video-level labels are used for training. And at refinement iteration , a frame-level pseudo ground truth is generated at refinement iteration , and provides frame-level supervision for the current refinement iteration. However, without true ground truth annotations, we can neither measure the quality of the pseudo ground truth, nor guarantee the pseudo ground truth can help the base models achieve higher performance.

Inspired by two-stream late fusion, we introduce a simple yet effective method to generate the pseudo ground truth. Intuitively, locations at which both streams have high activations are likely to contain ground truth action instances; locations at which only one stream has high activations are likely to be either false positive action proposals or true action instances that only one stream can detect; locations at which both streams both have low activations are likely to be the background.

Following this intuition, we use the fusion attention sequence at refinement iteration to generate pseudo ground truth for refinement iteration , where , and is a hyperparameter to control the relative importance of RGB and flow attentions. We introduce two pseudo ground truth generation methods.

Soft pseudo ground truth means to directly use the fusion attention values as pseudo labels: . The soft pseudo labels contain the probability of a snippet being the foreground action, but also add uncertainty to the model.

Hard pseudo ground truth thresholds the attention sequence to generate a binary sequence:

(8)

where is the threshold value. Setting a large value of will eliminate the action proposals that only one stream has high activations, and therefore reduces the false positive rate. In contrast, setting a small value of will help models to generate more action proposals and achieve a higher recall. Hard pseudo labels remove the uncertainty and provide stronger supervision, but introduce a hyperparameter.

After generating the frame-level pseudo ground truth, we force the attention sequence generated by each stream to be similar to the pseudo ground truth with a mean square error (MSE) loss111Although it is straightforward to use a cross entropy loss for hard pseudo ground truth, we found in practice that the cross entropy loss and the MSE loss achieve similar performance. To simplify training, we use the MSE loss for both kinds of pseudo ground truth.:

(9)

At refinement iteration , the total loss for each stream is

(10)

where is a hyperparameter to control the relative importance of two losses.

3.5 Action Localization

During testing, following BaS-Net [20], we first temporally upsample the attention sequence and T-CAM by a factor of

via linear interpolation. Then, we select top-

action categories from the fusion video-level prediction to perform action localization, where . For each of these categories, following our intention that the attention performs a binary selection, we generate action proposals by directly thresholding the attention value at and concatenating consecutive snippets. The action proposals are scored via a variant of the Outer-Inner-Constrastive score [34]: instead of using average T-CAM, we use attention weighted T-CAM to measure the outer and inner temporal contrast. Formally, given action proposal , fusion attention and T-CAM , where , the score is computed as

(11)

where , , and . We discard action proposals with confidence scores lower than .

4 Experiments

4.1 Dataset and Evaluation

THUMOS14 dataset [15] contains validation videos and testing videos within categories for the TAL task. We use the validation videos for training, and use the testing videos for evaluation.

ActivityNet dataset [2] has two release versions, i.e., ActivityNet v1.3 and ActivityNet v1.2. ActivityNet v1.3 covers action categories, with a training set of videos and a validation set of videos. ActivityNet v1.2 is a subset of ActivityNet v1.3, and covers action categories, with and videos in the training and validation set, respectively.222In our experiments, there are and videos in training and validation set of ActivityNet v1.3, respectively, and and videos in training and validation set of ActivityNet v1.2, respectively, because the rest of the videos are unaccessible from YouTube. We use the training set and the validation set for training and testing, respectively.

Evaluation Metrics. Following the standard protocol on temporal action localization, we evaluate our method with mean Average Precision (mAP) under different Intersection-over-Union (IoU) thresholds. We use the evaluation code provided by ActivityNet333https://github.com/activitynet/ActivityNet/tree/master/Evaluation to perform the experiments.

4.2 Implementation Details

Two off-the-shelf feature extraction backbones are used in our experiments,

i.e., UntrimmedNet [41] and I3D [3], with snippet lengths of frames and

frames, respectively. The two backbones are pre-trained on ImageNet 

[9] and Kinetics [3], respectively, and are not fine-tuned for fair comparison. The RGB and flow snippet-level features are extracted at the global_pool layer as -D vectors.

The networks are implemented in PyTorch

[29]. We use the Adam [16] optimizer with a fixed learning rate . We train the base models and epochs at refinement iteration , and and epochs for later refinement iterations for ActivityNet and THUMOS14, respectively. We set the maximal number of refinement iterations to for the THUMOS14 dataset, and

for the ActivityNet datasets, and choose base models that achieve the lowest loss at the previous refinement iteration to generate the pseudo ground truth. To eliminate fragmentary action proposals, temporal max pooling of kernel size

and stride

is used on the fusion attention sequence before pseudo ground truth generation on ActivityNet dataset. We use a whole video as a batch. All hyperparameters are determined via grid search: , , , . We set to and for THUMOS14 and ActivityNet, respectively. We choose top- action categories and also reject categories whose fusion classification prediction scores are lower than to perform action localization.

Method mAP@IoU (%)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Fully-supervised

Yuan et al. [44] 51.0 45.2 36.5 27.8 17.8 - - - -
CDC [33] - - 40.1 29.4 23.3 13.1 7.9 - -
R-C3D [42] 54.5 51.5 44.8 35.6 28.9 - - - -
SSN [48] 66.0 59.4 51.9 41.0 29.8 - - - -
BSN [22] - - 53.5 45.0 36.9 28.4 20.0 - -
TAL-Net [4] 59.8 57.1 53.2 48.5 42.8 33.8 20.8 - -
GTAN [25] 69.1 63.7 57.8 47.2 38.8 - - - -
BMN [21] - - 56.0 47.4 38.8 29.7 20.5 - -

Weakly-supervised

UntrimmedNet [41] 44.4 37.7 28.2 21.1 13.7 - - - -
STPN (UNT) [27] 45.3 38.8 31.1 23.5 16.2 9.8 5.1 2.0 0.3
AutoLoc (UNT) [34] - - 35.8 29.0 21.2 13.4 5.8 - -
W-TALC (UNT) [30] 49.0 42.8 32.0 26.0 18.8 - 6.2 - -
Liu et al. (UNT) [23] 53.5 46.8 37.5 29.1 19.9 12.3 6.0 - -
RefineLoc (UNT) [1] - - 36.1 - 22.6 - 5.8 - -
CleanNet (UNT) [24] - - 37.0 30.9 23.9 13.9 7.1 - -
BaS-Net (UNT) [20] 56.2 50.3 42.8 34.7 25.1 17.1 9.3 3.7 0.5
Ours (UNT) 58.9 52.9 45.0 36.6 27.6 18.8 10.2 4.0 0.5
STPN (I3D) [27] 52.0 44.7 35.5 25.8 16.9 9.9 4.3 1.2 0.1
W-TALC (I3D) [30] 55.2 49.6 40.1 31.1 22.8 - 7.6 - -
Liu et al. (I3D) [23] 57.4 50.8 41.2 32.1 23.1 15.0 7.0 - -
RefineLoc (I3D) [1] - - 40.8 - 23.1 - 5.3 - -
Nguyen et al. (I3D) [28] 60.4 56.0 46.6 37.5 26.8 17.6 9.0 3.3 0.4
BaS-Net (I3D) [20] 58.2 52.3 44.6 36.0 27.0 18.6 10.4 3.9 0.5
Ours (I3D) 63.4 57.6 47.8 37.7 28.7 19.4 10.2 3.9 0.7
Table 1: Comparison of our method with state-of-the-art TAL methods on the THUMOS14 testing set. UNT and I3D are abbreviations for UntrimmedNet feature and I3D feature, respectively

4.3 Comparison with the State-of-the-art

Experiments on THUMOS14. Table 1 summarizes the performance comparison between the proposed TSCN and state-of-the-art fully-supervised and weakly-supervised TAL methods on the THUMOS14 testing set. With UntrimmedNet features, TSCN outperforms other W-TAL methods by a large margin, and even achieves comparable results to some recent W-TAL methods with I3D features (e.g., Nguyen et al. [28] and BaS-Net [20]) at high IoU thresholds.

With I3D features, our performance boosts significantly, and outperforms previous W-TAL methods at most IoU thresholds. We note the proposed TSCN can achieve a comparable performance to some recent fully-supervised methods (e.g., R-C3D [42]). TSCN even outperforms TAL-net [4] at IoU thresholds and . However, as the IoU threshold increases, the performance of TSCN drops significantly, because generating more precise action boundaries need true frame-level ground truth supervision.

Experiments on ActivityNet. The performance comparisons on ActivityNet v1.2 and v1.3 are shown in Table 3 and Table 3, respectively, where our models are trained with I3D features. The proposed TSCN outperforms previous W-TAL methods at the average mAP at IoU threshold on both release versions of ActivityNet, verifying the efficacy of our design intuition.

Method mAP@IoU (%) Avg
0.5 0.75 0.95
UntrimmedNet [41] 7.4 3.2 0.7 3.6
AutoLoc [34] 27.3 15.1 3.3 16.0
W-TALC [30] 37.0 - - 18.0
Liu et al. [23] 36.8 22.0 5.6 22.4
Ours 37.6 23.7 5.7 23.6
Table 3: Comparison of our method with state-of-the-art W-TAL methods on the ActivityNet v1.3 validation set. The Avg column indicates the average mAP at IoU thresholds 0.5:0.05:0.95
Method mAP@IoU (%) Avg
0.5 0.75 0.95
STPN [27] 29.3 16.9 2.7 -
Liu et al. [23] 34.0 20.9 5.7 21.2
Nguyen et al. [28] 36.4 19.2 2.9 -
Ours 35.3 21.4 5.3 21.7
Table 2: Comparison of our method with state-of-the-art W-TAL methods on the ActivityNet v1.2 validation set. The Avg column indicates the average mAP at IoU thresholds 0.5:0.05:0.95

4.4 Ablation Study

In this subsection, a set of ablation studies is conducted on the THUMOS14 testing set with UntrimmedNet feature to analyze the efficacy of each component in the proposed TSCN.

Ablation study on . The goal of in Equation (5) is to force the attention values to approach extreme values, and therefore generate a clean foreground feature and improve action proposal quality. Some recent methods [28, 20] introduce background classification to W-TAL. Particularly, background classification loss  [28] is introduced to classify the background, where a background attention is defined as , and a background feature is generated via background attention-weighted pooling over all snippets to perform the background classification. Therefore, is in essence an implicit attention normalization loss. However, one drawback of such background loss is that assigning background labels to all videos will make the value of the background category in the T-CAM increase. We reproduce in our model, compare it with our proposed , and list the results in Table 4. The results reveal that both and help improve the performance. And the proposed

achieves higher attention variance and better localization performance than

, demonstrating that the our attention normalization term can better avoid the ambiguity of attention. Surprisingly, with both and , the localization performance is still lower than that with only , and we think this is because the noise of background classification reduces the accuracy of action proposal scores.

Table 4: Comparison of our method with different attention normalization functions on the THUMOS14 testing set. is the background classification loss introduced in [28], and is defined in Equation (5). The var column denotes the average attention variance over the whole testing set mAP@IoU (%) Var 0.3 0.5 0.7 - - 29.6 16.1 4.1 0.0440 - 34.3 19.3 6.7 0.0599 - 40.9 24.0 8.2 0.0937 40.6 23.6 7.8 0.0886
Figure 3: Comparison between models trained with different pseudo ground truth on the THUMOS14 testing set. The upper bounds denote models trained with ground truth actionness sequence

Ablation study on Pseudo Ground Truth. Fig. 3 plots performance comparison between different pseudo ground truth methods at different refinement iterations. Both soft and hard pseudo ground truth help improve the localization performance. The hard pseudo ground truth removes uncertainty to the model, and thus achieves higher performance improvement. However, with the same frame-level supervision, the flow stream outperforms the RGB stream by a large margin. We think this is because of the nature of two modalities: the RGB modality is less sensitive to actions than the optical flow modality. To demonstrate this, we generate a true frame-level ground truth actionness sequence (action categories are not used), train our model in the same way as the pseudo ground truth. The results are plotted in Fig. 3 as an upper bound. The results verify our hypothesis and demonstrate that the optical flow modality is more suitable for the action localization task than the RGB modality.

Modality Label mAP@IoU (%) Precision (%) Recall (%) F-measure
0.3 0.4 0.5 0.6 0.7
RGB video 19.8 13.2 8.2 4.5 1.9 10.2 20.9 0.1371
RGB frame 31.4 22.1 14.4 8.9 5.2 20.9 30.8 0.2489
Flow video 40.2 32.0 23.2 15.4 7.2 25.5 43.3 0.3207
Flow frame 40.8 32.7 24.1 16.8 8.7 30.9 42.4 0.3573
Fusion video 40.9 32.4 24.0 15.9 8.2 23.6 44.4 0.3078
Fusion frame 45.0 36.5 27.6 18.8 10.2 31.3 44.6 0.3680
Table 5: Comparison between the model trained with only video-level labels and the model trained with hard pseudo ground truth on the THUMOS14 testing set. The label column denotes the supervision used in training, where “video” indicates only video-level labels are leveraged, and “frame” indicates the hard pseudo ground truth is also leveraged during training. Precision, recall and F-measure are calculated under IoU threshold

Table 5

lists the detailed performance comparison between the model trained with only video-level labels and that trained with the hard pseudo ground truth. The results show that pseudo ground truth improves the localization performance for both modalities at all IoU thresholds, and thus improves the performance of the fusion result. Also, the pseudo ground truth greatly improves the precision and recall for the RGB stream and the fusion result, and improves the precision for the flow stream with a minor loss of recall (the overall F-measure improves significantly), which demonstrates that the pseudo ground truth can help eliminate false positive action proposals.

Qualitative Analysis. Three representative examples of TAL results are plotted in Fig. 4 to illustrate the efficacy of the proposed pseudo supervision. In the first example of diving and cliff diving, with only video-level labels, the RGB stream provides worse localization result than the flow stream, and thus leads to a noisy fusion attention sequence. The pseudo ground truth guides the RGB stream to identify false positive action proposals and discover true action instances, and further leads to a cleaner fusion attention sequence, where high activations correspond better to the ground truth. In the second example of cricket shot, with only video-level supervision, the RGB stream can only distinguish certain scenes, and fails to separate proximate action instances. In contrast, the flow stream can precisely detect action instances. Therefore, the pseudo ground truth helps the RGB stream to separate consecutive action instances. In the last example of soccer penalty, both streams have high activations on certain false positive temporal locations. Under this circumstance, the false positive action proposals will have higher activations under frame-level pseudo supervision. To eliminate such false positive action proposals, however, need true ground truth supervision. To summarize, the two modalities have their own strengths and limitations: the RGB stream is sensitive to appearance, thus it fails in scenes shot from unusual angles or separating proximate action instances in the same scene; the flow stream is sensitive to motion, and provides more accurate results, but it fails in slow or occluded motion. Qualitative results reveal that the pseudo ground truth helps two streams reach a consensus at most temporal locations. Therefore, the fusion attention sequence becomes cleaner and helps generate more precise action proposals and more reliable confidence scores.

Figure 4: Qualitative results on the THUMOS14 testing set. The eight rows in each example are input video, ground truth action instance, RGB stream, flow stream, and fusion attention sequences from the model trained with only video-level labels and frame-level pseudo ground truth, respectively. Action proposals are represented by green boxes. The horizontal and vertical axes are time and intensity of attention, respectively

5 Conclusions

In this paper, we propose a Two-Stream Consensus Network (TSCN) for W-TAL, which benefits from an iterative refinement training method and a new attention normalization loss. The iterative refinement training uses a novel frame-level pseudo ground truth as fine-grained supervision, and iteratively improves the two-stream base models. The attention normalization loss function reduces the ambiguity of attention values, and thus leads to more precise action proposals. Experiments on two benchmarks demonstrate the proposed TSCN outperforms current state-of-the-art methods, and verify our design intuition.

Acknowledgement

This work was supported partly by National Key R&D Program of China Grant 2018AAA0101400, NSFC Grants 61629301, 61773312, and 61976171, China Postdoctoral Science Foundation Grant 2019M653642, Young Elite Scientists Sponsorship Program by CAST Grant 2018QNRC001, and Natural Science Foundation of Shaanxi Grant 2020JQ-069.

References

  • [1] H. Alwassel, A. Pardo, F. C. Heilbron, A. Thabet, and B. Ghanem (2019) RefineLoc: iterative refinement for weakly-supervised action localization. arXiv preprint arXiv:1904.00227. Cited by: §1, §1, §2, Table 1.
  • [2] F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles (2015) Activitynet: a large-scale video benchmark for human activity understanding. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 961–970. Cited by: §2, §4.1.
  • [3] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308. Cited by: §2, §3.2, §4.2.
  • [4] Y. Chao, S. Vijayanarasimhan, B. Seybold, D. A. Ross, J. Deng, and R. Sukthankar (2018) Rethinking the faster r-cnn architecture for temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1130–1139. Cited by: §2, §4.3, Table 1.
  • [5] N. Crasto, P. Weinzaepfel, K. Alahari, and C. Schmid (2019) MARS: motion-augmented rgb stream for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7882–7891. Cited by: §2.
  • [6] X. Dai, B. Singh, G. Zhang, L. S. Davis, and Y. Qiu Chen (2017) Temporal context network for activity localization in videos. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5793–5802. Cited by: §2.
  • [7] N. Dalal, B. Triggs, and C. Schmid (2006) Human detection using oriented histograms of flow and appearance. In Proceedings of the European Conference on Computer Vision, pp. 428–441. Cited by: §2.
  • [8] N. Dalal and B. Triggs (2005) Histograms of oriented gradients for human detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 886–893. Cited by: §2.
  • [9] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and F. F. Li (2009) ImageNet: a large-scale hierarchical image database. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. Cited by: §4.2.
  • [10] C. Feichtenhofer, A. Pinz, and A. Zisserman (2016) Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941. Cited by: §2.
  • [11] J. Gao, Z. Yang, K. Chen, C. Sun, and R. Nevatia (2017) Turn tap: temporal unit regression network for temporal action proposals. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3628–3636. Cited by: §2.
  • [12] J. Gao, Z. Yang, and R. Nevatia (2017) Cascaded boundary regression for temporal action detection. arXiv preprint arXiv:1705.01180. Cited by: §2.
  • [13] A. Gorban, H. Idrees, Y. Jiang, A. R. Zamir, I. Laptev, M. Shah, and R. Sukthankar (2015) THUMOS challenge: action recognition with a large number of classes. Cited by: §2.
  • [14] F. C. Heilbron, W. Barrios, V. Escorcia, and B. Ghanem (2017) Scc: semantic context cascade for efficient action detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3175–3184. Cited by: §2.
  • [15] Y. Jiang, J. Liu, A. R. Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar (2014) THUMOS challenge: action recognition with a large number of classes. Cited by: §2, §4.1.
  • [16] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
  • [17] T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §2.
  • [18] K. Kumar Singh and Y. Jae Lee (2017) Hide-and-seek: forcing a network to be meticulous for weakly-supervised object and action localization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3524–3533. Cited by: §1, §2.
  • [19] I. Laptev (2005) On space-time interest points. International Journal of Computer Vision, pp. 107–123. Cited by: §2.
  • [20] P. Lee, Y. Uh, and H. Byun (2020) Background suppression network for weakly-supervised temporal action localization. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Cited by: §1, §1, §2, §3.2, §3.3, §3.5, §4.3, §4.4, Table 1.
  • [21] T. Lin, X. Liu, X. Li, E. Ding, and S. Wen (2019) Bmn: boundary-matching network for temporal action proposal generation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3889–3898. Cited by: §2, Table 1.
  • [22] T. Lin, X. Zhao, H. Su, C. Wang, and M. Yang (2018) Bsn: boundary sensitive network for temporal action proposal generation. In Proceedings of the European Conference on Computer Vision, pp. 3–19. Cited by: §2, Table 1.
  • [23] D. Liu, T. Jiang, and Y. Wang (2019) Completeness modeling and context separation for weakly supervised temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1298–1307. Cited by: §1, §1, §2, §3.2, Table 1, Table 3.
  • [24] Z. Liu, L. Wang, Q. Zhang, Z. Gao, Z. Niu, N. Zheng, and G. Hua (2019) Weakly supervised temporal action localization through contrast based evaluation networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3899–3908. Cited by: §1, §1, §2, §3.2, Table 1.
  • [25] F. Long, T. Yao, Z. Qiu, X. Tian, J. Luo, and T. Mei (2019) Gaussian temporal awareness networks for action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 344–353. Cited by: §2, Table 1.
  • [26] S. Narayan, H. Cholakkal, F. S. Khan, and L. Shao (2019) 3c-net: category count and center loss for weakly-supervised action localization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8679–8687. Cited by: §1, §1, §3.2.
  • [27] P. Nguyen, T. Liu, G. Prasad, and B. Han (2018) Weakly supervised action localization by sparse temporal pooling network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6752–6761. Cited by: §1, §1, §2, §3.2, Table 1, Table 3.
  • [28] P. X. Nguyen, D. Ramanan, and C. C. Fowlkes (2019) Weakly-supervised action localization with background modeling. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5502–5511. Cited by: §1, §1, §2, §3.2, §3.3, §4.3, §4.4, Table 1, Table 3, Table 4.
  • [29] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019)

    PyTorch: an imperative style, high-performance deep learning library

    .
    In Advances in neural information processing systems, pp. 8024–8035. Cited by: §4.2.
  • [30] S. Paul, S. Roy, and A. K. Roy-Chowdhury (2018) W-talc: weakly-supervised temporal activity localization and classification. In Proceedings of the European Conference on Computer Vision, pp. 563–579. Cited by: §1, §1, §2, §3.2, Table 1, Table 3.
  • [31] A. Piergiovanni and M. S. Ryoo (2019) Representation flow for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [32] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Proceedings of Neural Information Processing Systems, pp. 91–99. Cited by: §2.
  • [33] Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S. Chang (2017) Cdc: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5734–5743. Cited by: §2, Table 1.
  • [34] Z. Shou, H. Gao, L. Zhang, K. Miyazawa, and S. Chang (2018) Autoloc: weakly-supervised temporal action localization in untrimmed videos. In Proceedings of the European Conference on Computer Vision, pp. 154–171. Cited by: §1, §1, §2, §3.2, §3.5, Table 1, Table 3.
  • [35] Z. Shou, X. Lin, Y. Kalantidis, L. Sevilla-Lara, M. Rohrbach, S. Chang, and Z. Yan (2019) Dmc-net: generating discriminative motion cues for fast compressed video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1268–1277. Cited by: §2.
  • [36] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta (2016) Hollywood in homes: crowdsourcing data collection for activity understanding. In Proceedings of the European Conference on Computer Vision, pp. 510–526. Cited by: §2.
  • [37] K. Simonyan and A. Zisserman (2014) Two-stream convolutional networks for action recognition in videos. In Proceedings of Neural Information Processing Systems, pp. 568–576. Cited by: §1, §2.
  • [38] M. Tan, Q. Shi, A. van den Hengel, C. Shen, J. Gao, F. Hu, and Z. Zhang (2015) Learning graph structure for multi-label image classification via clique generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4100–4109. Cited by: §2.
  • [39] H. Wang, A. Kläser, C. Schmid, and C. Liu (2011) Action recognition by dense trajectories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3169–3176. Cited by: §2.
  • [40] L. Wang, P. Koniusz, and D. Q. Huynh (2019) Hallucinating idt descriptors and i3d optical flow features for action recognition with cnns. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8698–8708. Cited by: §2.
  • [41] L. Wang, Y. Xiong, D. Lin, and L. Van Gool (2017) Untrimmednets for weakly supervised action recognition and detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4325–4334. Cited by: §1, §1, §2, §4.2, Table 1, Table 3.
  • [42] H. Xu, A. Das, and K. Saenko (2017) R-c3d: region convolutional 3d network for temporal activity detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5783–5792. Cited by: §2, §4.3, Table 1.
  • [43] T. Yu, Z. Ren, Y. Li, E. Yan, N. Xu, and J. Yuan (2019) Temporal structure mining for weakly supervised action detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5522–5531. Cited by: §1, §3.2.
  • [44] Z. Yuan, J. C. Stroud, T. Lu, and J. Deng (2017) Temporal action localization by structured maximal sums. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3684–3692. Cited by: Table 1.
  • [45] R. Zeng, W. Huang, M. Tan, Y. Rong, P. Zhao, J. Huang, and C. Gan (2019) Graph convolutional networks for temporal action localization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7094–7103. Cited by: §2.
  • [46] Y. Zhai, L. Wang, Z. Liu, Q. Zhang, G. Hua, and N. Zheng (2019) Action coherence network for weakly supervised temporal action localization. In Proceedings of the IEEE International Conference on Image Processing, pp. 3696–3700. Cited by: §1.
  • [47] Y. Zhao, Y. Xiong, and D. Lin (2018) Recognize actions by disentangling components of dynamics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6566–6575. Cited by: §2.
  • [48] Y. Zhao, Y. Xiong, L. Wang, Z. Wu, X. Tang, and D. Lin (2017) Temporal action detection with structured segment networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2914–2923. Cited by: §2, Table 1.
  • [49] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2016)

    Learning deep features for discriminative localization

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929. Cited by: §1.