Learning Action Completeness from Points for Weakly-supervised Temporal Action Localization

08/11/2021 ∙ by Pilhyeon Lee, et al. ∙ Yonsei University 5

We tackle the problem of localizing temporal intervals of actions with only a single frame label for each action instance for training. Owing to label sparsity, existing work fails to learn action completeness, resulting in fragmentary action predictions. In this paper, we propose a novel framework, where dense pseudo-labels are generated to provide completeness guidance for the model. Concretely, we first select pseudo background points to supplement point-level action labels. Then, by taking the points as seeds, we search for the optimal sequence that is likely to contain complete action instances while agreeing with the seeds. To learn completeness from the obtained sequence, we introduce two novel losses that contrast action instances with background ones in terms of action score and feature similarity, respectively. Experimental results demonstrate that our completeness guidance indeed helps the model to locate complete action instances, leading to large performance gains especially under high IoU thresholds. Moreover, we demonstrate the superiority of our method over existing state-of-the-art methods on four benchmarks: THUMOS'14, GTEA, BEOID, and ActivityNet. Notably, our method even performs comparably to recent fully-supervised methods, at the 6 times cheaper annotation cost. Our code is available at https://github.com/Pilhyeon.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 8

page 12

page 13

page 14

page 15

Code Repositories

Learning-Action-Completeness-from-Points

Official Pytorch Implementation of 'Learning Action Completeness from Points for Weakly-supervised Temporal Action Localization' (ICCV-21 Oral)


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Simplified illustration of our idea. We use points as seeds to find the optimal sequence, which in turn provides completeness guidance to the model.

The goal of temporal action localization lies in locating starting and ending timestamps of action instances and classifying them. Thanks to the various applications 

[Ma2005AGF, Vishwakarma2012ASO, Xiong2019LessIM], it has drawn much attention from researchers, leading to the rapid and remarkable progress in the fully-supervised setting (, frame-level labels) [Liu2019MGG, shou2017cdc, shou2016temporal, xu2017r]. Meanwhile, there appear attempts to reduce the prohibitively expensive cost of annotating individual frames by devising weakly-supervised models with video-level labels [Fernando2020WSGN, Ma_2021_ASL, wang2017untrimmednets, Yuan2019MARGINALIZEDAA]. However, they fall largely behind the fully-supervised counterparts, mainly on account of their weak ability to distinguish action and background frames [lee2020background, lee2021um, Nguyen2019WeaklySupervisedAL, Xu2019SegregatedTA].

To narrow the performance gap between them, another level of weak supervision has been proposed recently, namely the point-supervised setting. In this setting, only a single timestamp (point) with its action category is annotated for each action instance during training. In terms of the labeling cost, point-level labels require a negligible extra cost compared to video-level ones, while being cheaper than frame-level ones (50s vs. 300s per 1-min video) [ma2020sfnet].

Despite the affordable cost, it offers coarse locations as well as the total number of action instances, thus bringing a strong ability in spotting actions to the models. Consequently, point-supervised methods show comparable or even superior performances to fully-supervised counterparts under low intersection over union (IoU) thresholds. However, it has been revealed that they suffer from incomplete predictions, resulting in highly inferior performances in the case of high IoU thresholds. We conjecture that this problem is attributed to the sparse nature of point-level labels that induces the models to learn only a small part of actions rather than the full extent of action instances. In other words, they fail to learn action completeness from the point annotations. Although SF-Net [ma2020sfnet] mines pseudo action and background points to alleviate the label sparsity, they are discontinuous and thus do not provide completeness cues.

In this paper, we aim to allow the model to learn action completeness under the point-supervised setting. To this end, we introduce a new framework, where dense pseudo-labels (, sequences) are generated based on the point annotations to provide completeness guidance to the model. The overall workflow is illustrated in Fig. 1.

Technically, we first select pseudo background points to augment point-level action labels. As aforementioned, such point annotations are discontiguous, so it is infeasible to learn completeness from them. To that end, we propose to search for the optimal sequence covering complete action instances among candidates consistent with the point labels. However, it is non-trivial to measure how complete the instances in each candidate sequence are, without full supervision. To realize it, we borrow the outer-inner-contrast concept [shou2018autoloc]

as a proxy for instance completeness. Intuitively, a complete action instance generally shows large score contrast, , much higher action scores for inner frames than those for surrounding frames. In contrast, a fragmentary instance probably has high action scores in its outer region (still within the action), leading to small score contrast. This can be generalized for background instances as well. Based on this property, we derive the score of an input sequence by aggregating the score contrast of action and background instances constituting the sequence. By maximizing the score, we can obtain the optimal sequence that is likely to be well-aligned with the ground-truth we do not have. In experiments, we present the accuracy of optimal sequences and the correlation between score contrast and completeness.

From the obtained sequence, the model is supposed to learn action completeness. To this end, we design score contrastive loss to maximize the agreement between the model outputs and the optimal sequence, by enlarging the completeness of the sequence. With the loss, the model is trained to discriminate each action (background) instance from its surroundings in terms of action scores. Moreover, we introduce feature contrastive loss to encourage feature discrepancy between action and background instances. Experiments validate that the proposed losses complementarily help the model to detect complete action instances, leading to large performance gains under high IoU thresholds.

To summarize, our contributions are three-fold.

  • We introduce a new framework, where the dense optimal sequence is generated to provide completeness guidance to the model in the point-supervised setting.

  • We propose two novel losses that facilitate the action completeness learning by contrasting action instances with background ones with respect to action score and feature similarity, respectively.

  • Our model achieves a new state-of-the-art with a large gap on four benchmarks. Furthermore, it even performs favorably against fully-supervised approaches.

Figure 2: Overview of the proposed method. Besides the conventional objectives, , video-level and point-level classification losses, we propose to learn action completeness (the lower part). Based on the final action scores, the optimal sequence is selected among candidates consistent with the point-level labels. It in turn provides completeness guidance with two proposed losses that contrast action instances with background ones with respect to (a) action score and (b) feature similarity.

2 Related Work

Fully-supervised temporal action localization.   In order to tackle temporal action localization, fully-supervised methods rely on precise temporal annotations, , frame-level labels. They mainly adopt the two-stage paradigm (proposal generation and classification), and can be roughly categorized into two groups regarding the way to generate proposals. The first group prepares a large number of proposals using the sliding window technique [chao2018rethinking, shou2017cdc, shou2016temporal, xiong2017pursuit, yang2018exploring, yuan2016temporal, zhao2017temporal]. On the other hand, the second group first predicts the probability of each frame being a start (end) point of an action instance, and then uses the combinations of probable start and end points as proposals [lin2020fast, Lin2019BMNBN, lin2018bsn, zhao2020bottom-up]. Meanwhile, there are graph modeling methods taking snippets [bai2020bcgnn, xu2020g-tad] or proposals [zeng2019p-gcn] as nodes. Different from fully-supervised methods that utilize expensive frame-level labels for action completeness learning, our method enables it with only point-level labels by introducing a novel framework.

Weakly-supervised temporal action localization.   To alleviate the cost issue of frame-level labels, many attempts have been made recently to solve the same task in the weakly-supervised setting, mainly using video-level labels. Untrimmednets [wang2017untrimmednets] tackle it by selecting segments that contribute to video-level classification. STPN [nguyen2018weakly] puts a constraint that key frames should be sparse. In addition, there are background modeling approaches under the video-supervised setting [Islam2021HAM-Net, lee2020background, lee2021um, Nguyen2019WeaklySupervisedAL]. To learn reliable attention weights, DGAM [Shi2020DGAM] designs a generative modeling, while EM-MIL [luo2020EMMIL]

adopts the Expectation-maximization strategy. Meanwhile, metric learning is utilized for action representation learning 

[Islam2020metric, Narayan20193CNetCC, paul2018w] or action-background separation [min2020A2CL]. There are also methods that explore sub-actions [Jain2020ActionBytes, Luo_2021_AUMN] or exploit the complementarity of RGB and flow modalities [Yang_2021_UGCT, zhai2020TSCN]. Besides, several methods leverage external information, , action count [Narayan20193CNetCC, Xu2019SegregatedTA], pose [zhang2020MultiinstanceMA] or audio [Lee2021audio-visual]. Moreover, some approaches aim to detect complete action instances by aggregating multiple predictions [liu2019completeness], erasing the most discriminative part [singh2017hide, zhong2018step], or directly regressing the action intervals [Liu2019WeaklyST, shou2018autoloc].

Most recently, point-level supervision starts to be explored, which provides rich information at an affordable cost. Moltisanti  [Moltisanti2019CVPR] first utilize the point-level labels for action localization. SF-Net [ma2020sfnet] adopts the pseudo label mining strategy to acquire more labeled frames. Meanwhile, Ju  [ju2020point] perform boundary regression based on key frame prediction. However, they do not explicitly consider action completeness, and therefore produce predictions that cover only part of action instances. In contrast, we propose to learn action completeness from dense pseudo-labels by contrasting action instances with surrounding background ones. In Sec. 4, the efficacy of our method is clearly verified with notable performance boosts at high IoU thresholds.

3 Method

In this section, we first describe the problem setting and detail the baseline setup. Afterward, the optimal sequence search is elaborated, followed by our action completeness learning strategy. Lastly, we explain the joint learning and the inference of our model. The overall architecture of our method is illustrated in Fig. 2.

Problem setting. Following [ju2020point, ma2020sfnet], we set up the problem of point-supervised temporal action localization. Given an input video, a single point and the category for each action instance is provided, , , where the -th action instance is labeled at the -th segment (frame) with its action label , and is the total number of action instances in the input video. The points are sorted in temporal order (, ). The label

is a binary vector with

if the -th action instance contains the -th action class and otherwise for action classes. It is worth noting that the video-level label can be readily acquired by aggregating the point-level ones, , , where is the indicator function.

3.1 Baseline Setup

Our baseline is shown in the upper part of Fig. 2. We first divide the input video into 16-frame segments, which are then fed to the pre-trained feature extractor. Following [lee2020background, paul2018w], we exploit both of RGB and flow streams with early-fusion. The two-stream features are fused by concatenation, resulting in , where and denote the feature dimension and the number of segments, respectively.

The extracted features then go through a single 1D convolutional layer followed by ReLU activation, which produces the embedded features

. In practice, we set the dimension of the embedded features to the same as that of the extracted features , ,

. Afterward, the embedded features are fed into a 1D convolutional layer with the sigmoid function, to predict the segment-level class scores

, where indicates the number of action classes. Meanwhile, we derive the class-agnostic background scores , to model background frames which do not belong to any action classes. Thereafter, we fuse the action scores with the complement of background probability to get the final scores , , . This fusion strategy is similar to that of [lee2021um], although the out-of-distribution modeling is not incorporated in our model.

The segment-level action scores are then aggregated to build a single video-level class score. We use the temporal top- pooling for aggregation as in [lee2020background, paul2018w]. Formally, the video-level probability is calculated as follows.

(1)

where and denotes all possible subsets of containing segments, , .

Our baseline model includes two loss functions using video- and point-level labels respectively. As aforementioned, the video-level class label

can be derived by accumulating the point-level labels. The video-level classification loss is then calculated with binary cross-entropy.

(2)

The point-level classification loss is also computed by binary cross-entropy but involving the background term for effectively training . In addition, we adopt the focal loss [lin2017focal] to facilitate the training process. Formally, the classification loss for action points is defined as follows.

(3)

where indicates the number of action instances in the video and is the focusing parameter, which is set to following the original paper [lin2017focal].

Training only with action points would lead the network to always produce low background scores rather than learn to separate action and background. Therefore, we gather some pseudo background points to supplement action ones. Our principle for selection is that at least one background frame must be placed between two adjacent action instances to separate them. By the problem definition, two different action points are sampled from different instances, so we use the action points as surrogates for the corresponding instances. Concretely, between two adjacent action points, we find the segments whose background scores are larger than the threshold . If no segment satisfies the condition in a section, we select one with the largest background score. Meanwhile, for the case where multiple background points are selected in a section, we mark all points between them as background, since it is trivial that no action exists there. In practice, this strategy is shown to be more effective than global mining [ma2020sfnet] by collecting more hard points. Given the pseudo background point set, , the classification loss for background points is computed by:

(4)

where denotes the number of the selected background points and is the focusing factor, the same with (3). For pseudo background points, we penalize the final scores for all action classes, while encouraging the background scores.

The total point-level loss function is defined as the sum of the losses for action and pseudo background points.

(5)
Figure 3: Optimal sequence search for class . Given the final scores and the point-level labels, we select pseudo background points. Then, among all possible candidates, we search for the optimal sequence that maximizes the completeness score (6).

3.2 Optimal Sequence Search

As discussed in Sec. 1, the point-level classification loss is insufficient to learn action completeness, as point labels cover only a small portion of action instances. Therefore, we propose to generate dense pseudo-labels that can offer some hints about action completeness for the model. In detail, we consider all possible sequence candidates consistent with the action and pseudo background points. Among them, we find the optimal sequence that can provide good completeness guidance to the model. However, it is non-trivial without full supervision to measure how well a candidate sequence covers complete action instances. To enable it, we re-purpose the outer-inner-contrast concept [shou2018autoloc] as a proxy for judging the completeness score of a sequence. Intuitively, the contrast between inner and outer scores is likely to be large for a complete action instance but small for a fragmentary one. Note that our purpose is different from the original paper [shou2018autoloc]. It was originally designed for parametric boundary regression. In contrast, we utilize it as a scoring function to search for the optimal sequence, from which the model could learn action completeness.

Before detailing the scoring function, we present the formulation of candidate sequences. Due to the multi-label nature of temporal action localization, we consider class-specific sequences for each action class. Note that all segments belonging to other action classes are considered background for sequences of class . Then, a sequence is defined as multiple action and background (including other actions) instances that alternate consecutively. Formally, a sequence of class can be expressed as , where and denote the start and end points of the -th instance, respectively, while is the total number of instances for class . In addition, indicates the type of the instance, , if -th instance is of the -th action class, otherwise (background).

Given an input sequence, we compute its completeness score by averaging the contrast scores of individual action and background instances contained in the sequence. It would be noted that the contrast scores of background instances are included in the calculation, which proves to be effective for finding more accurate optimal sequences, as will be shown in Sec. 4.3. Formally, the completeness score of a sequence for the -th action class is computed by:

(6)

is the temporal length of the -th instance of , is a hyper-parameter adjusting the outer range (set to 0.25), and is the total number of action and background instances for class . Then, the optimal sequence for class can be obtained by finding the sequence that maximizes the score, , using (6). The optimal sequence search process is illustrated in Fig. 3. By evaluating the completeness score, our method can reject underestimation (Fig. 3a) and overestimation (Fig. 3b) cases. Consequently, we obtain the optimal sequence that is most likely to contain complete action instances.

However, the search space grows exponentially as increases, leading to the exorbitant cost for optimal sequence search. To relieve the issue, we implement the search process with a greedy algorithm under a limited budget, which results in greatly saving the computational cost. Detailed algorithm and cost analysis are presented in Sec. B of the appendix. Note that optimal sequence search is performed only for the action classes contained in the video.

3.3 Action Completeness Learning

Given the class-specific optimal sequences , our goal is to let the model learn action completeness. To this end, we design two losses that enable completeness learning by contrasting action instances from background ones. This helps in complete action predictions, as validated in Sec. 4.

Firstly, we propose score contrastive loss that encourages the model to separate action (background) instances from their surroundings in terms of final scores. It can be also interpreted as fitting the model outputs to the optimal sequences (Fig. 2a). Formally, the loss is computed by:

(7)

where we use -squared term to focus on the instances that are largely inconsistent with the optimal sequence ().

Secondly, inspired by the recent success of contrastive learning [chen2020simCLR, he2020moco, khosla2020SupCon], we design feature contrastive loss. Our intuition is that features from different instances but with the same action class should be closer to each other than any other background instances in the same video (Fig. 2b). We note that our loss differs from [chen2020simCLR, he2020moco, khosla2020SupCon] in that they pull different views of an input image, whereas ours attracts different action instances in a given video. In addition, ours does not need negative sampling from different images, as background instances are obtained from the same video.

To extract the representative feature for each action (or background) instance, we modify the segment of interest (SOI) pooling [chao2018rethinking]

by replacing max-pooling with random sampling. In detail, we evenly divide each input instance into three intervals, from each of which a single segment is randomly sampled. Then, the embedded features of the sampled segments are averaged, producing the representative feature

for the -th instance of the sequence .

Taking the normalized instance features as inputs, we derive feature contrastive loss. The loss is computed only for the classes whose action counts are larger than 1, , at least two action instances exist in the video. Note that background instances do not attract each other. Given the optimal sequences , the proposed feature contrastive loss is formulated as:

(8)

where is the partial loss for class , denotes the temperature parameter, and denotes the indicator function.

3.4 Joint Training and Inference

The overall training objective of our model is as follows.

(9)

where are weighting parameters for balancing the losses, which are determined empirically.

During the test time, we first threshold on the video score with to determine which action categories are to be localized. Then, only for the remaining classes, we threshold on the segment-level final scores with to select candidate segments. Afterward, consecutive candidates are merged into a single proposal, which becomes a localization result. We set the confidence of each proposal to its outer-inner-contrast score, as in [lee2020background, liu2019completeness]. To augment the proposal pool, we use multiple thresholds for and perform non-maximum suppression (NMS) to remove overlapping proposals. Note that the optimal sequence search is not performed at test time, so does not affect the inference time.

Supervision Method mAP@IoU (%) AVG AVG
0.1 0.2 0.3 0.4 0.5 0.6 0.7 (0.1:0.5) (0.3:0.7)
Frame-level
(Full)
BMN [Lin2019BMNBN] - - 56.0 47.4 38.8 29.7 20.5 - 38.5
P-GCN [zeng2019p-gcn] 69.5 67.8 63.6 57.8 49.1 - - 61.6 -
G-TAD [xu2020g-tad] - - 54.5 47.6 40.2 30.8 23.4 - 39.3
BC-GNN [bai2020bcgnn] - - 57.1 49.1 40.4 31.2 23.1 - 40.2
Zhao  [zhao2020bottom-up] - - 53.9 50.7 45.4 38.0 28.5 - 43.3
Video-level
(Weak)
Lee  [lee2021um] 67.5 61.2 52.3 43.4 33.7 22.9 12.1 51.6 32.9
CoLA [Zhang_2021_cola] 66.2 59.5 51.5 41.9 32.2 22.0 13.1 50.3 32.1
AUMN [Luo_2021_AUMN] 66.2 61.9 54.9 44.4 33.3 20.5 9.0 52.1 32.4
TS-PCA [Liu_2021_TS_PCA] 67.6 61.1 53.4 43.4 34.3 24.7 13.7 52.0 33.9
UGCT [Yang_2021_UGCT] 69.2 62.9 55.5 46.5 35.9 23.8 11.4 54.0 34.6
Point-level
(Weak)
SF-Net [ma2020sfnet] 71.0 63.4 53.2 40.7 29.3 18.4 9.6 51.5 30.2
Ju et al. [ju2020point] 72.8 64.9 58.1 46.4 34.5 21.8 11.9 55.3 34.5
Ours 75.1 70.5 63.3 55.2 43.9 33.3 20.8 61.6 43.3
Moltisanti et al. [Moltisanti2019CVPR] 24.3 19.9 15.9 12.5 9.0 - - 16.3 -
SF-Net [ma2020sfnet] 68.3 62.3 52.8 42.2 30.5 20.6 12.0 51.2 31.6
Ju et al. [ju2020point] 72.3 64.7 58.2 47.1 35.9 23.0 12.8 55.6 35.4
Ours 75.7 71.4 64.6 56.5 45.3 34.5 21.8 62.7 44.5
Table 1: State-of-the-art comparison on THUMOS’14. We also include the methods under video-level and frame-level supervision for reference. The average mAPs are computed under the IoU thresholds 0.1:0.5 and 0.3:0.7 with the step size of 0.1. While indicates the use of manually annotated labels from [ma2020sfnet], denotes the use of labels automatically generated in [Moltisanti2019CVPR].

4 Experiments

4.1 Experimental Settings

Datasets. THUMOS’14 [THUMOS14] is of 20 action classes with 200 and 213 untrimmed videos for validation and test, respectively. It is known to be challenging due to the diverse length and the frequent occurrence of action instances. Following the convention [nguyen2018weakly], we use the validation videos for training and test videos for test. GTEA [lei2018gtea] contains 28 videos of 7 fine-grained daily actions in the kitchen, among which 21 and 7 videos are utilized for training and test, respectively. BEOID [damen2014BEOID] has 58 videos with a total of 30 action categories. We follow the data split provided by [ma2020sfnet]. ActivityNet [caba2015activitynet] is a large-scale dataset with two versions. The version 1.3 includes 10,024 training, 4,926 validation, and 5,044 test videos with 200 action classes. The version 1.2 consists of 4,819 training, 2,383 validation, and 2,480 test videos with 100 categories. We evaluate our model on the validation sets for both versions. It should be noted that our model takes only point-level annotations for training.

Evaluation metrics. Following the standard protocol of temporal action localization, we compute mean average precisions (mAPs) under several different levels of intersection over union (IoU) thresholds. We note that performances at small IoU thresholds demonstrate the ability in finding actions, while those under high IoU thresholds exhibit the completeness of action predictions.

Implementation details. We employ the two-stream I3D networks [carreira2017quo]

pre-trained on Kinetics-400 

[carreira2017quo] as our feature extractor, which is not fine-tuned in our experiments for fair comparison. To obtain optical flow maps, we use TV-L1 algorithm [wedel2009improved]. Each video is split into 16-frame segments, which are taken as inputs by the feature extractor resulting in 1024-dim features for each modality (, ). We use the original number of segments as without sampling. Our model is optimized by Adam [kingma2014adam] with the learning rate of and the batch size of 16. Hyper-parameters are determined by grid search: , . The video-level threshold is set to 0.5, while the segment-level threshold spans from 0 to 0.25 with a step size of 0.05. The NMS is performed with the threshold of 0.6.

4.2 Comparison with State-of-the-art Methods

In Table 1, we compare our method with state-of-the-art models under different levels of supervision on THUMOS’14. We note that fully-supervised models require far more expensive annotation costs compared to weakly-supervised counterparts. In the comparison, our model significantly outperforms the state-of-the-art point-supervised approaches. We also notice the large performance margins at high IoU thresholds, , 11% in mAP@0.6 and 9% in mAP@0.7. This confirms that the proposed method aids in locating the complete action instances. At the same time, our model largely surpasses the video-supervised methods with the comparable labeling cost. Further, our model even performs favorably against the fully-supervised methods in terms of average mAPs at the much lower annotation cost. It is, however, also shown that ours lags behind them at high IoU thresholds, due to the lack of boundary information.

We provide the experimental results on GTEA and BEOID benchmarks in Table 2. On the both datasets, our method beats the existing state-of-the-art methods with a large gap. Notably, our method shows significant performance boosts under the high thresholds of 0.5 and 0.7, verifying the efficacy of the proposed completeness learning.

Table 3 and Table 4 summarize the results on ActivityNet. Our model shows the superior performances over all the existing weakly-supervised approaches on both versions. It can be also observed that the performance gains upon video-level labels are relatively small compared to THUMOS’14, which we conjecture is due to the far less frequent action instances (1.5 vs. 15 instances per video).

Dataset Method mAP@IoU (%) AVG
0.1 0.3 0.5 0.7
GTEA
SF-Net [ma2020sfnet] 58.0 37.9 19.3 11.9 31.0
SF-Net [ma2020sfnet] 52.9 37.6 21.7 13.7 31.1
Ju et al. [ju2020point] 59.7 38.3 21.9 18.1 33.7
Li et al. [li2021seg-timestamp] 60.2 44.7 28.8 12.2 36.4
Ours 63.9 55.7 33.9 20.8 43.5
BEOID
SF-Net [ma2020sfnet] 62.9 40.6 16.7 3.5 30.9
SF-Net [ma2020sfnet] 64.6 42.2 27.3 12.2 36.5
Ju et al. [ju2020point] 63.2 46.8 20.9 5.8 34.9
Li et al. [li2021seg-timestamp] 71.5 40.3 20.3 5.5 34.4
Ours 76.9 61.4 42.7 25.1 51.8
Table 2: State-of-the-art comparison on GTEA and BEOID. AVG denotes the average mAP at the thresholds 0.1:0.1:0.7. * denotes the reproduced results by official implementation.
Supervision Method mAP@IoU (%) AVG
0.5 0.75 0.95
Frame-level SSN [zhao2017temporal] 41.3 27.0 6.1 26.6
Video-level Lee  [lee2021um] 41.2 25.6 6.0 25.9
AUMN [Luo_2021_AUMN] 42.0 25.0 5.6 25.5
UGCT [Yang_2021_UGCT] 41.8 25.3 5.9 25.8
CoLA [Zhang_2021_cola] 42.7 25.7 5.8 26.1
Point-level SF-Net [ma2020sfnet] 37.8 - - 22.8
Ours 44.0 26.0 5.9 26.8
Table 3: State-of-the-art comparison on ActivityNet 1.2. AVG is the averaged mAP at the thresholds 0.5:0.05:0.95.

4.3 Analysis

Effect of each component. In Table 5, we conduct ablation study to investigate the contribution of each component. The upper section reports the baseline performances, from which we observe a large score gain brought by the point-level supervision, especially under low IoU thresholds. It mainly comes from the background modeling [lee2020background, lee2021um, Nguyen2019WeaklySupervisedAL] and the help of point annotations in spotting action instances. On the other hand, the lower section demonstrates the results of the proposed method, where completeness guidance is provided for the model. We observe the absolute average mAP gains of 4.7% and 1.7% from the proposed contrastive losses regarding score and feature similarity, respectively. Moreover, with the two losses combined, the performance is further boosted to 52.8%. This clearly shows that the proposed two losses are complementary and beneficial for precise action localization. Notably, the scores at high IoU thresholds are largely improved, verifying the efficacy of our completeness learning.

Supervision Method mAP@IoU (%) AVG
0.5 0.75 0.95
Frame-level BMN [Lin2019BMNBN] 50.1 34.8 8.3 33.9
P-GCN [zeng2019p-gcn] 48.3 33.2 3.3 31.1
G-TAD [xu2020g-tad] 50.4 34.6 9.0 34.1
BC-GNN [bai2020bcgnn] 50.6 34.8 9.4 34.2
Zhao  [zhao2020bottom-up] 43.5 33.9 9.2 30.1
Video-level Lee  [lee2021um] 37.0 23.9 5.7 23.7
AUMN [Luo_2021_AUMN] 38.3 23.5 5.2 23.5
TS-PCA [Yang_2021_UGCT] 37.4 23.5 5.9 23.7
Point-level Ours 40.4 24.6 5.7 25.1
Table 4: State-of-the-art comparison on ActivityNet 1.3. AVG is the averaged mAP at the thresholds 0.5:0.05:0.95.
mAP@IoU (%) AVG
0.1 0.3 0.5 0.7
51.9 37.1 20.3 6.0 28.7
70.7 58.1 40.7 16.1 47.3
75.1 64.4 44.5 20.0 52.0
72.1 60.5 42.1 17.9 49.0
75.7 64.6 45.3 21.8 52.8
Table 5: Ablation study on THUMOS’14. AVG represents the average mAP at the IoU thresholds 0.1:0.1:0.7.
Scoring method
Sequence
accuracy
mAP@IoU (%) AVG
0.1 0.3 0.5 0.7
Baseline N/A 70.7 58.1 40.7 16.1 47.3
(a) Inner scores 74.0 74.7 61.4 40.9 15.2 49.0
(b) Contrast-act 80.1 74.3 63.3 43.6 19.5 50.8
(c) Contrast-both 83.9 75.7 64.6 45.3 21.8 52.8
Table 6: Comparison of different scoring methods for optimal sequence search on THUMOS’14. AVG denotes the average mAP at the IoU thresholds 0.1:0.1:0.7.
Figure 4: Qualitative comparison with SF-Net [ma2020sfnet] on THUMOS’14. We provide two examples with different action classes: (1) CleanAndJerk and (2) SoccerPenalty. For each video, we present final scores and detection results from SF-Net and our model as well as ground truth action interval. The detection threshold is set to 0.2 for our method and set to the mean score for SF-Net following the original paper. The red boxes indicate the frames that are misclassified by SF-Net but detected by our method. Note that all of our detection results show high IoUs ( 0.6) with the ground-truths.

Comparison of different scoring methods. In Table 6, we compare different sequence scoring methods regarding frame-level accuracy of optimal sequences in the training set as well as localization performances in the test set of THUMOS’14. Specifically, we investigate three variants: (a) inner scores and (b) score contrast of action instances, and (c) contrast of both action and background ones. As a result, compared to inner scores, the contrast methods generate more accurate optimal sequences and bring larger performance gains at high IoU thresholds. Moreover, we observe that incorporating background instances for score calculation helps to find highly accurate optimal sequences, thereby improving the localization performance at test time.

Method Distribution
Sequence
accuracy
mAP@IoU (%) AVG
0.3 0.5 0.7
SF-Net [ma2020sfnet]
Manual N/A 53.3 28.8 9.7 40.6
Uniform N/A 52.0 30.2 11.8 40.5
Gaussian N/A 47.4 26.2 9.1 36.7
Ju  [ju2020point]
Manual N/A 58.1 34.5 11.9 44.3
Uniform N/A 55.6 32.3 12.3 42.9
Gaussian N/A 58.2 35.9 12.8 44.8
Ours
Manual 83.7 63.3 43.9 20.8 51.7
Uniform 76.6 60.4 42.6 20.2 49.3
Gaussian 83.9 64.6 45.3 21.8 52.8
Table 7: Comparison of the point-level labels from different distributions on THUMOS’14. AVG denotes the average mAP at the IoU thresholds 0.1:0.1:0.7.

Comparison of different label distributions. In Table 7, we explore different label distributions. “Manual” indicates the use of human annotations from [ma2020sfnet]

, whereas the others denote the simulated labels from the corresponding distributions. It is shown that our method significantly outperforms the existing methods regardless of the distribution choice, showing its robustness. We also observe that our method performs slightly worse in “Uniform” compared to the other distributions. We conjecture this is because less discriminative points have more chances to be annotated. Their neighbors are likely to have lower confidence, probably leading to sub-optimal sequences by the greedy algorithm. Indeed, the optimal sequence accuracy is shown to be the lowest in the uniform distribution, which supports our claim.

4.4 Qualitative Comparison

We present qualitative comparisons with SF-Net [ma2020sfnet] in Fig. 4. It can be clearly noticed that our method locates the action instances more precisely. Specifically, in the left example, SF-Net produces fragmentary predictions with false negatives, whereas our method detects the complete action instances without splitting them. In the right sample, while SF-Net overestimates the action instances with false positives, our method produces precise detection results by contrasting action frames from background ones well. The red boxes highlight the false negatives and false positives of SF-Net in the left and right examples, respectively. We note that all the predictions of our model in both examples have high IoUs larger than 0.6 with the corresponding ground-truth instances, validating the effectiveness of our completeness learning. Comparisons on other benchmarks and more visualization results can be found in Sec. C of the appendix.

5 Conclusion

In this paper, we presented a new framework for point-supervised temporal action localization, where dense sequences provide completeness guidance to the model. Concretely, we find the optimal sequence consistent with point labels based on the completeness score, which is efficiently implemented with a greedy algorithm. To learn completeness from the obtained sequence, we introduced two novel losses which encourage contrast between action and background instances regarding action score and feature similarity, respectively. Experiments validated that the optimal sequences are accurate and the proposed losses indeed help to detect complete action instances. Moreover, our model achieves a new state-of-the-art with a large gap on four benchmarks. Notably, it even outperforms fully-supervised methods on average despite the lower supervision level.

Acknowledgements

This project was partly supported by the National Research Foundation of Korea grant funded by the Korea government (MSIT) (No. 2019R1A2C2003760) and the Institute for Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (No. 2020-0-01361: Artificial Intelligence Graduate School Program (YONSEI UNIVERSITY)).

A Regarding Point-level Supervision

In this paper, we tackle temporal action localization under point-level supervision. Here, timestamp are denoted by “points” in the temporal axis, whereas “points” have also been widely used to represent spatial pixels in the literature. Bearman  [Bearman2016whats_the_points] introduce the first weakly-supervised semantic segmentation framework that takes as supervision a single annotated pixel for each object. Since that work, a great amount of efforts [ke2021universal_segmentation, laradji2021point_covid, laradji2020point_instance2, ren2020ufo, zhou2019point_instance] have been endeavored to utilize point-level supervision to solve various segmentation tasks in images or videos, thanks to its affordable annotation cost. Meanwhile, there are also attempts to employ point-level supervision to train object detectors [mcever2020pcams, papadopoulos2017extreme_detection, papadopoulos2017click_detection]. On the other hand, spatial points have also been explored to provide supervision for the weakly-supervised spatio-temporal action localization task [mettes2019point_action, mettes2016spot_on].

We remark that the definition of “point” in our problem setting is based on the temporal dimension, differing from that of the work above.

B Greedy Optimal Sequence Search

As discussed in the main paper, the search space of optimal sequence selection would grow exponentially as the length of the input video increases, which makes the optimal sequence search intractable. To bypass the cost issue, we design a greedy algorithm that makes locally optimal choices at each step under a fixed budget. Specifically, we process an input video in a sequential way, taking one segment at a timestep. At each timestep , we consider all possible -length candidate sequences consistent with point labels, and compute their completeness scores by averaging contrast scores of the action and background instances constituting the sequences (Eq. (6) of the main paper). In this calculation, we do not include the ongoing (, not terminated) instance, as it is infeasible to derive its contrast score without looking ahead to the future. Afterwards, we keep only the top (budget size) candidates regarding the completeness scores. When the step reaches the end of the video, we terminate the algorithm and select the optimal sequence with the highest score. In this way, we can save a large amount of the computational cost, thereby making the search process tractable. The pseudo-code of our algorithm for class is described in Algorithm 1.

Since the budget

affects the computational cost as well as the performance, we investigate several different budget sizes on THUMOS’14. For the computational cost, we train the model for 100 epochs and report the average execution time of optimal sequence selection for an epoch (, 200 training videos). The selection is implemented in multiprocessing with 16 worker processes and performed on a single AMD-3960X Threadripper CPU. Table 

8 shows the average mAPs (%) and the execution times (sec) with varying . As can be expected, when the budget increases, the computational cost grows in a nearly linear way. Besides, when is set to a too-small value (, 1), the selected optimal sequence is likely to be a local optimum, leading to a significant performance drop. On the other hand, the performance differences are insignificant when is larger than 5. This indicates that the model is fairly robust against the budget size and a not-too-small is sufficient to find the sequences that can provide helpful completeness guidance to the model. In practice, we set to 25, as it achieves the best performance at an affordable cost of fewer than 5 seconds for processing the whole training videos.

  1 5 10 25 50 100
mAP@AVG (%) 51.3 52.5 52.6 52.8 52.7 52.7
Execution time (sec) 0.683 1.343 2.151 4.398 8.512 16.769
Table 8: Analysis on the budget size on THUMOS’14. We provide the execution times as well as the average mAPs under IoU thresholds 0.1:0.1:0.7 with varying from 1 to 100. The average execution time for optimal sequence selection per epoch is reported in seconds.
Figure 5: Correlation between scores and IoUs with ground-truths. (a) The inner score shows moderate correlation (Pearson’s r = 0.38), whereas (b) the score contrast displays much stronger correlation (Pearson’s r = 0.68).

C Additional Experiments

c.1 Score contrast vs. completeness

To analyze the correlation between score contrast and action completeness, we draw the scatter plot of score contrast vs. IoUs with ground-truth action instances, using the randomly sampled 2,000 temporal intervals in the THUMOS’14 training videos. For reference, we also present the scatter plot of inner action scores vs. IoUs with the same intervals. In the experiments, we use the baseline model for fair comparison. Fig. 5a demonstrates that there is a moderate correlation between inner action scores and IoUs, but there are many cases with large inner scores but low IoUs (see bottom right). On the contrary, as shown in Fig. 5b, score contrast correlates much stronger with IoUs, demonstrating its efficacy as a proxy for measuring the action completeness without any supervision.

0:   class-specific action points (ascending) , pseudo background points (ascending) , the number of class-specific action points , the number of pseudo background points , fixed budget size
0:  optimal sequence  //  Definition: (refer to Sec. 3.2 of the main paper for the definition of and ) //  Initialize the first instance with the same category as that of the first point label
1:  if ,  then   else
2:  
3:    //  For each step , find the top sequences which span from the first segment to the -th segment while agreeing with point labels.
4:  for  to  do
5:     //  Find the upcoming points for action and background, respectively.
6:     if ,  then ;  if ,  then //  Remember the category of the closest upcoming point, as it will determine the possible cases (to continue or to be terminated)
7:     if ,  then   else //  If surpasses either of the last points for action and background, reverse the upcoming category
8:     if ,  then  //  Update the candidate sequence set for the timestep
9:     
10:     while  do
11:        pop , from
12:        pop the last instance from                  //   should be equal to //  The case where the last instance continues at timestep
13:        if  or  then
14:           
15:           
16:        end if//  The case where the last instance is terminated at timestep and a new instance starts at timestep
17:        if  then
18:            //  Update the score of the candidate sequence by averaging the contrast scores again
19:           if ,  then   else //  Create a new instance that starts right after the last instance, with the category of
20:           
21:           
22:        end if
23:     end while
24:      //  Pruning with the budget size
25:     while  do
26:         for
27:        pop from
28:     end while
29:  end for //  Return the optimal sequence
30:   for
31:  return
Algorithm 1 Greedy Optimal Sequence Search
Mining approach mAP@IoU (%)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 AVG
Global mining [ma2020sfnet] 67.4 61.1 54.9 46.3 36.4 25.7 13.4 43.6
Ours w/o filling 70.1 64.4 57.6 49.5 39.4 29.5 15.5 46.6
Ours 70.7 65.2 58.1 49.8 40.7 30.2 16.1 47.3
Table 9: Comparison of different pseudo background mining approaches on THUMOS’14. AVG represents the average mAP at the IoU thresholds 0.1:0.1:0.7.

c.2 Analysis on Pseudo Background Mining

We compare different variants of pseudo background mining on THUMOS’14. Specifically, we consider three variants: (1) “Global mining” selects the top points throughout the whole video without considering their locations as in SF-Net [ma2020sfnet], where is the number of action instances and is set to 5, (2) “Ours w/o filling” follows the principle described in Sec. 3.1 except the filling stage, , we select at least one background point for each section between two action points, and (3) “Ours” mines all points between the background points for each section if multiple points are found in the second variant. Note that we use the baseline model without completeness learning for clear comparison.

The results are demonstrated in Table 9. It can be observed that both of our methods significantly outperform the “Global mining” approach, which verifies the effectiveness of our selection principle that at least one background point should be placed for each section. Moreover, by ensuring at least one background point for each section, the search space of optimal sequence selection can be significantly reduced, although we do not include the cost analysis for this experiment. Meanwhile, we notice that filling between two background points slightly boosts the localization performance. This is presumably because hard background points with low background scores can be collected in the filling step.

c.3 Optimal Sequence Visualization

In Fig. 6, we visualize the obtained optimal sequences for the examples from the three benchmarks. In the first example from THUMOS’14 (a), the optimal sequence covers the ground-truth action instances well so that the model could learn action completeness from it. Moreover, although the examples from GTEA (b) and BEOID (c) contain a variety of action classes in a single video, our method successfully finds the optimal sequence that shows large overlaps with the ground-truth ones. Overall, it is shown from all the examples that the optimal sequences are quite accurate even though they are selected based on point-level labels without full supervision. They in turn provide completeness guidance to our model, which proves to improve localization performances at high IoU thresholds in Sec. 4.3 of the main paper.

c.4 More Qualitative Comparison

We qualitatively compare our method with SF-Net [ma2020sfnet] on the three benchmarks. The comparison on THUMOS’14 [THUMOS14] is demonstrated in Fig. 7. As shown, SF-Net produces fragmentary predictions by splitting action instances, whereas our method outputs complete ones with high IoUs even for the extremely long action instance (b). The comparison result on GTEA [lei2018gtea] is presented in Fig. 8. It would be noted that action localization on GTEA is challenging as the frames with different action categories are visually similar, leading to false positives. We see that SF-Net has difficulty in distinguishing action instances from background ones, resulting in inaccurate localization. On the other hand, our method successfully finds the action instances by learning completeness, showing fewer false positives. Lastly, the comparison on BEOID [damen2014BEOID] is shown in Fig. 9. It can be clearly noticed that SF-Net fails to predict the ending times of action instances, leading to the overestimation problem. On the contrary, with the help of the completeness guidance, our method better separates actions from their surroundings and locates the action instances more precisely.

Figure 6: Optimal sequence visualization on the three benchmarks. The examples are taken from (a) THUMOS’14, (b) GTEA, and (c) BEOID, respectively. Note that all of the examples belong to the training set of the corresponding benchmarks. For each video, we present the final scores and the obtained optimal sequences as well as ground-truth action intervals. The horizontal axis in each plot denotes the timesteps of the video, while the vertical axis in the first plot indicates the score values ranging from 0 to 1. For each example, different colors correspond to different action categories, while the gray color indicates the background class.
Figure 7: Qualitative comparison with SF-Net [ma2020sfnet] on THUMOS’14. We provide two examples with different action classes: (a) Diving and (b) CleanAndJerk. For each video, we present the final scores and detection results from SF-Net and our model as well as ground-truth action intervals. The horizontal axes denote the timesteps of the video, while the vertical axes are the score values ranging from 0 to 1. The detection threshold is set to 0.2 for our method and set to the mean score for SF-Net following the original paper. The red boxes indicate the frames that are misclassified by SF-Net but detected by our method. All of our detection results show high IoUs ( 0.5) with the corresponding ground-truths regardless of their lengths.
Figure 8: Qualitative comparison with SF-Net [ma2020sfnet] on GTEA. We provide two examples with different action classes: (a) Take and (b) Pour. For each video, we present the final scores and detection results from SF-Net and our model as well as ground-truth action intervals. The horizontal axis in each plot denotes the timesteps of the video, while the vertical axes are the score values ranging from 0 to 1. The detection threshold is set to 0.2 for our method and set to the mean score for SF-Net following the original paper. The red boxes indicate false alarms of SF-Net, but they, however, are rejected by our method. Compared to SF-Net, our method localizes action instances more precisely with fewer false positives.
Figure 9: Qualitative comparison with SF-Net [ma2020sfnet] on BEOID. We provide two examples with different action classes: (a) ScanCard-reader and (b) TurnTap. For each video, we present the final scores and detection results from SF-Net and our model as well as ground-truth action intervals. The horizontal axis in each plot denotes the timesteps of the video, while the vertical axes are the score values ranging from 0 to 1. The detection threshold is set to 0.2 for our method and set to the mean score for SF-Net following the original paper. The red boxes indicate false alarms of SF-Net deteriorating the performances at high IoU thresholds. While SF-Net overestimates the action instances, our method detects the complete action instances by discriminating action instances from background ones well.

References