Weakly Supervised Temporal Action Localization with Segment-Level Labels

07/03/2020 ∙ by Xinpeng Ding, et al. ∙ 0

Temporal action localization presents a trade-off between test performance and annotation-time cost. Fully supervised methods achieve good performance with time-consuming boundary annotations. Weakly supervised methods with cheaper video-level category label annotations result in worse performance. In this paper, we introduce a new segment-level supervision setting: segments are labeled when annotators observe actions happening here. We incorporate this segment-level supervision along with a novel localization module in the training. Specifically, we devise a partial segment loss regarded as a loss sampling to learn integral action parts from labeled segments. Since the labeled segments are only parts of actions, the model tends to overfit along with the training process. To tackle this problem, we first obtain a similarity matrix from discriminative features guided by a sphere loss. Then, a propagation loss is devised based on the matrix to act as a regularization term, allowing implicit unlabeled segments propagation during training. Experiments validate that our method can outperform the video-level supervision methods with almost same the annotation time.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 4

page 7

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There are many works [1, 2, 3, 4, 5, 6]

in recent years to tackle temporal action localization which aims to localize and classify actions in untrimmed videos. These methods are introduced with

full supervision setting: annotations of temporal boundaries (start time and end time) and action category labels are provided in the training procedure as shown in Fig. 1 (a). Although great improvement has been gained under this setting, obtaining such annotations is very time-consuming in long untrimmed videos [7].

To alleviate the requirement for temporal boundary annotations, weakly supervised methods [8, 9, 10, 11, 12, 13] have been developed. The most common setting is video-level supervision: only category labels are provided for each video in training time as shown in Fig. 1 (c). In these methods, researchers aim to learn class activation sequences (CAS) for action localization using excitation back-propagation with video-level category label supervision guided by a classification loss, which are simple yet efficient for weakly supervised action localization. Along with the learning procedure, CAS will shrink to the discriminative parts of action due to the discriminative parts are capable of minimizing the action classification loss. Therefore, they are usually observed to active discriminative action parts instead of full action extent. Existing approaches [14, 10] have explored erasing salient parts to expand temporal class activation maps and pursue full action extent. Nevertheless, these methods may result in decreased action classification accuracy and incomplete semantic information of actions, due to the lack of some action parts.

In this paper, we first divide an video into non-overlap segments, each of which contains 16 frames. Then, we propose a new segment-level supervision setting: one or two segments and their corresponding action category labels are provided in training time as shown in Fig. 1 (b). In this setting, annotators browse the video for an action instance and simultaneously label one or two interval seconds that belong to the action instance. The segments contain the labeled seconds are regard as the ground-truth labeled segments. Compared with boundary annotations, segment annotations do not require time consumption on finding precise start and end time which is sometimes accurate to 0.1 second. Furthermore, segment-level supervision can provide extra localization information compared with video-level supervision.

To make full use of this segment-level information, we propose a localization module which consists of three loss terms: a partial segment loss, a sphere loss and a propagation loss. Compared with video-level supervision methods, the partial segment loss uses the labeled segments to learn more parts of action instances instead of just focusing on discriminative parts. Since the labeled segments are only a part of an action instance rather than the full extent, the model will overfit along with the training process. To address the problem, we first define the segments that have high feature similarity with labeled segments as implicit segments, which is motivated by the intuition that the features of the segments belonging to the same action instance are similar. To measure the similarity between pairs of segments, we obtain a similarity matrix generated from the discriminative features. Guided by the sphere loss, the discriminative features have smaller maximal intra-class distance than minimal inter-class distance. Then, based on the obtained similarity matrix, the propagation loss is introduced to act as a regularization term which propagates labeled segments to implicit ones. The main contributions of this paper are as follows:

  • A new segment-level supervision setting is proposed for weakly supervised temporal action localization, costing almost the same annotation time as the video-level supervision.

  • A novel localization module guided by a sphere loss, a partial segment loss and a propagation loss is proposed to exploit both labeled and implicit segment to keep from focusing only on the discriminative parts.

  • Experimental results demonstrate that the proposed method outperforms the state-of-the-art weakly supervised temporal action localization methods with video-level supervision setting.

Figure 1: A video annotated with (a) full supervision, (b) segment-level supervision and (c) video-level supervision.

2 Related Work

Temporal Action Localization. Temporal action localization in full supervision has gained significant developments in recent years [15, 11, 16, 2, 3, 6]. However, obtaining precise temporal boundaries (start and end time) is very time-consuming in long untrimmed videos. To reduce the time-consumption of boundary annotations, weakly supervised temporal action localization in video-level category label supervision has attracted growing attentions. Given only category labels, most of methods [1, 9, 14, 10, 8, 17] tend to generate class activation sequences (CAS) from a classification loss. However, CAS guided by the classification loss is observed to shrink to salient parts instead of the full action extent. The reason behind the phenomenon lies in that the networks tend to learn the most compact features to distinguish different categories when optimizing the classification loss and ignore less discriminative ones [18]. Several researchers have attempted to pursue the integral action extent. Hide-and-Seek [14] randomly hide parts of videos during training to observe whole parts. Zhong et al. [10] trains multiple classifiers to erase regions step by step. However, these methods may lead to the lack of the discriminative parts, which would decrease the classification accuracy.

Figure 2: Annotation examples of (a) one-segment and (b) two-segment with the COIN annotation tool.

Regularization for Neural Networks.

Regularization is a set of techniques that can prevent overfitting in neural networks and has been widely used to improve the performance, e.g. norm regularization

[19], dropout [20]

. Motivated from the semi-supervised learning

[21], our proposed propagation loss differs from these regularization in parameters. Similar to our segment-level supervision, semi-supervised learning is the setting that a small amount of data is labeled while a large amount of ones is unlabeled during training. Weston et al. [21] add a semi-supervised loss (regularizer) to the supervised loss on the entire network’s output for unlabeled data. Such regularization is well coupled with our segment-supervised loss to improve temporal action localization performance.

Figure 3: Architecture of Our Approach. There are two main modules: a classification module and a localization module. The classification module is trained by a classification loss (CL) and the localization module is guided by a partial segment loss (PSL), a sphere loss (SL) and a propagation loss (PL).

3 Annotation of Segment-Level Supervision

For preventing re-annotations, we generate ground-truth segment labels by random sampling available temporal action boundary annotations from ActivityNet and THUMOS14. However, for a new dataset, segment labels can be annotated, without demand of any action boundary annotations. We choose two kinds of segment-level supervision: one-segment in which one segment is labeled for each action instance and two-segment in which two interval segments are labeled for each action instance. In two-segment, since temporal annotations are one-dimensional, segments between two labeled interval segments can all be regard as ground-truth segments. We sample videos with action classes from ActivityNet and THUMOS14 for evaluating annotation time. We use the COIN annotation tool [22] to label the seconds in which an action instance happens, as shown in Fig. 2. Then the labeled ground-truth segments can be regarded as the segments which contain the labeled seconds. The experiments on annotation time of video-level, one-segment, two-segment and full supervision are conducted in Section 5.2.

4 Our approach

4.1 Problem Statement and Notation

Let define untrimmed videos as , where denotes the number of videos. We divide each video into non-overlap segments , where denotes number of segments. Each segment consists of frames. The extracted feature of is denoted as , where is the dimension. Let the action label be denoted as , where

is a multi-hot vector and

is the number of action classes. For the video , we denote its segment label as . when there is an action instance with category occurring in -th segment and when none action instance with category occurs in the -th segments. For simplicity, we use , and instead of , and when there is no confusion. We use to represent the elements in the -th row and -th column of matrix . Naturally, and indicate the -th row vector and -th column vector of matrix respectively.

4.2 Architecture

The architecture of our approach is shown in Fig. 3. The fused feature described in Section 4.1 is fed into a fully connected (FC) layer to get the discriminative feature . Following is two main modules: a classification module to learn discriminative parts for distinguishing different action classes and a localization module to observe integral action regions.

The output of a fully connected layer in the classification module is the class activation sequence (CAS) which is a class-specific 1D temporal map, similar to the 2D class activation map in object detection [23]. We denote the CAS for classification as

. Conducting the top-k mean operation on

, a probability mass function (PMF), denoted by

, is generated for a classification loss. Similar to other video-level supervision methods [8, 9], the classification loss encourages the model to distinguish the different action categories.

In the localization module, we obtain a localization CAS, denoted by , with a fully connected layer similar to the classification module. Guided by a partial segment loss, the model can pay attention to labeled segments rather than only the discriminative ones learned from the classification loss. Since the labeled segments are only part of action instances, the model are prone to overfit as the training proceeds. To solve this drawback, we define the segments having high similarity with labeled segments as the implicit segments. In order to measure the similarity of pairs of segments, a sphere loss is first adopted to ensure has smaller maximal intra-class distance than minimal inter-class distance. Then, we measure the similarity between pairs of segments by the similarity matrix , where and indicates the matrix multiplication and the transpose of respectively. Finally, we add a propagation loss to propagate over partially labeled segments to the entire action instances including unlabeled implicit segments. The objective of our framework is formulated as follows:

(1)

where , , and indicate the classification loss, the partial segment loss, the sphere loss and the propagation loss respectively. , and

are trade-off hyperparameters.

Figure 4: Predicted action proposals for a video clip containing ‘LongJump’ category from THUMOS14. (a) ‘GT’ indicates the ground-truth segments belonging to action instances. (b) The model trained with the classification loss (CL) only predict the discriminative segments. (c) Trained with the classification loss (CL) and partial segment loss (PSL), the model can observe more segments belonging to the action instances. (d) After adding the sphere loss (SL), the segments belonging to the same action instance are combined. (e) Guided by the propagation loss (PL), more implicit segments belonging to action instances are predicted.

4.3 Classification Loss

Due to the variation temporal duration, we use the top-k mean to generate a single class score aggregated from described in Section 4.2, similar to [8]. The class score for the -th category, denoted by , is defined as follows:

(2)

where , and is a hyperparameter to control the ratio of selected segments in a video. Then, a probability mass function (PMF), , is computed by employing softmax:

(3)

Finally, the classification loss (CL) is defined as follows:

(4)

where is the ground-truth label for -th class which described in Section 4.1.

Along with the training process, the CAS guided by the classification loss will shrink to only the discriminative parts rather than the whole action instances. Specific activated action parts are capable of minimizing the action classification loss, but difficult to optimize action localization. The only goal of optimizing CL is to capture the relevant action parts between and to distinguish action categories. Along with training, the relevant parts become more and more discriminative while the irrelevant parts with no contribution to the prediction of are suppressed. As illustrated in Fig. 4 (b), for ‘LongJump’ category, only the segments where the athlete jumps from the bunker are predicted. This is because these parts can be informative and enough to distinguish ‘LongJump’ from other different action categories, such as ‘HighJump’.

4.4 Partial Segment Loss

In order to tackle the problem described in Section 4.3, we introduce a partial segment loss in segment-level supervision. An intuitive choice is loss, denoted by . The model guided by will urge to fit . However, the ground-truth labels are only the partial action instances rather than the entire ones. Therefore, We introduce a partial segment loss which only considers the cross entropy loss for labeled segments and effectively ignores other parts. We first conduct softmax on to obtain the normalized CAS, defined as:

(5)

Then, the partial segment loss can be defined as follows:

(6)

where . This partial segment loss can be seen as a sampling of the loss which is consistent with the annotation of the segment-level supervision. In segment-level supervision, the process of labeling segments can be regarded as a sampling of action instances. Guided by the partial segment loss, the model can observe more essential parts which is shown in Fig. 4 (c).

4.5 Sphere Loss

As described in Section 4.2, we denote the similarity matrix between two segments as . To ensure features with the same category have higher similarity than those with different categories, should have the property that maximal intra-class distance is smaller than minimal inter-class distance. The A-Softmax loss introduced in [24]

learns the features by constructing a discriminative angular distance metric, making the decision boundary more stringent and separated. However, A-Softmax loss is used for face recognition, which is trained on examples containing single-label instances with no backgrounds.

In our task, we integrate the sphere loss adapted from A-Softmax loss [24] into our network for multi-label action instances. Since an untrimmed video contains many background clips, a feature aggregation is needed to obtain a class-specific feature without background regions. Specifically, let and we first compute the high attention along the temporal axis for class as follows:

(7)

where is the normalized CAS in Equation 5. We refer to as attention, as it attends to the portions of the video where an action of a certain category occurs. For example, if equals instead of , the -th segments of the video may contain action instances of category . Then as in [8], we obtain the high attention region aggregated class-wise feature vectors for category as follows:

(8)

where . Following [24], we define the fully connect layer as and is the angle between and . Then, the A-Softmox loss for category is formulated as:

(9)

where , and . is an integer that controls the size of angular margin. More detail explanation and provement can be found in [24]. Then, the sphere loss for multi-label action categories can be formulated as:

(10)

The predicted proposals of added sphere loss is shown in Fig. 4 (d).

4.6 Propagation Loss

In segment-level supervision, only a part of action instances is labeled, compared with entire regions in full supervision methods. This setting is similar to the semi-supervised learning which combines a small amount of labeled data with a large amount of unlabeled data during training. In many semi-supervised algorithms [25, 26], a key assumption is the structure assumption: points within the same structure (such as a cluster or a manifold) are likely to have the same label. Under this assumption, the aim is to use this structure to propagate labeled data to unlabeled data. In [21], authors add a semi-supervised loss (regularizer) to the supervised loss on the entire network’s output:

(11)

where and indicate the number of the labeled and unlabeled examples respectively. indicates the encoding function and specifies the similarity or dissimilarity between examples and . is the loss for labeled examples and is the loss between pairs of examples. is the trade-off hyperparameter. In our approach, we rewrite the Equation 11 as follows:

(12)

where the propagation loss is defined as follows:

(13)

where is the similarity matrix described in Section 4.2. With the propagation loss, the model can propagate the labeled segments to implicit segments by measuring their similarity, as shown in Fig. 4 (e).

4.7 Classification and Localization

We first get the final CAS, denoted by . Then, and are computed by Equation 2 and 3. As in [8], we use the computed PMF with a threshold to classify the video to contain one or more action categories. For localization, we discard the categories of which are below a certain threshold (0 in our experiments). Thereafter, for each of the remaining categories, we apply a threshold to along the temporal axis to obtain the action proposals.

CL (baseline)  ✓  ✓  ✓  ✓  ✓  ✓  ✓  ✓
PSL  ✓  ✓  ✓  ✓
SL  ✓  ✓  ✓  ✓
PL  ✓  ✓  ✓  ✓
one-segment  19.4  27.0  24.4  19.7  28.6  27.2  26.1 29.3
two-segment  19.4  28.5  24.4  19.7  29.9  29.1  26.1 31.6
Table 1: Results (mAP) with different loss terms on THUMOS14 at IoU=0.5. ‘one-segment’ and ‘two-segment’ indicate labeling one segment and two segments for each action instance respectively.

5 Experiments

5.1 Experimental Setup

Datasets. We evaluate our method on two popular action localization benchmark datasets: THUMOS14 [27] and ActivityNet [28]. The THUMOS14 dataset has temporal annotations for a subset of videos in the validation and test sets for 20 classes. ActivityNet1.2 has 4,819 videos for training, 2,383 videos for validation, and 2,480 videos for testing whose labels are withheld. ActivityNet1.3 contains 19,994 videos with 200 action classes.

Methods mAP @ IoU
0.1 0.2 0.3 0.4 0.5 0.6 0.7 AVG
Fully-supervised Methods
S-CNN [16] 47.7 43.5 36.3 28.7 19.0 - 5.3 35.0
R-C3D [29] 54.5 51.5 44.8 35.6 28.9 - - 43.1
SSN [30] 60.3 56.2 50.6 40.8 29.1 - - 47.4
Chao et al. [2] 59.8 57.1 53.2 48.5 42.8 33.8 20.8 52.3
GTAN [11] 69.1 63.7 57.8 47.2 38.8 - - 55.32
Weakly-supervised Methods
Hide-and-Seek [14] 36.4 27.8 19.5 12.7 6.8 - - 20.6
Zhong et al. [10] 45.8 39.0 31.1 22.5 15.9 - - 30.9
STPN (UNT) [9] 45.3 38.8 31.1 23.5 16.2 9.8 5.1 31.0
W-TALC (UNT) [8] 49.0 42.8 32.0 26.0 18.8 - 6.2 33.7
AutoLoc (UNT) [31] - - 35.8 29.0 21.2 13.4 5.8 -
Liu et al. (UNT) [17] 53.5 46.8 37.5 29.1 19.9 12.3 6.0 37.4
BaSNet (UNT) [13] 56.2 50.3 42.8 34.7 25.1 17.1 9.3 41.8
Ours (UNT) 59.1 53.5 45.7 37.5 28.4 20.3 11.8 44.8
STPN (I3D) [9] 52.0 44.7 35.5 25.8 16.9 9.9 4.3 35.0
W-TALC (I3D) [8] 55.2 49.6 40.1 31.1 22.8 - 7.6 39.8
Liu et al. (I3D) [17] 57.4 50.8 41.2 32.1 23.1 15.0 7.0 40.9
3C-Net (I3D) [12] 59.1 53.5 44.2 34.1 26.6 8.1 - 43.5
BaSNet (I3D) [13] 58.2 52.3 47.6 36.0 27.0 18.6 10.4 43.6
Ours (I3D) 61.6 55.8 48.2 39.7 31.6 22.0 13.8 47.4
Table 2: Action localization performance comparison (mAP) of our method with the state-of-the-art methods on the THUMOS14 dataset. The mAP values at different IoU thresholds and the average mAP (0.1:0.1:0.5) are presented. UNT and I3D are abbreviations for UntrimmedNet features and I3D features respectively. Our two-segment model with full loss terms achieves the best performance at most IoUs with both UntrimmedNet and I3D features.

Evaluation metric. Following previous methods [8, 9, 12, 13], we use the standard protocol provided by two datasets for evaluation. For action localization, the evaluation protocol is based on mean Average Precision (mAP) score at different intersection over union (IoU) values. For the multi-label action classification, we use the predicted video-level scores to compute the mAP score for evaluation.

Implementation details. We use the corresponding repositories to extract the features for UntrimmedNet [1] and I3D [32]. The dimension of the confused feature is . As in [8, 12], We do not finetune the feature extractors. The trade-off hyperparameters in Equation 1 are , and respectively. Different from previous video-level supervision methods which use a fixed number, the ratio of selected segments is set to , where

indicates the number of labeled segments in our experiments. All of our models are implemented by PyTorch

[33] and trained under the environment of Python 3.6 on Ubuntu 16.04 system with a 12G NVIDIA Titan Xp GPU.

5.2 Exploratory Experiments

In the following experiments, we take I3D [32] as the feature extractor.

Ablation study. We set the model guided by the classification loss (CL) alone as the baseline. The comparison of temporal action localization performance (mAP) with different loss terms on THUMOS14 at IoU=0.5 are shown in Table 1. The baseline model gets a mAP score of . The partial segment loss (PSL) can significantly improve the performance. In one-segment label, combining the classification loss and partial segment loss (CL+PSL), we can obtain a mAP score of , improving over CL. In two-segment label, with more labeled segments, the performance is significantly improved, over CL. The sphere loss is also beneficial to localization due to more discriminative features generated. For instances, the integration of the classification loss and sphere loss (CL+SL) obtains a mAP score of improving over CL. With only the propagation loss (PL), the performance is hardly improved for the propagation loss requiring the similarity information. However, PL can propagate predicted regions to implicit parts based on other loss terms. Guided by CL+PSL+PL, we obtain better performance than CL+PSL, i.e., a mAP score of for one-segment. PL can also improve over CL+SL. The action localization performance is improved to and mAP for one-segment and two-segment respectively, by CL+PSL+SL+PL.

Comparisons of the trade-off between annotation time and performance. To evaluate the annotation time, we define a new metric named as annotation-duration ratio which is denoted by , here and indicate the annotation time and duration time of videos respectively. Using the COIN annotation tool [22] to label on the THUMOS14 dataset, we obtain , , and , where , and indicate of video-level, one-segment, two-segment and full supervision respectively. Similarly, for the sampled videos of ActivityNet, we obtain , , and . We present the trade-off between annotation time and performance on the THUMOS14 dataset in Fig. 5. The x-axis is the annotation time () which is denoted in Section 3 and the y-axis is the performance (mAP) of temporal action localization. As Fig. 5 indicates, our approach in segment-level supervision can significantly improve the performance compared with video-level supervision methods, at the cost of only a little more annotation time.

  Methods mAP @ IoU
0.5 0.75 0.95 AVG
Fully-supervised Methods
SSN [30] 41.3 27.0 6.1 26.6
Weakly-supervised Methods
W-TALC[8] 37.0 - - 18.0
Liu et al. [17] 36.8 22.0 5.6 22.4
3C-Net [12] 37.2 23.7 9.2 21.7
BaSNet [13] 38.5 24.2 5.6 24.3
Ours 41.7 26.7 6.3 26.4
Table 4: Action localization performance comparison (mAP) of our method with the state-of-the-art methods on ActivityNet1.3 dataset. ‘AVG’ means the the average mAP at IoU thresholds 0.5:0.05:0.95.
  Methods mAP @ IoU
0.5 0.75 0.95 AVG
Fully-supervised Methods
GTAN [11] 52.6 34.1 8.9 34.3
BMN [6] 50.1 34.8 8.3 33.9
Weakly-supervised Methods
STPN [9] 29.3 16.9 2.6 -
Liu et al. [17] 34.0 20.9 5.7 21.2
BaSNet [13] 34.5 22.5 4.9 22.2
Ours 37.7 25.6 6.8 24.8
Table 3: Action localization performance comparison (mAP) of our method with the state-of-the-art methods on ActivityNet1.2 dataset. ‘AVG’ means the the average mAP at IoU thresholds 0.5:0.05:0.95.

5.3 Comparisons with the State-of-the-art

We conduct experiments on THUMOS14 and ActivityNet datasets to compare with several state-of-the-art techniques.

Action localization. Table 2 reports the comparison of our method with existing approaches on the THUMOS14 dataset. We report mAP scores at different IoU thresholds (0.1:0.1:0.7). ‘AVG’ represents the average mAP at IoU thresholds form 0.1 to 0.5. Results show that our model can perform better than previous video-level weakly supervised methods at all IoU thresholds for both UNT and I3D feature extractors. Our method can achieve better performance at most of the IoUs. Specifically, for average mAP from 0.1 to 0.5, our method significantly improves the performance from to with the UNT features. The performance of our method is further improved by using I3D features, and we achieved an average mAP of , which improves over BasNet [13]. Table 4 shows the state-of-the-art comparison on the ActivityNet1.2 dataset. Following other works [12, 13], we report the mean mAP scores at different thresholds (0.5:0.05:0.95). Our approach achieves an average mAP of which surpasses all existing video-level weakly-supervised methods. Furthermore, our segment-level supervision also have competitive performance against the full supervision method SSN [30]. Table 4 illustrates the performance comparison of our method with the start-of-the-art on ActivityNet1.3 dataset. Our method with segment-level supervision achieves an average mAP of , outperforming BaSNet in video-level supervision by .

Action classification. Table 5 reports action classification performance comparison (mAP) of our method with the state-of-the-art methods on the THUMOS14 and ActivityNet 1.2 datasets. Since varies with the number of labeled segments in the video, it can represent more appropriate sampling information compared with the fixed number. In comparison with the existing approaches, our method achieves competitive results of and mAP on THUMOS14 and ActivityNet1.3 respectively.

Methods THUMOS14 ActivityNet1.2
iDT+FV [34] 63.1 66.5
C3D [35] - 74.1
TSN [36] 67.7 88.8
W-TALC [8] 85.6 93.2
3C-Net [12] 86.9 92.4
Ours 87.6 93.2
Figure 5: Trade-off of Annotation time and Performance.
Table 5: Action classification performance comparison of our method with the state-of-the-art methods on the THUMOS14 and ActivityNet 1.2 datasets.
Figure 6: Illustration of the temporal predicted regions on the THUMOS14 and ActivityNet datasets. ‘GT’ indicates the ground-truth. ‘Baseline’ means the model only with CL. Our method greatly improves the performance of temporal action localization on all three action videos.

5.4 Qualitative Results

The qualitative analysis of our approach is shown in Fig. 6. The top row is the sampling segments of action videos. We choose three actions (‘CricketBowling’, ‘Ping-pong’ and ‘Calf roping’) from THUMOS14 and ActivityNet to evaluate our method. ‘GT’ denotes the ground-truth segments. The baseline model with only CL just localize the strong discriminative regions. As the duration of time increases, the IoU will decrease. For instances, in Fig. 6 (a), the baseline model can predict segments which have high overlap with GT, for short duration of ‘CricketBowling’. However, in Fig. 6 (b), a low overlap is obtained for longer duration of ‘Ping-pong’, which is lower in Fig. 6 (c). With our proposed method, more complete and correct action segments will be detected. This indicates that our method can significantly improve weakly supervised temporal action localization.

6 Conclusion

In this work, we propose a new segment-level supervision setting for weakly supervised temporal action localization, which costs almost the same annotation time as video-level supervision. Based on the segment-level supervision, we devise a localization module guided by the partial segment loss, the sphere loss and the propagation loss. Compared with video-level supervision, our approach, exploiting the segment labels and propagating them to explicit segments based on the discriminative features, significantly improves the integrity of predicted segments.

References

  • [1] Wang, L., Xiong, Y., Lin, D., Van Gool, L.: Untrimmednets for weakly supervised action recognition and detection.

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2017) 4325–4334

  • [2] Chao, Y.W., Vijayanarasimhan, S., Seybold, B., Ross, D.A., Deng, J., Sukthankar, R.: Rethinking the faster r-cnn architecchature for temporal action localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2018) 1130–1139
  • [3] Lin, T., Zhao, X., Su, H., Wang, C., Yang, M.: Bsn: Boundary sensitive network for temporal action proposal generation. In: Proceedings of the European Conference on Computer Vision (ECCV). (2018) 3–19
  • [4] Alwassel, H., Caba Heilbron, F., Ghanem, B.: Action search: Spotting actions in videos and its application to temporal action localization. In: Proceedings of the European Conference on Computer Vision (ECCV). (2018) 251–266
  • [5] Zeng, R., Huang, W., Tan, M., Rong, Y., Zhao, P., Huang, J., Gan, C.: Graph convolutional networks for temporal action localization. arXiv preprint arXiv:1909.03252 (2019)
  • [6] Lin, T., Liu, X., Li, X., Ding, E., Wen, S.: Bmn: Boundary-matching network for temporal action proposal generation. arXiv preprint arXiv:1907.09702 (2019)
  • [7] Zhao, H., Yan, Z., Wang, H., Torresani, L., Torralba, A.: Slac: A sparsely labeled dataset for action classification and localization. (12 2017)
  • [8] Paul, S., Roy, S., Roy-Chowdhury, A.K.: W-talc: Weakly-supervised temporal activity localization and classification. In: Proceedings of the European Conference on Computer Vision (ECCV). (2018) 563–579
  • [9] Nguyen, P., Liu, T., Prasad, G., Han, B.: Weakly supervised action localization by sparse temporal pooling network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2018) 6752–6761
  • [10] Zhong, J.X., Li, N., Kong, W., Zhang, T., Li, T.H., Li, G.: Step-by-step erasion, one-by-one collection: A weakly supervised temporal action detector. arXiv preprint arXiv:1807.02929 (2018)
  • [11] Long, F., Yao, T., Qiu, Z., Tian, X., Luo, J., Mei, T.: Gaussian temporal awareness networks for action localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2019) 344–353
  • [12] Narayan, S., Cholakkal, H., Khan, F.S., Shao, L.: 3c-net: Category count and center loss for weakly-supervised action localization. In: Proceedings of the IEEE International Conference on Computer Vision. (2019) 8679–8687
  • [13] Lee, P., Uh, Y., Byun, H.: Background suppression network for weakly-supervised temporal action localization. In: AAAI. (2020)
  • [14] Singh, K.K., Lee, Y.J.: Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), IEEE (2017) 3544–3553
  • [15] Lin, T., Zhao, X., Shou, Z.: Single shot temporal action detection. In: Proceedings of the 25th ACM international conference on Multimedia, ACM (2017) 988–996
  • [16] Shou, Z., Wang, D., Chang, S.F.: Temporal action localization in untrimmed videos via multi-stage cnns. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 1049–1058
  • [17] Liu, D., Jiang, T., Wang, Y.: Completeness modeling and context separation for weakly supervised temporal action localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2019) 1298–1307
  • [18] Yuan, Y., Lyu, Y., Shen, X., Tsang, I.W., Yeung, D.Y.: Marginalized average attentional network for weakly-supervised learning. arXiv preprint arXiv:1905.08586 (2019)
  • [19] Goodfellow, I., Bengio, Y., Courville, A.: Deep learning. MIT press (2016)
  • [20] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting.

    The Journal of Machine Learning Research

    15(1) (2014) 1929–1958
  • [21] Weston, J., Ratle, F., Mobahi, H., Collobert, R.: Deep learning via semi-supervised embedding. In: Neural networks: Tricks of the trade. Springer (2012) 639–655
  • [22] Yansong Tang, Dajun Ding, Y.R.Y.Z.D.Z.L.Z.J.L.J.Z.: Coin: A large-scale dataset for comprehensive instructional video analysis. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2019)
  • [23] Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.:

    Learning deep features for discriminative localization.

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 2921–2929
  • [24] Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: Deep hypersphere embedding for face recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2017) 212–220
  • [25] Chapelle, O., Weston, J., Schölkopf, B.: Cluster kernels for semi-supervised learning. In: Advances in neural information processing systems. (2003) 601–608
  • [26] Zhu, X., GhahramaniH, Z.: Learning from labeled and unlabeled data with label propagation. (2002)
  • [27] Idrees, H., Zamir, A.R., Jiang, Y.G., Gorban, A., Laptev, I., Sukthankar, R., Shah, M.: The thumos challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding 155 (2017) 1–23
  • [28] Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: Activitynet: A large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 961–970
  • [29] Xu, H., Das, A., Saenko, K.: R-c3d: Region convolutional 3d network for temporal activity detection. In: Proceedings of the IEEE International Conference on Computer Vision. (2017) 5783–5792
  • [30] Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: Proceedings of the IEEE International Conference on Computer Vision. (2017) 2914–2923
  • [31] Shou, Z., Gao, H., Zhang, L., Miyazawa, K., Chang, S.F.: Autoloc: Weakly-supervised temporal action localization in untrimmed videos. In: Proceedings of the European Conference on Computer Vision (ECCV). (2018) 154–171
  • [32] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2017) 6299–6308
  • [33] Paszke, A., Gross, S., Chintala, S.: Pytorch deep learning framework. Web page (2017)
  • [34] Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision. (2013) 3551–3558
  • [35] Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision. (2015) 4489–4497
  • [36] Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: Towards good practices for deep action recognition. In: Proceedings of the European Conference on Computer Vision, Springer (2016) 20–36