ACSNet: Action-Context Separation Network for Weakly Supervised Temporal Action Localization

03/28/2021 ∙ by Ziyi Liu, et al. ∙ Xi'an Jiaotong University University of Illinois at Chicago University at Buffalo 0

The object of Weakly-supervised Temporal Action Localization (WS-TAL) is to localize all action instances in an untrimmed video with only video-level supervision. Due to the lack of frame-level annotations during training, current WS-TAL methods rely on attention mechanisms to localize the foreground snippets or frames that contribute to the video-level classification task. This strategy frequently confuse context with the actual action, in the localization result. Separating action and context is a core problem for precise WS-TAL, but it is very challenging and has been largely ignored in the literature. In this paper, we introduce an Action-Context Separation Network (ACSNet) that explicitly takes into account context for accurate action localization. It consists of two branches (i.e., the Foreground-Background branch and the Action-Context branch). The Foreground- Background branch first distinguishes foreground from background within the entire video while the Action-Context branch further separates the foreground as action and context. We associate video snippets with two latent components (i.e., a positive component and a negative component), and their different combinations can effectively characterize foreground, action and context. Furthermore, we introduce extended labels with auxiliary context categories to facilitate the learning of action-context separation. Experiments on THUMOS14 and ActivityNet v1.2/v1.3 datasets demonstrate the ACSNet outperforms existing state-of-the-art WS-TAL methods by a large margin.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1:

The illustration of action, context and background in terms of frames and points in feature space. The green dashed line is the desired boundary for the localization task. However, based on the given video-level categorical labels, the blue dashed line is learned, due to the high co-occurrence and visual similarity of action and context. Existing methods frequently identify both red and green dots as actions. The main challenge in WS-TAL is how to isolate context from action instances with merely video-level categorical labels

Figure 2: An overview of our main idea, i.e., using extended label with auxiliary context categories to guide the training of action/context attentions. Unfortunately, such an idea is nontrivial to implement due to “lack of explicit action-context constraint” and “lack of explicit supervision”.

Temporal Action Localization (TAL) aims to localize temporal starts and ends of specific action categories in a video. It serves as a fundamental tool for several practical applications such as action retrieval, intelligent surveillance and video summarization Lee, Ghosh, and Grauman (2012); Vishwakarma and Agrawal (2013); Asadiaghbolaghi et al. (2017); Kang and Wildes (2016); Yao, Lei, and Zhong (2019). Although fully supervised TAL methods have recently achieved remarkable progress Buch et al. (2017); Xu, Das, and Saenko (2017); Gao et al. (2017); Xu, Das, and Saenko (2017); Chao et al. (2018); Lin et al. (2018, 2019); Zeng et al. (2019), manually annotating the precise temporal boundaries of action instances in untrimmed videos is time-consuming and challenging. This limitation motivates the weakly supervised setting where only video-level categorical labels are provided for model training. Compared with temporal boundary annotations, video-level categorical labels are easier to collect, and they help avoid the localization bias introduced by human annotators.

Existing weakly-supervised temporal action localization (WS-TAL) methods Wang et al. (2017); Nguyen et al. (2018); Paul, Roy, and Roy-Chowdhury (2018); Nguyen, Ramanan, and Fowlkes (2019) leverage attention mechanisms to categorize snippets or sampled frames into foreground and background based on their contribution to the video-level classification task, i.e., to find the blue dashed line in Figure 1. Then temporal action localization is reformulated as selecting consecutive foreground snippets belonging to each category. However, the foreground localized through video-level categorization involves not only the actual action instance but also its surrounding context. As illustrated in Figure 1, context is snippets or frames that frequently co-occur with the action instances of a specific category but should not be included in their localization. Different from background, which is class-agnostic, context provides strong evidence for action classification and thus can be easily confused with the action instances. We believe separating the action instances and their context is a core problem in WS-TAL, and it is very challenging due to the co-occurrence nature.

The goal of this paper is to address the action-context separation (ACS) problem in the weakly-supervised setting so as to achieve more precise action localization. We first introduce auxiliary context categories for each action class during training. As shown in Figure 2, each video-level category is divided into two sub-categories, respectively corresponding to the actual action and its context. Prior methods exploit foreground attention to achieve foreground-background separation. However, this simple idea is not applicable to action-context separation due to two difficult issues. (1) Lack of explicit action-context constraints: The sum-to-one constraint  Nguyen, Ramanan, and Fowlkes (2019) of the foreground and background attention scores does not apply to action-context separation. (2) Lack of explicit supervision: Both action and context can contribute to action classification, so the only available video-level categorical labels cannot provide direct supervision for them.

To address these two difficult issues, we introduce the Action-Context Separation Network (ACSNet). As illustrated in Figure 3, it consists of two branches, i.e., the Foreground-Background branch (FB branch) and the Action-Context branch (AC branch). The FB branch divides an untrimmed video into foreground and background based on whether a snippet supports the video-level classification. This is achieved via snippet-level categorical predictions (SCPs) and snippet-level attention predictions (SAPs), e.g., foreground attention in Figure 2. Subsequently, the AC branch further divides the obtained foreground into action and context by associating each video snippet with two latent components, i.e., a positive component and a negative component. Different combinations of these two components respectively characterize the foreground, action and context. This enables effective action-context separation with only video-level supervision. Finally, the output of AC branch facilitates the TAL by providing (1) temporal action proposals with more accurate boundaries and (2) more reliable proposal confidence scores.

The contribution of this paper is summarized below.

  1. Prior WS-TAL approaches take it for granted that the foreground localized via the classification attention is equivalent to the actual action instance, and thus they unavoidably include the co-occurring context in the localization result. We address this challenge via a novel action-context separation network (ACSNet), which not only distinguishes foreground from background but also separates action and context within the foreground to achieve more precise action localization.

  2. The proposed ACSNet features a novel Action-Context branch. It can individually characterize foreground, action and context using different combinations of two latent components, i.e., the positive component and the negative component.

  3. We propose novel extended labels with auxiliary context categories. By explicitly decoupling the actual action and its context, this new representation facilitates effective learning of action-context separation.

  4. Extensive experimental results indicate the proposed ACSNet can effectively perform action-context separation. It significantly outperforms state-of-the-art methods on three benchmarks, and it is even comparable to recent fully-supervised methods.

2 Related Work of WS-TAL

Different from action recognition which is essentially a classification task Feichtenhofer, Pinz, and Zisserman (2016); Simonyan and Zisserman (2014); Wang et al. (2016); Ji et al. (2013); Sun et al. (2015b); Tran et al. (2015); Feichtenhofer et al. (2019), TAL requires finer-grained predictions with temporal boundaries of the target action instances. WS-TAL methods address it without temporal annotations, which is first introduced in Sun et al. (2015a). To distinguish action instances from background, the attention mechanism is widely adopted for foreground-background separation. UntrimmedNet Wang et al. (2017) formulates the attention mechanism as a soft selection module to localize target action, and the final localization is achieved by thresholding the snippets’ action scores. STPN Nguyen et al. (2018) proposes a sparsity loss based on the soft selection module of UntrimmedNet, which can facilitate the selection of action instances. Nguyen et alNguyen, Ramanan, and Fowlkes (2019) characterize background by an additional background loss and introduce other losses to guide the attention. For better evaluation of temporal action proposals, W-TALC Paul, Roy, and Roy-Chowdhury (2018) proposes a co-activity loss to enforce the feature similarity among localized instances. AutoLoc Shou et al. (2018) uses an “outer-inner-contrastive loss” to predict and regress temporal boundaries. Liu et alLiu, Jiang, and Wang (2019)

exploit a multi-branch neural network to discover distinctive action parts and fuse them to ensure completeness. CleanNet 

Liu et al. (2019b) designs a “contrast score” by leveraging temporal contrast in SCPs to achieve end-to-end training of localization.

However, driven by the video-level classification labels, the existing attention mechanism is merely able to capture the difference between foreground and background for classification, instead of action and non-action for localization. The proposed ACSNet manages to distinguish action instances from their surrounding context, and we extend labels by introducing auxiliary context categories to make the framework trainable.

Figure 3: The framework of the proposed ACSNet, which has two branches, i.e., Foreground-Background branch and Action-Context branch. The input video is first processed by the feature embedding to get features from both spatial and temporal streams. The FB branch focuses on foreground-background separation while the AC branch focuses on action-context separation. Video-level labels are extended to facilitate the action-context separation.

3 Action-Context Separation Network

In this section, we introduce the extended video-level labels with auxiliary context categories (Section 3.1) and the proposed Action-Context Separation Network (ACSNet). As illustrated in Figure 3, the ACSNet consists of two branches, i.e

., Foreground-Background branch (FB branch) and Action-Context branch (AC branch). After feature extraction from the given video (Section 

3.2), FB branch distinguishes the foreground from background (Section 3.2). The obtained foreground contains both action and context. Subsequently, AC branch localizes the actual temporal action instances by performing action-context separation within the foreground (Section 3.3). To guide the training of ACS, additional losses are introduced (Section 3.4).

3.1 Extending Video-Level Labels

Suppose we are given a video with a video-level categorical label , where if contains the -th action category. is the total number of action categories, represents the background category. To guide the division of foreground into action and context, we extend with auxiliary context categories as


where and denote the -th action category and its corresponding context, respectively. As shown in Figure 3, is used in FB branch and is used in AC branch.

3.2 Baseline Modules

This section introduces the baseline modules used in ACSNet, including feature extraction and FB branch based on the attention mechanism. While they are not our main contribution, we introduce them for completeness. Similar modules have been explored and adopted by existing methods Nguyen et al. (2018); Paul, Roy, and Roy-Chowdhury (2018); Nguyen, Ramanan, and Fowlkes (2019); Lee, Uh, and Byun (2020).

Feature Extraction

The input of the feature extraction module is the given video , which is divided into non-overlapping snippets. The outputs are the corresponding features of each snippet. For each snippet , the corresponding -dimensional features are extracted from two streams, i.e., the spatial stream (RGB) and the temporal stream (optical flow), denoted as and , respectively. Afterwards, the video is represented as and .

For notational simplicity, we use superscript “” to indicate the notations used in both streams in the rest of the paper. The notations of the spatial/temporal stream can be obtained by substituting the superscript “” with “rgb/flow”. For example, can represent either or .

Foreground-Background Branch

The goal of the FB branch is to divide the entire video into two parts, i.e., foreground and background, which can be trained by the video-level categorical label .

The inputs of FB branch are the features , and the outputs are the snippet-level attention predictions (SAPs, ) and the snippet-level classification predictions (SCPs, ). Accordingly, FB branch consists of two sub-modules, i.e., attention module () and Foreground-Background classification module (). The SAPs and SCPs of each stream are obtained by


Subsequently, the outputs of two streams are weighted to get the final SAPs and SCPs as


where by default in our experiments. We implement

with a fully-connected (FC) layer followed by a sigmoid activation function. And

is implemented by an FC layer.

To train and with only video-level label, video-level prediction is needed. Therefore, we calculate the video-level foreground feature as


Similarly, the video-level background feature is obtained by


After obtaining and , we feed them into to obtain the video-level prediction, i.e., the foreground prediction () and background prediction (), defined as


Given video-level predictions in Eq.(8), the FB branch can be trained via regular cross-entropy loss. For , its label is , where if contains the -th action category, as shown in Figure 3. While for , assuming that all videos contain background snippets, its label is always and .

3.3 Action-Context Branch

The attention mechanism trained by will be distracted by context because both action and context can support video-level classification. To avoid such distraction, after distinguishing the foreground from background, we further separate action and context within the foreground to locate the actual action instances in this section.

The inputs of the AC branch are features from two streams ( obtained in Section 3.2) and SAPs ( obtained in Section 3.2). The AC branch consists of three sub-modules, i.e., latent components generation, latent components combination, and action-context separation.

Latent Components Generation.

We introduce the concept of positive component () and negative component () to characterize foreground, action and context. Assuming the foreground is represented by two latent components, we define the one corresponding to the actual action as positive component, while the other one as negative component. They are obtained similarly as the SAPs in Eq.(2), by feeding features into positive module () and negative module ()



share the same architecture (parameters are not shared), with two temporal convolution (Conv1d) layers followed by a ReLU and a sigmoid activation function for the first and the second layer, respectively.

Latent Components Combination.

Given and , we use the combination of them to construct the snippet-level foreground attention (), action attention (), and context attention (). Specifically, for each stream, we have



denotes the sigmoid function. Subsequently, the outputs from two streams are fused by weighted average similar to Eq.(



where . For notational simplicity, we use subscript “” to denote either “fg”, “” or“” if necessary. By substituting the subscript “” with “fg//”, // are obtained following Eq.(13).

Instead of directly imposing simple constrains like foreground and background following Nguyen, Ramanan, and Fowlkes (2019), i.e., , we adopt the combinations of and to characterize and individually. We compared different approaches to obtain in supplementary material.

Action-Context Separation.

After obtaining , and , we can start the action-context separation by leveraging label with auxiliary context categories (i.e., introduced in Section 3.1). First of all, we select all temporal indices corresponding to foreground snippets as


where denotes the cardinality (number of elements). Subsequently, the video-level feature representations of foreground, action and context are obtained as


where and is the concatenated feature from both streams and means concatenation. By substituting the subscript “” with “fg//”, , and are calculated following Eq.(15). Afterwards, they are fed into the action-context classification module to get the video-level action-context prediction as


Different from the video-level prediction from FB branch (i.e., in Eq.(8)), provides predictions on both action and context categories. Specifically, if the video contains the -th category, the label for is , where . While for and , the labels are () and (), respectively, as shown in Figure 3. After obtaining and the corresponding labels, the AC branch is also trained via regular cross-entropy loss.

Applying to each snippet, the snippet-level action-context predictions are obtained as


where is the concatenated feature. is leveraged to promote the action and suppress the context, by defining an “action-context offset ()” as


where (or ) is the prediction of the -th action (or corresponding context) of the -th snippet. Intuitively, means “offsets” for the -th class of the -th snippet, compared the prediction of action () with context ().

In summery, the AC branch outputs snippet-level action score () and the action-context offset () for the subsequent localization task.

3.4 Additional Losses

In addition to the regular cross-entropy losses, more constrains are required to train the ACSNet successfully, since there are neither temporal annotations nor action/context annotations available. In this section, we introduce two additional losses to provide extra guidance for ACSNet training, i.e., and .

For guidance loss , due to the lack of ground truth labeled action or context categories, confusion between action and context (e.g., and , and ) will occur due to symmetry. Therefore, additional guidance should be introduced to distinguish action from context, which is achieved by minimizing

. Specifically, the differences between two streams are leveraged. We adopt weighted binary logistic regression loss function

to guide and , where is denoted as


where and

is a binary vector indicating positive and negative samples (snippets).

is the prediction to be regressed. and . For action attention , positive time index set () and negative time index set () are defined as


where and indicate high and low thresholds, respectively. Intuitively, the snippets with high/low attentions on both streams are regarded as positive/negative samples for action snippets. For context attention , we assume context contains scenes (excluding action instances), so that the corresponding positive/negative snippet index sets are defined as


Subsequently, the guidance loss is calculated as


where (or ) indicates a d-dimensional vector filled with ones (or zeros).

For , in order to encourage the two latent components to focus on the foreground, we adopt the Mean Squared Error (MSE) loss between and , denoted as


where is a Gaussian smoothing function. Finally, the AC branch is trained by minimizing the total loss , calculated as


where is the sum of cross-entropy losses mentioned in Section 3.3. is the balancing weight set as .

4 Localization

After the inference, FB branch outputs SAPs (), SCPs () and AC branch outputs action score (), action-context offset (). These outputs are leveraged for the TAL task. We first introduce the TAL baseline using only outputs of FB branch. Secondly, we present the contribution of AC branch to the TAL task.

4.1 Localization Baseline

The localization baseline uses only outputs of FB branch. The temporal action proposals are generated by thresholding with . The evaluation (scoring) of temporal action proposals is based on .

After obtaining a proposal , where and denote the starting and ending snippet indices, respectively. is scored by leveraging the Outer-Inner-Contrastive loss Shou et al. (2018) as


where is the sequence for scoring. denotes the inflation length and is the averaging function. Specifically, when locating the -th action category based on , we make , which is the predictions of the -th action category of all snippets. After obtaining proposals and their scores, the TAL results are collected.

4.2 Improving Localization by AC branch

The two critical steps of performing TAL are the generation and evaluation of proposals. The outputs of AC branch can improve both of them. For proposal generation, in addition to thresholding ( in Table 4), we also perform thresholding step on and ( and in Table 4). Since and are both action-aware and less susceptible to the influence of context, the proposals obtained by thresholding them can provide more accurate action boundaries and less context noise.

For proposal evaluation, we can improve the quality of to make the scores calculated by Eq.(4.1) more reliable using . Specifically, we improve by suppressing the context and promoting the action as


By replacing with in Eq.(4.1), we can evaluate proposals more accurately by alleviating the influence of context.

In summery, the contribution of AC branch to the TAL is reflected in three aspects, i.e., using its outputs ( and ) to improve proposal generation ( and ), and using to improve proposal scoring (). These three aspects are validated in Table 4.

Method Feature mAP@IoU AVG
0.3 0.4 0.5 0.6 0.7
SSN (2017) UNT 51.9 41.0 29.8 19.6 10.7 30.6
BSN (2018) - 53.5 45.0 36.9 28.4 20.0 36.8
MGG (2019a) I3D 53.9 46.8 37.4 29.5 21.3 37.8
G-TAD 2020 - 54.5 47.6 40.2 30.8 23.4 39.3

STPN (2018) UNT 31.1 23.5 16.2 9.8 5.1 17.1
W-TALC (2018) UNT 32 26.0 18.8 10.9 6.2 18.8
AutoLoc (2018) UNT 35.8 29.0 21.2 13.4 5.8 21.0
CleanNet (2019b) UNT 37.0 30.9 23.9 13.9 7.1 22.6
ACSNet (Ours) UNT 40.3 33.8 26.7 16.8 9.2 25.4
STPN (2018) I3D 35.5 25.8 16.9 9.9 4.3 18.5
MAAN (2019) I3D 41.1 30.6 20.3 12.0 6.9 22.2
W-TALC (2018) I3D 40.1 31.1 22.8 14.5 7.6 23.2
Liu(2019) I3D 41.2 32.1 23.1 15.0 7.0 23.7
BM (2019) I3D 46.6 37.5 26.8 17.6 9.0 27.5
ASSG (2019) I3D 50.4 38.7 25.4 15.0 6.6 27.2
BaSNet (2020) I3D 44.6 36.0 27.0 18.6 10.4 27.3
DGAM  (2020) I3D 46.8 38.2 28.8 19.8 11.4 29.0
ACSNet (Ours) I3D 51.4 42.7 32.4 22.0 11.7 32.0
Table 1: TAL performance comparison on THUMOS14 test set, in terms of average mAP at IoU thresholds . Recent works in both fully-supervised and weakly-supervised settings are reported. UNT and I3D represent UntrimmedNet and I3D feature backbones, respectively. ACSNet achieves state-of-the-art performance on both backbones. Compared to fully-supervised methods, our ACSNet can achieve close or even better performance.

5 Experiments

In this section, we evaluate the proposed ACSNet via detailed ablation studies to explore the contribution brought by AC branch. Meanwhile, we compare our method with state-of-the-art WS-TAL methods and recent fully-supervised TAL methods on two standard benchmarks.

5.1 Experimental Setting

Evaluation Datasets. THUMOS14 dataset Jiang et al. (2014) provides temporal annotations for action categories, including 200 untrimmed videos from validation set and 213 untrimmed videos from test set. On average, each video contains action instances and frames are non-action background. Following conventions, the validation and test sets are leveraged for training and testing, respectively. ActivityNet v1.2 & v1.3 Fabian Caba Heilbron and Niebles (2015) provide temporal annotations for / action categories, including a training set with / untrimmed videos and a validation set with / untrimmed videos111In our experiments, there are / and / videos accessible from YouTube in the training and validation set for ActivityNet v1.2 / v1.3, respectively..

Evaluation metric. Following the standard evaluation protocol, we evaluate the TAL performance using mean average precision (mAP) values at different levels of IoU thresholds. Specifically, the IoU threshold sets are and for THUMOS14 and ActivityNet, respectively. Both THUMOS14 and ActivityNet benchmarks provide standard evaluation implementations, which are directly exploited in our experiments for fair comparison.

Method 1.2 /1.3 mAP(%)@IoU Avg
0.5 0.75 0.95
SSN (2017) v1.2 41.3 27.0 6.1 26.6
SSN (2017) v1.3 39.1 23.5 5.5 24.0
AutoLoc (2018) v1.2 27.3 15.1 3.3 16.0
TSM  (2019) v1.2 28.3 17.0 3.5 17.1
W-TALC  (2018) v1.2 37.0 12.7 1.5 18.0
CleanNet  (2019b) v1.2 37.1 20.3 5.0 21.6
Liu et al.(2019) v1.2 36.8 22.0 5.6 22.4
BaSNet  (2020) v1.2 38.5 24.2 5.6 24.3
DGAM  (2020) v1.2 41.0 23.5 5.3 24.4
ACSNet (Ours) v1.2 40.1 26.1 6.8 26.0
STPN  (2018) v1.3 29.3 16.9 2.6 -
TSM  (2019) v1.3 30.3 19.0 4.5 -
Liu et al.(2019) v1.3 34.0 20.9 5.7 21.2
BM  (2019) v1.3 36.4 19.2 2.9 -
BaSNet  (2020) v1.3 34.5 22.5 4.9 22.2
ACSNet (Ours) v1.3 36.3 24.2 5.8 23.9
Table 2: TAL performance comparison on ActivityNet v1.2 and v1.3 validation set, in terms of average mAP at IoU thresholds Our result is also comparable to fully-supervised models.

5.2 Comparisons with State-of-the-Art Methods

As presented in Table 1, the proposed ACSNet outperforms existing WS-TAL methods in terms of mAPs with all IoU threshold settings on THUMOS14 testing set with significant improvement. Also, the proposed ACSNet achieves state-of-the-art on ActivityNet v1.2 and v1.3, as presented in Table 2. However, such performance improvement is not as significant as that on THUMOS14, possibly due to ActivityNet v1.2/v1.3 only has / non-action frames per video on average, while THUMOS14 contains on average. With lower non-action ratio, the improvement brought by context suppression could be less significant.

5.3 Ablation Study

Is Context Really Useful for Classification?

We assume that the action-context confusion is caused by both action and context can support the classification, due to the high co-occurrence of them. To validate whether the context snippets estimated by AC branch meet our assumption or not, we collect the foreground/background and action/context snippets as follows. The

-th snippet belongs to foreground if and otherwise it belongs to background. Among foreground snippets, if , the -th snippet is assigned as action and otherwise as context. For reference, we also collect all ground truth snippets. Therefore, five snippet sets are collected, noted as , , , , and , respectively.

Regarding the conjuncted snippets as temporal proposals among each set, these snippet sets can be evaluated in both localization and classification tasks, as summarized in Table 3. For localization, we use the metrics introduced in Section 5.1 with for proposal evaluation, since does not bias on either action or context. For classification, two metrics are adopted, i.e., the average classification accuracy () and proportion of groundtruth actions defined as


where means the groundtruth category and is the -th snippet’s classification prediction on the -th class.

As presented in Table 3, context snippets contain more useful information compared with , indicated by the much better classification accuracy. However, in terms of localization task, both and perform poorly, which matches our assumption of context, i.e., snippets that can support classification but contain no actual actions.

mAP(%)@IoU AVG
(%) (%) 0.3 0.4 0.5 0.6 0.7
91.4 62.4 100 100 100 100 100 100
88.6 59.1 38.3 30.4 21.5 14.4 7.4 22.4
91.0 61.5 42.4 34.6 25.0 16.7 9.4 25.6
81.0 53.4 0.7 0.3 0.2 0 0 0.2
26.7 15.1 0.1 0 0 0 0 0
Table 3: Classification and localization evaluation on different snippet sets on THUMOS14 test set. Classification metric: Average classification accuracy (), and proportion of groundtruth actions () defined in Eq.(30). Localization metric: Average mAP under the IoU thresholds from to .
Variants mAP(%)@IoU AVG
0.3 0.4 0.5 0.6 0.7
#0() 31.4 23.4 15.8 9.4 4.8 17.0
#0() 38.3 30.4 21.5 14.4 7.4

42.4 34.6 25.0 16.7 9.4
#2 49.5 40.7 29.3 19.4 10.2

51.6 42.2 31.6 20.6 10.8
#4 51.4 42.7 32.4 22.0 11.7
#5 46.0 38.5 28.4 19.1 9.8 28.3
Table 4: Ablation studies of ACSNet on THUMOS14 test. As defined in Section 5.3, the usage of // reflect the contribution of // in aspects of proposal generation/generation/evaluation. // take up // of the mAP gain upon #0 ().

TAL Contribution of AC branch. The contribution of the proposed AC branch towards the TAL task is reflected in three aspects as summarized in Section 4.2. To validate these three aspects, five ablated variants are evaluated in this section. For the convenience of the discussion, we define the following notations for experiment settings. For proposal generation settings, // are defined as: Thresholding // with // to generate temporal action proposals for all/all/-th action class. For proposal scoring settings, / are defined as: Using / as the in Eq.(4.1) for proposal evaluation. Therefore, the usage of reflects the contribution of in aspects of proposal generation. The usage of and reflect the contribution of in aspects of proposal generation and evaluation, respectively. The contribution of // to TAL is evaluated individually below, as presented in Table 4.

With and , the #0 variants are the baseline methods, which depend on FB branch and are non-related to the AC branch . Noted that baselines show super sensitivity towards hyper-parameter , we choose the best one () for comparison below. In contrast, all the other ablated variants are with simple average two-stream fusion (). Comparison between baseline (#0) and #1 shows the contribution solely from . Similarly, the contributions solely from and can be validated by the comparisons between #2 and #4, #1 and #2, respectively. Quantitatively, // take up // of the performance gain upon baseline.

Besides, compared with #4 and #5, an obvious performance drop is observed, indicating the localization result from FB branch has been burden for the final localization. Without the proposals from FB branch, and with the help of and on proposal generation and evaluation, “#4” achieves the best localization performance.

6 Conclusions

We propose an ACSNet for weakly-supervised temporal action localization, which can separate action and context with only video-level categorical labels. This is achieved by characterizing foreground/action/context as combinations of positive and negative latent compositions. ACSNet significantly outperforms existing WS-TAL methods on three standard datasets, i.e., THUMOS14, ActivityNet v1.2 and v1.3. Moreover, ACSNet achieves competitive performance even compared with recent fully-supervised TAL methods. Experimental results validate the significance of action-context separation and the superiority of the proposed pipeline.

7 Acknowledgments

This work was supported partly by National Key R&D Program of China Grant 2018AAA0101400, NSFC Grants 61629301, 61773312, and 61976171, China Postdoctoral Science Foundation Grant 2019M653642, Young Elite Scientists Sponsorship Program by CAST Grant 2018QNRC001, and Natural Science Foundation of Shaanxi Grant 2020JQ-069.


  • Asadiaghbolaghi et al. (2017) Asadiaghbolaghi, M.; Clapes, A.; Bellantonio, M.; Escalante, H. J.; Poncelopez, V.; Baro, X.; Guyon, I.; Kasaei, S.; and Escalera, S. 2017.

    A Survey on Deep Learning Based Approaches for Action and Gesture Recognition in Image Sequences.

    In FG, 476–483.
  • Buch et al. (2017) Buch, S.; Escorcia, V.; Shen, C.; Ghanem, B.; and Niebles, J. C. 2017. Sst: Single-stream temporal action proposals. In CVPR, 6373–6382.
  • Chao et al. (2018) Chao, Y.-W.; Vijayanarasimhan, S.; Seybold, B.; Ross, D. A.; Deng, J.; and Sukthankar, R. 2018. Rethinking the Faster R-CNN Architecture for Temporal Action Localization. In CVPR, 1130–1139.
  • Fabian Caba Heilbron and Niebles (2015) Fabian Caba Heilbron, Victor Escorcia, B. G.; and Niebles, J. C. 2015. ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding. In CVPR, 961–970.
  • Feichtenhofer et al. (2019) Feichtenhofer, C.; Fan, H.; Malik, J.; and He, K. 2019. Slowfast networks for video recognition. In ICCV.
  • Feichtenhofer, Pinz, and Zisserman (2016) Feichtenhofer, C.; Pinz, A.; and Zisserman, A. 2016. Convolutional two-stream network fusion for video action recognition. In CVPR, 1933–1941.
  • Gao et al. (2017) Gao, J.; Yang, Z.; Sun, C.; Chen, K.; and Nevatia, R. 2017. Turn tap: Temporal unit regression network for temporal action proposals. In ICCV, 3628–3636.
  • Ji et al. (2013) Ji, S.; Xu, W.; Yang, M.; and Yu, K. 2013.

    3D convolutional neural networks for human action recognition.

    IEEE Transactions on Pattern Analysis and Machine Intelligence 35(1): 221–231.
  • Jiang et al. (2014) Jiang, Y.; Liu, J.; Zamir, A. R.; Toderici, G.; Laptev, I.; Shah, M.; and Sukthankar, R. 2014. THUMOS challenge: Action recognition with a large number of classes.
  • Kang and Wildes (2016) Kang, S. M.; and Wildes, R. P. 2016. Review of action recognition and detection methods. arXiv preprint arXiv:1610.06906 .
  • Lee, Uh, and Byun (2020) Lee, P.; Uh, Y.; and Byun, H. 2020. Background Suppression Network for Weakly-Supervised Temporal Action Localization. In AAAI, 11320–11327.
  • Lee, Ghosh, and Grauman (2012) Lee, Y. J.; Ghosh, J.; and Grauman, K. 2012. Discovering important people and objects for egocentric video summarization. In CVPR, 1346–1353.
  • Lin et al. (2019) Lin, T.; Liu, X.; Li, X.; Ding, E.; and Wen, S. 2019. Bmn: Boundary-matching network for temporal action proposal generation. In ICCV, 3889–3898.
  • Lin et al. (2018) Lin, T.; Zhao, X.; Su, H.; Wang, C.; and Yang, M. 2018. BSN: Boundary Sensitive Network for Temporal Action Proposal Generation. In ECCV.
  • Liu, Jiang, and Wang (2019) Liu, D.; Jiang, T.; and Wang, Y. 2019. Completeness Modeling and Context Separation for Weakly Supervised Temporal Action Localization. In CVPR.
  • Liu et al. (2019a) Liu, Y.; Ma, L.; Zhang, Y.; Liu, W.; and Chang, S.-F. 2019a. Multi-granularity generator for temporal action proposal. In CVPR, 3604–3613.
  • Liu et al. (2019b) Liu, Z.; Wang, L.; Zhang, Q.; Gao, Z.; Niu, Z.; Zheng, N.; and Hua, G. 2019b. Weakly Supervised Temporal Action Localization through Contrast based Evaluation Networks. In ICCV.
  • Nguyen et al. (2018) Nguyen, P.; Liu, T.; Prasad, G.; and Han, B. 2018. Weakly supervised action localization by sparse temporal pooling network. In CVPR, 6752–6761.
  • Nguyen, Ramanan, and Fowlkes (2019) Nguyen, P. X.; Ramanan, D.; and Fowlkes, C. C. 2019. Weakly-supervised action localization with background modeling. In ICCV, 5502–5511.
  • Paul, Roy, and Roy-Chowdhury (2018) Paul, S.; Roy, S.; and Roy-Chowdhury, A. K. 2018. W-TALC: Weakly-supervised Temporal Activity Localization and Classification. In ECCV, 588–607.
  • Shi et al. (2020) Shi, B.; Dai, Q.; Mu, Y.; and Wang, J. 2020.

    Weakly-Supervised Action Localization by Generative Attention Modeling.

    In CVPR, 1009–1019.
  • Shou et al. (2018) Shou, Z.; Gao, H.; Zhang, L.; Miyazawa, K.; and Chang, S.-F. 2018. AutoLoc: Weakly-supervised Temporal Action Localization in Untrimmed Videos. In ECCV, 154–171.
  • Simonyan and Zisserman (2014) Simonyan, K.; and Zisserman, A. 2014. Two-stream convolutional networks for action recognition in videos. In NIPS, 568–576.
  • Sun et al. (2015a) Sun, C.; Shetty, S.; Sukthankar, R.; and Nevatia, R. 2015a. Temporal localization of fine-grained actions in videos by domain transfer from web images. In ACM MM, 371–380.
  • Sun et al. (2015b) Sun, L.; Jia, K.; Yeung, D.-Y.; and Shi, B. E. 2015b. Human action recognition using factorized spatio-temporal convolutional networks. In CVPR, 4597–4605.
  • Tran et al. (2015) Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; and Paluri, M. 2015. Learning spatiotemporal features with 3d convolutional networks. In CVPR, 4489–4497.
  • Vishwakarma and Agrawal (2013) Vishwakarma, S.; and Agrawal, A. 2013. A survey on activity recognition and behavior understanding in video surveillance. The Visual Computer 29(10): 983–1009.
  • Wang et al. (2017) Wang, L.; Xiong, Y.; Lin, D.; and Van Gool, L. 2017. Untrimmednets for weakly supervised action recognition and detection. In CVPR, 4325–4334.
  • Wang et al. (2016) Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; and Van Gool, L. 2016. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 20–36.
  • Xu, Das, and Saenko (2017) Xu, H.; Das, A.; and Saenko, K. 2017. R-C3D: region convolutional 3d network for temporal activity detection. In ICCV, 5794–5803.
  • Xu et al. (2020) Xu, M.; Zhao, C.; Rojas, D. S.; Thabet, A.; and Ghanem, B. 2020. G-TAD: Sub-Graph Localization for Temporal Action Detection. In CVPR, 10156–10165.
  • Yao, Lei, and Zhong (2019) Yao, G.; Lei, T.; and Zhong, J. 2019. A review of Convolutional-Neural-Network-based action recognition. Pattern Recognition Letters 118: 14–22.
  • Yu et al. (2019) Yu, T.; Ren, Z.; Li, Y.; Yan, E.; Xu, N.; and Yuan, J. 2019. Temporal structure mining for weakly supervised action detection. In ICCV, 5522–5531.
  • Yuan et al. (2019) Yuan, Y.; Lyu, Y.; Shen, X.; Tsang, I. W.; and Yeung, D.-Y. 2019.

    Marginalized Average Attentional Network for Weakly-Supervised Learning.

    In ICLR.
  • Zeng et al. (2019) Zeng, R.; Huang, W.; Tan, M.; Rong, Y.; Zhao, P.; Huang, J.; and Gan, C. 2019. Graph convolutional networks for temporal action localization. In ICCV, 7094–7103.
  • Zhang et al. (2019) Zhang, C.; Xu, Y.; Cheng, Z.; Niu, Y.; Pu, S.; Wu, F.; and Zou, F. 2019. Adversarial Seeded Sequence Growing for Weakly-Supervised Temporal Action Localization. In ACM MM, 738–746.
  • Zhao et al. (2017) Zhao, Y.; Xiong, Y.; Wang, L.; Wu, Z.; Tang, X.; and Lin, D. 2017. Temporal Action Detection with Structured Segment Networks. In ICCV, 2933–2942.