Weakly-Supervised Action Localization by Generative Attention Modeling

by   Baifeng Shi, et al.
Peking University

Weakly-supervised temporal action localization is a problem of learning an action localization model with only video-level action labeling available. The general framework largely relies on the classification activation, which employs an attention model to identify the action-related frames and then categorizes them into different classes. Such method results in the action-context confusion issue: context frames near action clips tend to be recognized as action frames themselves, since they are closely related to the specific classes. To solve the problem, in this paper we propose to model the class-agnostic frame-wise probability conditioned on the frame attention using conditional Variational Auto-Encoder (VAE). With the observation that the context exhibits notable difference from the action at representation level, a probabilistic model, i.e., conditional VAE, is learned to model the likelihood of each frame given the attention. By maximizing the conditional probability with respect to the attention, the action and non-action frames are well separated. Experiments on THUMOS14 and ActivityNet1.2 demonstrate advantage of our method and effectiveness in handling action-context confusion problem. Code is now available on GitHub.


Background Suppression Network for Weakly-supervised Temporal Action Localization

Weakly-supervised temporal action localization is a very challenging pro...

ACM-Net: Action Context Modeling Network for Weakly-Supervised Temporal Action Localization

Weakly-supervised temporal action localization aims to localize action i...

Weakly Supervised Action Selection Learning in Video

Localizing actions in video is a core task in computer vision. The weakl...

Background-Click Supervision for Temporal Action Localization

Weakly supervised temporal action localization aims at learning the inst...

Foreground-Action Consistency Network for Weakly Supervised Temporal Action Localization

As a challenging task of high-level video understanding, weakly supervis...

Convex Combination Consistency between Neighbors for Weakly-supervised Action Localization

In weakly-supervised temporal action localization (WS-TAL), the methods ...

Point-Level Temporal Action Localization: Bridging Fully-supervised Proposals to Weakly-supervised Losses

Point-Level temporal action localization (PTAL) aims to localize actions...

1 Introduction

Action localization is one of the most challenging tasks in video analytics and understanding [43, 42, 20, 37, 21]. The goal is to predict accurate start and end time stamps of different human actions. Owing to its wide application (e.g., surveillance [47], video summarization [28], highlight detection [51]

), action localization has drawn lots of attention in the community. Thanks to the powerful convolutional neural network (CNN) 

[18], performance achieved on this task has gone through a phenomenal surge in the past few years [42, 53, 6, 52, 5, 1, 23, 27]. Nevertheless, these fully-supervised methods require temporal annotations of action intervals during training, which is extremely expensive and time-consuming. Therefore, the task of weakly-supervised action localization (WSAL) has been put forward, where only video-level category labels are available.

To date in the literature, there are two main categories of approaches in WSAL. The first type [24, 29, 34, 49]

generally builds a top-down pipeline, which learns a video-level classifier and then obtains frame attention by checking the produced temporal class activation map (TCAM) 

[60]. Note that a frame indicates a small snippet from which appearance or motion feature could be extracted. On the other hand, the second paradigm works in a bottom-up way, i.e., temporal attention is directly predicted from raw data [30, 31, 41, 55]. Then attention is optimized in the task of video classification with video-level supervision. Frames with high attention are thus treated as action part, otherwise the background part.

Figure 1: An illustration of action-context confusion. The video clip, showing a long jump process, consists of three stages of the action (approaching, jumping, and landing) and two stages of context (preparing and finishing). (a) Ground truth of action localization. (b) Action-context confusion. The context frames, which are highly related to the long jump category, are also selected.

Both kinds of methods largely rely on the video-level classification model, which would lead to the intractable action-context confusion [24] issue in the absence of frame-wise labels. Take the long jump in Figure 1 as an example, the action has three stages, i.e., approaching, jumping, and landing. In addition, the frames before and after the action, i.e., preparing and finishing, contain the content that is closely related to long jump, but are not parts of the action. We refer to such frames as context, which is a special kind of background. In this example, the context parts include the track field and sandpit, which could in fact significantly encourage the recognition of the action. Without frame-wise annotations, the classifier is normally learned by aggregating the features of all related frames, where context and action are roughly mixed up. The context frames thus tend to be easily recognized as action frames themselves. The action-context confusion problem has not been fully studied though it is common in WSAL. One recent exploration [24] attempts to solve the problem by assuming a strong prior that context clips should be stationary, i.e., no motions in them. However, such assumption has massive limitations and ignores the inherent difference between context and action.

To separate context and action, the model should be able to capture the underlying discrepancy between them. Intuitively, context frame indeed exhibits obvious difference from action frame at the appearance or motion level. For example, among the five stages in Figure 1, the action stages (approaching, jumping, and landing) clearly demonstrate more intense body postures than the context stages (preparing and finishing). In other words, the extracted feature representations for context and action are also different. Such difference exists regardless of the action category.

Inspired by this observation, we propose a novel generative attention mechanism to model the frame representation conditioned on frame attention. In addition to the above intuition, we build a graphical model to theoretically demonstrate that the localization problem is associated with both the conventional classification and the proposed representation modeling. Our framework thus consists of two parts: the Discriminative and Generative Attention Modeling (DGAM). On one hand, the discriminative attention modeling trains a classification model on temporally pooled features weighted by the frame attention. On the other hand, a generative model, i.e., conditional Variational Auto-Encoder (VAE), is learned to model the class-agnostic frame-wise distribution of representation conditioned on attention values. By maximizing likelihood of the representation, the frame-wise attention is optimized accordingly, leading to well separation of action and context frames. Extensive experiments are conducted on THUMOS14 [13] and ActivityNet1.2 [3] to show that DGAM outperforms the state-of-the-arts by a significant margin. Comprehensive analysis further validates its effectiveness on separating action and context.

The main contribution of this work is the proposed DGAM framework for addressing the issue of action-context confusion in WSAL by modeling the frame representation conditioned on different attentions. The solution has led to elegant views of how localization is associated with the representation distribution and how to learn better attentions by modeling the representation, which have not been discussed in the literature.

2 Related Works

Video action recognition is a fundamental problem in video analytics. Most video-related tasks leverage the off-the-shelf action recognition models to extract features for further analysis. Early methods normally devise hand-crafted features [19, 48, 32]

for recognition. Recently, thanks to the development of deep learning techniques, lots of approaches focus on automatic feature extraction with end-to-end learning,

e.g., two-stream network [43], temporal segment network (TSN) [50], 3D ConvNet (C3D) [46], Pseudo 3D (P3D) [36], Inflated 3D (I3D) [4]. In our experiments, I3D is utilized for feature extraction.

Fully-supervised action localization has been extensively studied recently. Many works follow the paradigms that are widely applied in object detection area [8, 9, 39, 38, 25] due to their commonalities in problem setting. To be more specific, there are mainly two directions, namely two-stage method and one-stage method. Two-stage methods [58, 52, 6, 5, 42, 40, 7, 11, 23] first generate action proposals and then classify them with further refinement on temporal boundaries. One-stage methods [2, 22, 57] instead predict action category and location directly from raw data. In fully-supervised setting, the action-context confusion could be alleviated with frame-wise annotations.

Weakly-supervised action localization is drawing increasing attention due to the time-consuming manual labeling in fully-supervised setting. As introduced in Section 1, WSAL methods can be grouped into two categories, namely top-down and bottom-up methods. In top-down pipeline (e.g. UntrimmedNet [49]), video-level classification model is learned first, and then frames with high classification activation are selected as action locations. W-TALC [34] and 3C-Net [29] also force foreground features from the same class to be similar, otherwise dissimilar. Unlike top-down scheme, the bottom-up methods directly produce the attention for each frame from data, and train a classification model with the features weighted by attention. Based on this paradigm, STPN [30] further adds a regularization term to encourage the sparsity of action. AutoLoc [41] proposes the Outer-Inner-Contrastive (OIC) loss by assuming that a complete action clip should look different from its neighbours. MAAN [55] proposes to suppress dominance of the most salient action frames and retrieve less salient ones. Nguyen et al[31] propose to penalize the discriminative capacity of background, which is also utilized in our classification module. Besides, a video-level clustering loss is applied in [31] to separate foreground and background. Nevertheless, all of the aforementioned methods ignore the challenging action-context confusion issue caused by the absence of frame-wise label. Though Liu et al[24] try to separate action and context using hard negative mining, their method is based on the strong assumption that context clips should be stationary, which has many limitations and may hence cause negative influence on the prediction.

Generative model has also experienced a fast development in recent years [17, 10, 12]. GAN [10] employs a generator to approximate real data distribution by the adversarial training between generator and discriminator. However, the learned approximating distribution is implicitly determined by generator and thus cannot be analytically expressed. VAE [17]

approximates the real distribution by optimizing the variational lower bound on the marginal likelihood of data. Given a latent code, the conditional distribution is explicitly modeled as a Gaussian distribution, hence data distribution can be analytically expressed by sampling latent vectors and calculating the Gaussian. Flow-based model 

[16] uses invertible layers as the generative mapping, where data distribution can be calculated given the Jacobian of each layer. However, all layers must have the same dimensions, which is much less flexible. In our work, we exploit Conditional VAE (CVAE) [45] to model the frame feature distribution conditioned on attention value.

3 Method

Suppose we have a set of training videos and the corresponding video-level labels. For each video, we sample frames (snippets) to extract the RGB or optical flow features with a pre-trained model, where is the feature of frame , and is feature dimension. The video-level label is denoted as , where is the number of classes and corresponds to background. For brevity, we assume that each video only belongs to one class, though the following discussion can also apply to multi-label videos.

Our method follows the bottom-up pipeline for WSAL, which learns the attention directly from data, where is the attention of frame . Before discussing the details of our method, we examine the action localization problem from the beginning.

3.1 Attention-based Framework

In attention-based action localization problem, the target is to predict the frame attention , which is equivalent to solving the maximum a posteriori (MAP) problem:



is the unknown probability distribution of

given and . In the absence of frame-level labels (ground truth of ), it is difficult to approximate and optimize

directly. Therefore, we transform the optimization target using Bayes' theorem,


where in the last step, we discard the constant term and assume a uniform prior of , i.e., . Our optimization problem thus becomes


This formulation indicates two different aspects for optimizing . The first term prefers with high discriminative capacity for action classification, which is the main optimization target in previous works. In contrast, the second term forces the representation of frames to be accurately predicted from the attention . Given the feature difference between foreground and background, this objective encourages the model to impose different attentions on different features. In specific, we exploit a generative model to approximate , and force the feature to be accurately reconstructed by the model.

Figure 2 shows the graphical model of the above problem. The model parameters () and the latent variables in generative model () will be discussed later. Based on (3), the framework of our method consists of two components, i.e., the discriminative attention modeling and the generative attention modeling, as illustrated in Figure 3.

Figure 2: The directed graphical model of DGAM. Solid lines denote the generative model , dashed lines denote the variational approximation to intractable posterior , and dash-dot lines denote the video-level classification model . and are jointly learned, which forms an alternating optimization together with and .
Figure 3: Framework overview. The proposed model is trained in two alternating stages (a) and (b). In stage (a), the generative model (CVAE) is frozen. Attention module and classification module are updated with classification-based discriminative loss , representation-based reconstruction loss and regularization loss . In stage (b), attention and classification modules are frozen. The CVAE is trained with loss to reconstruct the representation of frames with different . Since the ground truth is unavailable, we utilize predicted by attention module as “pseudo label” for training.

3.2 Discriminative Attention Modeling

The discriminative attention module learns the frame attention by optimizing the video-level recognition task. In specific, we utilize attention as weight to perform temporal average pooling over all frames in the video and produce a video-level foreground feature given by


Similarly, we can also utilize as the weight to calculate a background feature :


To optimize , we encourage high discriminative capability of the foreground feature and simultaneously punish any discriminative capability of the background feature  [31]. This is equivalent to minimizing the following discriminative loss (i.e. softmax loss):


where is a hyper-parameter, and is our classification module modeled by a fully-connected layer with weight for each class

and a following softmax layer. During training, attention module and classification module are jointly optimized. The graphical model of this part is illustrated in Figure

2 with dash-dot lines.

3.3 Generative Attention Modeling

The discriminative attention optimization generally has difficulty in separating context and foreground when frame-wise annotations are unavailable. Based on the observation that context differs from foreground in terms of feature representation, we utilize a Conditional Variational Auto-Encoder (CVAE) to model the representation distribution of different frames. Before explaining the details, we briefly review the Variational Auto-Encoder (VAE).

Given the observed variable , VAE [17] introduces a latent variable , and aims to generate from , i.e.,


where denotes the parameters of generative model, is the prior (e.g. a standard Gaussian), and

is the conditional distribution indicating the generation procedure, which is typically estimated with a neural network

that is referred to as decoder. The key idea behind is to sample values of that are likely to produce , which means that we need an approximation to the intractable posterior . denotes the parameters of approximation model, and is also estimated via a neural network , which is referred to as encoder. VAE incorporates encoder and decoder , and learns parameters by maximizing the variational lower bound:


where is the KL divergence of from .

In our DGAM model, we expect to generate the observation based on the attention , i.e., , which can be written as by assuming independence between frames in a video. Similarly, we introduce a latent variable , and attempt to generate each from and , which forms a Conditional VAE problem:


Note that the desired distribution of is modeled as a Gaussian, i.e., , where is the decoder, is a hyper-parameter, and is the unit matrix. Ideally, is sampled from the prior . In DGAM, we set the prior as a Gaussian, i.e., , where is all-ones vector and is a hyper-parameter indicating the discrepancy between priors of different attention value . When , prior is independent of .

During training of CVAE, we also approximate the intractable posterior by a Gaussian , where and are the outputs of the encoder . We then minimize the variational loss :


where is -th sample from . Note that the Monte Carlo estimation of the expectation is employed with samples. is a hyper-parameter for trade-off between reconstruction quality and sampling accuracy.

For the generative attention modeling of , we fix CVAE and minimize the reconstruction loss given by


where is sampled from the prior . In our experiments, is set to , and (11) can be written as


The graphical model of generative attention modeling is illustrated in Figure 2 with solid and dashed lines.

In our framework, the CVAE cannot be directly and solely optimized due to the unavailability of ground truth . Therefore, we propose to train attention module and CVAE in an alternating way, i.e., we first update CVAE with “pseudo label” of given by the attention module, and then train attention module with fixed CVAE. The two stages are repeated for several iterations. Since there exist other loss terms for attention modeling (e.g. ), the pseudo label can be high-quality and hence a good convergence can be reached. Experimental results empirically validate it.

3.4 Optimization

In addition to the above objectives, we exploit a self-guided regularization [31] to further refine the attention. The temporal class activation maps (TCAM) [30, 60] are utilized to produce the top-down, class-aware attention maps. In specific, given a video with label , the TCAM are computed by


where indicates the parameters of the classification module for class . and are foreground and background TCAM, respectively.

is a Gaussian smooth filter with standard deviation

, and represents convolution. The generated and are expected to be consistent with the bottom-up, class-agnostic attention , hence the loss can be formulated as


To sum up, we optimize the whole framework by alternately executing the following two steps:

  1. Update attention and classification modules with loss


    where denote the hyper-parameters.

  2. Update CVAE with loss .

The whole architecture is illustrated in Figure 3.

3.5 Action Prediction

To generate action proposals for a video during inference, we feed the video to DGAM and obtain the attention . By filtering out frames with attention lower than a threshold , we extract consecutive segments with high attention values as the predicted locations. For each segment , we temporally pool the features with attention, and get the classification score for class , which is the output of classification module before softmax. We further follow [41, 24] to refine by subtracting the score of its surroundings. The final score is calculated by


where is the subtraction parameter.

4 Experiments

4.1 Datasets and Evaluation Metrics

For evaluation, we conduct experiments on two benchmarks, THUMOS14 [13] and ActivityNet1.2 [3]. During training, only video-level category labels are available.

THUMOS14 contains videos from 20 classes for action localization task. We follow the convention to train on validation set with 200 videos and evaluate on test set with 212 videos. Note that we exclude the wrongly annotated video#270 from test set, following [31, 58]. This dataset is challenging for its finely annotated action instances. Each video contains 15.5 action clips on average. Length of action instance varies widely, from a few seconds to minutes. Video length also ranges from a few seconds to 26 minutes, with an average of around 3 minutes. Compared to other large-scale datasets, e.g., ActivityNet1.2, THUMOS14 has less training data which indicates higher requirement of model’s generalization ability and robustness.

ActivtyNet1.2 contains 100 classes of videos with both video-level labels and temporal annotations. Each video contains 1.5 action instances on average. Following [49, 41], we train our model on training set with 4819 videos and evaluate on validation set with 2383 videos.

Evaluation Metrics. We follow the standard evaluation protocol and report mean Average Precision (mAP) at different intersection over union (IoU) thresholds. The results are calculated using the benchmark code provided by ActivityNet official codebase222https://github.com/activitynet/ActivityNet/tree/master/Evaluation. For fair comparison, all results on THUMOS14 are averaged over five runs.

Att Cls mAP@IoU
0.3 0.4 0.5 0.6 0.7
O O 43.8 35.8 26.7 18.2 9.7
O N 44.2 36.1 27.0 18.7 9.8
N O 46.1 38.2 28.8 19.4 11.2
N N 46.8 38.2 28.8 19.8 11.4
Table 1: Attention evaluation on THUMOS14. The “Old” model (O) is trained without the generative attention modeling, and the “New” model (N) is our DGAM. We assemble specific models by alternately choosing Attention (Att) and Classification (Cls) modules from the two models.

4.2 Implementation Details

We utilize I3D [4] network pre-trained on Kinetics [14] as the feature extractor333https://github.com/deepmind/kinetics-i3d. In specific, we first extract optical flow from RGB data using TV-L1 algorithm [35]. Then we divide both streams into non-overlapping 16-frame snippets and send them into the pre-trained I3D network to obtain two 1024-dimension feature frames for each snippet. We train separate DGAMs for RGB and flow streams. The proposals from them are combined with Non-Maximum Suppression (NMS) during inference. Following [30, 31], we set to 400 for all videos during training. During evaluation, we feed all frames of each video to our network if the frame number is less than , otherwise we sample frames uniformly. is 400 for THUMOS14, and 200 for ActivityNet1.2.

We set in Eq. (6) and in Eq. (10). In Eq. (16), we set to for RGB stream, and for flow stream. is set as

. The whole architecture is implemented with PyTorch 

[33] and trained on single NVIDIA Tesla M40 GPU using Adam optimizer [15] with learning rate of

. To stabilize the training of DGAM, we leverage a warm-up strategy in the first 300 epochs when updating

and .

Metric w/o w/
0.777 0.698
0.858 0.707
1.522 1.543
0.001 0.001
Table 2: Statistics comparison on THUMOS14 with/without generative attention modeling. indicates lower is better, indicates higher is better. For details of notation, please refer to Section 4.3.
Method Supervision Feature mAP@IoU
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
S-CNN [42] Full - 47.7 43.5 36.3 28.7 19.0 10.3 5.3 - -
R-C3D [52] Full - 54.5 51.5 44.8 35.6 28.9 - - - -
SSN [58] Full - 66.0 59.4 51.9 41.0 29.8 - - - -
Chao et al. [5] Full - 59.8 57.1 53.2 48.5 42.8 33.8 20.8 - -
BSN [23] Full - - - 53.5 45.0 36.9 28.4 20.0 - -
P-GCN [56] Full - 69.5 67.8 63.6 57.8 49.1 - - - -
Hide-and-Seek [44] Weak - 36.4 27.8 19.5 12.7 6.8 - - - -
UntrimmedNet [49] Weak - 44.4 37.7 28.2 21.1 13.7 - - - -
Zhong et al. [59] Weak - 45.8 39.0 31.1 22.5 15.9 - - - -
AutoLoc [41] Weak UNT - - 35.8 29.0 21.2 13.4 5.8 - -
CleanNet [26] Weak UNT - - 37.0 30.9 23.9 13.9 7.1 - -
STPN [30] Weak I3D 52.0 44.7 35.5 25.8 16.9 9.9 4.3 1.2 0.1
MAAN [55] Weak I3D 59.8 50.8 41.1 30.6 20.3 12.0 6.9 2.6 0.2
W-TALC [34] Weak I3D 55.2 49.6 40.1 31.1 22.8 - 7.6 - -
Liu et al. [24] Weak I3D 57.4 50.8 41.2 32.1 23.1 15.0 7.0 - -
TSM [54] Weak I3D - - 39.5 - 24.5 - 7.1 - -
3C-Net [29] Weak I3D 56.8 49.8 40.9 32.3 24.6 - 7.7 - -
Nguyen et al. [31] Weak I3D 60.4 56.0 46.6 37.5 26.8 17.6 9.0 3.3 0.4
DGAM Weak I3D 60.0 54.2 46.8 38.2 28.8 19.8 11.4 3.6 0.4
Table 3: Results on THUMOS14 testing set. We report mAP values at IoU thresholds 0.1:0.1:0.9. Recent works in both fully-supervised and weakly-supervised settings are reported. UNT and I3D represent UntrimmedNet and I3D feature extractor, respectively. Our method outperforms the state-of-the-art methods, especially at high IoU threshold, which means that our model could produce finer and more precise predictions. Compared to fully-supervised methods, our DGAM can achieve close or even better performance.

4.3 Statistical Evaluation on Attention

We first evaluate the learned attention of DGAM and its effectiveness on handling action-context confusion. For comparison, an “old” model is trained by removing the generative attention modeling (GAM) from DGAM, and our DGAM is denoted as the “new” model. Note that only Attention and Classification modules are involved during inference. When evaluating, we assemble specific models by alternately choosing the two modules from “old” or “new” models. Table 1 details the mAP results on THUMOS14. It can be found that the new attention module largely improves the performance, while there is little or no improvement with the new classification module. This observation indicates that DGAM indeed learns better attention values. Even with “old” classifier, the “new” attention can boost the localization significantly.

We further collect several statistics to show the improvement intuitively in Table 2. Experiments are conducted on both “old” (w/o GAM) and “new” (w/ GAM) models. In particular, att (cls) indicates the set of frames with attention values (classification scores) larger than a threshold , and gt is the set of ground truth frames. represents size of a set. ‘’, ‘’ and ‘’ indicate set exclusion, intersection and complement, separately. Though such simple thresholding is not exactly the predicted locations, it somewhat reflects the quality of localization.

In Table 2, or indicates the percentage of frames falsely captured or omitted by attention. It shows that both false activation and omission can be reduced with GAM. Moreover, an improvement in demonstrates that GAM can better filter out the false positives (e.g. context frames) made by classifier. measures how attention can capture the false negatives, i.e., action frames neglected by classifier. Since GAM is devised for excluding the false positives produced by classifier, it is not surprising that GAM contributes little to it.

- - - 21.5
- - 24.8
- 26.7
Table 4: Contribution of each design in DGAM on THUMOS14. Note that when adding , is involved simultaneously.

4.4 Ablation Studies

Next we study how each component in DGAM influences the overall performance. We start with the basic model that directly optimizes the attention based foreground classification loss . The background classification loss , the self-guided regularization loss

, and the feature reconstruction loss

are further included step by step. Note that adding indicates involving the generative attention modeling, where is also optimized.

Table 4 summarizes the performance by considering one more factor at each stage on THUMOS14. Background classification is a general approach for both video recognition and localization. In our case, it is part of our discriminative attention modeling, which brings a performance gain of 3.3%. Self-guided regularization is the additional optimization of our system, which leads to 1.9% mAP improvement. Our generative attention modeling further contributes a significant increase of 2.1% and the performance of DGAM finally reaches 28.8%.

Method Supervision mAP@IoU
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 AVG
SSN [58] Full 41.3 38.8 35.9 32.9 30.4 27.0 22.2 18.2 13.2 6.1 26.6
UntrimmedNet [49] Weak 7.4 6.1 5.2 4.5 3.9 3.2 2.5 1.8 1.2 0.7 3.6
AutoLoc [41] Weak 27.3 24.9 22.5 19.9 17.5 15.1 13.0 10.0 6.8 3.3 16.0
W-TALC [34] Weak 37.0 33.5 30.4 25.7 14.6 12.7 10.0 7.0 4.2 1.5 18.0
TSM [54] Weak 28.3 26.0 23.6 21.2 18.9 17.0 14.0 11.1 7.5 3.5 17.1
3C-Net [29] Weak 35.4 - - - 22.9 - - - 8.5 - 21.1
CleanNet [26] Weak 37.1 33.4 29.9 26.7 23.4 20.3 17.2 13.9 9.2 5.0 21.6
Liu et al. [24] Weak 36.8 - - - - 22.0 - - - 5.6 22.4
DGAM Weak 41.0 37.5 33.5 30.1 26.9 23.5 19.8 15.5 10.8 5.3 24.4
Table 5: Results on ActivityNet1.2 validation set. We report mAP at different IoU thresholds and mAP@AVG (average mAP on thresholds 0.5:0.05:0.95). Note that indicates utilization of weaker feature extractor than others. Our method outperforms state-of-the-art methods by a large margin, where an improvement of 2% is made on mAP@AVG. Our result is also comparable to fully-supervised models.
Figure 4: Evaluation on latent prior discrepancy on THUMOS14. We show mAP@ with different . Larger indicates larger discrepancy between priors of under different attentions .
(dim) 4 5 6 7 8 9
mAP@0.5 26.5 27.5 28.0 28.8 28.3 27.7
Table 6: Evaluation on dimension of latent space on THUMOS14. We experiment with different dimensions of , .
0.01 0.03 0.07 0.1 0.3 0.7
mAP@0.5 28.2 28.1 28.4 28.8 28.0 28.4
Table 7: Evaluation on parameter for reconstruction-sampling trade-off in CVAE. mAP@ is reported on THUMOS14.

4.5 Evaluation on Parameters

To further understand the proposed model, we conduct evaluations to analyze the impact of different parameter settings in DGAM. mAP@ on THUMOS14 is reported.

Discrepancy between latent prior of different . In generative attention modeling, different attentions correspond to different feature distributions . The discrepancy between these distributions can be implicitly modeled by the discrepancy between latent codes sampled from different priors, which are modeled as different Gaussian distributions . Here controls the discrepancy. We evaluate every 0.25 from 0 to 1.5, and the results are shown in Figure 4. In general, the performance is relatively stable with small fluctuation, demonstrating the robustness of DGAM.

Dimension of latent space. The dimension of latent space in CVAE is crucial for quality of reconstruction and complexity of modeled distribution. High dimension can facilitate the approximation of feature distribution, hence leading to more accurate attention learning. However, more training data is also required. We evaluate different dimensions of , . As shown in Table 6, mAP improves rapidly with increasing dimension, which indicates better generative attention modeling. The result reaches the peak at dimension . After that, the performance starts dropping, partially because of the sparsity of limited data in high-dimensional latent space.

Reconstruction-sampling trade-off in CVAE. The hyper-parameter in Eq. (10) balances reconstruction quality (the first term) and sampling accuracy (the second term). With larger , we expect the approximated posterior to be closer to the prior, which improves the precision when sampling latent vectors from prior, while the reconstruction quality (i.e. the quality of learned distribution) will decrease. We test different from 0 to 1. As shown in Table 7, the performance fluctuates in a small range from 28% to 28.8%, indicating that our method is insensitive to .

4.6 Comparisons with State-of-the-Art

Table 3 compares our DGAM with existing approaches in both weakly-supervised and fully-supervised action localization on THUMOS14. Our method outperforms other weakly-supervised methods, especially at high IoU threshold, which means DGAM could produce finer and more precise predictions. Compared with state of the art, DGAM improves mAP at IoU=0.5 by 2%. Note that Nguyen et al[31]

achieves better performance at IoU=0.1 and 0.2 than our model, partially because our generative attention modeling may discard out-of-distribution hard candidates (outliers), which become common when IoU is low. Furthermore, our results are comparable with several fully-supervised methods, indicating the effectiveness of the proposed DGAM.

On ActivityNet1.2, we summarize the performance comparisons in Table 5. Our method significantly outperforms the state-of-the-arts. Particularly, DGAM surpasses the best competitor by 2% on mAP@AVG. Our method also demonstrates comparable results to fully-supervised methods.

5 Conclusion

We have presented a novel Discriminative and Generative Attention Modeling (DGAM) method to solve the action-context confusion issue in weakly-supervised action localization. Particularly, we study the problem of modeling frame-wise attention based on the distribution of frame features. With the observation that context feature obviously differs from action feature, we devise a conditional variation auto-encoder (CVAE) to construct different feature distributions conditioned on different attentions. The learned CVAE in turn refines the desired frame-wise attention according to their features. Experiments conducted on two benchmarks, i.e., THUMOS14 and ActivityNet1.2, validate our method and analysis. More remarkably, we achieve the new state-of-the-art results on both datasets.

Acknowledgements  This work is supported by Beijing Municipal Commission of Science and Technology under Grant Z181100008918005, National Natural Science Foundation of China (NSFC) under Grant 61772037. Baifeng Shi thanks Prof. Tingting Jiang and Daochang Liu for enlightening discussions.


  • [1] H. Alwassel, F. Caba Heilbron, and B. Ghanem (2018) Action search: spotting actions in videos and its application to temporal action localization. In

    Proceedings of the European Conference on Computer Vision (ECCV)

    pp. 251–266. Cited by: §1.
  • [2] S. Buch, V. Escorcia, B. Ghanem, L. Fei-Fei, and J. C. Niebles (2017) End-to-end, single-stream temporal action detection in untrimmed videos.. In British Machine Vision Conference (BMVC), Vol. 2, pp. 7. Cited by: §2.
  • [3] F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles (2015) Activitynet: a large-scale video benchmark for human activity understanding. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 961–970. Cited by: §1, §4.1.
  • [4] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6299–6308. Cited by: §2, §4.2.
  • [5] Y. Chao, S. Vijayanarasimhan, B. Seybold, D. A. Ross, J. Deng, and R. Sukthankar (2018) Rethinking the faster r-cnn architecture for temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1130–1139. Cited by: §1, §2, Table 3.
  • [6] X. Dai, B. Singh, G. Zhang, L. S. Davis, and Y. Qiu Chen (2017) Temporal context network for activity localization in videos. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5793–5802. Cited by: §1, §2.
  • [7] J. Gao, Z. Yang, and R. Nevatia (2017) Cascaded boundary regression for temporal action detection. In British Machine Vision Conference (BMVC), Cited by: §2.
  • [8] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 580–587. Cited by: §2.
  • [9] R. Girshick (2015) Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1440–1448. Cited by: §2.
  • [10] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems (NeurIPS), pp. 2672–2680. Cited by: §2.
  • [11] F. C. Heilbron, W. Barrios, V. Escorcia, and B. Ghanem (2017) Scc: semantic context cascade for efficient action detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3175–3184. Cited by: §2.
  • [12] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner (2017) Beta-vae: learning basic visual concepts with a constrained variational framework.. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §2.
  • [13] H. Idrees, A. R. Zamir, Y. Jiang, A. Gorban, I. Laptev, R. Sukthankar, and M. Shah (2017) The thumos challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding (CVIU) 155, pp. 1–23. Cited by: §1, §4.1.
  • [14] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950. Cited by: §4.2.
  • [15] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §4.2.
  • [16] D. P. Kingma and P. Dhariwal (2018) Glow: generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems (NeurIPS), pp. 10215–10224. Cited by: §2.
  • [17] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §2, §3.3.
  • [18] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (NeurIPS), pp. 1097–1105. Cited by: §1.
  • [19] I. Laptev (2005) On space-time interest points. International Journal of Computer Vision (IJCV) 64 (2-3), pp. 107–123. Cited by: §2.
  • [20] D. Li, Z. Qiu, Q. Dai, T. Yao, and T. Mei (2018) Recurrent tubelet proposal and recognition networks for action detection. In Proceedings of the European conference on computer vision (ECCV), pp. 303–318. Cited by: §1.
  • [21] D. Li, T. Yao, Z. Qiu, H. Li, and T. Mei (2019) Long short-term relation networks for video action detection. In Proceedings of the ACM International Conference on Multimedia (MM), pp. 629–637. Cited by: §1.
  • [22] T. Lin, X. Zhao, and Z. Shou (2017) Single shot temporal action detection. In Proceedings of the ACM international conference on Multimedia (MM), pp. 988–996. Cited by: §2.
  • [23] T. Lin, X. Zhao, H. Su, C. Wang, and M. Yang (2018) Bsn: boundary sensitive network for temporal action proposal generation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: §1, §2, Table 3.
  • [24] D. Liu, T. Jiang, and Y. Wang (2019) Completeness modeling and context separation for weakly supervised temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1298–1307. Cited by: §1, §1, §2, §3.5, Table 3, Table 5.
  • [25] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European Conference on Computer Vision (ECCV), pp. 21–37. Cited by: §2.
  • [26] Z. Liu, L. Wang, Q. Zhang, Z. Gao, Z. Niu, N. Zheng, and G. Hua (2019) Weakly supervised temporal action localization through contrast based evaluation networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 3899–3908. Cited by: Table 3, Table 5.
  • [27] F. Long, T. Yao, Z. Qiu, X. Tian, J. Luo, and T. Mei (2019) Gaussian temporal awareness networks for action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 344–353. Cited by: §1.
  • [28] Y. Ma, X. Hua, L. Lu, and H. Zhang (2005) A generic framework of user attention model and its application in video summarization. IEEE Transaction on multimedia (TMM) 7 (5), pp. 907–919. Cited by: §1.
  • [29] S. Narayan, H. Cholakkal, F. S. Khan, and L. Shao (2019) 3C-net: category count and center loss for weakly-supervised action localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 8679–8687. Cited by: §1, §2, Table 3, Table 5.
  • [30] P. Nguyen, T. Liu, G. Prasad, and B. Han (2018) Weakly supervised action localization by sparse temporal pooling network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6752–6761. Cited by: §1, §2, §3.4, §4.2, Table 3.
  • [31] P. X. Nguyen, D. Ramanan, and C. C. Fowlkes (2019) Weakly-supervised action localization with background modeling. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5502–5511. Cited by: §1, §2, §3.2, §3.4, §4.1, §4.2, §4.6, Table 3.
  • [32] D. Oneata, J. Verbeek, and C. Schmid (2013) Action and event recognition with fisher vectors on a compact feature set. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1817–1824. Cited by: §2.
  • [33] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in PyTorch. In NeurIPS Autodiff Workshop, Cited by: §4.2.
  • [34] S. Paul, S. Roy, and A. K. Roy-Chowdhury (2018) W-talc: weakly-supervised temporal activity localization and classification. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 563–579. Cited by: §1, §2, Table 3, Table 5.
  • [35] J. S. Pérez, E. Meinhardt-Llopis, and G. Facciolo (2013) TV-l1 optical flow estimation. Image Processing On Line (IPOL) 2013, pp. 137–150. Cited by: §4.2.
  • [36] Z. Qiu, T. Yao, and T. Mei (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5533–5541. Cited by: §2.
  • [37] Z. Qiu, T. Yao, C. Ngo, X. Tian, and T. Mei (2019) Learning spatio-temporal representation with local and global diffusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12056–12065. Cited by: §1.
  • [38] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788. Cited by: §2.
  • [39] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems (NeurIPS), pp. 91–99. Cited by: §2.
  • [40] Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S. Chang (2017) Cdc: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5734–5743. Cited by: §2.
  • [41] Z. Shou, H. Gao, L. Zhang, K. Miyazawa, and S. Chang (2018) Autoloc: weakly-supervised temporal action localization in untrimmed videos. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 154–171. Cited by: §1, §2, §3.5, §4.1, Table 3, Table 5.
  • [42] Z. Shou, D. Wang, and S. Chang (2016) Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1049–1058. Cited by: §1, §2, Table 3.
  • [43] K. Simonyan and A. Zisserman (2014) Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems (NeurIPS), pp. 568–576. Cited by: §1, §2.
  • [44] K. K. Singh and Y. J. Lee (2017) Hide-and-seek: forcing a network to be meticulous for weakly-supervised object and action localization. In 2017 IEEE International Conference on Computer Vision (ICCV), pp. 3544–3553. Cited by: Table 3.
  • [45] K. Sohn, H. Lee, and X. Yan (2015) Learning structured output representation using deep conditional generative models. In Advances in neural information processing systems (NeurIPS), pp. 3483–3491. Cited by: §2.
  • [46] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2015) Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4489–4497. Cited by: §2.
  • [47] S. Vishwakarma and A. Agrawal (2013) A survey on activity recognition and behavior understanding in video surveillance. The Visual Computer (Vis Comput) 29 (10), pp. 983–1009. Cited by: §1.
  • [48] H. Wang and C. Schmid (2013) Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 3551–3558. Cited by: §2.
  • [49] L. Wang, Y. Xiong, D. Lin, and L. Van Gool (2017) Untrimmednets for weakly supervised action recognition and detection. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 4325–4334. Cited by: §1, §2, §4.1, Table 3, Table 5.
  • [50] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool (2016) Temporal segment networks: towards good practices for deep action recognition. In European Conference on Computer Vision (ECCV), pp. 20–36. Cited by: §2.
  • [51] B. Xiong, Y. Kalantidis, D. Ghadiyaram, and K. Grauman (2019) Less is more: learning highlight detection from video duration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1258–1267. Cited by: §1.
  • [52] H. Xu, A. Das, and K. Saenko (2017) R-c3d: region convolutional 3d network for temporal activity detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5783–5792. Cited by: §1, §2, Table 3.
  • [53] S. Yeung, O. Russakovsky, G. Mori, and L. Fei-Fei (2016) End-to-end learning of action detection from frame glimpses in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2678–2687. Cited by: §1.
  • [54] T. Yu, Z. Ren, Y. Li, E. Yan, N. Xu, and J. Yuan (2019) Temporal structure mining for weakly supervised action detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5522–5531. Cited by: Table 3, Table 5.
  • [55] Y. Yuan, Y. Lyu, X. Shen, I. W. Tsang, and D. Yeung (2019)

    Marginalized average attentional network for weakly-supervised learning

    In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §1, §2, Table 3.
  • [56] R. Zeng, W. Huang, M. Tan, Y. Rong, P. Zhao, J. Huang, and C. Gan (2019) Graph convolutional networks for temporal action localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 7094–7103. Cited by: Table 3.
  • [57] D. Zhang, X. Dai, X. Wang, and Y. Wang (2018) S3D: single shot multi-span detector via fully 3d convolutional network. In Proceedings of the British Machine Vision Conference (BMVC), Cited by: §2.
  • [58] Y. Zhao, Y. Xiong, L. Wang, Z. Wu, X. Tang, and D. Lin (2017) Temporal action detection with structured segment networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2914–2923. Cited by: §2, §4.1, Table 3, Table 5.
  • [59] J. Zhong, N. Li, W. Kong, T. Zhang, T. H. Li, and G. Li (2018) Step-by-step erasion, one-by-one collection: a weakly supervised temporal action detector. In Proceedings of the ACM international conference on Multimedia (MM), pp. 35–44. Cited by: Table 3.
  • [60] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2016)

    Learning deep features for discriminative localization

    In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 2921–2929. Cited by: §1, §3.4.