Action localization is one of the most challenging tasks in video analytics and understanding [43, 42, 20, 37, 21]. The goal is to predict accurate start and end time stamps of different human actions. Owing to its wide application (e.g., surveillance , video summarization , highlight detection 
), action localization has drawn lots of attention in the community. Thanks to the powerful convolutional neural network (CNN), performance achieved on this task has gone through a phenomenal surge in the past few years [42, 53, 6, 52, 5, 1, 23, 27]. Nevertheless, these fully-supervised methods require temporal annotations of action intervals during training, which is extremely expensive and time-consuming. Therefore, the task of weakly-supervised action localization (WSAL) has been put forward, where only video-level category labels are available.
generally builds a top-down pipeline, which learns a video-level classifier and then obtains frame attention by checking the produced temporal class activation map (TCAM). Note that a frame indicates a small snippet from which appearance or motion feature could be extracted. On the other hand, the second paradigm works in a bottom-up way, i.e., temporal attention is directly predicted from raw data [30, 31, 41, 55]. Then attention is optimized in the task of video classification with video-level supervision. Frames with high attention are thus treated as action part, otherwise the background part.
Both kinds of methods largely rely on the video-level classification model, which would lead to the intractable action-context confusion  issue in the absence of frame-wise labels. Take the long jump in Figure 1 as an example, the action has three stages, i.e., approaching, jumping, and landing. In addition, the frames before and after the action, i.e., preparing and finishing, contain the content that is closely related to long jump, but are not parts of the action. We refer to such frames as context, which is a special kind of background. In this example, the context parts include the track field and sandpit, which could in fact significantly encourage the recognition of the action. Without frame-wise annotations, the classifier is normally learned by aggregating the features of all related frames, where context and action are roughly mixed up. The context frames thus tend to be easily recognized as action frames themselves. The action-context confusion problem has not been fully studied though it is common in WSAL. One recent exploration  attempts to solve the problem by assuming a strong prior that context clips should be stationary, i.e., no motions in them. However, such assumption has massive limitations and ignores the inherent difference between context and action.
To separate context and action, the model should be able to capture the underlying discrepancy between them. Intuitively, context frame indeed exhibits obvious difference from action frame at the appearance or motion level. For example, among the five stages in Figure 1, the action stages (approaching, jumping, and landing) clearly demonstrate more intense body postures than the context stages (preparing and finishing). In other words, the extracted feature representations for context and action are also different. Such difference exists regardless of the action category.
Inspired by this observation, we propose a novel generative attention mechanism to model the frame representation conditioned on frame attention. In addition to the above intuition, we build a graphical model to theoretically demonstrate that the localization problem is associated with both the conventional classification and the proposed representation modeling. Our framework thus consists of two parts: the Discriminative and Generative Attention Modeling (DGAM). On one hand, the discriminative attention modeling trains a classification model on temporally pooled features weighted by the frame attention. On the other hand, a generative model, i.e., conditional Variational Auto-Encoder (VAE), is learned to model the class-agnostic frame-wise distribution of representation conditioned on attention values. By maximizing likelihood of the representation, the frame-wise attention is optimized accordingly, leading to well separation of action and context frames. Extensive experiments are conducted on THUMOS14  and ActivityNet1.2  to show that DGAM outperforms the state-of-the-arts by a significant margin. Comprehensive analysis further validates its effectiveness on separating action and context.
The main contribution of this work is the proposed DGAM framework for addressing the issue of action-context confusion in WSAL by modeling the frame representation conditioned on different attentions. The solution has led to elegant views of how localization is associated with the representation distribution and how to learn better attentions by modeling the representation, which have not been discussed in the literature.
2 Related Works
Video action recognition is a fundamental problem in video analytics. Most video-related tasks leverage the off-the-shelf action recognition models to extract features for further analysis. Early methods normally devise hand-crafted features [19, 48, 32]43], temporal segment network (TSN) , 3D ConvNet (C3D) , Pseudo 3D (P3D) , Inflated 3D (I3D) . In our experiments, I3D is utilized for feature extraction.
Fully-supervised action localization has been extensively studied recently. Many works follow the paradigms that are widely applied in object detection area [8, 9, 39, 38, 25] due to their commonalities in problem setting. To be more specific, there are mainly two directions, namely two-stage method and one-stage method. Two-stage methods [58, 52, 6, 5, 42, 40, 7, 11, 23] first generate action proposals and then classify them with further refinement on temporal boundaries. One-stage methods [2, 22, 57] instead predict action category and location directly from raw data. In fully-supervised setting, the action-context confusion could be alleviated with frame-wise annotations.
Weakly-supervised action localization is drawing increasing attention due to the time-consuming manual labeling in fully-supervised setting. As introduced in Section 1, WSAL methods can be grouped into two categories, namely top-down and bottom-up methods. In top-down pipeline (e.g. UntrimmedNet ), video-level classification model is learned first, and then frames with high classification activation are selected as action locations. W-TALC  and 3C-Net  also force foreground features from the same class to be similar, otherwise dissimilar. Unlike top-down scheme, the bottom-up methods directly produce the attention for each frame from data, and train a classification model with the features weighted by attention. Based on this paradigm, STPN  further adds a regularization term to encourage the sparsity of action. AutoLoc  proposes the Outer-Inner-Contrastive (OIC) loss by assuming that a complete action clip should look different from its neighbours. MAAN  proposes to suppress dominance of the most salient action frames and retrieve less salient ones. Nguyen et al.  propose to penalize the discriminative capacity of background, which is also utilized in our classification module. Besides, a video-level clustering loss is applied in  to separate foreground and background. Nevertheless, all of the aforementioned methods ignore the challenging action-context confusion issue caused by the absence of frame-wise label. Though Liu et al.  try to separate action and context using hard negative mining, their method is based on the strong assumption that context clips should be stationary, which has many limitations and may hence cause negative influence on the prediction.
Generative model has also experienced a fast development in recent years [17, 10, 12]. GAN  employs a generator to approximate real data distribution by the adversarial training between generator and discriminator. However, the learned approximating distribution is implicitly determined by generator and thus cannot be analytically expressed. VAE 
approximates the real distribution by optimizing the variational lower bound on the marginal likelihood of data. Given a latent code, the conditional distribution is explicitly modeled as a Gaussian distribution, hence data distribution can be analytically expressed by sampling latent vectors and calculating the Gaussian. Flow-based model uses invertible layers as the generative mapping, where data distribution can be calculated given the Jacobian of each layer. However, all layers must have the same dimensions, which is much less flexible. In our work, we exploit Conditional VAE (CVAE)  to model the frame feature distribution conditioned on attention value.
Suppose we have a set of training videos and the corresponding video-level labels. For each video, we sample frames (snippets) to extract the RGB or optical flow features with a pre-trained model, where is the feature of frame , and is feature dimension. The video-level label is denoted as , where is the number of classes and corresponds to background. For brevity, we assume that each video only belongs to one class, though the following discussion can also apply to multi-label videos.
Our method follows the bottom-up pipeline for WSAL, which learns the attention directly from data, where is the attention of frame . Before discussing the details of our method, we examine the action localization problem from the beginning.
3.1 Attention-based Framework
In attention-based action localization problem, the target is to predict the frame attention , which is equivalent to solving the maximum a posteriori (MAP) problem:
is the unknown probability distribution ofgiven and . In the absence of frame-level labels (ground truth of ), it is difficult to approximate and optimize
directly. Therefore, we transform the optimization target using Bayes' theorem,
where in the last step, we discard the constant term and assume a uniform prior of , i.e., . Our optimization problem thus becomes
This formulation indicates two different aspects for optimizing . The first term prefers with high discriminative capacity for action classification, which is the main optimization target in previous works. In contrast, the second term forces the representation of frames to be accurately predicted from the attention . Given the feature difference between foreground and background, this objective encourages the model to impose different attentions on different features. In specific, we exploit a generative model to approximate , and force the feature to be accurately reconstructed by the model.
Figure 2 shows the graphical model of the above problem. The model parameters () and the latent variables in generative model () will be discussed later. Based on (3), the framework of our method consists of two components, i.e., the discriminative attention modeling and the generative attention modeling, as illustrated in Figure 3.
3.2 Discriminative Attention Modeling
The discriminative attention module learns the frame attention by optimizing the video-level recognition task. In specific, we utilize attention as weight to perform temporal average pooling over all frames in the video and produce a video-level foreground feature given by
Similarly, we can also utilize as the weight to calculate a background feature :
To optimize , we encourage high discriminative capability of the foreground feature and simultaneously punish any discriminative capability of the background feature . This is equivalent to minimizing the following discriminative loss (i.e. softmax loss):
where is a hyper-parameter, and is our classification module modeled by a fully-connected layer with weight for each class
and a following softmax layer. During training, attention module and classification module are jointly optimized. The graphical model of this part is illustrated in Figure2 with dash-dot lines.
3.3 Generative Attention Modeling
The discriminative attention optimization generally has difficulty in separating context and foreground when frame-wise annotations are unavailable. Based on the observation that context differs from foreground in terms of feature representation, we utilize a Conditional Variational Auto-Encoder (CVAE) to model the representation distribution of different frames. Before explaining the details, we briefly review the Variational Auto-Encoder (VAE).
Given the observed variable , VAE  introduces a latent variable , and aims to generate from , i.e.,
where denotes the parameters of generative model, is the prior (e.g. a standard Gaussian), and
is the conditional distribution indicating the generation procedure, which is typically estimated with a neural networkthat is referred to as decoder. The key idea behind is to sample values of that are likely to produce , which means that we need an approximation to the intractable posterior . denotes the parameters of approximation model, and is also estimated via a neural network , which is referred to as encoder. VAE incorporates encoder and decoder , and learns parameters by maximizing the variational lower bound:
where is the KL divergence of from .
In our DGAM model, we expect to generate the observation based on the attention , i.e., , which can be written as by assuming independence between frames in a video. Similarly, we introduce a latent variable , and attempt to generate each from and , which forms a Conditional VAE problem:
Note that the desired distribution of is modeled as a Gaussian, i.e., , where is the decoder, is a hyper-parameter, and is the unit matrix. Ideally, is sampled from the prior . In DGAM, we set the prior as a Gaussian, i.e., , where is all-ones vector and is a hyper-parameter indicating the discrepancy between priors of different attention value . When , prior is independent of .
During training of CVAE, we also approximate the intractable posterior by a Gaussian , where and are the outputs of the encoder . We then minimize the variational loss :
where is -th sample from . Note that the Monte Carlo estimation of the expectation is employed with samples. is a hyper-parameter for trade-off between reconstruction quality and sampling accuracy.
For the generative attention modeling of , we fix CVAE and minimize the reconstruction loss given by
where is sampled from the prior . In our experiments, is set to , and (11) can be written as
The graphical model of generative attention modeling is illustrated in Figure 2 with solid and dashed lines.
In our framework, the CVAE cannot be directly and solely optimized due to the unavailability of ground truth . Therefore, we propose to train attention module and CVAE in an alternating way, i.e., we first update CVAE with “pseudo label” of given by the attention module, and then train attention module with fixed CVAE. The two stages are repeated for several iterations. Since there exist other loss terms for attention modeling (e.g. ), the pseudo label can be high-quality and hence a good convergence can be reached. Experimental results empirically validate it.
In addition to the above objectives, we exploit a self-guided regularization  to further refine the attention. The temporal class activation maps (TCAM) [30, 60] are utilized to produce the top-down, class-aware attention maps. In specific, given a video with label , the TCAM are computed by
where indicates the parameters of the classification module for class . and are foreground and background TCAM, respectively.
is a Gaussian smooth filter with standard deviation, and represents convolution. The generated and are expected to be consistent with the bottom-up, class-agnostic attention , hence the loss can be formulated as
To sum up, we optimize the whole framework by alternately executing the following two steps:
Update attention and classification modules with loss
where denote the hyper-parameters.
Update CVAE with loss .
The whole architecture is illustrated in Figure 3.
3.5 Action Prediction
To generate action proposals for a video during inference, we feed the video to DGAM and obtain the attention . By filtering out frames with attention lower than a threshold , we extract consecutive segments with high attention values as the predicted locations. For each segment , we temporally pool the features with attention, and get the classification score for class , which is the output of classification module before softmax. We further follow [41, 24] to refine by subtracting the score of its surroundings. The final score is calculated by
where is the subtraction parameter.
4.1 Datasets and Evaluation Metrics
THUMOS14 contains videos from 20 classes for action localization task. We follow the convention to train on validation set with 200 videos and evaluate on test set with 212 videos. Note that we exclude the wrongly annotated video#270 from test set, following [31, 58]. This dataset is challenging for its finely annotated action instances. Each video contains 15.5 action clips on average. Length of action instance varies widely, from a few seconds to minutes. Video length also ranges from a few seconds to 26 minutes, with an average of around 3 minutes. Compared to other large-scale datasets, e.g., ActivityNet1.2, THUMOS14 has less training data which indicates higher requirement of model’s generalization ability and robustness.
ActivtyNet1.2 contains 100 classes of videos with both video-level labels and temporal annotations. Each video contains 1.5 action instances on average. Following [49, 41], we train our model on training set with 4819 videos and evaluate on validation set with 2383 videos.
Evaluation Metrics. We follow the standard evaluation protocol and report mean Average Precision (mAP) at different intersection over union (IoU) thresholds. The results are calculated using the benchmark code provided by ActivityNet official codebase222https://github.com/activitynet/ActivityNet/tree/master/Evaluation. For fair comparison, all results on THUMOS14 are averaged over five runs.
4.2 Implementation Details
We utilize I3D  network pre-trained on Kinetics  as the feature extractor333https://github.com/deepmind/kinetics-i3d. In specific, we first extract optical flow from RGB data using TV-L1 algorithm . Then we divide both streams into non-overlapping 16-frame snippets and send them into the pre-trained I3D network to obtain two 1024-dimension feature frames for each snippet. We train separate DGAMs for RGB and flow streams. The proposals from them are combined with Non-Maximum Suppression (NMS) during inference. Following [30, 31], we set to 400 for all videos during training. During evaluation, we feed all frames of each video to our network if the frame number is less than , otherwise we sample frames uniformly. is 400 for THUMOS14, and 200 for ActivityNet1.2.
. The whole architecture is implemented with PyTorch and trained on single NVIDIA Tesla M40 GPU using Adam optimizer  with learning rate of
. To stabilize the training of DGAM, we leverage a warm-up strategy in the first 300 epochs when updatingand .
|Chao et al. ||Full||-||59.8||57.1||53.2||48.5||42.8||33.8||20.8||-||-|
|Zhong et al. ||Weak||-||45.8||39.0||31.1||22.5||15.9||-||-||-||-|
|Liu et al. ||Weak||I3D||57.4||50.8||41.2||32.1||23.1||15.0||7.0||-||-|
|Nguyen et al. ||Weak||I3D||60.4||56.0||46.6||37.5||26.8||17.6||9.0||3.3||0.4|
4.3 Statistical Evaluation on Attention
We first evaluate the learned attention of DGAM and its effectiveness on handling action-context confusion. For comparison, an “old” model is trained by removing the generative attention modeling (GAM) from DGAM, and our DGAM is denoted as the “new” model. Note that only Attention and Classification modules are involved during inference. When evaluating, we assemble specific models by alternately choosing the two modules from “old” or “new” models. Table 1 details the mAP results on THUMOS14. It can be found that the new attention module largely improves the performance, while there is little or no improvement with the new classification module. This observation indicates that DGAM indeed learns better attention values. Even with “old” classifier, the “new” attention can boost the localization significantly.
We further collect several statistics to show the improvement intuitively in Table 2. Experiments are conducted on both “old” (w/o GAM) and “new” (w/ GAM) models. In particular, att (cls) indicates the set of frames with attention values (classification scores) larger than a threshold , and gt is the set of ground truth frames. represents size of a set. ‘’, ‘’ and ‘’ indicate set exclusion, intersection and complement, separately. Though such simple thresholding is not exactly the predicted locations, it somewhat reflects the quality of localization.
In Table 2, or indicates the percentage of frames falsely captured or omitted by attention. It shows that both false activation and omission can be reduced with GAM. Moreover, an improvement in demonstrates that GAM can better filter out the false positives (e.g. context frames) made by classifier. measures how attention can capture the false negatives, i.e., action frames neglected by classifier. Since GAM is devised for excluding the false positives produced by classifier, it is not surprising that GAM contributes little to it.
4.4 Ablation Studies
Next we study how each component in DGAM influences the overall performance. We start with the basic model that directly optimizes the attention based foreground classification loss . The background classification loss , the self-guided regularization loss
, and the feature reconstruction lossare further included step by step. Note that adding indicates involving the generative attention modeling, where is also optimized.
Table 4 summarizes the performance by considering one more factor at each stage on THUMOS14. Background classification is a general approach for both video recognition and localization. In our case, it is part of our discriminative attention modeling, which brings a performance gain of 3.3%. Self-guided regularization is the additional optimization of our system, which leads to 1.9% mAP improvement. Our generative attention modeling further contributes a significant increase of 2.1% and the performance of DGAM finally reaches 28.8%.
|Liu et al. ||Weak||36.8||-||-||-||-||22.0||-||-||-||5.6||22.4|
4.5 Evaluation on Parameters
To further understand the proposed model, we conduct evaluations to analyze the impact of different parameter settings in DGAM. mAP@ on THUMOS14 is reported.
Discrepancy between latent prior of different . In generative attention modeling, different attentions correspond to different feature distributions . The discrepancy between these distributions can be implicitly modeled by the discrepancy between latent codes sampled from different priors, which are modeled as different Gaussian distributions . Here controls the discrepancy. We evaluate every 0.25 from 0 to 1.5, and the results are shown in Figure 4. In general, the performance is relatively stable with small fluctuation, demonstrating the robustness of DGAM.
Dimension of latent space. The dimension of latent space in CVAE is crucial for quality of reconstruction and complexity of modeled distribution. High dimension can facilitate the approximation of feature distribution, hence leading to more accurate attention learning. However, more training data is also required. We evaluate different dimensions of , . As shown in Table 6, mAP improves rapidly with increasing dimension, which indicates better generative attention modeling. The result reaches the peak at dimension . After that, the performance starts dropping, partially because of the sparsity of limited data in high-dimensional latent space.
Reconstruction-sampling trade-off in CVAE. The hyper-parameter in Eq. (10) balances reconstruction quality (the first term) and sampling accuracy (the second term). With larger , we expect the approximated posterior to be closer to the prior, which improves the precision when sampling latent vectors from prior, while the reconstruction quality (i.e. the quality of learned distribution) will decrease. We test different from 0 to 1. As shown in Table 7, the performance fluctuates in a small range from 28% to 28.8%, indicating that our method is insensitive to .
4.6 Comparisons with State-of-the-Art
Table 3 compares our DGAM with existing approaches in both weakly-supervised and fully-supervised action localization on THUMOS14. Our method outperforms other weakly-supervised methods, especially at high IoU threshold, which means DGAM could produce finer and more precise predictions. Compared with state of the art, DGAM improves mAP at IoU=0.5 by 2%. Note that Nguyen et al. 
achieves better performance at IoU=0.1 and 0.2 than our model, partially because our generative attention modeling may discard out-of-distribution hard candidates (outliers), which become common when IoU is low. Furthermore, our results are comparable with several fully-supervised methods, indicating the effectiveness of the proposed DGAM.
On ActivityNet1.2, we summarize the performance comparisons in Table 5. Our method significantly outperforms the state-of-the-arts. Particularly, DGAM surpasses the best competitor by 2% on mAP@AVG. Our method also demonstrates comparable results to fully-supervised methods.
We have presented a novel Discriminative and Generative Attention Modeling (DGAM) method to solve the action-context confusion issue in weakly-supervised action localization. Particularly, we study the problem of modeling frame-wise attention based on the distribution of frame features. With the observation that context feature obviously differs from action feature, we devise a conditional variation auto-encoder (CVAE) to construct different feature distributions conditioned on different attentions. The learned CVAE in turn refines the desired frame-wise attention according to their features. Experiments conducted on two benchmarks, i.e., THUMOS14 and ActivityNet1.2, validate our method and analysis. More remarkably, we achieve the new state-of-the-art results on both datasets.
Acknowledgements This work is supported by Beijing Municipal Commission of Science and Technology under Grant Z181100008918005, National Natural Science Foundation of China (NSFC) under Grant 61772037. Baifeng Shi thanks Prof. Tingting Jiang and Daochang Liu for enlightening discussions.
Action search: spotting actions in videos and its application to temporal action localization.
Proceedings of the European Conference on Computer Vision (ECCV), pp. 251–266. Cited by: §1.
-  (2017) End-to-end, single-stream temporal action detection in untrimmed videos.. In British Machine Vision Conference (BMVC), Vol. 2, pp. 7. Cited by: §2.
Activitynet: a large-scale video benchmark for human activity understanding.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 961–970. Cited by: §1, §4.1.
-  (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6299–6308. Cited by: §2, §4.2.
-  (2018) Rethinking the faster r-cnn architecture for temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1130–1139. Cited by: §1, §2, Table 3.
-  (2017) Temporal context network for activity localization in videos. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5793–5802. Cited by: §1, §2.
-  (2017) Cascaded boundary regression for temporal action detection. In British Machine Vision Conference (BMVC), Cited by: §2.
-  (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 580–587. Cited by: §2.
-  (2015) Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1440–1448. Cited by: §2.
-  (2014) Generative adversarial nets. In Advances in neural information processing systems (NeurIPS), pp. 2672–2680. Cited by: §2.
-  (2017) Scc: semantic context cascade for efficient action detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3175–3184. Cited by: §2.
-  (2017) Beta-vae: learning basic visual concepts with a constrained variational framework.. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §2.
-  (2017) The thumos challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding (CVIU) 155, pp. 1–23. Cited by: §1, §4.1.
-  (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950. Cited by: §4.2.
-  (2014) Adam: a method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §4.2.
-  (2018) Glow: generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems (NeurIPS), pp. 10215–10224. Cited by: §2.
-  (2013) Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §2, §3.3.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (NeurIPS), pp. 1097–1105. Cited by: §1.
-  (2005) On space-time interest points. International Journal of Computer Vision (IJCV) 64 (2-3), pp. 107–123. Cited by: §2.
-  (2018) Recurrent tubelet proposal and recognition networks for action detection. In Proceedings of the European conference on computer vision (ECCV), pp. 303–318. Cited by: §1.
-  (2019) Long short-term relation networks for video action detection. In Proceedings of the ACM International Conference on Multimedia (MM), pp. 629–637. Cited by: §1.
-  (2017) Single shot temporal action detection. In Proceedings of the ACM international conference on Multimedia (MM), pp. 988–996. Cited by: §2.
-  (2018) Bsn: boundary sensitive network for temporal action proposal generation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: §1, §2, Table 3.
-  (2019) Completeness modeling and context separation for weakly supervised temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1298–1307. Cited by: §1, §1, §2, §3.5, Table 3, Table 5.
-  (2016) Ssd: single shot multibox detector. In European Conference on Computer Vision (ECCV), pp. 21–37. Cited by: §2.
-  (2019) Weakly supervised temporal action localization through contrast based evaluation networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 3899–3908. Cited by: Table 3, Table 5.
-  (2019) Gaussian temporal awareness networks for action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 344–353. Cited by: §1.
-  (2005) A generic framework of user attention model and its application in video summarization. IEEE Transaction on multimedia (TMM) 7 (5), pp. 907–919. Cited by: §1.
-  (2019) 3C-net: category count and center loss for weakly-supervised action localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 8679–8687. Cited by: §1, §2, Table 3, Table 5.
-  (2018) Weakly supervised action localization by sparse temporal pooling network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6752–6761. Cited by: §1, §2, §3.4, §4.2, Table 3.
-  (2019) Weakly-supervised action localization with background modeling. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5502–5511. Cited by: §1, §2, §3.2, §3.4, §4.1, §4.2, §4.6, Table 3.
-  (2013) Action and event recognition with fisher vectors on a compact feature set. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1817–1824. Cited by: §2.
-  (2017) Automatic differentiation in PyTorch. In NeurIPS Autodiff Workshop, Cited by: §4.2.
-  (2018) W-talc: weakly-supervised temporal activity localization and classification. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 563–579. Cited by: §1, §2, Table 3, Table 5.
-  (2013) TV-l1 optical flow estimation. Image Processing On Line (IPOL) 2013, pp. 137–150. Cited by: §4.2.
-  (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5533–5541. Cited by: §2.
-  (2019) Learning spatio-temporal representation with local and global diffusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12056–12065. Cited by: §1.
-  (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788. Cited by: §2.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems (NeurIPS), pp. 91–99. Cited by: §2.
-  (2017) Cdc: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5734–5743. Cited by: §2.
-  (2018) Autoloc: weakly-supervised temporal action localization in untrimmed videos. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 154–171. Cited by: §1, §2, §3.5, §4.1, Table 3, Table 5.
-  (2016) Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1049–1058. Cited by: §1, §2, Table 3.
-  (2014) Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems (NeurIPS), pp. 568–576. Cited by: §1, §2.
-  (2017) Hide-and-seek: forcing a network to be meticulous for weakly-supervised object and action localization. In 2017 IEEE International Conference on Computer Vision (ICCV), pp. 3544–3553. Cited by: Table 3.
-  (2015) Learning structured output representation using deep conditional generative models. In Advances in neural information processing systems (NeurIPS), pp. 3483–3491. Cited by: §2.
-  (2015) Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4489–4497. Cited by: §2.
-  (2013) A survey on activity recognition and behavior understanding in video surveillance. The Visual Computer (Vis Comput) 29 (10), pp. 983–1009. Cited by: §1.
-  (2013) Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 3551–3558. Cited by: §2.
-  (2017) Untrimmednets for weakly supervised action recognition and detection. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 4325–4334. Cited by: §1, §2, §4.1, Table 3, Table 5.
-  (2016) Temporal segment networks: towards good practices for deep action recognition. In European Conference on Computer Vision (ECCV), pp. 20–36. Cited by: §2.
-  (2019) Less is more: learning highlight detection from video duration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1258–1267. Cited by: §1.
-  (2017) R-c3d: region convolutional 3d network for temporal activity detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5783–5792. Cited by: §1, §2, Table 3.
-  (2016) End-to-end learning of action detection from frame glimpses in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2678–2687. Cited by: §1.
-  (2019) Temporal structure mining for weakly supervised action detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5522–5531. Cited by: Table 3, Table 5.
Marginalized average attentional network for weakly-supervised learning. In International Conference on Learning Representations (ICLR), External Links: Cited by: §1, §2, Table 3.
-  (2019) Graph convolutional networks for temporal action localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 7094–7103. Cited by: Table 3.
-  (2018) S3D: single shot multi-span detector via fully 3d convolutional network. In Proceedings of the British Machine Vision Conference (BMVC), Cited by: §2.
-  (2017) Temporal action detection with structured segment networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2914–2923. Cited by: §2, §4.1, Table 3, Table 5.
-  (2018) Step-by-step erasion, one-by-one collection: a weakly supervised temporal action detector. In Proceedings of the ACM international conference on Multimedia (MM), pp. 35–44. Cited by: Table 3.
Learning deep features for discriminative localization. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 2921–2929. Cited by: §1, §3.4.