Marginalized Average Attentional Network for Weakly-Supervised Learning

by   Yuan Yuan, et al.

In weakly-supervised temporal action localization, previous works have failed to locate dense and integral regions for each entire action due to the overestimation of the most salient regions. To alleviate this issue, we propose a marginalized average attentional network (MAAN) to suppress the dominant response of the most salient regions in a principled manner. The MAAN employs a novel marginalized average aggregation (MAA) module and learns a set of latent discriminative probabilities in an end-to-end fashion. MAA samples multiple subsets from the video snippet features according to a set of latent discriminative probabilities and takes the expectation over all the averaged subset features. Theoretically, we prove that the MAA module with learned latent discriminative probabilities successfully reduces the difference in responses between the most salient regions and the others. Therefore, MAAN is able to generate better class activation sequences and identify dense and integral action regions in the videos. Moreover, we propose a fast algorithm to reduce the complexity of constructing MAA from O(2^T) to O(T^2). Extensive experiments on two large-scale video datasets show that our MAAN achieves superior performance on weakly-supervised temporal action localization


page 9

page 19


Adversarial Seeded Sequence Growing for Weakly-Supervised Temporal Action Localization

Temporal action localization is an important yet challenging research to...

Cascaded Pyramid Mining Network for Weakly Supervised Temporal Action Localization

Weakly supervised temporal action localization, which aims at temporally...

Learning Discriminative Prototypes with Dynamic Time Warping

Dynamic Time Warping (DTW) is widely used for temporal data processing. ...

Adversarial Background-Aware Loss for Weakly-supervised Temporal Activity Localization

Temporally localizing activities within untrimmed videos has been extens...

Adaptive Mutual Supervision for Weakly-Supervised Temporal Action Localization

Weakly-supervised temporal action localization aims to localize actions ...

ACGNet: Action Complement Graph Network for Weakly-supervised Temporal Action Localization

Weakly-supervised temporal action localization (WTAL) in untrimmed video...

1 Introduction

Weakly-supervised temporal action localization has been of interest to the community recently. The setting is to train a model with solely video-level class labels, and to predict both the class and the temporal boundary of each action instance at the test time. The major challenge in the weakly-supervised localization problem is to find the right way to express and infer the underlying location information with only the video-level class labels. Traditionally, this is achieved by explicitly sampling several possible instances with different locations and durations (Bilen & Vedaldi, 2016; Kantorov et al., 2016; Zhang et al., 2017)

. The instance-level classifiers would then be trained through multiple instances learning 

(Cinbis et al., 2017; Yuan et al., 2017a) or curriculum learning (Bengio et al., 2009). However, the length of actions and videos varies too much such that the number of instance proposals for each video varies a lot and it can also be huge. As a result, traditional methods based on instance proposals become infeasible in many cases.

Recent research, however, has pivoted to acquire the location information by generating the class activation sequence (CAS) directly (Nguyen et al., 2018), which produces the classification score sequence of being each action for each snippet over time. The CAS along the 1D temporal dimension for a video is inspired by the class activation map (CAM) (Zhou et al., 2016a, 2014; Pinheiro & Collobert, 2015; Oquab et al., 2015)

in weakly-supervised object detection. The CAM-based models have shown that despite being trained on image-level labels, convolutional neural networks (CNNs) have the remarkable ability to localize objects. Similar to object detection, the basic idea behind CAS-based methods for action localization in the training is to sample the non-overlapping snippets from a video, then to aggregate the snippet-level features into a video-level feature, and finally to yield a video-level class prediction. During testing, the model generates a CAS for each class that identifies the discriminative action regions, and then applies a threshold on the CAS to localize each action instance in terms of the start time and the end time.

In CAS-based methods, the feature aggregator that aggregates multiple snippet-level features into a video-level feature is the critical building block of weakly-supervised neural networks. A model’s ability to capture the location information of an action is primarily determined by the design of the aggregators. While using the global average pooling over a full image or across the video snippets has shown great promise in identifying the discriminative regions (Zhou et al., 2016a, 2014; Pinheiro & Collobert, 2015; Oquab et al., 2015), treating each pixel or snippet equally loses the opportunity to benefit from several more essential parts. Some recent works (Nguyen et al., 2018; Zhu et al., 2017) have tried to learn attentional weights for different snippets to compute a weighted sum as the aggregated feature. However, they suffer from the weights being easily dominated by only a few most salient snippets.

In general, models trained with only video-level class labels tend to be easily responsive to small and sparse discriminative regions from the snippets of interest. This deviates from the objective of the localization task that is to locate dense and integral regions for each entire action. To mitigate this gap and reduce the effect of the domination by the most salient regions, several heuristic tricks have been proposed to apply to existing models. For example,  

(Wei et al., 2017; Zhang et al., 2018b) attempt to heuristically erase the most salient regions predicted by the model which are currently being mined, and force the network to attend other salient regions in the remaining regions by forwarding the model several times. However, the heuristic multiple-run model is not end-to-end trainable. It is the ensemble of multiple-run mined regions but not the single model’s own ability that learns the entire action regions. “Hide-and-seek”(Singh & Lee, 2017) randomly masks out some regions of the input during training, enforcing the model to localize other salient regions when the most salient regions happen to be masked out. However, all the input regions are masked out with the same probability due to the uniform prior, and it is very likely that most of the time it is the background that is being masked out. A detailed discussion about related works can be found in Appendix D.

To this end, we propose the marginalized average attentional network (MAAN) to alleviate the issue raised by the domination of the most salient region in an end-to-end fashion for weakly-supervised action localization. Specifically, MAAN suppresses the action prediction response of the most salient regions by employing marginalized average aggregation (MAA) and learning the latent discriminative probability in a principled manner. Unlike the previous attentional pooling aggregator which calculates the weighted sum with attention weights, MAA first samples a subset of features according to their latent discriminative probabilities, and then calculates the average of these sampled features. Finally, MAA takes the expectation (marginalization) of the average aggregated subset features over all the possible subsets to achieve the final aggregation. As a result, MAA not only alleviates the domination by the most salient regions, but also maintains the scale of the aggregated feature within a reasonable range. We theoretically prove that, with the MAA, the learned latent discriminative probability indeed reduces the difference of response between the most salient regions and the others. Therefore, MAAN can identify more dense and integral regions for each action. Moreover, since enumerating all the possible subsets is exponentially expensive, we further propose a fast iterative algorithm to reduce the complexity of the expectation calculation procedure and provide a theoretical analysis. Furthermore, MAAN is easy to train in an end-to-end fashion since all the components of the network are differentiable. Extensive experiments on two large-scale video datasets show that MAAN consistently outperforms the baseline models and achieves superior performance on weakly-supervised temporal action localization.

In summary, our main contributions include: (1) a novel end-to-end trainable marginalized average attentional network (MAAN) with a marginalized average aggregation (MAA) module in the weakly-supervised setting; (2) theoretical analysis of the properties of MAA and an explanation of the reasons MAAN alleviates the issue raised by the domination of the most salient regions; (3) a fast iterative algorithm that can effectively reduce the computational complexity of MAA; and (4) a superior performance on two benchmark video datasets, THUMOS14 and ActivityNet1.3, on the weakly-supervised temporal action localization.

2 Marginalized Average Attentional Network

In this section, we describe our proposed MAAN for weakly-supervised temporal action localization. We first derive the formulation of the feature aggregation module in MAAN as a MAA procedure in Sec. 2.1. Then, we study the properties of MAA in Sec. 2.2, and present our fast iterative computation algorithm for MAA construction in Sec. 2.3. Finally, we describe our network architecture that incorporates MAA, and introduce the corresponding inference process on weakly-supervised temporal action localization in Sec. 2.4.

2.1 Marginalized Average Aggregation

Figure 1: An illustration of the weighted sum aggregation and the marginalized average aggregation.

Let denote the set of snippet-level features to be aggregated, where is the dimensional feature representation extracted from a video snippet centered at time , and is the total number of sampled video snippets. The conventional attentional weighted sum pooling aggregates the input snippet-level features into a video-level representation . Denote the set of attentional weights corresponding to the snippet-level features as , where is a scalar attentional weight for . Then the aggregated video-level representation is given by


as illustrated in Figure 1 (a). Different from the conventional aggregation mechanism, the proposed MAA module aggregates the features by firstly generating a set of binary indicators to determine whether a snippet should be sampled or not. The model then computes the average aggregation of these sampled snippet-level representations. Lastly, the model computes the expectation (marginalization) of the aggregated average feature for all the possible subsets, and obtains the proposed marginalized average aggregated feature. Formally, in the proposed MAA module, we first define a set of probabilities , where each is a scalar corresponding to , similar to the notation

mentioned previously. We then sample a set of random variables

, where , i.e., with probability . The sampled set is used to represent the subset selection of snippet-level features, in which indicates is selected, otherwise not. Therefore, the average aggregation of the sampled subset of snipped-level representations is given by , and our proposed aggregated feature, defined as the expectation of all the possible subset-level average aggregated representations, is given by


which is illustrated in Figure 1 (b).

2.2 Partial order Preservation and Dominant Response Suppression

Direct learning and prediction with the attention weights in Eq. (1) in weakly-supervised action localization leads to an over-response in the most salient regions. The MAA in Eq. (2) has two properties that alleviate the domination effect of the most salient regions. First, the partial order preservation property, i.e., the latent discriminative probabilities preserve the partial order with respect to their attention weights. Second, the dominant response suppression property, i.e., the differences in the latent discriminative probabilities between the most salient items and others are smaller than the differences between their attention weights. The partial order preservation property guarantees that it does not mix up the action and non-action snippets by assigning a high latent discriminative probability to a snippet with low response. The dominant response suppression property reduces the dominant effect of the most salient regions and encourages the identification of dense and more integral action regions. Formally, we present the two properties in Proposition 1 and Proposition 2, respectively. Detailed proofs can be found in Appendix A and Appendix B respectively.

Proposition 1.

Let for . Then for , Eq. (3) holds true, and .


where and for .

Proposition 1 shows that the latent discriminative probabilities preserve the partial order of the attention weights . This means that a large attention weight corresponds to a large discriminative probability, which guarantees that the latent discriminative probabilities preserve the ranking of the action prediction response. Eq. (3) can be seen as a factorization of the attention weight into the multiplication of two components, and , for . is the latent discriminative probability related to the feature of snippet itself. The factor captures the contextual information of snippet from the other snippets. This factorization can be considered to be introducing structural information into the aggregation. Factor can be considered as performing a structural regularization for learning the latent discriminative probabilities for , as well as for learning a more informative aggregation.

Proposition 2.

Let for . Denote and for . Denote as an index set. Then and for , inequality (4) holds true.


The index set can be viewed as the most salient features set. Proposition 2 shows that the difference between the normalized latent discriminative probabilities of the most salient regions and others is smaller than the difference between their attention weights. It means that the prediction for each snippet using the latent discriminative probability can reduce the gap between the most salient featuress and the others compared to conventional methods that are based on attention weights. Thus, MAAN suppresses the dominant responses of the most salient featuress and encourages it to identify dense and more integral action regions.

Directly learning the attention weights leans to an over response to the most salient region in weakly-supervised temporal localization. Namely, the attention weights for only a few snippets are too large and dominate the others, while attention weights for most of the other snippets that also belong to the true action are underestimated. Proposition 2 shows that latent discriminative probabilities are able to reduce the gap between the most salient features and the others compared to the attention weights. Thus, by employing the latent discriminative probabilities for prediction instead of the attention weights, our method can alleviate the dominant effect of the most salient region in weakly-supervised temporal localization.

2.3 Recurrent Fast Computation

Figure 2: The purple box demonstrates the marginalized average aggregation module, where the inputs are and and the output is . The two black boxes demonstrate the computation graphs of and , respectively. The black hollow point indicates its value is 0, while the value of the black solid point is non-zero. is initialized as 1.

Given a video containing snippet-level representations, there are possible configurations for the subset selection. Directly summing up all the configurations to calculate has a complexity of . In order to reduce the exponential complexity, we propose an iterative method to calculate with complexity. Let us denote the aggregated feature of with length as , and denote and for simplicity, then we have a set of


and the aggregated feature of can be obtained as . In Eq. (5), is the summation of all the , which indicates the number of elements selected in the subset. Although there are distinct configurations for , it has only distinct values for , i.e. . Therefore, we can divide all the distinct configurations into groups, where the configurations sharing with the same fall into the same group. Then the expectation can be calculated as the summation of the parts. That is, , where the , indicating the part of for group , is shown in Eq. (6).


In order to calculate , given , we can calculate recurrently. The key idea here is that comes from two cases: if , then is the same as ; if , then is the weighted average of and . The latter case is also related to the probability . By denoting for simplicity, we can obtain as a function of several elements:


Similarly, the computation of comes from two cases: the probability of selecting items from the first items and selecting the item, i.e., ; and the probability of selecting items all from the first items and not selecting the item, i.e., . We derive the function of and in Proposition 3. Detailed proofs can be found in Appendix C.

Proposition 3.

Let , and for . Define as Eq. (6) and , then can be obtained recurrently by Eq. (8) and Eq. (9).


where , , , , , and .

Proposition 3 provides a recurrent formula to calculate . With this recurrent formula, we calculate the aggregation by iteratively calculating from to and to . Therefore, we can obtain the aggregated feature of as . The iterative computation procedure is summarized in Algorithm 1 in Appendix E. The time complexity is .

With the fast iterative algorithm in Algorithm 1, the MAA becomes practical for end-to-end training. A demonstration of the computation graph for in Eq. (9) and in Eq. (8) is presented in the left and right-hand sides of Figure 2, respectively. From Figure 2, we can see clearly that, to compute (the big black node on the right), it needs , , , , and . The MAA can be easily implemented as a subnetwork for end-to-end training and can be used to replace the operation of other feature aggregators.

Figure 3: Network architecture for the weakly-supervised action localization.

2.4 Network Architecture and Temporal Action Localization

Figure 4: The feature aggregators used in STPN and MAAN.

Network Architecture: We now describe the network architecture that employs the MAA module described above for weakly-supervised temporal action localization. We start from a previous state-of-the-art base architecture, the sparse temporal pooling network (STPN) (Nguyen et al., 2018). As shown in Figure 3, it first divides the input video into several non-overlapped snippets and extracts the I3D (Carreira & Zisserman, 2017) feature for each snippet. Each snippet-level feature is then fed to an attention module to generate an attention weight between 0 and 1. STPN then uses a feature aggregator to calculate a weighted sum of the snippet-level features with these class-agnostic attention weights to create a video-level representation, as shown on the left in Figure 4. The video-level representation is then passed through an FC layer followed by a sigmoid layer to obtain class scores. Our MAAN uses the attention module to generate the latent discriminative probability and replaces the feature aggregator from the weighted sum aggregation by the proposed marginalized average aggregation, which is demonstrated on the right in Figure 4.

Training with video-level class labels: Formally, the model first performs aggregation of the snippet-level features (i.e. ) to obtain the video-level representation (

). Then, it applies a logistic regression layer (FC layer + sigmoid) to output video-level classification prediction probability. Specifically, the prediction probability for class

is parameterized as , where is the aggregated feature for video . Suppose each video is i.i.d and each action class is independent from the other, the negative log-likelihood function (cross-entropy loss) is given as follows:


where is the ground-truth video-level label for class happening in video and .

Temporal Action Localization: Let be the video-level action prediction score, and be the video-level action prediction probability. In STPN, as , the can be rewritten as:


In STPN, the prediction score of snippet for action class c in a video is defined as:



denotes the sigmoid function. In MAAN, as

, according to Proposition 1, the can be rewritten as:


The latent discriminative probability corresponds to the class-agnostic attention weight for snippet . According to Proposition 1 and Proposition 2, does not relate to snippet , but captures the context of other snippets. corresponds to the class-specific weights for action class for all the snippets, and indicates the relevance of snippet to class . To generate temporal proposals, we compute the prediction score of snippet belonging to action class in a video as:


We denote the as the class activation sequence (CAS) for class . Similar to STPN, the threshold is applied to the CAS for each class to extract the one-dimensional connected components to generate its temporal proposals. We then perform non-maximum suppression among temporal proposals of each class independently to remove highly overlapped detections.

Compared to STPN (Eq. (12)), MAAN (Eq. (14)) employs the latent discriminative probability instead of directly using the attention weight (equivalent to ) for prediction. Proposition 2 suggests that MAAN can suppress the dominant response compared to STPN. Thus, MAAN is more likely to achieve a better performance in weakly-supervised temporal action localization.

3 Experiments

This section discusses the experiments on the weakly-supervised temporal action localization problem, which is our main focus. We have also extended our algorithm on addressing the weakly-supervised image object detection problem and the relevant experiments are presented in Appendix F.

3.1 Experimental Settings

Datasets. We evaluate MAAN on two popular action localization benchmark datasets, THUMOS14 (Jiang et al., 2014) and ActivityNet1.3 (Heilbron et al., 2015). THUMOS14 contains 20 action classes for the temporal action localization task, which consists of 200 untrimmed videos (3,027 action instances) in the validation set and 212 untrimmed videos (3,358 action instances) in the test set. Following standard practice, we train the models on the validation set without using the temporal annotations and evaluate them on the test set. ActivityNet1.3 is a large-scale video benchmark for action detection which covers a wide range of complex human activities. It provides samples from 200 activity classes with an average of 137 untrimmed videos per class and 1.41 activity instances per video, for a total of 849 video hours. This dataset contains 10,024 training videos, 4,926 validation videos and 5,044 test videos. In the experiments, we train the models on the training videos and test on the validation videos.
Evaluation Metrics.

We follow the standard evaluation metric by reporting mean average precision (mAP) values at several different levels of intersection over union (IoU) thresholds. We use the benchmarking code provided by ActivityNet

111 to evaluate the models.

Implementation Details. We use two-stream I3D networks (Carreira & Zisserman, 2017) pre-trained on the Kinetics dataset (Kay et al., 2017)

to extract the snippet-level feature vectors for each video. All the videos are divided into sets of non-overlapping video snippets. Each snippet contains 16 consecutive frames or optical flow maps. We input each 16 stacked RGB frames or flow maps into the I3D RGB or flow models to extract the corresponding 1024 dimensional feature vectors. Due to the various lengths of the videos, in the training, we uniformly divide each video into

non-overlapped segments, and randomly sample one snippet from each segment. Therefore, we sample snippets for each video as the input of the model for training. We set to in our MAAN model. The attention module in Figure 3 consists of an FC layer of , a LeakyReLU layer, an FC layer of , and a sigmoid non-linear activation, to generate the latent discriminative probability . We pass the aggregated video-level representation through an FC layer of followed by a sigmoid activation to obtain class scores. We use the ADAM optimizer (Kingma & Ba, 2014) with an initial learning rate of to optimize network parameters. At the test time, we first reject classes whose video-level probabilities are below . We then forward all the snippets of the video to generate the CAS for the remaining classes. We generate the temporal proposals by cutting the CAS with a threshold . The combination ratio of two-stream modalities is set to and

. Our algorithm is implemented in PyTorch 

222 We run all the experiments on a single NVIDIA Tesla M40 GPU with a 24 GB memory.

3.2 THUMOS14 dataset

We first compare our MAAN model on the THUMOS14 dataset with several baseline models that use different feature aggregators in Figure 3 to gain some basic understanding of the behavior of our proposed MAA. The descriptions of the four baseline models are listed below.

(1) STPN. It employs the weighed sum aggregation to generate the video-level representation. (2) Dropout. It explicitly performs dropout sampling with dropout probability in STPN to obtain the video-level representation, , . (3) Normalization. Denoted as “Norm” in the experiments, it utilizes the weighted average aggregation for the video-level representation. (4) SoftMax Normalization. Denoted as “SoftMaxNorm” in the experiments, it applies the softmax function as the normalized weights to get the weighted average aggregated video-level feature, .

Methods AP@IoU Cls mAP
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
STPN 57.4 48.7 40.3 29.5 19.8 11.4 5.8 1.7 0.2 94.2
Dropout 53.4 44.9 35.4 25.0 16.2 8.7 4.3 1.3 0.1 92.4
Norm 48.0 39.9 30.5 20.9 12.3 5.7 2.4 0.6 0.1 95.2
SoftMaxNorm 22.2 17.2 12.8 9.6 6.3 4.3 2.8 1.0 0.1 94.8
MAAN 59.8 50.8 41.1 30.6 20.3 12.0 6.9 2.6 0.2 94.1
Table 1: Comparison of the proposed MAAN with four baseline feature aggregators on the THUMOS14 test set. All values are reported in percentage. The last column is the classification mAP.

We test all the models with the cutting threshold as 0.2 of the max value of the CAS. We compare the detection average precision (%) at IoU = [0.1 : 0.1 : 0.9] and the video-level classification mean average precision (%) (denoted as Cls mAP) on the test set in Table 1. From Table 1, we can observe that although all the methods achieve a similar video-level classification mAP, their localization performances vary a lot. It shows that achieving a good video-level classification performance cannot guarantee obtaining a good snippet-level localization performance because the former only requires the correct prediction of the existence of an action, while the latter requires the correct prediction of both its existence and its duration and location. Moreover, Table 1 demonstrates that MAAN consistently outperforms all the baseline models at different levels of IoUs in the weakly-supervised temporal localization task. Both the “Norm” and “SoftmaxNorm” are the normalized weighted average aggregation. However, the “SoftmaxNorm” performs the worst, because the softmax function over-amplifies the weight of the most salient snippet. As a result, it tends to identify very few discriminative snippets and obtains sparse and non-integral localization. The “Norm” also performs worse than our MAAN. It is the normalized weighted average over the snippet-level representation, while MAAN can be considered as the normalized weighted average (expectation) over the subset-level representation. Therefore, MAAN encourages the identification of dense and integral action segments as compared to “Norm” which encourages the identification of only several discriminative snippets. MAAN works better than “Dropout” because “Dropout” randomly drops out the snippets with different attention weights by uniform probabilities. At each iteration, the scale of the aggregated feature varies a lot, however, MAAN samples with the learnable latent discriminative probability and conducts the expectation of keeping the scale of the aggregated feature stable. Compared to STPN, MAAN also achieves superior results. MAAN implicitly factorizes the attention weight into , where learns the latent discriminative probability of the current snippet, and captures the contextual information and regularizes the network to learn a more informative aggregation. The properties of MAA disallow the predicted class activation sequences to concentrate on the most salient regions. The quantitative results show the effectiveness of the MAA feature aggregator.

Figure 5: Visualization of the one-dimensional activation sequences on an example of the HammerThrow action in the test set of THUMOS14. The horizontal axis denotes the temporal dimension, which is normalized to [0, 1]. The first row of each model shows the ground-truth action segments. The second row demonstrates the predicted activation sequence for class HammerThrow.

Figure 5 visualizes the one-dimensional CASs of the proposed MAAN and all the baseline models. The temporal CAS generated by MAAN can cover large and dense regions to obtain more accurate action segments. In the example in Figure 5, MAAN can discover almost all the actions that are annotated in the ground-truth; however, the STPN have missed several action segments, and also tends to only output the more salient regions in each action segment. Other methods are much sparser compared to MAAN. The first row of Figure 5 shows several action segments in red and in green, corresponding to action segments that are relatively difficult and easy to be localized, respectively. We can see that all the easily-localized segments contain the whole person who is performing the “HammerThrow” action, while the difficultly-localized segments contain only a part of the person or the action. Our MAAN can successfully localize the easy segments as well as the difficult segments; however, all the other methods fail on the difficult ones. It shows that MAAN can identify several dense and integral action regions other than only the most discriminative region which is identified by the other methods.

We also compare our model with the state-of-the-art action localization approaches on the THUMOS14 dataset. The numerical results are summarized in Table 2. We include both fully and weakly-supervised learning, as in (Nguyen et al., 2018). As shown in Table 2, our implemented STPN performs slightly better than the results reported in the original paper (Nguyen et al., 2018). From Table 2, our proposed MAAN outperforms the STPN and most of the existing weakly-supervised action localization approaches. Furthermore, our model still presents competitive results compared with several recent fully-supervised approaches even when trained with only video-level labels.

Supervision Methods AP@IoU
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Fully Supervised Richard et al. (Richard & Gall, 2016) 39.7 35.7 30.0 23.2 15.2 - - - -
Shou et al. (Shou et al., 2016) 47.7 43.5 36.3 28.7 19.0 10.3 5.3 - -
Yeung et al. (Yeung et al., 2016) 48.9 44.0 36.0 26.4 17.1 - - - -
Yuan et al. (Yuan et al., 2016) 51.4 42.6 33.6 26.1 18.8 - - - -
Shou et al. (Shou et al., 2017) - - 40.1 29.4 23.3 13.1 7.9 - -
Yuan et al. (Yuan et al., 2017b) 51.0 45.2 36.5 27.8 17.8 - - - -
Xu et al. (Xu et al., 2017) 54.5 51.5 44.8 35.6 28.9 - - - -
Zhao et al. (Zhao et al., 2017) 66.0 59.4 51.9 41.0 29.8 - - - -
Weakly Supervised Wang et al. (Wang et al., 2017) 44.4 37.7 28.2 21.1 13.7 - - - -
Singh & Lee (Singh & Lee, 2017) 36.4 27.8 19.5 12.7 6.8 - - - -
STPN (Nguyen et al., 2018) (UN) 45.3 38.8 31.1 23.5 16.2 9.8 5.1 2.0 0.3
STPN (Nguyen et al., 2018) (I3D) 52.0 44.7 35.5 25.8 16.9 9.9 4.3 1.2 0.1
STPN (Nguyen et al., 2018) (ours) 57.4 48.7 40.3 29.5 19.8 11.4 5.8 1.7 0.2
AutoLoc (Shou et al., 2018) - - 35.8 29.0 21.2 13.4 5.8 - -
MAAN (ours) 59.8 50.8 41.1 30.6 20.3 12.0 6.9 2.6 0.2
Table 2: Comparison of our algorithm to the previous approaches on THUMOS14 test set. AP (%) is reported for different IoU thresholds. Both the fully-supervised and the weakly-supervised results are listed. (“UN”: using UntrimmedNet features, “I3D”: using I3D features, “ours”: our implementation.)
Supervision Methods AP @ IoU
0.5 0.75 0.95
Fully-supervised Singh & Cuzzolin (Singh & Cuzzolin, 2016) 34.5 - -
Wang & Tao (Wang & Tao, 2016) 45.1 4.1 0.0
Shou et al. (Shou et al., 2017) 45.3 26.0 0.2
Xiong et al. (Xiong et al., 2017) 39.1 23.5 5.5
Weakly-supervised STPN (Nguyen et al., 2018) 29.3 16.9 2.6
STPN (Nguyen et al., 2018) (ours) 29.8 17.7 4.1
MAAN (ours) 33.7 21.9 5.5
Table 3: Comparison of our algorithm to the state-of-the-art approaches on ActivityNet1.3 validation set. AP (%) is reported for different IoU threshold . (“ours” means our implementation.)

3.3 ActivityNet1.3 dataset

We train the MAAN model on the ActivityNet1.3 training set and compare our performance with the recent state-of-the-art approaches on the validation set in Table 3. The action segment in ActivityNet is usually much longer than that of THUMOS14 and occupies a larger percentage of a video. We use a set of thresholds, which are of the max value of the CAS, to generate the proposals from the one-dimensional CAS. As shown in Table 3, with the set of thresholds, our implemented STPN performs slightly better than the results reported in the original paper (Nguyen et al., 2018). With the same threshold and experimental setting, our proposed MAAN model outperforms the STPN approach on the large-scale ActivityNet1.3. Similar to THUMOS14, our model also achieves good results that are close to some of the fully-supervised approaches.

4 Conclusion

We have proposed the marginalized average attentional network (MAAN) for weakly-supervised temporal action localization. MAAN employs a novel marginalized average aggregation (MAA) operation to encourage the network to identify the dense and integral action segments and is trained in an end-to-end fashion. Theoretically, we have proved that MAA reduces the gap between the most discriminant regions in the video to the others, and thus MAAN generates better class activation sequences to infer the action locations. We have also proposed a fast algorithm to reduce the computation complexity of MAA. Our proposed MAAN achieves superior performance on both the THUMOS14 and the ActivityNet1.3 datasets on weakly-supervised temporal action localization tasks compared to current state-of-the-art methods.

5 Acknowledgement

We thank our anonymous reviewers for their helpful feedback and suggestions. Prof. Ivor W. Tsang was supported by ARC FT130100746, ARC LP150100671, and DP180100106.


  • Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. ICLR, 2015.
  • Bengio et al. (2009) Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In

    Proceedings of the 26th annual international conference on machine learning

    , pp. 41–48. ACM, 2009.
  • Bilen & Vedaldi (2016) Hakan Bilen and Andrea Vedaldi. Weakly supervised deep detection networks. In CVPR, 2016.
  • Carreira & Zisserman (2017) J. Carreira and A Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
  • Cinbis et al. (2017) Ramazan Gokberk Cinbis, Jakob Verbeek, and Cordelia Schmid. Weakly supervised object localization with multi-fold multiple instance learning. IEEE transactions on pattern analysis and machine intelligence, 39(1):189–203, 2017.
  • Girdhar & Ramanan (2017) Rohit Girdhar and Deva Ramanan. Attentional pooling for action recognition. In Advances in Neural Information Processing Systems, pp. 33–44, 2017.
  • Gkioxari et al. (2015) Georgia Gkioxari, Ross Girshick, and Jitendra Malik. Contextual action recognition with r* cnn. In

    Proceedings of the IEEE international conference on computer vision

    , pp. 1080–1088, 2015.
  • Hamilton et al. (2017) Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1025–1035, 2017.
  • Heilbron et al. (2015) F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015.
  • Hermann et al. (2015) Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, pp. 1693–1701, 2015.
  • Jiang et al. (2014) Y.-G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar. THUMOS challenge: Action recognition with a large number of classes., 2014.
  • Kantorov et al. (2016) Vadim Kantorov, Maxime Oquab, Minsu Cho, and Ivan Laptev. ContextLocNet: Context-aware deep network models for weakly supervised localization. In ECCV, 2016.
  • Kay et al. (2017) W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Greem, T. Back, P. Natsev, M. Suleyman, and A. Zisserman. The kinetics human action video dataset. In arXiv:1705.06950v1, 2017.
  • Kim et al. (2017) Yoon Kim, Carl Denton, Luong Hoang, and Alexander M Rush. Structured attention networks. arXiv preprint arXiv:1702.00887, 2017.
  • Kingma & Ba (2014) D. Kingma and J. Ba. Adam: A method for stochastic optimization. In arXiv preprint arXiv:1412.6980, 2014.
  • Kong & Fowlkes (2017) Shu Kong and Charless Fowlkes. Low-rank bilinear pooling for fine-grained classification. In

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pp. 7025–7034. IEEE, 2017.
  • Mensch & Blondel (2018) Arthur Mensch and Mathieu Blondel. Differentiable dynamic programming for structured prediction and attention. arXiv preprint arXiv:1802.03676, 2018.
  • Nguyen et al. (2018) Phuc Nguyen, Ting Liu, Gautam Prasad, and Bohyung Han. Weakly supervised action localization by sparse temporal pooling network. CVPR, 2018.
  • Oquab et al. (2015) Maxime Oquab, Léon Bottou, Ivan Laptev, and Josef Sivic. Is object localization for free?-weakly-supervised learning with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 685–694, 2015.
  • Pinheiro & Collobert (2015) Pedro O Pinheiro and Ronan Collobert. From image-level to pixel-level labeling with convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1713–1721, 2015.
  • Richard & Gall (2016) Alexander Richard and Juergen Gall. Temporal action detection using a statistical language model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3131–3140, 2016.
  • Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  • Sharma et al. (2015) Shikhar Sharma, Ryan Kiros, and Ruslan Salakhutdinov. Action recognition using visual attention. arXiv preprint arXiv:1511.04119, 2015.
  • Shou et al. (2016) Zheng Shou, Dongang Wang, and Shih-Fu Chang. Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1049–1058, 2016.
  • Shou et al. (2017) Zheng Shou, Jonathan Chan, Alireza Zareian, Kazuyuki Miyazawa, and Shih-Fu Chang. CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In CVPR, 2017.
  • Shou et al. (2018) Zheng Shou, Hang Gao, Lei Zhang, Kazuyuki Miyazawa, and Shih-Fu Chang. Autoloc: Weaklysupervised temporal action localization in untrimmed videos. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 154–171, 2018.
  • Simonyan & Zisserman (2014) Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pp. 568–576, 2014.
  • Singh & Cuzzolin (2016) Gurkirt Singh and Fabio Cuzzolin. Untrimmed video classification for activity detection: submission to activitynet challenge. arXiv preprint arXiv:1607.01979, 2016.
  • Singh & Lee (2017) Krishna Kumar Singh and Yong Jae Lee. Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In The IEEE International Conference on Computer Vision (ICCV), 2017.
  • Tran et al. (2015) Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 4489–4497, 2015.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 6000–6010, 2017.
  • Wah et al. (2011) C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, 2011.
  • Wang et al. (2016) Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In European Conference on Computer Vision, pp. 20–36. Springer, 2016.
  • Wang et al. (2017) Limin Wang, Yuanjun Xiong, Dahua Lin, and Luc Van Gool. Untrimmednets for weakly supervised action recognition and detection. CVPR, 2017.
  • Wang & Tao (2016) R. Wang and D. Tao. Acitivitynet large scale activity recognition challenge. UTS at Activitynet, 2016.
  • Wei et al. (2017) Yunchao Wei, Jiashi Feng, Xiaodan Liang, Ming-Ming Cheng, Yao Zhao, and Shuicheng Yan. Object region mining with adversarial erasing: A simple classification to semantic segmentation approach. In IEEE CVPR, 2017.
  • Xiong et al. (2017) Yuanjun Xiong, Yue Zhao, Limin Wang, Dahua Lin, and Xiaoou Tang. A pursuit of temporal accuracy in general activity detection. arXiv preprint arXiv:1703.02716, 2017.
  • Xu et al. (2017) H. A. Xu, A. Das, and K. Saenko. R-c3d: Region convolutional 3d network for temporal activity detection. In ICCV, 2017.
  • Yeung et al. (2016) Serena Yeung, Olga Russakovsky, Greg Mori, and Li Fei-Fei. End-to-end learning of action detection from frame glimpses in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2678–2687, 2016.
  • Yuan et al. (2016) J. Yuan, B. Ni, X. Yang, and A. A. Kassim. Temporal action localization with pyramid of score distribution features. In CVPR, 2016.
  • Yuan et al. (2017a) Yuan Yuan, Xiaodan Liang, Xiaolong Wang, Dit-Yan Yeung, and Abhinav Gupta. Temporal dynamic graph LSTM for action-driven video object detection. In ICCV, pp. 1819–1828, 2017a.
  • Yuan et al. (2017b) Z. Yuan, J. Stroud, T. Lu, and J. Deng. Temporal action localization by structured maximal sums. In CVPR, 2017b.
  • Zhang et al. (2017) Dingwen Zhang, Deyu Meng, and Junwei Han. Co-saliency detection via a self-paced multiple-instance learning framework. IEEE transactions on pattern analysis and machine intelligence, 39(5):865–878, 2017.
  • Zhang et al. (2018a) Jiani Zhang, Xingjian Shi, Junyuan Xie, Hao Ma, Irwin King, and Dit-Yan Yeung. Gaan: Gated attention networks for learning on large and spatiotemporal graphs. arXiv preprint arXiv:1803.07294, 2018a.
  • Zhang et al. (2018b) Xiaolin Zhang, Yunchao Wei, Jiashi Feng, Yi Yang, and Thomas Huang. Adversarial complementary learning for weakly supervised object localization. arXiv preprint arXiv:1804.06962, 2018b.
  • Zhao et al. (2017) Y. Zhao, Y. Xiong, L. Wang, Z. Wu, X. Tang, and D. Lin. Temporal action detection with structured segment networks. In ICCV, 2017.
  • Zhou et al. (2016a) B. Zhou, A. Khosla, Lapedriza. A., A. Oliva, and A. Torralba.

    Learning Deep Features for Discriminative Localization.

    CVPR, 2016a.
  • Zhou et al. (2014) Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Object detectors emerge in deep scene cnns. arXiv preprint arXiv:1412.6856, 2014.
  • Zhou et al. (2016b) Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In Computer Vision and Pattern Recognition, 2016b.
  • Zhu et al. (2017) Yi Zhu, Yanzhao Zhou, Qixiang Ye, Qiang Qiu, and Jianbin Jiao. Soft proposal networks for weakly supervised object localization. arXiv preprint arXiv:1709.01829, 2017.

Appendix A Proof of Proposition 1

a.1 Proof of Equation (3)


In addition,


Thus, we achieve


a.2 Proof of


Denote , then we have


Since , we achieve that . Since and , and , it follows that .

Appendix B Proof of Proposition 2


When , we have . Then inequality (4) trivially holds true. Without loss of generality, assume and there exists a strict inequality. Then such that for and for . Otherwise, we obtain or for and there exists a strict inequality. It follows that or , which contradicts . Thus, we obtain the set .

Without loss of generality, for and , we have and , then we obtain that . It follows that


Appendix C Proof of Proposition 3

c.1 Computation of


where denotes the indicator function.

We achieve Eq. (26) by partitioning the summation into groups . Terms belonging to group have .

Let , and we achieve Eq. (28).

c.2 Proof of Recurrent formula of

We now give the proof of the recurrent formula of Eq. (29)


Then, we have