Impressive improvement has been made in the past two years to address Temporal Action Localization (TAL) in untrimmed videos [23, 15, 46, 67, 68, 51, 50, 36, 22, 73, 17, 66, 11, 10, 16, 7, 6, 69, 52]. These methods were proposed for the fully-supervised setting: the model training requires the full annotation of the ground truth temporal boundary (start time and end time) for each action instance. However, untrimmed videos are usually very long with substantial background content in time. Therefore, manually annotating temporal boundaries for a new large-scale dataset is very expensive and time-consuming , and thus might prohibit applying the fully-supervised methods to the new domains that lack enough training data with full annotations.
This motivates us to develop TAL methods that require significantly fewer ground truth annotations for training. As illustrated in Fig. 1, in this paper we focus on the following scenario: during training, we only have the video-level labels, which are much easier to collect , compared to the boundary annotations; during testing, we still aim to predict both (1) the action class and (2) the temporal boundary (i.e. start time and end time) of each action instance. We refer this scenario as the weakly-supervised setting that this paper works on.
Recently, a few methods have been proposed to tackle TAL in such a weakly-supervised setting. UntrimmedNet  and Hide-and-Seek  achieve the state-of-the-art performances and carry out the localization in a similar manner. Given a training video, several segments are randomly sampled and are fed into a network together to yield a video-level class prediction. During testing, the trained network is slided over time to produce the classification score sequence of being each action over time. The score sequence is similar to the Class Activation Map in  but just has one dimension, and thus we refer it as Class Activation Sequence (CAS). Finally a simple thresholding method is applied on the CAS to localize each action instance in terms of the start time and the end time.
However, performing localization via thresholding in general may not be robust to noises in CAS: sometimes there are a few dips of low activations within an interval of high activations, using a large threshold might over-segment one action instance into several segments; but using a small threshold might include too many irrelevant backgrounds preceding and succeeding the action instance. One possible solution is improving the quality of CAS. Alternatively, instead of thresholding, many fully-supervised TAL methods detect action instances at the segment-level directly [51, 6]. Some works further employ boundary regression models to learn to predict more accurate boundaries [36, 17, 66, 16]. Thus, we design a framework called AutoLoc which can conduct direct boundary prediction via predicting the center location and the duration of each action instance.
But how to train the boundary prediction model without ground truth boundary annotations still remains unsolved. To address this challenge, we propose a novel Outer-Inner-Contrastive (OIC) loss to provide the needed segment-level supervision for training the boundary prediction model. Given the CAS of being the ground truth action, we denote the inner boundary as the boundary of a predicted action instance and we inflate the inner boundary slightly to obtain the outer boundary. As illustrated in Fig. 1, we propose an OIC loss as the average activation in the outer red area minus the average activation in the inner green area. By minimizing the OIC loss to find the area of high inner activations but low outer activations, we can make desirable localization of the salient interval on CAS, which is likely to be well-aligned with the ground truth segment. Equipped with the OIC loss, AutoLoc can automatically discover the segment-level supervision from the video-level annotations for training the boundary prediction model. In Sec. 5, we will experimentally compare with the state-of-the-art methods and also study several variants of our model.
In summary, we make three novel contributions in this paper:
(1) To the best of our knowledge, AutoLoc is the first weakly-supervised TAL framework that can directly predict the temporal boundary of each action instance with only the video-level annotations available during training, specifically addressing the localization task at the segment level.
(2) To enable the training of such a parametric boundary prediction model, we design a novel OIC loss to automatically discover the segment-level supervision and we prove that the OIC loss is differentiable to the underlying boundary prediction model.
(3) We demonstrate the effectiveness of AutoLoc on two standard benchmarks. AutoLoc significantly outperforms the state-of-the-art weakly-supervised TAL methods and even achieves results comparable to some fully-supervised methods that use the boundary annotations during training. When the overlap IoU threshold is set to 0.5 during evaluation, our method improves mAP on THUMOS’14 from 13.7% to 21.2% (54.7% relative gain) and improves mAP on ActivityNet from 7.4% to 27.3% (268.9% relative gain).
2 Related Works
2.0.1 Video Action Analysis
Detailed reviews can be found in recent surveys [65, 42, 2, 9, 3, 31]. Researchers have developed quite a few deep networks for video action analysis such as 3D ConvNets [60, 27, 61], LSTM , two-stream network , I3D , etc. For example, Wang et al. proposed Temporal Segment Network , which employed the two-stream network to model the long-range temporal structure in video and served as an effective backbone network in various video analysis tasks such as recognition , localization 
, weakly-supervised learning.
2.0.2 Temporal Action Localization with Weak Supervision and Full Supervision
Several large-scale video datasets have been created for TAL such as Charades [54, 53], ActivityNet , THUMOS [29, 19]. In order to obtain the ground truth temporal boundaries to provide full supervision for training the fully-supervised TAL models, substantial efforts are required for annotating each of such large-scale datasets. Therefore, it is useful and important to develop TAL models that can be trained with weak supervision only.
Video-level annotation is one kind of weak supervision that can be more easily collected and thus is quite interesting to the community. Sun et al.  was the first to consider TAL with only the video-level annotations available during training and the authors discovered the additional supervision from web images. Recently, Singh et al. designed Hide-and-Seek  to address the challenge that weakly-supervised detection methods usually focus on the most discriminative parts while neglect other relevant parts of the target instance. Wang et al.  proposed a framework called UntrimmedNet consisting of a classification module to perform action classification and a selection module to detect important temporal segments. These recent methods are effectively learning an action classification model during training in order to generate reasonably good Class Activation Sequence (CAS) over time. But in order to detect temporal boundaries, a simple thresholding is applied on the CAS during testing. Therefore, although these methods can excel at the video-level action recognition, the performance of temporal localization still has large room for improvement.
However, the fully-supervised TAL methods (boundary annotations available during training) have gone beyond the simple thresholding method. First, some researchers performed localization at segment-level: they first generated the candidate segments via sliding window or proposal methods, and then they classified each segment into certain actions[51, 17, 66, 16, 7]. Motivated by the success of single-shot object detection method [38, 44, 43], Lin et al.  removed the proposal stage and directly conducted TAL in a single-shot fashion to simultaneously predict temporal boundary and action class. Second, direct boundary prediction via anchor generation and boundary regression has been adapted from object detection [38, 44, 43, 45, 18] to fully-supervised TAL recently and proven to be quite effective in detecting more accurate boundaries [36, 73, 17, 66, 16]. This motivates us to generalize segment-level localization and direct boundary prediction to weakly-supervised TAL: we develop AutoLoc to first generate anchor segments and then regress their boundaries to obtain the predicted segments; in order to train the boundary regressors, we propose the OIC loss to provide the segment-level supervision.
2.0.3 Weakly-supervised Deep Learning Methods
Other types of weak supervision for action detection have also been explored in the past. For instance, Huang et al.  and Richard et al.  both utilized the order of actions as the supervision used during training. Mettes et al.  worked on the spatio-temporal action detection using only the point-level supervision for training.
Weakly-supervised deep learning methods have been also widely studied in other vision tasks such as object detection[74, 75, 49, 35, 30, 14, 59, 71, 57, 5, 32, 20], semantic segmentation [34, 24, 41, 4], video captioning , visual relation detection , etc. As a counterpart of the weakly-supervised video TAL, the weakly-supervised image object detection has been significantly improved via combining Multiple Instance Learning (MIL)  and deep networks [49, 30, 59, 5, 32]: built upon Fast-RCNN , these methods first generated candidate proposals beforehand; then they employed deep networks to classify each proposal and the scores from all proposals were fused together to obtain one label prediction for the whole image to be compared with the image-level label. One of such MIL-based deep networks is ContextLocNet , which further inflated the prediction box to obtain its outer box to take into account the contextual information. Our work bypasses the costly proposal generation and predicts the boundaries from raw input videos in a single-shot fashion. Although we focus on video TAL in this paper, it would be also interesting to adapt our method for image object detection in the future.
3 Outer-Inner-Contrastive Loss
In this Section, we formulate how to compute the proposed OIC loss during the network forward pass of AutoLoc and prove that the OIC loss is differentiable to the underlying boundary prediction model during the backward pass. The whole pipeline and details of AutoLoc will be presented in Sec. 4.
As illustrated by the bottom-right part in Fig. 2, for each predicted segment , we can compute its OIC loss. Each predicted segment consists of the action/inner boundary , the inflated outer boundary , and the action class . These boundaries are at the snippet-level granularity (for example, boundary corresponds to the location of the -st snippet). In order to fetch the corresponding snippet-level activation on the CAS, we round each boundary of continuous value to its nearest integer (i.e. the location of the nearest snippet). We denote the class activation at the -th snippet on the CAS of action as . The OIC loss of the prediction is defined as the average activation in the outer area minus the average activation in the inner area:
During training, we set to the ground truth action and we minimize to encourage high activations inside and penalize high activations outside.
We prove that the OIC loss is differentiable to the inner and outer boundaries. Therefore, the supervision discovered by the OIC loss can be back-propagated to the underlying boundary prediction model. Detailed derivation can be found in the supplementary material. The gradients corresponding to the predicted segment w.r.t its inner boundary are as follows:
The gradients corresponding to the predicted segment w.r.t its outer boundary are as follows:
Note that these gradients indeed have the physical meanings about how to adjust the boundaries. For example, in Equation 2, represents how much the average inner activation is higher than the activation at the inner left boundary . If the average inner activation is much higher than the activation at the inner left boundary , is likely to belong to the background and thus we would like to move in the positive (right) direction. Similarly, represents how much the activation at the inner left boundary is higher than the average outer activation. is the adversarial outcome of and . Consequently, indicates how the model wants to adjust the inner left boundary eventually: if , moves in the positive (right) direction; if , moves in the negative (left) direction.
In this Section, we walk through the pipeline of AutoLoc as illustrated in Fig. 2. The training and testing pipelines are very similar in AutoLoc. So we only explicitly distinguish the training and testing pipelines when any difference appears.
4.1 Input Data Preparation and Feature Extraction
Each input data sample fed into AutoLoc is one single untrimmed video. Following UntrimmedNet , for each input video, we first divide it into 15-frames-long snippets without overlap and extract feature for each snippet individually.
has been proven to be effective in training TSN classifier with only the video-level labels. Therefore, we first train an UntrimmedNet network (the soft version) in advance and then use the trained network as our backbone for feature extraction.
This backbone network consists of one spatial stream accepting RGB input and one temporal stream accepting Optical Flow input. For each stream, we employ the Inception network architecture with Batch Normalization and extract the 1024-dimensional feature at the layer. Finally, for each snippet, we concatenate the extracted spatial feature and temporal feature into one feature vector of 2048 dimensions. For each input video of snippets in total, we obtain a feature map of shape 2048 (channels) by (snippets).
4.2 Classification Branch
The goal of the classification branch is to obtain the Class Activation Sequence (CAS). We build our Activation Generator S based on UntrimmedNet. On top of the layer, UntrimmedNet attaches one Fully Connected (FC) layer of nodes to classify each snippet into action categories and also attaches another Fully Connected (FC) layer of just 1 node to predict the attention score (importance) for each snippet. The corresponding scores from the spatial stream and the temporal stream are averaged to obtain the final score. For each video, we use these two FC layers in the UntrimmedNet that are trained beforehand to respectively extract a classification score sequence of shape (actions) by (snippets) and an attention score sequence of dimensions. For each snippet, we set its classification scores of all classes to 0 when its attention score is lower than the threshold (7 is chosen via grid search on the THUMOS’14 training set and also works well on ActivityNet); then we regard such a gated classification score as the activation, which ranges within [0, 1]. Finally, for each video, we obtain its CAS of shape (actions) by (snippets).
4.3 Localization Branch
The goal of the localization branch is to learn a parametric model for predicting the segment boundary directly. Recent fully-supervised TAL methods[36, 73, 17, 66, 16] have shown the effectiveness of regressing anchors for direct boundary prediction: the anchor is a hypothesis of the possible segment; the predicted boundary is obtained by respectively regressing (1) the center location and (2) the temporal length of the anchor segment; multi-anchor mechanism is used to cover the possible segments of different temporal scales. Therefore, we design a localization network B to look at each temporal position on the feature map and output the needed two boundary regression values for each anchor. Then we regress the anchors using these regression values to obtain the predicted action boundaries (inner boundaries) and inflate the inner boundaries to obtain the outer boundaries. Finally, based on the CAS, we introduce an OIC layer equipped with the OIC loss to generate the final segment predictions.
4.3.2 Network Architecture of the Localization Network B
Given an input video, its feature map of shape 2048 channels by snippets is fed into B. B
first stacks 3 same temporal convolutional layers, which slide convolutional filters over time. Each temporal convolutional layer has 128 filters, which all have kernel size 3 in time with stride 1 and padding 1. Each temporal convolutional layer is followed by one Batch Normalization layer and one ReLU layer.
Finally, B adds one more temporal convolutional layer to output the boundary regression values. Filters in have kernel size 3 in time with stride 1 and padding 1. Similar to YOLO [44, 43], the boundary predicted by B is designed to be class-agnostic. This allows us to learn a generic boundary predictor, which may be used for generating action proposals for unseen actions in the future. Consequently, the total number of filters in is , where is the number of anchor scales. For each anchor, B predicts two boundary regression values: (1) indicating how to shift the center location of the anchor and (2) indicating how to scale the length of the anchor.
4.3.3 Details of the Boundary Transformation
Since each temporal position on the feature map and each temporal position on the CAS both correspond to the same location of an input snippet, we make boundary predictions at the snippet-level granularity. We outline the boundary prediction procedure in Fig. 3.
Anchor generation. At the temporal position on the feature map, we generate a hypothesized segment (anchor) of length . In practice, we use multi-scale anchors. We determine their scales according to the typical time duration range of segments in each specific dataset.
Boundary regression. As aforementioned, for each anchor at the temporal position , B predicts two boundary regression values and . We can obtain the predicted segment via regressing the center location and the temporal length . We denote the boundary of this predicted segment as the inner boundary, which can be computed by and . Furthermore, we clip the predicted boundary and to fit into the range of the whole video. More details about clipping can be found in the supplementary material.
Boundary inflation. A ground truth segment usually exhibits relatively higher activations on CAS within the inner area compared to the contextual area preceding and succeeding . Therefore, we inflate the inner boundary by a ratio to obtain the corresponding outer boundary and . As discussed in the supplementary material, setting to 0.25 is a good choice.
4.3.4 The OIC layer for Obtaining the Final Predictions
Finally, we introduce an OIC layer which uses the OIC loss to measure how likely each segment contains actions and then removes the segments that are not likely to contain actions. During testing, this OIC layer outputs a set of predicted segments. During training, this OIC layer further computes the total OIC loss and back-propagates the gradients to the underlying boundary prediction model.
Concretely, given an input video, the classification branch generates its CAS and the localization branch predicts the candidate class-agnostic segments. Note that since all temporal convolutional layers in B slide over time with stride 1, the set of segments predicted at each temporal position on the feature map and the activations at each temporal position on the CAS are paired, corresponding to the same input snippet. Thus at the temporal position of each snippet, B has predicted class-agnostic anchor segments. Then for each action, we iteratively go through the following steps on the CAS to obtain the final class-specific segment predictions. Note that during training we consider only the ground truth actions while during testing we consider all actions. If a temporal position has the activation lower than 0.1 on the CAS, we discard all the predictions corresponding to this temporal position. For each of the remaining positions, among its anchor segment predictions, we only keep the one with the lowest OIC loss which means selecting the anchor of the most likely scale. Finally, for all the kept segment predictions, we remove the segment predictions with the OIC loss higher than -0.3. We perform Non-Maximum Suppression (NMS) over all segment predictions with overlap IoU threshold 0.4. All these thresholds are chosen by grid search on the THUMOS’14 training set and also work well on ActivityNet. Alg. 1 in the supplementary material summarizes the above steps.
During training, the total loss is the summation of the OIC loss generated by each kept segment predictions. We can compute the gradients triggered by each kept segment prediction according to Sec. 3.2 and then accumulate them together to update the underlying boundary predictor B. During testing, all the kept segment predictions are outputted as our final segment predictions. Each segment prediction consists of (1) the predicted action class, (2) the confidence score which is set to 1 minus its OIC loss, and (3) the start time and the end time obtained by converting the inner boundary [, ] from the snippet-level granularity (continuous value before rounding to its nearest integer) to time.
In this section, we first introduce two standard benchmarks and the corresponding evaluation metrics. Note that during training, we only use the video-level labels; during testing, we use the ground truth segments with boundary annotations for evaluating the performance of temporal action localization. We compare our method with the state-of-the-art methods and then conduct some ablation studies to investigate different variants of our method.
5.1 Datasets and Evaluation
5.1.1 Thumos’14 
The temporal action localization task in THUMOS’14 contains 20 actions. Its validation set has 200 untrimmed videos. Each video contains at least one action. We use these 200 videos in the validation set for training. The trained model is tested on the test set which contains 213 videos.
5.1.2 ActivityNet v1.2 
To facilitate comparisons, we follow Wang et al.  to use the ActivityNet release version 1.2 which covers 100 activity classes. The training set has 4,819 videos and the validation set has 2,383 videos. We train on the training set and test on the validation set.
5.1.3 Evaluation Metrics
Given the testing videos, the system outputs a rank list of action segment predictions. Each prediction contains the action class, the starting time and the ending time, and the confidence score. We follow the conventions [29, 1] to evaluate mean Average Precision (mAP). Each prediction is regarded as correct only when (1) the predicted class is correct and (2) its temporal overlap IoU with the ground truth segment exceeds the evaluation threshold. We do not allow duplicate detections for the same ground truth segment.
5.2 Implementation Details
We implement our AutoLoc using Caffe
. We use the stochastic gradient descent algorithm to train AutoLoc. Through the experimental studies, we find that the training process can converge quickly on both THUMOS’14 and ActivityNet datasets after 1 training epoch. Following Faster R-CNN, during each mini-batch, we process one whole untrimmed video. The learning rate is initially set to 0.001 and is reduced by one order of magnitude for every 200 iterations on THUMOS’14 and for every 500 iterations on ActivityNet. We set the weight decay to 0.0005. We choose anchors of the snippet-level length 1, 2, 4, 8, 16, 32 for THUMOS’14 and 16, 32, 64, 128, 256, 512 for ActivityNet. We use CUDA 8.0 and cuDNN v5. We use one single NVIDIA GeForce GTX TITAN X GPU.
|Full||Karaman et al. ||0.5||0.3||0.2||0.2||0.1|
|Full||Wang et al. ||14.6||12.1||8.5||4.7||1.5|
|Full||Heilbron et al. ||-||-||13.5||-||-|
|Full||Escorcia et al. ||-||-||13.9||-||-|
|Full||Oneata et al. ||28.8||21.8||15.0||8.5||3.2|
|Full||Richard and Gall ||30.0||23.2||15.2||-||-|
|Full||Yeung et al. ||36.0||26.4||17.1||-||-|
|Full||Yuan et al. ||33.6||26.1||18.8||-||-|
|Full||Yuan et al. ||36.5||27.8||17.8||-||-|
|Full||Dai et al. ||-||33.3||25.6||15.9||9.0|
|Full||TURN TAP ||44.1||34.9||25.6||-||-|
|Full||Gao et al. ||50.1||41.3||31.0||19.1||9.9|
|Weak||Sun et al. ||8.5||5.2||4.4||-||-|
|Weak||Wang et al. ||28.2||21.1||13.7||-||-|
|Weak||Ours - AutoLoc||35.8||29.0||21.2||13.4||5.8|
5.3 Comparisons with the State-of-the-art
The results on THUMOS’14 are shown in Table 1. Our method significantly outperforms the state-of-the-art weakly-supervised TAL methods that are trained with the video-level labels only. Regarding to the recent weakly-supervised TAL methods (i.e. Hide-and-Seek  and Wang et al. ), although they can generate reasonably good CAS, TAL is done by applying simple thresholding on the CAS which might not robust be to noises in CAS. Our method directly predicts the segment boundary with the contextual information taken into account. Our method can even achieve better or comparable results to some fully-supervised methods (e.g. S-CNN ) that are trained with the segment-level boundary annotations. The results of SSN  correspond to the model of the same backbone network architecture as ours.
The results on ActivityNet v1.2 are shown in Table 2 and our method can achieve substantial improvements again. Wang et al.  did not report temporal localization results on ActivityNet in their paper. But their trained models and source codes have been released online publicly and thus we can evaluate their results on ActivityNet as well.
|Weak||Wang et al. ||7.4||6.1||5.2||4.5||3.9||3.2||2.5||1.8||1.2||0.7||3.6|
|Weak||Ours - AutoLoc||27.3||24.9||22.5||19.9||17.5||15.1||13.0||10.0||6.8||3.3||16.0|
In this Section, we address several questions quantitatively to analyze our model.
5.4.1 Q1: How Effective is the Proposed OIC Loss?
In order to evaluate the effectiveness of the proposed OIC loss, we enumerate all candidate segments at the snippet-level granularity (for example, a segment starting at the location of the 2-nd snippet and ending at the location of the 6-th snippet). We leverage the OIC loss to measure how likely each segment contains actions and then select the most likely ones. Concretely, for each segment, we compute its OIC loss of being each action. Then we follow Sec. 4.3.4 to remove segments with high OIC loss and remove duplicate predictions via NMS. We denote this approach as OIC Selection. As shown in Table 3, although not as good as AutoLoc, OIC Selection still significantly improves the state-of-the-art results . Because the OIC loss explicitly favors the segment which has high activations inside and low activations outside, and also such a segment of low OIC loss is usually well aligned to the ground truth segment. This confirms the effectiveness of the proposed OIC loss.
5.4.2 Q2: How Important is Looking into the Contrast Between the Inner Area and the Outer Area?
The core idea of the OIC loss is encouraging high activations in the inner area while penalizing high activations in the outer area. We consider another variant that can also discover the segment-level supervision but does not model the contrast between inner and outer. Specifically, we change the OIC loss in AutoLoc to Inner Only Loss, which only encourages high activations inside the segment but does not look into the contextual area. The detailed formulation of how to compute the loss and gradients in the Inner Only Loss can be found in the supplementary material. As shown in Table 3, the performances drop a lot. Consequently, when designing the loss for training the boundary predictor, it is very important and effective to take into account the contrast between the inner area and the outer area.
Notably, the idea of looking into the contrast between inner and outer is related to the usage of Laplacian of Gaussian (LoG) filter for blob detection . The operation of computing the OIC loss is effectively convolving the CAS with a step function as shown in Fig. 4, which can be regarded as a variant of the LoG filter for the sake of easing the network training. As proven in the supplementary material, the integral of the LoG filter and the integral of the step function are both zero on the range . Further, we approach the scale selection in blob detection by the multi-anchor mechanism and the boundary regression method. Despite the simplicity of the OIC loss, it turns out to be quite effective in practice for localizing likely action segments.
|Wang et al. ||7.4||6.1||5.2||4.5||3.9||3.2||2.5||1.8||1.2||0.7||3.6|
|Ours - AutoLoc||27.3||24.9||22.5||19.9||17.5||15.1||13.0||10.0||6.8||3.3||16.0|
|Q1: OIC Selection||15.8||13.7||11.9||10.3||8.8||7.5||6.4||5.1||3.6||2.2||8.5|
|Q2: Inner Only Loss||4.6||3.7||2.7||1.9||1.3||0.9||0.5||0.2||0.1||0.0||1.6|
|Q3: Direct Optimization||21.8||19.6||17.8||15.8||13.8||11.7||9.8||7.8||5.5||2.7||12.6|
5.4.3 Q3: What is the Advantage of Learning a Model on the Training Videos Compared to Directly Optimizing the Boundaries on the Testing Videos?
AutoLoc trains a model on the training videos and then applies the trained model to perform inference on the testing videos. Alternatively, without training the boundary predictor B on the training videos, we can directly train/optimize B from scratch on each testing video individually: we follow the testing pipeline as described in Sec. 4.3.4 while we also conduct the back-propagation to update B to iteratively find likely segments on each testing video. We refer this approach as Direct Optimization. As shown in Table 3, its performance is not bad, which confirms the effectiveness of the OIC loss again. But it is still not as good as AutoLoc. Because Direct Optimization optimizes the predicted boundaries according to the testing video’s CAS, which may not be very accurate. Eventually Direct Optimization overfits such an inaccurate CAS and thus results into imperfect boundary predictions. In AutoLoc, B has been trained on multiple training videos and thus is robust to the noises in CAS. Consequently, AutoLoc may still predicts good boundary even when the testing video’s CAS is not perfect. Furthermore, Direct Optimization requires optimizing the boundary predictions on the testing video until convergence and thus its testing speed is much slower than AutoLoc. For example, on ActivityNet, Direct Optimization converges after 25 training iterations (25 forward passes and 25 backward passes). However, AutoLoc directly applies the trained model to do inference on the testing video and thus requires only one forward pass during testing.
6 Conclusion and Future Works
In this paper, we have presented a novel weakly-supervised TAL framework to directly predict temporal boundary in a single-shot fashion and proposed a novel OIC loss to provide the needed segment-level supervision. In the future, it would be interesting to extend AutoLoc for object detection in image.
We appreciate the support from Mitsubishi Electric for this project.
8 Supplementary Materials
8.1 Visualization Examples
In this section, we show two sets of experimental testing results on the THUMOS’14 test set. We can observe that using a simple thresholding sometimes over-segments a whole action instance into two segments while using a simple thresholding sometimes mistakenly merges two consecutive action instances into one segment. But our AutoLoc method is robust to the noises in CAS and can localize the action instances correctly.
As shown in Fig. 5, the activations of being PoleVault in the first half part of the ground truth segment are low due to that the camera does not capture the full human body. In this case, the threshold used in the simple thresholding method is relatively high so that it cuts the whole segment into two segments. However, our AutoLoc model has been trained on multiple training videos to predict at the segment level. Thus, our AutoLoc is robust to such noises in CAS and can predict the correct segment as a whole.
Although the threshold used in the simple thresholding method is relatively high for the example shown in Fig. 5, the same threshold is actually low in other cases such as the one shown in Fig. 6. In Fig. 6, there are two GolfSwing instances happened consecutively. After the first instance ends and before the second instance starts, the scene of the first instance shades into the scene of the second instance, resulting into high activations of being GolfSwing within this interval of the transition. The thresholding method mistakenly detects these two instances as one whole segment. But our AutoLoc model can correctly detect the ending of the first instance and the beginning of the second instance and thus our AutoLoc can localize these two segments separately. By the way, note that the ending boundary of the first instance predicted by AutoLoc is not very well-aligned with the annotated ground truth. We notice that the corresponding video contents from the ground truth ending boundary to our detected ending boundary are that the person keeps still with his arms lifted. It is actually kind of ambiguous to determine when the first GolfSwing instance ends precisely and thus the ending boundary predicted by our AutoLoc shall also be acceptable.
8.2 Detailed Derivation of the OIC Back-propagation
Here we present the detailed derivation about how to calculate the gradients of the OIC loss during back-propagation.
Each predicted segment consists of the action/inner boundary , and the inflated outer boundary . These boundaries are at the snippet-level granularity (for example, boundary corresponds to the location of the -st snippet). We denote the class activation at the -th snippet on the CAS of being the action as . The OIC loss of the prediction (i.e. ) is defined as the average activation in the outer area minus the average activation in the inner area as follows:
The gradient corresponding to the predicted segment w.r.t the left outer boundary of is as follows:
Likewise, the gradient corresponding to the predicted segment w.r.t the right outer boundary of can be computed as follows:
The gradient corresponding to the predicted segment w.r.t the left inner boundary of is as follows:
Likewise, the gradient corresponding to the predicted segment w.r.t the right inner boundary of can be computed as follows:
8.3 The Connection of the LoG Filter and Our OIC Loss
In the paper Sec. 5.4, we have discussed how our OIC loss relates with the Laplacian of Gaussian (LoG) filter for blob detection . When we convolve the LoG filter over a signal, the LoG filter achieves the response of maximum magnitude at the center of the target blob, provided the scale of the Laplacian is matched with the scale of the blob.
The operation of computing the OIC loss is effectively convolving the CAS with a step function as shown in the paper Fig. 5. This step function implicated in our OIC loss can be regarded as a variant of the LoG filter for the sake of easing the network training. The optimum (minimum OIC loss) is achieved at center of the segment whose activations in the inner area are relatively high and activations in the outer area are relatively low. Despite the simplicity of the OIC loss, it turns out to be quite effective in practice for localizing likely action segments.
In addition, the integral of the LoG filter on the range is 0. The step function implicated in our OIC loss also exhibits such characteristic. Recall that the OIC loss is defined as the average activation in the outer area minus the average activation in the inner area. Therefore, the value of this step function in the outer area is and the value of this step function in the inner area is . The value of this step function in other area is 0. Consequently, the integral on the outer area is 1 and the integral on the inner area is , and the total integral of this step function on the range is 0 as well.
8.4 Details of Boundary Transformation
8.4.1 Details of Clipping
As mentioned in the paper Sec. 4.3, when the predicted boundary and exceed the boundary of the whole video, we clip and to fit into the range of the whole video. As an example shown in Fig. 7, A is the predicted inner boundary.
C is obtained by clipping A to fit into [1, ] ( is the total number of snippets). According to Equation 2 in the paper, using this clipping method, the gradient w.r.t the inner left boundary is
Here has the possibility to be positive and thus results into moving left. However, we would actually like to move right through optimization. This is due to that in our implementation, a segment of boundary includes the snippets at and , and the activations at the first and the last snippets of the whole video may not be zero.
Therefore, we conduct clipping with zero-padding to obtain B: padding one snippet of zero activation at both ends of the whole video and then clipping A to fit into [0, ]. Using this clipping method, the gradient w.r.t the inner left boundary is
which is always non-positive. Therefore, even though sometimes the predicted boundary might be too large and exceed the range of the whole video, our clipping method allows the boundary prediction model to move back to explore the potential boundary locations inside the video.
8.4.2 Details of Inflation
As presented in the paper Sec. 4.3, in order to take into account the contextual information, we need to inflate the inner boundary and to obtain the outer boundary and . However, when the predicted segment is very short, would be very close to and would be very close to . As mentioned in the paper Sec. 3.1, when fetching the corresponding activation on CAS, we round each boundary to its nearest integer (i.e. the location of the nearest snippet). If the inner boundary and the outer boundary are very close, they might be rounded to the same location on CAS. Consequently, there is no outer area. In order to keep a minimum outer area on CAS so that the model can look into the contextual information, we force to be not larger than and force to be not smaller than .
As shown by an example in Fig. 8, the inner boundary and the outer boundary are very close. During the OIC layer, these boundaries are rounded to be the same. In order to look into the contrast between the inner area and the outer area, we make sure the outer boundary is extended by at least a certain minimum offset (i.e. 1 snippet).
8.5 Details of the OIC Layer
Alg. 1 summarizes the steps of determining whether keeping each segment prediction or not in the OIC layer during training.
8.6.1 Exploration Study
In order to determine a few hyper-parameters in AutoLoc, we first make hypothesis about the reasonable range and then explore quantitatively via grid search on the training data. For example, is used to inflate the inner boundary to obtain the outer boundary. should not be too large (larger than 1/2) which may include too many irrelevant area and also should not be too small (smaller than 1/8) for the sake of taking into account sufficient contextual area preceding and succeeding the predicted action boundary (i.e. inner boundary). Therefore, we vary within a reasonable range from 1/2 to 1/8 on the training videos on THUMOS’14 and ActivityNet respectively. As shown in Table 4 and Table 5, within this range, the results are all acceptable and comparable on both datasets. = 1/4 is slightly better and thus we set to 1/4 for all the experiments in the paper.
8.6.2 Formulation of the Inner Only Loss
In the paper Sec. 5.4, we have investigated changing the OIC loss in AutoLoc to Inner Only Loss, which only encourages the high activations inside the segment while does not look into the contextual area. The Inner Only Loss is defined as:
Since the Inner Only Loss only considers the inner boundary, we only need to back-propagate the gradients w.r.t the inner boundary, which can computed as follows:
8.6.3 The Advantage of Combining Anchor Mechanism and Boundary Regression Compared to Enumeration with OIC Selection
As shown in Table 3, even without training on the training videos, Direct Optimization outperforms OIC Selection. This implies the advantage of combining anchor mechanism and boundary regression over performing enumeration with the OIC selection.
Direct Optimization and OIC Selection both generate a set of candidate segments first and then leverage the OIC loss to obtain the final predictions. OIC Selection enumerates a large set of possible segments and uses the OIC loss to select the likely ones while OIC Selection sometimes keeps quite a few false alarms. But Direct Optimization generates a small set of anchors at each position on CAS and then select the one with the lowest OIC loss, which has the most likely temporal scale, and then Direct Optimization refines the boundary of the selected anchor through optimization. Consequently, Direct Optimization compares segments cross various scales and thus is less likely to fall into local minimum. The precise boundary can be found through the optimization afterwards. We give a visualization example in Fig. 9 to help illustration.
As for OIC Selection, it might enumerate both prediction A and B. A is better aligned to the ground truth boundary. B localizes a peak because the activation at the peak is indeed much higher than the activations in its contextual area. But B might be a solution at the local minimum, compared to A which has even lower OIC loss. However, since A and B are not highly overlapped (their overlap IoU is 0.2), both A and B are likely to be kept by OIC Selection even after NMS. As for Direct Optimization, in the case that the model has two anchors (one is regressed as A and another one is regressed as B), the model keeps B and removes A directly.
In addition, considering the speed, enumeration is not practical for long videos. But for the TAL task, videos are usually untrimmed and long.
-  Activitynet challenge 2016. http://activity-net.org/challenges/2016/ (2016)
-  Aggarwal, J.K., Ryoo, M.S.: Human activity analysis: A review. In: ACM Computing Surveys (2011)
-  Asadi-Aghbolaghi, M., Clapés, A., Bellantonio, M., Escalante, H.J., Ponce-López, V., Baró, X., Guyon, I., Kasaei, S., Escalera, S.: A survey on deep learning based approaches for action and gesture recognition in image sequences. In: FG (2017)
-  Bearman, A., Russakovsky, O., Ferrari, V., Fei-Fei, L.: What’s the point: Semantic segmentation with point supervision. In: ECCV (2016)
-  Bilen, H., Vedaldi, A.: Weakly supervised deep detection networks. In: CVPR (2016)
-  Buch, S., Escorcia, V., Ghanem, B., Fei-Fei, L., Niebles, J.C.: End-to-end, single-stream temporal action detection in untrimmed videos. In: BMVC (2017)
-  Buch, S., Escorcia, V., Shen, C., Ghanem, B., Niebles, J.C.: Sst: Single-stream temporal action proposals. In: CVPR (2017)
-  Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR (2017)
-  Cheng, G., Wan, Y., Saudagar, A.N., Namuduri, K., Buckles, B.P.: Advances in human action recognition: A survey (2015), http://arxiv.org/abs/1501.05964
-  Dai, X., Singh, B., Zhang, G., Davis, L.S., Chen, Y.Q.: Temporal context network for activity localization in videos. In: ICCV (2017)
-  Dave, A., Russakovsky, O., Ramanan, D.: Predictive-corrective networks for action detection. In: CVPR (2017)
-  Dietterich, T.G., Lathrop, R.H., Lozano-Pérez, T.: Solving the multiple instance problem with axis-parallel rectangles (1997)
-  Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015)
-  Durand, T., Mordan, T., Thome, N., Cord, M.: Wildcat: Weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation. In: CVPR (2017)
-  Escorcia, V., Heilbron, F.C., Niebles, J.C., Ghanem, B.: Daps: Deep action proposals for action understanding. In: ECCV (2016)
-  Gao, J., Yang, Z., Nevatia, R.: Cascaded boundary regression for temporal action detection. In: BMVC (2017)
-  Gao, J., Yang, Z., Sun, C., Chen, K., Nevatia, R.: Turn tap: Temporal unit regression network for temporal action proposals. In: ICCV (2017)
-  Girshick, R.: Fast r-cnn. In: ICCV (2015)
-  Gorban, A., Idrees, H., Jiang, Y.G., Zamir, A.R., Laptev, I., Shah, M., Sukthankar, R.: THUMOS challenge: Action recognition with a large number of classes. http://www.thumos.info/ (2015)
-  Gudi, A., van Rosmalen, N., Loog, M., van Gemert, J.: Object-extent pooling for weakly supervised single-shot localization. In: BMVC (2017)
-  Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: Activitynet: A large-scale video benchmark for human activity understanding. In: CVPR (2015)
-  Heilbron, F.C., Barrios, W., Escorcia, V., Ghanem, B.: Scc: Semantic context cascade for efficient action detection. In: CVPR (2017)
-  Heilbron, F.C., Niebles, J.C., Ghanem, B.: Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In: CVPR (2016)
-  Hong, S., Yeo, D., Kwak, S., Lee, H., Han, B.: Weakly supervised semantic segmentation using web-crawled videos. In: CVPR (2017)
-  Huang, D.A., Fei-Fei, L., Niebles, J.C.: Connectionist temporal modeling for weakly supervised action labeling. In: ECCV (2016)
-  Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: ICML (2015)
Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. In: TPMAI (2013)
-  Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. In: ACM MM (2014)
-  Jiang, Y.G., Liu, J., Zamir, A.R., Toderici, G., Laptev, I., Shah, M., Sukthankar, R.: THUMOS challenge: Action recognition with a large number of classes. http://crcv.ucf.edu/THUMOS14/ (2014)
-  Jie, Z., Wei, Y., Jin, X., Feng, J., Liu, W.: Deep self-taught learning for weakly supervised object localization. In: CVPR (2017)
-  Kang, S.M., Wildes, R.P.: Review of action recognition and detection methods. arXiv preprint arXiv:1610.06906 (2016)
-  Kantorov, V., Oquab, M., Cho, M., Laptev, I.: Contextlocnet: Context-aware deep network models for weakly supervised localization. In: ECCV (2016)
-  Karaman, S., Seidenari, L., Bimbo, A.D.: Fast saliency based pooling of fisher encoded dense trajectories. In: ECCV THUMOS Workshop (2014)
-  Khoreva, A., Benenson, R., Hosang, J., Hein, M., Schiele, B.: Simple does it: Weakly supervised instance and semantic segmentation. In: CVPR (2017)
-  Kim, D., Yoo, D., Kweon, I.S., et al.: Two-phase learning for weakly supervised object localization. In: ICCV (2017)
-  Lin, T., Zhao, X., Shou, Z.: Single shot temporal action detection. In: ACM MM (2017)
-  Lindeberg, T.: Feature detection with automatic scale selection. IJCV (1998)
-  Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd: Single shot multibox detector. In: ECCV (2016)
-  Mettes, P., van Gemert, J.C., Snoek, C.G.: Spot on: Action localization from pointly-supervised proposals. In: ECCV (2016)
-  Oneata, D., Verbeek, J., Schmid, C.: The lear submission at thumos 2014. In: ECCV THUMOS Workshop (2014)
Papandreou, G., Chen, L.C., Murphy, K.P., Yuille, A.L.: Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation. In: CVPR (2015)
-  Poppe, R.: A survey on vision-based human action recognition. In: Image and vision computing (2010)
-  Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: CVPR (2016)
-  Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: CVPR (2017)
-  Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: NIPS (2015)
-  Richard, A., Gall, J.: Temporal action detection using a statistical language model. In: CVPR (2016)
-  Richard, A., Kuehne, H., Gall, J.: Weakly supervised action learning with rnn based fine-to-coarse modeling. In: CVPR (2017)
Shen, Z., Li, J., Su, Z., Li, M., Chen, Y., Jiang, Y.G., Xue, X.: Weakly supervised dense video captioning. In: CVPR (2017)
-  Shi, M., Caesar, H., Ferrari, V.: Weakly supervised object localization using things and stuff transfer. In: ICCV (2017)
-  Shou, Z., Chan, J., Zareian, A., Miyazawa, K., Chang, S.F.: Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: CVPR (2017)
-  Shou, Z., Wang, D., Chang, S.F.: Temporal action localization in untrimmed videos via multi-stage cnns. In: CVPR (2016)
-  Sigurdsson, G.A., Divvala, S., Farhadi, A., Gupta, A.: Asynchronous temporal fields for action recognition. In: CVPR (2017)
-  Sigurdsson, G.A., Russakovsky, O., Farhadi, A., Laptev, I., Gupta, A.: Much ado about time: Exhaustive annotation of temporal data. In: HCOMP (2016)
-  Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: Crowdsourcing data collection for activity understanding. In: ECCV (2016)
-  Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS (2014)
-  Singh, K.K., Lee, Y.J.: Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In: ICCV (2017)
-  Sun, C., Paluri, M., Collobert, R., Nevatia, R., Bourdev, L.: Pronet: Learning to propose object-specific boxes for cascaded neural networks. In: CVPR (2016)
-  Sun, C., Shetty, S., Sukthankar, R., Nevatia, R.: Temporal localization of fine-grained actions in videos by domain transfer from web images. In: ACM MM (2015)
-  Tang, P., Wang, X., Bai, X., Liu, W.: Multiple instance detection network with online instance classifier refinement. In: CVPR (2017)
-  Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: ICCV (2015)
-  Tran, D., Ray, J., Shou, Z., Chang, S.F., Paluri, M.: Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038 (2017)
-  Wang, L., Qiao, Y., Tang, X.: Action recognition and detection by combining motion and appearance features. In: ECCV THUMOS Workshop (2014)
-  Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Gool, L.V.: Temporal segment networks: Towards good practices for deep action recognition. In: ECCV (2016)
-  Wang, L., Xiong, Y., Lin, D., Van Gool, L.: Untrimmednets for weakly supervised action recognition and detection. In: CVPR (2017)
Weinland, D., Ronfard, R., Boyer, E.: A survey of vision-based methods for action representation, segmentation and recognition. In: Computer Vision and Image Understanding (2011)
-  Xu, H., Das, A., Saenko, K.: R-c3d: Region convolutional 3d network for temporal activity detection. In: ICCV (2017)
-  Yeung, S., Russakovsky, O., Mori, G., Fei-Fei, L.: End-to-end learning of action detection from frame glimpses in videos. In: CVPR (2016)
-  Yuan, J., Ni, B., Yang, X., Kassim, A.: Temporal action localization with pyramid of score distribution features. In: CVPR (2016)
-  Yuan, Z., Stroud, J.C., Lu, T., Deng, J.: Temporal action localization by structured maximal sums. In: CVPR (2017)
-  Zhang, H., Kyaw, Z., Yu, J., Chang, S.F.: Ppr-fcn: weakly supervised visual relation detection via parallel pairwise r-fcn. In: ICCV (2017)
-  Zhang, J., Lin, Z., Brandt, J., Shen, X., Sclaroff, S.: Top-down neural attention by excitation backprop. In: ECCV (2016)
-  Zhao, H., Yan, Z., Wang, H., Torresani, L., Torralba, A.: Slac: A sparsely labeled dataset for action classification and localization. arXiv preprint arXiv:1712.09374 (2017)
-  Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: ICCV (2017)
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: CVPR (2016)
-  Zhu, Y., Zhou, Y., Ye, Q., Qiu, Q., Jiao, J.: Soft proposal networks for weakly supervised object localization. In: ICCV (2017)