Despite huge success in understanding a single image, understanding videos still needs further more exploration. Temporal action proposal generation, which aims to extract temporal intervals that may contain an action, has drawn lots of attention recently. It is a challenging task since high quality proposals not only require accurate classification of an action, but also require precise starting time and ending time.
Previous temporal action proposal generation methods can be generally classified into two main types. The first type is to generate proposals by sliding windows. These methods first predefine a series of temporal windows with fixed lengths as proposal candidates. Then those proposal candidates are scored to indicate the probability of action existence. Finally ranking is applied to get top proposals. Early works like SST and SCNN-prop  try to get high recall by generating dense proposal candidates. SST generates proposals at each time step by utilizing RNN. TURN  and S3D 
add boundary regression network to get more precise starting and ending time. However, the disadvantages of the sliding window methods are obvious: (1) High-density sliding windows cause great cost of time; (2) Without boundary regression network, the temporal boundaries are not so precise; (3) Sliding windows require multiple predefined lengths and strides, thus introducing additional hyper-parameters of design choices.
The second type is to generate proposals by actioness grouping. These methods evaluate the probability of action existence for each temporal point and group points with high actioness scores to form final proposals. For example, TAG  first uses an actioness classifier to evaluate the actioness probabilities of individual snippets and generates proposals by classic watershed algorithm . BSN  adopts three binary classifiers to evaluate starting, ending and actioness probabilities of each snippet separately. Then it combines all candidate starting and ending locations as proposals when the gaps between them are not too far. Methods based on actioness score tend to generate more precise boundaries. However, quality of proposals generated by this type of methods highly depends on the grouping strategy. Besides, evaluating actioness probabilities for all points and grouping them limit the processing efficiency.
How we humankind recognize and localize a video action? Do we need pre-defined windows and scanning the whole video sequence? The answer is obviously no. For any single frame in a video, human can directly distinguish if an action happens. And sometimes, human even do not need to see the very start or end of the action but can predict the location.
Inspired by this, we present a simple yet effective system named Deep Point-wise Prediction Network (DPP) to generate temporal action proposals. Our method can be divided into two sibling streams: (1) predicting action existing probability for each temporal point in feature maps; (2) predicting starting time and ending time respectively for each position that potentially contains an action. The whole architecture consists of three parts. The first part is backbone network to extract high level spatio-temporal features. The second part is Temporal Feature Pyramid Network (TFPN), which is inspired by Feature Pyramid Network (FPN)  for object detection task. The third part includes a binary classifier for actioness score and a predictor for starting and ending time. The whole system is end-to-end trained with joint loss of classification and localization.
In summary, the main contributions of our work are three-fold:
We propose a novel method named Deep Point-wise Prediction for temporal action proposal generation, which can generate high quality temporal action proposals with precise boundaries in real time.
Our proposed DPP breaks through the performance limitation of sliding window based methods. It needs no extra design for predefined sliding windows or anchors. Also, with different backbone networks, DPP gets promising results.
We evaluate DPP on standard THUMOS 2014 dataset, and achieve state-of-the-art performance.
2 Related Work
Action Recognition is an important task of video understanding. Architectures of this task always consist of two part: spatio-temporal feature extraction network and category classifier. Since action recognition and temporal action proposal generation both need spatio-temporal features for the following steps, this task is worthy of investigation. Earlier works like improved Dense Trajectory (iDT)
use traditional feature extraction method consists of HOF, HOG, and MBH. With the development of convolutional neural network, many researchers adopt two-stream networkfor this task. It combines 2D convolutional neural network and optical flow to capture appearance and motion features respectively. Recently, as kinds of 3D convolutional neural networks such as C3D, P3D, I3D and 3D-ResNet appear, adopting 3D convolutional neural network to extract spatio-temporal feature is getting more and more popular[1, 2, 25, 3].
Temporal Action Proposals and Detection. Since natural videos are always long and untrimmed, temporal action proposals and detection have aroused intensive interest from researchers[6, 26, 1, 25, 3, 8]. DAP leverages LSTM to encode the video sequence for temporal features. SST presents a method combined C3D and GRU to generate temporal action proposals, trying to capture long-time dependency. SCNN-prop adopts multi-scale sliding windows to generate segment proposals. Then it uses 3D convolution neural network and fully-connected layers to extract features and classify proposals separately. Recent studies focus more on how to get proposals with precise boundaries. TURN applies a coordinate regression network to adjust proposal boundaries. CBR proposes cascaded boundary regression for further boundary refinement. Other methods like TAL-net modifies Faster-RCNN to fit temporal action proposal generation task.
For temporal action detection, methods can be divided into two main types: one-stage[26, 13, 7, 25, 21] and two-stage[1, 6, 14]. One-stage methods like S3D generate temporal action proposals and make classification simultaneously. While two-stage methods such as TURN and BSN generate proposals first and re-extract features to classify those proposals.
In this section, we introduce the proposed Deep Point-wise Prediction Network and how it works in details.
3.1 Deep Point-wise Prediction Network
As shown in Figure 2, Deep Point-wise Prediction Network consists of three sub-networks, which are backbone network, Temporal Feature Pyramid Network, and prediction network.
Backbone Network. We use backbone network and spatial pooling to generate the first-level feature map from a video sequence111We contrast different backbones in our experiments.. More specifically, given a video sequence with shape of , through backbone network, we get a feature map with shape of , where is the frame number, and are height and width respectively, is output channel varying with backbone networks. Then we adopt a transpose 3D convolutional layer to upsample the feature map in dimension and a 2D average pooling layer to pool the spatial features. Finally, we get our first-level temporal feature map with the shape of .
Temporal Feature Pyramid Network. The core unit of Temporal Feature Pyramid Network is the Temporal Reduction Unit. It receives current feature map as input and outputs next feature map with larger receptive field in each point. And it consists of four 1D temporal convolutional layers with the first three layers of stride 1 and last layer of stride 2. As a result, every feature map is half size of last feature map in temporal dimension. TRU between different levels share the same weights.
Prediction Network. Prediction Network is applied on different feature maps and generates predictions for every point. The first part is a binary classifier to generate foreground and background score. The second part is a predictor to generate left offset and right offset of proposals. Both parts are achieved by 1D convolutional operation.
3.2 Label Assignment
During training, we need to assign actioness label to every output point according to the ground truth. We design a simple but effective label assignment strategy here. First, points in feature maps are mapped into time points in the original video. For example, for a point in -level feature map with position , its corresponding position in the original video is . If the corresponding position of a point is inside any ground truth, we define it as a positive point. Further restriction for positive labels is introduced in Section 3.3. Since there is no overlap in adjacent ground truths, a point can only be inside one ground truth. While previous methods whether sliding window based or actioness grouping based adopt a temporal Intersection over Union (tIOU) threshold strategy to define positive proposals and assign corresponding ground truth proposals[25, 7, 6, 1, 26, 4]. Their predefined segments may have overlap with more than one ground truths simultaneously. Compared with the tIoU based matching strategy, our label assignment process is more simple and straightforward.
3.3 Scale Assignment
To predict the proposal location for every point, we try to learn transformation of left offset and right offset between ground truths and current point. Specifically, for points in -level feature map with position and corresponding ground truth proposal with boundary , our localization target is:
where indicates that the point is from feature map, is the length of this feature map, projects the point in feature map into the original input video sequence. is a coefficient which is set as 3.0 in our training to control the importance of localization part in final loss.
As we can learn from label assignment strategy in section 3.2, a ground truth may be assigned to different points in different level feature maps. And if we keep all these positive points for training, it can be difficult with large scale variations in boundary offsets. Also, as a result of fixed sizes of convolutional kernels, receptive fields of points in the same level feature map are same and points in higher level feature map tend to have bigger receptive fields. And it is hard for a point to predict proposal boundaries far from its receptive field. In feature map, the stride of adjacent points is . And its receptive field size is several times as the stride. Here, we want to restrict target left offset and right offset around receptive field of current point. So we divide the original localization targets by default stride of corresponding feature maps to regularize them. For target offsets close to default stride of corresponding feature maps, this operation centers them around 1. And the log function further centers them around 0. We add additional restrictions for positive points as below:
where is a parameter to control the localization range. Note that points regarded as positive in Section 3.2 but do not satisfy condition in this Eq.2 will be ignored during training. As increases, a ground truth is likely to be optimized by more feature maps.
In conclusion, Eq.(1) computes the regularized left offset and right offset between each time point in feature maps and corresponding ground truth proposals. With predictions from our regressor, we can easily get the final boundaries by inverse transformation of Eq.(1). Eq.(2) selects valuable boundary prediction targets for training.
3.4 Loss Function
Our loss consists of two parts which are action loss and localization loss respectively. The overall loss is combination of above two loss defined as:
For action loss, we use cross entropy loss, which is effective for classification task
where is the actioness label for sample,
is a vector contains two elements which are predicted foreground and background score with Softmax activation. For localization loss, we adopt the widely used Smoothloss.
where is the number of points we define as positive samples, is boundary prediction of point and is the target defined in Section 3.3.
4.1 Dataset and Setup
THUMOS 2014. We evaluate the proposed method on THUMOS 2014 dataset , which is a standard and widely used dataset for temporal action proposal generation task. It contains 200 validation and 213 test untrimmed videos whose action instances are annotated temporally. Following the conventions [26, 25, 6, 1, 14, 17], We train our models on validation set and evaluate them on testing set.
For temporal action proposal generation, we adopt the conventional evaluation metric. We calculate Average Recall (AR) which is mean value of recall over different tIOU thresholds under various Average Number of proposals, denoted as AR@AN. Specifically, tIOU set ofis used in our experiments.
During training, we used the stochastic gradient descent (SGD) as our optimizer. Momentum factor is set as 0.9 and weight decay factor is set as 0.0001 to regularize weights. We apply a multi-step learning scheduler to adjust learning rate. For all models, the training process lasts for 10 epochs. The initial learning rate is set as 0.0001. It is divided by 10 at epoch 7 and divided by 10 again at epoch 10. Training for one epoch means iterating over the dataset once. To form a batch while training, we clip videos as segments with equivalent length, which is 256 frames in our experiments specifically. The overlap of adjacent clips is 128 frames. We adopt sampling frequency of 8 fps in our experiments. According to our network architecture introduced in Section.3.1, we finally get 126 samples for one clip regardless of assignment strategy. To reduce overfitting, we adopt a multi-scale crop strategy for per frame in addition to random horizontal flip transformation. Like most foreground/background tasks, huge imbalance of positive and negative samples exists in our experiments. Thus, we randomly sample negative samples in each batch to keep the ratio of positive and negative samples about 1:1. This strategy is proved to be efficient and results in more stable training.
During inference, We predict actioness score and boundary offset for each point in all feature maps. Final boundary can be computed by inverse transformation of Eq.(1). Then proposals of different clips in the same video are gathered. Finally, all proposals of a video are sorted according to the actioness score and filterd by Non-Maximum Suppression (NMS) with threshold value of 0.7.
4.2 Ablation Study
Comparison with pre-defined sliding windows. For sliding window based methods, the density of sliding windows at each timestamp is an important factor that influences the performance. Most of them adopt a multi-scale anchor strategy to cover more ground truth proposals [13, 25]. It may come to an assumption that more dense pre-defined sliding windows will lead to a better result. To explore the influence of sliding window density, we setup a fair contrast experiment and results are shown in Table 1. For better comparison with our methods, we use the same architecture in Figure 2 and assign a base sliding window for each point in feature maps. The ratios in Table 1 means the number of sliding windows in each point. For example, in second row, there are two pre-defined sliding windows for each position in feature maps. One is the base sliding window, the other is a sliding window with same center but half length as base sliding window. Thus, the amount of output proposals is twice as our method. During training for sliding window based methods, we assign positive labels for pre-defined sliding windows when their tIOU with any ground truth exceeds 0.5[7, 13, 25].
With a certain limit, more sliding windows do result in a higher average recall. However, over-density sliding windows do not help. While our method is superior to the best performance of sliding window based methods. This may be caused by many reasons. One possible reason is that multi-ratio sliding windows cause the ambiguous problem. Sliding windows at the same position with different ratios share the same input features, but expected to have different predictions. And our scale assignment strategy restricts target predictions of each point inside its receptive field, likely to result in better performance. Meanwhile, more sliding windows mean more outputs both in training and inference, undoubtedly leading to decrease in speed. In conclusion, compared with sliding window based methods, DPP has the following advantages: (1) no ambiguous problem thus making optimization much easier; (2) fewer hyper-parameters which needs to be manually designed; (3) fewer proposal candidates resulting in faster processing.
Analysis of Scale Assignment. We design a novel scale assignment strategy in Section 3.3. And according to Eq.(1), decides the localization target range of each pyramid. As increases, the localization target range will be larger. Thus a ground truth is more likely to match different pyramids, resulting in more positive proposal candidates.
Table 2 shows the influence of on the performance of DPP. And gets the best performance, which is used in all the following experiments. We can compute by the inverse transformation of Eq.(1) that, when , the lower bound and upper bound of localization target are about and three times of default size for each pyramid.
Exploration of Backbone Network. For the test of different backbones, we fix the pyramid amount as 6. As Table 3 shows, 3D ResNet-50, 3D ResNet-101 and C3D are compared in our experiments. Backbone network with heavier weights tends to get better performances. We also test the performance of different backbone networks in speed. C3D outperforms other backbones in average recall but loses in speed competition. With almost the same average recall, 3D ResNet-101 attains about the twice speed of C3D. Note that all fps data is evaluated on a single GeForce GTX 1080 Ti. And for each experiment, fps is computed as mean fps of three epochs.
Varying Pyramids for DPP. DPP adopts a pyramid structure to generate feature maps with different scales. We make a contrast experiment here to explore how pyramid amounts affect the performance of DPP.
Table 4 shows results of different pyramid amounts varying from 3 to 6, where npc means number of proposals in one clip. Here, all experiments in Table 4 use C3D as backbone network. It is found that under metrics of AR@100 and AR@200, 6 pyramids performs best. And under metirc of AR@50, 3 pyramids performs best. Since the difference among results of all these experiments is slight, we can infer that our proposed DPP is robust for pyramids variation.
4.3 Comparison with State-of-the-art Methods
We compare the proposed DPP with other state-of-the-art methods on action temporal proposal generation in Table 5. To illustrate effectiveness of DPP, all methods adopt C3D  to extract spatio-temporal features and our method outperforms other methods.
All methods in the top part of the table adopt pre-defined sliding windows to generate proposal candidates, which is similar to anchor-based methods in object detection such as SSD. As we can see, DPP surpasses all sliding-window based method by a large margin. Specifically, DPP outperforms TURN, which performs best in sliding-window based methods, by improvement of in AR@200.
Actioness-grouping methods like BSN group temporal points with high actioness scores to form temporal action proposals. Compared to BSN, DPP increases AR@200 with . MGG ensembles actioness-grouping based method which is proposed in  and sliding-window based method to get higher results. Such methods cost much time when predicting, while our method generates high quality proposals with a high speed. Fps for the four methods in the top part of Table 5 are evaluated on a Geforce Titan X GPU and our method is evaluated on a Geforce GTX 1080 Ti GPU. Though BSN and MGG do not report their fps, according to the difference in principles, sliding-window based methods are expected to run faster than actioness-grouping based methods. Thus, compared to ensemble methods, DPP achieves comparative even better results with a much faster speed.
In this paper, We present a simple yet efficient method named Deep Point-wise Prediction to generate high quality temporal action proposals. Unlike previous work, we do not use any pre-defined sliding windows to generate proposal candidates, but predict left and right offsets for each point in different feature maps directly. We also note that there are also previous works in 2D object detection sharing similar ideas [12, 10]. Without ambiguity of using same feature to regress different proposal candidates, our method gets better performance on localization and generates higher quality proposals. In experiments, we explore different settings of our methods and prove its robustness. DPP is evaluated on standard THUMOS 2014 dataset to demonstrate its effectiveness.
-  (2017) Sst: single-stream temporal action proposals. In , pp. 2911–2920. Cited by: §1, §2, §2, §2, §3.2, §4.1, Table 5.
-  (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308. Cited by: §2.
-  (2018) Rethinking the faster r-cnn architecture for temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1130–1139. Cited by: §2, §2.
-  (2016) Daps: deep action proposals for action understanding. In European Conference on Computer Vision, pp. 768–784. Cited by: §2, §3.2, Table 5.
-  (2016) Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1933–1941. Cited by: §2.
-  (2017) Turn tap: temporal unit regression network for temporal action proposals. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3628–3636. Cited by: §1, §2, §2, §3.2, §4.1, Table 5.
-  (2017) Cascaded boundary regression for temporal action detection. arXiv preprint arXiv:1705.01180. Cited by: §2, §2, §3.2, §4.2.
-  (2017) Learning spatio-temporal features with 3d residual networks for action recognition. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3154–3160. Cited by: §2, §4.2.
Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6546–6555. Cited by: §2.
-  (2015) Densebox: unifying landmark localization with end to end object detection. arXiv preprint arXiv:1509.04874. Cited by: §5.
-  (2017) The thumos challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding 155, pp. 1–23. Cited by: §4.1.
-  (2019) FoveaBox: beyond anchor-based object detector. arXiv preprint arXiv:1904.03797. Cited by: §5.
-  (2017) Single shot temporal action detection. In Proceedings of the 25th ACM international conference on Multimedia, pp. 988–996. Cited by: §2, §4.2.
-  (2018) Bsn: boundary sensitive network for temporal action proposal generation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: §1, §2, §4.1, Table 5.
-  (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125. Cited by: §1.
-  (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §4.3.
-  (2018) Multi-granularity generator for temporal action proposal. arXiv preprint arXiv:1811.11524. Cited by: §4.1, Table 5.
-  (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In proceedings of the IEEE International Conference on Computer Vision, pp. 5533–5541. Cited by: §2.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §3.4.
-  (2000) The watershed transform: definitions, algorithms and parallelization strategies. Fundamenta informaticae 41 (1, 2), pp. 187–228. Cited by: §1.
-  (2016) Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1049–1058. Cited by: §1, §2, §2, Table 5.
-  (2015) Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 4489–4497. Cited by: §2, §4.2, §4.3.
-  (2013) Action recognition with improved trajectories. In Proceedings of the IEEE international conference on computer vision, pp. 3551–3558. Cited by: §2.
-  (2015) Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159. Cited by: §4.1.
-  (2018) S3D: single shot multi-span detector via fully 3d convolutional networks. arXiv preprint arXiv:1807.08069. Cited by: §1, §2, §2, §2, §3.2, §4.1, §4.2.
-  (2017) Temporal action detection with structured segment networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2914–2923. Cited by: §1, §2, §2, §3.2, §4.1, §4.3.