Online Action Detection in Untrimmed, Streaming Videos - Modeling and Evaluation

02/19/2018 ∙ by Zheng Shou, et al. ∙ 0

The goal of Online Action Detection (OAD) is to detect action in a timely manner and to recognize its action category. Early works focused on early action detection, which is effectively formulated as a classification problem instead of online detection in streaming videos, because these works used partially seen short video clip that begins at the start of action. Recently, researchers started to tackle the OAD problem in the challenging setting of untrimmed, streaming videos that contain substantial background shots. However, they evaluate OAD in terms of per-frame labeling, which does not require detection at the instance-level and does not evaluate the timeliness of the online detection process. In this paper, we design new protocols and metrics. Further, to specifically address challenges of OAD in untrimmed, streaming videos, we propose three novel methods: (1) we design a hard negative samples generation module based on Generative Adversarial Network (GAN) framework to better distinguish ambiguous background shots that share similar scenes but lack true characteristics of action start; (2) during training we impose a temporal consistency constraint between data around action start and data succeeding action start to model their similarity; (3) we introduce an adaptive sampling strategy to handle the scarcity of the important training data around action start. We conduct extensive experiments using THUMOS'14 and ActivityNet. We show that our proposed strategies lead to significant performance gains and improve state-of-the-art results. A systematic ablation study also confirms the effectiveness of each proposed method.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Online Action Detection (OAD), as shown in Figure 1, aims to detect the occurrence and class of the action start as soon as it happens; the detection model continuously monitors the live video stream in real time without any side information or access to the future video frames. This task is important in many practical applications, such as monitoring surveillance cameras, robot cognition, etc.

Related problems have been studied before [32, 65, 6, 42, 33, 36, 4]. However, most of them simulated experiments using short video clips trimmed to start exactly at the onset of action and include only part of the action without surrounding complex background streams. So this problem was effectively formulated as partial action classification.

Figure 1: Framework of Online Action Detection and notations for the proposed evaluation protocols. STDL: Start Time Decision Latency; STPE: Start Time Prediction Error.

In this paper, we aim to accomplish OAD in streaming, untrimmed videos, which are crawled from the Web and contain large amounts of background shots. These videos are challenging to analyze because they require models that are able to distinguish complex backgrounds that occur before actions and to timely detect action starts. De Geest et al. [18] first simulated OAD using untrimmed, realistic videos and benchmarked existing models for this scenario. Gao et al. [16] designed a training strategy to encourage Reinforced Encoder-Decoder (RED) Networks to make correct frame-level label predictions as early as possible.

However, both works evaluated performance using mean Average Precision (mAP) of per-frame action category labeling. First, a good OAD model does not necessarily require high accuracy in terms of per-frame classification. Conversely, when a series of action instances occur back to back without background gaps, perfect per-frame classification does not lead to correct action start detection, because per-frame labeling does not require detecting the transition from background to action. Also, it bypasses the difficulty of detecting the action start at the instance level and explicitly measuring the detection timeliness. Consequently, we design new evaluation protocols that (1) explicitly require the model to detect when each action starts and then (2) measure the offset of predicted start time from true start time (defined as STPE in Figure 1). Although RED [16] achieves 45.3% per-frame labeling mAP on THUMOS’14, its performance drops significantly under the proper OAD evaluation protocols, as shown in Figure 6. This indicates that OAD in untrimmed, streaming setting remains very challenging and is far from being solved.

In this paper, we identify three challenges in training a good OAD model and accordingly propose three novel solutions. Since state-of-the-art video classification models such as C3D [52, 53], TSN [59], I3D [7] all accept short video snippet/segment/window as network input, in this paper we also accept windows as input and set the window length to 16 frames. (1) As the example shown in Figure 2, it is important to learn and detect characteristics that can correctly distinguish the start window from the background that precedes the start of action and may share very similar scenes (but without the actual occurrence of actions). To this end, we introduce an auxiliary generative network trained in an adversarial process to automatically generate hard negative samples during training. Although hard negative data may be rare in the training videos, our generator directly learns to model the distribution of hard negatives and thus can generate a much larger pool of hard negatives. (2) We define the start window and its follow-up window in Figure 2. A start window contains action frames and background frames. Background preceding action can provide temporal contextual information but can also be confusing and thus prevent making feature representation of start windows more similar to actions than backgrounds. To remedy this, since the follow-up window is completely inside action, we propose to model the similarity between the start window and its follow-up window via imposing a temporal consistency constraint during training. (3)

It is important to accurately classify start windows in OAD. But each action instance only has a few start windows, and thus the number of training samples for start windows is much more scarce than others such as background windows and windows fully inside action. To address this, we design an

adaptive sampling strategy to increase the percentage of start windows in each training batch. Finally, experiments in Section 5 prove the effectiveness of each proposed method and putting all three altogether results in significant performance gains.

Figure 2: We identify three challenges in training a good OAD model and propose three novel methods to significantly enhance the accuracy and timeliness in detecting the start of action.

In summary, we make three contributions in this paper:

(a) We propose new protocols to correctly formulate OAD in untrimmed, streaming videos as a detection problem and evaluate performance in terms of the detection timeliness and the action category prediction accuracy.

(b) We design three novel strategies for training good OAD models: generating hard negative samples based on GAN to assist OAD model in continuously investigating characteristics that are discriminative for OAD, imposing a temporal consistency constraint between the start window and its follow-up window to model their similarity, and also adaptively sampling start windows more frequently.

(c) Extensive simulations using THUMOS’14 and ActivityNet demonstrate the effectiveness and necessity of all three novel strategies in our approach.

2 Related work

Action Classification. Given a video clip, the goal of classification is to recognize the action categories contained in the whole video. Impressive progress has been made to address this problem, from hand-crafted features [57, 58, 38, 26] to recent approaches based on deep networks [52, 13, 46, 12, 59, 61, 52, 7]. Detailed review can be found in surveys [60, 39, 3, 8, 5, 29]. Various network architectures have been proposed: 3D ConvNets have been studied in [52, 27, 53]; Simonyan and Zisserman first proposed the two-stream network [46], and based on it, Wang et al. built a framework called Temporal Segment Network [59]; Carreira and Zisserman expanded the Inception network [51, 24] from 2D to 3D, resulting in I3D [7].

Temporal Action Detection (TAD). In contrast to OAD, TAD belongs to the offline setting which means that during testing the complete video is available. Recently, TAD in long, untrimmed videos has raised a lot of interest [45, 41, 21, 11, 64, 63, 66, 50, 44, 16, 62, 14, 17, 10]. Given an untrimmed video, TAD needs to temporally localize each action instance: not only predict its category but also detect when does it start and end. During evaluation, TAD will compare the temporal overlap measured by Intersection-over-Union (IoU) between the prediction and the ground truth. Only when the overlap exceeds the threshold, the prediction can be regarded as correct. This evaluation protocol can properly measure how accurately detection models can fire at the instance-level, and thus it inspires us to design new and proper evaluation protocols for OAD.

Online Action Detection. Researchers have been working on streaming video for a long time, mainly focusing on early action detection. This essentially is a partial video classification problem rather than detection. Because video used in early action detection is short and temporally trimmed to start with the onset of action. They evaluate the classification accuracy when only a certain percentage of the action has been observed since the action start [32, 65, 6, 42, 33, 36, 4]. Hoai and De la Torre [22, 23] made attempts to detect actions in an online manner using several simple datasets (e.g., one action instance per video).

However, OAD in this paper targets a more challenging and realistic setting: (1) beyond simple actions, we consider actions of high-level semantics in the wild; (2) the streaming videos contain considerable amounts of background contents which can be quite diverse and thus are hard to distinguish. We employ untrimmed videos from standard TAD datasets collected from YouTube to simulate experiments for OAD. De Geest et al. [18] was the first to introduce OAD in the aforementioned setting. Recently, Gao et al. [16] proposed a reinforced encoder-decoder LSTM model to detect action online and anticipate action in the future. But these two papers both evaluated OAD using per-frame labeling accuracy, which is not a appropriate evaluation protocol.

In addition, there are also works on spatio-temporally localizing actions in an online manner but also limited to short videos [47, 48]. Li et al. [34] and Liu et al. [35] leveraged Kinect sensors and performed detection based on tracked skeleton information. Vondrick et al. [55] targeted future prediction, which is a more ambitious task than OAD.

Adversarial Learning. The idea of training in an adversarial process was first proposed in [19] and has been adopted in many applications [37, 56, 25, 67, 54]. Generative Adversarial Network (GAN) [19, 40] consists of two networks trained simultaneously to compete with each other: a generator network G that learns to generate fake samples indistinguishable from real data and a discriminator network D which is optimized to correctly recognize whether input data samples are real or fake.

3 Model

In this section, we first introduce our OAD framework and then propose three novel approaches to improve the capability of backbone networks at detecting action in a timely manner. We follow state-of-the-art video classification networks like C3D [52, 53], TSN [59], I3D [7] to accept temporal sliding windows as input. In particular, we set the window length to 16 frames and use C3D as our backbone network in this section to illustrate technical ideas.

3.1 Framework

Testing. We outline our OAD framework shown in Figure 1 by walking through the testing pipeline. During testing, when a new frame arrives at time , we feed the window ending at into our network to make prediction for the time stamp . Each prediction output consists of time , semantic class which could be background or action ( is the total number of action categories to be detected), and confidence score . In order to detect the action start, we compare prediction at and . We output an action start prediction when the following conditions are all satisfied: (1) is action; (2) ; (3) exceeds the threshold obtained by grid search on the training set. As alternatives, we have also studied the approach of adding a proposal stage specifically for detecting action start and then classifying the action class. We found such an alternative approach is not as effective as the one outlined above. Details can be found in the supplemental material.


The focus of our work is to develop innovative strategies to train a robust OAD model. During training, the complete videos are available. We slide windows over time with a stride of 1 frame to first collect a set of training windows to be fed into networks. For each window, we assign its label as the action class of the last frame of the window. Untrimmed videos have large portions of backgrounds; thus it is important to balance background data and action data contained in each training batch. We construct each training batch by randomly sampling half of the batch from background windows and randomly sampling the another half batch from windows whose labels are actions.

3.2 Adaptively sample the training data

Since we want to detect actions as soon as possible, it is important for OAD to accurately classify start windows. This is a challenging task because the start window contains various background contents, and the number of start windows is quite scarce. Therefore, instead of random sampling, we design an adaptive sampling strategy to pay more attention to start windows. Concretely, we randomly sample half of the training batch from start windows and randomly sample the another half batch from the remaining windows, which can be backgrounds or windows completely inside actions.

After each training batch is constructed, we can feed them into our network as shown in Figure 5 (a) and train the network via minimizing the multi-class classification softmax loss . We denote the set of start windows as where is the start window to be fed into our model and is its corresponding ground truth label. Similarly, we express the set of remaining windows as . The label space of is where the first classes are actions and the -th class stands for background. Our model takes

as input and predicts a vector


dimension. Finally we apply the softmax function to obtain the normalized probability prediction of being class

: . We use to represent expectation. The training objective is:

Figure 3: An illustration of data distribution in the high-level feature space when training the model with and without Temporal Consistency (TC) constraint. We can observe that TC constraint makes start windows closer to follow-up windows and thus apart from negatives. Although hard negatives that have subtle differences with start windows may still be close to start windows, we propose the GAN-based method to distinguish them.
Figure 4: L2 distance histogram for all pairs of start window and its follow-up window in the THUMOS’14 test set [28] computed using trained models with and without Temporal Consistency (TC) respectively. We can observe a significant shift after taking into account TC. This implies that TC indeed helps our model to make feature representation of start windows more similar to follow-up windows and thus more distinguishable from negatives.
Figure 5: Network architectures of our OAD models built on C3D and proposed training objectives. (a) Our basic OAD model consists of 3D ConvNets from to and 3 fully connected layers (, , ). We keep the same backbone network architecture as C3D [52] while changing the number of nodes in as to stand for actions and background. The output of is used for calculating multi-class classification softmax loss. (b) We impose a temporal consistency constraint between the start window and its paired follow-up window by adding another loss term in the training objective to minimize the L2 similarity computed using their activations. Two streams in this siamese network share the same parameters. (c) Further, we design a GAN-based framework to automatically generate hard negative samples on-the-fly to help our model more accurately distinguish actions against negatives. G is generator and D is discriminator. G accepts random noise as input and output fake features. We add an additional class in for fake samples. All blue blocks of the same name are the same layer and share weights. More details can be found in Section 3.4.

3.3 Impose temporal consistency constraint

Follow-up windows are completely inside action, and thus are far away from negative data in the feature space, as shown in Figure 3 (a). But the start window is a mixture of action frames and background frames, and thus in the feature space, start windows could be close to or even mixed with negatives. It is important to accurately distinguish start windows and negatives in OAD so that the model can more timely detect action start when the video stream switches from negative to action. Thus we impose a Temporal Consistency (TC) constraint between each start window and its follow-up window to model their similarity explicitly.

Concretely, we denote the training set of the paired start window and follow-up window as , where represents the start window and is its associated follow-up window and is still the ground truth label. We impose the temporal consistency constraint via minimizing the similarity measured by L2 distance of the feature representation between and :


where the function indicates extracting feature representation. In Figure 5 (b), we set to be the output of because it is also the input to the final layer for classification. Now, the overall objective becomes


where is the cost weighting parameter.

As illustrated in Figure 3, training with the TC constraint draws feature representations of start windows closer to follow-up windows and thus more separable from negatives. Figure 4 further confirms this hypothesis quantitatively.

3.4 Generate hard negative samples via GAN

As illustrated in Figure 2, it is important to train the OAD model to capture subtle differences that can serve as evidence for discriminating start windows from negatives preceding the action start. This motivates us to find such hard negative samples in the training set during training the model. However, exhaustively finding such samples is time consuming because such hard negatives are rare and may even not exist in the training data. Therefore, we propose to train a model to automatically synthesize samples that are hard to distinguish from true start windows.

To this end, we design a framework based on GAN to help OAD model separate start windows and hard negatives. We first pre-train our model using Equation 3. Based on such initialization, we train Generator (G) and Discriminator (D) in an alternating manner during each iteration.

Training G. Since directly generating video is very challenging, we use GAN to generate features rather than raw videos. As shown in Figure 5 (c), our GAN model has a fixed 3D ConvNets (from to ) to extract real features from raw videos and also has a G to generate fake features. Upper layers will serve as D to be explained later.

G accepts a random noise as input and learns to capture the true distribution of real start windows. Consequently, G has the potential to generate various fake samples which might not exist in the real training set. Therefore, our model can continuously explore the classification boundary in the high-level feature space. Following [40],

is a 100-dimensional vector randomly drawn from the standard normal distribution. In practice, we find that a simple G consisting of two fully connected layers


works well. Each fully connected layer is followed by a BatchNorm layer and a ReLU layer.

When training G, the principle is to generate hard negative samples that are similar to real start windows. Conventional GANs utilize a binary real/fake classifier to provide supervision signal for training G. However, this method usually encounters an instability issue. Following [43], instead of adding a binary classifier, we require G to generate fake data matching the statistics of the real data. Specifically, the feature matching objective is forcing G to match the expectation of the real features on an intermediate layer of D (we use layer as indicated in Figure 5 (c)).

Formally, we denote the feature extraction part of using fixed 3D ConvNets as

and the process from to as . During training G, we need to fix D so that the total objective to minimize contains only the feature matching loss defined as follows:


where stands for the generator, is the training set of start windows , represents the start window, and is the ground truth label.

Training D. The principle for designing D is that the generated samples should be still separable from real start windows despite their similarity, so that the generated samples can be regarded as hard negatives. As shown in Figure 5 (c), D consists of , , and . Instead of adding a binary real/fake classifier, we add an additional node in layer to represent the hard negative class, which is the ground truth label for generated samples. Note that this additional class will be removed during testing.

Similarly, some previous works also replaced the binary discriminator with a multi-class classifier that has an additional class for fake samples [49, 43, 9]. However, their motivation is mainly extending GAN to the semi-supervised setting: the unlabeled real samples could belong to any class except fake. But in this paper, we focus on generating hard negatives which should be similar to actions but dissimilar to backgrounds; meanwhile our D needs to distinguish hard negatives from not only actions but also from backgrounds.

Given feature either extracted from real data or generated by G, D accepts as input and predicts a vector which goes through a softmax function to get class probabilities: , where . Regarding real samples, we can calculate their corresponding classification loss term via extending defined in Equation 1:


As for generated fake samples, the loss is:


where represents the hard negative class. During training D, G is fixed and also we need to take into account the temporal consistency loss defined in Equation 2 and thus the full objective to be optimized is:


where is the cost weighting parameter.

4 Evaluation

4.1 Conventional protocols and metrics

Hoai and De la Torre [22, 23] first worked on OAD in untrimmed videos and proposed three evaluation protocols to respectively evaluate classification accuracy, detection timeliness, and localization precision. As comprehensively discussed in [18], these protocols do not suit OAD in realistic videos because they are designed for a simplified setting: each video contains only one action instance of interest. In addition, using three protocols individually rather than a single coherent one makes it difficult to compare different approaches. Therefore, recent works [18, 16] evaluated OAD using the mAP of classifying every frame.

4.2 Proposed new protocols and metrics

However, (1) per-frame labeling is a classification problem while OAD in practice requires the system to make detections at the instance level, and (2) the focus of OAD should be timely detection of the action start rather than accurately labeling every video frame. To remedy this, we carefully design new protocols and metrics for OAD.

Similar to standard evaluation protocols used in object detection and temporal action detection, we evaluate OAD at instance-level: as a video is streamed, OAD system outputs a list of detected action start points. As shown in Figure 1, each action start instance prediction is associated with two times: is the time when the system makes this predicion and is the time that the system predicts as the time when action starts. Following temporal action detection, we evaluate OAD results using mAP over all action classes and do not allow duplicate detections for the same ground truth. A prediction is counted as correct only when its action class is correct and the prediction satisfies certain criterion regarding the detection timeliness. Details about this criterion will be introduced in the following.

In temporal action detection, Intersection-over-Union (IoU) is used to measure the temporal overlap between the ground truth interval and the predicted interval, and the prediction is correct only when its IoU is higher than the threshold used in evaluation. Likewise, we introduce two timeliness measurements for OAD: we define Start Time Decision Latency (STDL) as to measure the detection latency; we define Start Time Prediction Error (STPE) as to measure the prediction offset error. When evaluating the timeliness using one of the above measurements, a prediction can be regarded as correct only when its timeliness is smaller than the threshold.

Note that in the above definitions we use absolute values because we consider that the thresholding constraint for timeliness measurement represents time offset tolerance: a correct prediction could precede or succeed the ground truth as long as it is within the offset tolerance. In addition, all approaches discussed in this paper follow the same testing pipeline outlined in Section 3.1. Therefore, for each prediction we have the same and and thus have the same STDL and STPE. Hence, during experiments we only need to measure the timeliness using STPE.

5 Experiments

5.1 Implementation details

In order to simulate the OAD setting, we employ standard benchmarks consisting of untrimmed, long videos to simulate the sequential arrival of video frames in a streaming manner. We implement our system using TensorFlow

[2]. Through grid search on the training videos, we find that our models can converge well in general under the following settings: we use Adam optimizer [31] of batch size 24 and 5K training iterations; in Equation 3 and 7 is 0.1; weight decay is 0.00005; in Figure 5 (a) and (b), learning rate is 0.00001 except 0.0001 for since it is randomly initialized; in Figure 5 (c), learning rate for D is 0.00001 and for G is 0.003 which is larger because we need to train G from scratch. We conduct experiments on a single NVIDIA Titan X GPU of 12GB memory.

5.2 Results on THUMOS’14

Dataset. THUMOS’14 [28] involves 20 actions and over 20 hours videos: 200 validation videos (3,007 action instances) and 213 test videos (3,358 action instances). These videos are untrimmed and contain at least one action instance. Each video has 16.8 action instances in average. We use the validation videos for training and use the test videos to simulate streaming videos for testing OAD.

Figure 6: Experimental results on THUMOS’14. Left: y-axis is mAP at AP depth reaching recall 100% and x-axis is varying STPE (Start Time Prediction Error) threshold. Right: y-axis is mAP averaged over STPE threshold from 1s to 10s at AP depth reaching maximum recall which varies from 0.1 to 1 in x-axis.

Comparisons. As for our approach, our network architecture can be found in Figure 5. Since we build our model upon C3D [52] which has been pre-trained on Sports-1M [30], we use it to initialize models shown in Figure 5. Since output is 8,192-dimensional, we set the number of nodes to 4,096 for both and in G.

We compare with the following baselines. (1) Random guess: we assign confidence scores for actions and background via randomly splitting score 1 into numbers all within . (2) C3D w/o ours: we use the C3D model which has the exactly same network architecture with our model used during testing but is trained without our proposed strategies. (3) RED: Gao et al. [16] proposed a Reinforced Encoder-Decoder (RED) Networks and achieved the state-of-the-art performances on THUMOS’14 in the OAD setting. We requested results from the authors and evaluate using the proposed protocols. Since the duration of action instance varies from 1s to 20s and thus during evaluation we vary the STPE threshold from 1s to 10s. As shown in Figure 6, when using the proposed training strategies specifically designed to tackle OAD, our approach improves the C3D w/o ours baseline by a large margin. Also, our approach is far better than random guess and also outperforms RED, which is a state-of-the-art method developed very recently. In Figure 7, we show some qualitative results.

Figure 7: Qualitative comparisons on THUMOS’14. Green indicates ground truth action starts; Red indicates action starts detected by C3D OAD model trained with our proposed approaches; Orange indicates action starts detected by the same C3D model trained without our proposed approaches. (a) Once the BasketballDunk action starts, our approach correctly detects it sooner than C3D w/o ours; (b) In this Billiards example, our approach detects action start exactly when the Billiards action begins and much sooner than C3D w/o ours; (c) In this example consisting of a CricketBowling instance and a CricketShot instance back-to-back, our approach detects the start of CricketBowling timely and detects the start of CricketShot exactly when it begins, but C3D w/o ours misses the CricketBowling instance and detects the CricketShot instance with delay.

5.3 Results on ActivityNet

Dataset. ActivityNet [20, 1] involves 200 actions and untrimmed videos over 800 hours: around 10K training videos (15K instances) and 5K validation videos (7.6K instances). Each video has 1.7 action instances in average. We train on the training videos and evaluate OAD using the validation videos.

Comparisons. As for our approach, given the superior performances of TSN on ActivityNet video classification task [59], following [15], for each window of 16 frames, we use TSN to extract a feature vector of 3,072 dimension to serve as input to our model. Our basic OAD model on ActivityNet consists of three fully connected layers (i.e. , , ) that are the same as these in C3D, but we train this model directly from scratch. As for G, since the dimension of fake samples here is 3,072, we set and in G to be 2,048-dimensional.

The duration of action instance varies from 1s to 200s in ActivityNet and thus during evaluation we vary the STPE threshold from 10s to 100s. As shown in Table 1, our approach is much better than Random guess again and significantly improves the baseline method TSN w/o ours which indicates that it also accepts TSN features as input and has the same testing network architecture as our approach but is trained without our proposed training methods.

STPE threshold (s) 10 50 100
Random guess 0.07 0.17 0.20
TSN w/o ours 12.49 32.94 44.86
Our approach 13.28 34.76 47.65
Table 1: Experimental mAP results (%) on ActivityNet when vary STPE (Start Time Prediction Error) threshold.

5.4 Discussions

Efficiency. In terms of speed during testing, unlike offline detection which evaluates how many frames can be processed per second, OAD requires evaluating what is the delay since receiving a new video frame until the system outputs prediction for this frame. Our model in Figure 5 (c) is able to respond within 0.16s. In addition, our approaches can be applied to existing video classification networks without needing to add complicated, additional modules during testing. During training, the proposed GAN module only adds a generator of two fully connected layers, and thus is pretty light.

Evaluation of individual strategies. We conduct in-depth study on THUMOS’14 to analyze the performance gain contributed by each proposed training strategy. In Figure 8, all approaches in the following have the same network architecture during testing: C3D is trained without any proposed strategies; C3D-adaptive is trained with the adaptive sampling strategy and improves results of C3D; When using the adaptive sampling strategy, C3D-adaptive-TC imposes the temporal consistency constraint during training and further improves results of C3D-adaptive; When using the adaptive sampling strategy, C3D-adaptive-GAN is trained within our proposed GAN-based framework and also outperforms C3D-adaptive; C3D-adaptive-TC-GAN combines all three proposed strategies together during training and achieves the best performances. Consequently, all these proposed approaches are effective and thus crucial for training a good OAD model. In the supplemental material, we present more detailed results for this ablation study.

Figure 8: Experimental results on THUMOS’14. Left: y-axis is mAP at AP depth reaching recall 100% and x-axis is varying STPE (Start Time Prediction Error) threshold. Right: y-axis is mAP averaged over STPE threshold from 1s to 10s at AP depth reaching maximum recall which varies from 0.1 to 1 in x-axis.

6 Conclusion and future works

In this paper, we conduct a solid study of the OAD problem. We design new formulation and evaluation protocols that can more appropriately assess the performance of OAD models. Three methods have been proposed to improve the capability of OAD models to detect action in a timely manner. Extensive experiments demonstrate the effectiveness of our approach. The overall performance still has large room for improvement, which confirms the challenge of OAD, and we hope it will attract more attention from the community in the near future.


  • [1] Activitynet challenge 2016., 2016.
  • [2] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
  • [3] J. K. Aggarwal and M. S. Ryoo. Human activity analysis: A review. In ACM Computing Surveys, 2011.
  • [4] M. S. Aliakbarian, F. Saleh, M. Salzmann, B. Fernando, L. Petersson, and L. Andersson. Encouraging lstms to anticipate actions very early. ICCV, 2017.
  • [5] M. Asadi-Aghbolaghi, A. Clapés, M. Bellantonio, H. J. Escalante, V. Ponce-López, X. Baró, I. Guyon, S. Kasaei, and S. Escalera.

    A survey on deep learning based approaches for action and gesture recognition in image sequences.

    In FG, 2017.
  • [6] Y. Cao, D. Barrett, A. Barbu, S. Narayanaswamy, H. Yu, A. Michaux, Y. Lin, S. Dickinson, J. Mark Siskind, and S. Wang. Recognize human activities from partially observed videos. In CVPR, 2013.
  • [7] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. CVPR, 2017.
  • [8] G. Cheng, Y. Wan, A. N. Saudagar, K. Namuduri, and B. P. Buckles. Advances in human action recognition: A survey. 2015.
  • [9] Z. Dai, Z. Yang, F. Yang, W. W. Cohen, and R. Salakhutdinov.

    Good semi-supervised learning that requires a bad gan.

    NIPS, 2017.
  • [10] A. Dave, O. Russakovsky, and D. Ramanan. Predictive-corrective networks for action detection. CVPR, 2017.
  • [11] V. Escorcia, F. C. Heilbron, J. C. Niebles, and B. Ghanem. Daps: Deep action proposals for action understanding. In ECCV, 2016.
  • [12] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In CVPR, 2016.
  • [13] C. Gan, N. Wang, Y. Yang, D.-Y. Yeung, and A. G. Hauptmann. Devnet: A deep event network for multimedia event detection and evidence recounting. In CVPR, 2015.
  • [14] J. Gao, C. Sun, Z. Yang, and R. Nevatia. Tall: Temporal activity localization via language query. In ICCV, 2017.
  • [15] J. Gao, Z. Yang, and R. Nevatia. Cascaded boundary regression for temporal action detection. In BMVC, 2017.
  • [16] J. Gao, Z. Yang, and R. Nevatia. Red: Reinforced encoder-decoder networks for action anticipation. In BMVC, 2017.
  • [17] J. Gao, Z. Yang, C. Sun, K. Chen, and R. Nevatia. Turn tap: Temporal unit regression network for temporal action proposals. In ICCV, 2017.
  • [18] R. D. Geest, E. Gavves, A. Ghodrati, Z. Li, C. Snoek, and T. Tuytelaars. Online action detection. In ECCV, 2016.
  • [19] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
  • [20] F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015.
  • [21] F. C. Heilbron, J. C. Niebles, and B. Ghanem. Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In CVPR, 2016.
  • [22] M. Hoai and F. De la Torre. Max-margin early event detectors. In CVPR, 2012.
  • [23] M. Hoai and F. De la Torre. Max-margin early event detectors. IJCV, 2014.
  • [24] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
  • [25] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. CVPR, 2017.
  • [26] H. Jégou, M. Douze, C. Schmid, and P. Pérez. Aggregating local descriptors into a compact image representation. In CVPR, 2010.
  • [27] S. Ji, W. Xu, M. Yang, and K. Yu.

    3d convolutional neural networks for human action recognition.

    In TPMAI, 2013.
  • [28] Y.-G. Jiang, J. Liu, A. R. Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar. THUMOS challenge: Action recognition with a large number of classes., 2014.
  • [29] S. M. Kang and R. P. Wildes. Review of action recognition and detection methods. arXiv preprint arXiv:1610.06906, 2016.
  • [30] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014.
  • [31] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [32] Y. Kong, D. Kit, and Y. Fu. A discriminative model with multiple temporal scales for action prediction. In ECCV, 2014.
  • [33] Y. Kong, Z. Tao, and Y. Fu. Deep sequential context networks for action prediction. In CVPR, 2017.
  • [34] Y. Li, C. Lan, J. Xing, W. Zeng, C. Yuan, and J. Liu.

    Online human action detection using joint classification-regression recurrent neural networks.

    In ECCV, 2016.
  • [35] C. Liu, Y. Li, Y. Hu, and J. Liu. Online action detection and forecast via multitask deep recurrent neural networks. In ICASSP, 2017.
  • [36] S. Ma, L. Sigal, and S. Sclaroff. Learning activity progression in lstms for activity detection and early detection. In CVPR, 2016.
  • [37] A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary classifier gans. ICML, 2017.
  • [38] F. Perronnin, J. Sánchez, and T. Mensink. Improving the fisher kernel for large-scale image classification. In ECCV, 2010.
  • [39] R. Poppe. A survey on vision-based human action recognition. In Image and vision computing, 2010.
  • [40] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
  • [41] A. Richard and J. Gall. Temporal action detection using a statistical language model. In CVPR, 2016.
  • [42] M. S. Ryoo. Human activity prediction: Early recognition of ongoing activities from streaming videos. In ICCV, 2011.
  • [43] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In NIPS, 2016.
  • [44] Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S.-F. Chang. Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In CVPR, 2017.
  • [45] Z. Shou, D. Wang, and S.-F. Chang. Temporal action localization in untrimmed videos via multi-stage cnns. In CVPR, 2016.
  • [46] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
  • [47] G. Singh, S. Saha, and F. Cuzzolin. Online real time multiple spatiotemporal action localisation and prediction on a single platform. ICCV, 2017.
  • [48] K. Soomro, H. Idrees, and M. Shah. Predicting the where and what of actors and actions through online action localization. In CVPR, 2016.
  • [49] J. T. Springenberg. Unsupervised and semi-supervised learning with categorical generative adversarial networks. ICLR, 2016.
  • [50] C. Sun, S. Shetty, R. Sukthankar, and R. Nevatia. Temporal localization of fine-grained actions in videos by domain transfer from web images. In ACM MM, 2015.
  • [51] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
  • [52] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
  • [53] D. Tran, J. Ray, Z. Shou, S.-F. Chang, and M. Paluri. Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038, 2017.
  • [54] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. CVPR, 2017.
  • [55] C. Vondrick, H. Pirsiavash, and A. Torralba. Anticipating the future by watching unlabeled video. 2016.
  • [56] C. Vondrick and A. Torralba. Generating the future with adversarial transformers. 2017.
  • [57] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Action Recognition by Dense Trajectories. In CVPR, 2011.
  • [58] H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013.
  • [59] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.
  • [60] D. Weinland, R. Ronfard, and E. Boyer. A survey of vision-based methods for action representation, segmentation and recognition. In Computer Vision and Image Understanding, 2011.
  • [61] Z. Xu, Y. Yang, and A. G. Hauptmann. A discriminative cnn video representation for event detection. In CVPR, 2015.
  • [62] Z. Yang, J. Gao, and R. Nevatia. Spatio-temporal action detection with cascade proposal and location anticipation. In BMVC, 2017.
  • [63] S. Yeung, O. Russakovsky, N. Jin, M. Andriluka, G. Mori, and L. Fei-Fei. Every moment counts: Dense detailed labeling of actions in complex videos. arXiv preprint arXiv:1507.05738, 2015.
  • [64] S. Yeung, O. Russakovsky, G. Mori, and L. Fei-Fei. End-to-end learning of action detection from frame glimpses in videos. In CVPR, 2016.
  • [65] G. Yu, J. Yuan, and Z. Liu. Predicting human activities using spatio-temporal structure of interest points. In ACM MM, 2012.
  • [66] J. Yuan, B. Ni, X. Yang, and A. Kassim. Temporal action localization with pyramid of score distribution features. In CVPR, 2016.
  • [67] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. ICCV, 2017.