ActionSpotter: Deep Reinforcement Learning Framework for Temporal Action Spotting in Videos

04/15/2020 ∙ by Guillaume Vaudaux-Ruth, et al. ∙ Sorbonne Université ONERA 0

Summarizing video content is an important task in many applications. This task can be defined as the computation of the ordered list of actions present in a video. Such a list could be extracted using action detection algorithms. However, it is not necessary to determine the temporal boundaries of actions to know their existence. Moreover, localizing precise boundaries usually requires dense video analysis to be effective. In this work, we propose to directly compute this ordered list by sparsely browsing the video and selecting one frame per action instance, task known as action spotting in literature. To do this, we propose ActionSpotter, a spotting algorithm that takes advantage of Deep Reinforcement Learning to efficiently spot actions while adapting its video browsing speed, without additional supervision. Experiments performed on datasets THUMOS14 and ActivityNet show that our framework outperforms state of the art detection methods. In particular, the spotting mean Average Precision on THUMOS14 is significantly improved from 59.7 video.



There are no comments yet.


page 1

page 3

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Many works are interested in the analysis of actions in videos as it leads to a lot of applications. For example, it can be used to index Youtube videos [12] or to control robots by gestures [30]. The evaluation of the quality of actions [26] can be used to improve sports performances and the segmentation of video streams is well suited to monitor elderly people in their homes [4]. Action detection [40] is also an interesting subject since it aims at detecting action realizations in untrimmed videos. Thus, in addition to the action class, action detection returns the precise start and end times of each action instance. It can be used for video surveillance by detecting accidents, thefts or fights, but also for people protection (e.g. detection of falls).

The visual summary of videos has also been widely studied in particular to provide a smart scroll bar when streaming videos. The two most common frameworks are key frame selection [22, 18, 5, 42] and key sub-shot selection [24, 25]. Both frameworks are classically unsupervised and aim at finding clusters that best describe all frames.

Fig. 1: Overview of the proposed spot frame extraction framework. It extracts one spot frame per action occurrence (in green in the figure) by sparsely browsing the video.

However, to summarize the actions in videos, these techniques are not optimal. Indeed, it can be very difficult to find precise action boundaries, as it is done in action detection. Furthermore these boundaries are not necessary to know the existence of an action. Moreover, finding a visual summarizing does not necessarily reflect the number of actions. In fact, the visual summarizing [22, 18, 5, 25, 42] is intended to extract salient or key images from a video stream. Thus, an action composed of several visual parts will be described using several key frames, which does not allow to count the instances of actions.

To address these limitations, this paper focuses on action spotting [1], which consists in producing an ordered list of action instances in the video. Specifically, [1] defines action spotting as ”the process of finding any temporal occurrence of an action while observing as little as possible from a video”. To this end, we propose to select one frame per action instance, as shown in figure 1.

To the best of our knowledge, the spotting task has only recently appeared in the literature: Alwassel et al. [1] use it as pre-processing for detection, Bhardwaj et al. [2] or Wu et al. [37] use it to produce accurate classification.

In this work, we focus specifically on the task of extracting spot frames (one frame per action instance) and propose a measure to quantify the quality of the spot frames set. This measure is based on the mean Average Precision (mAP) value used in the AVA dataset toolbox [12]

and has been adapted for action spotting. The above-mentioned measure, that estimates spotting performance, is available in supplementary material.

In order to perform fully supervised action spotting, [1] introduced human trajectory annotations. To free ourselves from these expensive annotations, we introduce ActionSpotter, a reinforcement learning based algorithm that requires only detection annotations to spot actions, and can handle videos containing multiple classes of action with multiple occurrences.

Relying solely on detection annotations instead of trajectory annotations [1] is not only less expensive, but also allows us to use common annotations provided in any action detection dataset and to compare the spots produced by ActionSpotter with those extracted from action detectors. As a result, ActionSpotter outperforms the best state-of-the-art detectors, which densely explore videos whereas ActionSpotter is sparse.

But such learning is not trivial as it requires deep learning combined with advanced reinforcement algorithm. This step was bypassed in

[1] thanks to expensive trajectory annotations. Thus, ActionSpotter is the first practical action spotting algorithm that handles multiple action classes and simple annotations.

To summarize, our contributions are as follows:

  • We define the action spotting task and a corresponding metric (the script is released in attachment).

  • We propose ActionSpotter, a reinforcement learning architecture extracting spot frames while observing as few frames as possible. It is based on the state-of-the-art actor-critic architecture [13] to efficiently learn the policy.

  • We show that this architecture is more relevant for action spotting than a post-processing of state-of-the-art detectors.

Our framework is presented in Section 3 after related works in Section 2. Then, experiments on THUMOS14 and ActivityNet are presented in Section 4 before conclusion.

Fig. 2: Overall description of our pipeline.

In a first stage, the CNN backbone encodes the frame (or non-overlapping chunk of frames) into a feature vector which is then forwarded to a GRU layer. The resulting hidden state vector is then individually processed by (SF), (CL), (BROW) and (

). The (SF) stage deals with the decision of turning the current frame into a spot frame or to skip it. The (CL) stage predicts the action class related to the spot frame and the (BROW) stage outputs the next video frame to look at. The (crit) stage is only used to ensure better convergence in the reinforcement learning framework.

Ii Related works

Algorithms based on deep reinforcement learning have been proposed in literature to quickly browse videos, especially for fast video classification [38, 36], early video detection [23, 10] or action detection [40, 1].

For example, [38]

offers an impressive result producing a state-of-the-art classifier by exploring only (on average) 8 frames per video. However, this is only possible thanks to the prior that videos contain a single action (or even object): in this context, a single good image could be enough to decide the class of the whole video.

In detection or spotting, as this assumption does not hold, performance decreases quickly when frames are skipped. This is particularly evident for short actions in long videos that can be completely missed by a greedy browsing. Typically, Giancola et al. [11] introduce a benchmark for action spotting in soccer videos and propose an action spotting method based on a classification of video slices. However, as actions are very short, they process all the frames of the video to perform the spotting. Only [40] tackles detection using few images.

In this paper, we rely on reinforcement learning, but for another reason: spot images are not provided by the ground truth which is only composed of temporal segments. Thus, even though [38] also uses an architecture based on reinforcement learning, the goal of our work is different, our objective being to produce a good spotting. Such an idea can be found for object spotting [29] in 2D images, but we believe we are the first to apply it to temporal action spotting.

Moreover, our framework balances the trade-off between accuracy and frames skipping and, in opposition to [40, 38, 36, 1], ActionSpotter processes videos online. Indeed, we show that using the current frame and a memory of previous ones is sufficient to take correct decisions.

These considerations lead to a practical spotting algorithm that does not rely on detection or segmentation. This property is interesting as detectors have to tackle a much more difficult task as they have to find the starting and ending points of actions that are often ambiguously defined.

The article closest to our work is [1] where Alwassel et al. propose an algorithm that produces one spot per video. Their process is fully supervised and is done by learning the trajectories made by a human during the exploration of the videos. The authors argue that the use of supervised trajectories during training is much more direct and simple than reinforcement learning. However, this is an important limitation of this method as it requires a lot of human acquisition to obtain the browsing strategies of the videos. Moreover, [1] does not consider the spot frame extraction for itself but only as a first step for action detection and thus does not present spotting results.

Iii Method

Iii-a Action spotting and proposed evaluation metric

In this section, we present the proposed methodology for action spotting. As presented in introduction, the goal is to browse a video in order to select spot frames summarizing human activity in videos (see Figure 1). We are thus interested in optimizing the quality of spot frames (described below), but also the proportion of skipped frames, called the skip ratio.

Spot Frames: If is a predefined set of action classes, and, a video sequence of frames (or frame chunks), containing action occurrences of classes localized at segments , then, our goal is to produce a set of spot frames/likelihoods/labels such that and . During the detection step, only spot frames with a likelihood such that are retained.

Evaluation Metric: In order to propose an unbiased metric reflecting the quality of the extracted spot frames, we build it on the basis of the state-of-the-art metric used in object detection [6] or temporal action localization [7]: the mean Average Precision (mAP). This metric is for example used on Pascal, THUMOS or AVA challenge (a derived version is used in MS COCO). Thus, we adapt this metric to action spotting and propose a new evaluation script publicly available to ensure a fair evaluation.

To compute this metric, spot frames are sorted in decreasing order according to their likelihood. Then, the intersection between the timestamps of the spot frames and the ground truth segments are computed iteratively. A spot frame is then flagged as a correct detection if and only if its timestamp intersects a ground truth segment, is classified with the correct label and is the first to match with the ground truth segment. A spot frame that does not match with any ground truth segment or is not the first to match with, is a false alarm. Finally, a ground truth segment that does not matched with any spot frame corresponds to a missed detection. This way, each level of the list of sorted spot frames corresponds to a precision/recall point. Finally, the area under the curve spanned by these points, averaged over all activity classes, is returned as the quality of the spotting (mAP).

To the best of our knowledge, we are the first to focus on the quality of a set of spot frames for itself.

Iii-B ActionSpotter: Actor Critic based semantic spot frame extractor

We designed a pipeline called ActionSpotter containing three networks that work together to both browse video frames in an online way and extract spot frames reflecting human activity. The overall state of our pipeline at timestamp is given by three elements: a current frame , a memory and a set of spot frames/likelihoods/labels . Importantly, frames are used following the video stream and can only be used once.

Memory: At time step , the frame (or frame chunk) is forwarded to a backbone network , based on CNN. extracts a feature vector that contains spatial information . Any state-of-the-art backbone network can be employed to implement . Then, a Gated Recurrent Layer (GRU) is used to encode temporal information. It takes as inputs the feature vector and the previous hidden state and produces the current hidden state , seen as a latent vector that contains the memory of the past viewed frames: .

Classification Network: Then, the classification network reads the current hidden state

and produces a probability distribution over action classes

. The predicted action label is then such as .

Spot Frame Selector Agent: The memory is also forwarded to the Spot Frame Selector Agent which produces a likelihood for the current frame to be a good spot frame: . Formally, the output of our algorithm is updated as . During testing, only spot frames with a likelihood greater than a detection threshold (constant for all the dataset’s videos) are considered.

Browser Agent: In parallel, this memory is also forwarded to the Browser Agent which decides the next frame to visit, i.e. .

Skip ratio: Importantly, as our pipeline does not go through all the images: let such that (i.e. the number of steps to process the video), then, the ratio of skipped frames is defined as 1-.

Global dynamic: The overall hybrid-policy, classically named in reinforcement learning, is the combination of browser, spot frame selector and classification network:


Figure 2 gives an illustration of the overall framework.

Iii-C Training and objectives

The training goal is to select relevant weights for (i.e. for , , , ) so that by processing the video stream , the policy provides an accurate set of spot frames , while skipping a large number of frames (i.e. with as high as possible). It is important to note that the browser and spot frame selection network are not learned in a supervised way, removing the need for specific annotations. Instead, a state-of-the-art reinforcement learning algorithm is used with a specific reward function quantifying the relevance of each step under policy (both for accuracy and browsing ratio).

Reward. As we use mAP to evaluate model performance, we choose to introduce a reward directly linked with this metric: the global policy has to maximize the final mAP of the video being processed (plus an entropy term which will be explained later). Moreover, we can easily find a trade-off between the efficiency of video browsing and the accuracy of spot frame selection by discounting the final mAP by where is the discount factor and the explored frames.

According to reward shaping theory [19], this final reward can be hybridized with a potential-based shaping to help reinforcement convergence without changing optimal policy. Straightforwardly, using the mAP after each step is an interesting shaping potential. Thus, our local reward at step is just the difference of mAP between step and under policy (plus entropy term):


Then, the cumulative discounted rewards is:


By omitting entropy, we can directly verify that:

exactly as desired as . This invariance in shaping is true even with entropy as shown by [19]. It is also important to note that this reward does not depend on the threshold as the mAP does not.

Actor-Critic optimization. The Policy Gradient method is based on the total expected reward, and therefore requires a long sequence of actions to update the policy. Good and bad actions are then averaged, which can introduce convergence issues. Actor-Critic approach [13] is known to be a way to avoid this issue by evaluating each action independently. It uses two models named actor and critic.

The actor is straightforwardly trained to find the policy that maximizes the expected return:


The critic measures how good the policy is (value-based) and produces an estimation of the value function which is the expected discounted reward

. Thus, the loss function linked to the critic is defined as:


In reinforcement learning, it is crucial to balance exploration and exploitation. Following the work of Haarnoja et al. [13], it is possible to integrate this balance directly into the reward, by adding an entropy penalty which forces the actor to uniformly explore states with equal rewards. Thus, a penalty is added in equation 2. is the temperature parameter which balances exploration and exploitation and is automatically adjusted following [13]. is the entropy function.

In our algorithm, the actor is the combination of the Browser Agent and the Spot Frame Selector Agent . Thus, is the entropy applied to the distribution over choices to update the current state i.e. and .

On the other side, the classification network CL is trained, in a supervised way, using Cross-Entropy (CE). Thus,


Final loss: Combining previous losses, our final objective is to minimize:


As objective is non differentiable we use REINFORCE [35] to derive the expected gradient:


We can then approximate this equation by using Monte Carlo sampling and finally use stochastic gradient descent to minimize our final objective.

Approach Detection mAP@ Spotting mAP
0.1 0.2 0.3 0.4 0.5
Glimpses [40] 48.9 44.0 36.0 26.4 17.1 -
SMS [41] 51.0 45.2 36.5 27.8 17.8 -
M-CNN [33] 47.7 43.5 36.3 28.7 19.0 41.2
CDC [32] - - 41.3 30.7 24.7 31.5
TURN [8] 54.0 50.9 44.1 34.9 25.6 44.8
R-C3D [39] 54.5 51.5 44.8 35.6 28.9 52.2
SSN [43] 66.0 59.4 51.9 41.0 29.8 -
A-Search [1] - - 51.8 42.4 30.8 -
CBR [9] 60.1 56.7 50.1 41.3 31.0 50.1
BSN + UNet [21] - - 53.5 45.0 36.9 -
Re-thinking F-RCNN [31] 59.8 57.1 53.2 48.5 42.5 -
D-SSAD [16] - - 60.2 54.1 44.2 59.7
Ours (TSN backbone) - - - - - 62.4
Ours (I3D backbone) - - - - - 65.6

TABLE I: Results on THUMOS14 validation set. Second column: state-of-the-art detector results according to detection metric, computed with different IOU thresholds ranking from to . Last column: mAP results for the spotting task.

Iv Experiments

This section is structured to highlight the relevance of ActionSpotter for accurate action spotting. The main experiment shows that ActionSpotter offers better sets of spot frames than state-of-the-art detectors (a comparison with a skip ratio equivalent to [40] is also proposed), highlighting that action spotting is a challenging task. Then, we compare ActionSpotter to several baselines to highlight the relevance of reinforcement learning for this asymmetrical problem where the ideal output is not defined in the ground truth. Finally, some additional experiments reveal that ActionSpotter is able to balance accuracy and skip ratio, simply by modifying the discount factor associated with reinforcement learning.

Iv-a Datasets

We evaluate our approach on the well-known THUMOS14 [17] and ActivityNet [15] datasets.

THUMOS14 dataset is composed of 101 activity classes for action recognition and a subset of 20 classes for action detection. Validation and testing sets contain respectively 200 and 212 untrimmed videos temporally annotated with action instances. We adopt the classical train/test setting of THUMOS14 protocol: training is done on 20 classes validation set and evaluation is done on testing set - original training set being not suited for detection.

ActivityNetv1.2 dataset contains 9,682 videos in 100 classes collected from YouTube. The dataset is divided into three subsets: 4,819 videos for training, 2,383 for validation and 2,480 for testing. Action spotting results on ActivityNet dataset are reported on validation set as evaluation server does not compute our spotting metric.

Iv-B Implementation details

As previously mentioned, any type of backbone network can be used to encode local information from images. In order to have the same backbones as the state-of-the-art action detectors, we rely on classical TSN [34] and I3D [3] feature extractors. This setting allows good reproducibility as such features are provided by [28] and [21]. These techniques operates on both RGB frames and optical-flow field to capture appearance feature and motion information. TSN operates on individual video frames while I3D features are extracted from non-overlapping 16-frame video slices. For this second technique, as feature represents non-overlapping frame slices, the Browser Agent BROW does not process individual frames but frame slices. This does not change the skip ratio because it is equivalent to considering one slice in instead of one frame in .

For the memory, we use a one-layer GRU with respectively 2,048 and 400 hidden units for THUMOS14 and ActivityNet. BROW, SF, CL and

have 3 linear layers and ReLu activation function. The first two layers have the same number of hidden units as the GRU layer and the last one has the size of the network output.

In our main setting, BROW can choose to move to the next frame, skip one frame or skip three frames. At training time, in order to approximate Eq.8

, the actions performed by BROW and SF are sampled from a categorical distribution parameterized by their respective logits. At testing time, the action performed is the one with the highest likelihood for BROW (and for CL). SF directly outputs a likelihood.

We use PyTorch for implementation and Adam for optimization with an initial learning rate of

and a batch size of 32. Convergence is much faster when and

are trained alone for few epochs before starting the whole reinforcement learning process with


Iv-C Comparison between ActionSpotter and detectors

We evaluate ActionSpotter performances on THUMOS14 and ActivityNet datasets using our proposed spotting metric. Results are presented in Table I and II. These tables also present results obtained by state-of-the-art detectors for the spotting task. We use published results or available codes to obtain detection results, so there is no re-implementation issues. Detection results are transformed into spotting results by extracting the centers of the predicted segments.

ActivityNet v1.2
Approach Detection mAP@ Spotting mAP
0.5 0.75 0.95 Avg
W-TALC [28] 37.0 14.6 - 18.0 -
SSN-SW [43] - - - 18.1 -
3C-Net [27] 37.2 23.7 9.2 21.7 -
FPTADC [14] 37.6 21.8 2.4 21.9 -
SSN-TAG [43] 39.2 25.3 5.4 25.9 55.4
BSN [21] 46.5 30.0 8.0 30.0 49.6
BMN [20] 50.1 34.8 8.3 33.85 55.3
Ours (I3D backbone) - - - - 60.2
TABLE II: Results on ActivityNet v1.2 validation set. The column AVG indicates the average mAP for IoU thresholds: 0.5:0.05:0.95.

These experiments show that ActionSpotter significantly outperforms state-of-the-art detectors for the spotting task: the mAP of the latest action detector D-SSAD [16] is improved from 59% to 65% on THUMOS14 database.

It can be pointed out that FrameGlimpses [40] offers very low performance but has a skip ratio of 98%. Currently, ActionSpotter performance also decreases from 62.4% to 50.9% when using TSN backbone and increasing the skip ratio to 98%. This shows that it is difficult to do accurate spotting (and even more detection) with only 2% of the frames, which is not the case for action classification [38].

One may wonder why spotting performance are not similar to detection performance with a low Intersection over Union (IoU) value. In fact, this is due to a difference in the matching mechanism. In the case of detection, predicted segments match with ground-truth segments according to the best IoU while, in the case of spotting, the predicted spots match the ground-truth according to their scores (which make sense since IoU no longer exists). Thus, for detection task, it is better to predict segments with good localization and random confidence than to produce one segment per action with good confidence but coarse localization. Conversely, only the confidence score matters in spotting. This difference between the matching processes induces some changes in the ranking of predictions between detection and spotting: while CDC [32] is more efficient in detection than M-CNN[33], it is the opposite for spotting.

Thus, spotting results obtained with detectors cannot be easily compared with those of spotting algorithms as their primary goal is not the same. However, the large gap of performance between our method and detection methods shows that it is not sufficient to post-process detector results to have an optimal spotting, and that it is necessary to have specific algorithms such as ActionSpotter.

Iv-D Ablation study and comparison with baselines

Previous experiments show that action spotting requires a specific approach. Moreover, spotting is an asymmetric problem since any frame of a ground truth segment can be used as a spot frame and, therefore, optimal spot frames are not defined by these segments. Based on this, Reinforcement Learning appears to be suited to tackle this problem and we show: (i) comparative results between supervised and reinforced spotting algorithm, (ii) an ablated version of ActionSpotter. Results are presented in table III.

Method mAP (%)
Naive Segmentation 32
Multi-Task Segmentation 43
Supervised ActionSpotter 52
No Memory ActionSpotter 45
ActionSpotter (memory + reinforcement) 65

TABLE III: Comparison between ActionSpotter and other spotting algorithms on THUMOS14 validadation set.

Naive segmentation is a simple semantic CNN-based segmentation (based on I3D + 2 layers) where the ground truth is constructed as follows: the center of each action segment is labeled by its action class and all other frames are labeled as background. This spotting-oriented baseline, only reaches 32% of mAP.

A two-tasks learning process is then used as a multi-task segmentation: the first task learns the center of segments and the second predicts the action class. This second baseline deals with a balanced problem helping gradient stabilization and leads to 43% of mAP.

Fig. 3: Example of ActionSpotter outputs. The figure displays outputs of ActionSpotter on THUMOS14 validation videos: columns represent the frames and, the 3 rows represent respectively ground truth, explored frames and selected spot frames.

Then, we train the ActionSpotter architecture to the supervised segmentation of action centers (instead of reinforcement). In this setting, performance reaches 52.3% mAP, which is higher than other baselines but well below ActionSpotter.

Finally, we consider ActionSpotter (with reinforcement) by removing its memory. Performance drops to 45%: as expected, without memory, reinforcement and supervision are practically the same since the decision is frame based.

These results highlight the fact that it is not easy to supervise action spotting. Actually, it is preferable to let the network select the easiest spot frame rather than imposing them. That’s exactly what reinforcement does as the reward is only based on the final output. Figure 3 shows the qualitative results on six videos: in most cases, the spot frames are not extracted in the middle of the actions. ActionSpotter adapts itself to the difficulty of the processed videos, as shown by the fact that the skip ratio is higher in areas without actions.

In addition, ActionSpotter allows direct optimization of the mAP score using state-of-the-art shaping technique. Indeed, this shaping technique is an important component as performance drop to 47% without it. ActionSpotter is therefore based on 3 key ideas: end-to-end learning, reinforcement to tackle asymmetry and shaping to allow better convergence.

Iv-E Trade-off between accuracy and skip ratio

ActionSpotter manages the lack of spot frame ground truth and tackles the spotting problem by optimizing the mAP. Moreover, it allows to balance the mAP and the browser skip ratio. We report in Table IV the impact of the discount factor on the policy learning and and thus on the spotting ability. More precisely, we report the mAP and the skip ratio for different values of .

1.0 0.99 0.98 0.96 0.95
mAP (%) 65.6 64.3 63.4 61.4 61.3
skip ratio (%) 23 29 53 51 51
TABLE IV: Performance on THUMOS14 validation set according to the discount factor .

As expected, for small values, the lower the discount factor, the higher the skip ratio (convergence issues appear quickly when the decreases). But, as a trade-off exists between speed and accuracy, accuracy decreases too. Nevertheless, when the skip ratio increases from 23% to 53%, the mAP decreases only by 2%. For , our algorithm is still better than the best detector for spotting while using only 47% of the frames. Currently, with the presented action space setting that puts a lot of emphasis on mAP, it is difficult to go bellow a 53% skip ratio. But, by adding more actions, it is possible to skip more frames even if mAP decreases significantly. For example, a mAP of 50.9% is obtained using only 2% of the frames (same skip ratio as [40]). But there is no point in skipping so many frames if it leads to such low performance level. As mentioned before, it seems difficult to perform accurate detection or spotting with 2% of frames, unlike classification [38] which assumes only one action.

Thus, while skipping many frames degrades performance, skipping some frames improves it. Indeed, by removing the browser and processing all video frames, the mAP decreases from 65.6% to 64.0%. Using the browser agent leads to a 1.6% improvement in mAP while using 23% fewer frames.

It is interesting to note that we also train our pipeline without the Browser but with uniform subsampling. Precisely, we subsample videos with different sampling rates and observe the mAP. Figure 4 shows that ActionSpotter produces better results regardless of the skip ratio.

Fig. 4: Benefit of reinforcement browsing: mAP according to the percentage of viewed frames for ActionSpotter algorithm and uniform sub-sampling.

V Conclusion

We propose an algorithm called ActionSpotter able to compute accurate semantic video summaries produced by collecting one frame per action instance (problem known as action spotting). Our algorithm is based on state-of-the-art reinforcement algorithm and is able to tackle action spotting in a streaming context (frames are not stored) while skipping some video frames.

To evaluate the proposed algorithm, we introduce a metric that quantifies the accuracy of the action spotting. It is based on the mean Average Precision (mAP) conventionally used in detection. The main result of this paper is that the adaptation of state-of-the-art action detectors into spotting algorithms is less efficient that learning an end-to-end spotting: ActionSpotter reaches 65% of mAP (while skipping 23% of video frames), when state-of-the-art detectors only reach 59% of mAP (with a dense exploration).

Indeed, unlike action detectors that have to deal with ambiguous temporal boundaries, ActionSpotter focuses on extracting one frame per instance thanks to reinforcement learning. Thus, ActionSpotter shows that action spotting requires specific approaches and that future work should continue to take advantage of its specificity. Typically, in the future, we plan to extend ActionSpotter to weakly labeled videos using loss derived from reinforcement that manages semi-supervised learning.


  • [1] H. Alwassel, F. Caba Heilbron, and B. Ghanem (2018) Action search: spotting actions in videos and its application to temporal action localization. In

    Proceedings of the European Conference on Computer Vision (ECCV)

    pp. 251–266. Cited by: §I, §I, §I, §I, §I, §II, §II, §II, TABLE I.
  • [2] S. Bhardwaj, M. Srinivasan, and M. M. Khapra (2019) Efficient video classification using fewer frames. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 354–363. Cited by: §I.
  • [3] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? A new model and the kinetics dataset. CoRR abs/1705.07750. External Links: Link, 1705.07750 Cited by: §IV-B.
  • [4] A. Chan-Hon-Tong, C. Achard, and L. Lucat (2014) Simultaneous segmentation and classification of human actions in video streams using deeply optimized hough transform. Pattern Recognition 47 (12), pp. 3807–3818. Cited by: §I.
  • [5] N. Ejaz, I. Mehmood, and S. W. Baik (2013) Efficient visual attention based framework for extracting key frames from videos. Signal Processing: Image Communication 28 (1), pp. 34–44. Cited by: §I, §I.
  • [6] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman (2010-06) The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88 (2), pp. 303–338. Cited by: §III-A.
  • [7] B. G. Fabian Caba Heilbron and J. C. Niebles (2015) ActivityNet: a large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970. Cited by: §III-A.
  • [8] J. Gao, Z. Yang, K. Chen, C. Sun, and R. Nevatia (2017-10) TURN tap: temporal unit regression network for temporal action proposals. In The IEEE International Conference on Computer Vision (ICCV), Cited by: TABLE I.
  • [9] J. Gao, Z. Yang, and R. Nevatia (2017) Cascaded boundary regression for temporal action detection. In BMVC, Cited by: TABLE I.
  • [10] M. Gao, M. Xu, L. S. Davis, R. Socher, and C. Xiong (2019) Startnet: online detection of action start in untrimmed videos. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5542–5551. Cited by: §II.
  • [11] S. Giancola, M. Amine, T. Dghaily, and B. Ghanem (2018) SoccerNet: a scalable dataset for action spotting in soccer videos. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1792–179210. Cited by: §II.
  • [12] C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar, et al. (2018) AVA: a video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6047–6056. Cited by: §I, §I.
  • [13] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: 2nd item, §III-C, §III-C.
  • [14] J. He, Y. Song, and H. Jiang (2020-02) Bi-direction feature pyramid temporal action detection network. pp. 889–901. External Links: ISBN 978-3-030-41403-0, Document Cited by: TABLE II.
  • [15] F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles (2015-06) ActivityNet: a large-scale video benchmark for human activity understanding. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 961–970. External Links: Document, ISSN 1063-6919 Cited by: §IV-A.
  • [16] Y. Huang, Q. Dai, and Y. Lu (2019) Decoupling localization and classification in single shot temporal action detection. CoRR abs/1904.07442. External Links: Link, 1904.07442 Cited by: TABLE I, §IV-C.
  • [17] Y.-G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar (2014) THUMOS challenge: action recognition with a large number of classes. Note: Cited by: §IV-A.
  • [18] S. K. Kuanar, R. Panda, and A. S. Chowdhury (2013) Video key frame extraction through dynamic delaunay clustering with a structural constraint. Journal of Visual Communication and Image Representation 24 (7), pp. 1212–1227. Cited by: §I, §I.
  • [19] A. D. Laud (2004) Theory and application of reward shaping in reinforcement learning. Technical report Cited by: §III-C.
  • [20] T. Lin, X. Liu, X. Li, E. Ding, and S. Wen (2019) BMN: boundary-matching network for temporal action proposal generation. CoRR abs/1907.09702. External Links: Link, 1907.09702 Cited by: TABLE II.
  • [21] T. Lin, X. Zhao, H. Su, C. Wang, and M. Yang (2018) BSN: boundary sensitive network for temporal action proposal generation. CoRR abs/1806.02964. External Links: Link, 1806.02964 Cited by: TABLE I, §IV-B, TABLE II.
  • [22] T. Liu, H. Zhang, and F. Qi (2003) A novel video key-frame-extraction algorithm based on perceived motion energy model. IEEE transactions on circuits and systems for video technology 13 (10), pp. 1006–1013. Cited by: §I, §I.
  • [23] S. Ma, L. Sigal, and S. Sclaroff (2016) Learning activity progression in lstms for activity detection and early detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1942–1950. Cited by: §II.
  • [24] I. Mademlis, A. Tefas, N. Nikolaidis, and I. Pitas (2016) Movie shot selection preserving narrative properties. In 2016 IEEE 18th International Workshop on Multimedia Signal Processing (MMSP), pp. 1–5. Cited by: §I.
  • [25] I. Mademlis, A. Tefas, and I. Pitas (2018) A salient dictionary learning framework for activity video summarization via key-frame extraction. Information Sciences 432, pp. 319–331. Cited by: §I, §I.
  • [26] M. Morel, C. Achard, R. Kulpa, and S. Dubuisson (2017) Automatic evaluation of sports motion: a generic computation of spatial and temporal errors. Image and Vision Computing 64, pp. 67–78. Cited by: §I.
  • [27] S. Narayan, H. Cholakkal, F. S. Khan, and L. Shao (2019) 3C-net: category count and center loss for weakly-supervised action localization. External Links: 1908.08216 Cited by: TABLE II.
  • [28] S. Paul, S. Roy, and A. K. Roy-Chowdhury (2018) W-TALC: weakly-supervised temporal activity localization and classification. CoRR abs/1807.10418. External Links: Link, 1807.10418 Cited by: §IV-B, TABLE II.
  • [29] H. Perreault, G. Bilodeau, N. Saunier, and M. Héritier (2020) SpotNet: self-attention multi-task network for object detection. arXiv preprint arXiv:2002.05540. Cited by: §II.
  • [30] S. S. Rautaray and A. Agrawal (2015) Vision based hand gesture recognition for human computer interaction: a survey. Artificial intelligence review 43 (1), pp. 1–54. Cited by: §I.
  • [31] B. Seybold, D. Ross, J. Deng, R. Sukthankar, S. Vijayanarasimhan, and Y. Chao (2018) Rethinking the faster r-cnn architecture for temporal action localization. In CVPR 2018, Cited by: TABLE I.
  • [32] Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S. Chang (2017) CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In CVPR, Cited by: TABLE I, §IV-C.
  • [33] Z. Shou, D. Wang, and S. Chang (2016) Temporal action localization in untrimmed videos via multi-stage cnns. In CVPR, Cited by: TABLE I, §IV-C.
  • [34] K. Simonyan and A. Zisserman (2014) Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 568–576. External Links: Link Cited by: §IV-B.
  • [35] R. S. Sutton and A. G. Barto (1998) Introduction to reinforcement learning. 1st edition, MIT Press, Cambridge, MA, USA. External Links: ISBN 0262193981 Cited by: §III-C.
  • [36] W. Wu, D. He, X. Tan, S. Chen, and S. Wen (2019) Multi-agent reinforcement learning based frame sampling for effective untrimmed video recognition. arXiv preprint arXiv:1907.13369. Cited by: §II, §II.
  • [37] Z. Wu, C. Xiong, C. Ma, R. Socher, and L. S. Davis (2019-06) AdaFrame: adaptive frame selection for fast video recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I.
  • [38] Z. Wu, C. Xiong, C. Ma, R. Socher, and L. S. Davis (2019) AdaFrame: adaptive frame selection for fast video recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1278–1287. Cited by: §II, §II, §II, §II, §IV-C, §IV-E.
  • [39] H. Xu, A. Das, and K. Saenko (2017) R-C3D: region convolutional 3d network for temporal activity detection. CoRR abs/1703.07814. External Links: Link, 1703.07814 Cited by: TABLE I.
  • [40] S. Yeung, O. Russakovsky, G. Mori, and L. Fei-Fei (2016) End-to-end learning of action detection from frame glimpses in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2678–2687. Cited by: §I, §II, §II, §II, TABLE I, §IV-C, §IV-E, §IV.
  • [41] Z. Yuan, J. Stroud, T. Lu, and J. Deng (2017) Temporal action localization by structured maximal sums. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: TABLE I.
  • [42] K. Zhang, W. Chao, F. Sha, and K. Grauman (2016)

    Video summarization with long short-term memory

    In European conference on computer vision, pp. 766–782. Cited by: §I, §I.
  • [43] Y. Zhao, Y. Xiong, L. Wang, Z. Wu, D. Lin, and X. Tang (2017) Temporal action detection with structured segment networks. CoRR abs/1704.06228. External Links: Link, 1704.06228 Cited by: TABLE I, TABLE II.