A Self-Adaptive Proposal Model for Temporal Action Detection based on Reinforcement Learning

06/22/2017 ∙ by Jingjia Huang, et al. ∙ Peking University NetEase, Inc 0

Existing action detection algorithms usually generate action proposals through an extensive search over the video at multiple temporal scales, which brings about huge computational overhead and deviates from the human perception procedure. We argue that the process of detecting actions should be naturally one of observation and refinement: observe the current window and refine the span of attended window to cover true action regions. In this paper, we propose an active action proposal model that learns to find actions through continuously adjusting the temporal bounds in a self-adaptive way. The whole process can be deemed as an agent, which is firstly placed at a position in the video at random, adopts a sequence of transformations on the current attended region to discover actions according to a learned policy. We utilize reinforcement learning, especially the Deep Q-learning algorithm to learn the agent's decision policy. In addition, we use temporal pooling operation to extract more effective feature representation for the long temporal window, and design a regression network to adjust the position offsets between predicted results and the ground truth. Experiment results on THUMOS 2014 validate the effectiveness of the proposed approach, which can achieve competitive performance with current action detection algorithms via much fewer proposals.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Temporal action detection requires not only to determine whether an action occurs in a video but also to locate the temporal extent of when it occurs, which is a challenging problem for real-life long untrimmed videos. Most of modern approaches (Shou et al., 2016; Zhu and Newsam, 2016; Xiong et al., 2017)

usually solve the problem via a two-step pipeline: firstly generate a set of class independent action proposals, which are obtained via running a action/background classifier over a video at multiple temporal scales; then the proposals are classified by the pre-trained action detector, and post processing such as non-maximum suppression is applied. However, such extensive search for action localization is unsatisfying in terms of both accuracy and computational efficiency. Like the human detects the action through successively altering the span of attended region to narrow down the difference between the bounds of current window and that of true action region, the optimal algorithm should be the process of sequential, iterative observation and refinement consuming search steps as less as possible.

In this paper, we propose a class-specific action detection model that learns to continuously adjust the current region to cover the groundtruth more precisely in a self-adapted way. This is achieved by applying a sequence of transformations to a temporal window that is initially placed in the video at random and finally finds and covers action region as large as possible. The sequence of transformation is decided by an agent that analyzes the content of the current attended region and select the next best action according to a learned policy, which is trained via reinforcement learning based on Deep Q-Learning algorithm (Mnih et al., 2015). Different from existing approaches that locate the action following a fixed path, our method generates various search trajectories for different action instances, depending on the video scenarios, the starting search position and the sequences of actions adopted. As a result, the trained agent will locate a single instance of an action in about 15 steps, which means that the model only processes 15 successive regions of an image to explore an uncover video segment, thus it is of great computational efficiency to compare with sliding window based approaches.

Our model draws the inspiration from works that have used reinforcement learning to build active models for object localization in image (Caicedo and Lazebnik, 2015; Jie et al., 2016; Bellver et al., 2016)

. However, we can not handle the video in a top-down way that is proved to perform effectively for image object localization, as the duration of the video is usually too long (from hundreds to thousands frames). We start the search from a position randomly selected from the video, which will terminate until a instance of action has been found or the maximum transformation steps has been reached, and then a new search begins from the position away from current attended region. We incorporate temporal pooling operation with feature extraction process to better represent the long video segment and design a "

jump" action to avoid the agent trapping itself in the region where no action occurs. We conducted a comprehensive experimental evaluation in the challenging THUMOS’14 dataset (Jiang et al., 2014)

, and the results demonstrate that the proposed method can achieve competitive performance in terms of precision and recall via a small number of action proposals.

2 Related work

This task has been attended for a few years, and a large amount of research work have been done (Laptev and Lindeberg, 2003; Wang and Schmid, 2013; Simonyan and Zisserman, 2014; Tran et al., 2015). In early years, researchers often tackle the problem based on hand-crafted visual features (Laptev and Lindeberg, 2003; Wang and Schmid, 2013)

. Recently, impressed by the huge success of deep learning on image analysis task, some approaches have introduced deep models, especially Convolutional Neural Network (CNN), for better excavation the spatial-temporal information included in the video clip. Simonyan and Zisserman

(Simonyan and Zisserman, 2014) propose the two-stream network architecture with one branch processing RGB signal and the other one dealing with optical-flow signal. Tran et al. (Tran et al., 2015) construct C3D model, which operates 3D convolution in spatio-temporal video volume directly and integrates appearance and motion cues for better feature representation. There have been also other efforts (Donahue et al., 2015; Yue-Hei Ng et al., 2015) that attempt to combine frame-level CNN feature representation and long-range temporal structure to cope with input videos of long duration. Up to now, deep learning based approaches have achieved state-of-the-art performances.

Different from action recognition where actions are included in a trimmed video clip and the aim is to predict the category, temporal action detection needs to not only classify the action but also give out temporal localization. Most existing approaches address the problem via sliding window strategy for candidates generation and focus on feature representation and classifier construction (Shou et al., 2016; Gaidon et al., 2013; Oneata et al., 2013; Yuan et al., 2016). Shou et al. (Shou et al., 2016) utilize a multi-stage CNN detection network for action localization, where background windows are first filtered out by a binary action/background classifier based on C3D feature, then an action detection network incorporated both classification loss and temporal localization loss is trained for candidate refinement. By the limitation of 16-frames input of C3D model, they select 16 frames in uniform from the whole video, which is inferior to temporal pooling operation utilized in our approach. Gao et al. (Gao et al., 2017) decompose the input video into short video units, and pool features extracted from a set of contiguous units for representation of long video clip, and meanwhile employ a coordinate regression network to refine the temporal action boundaries. Our approach also includes location regression, whose regression offsets are calculated via the relative deviation rather than the absolute value, thus it will facilitate the model to converge more efficiently. Unlike the works mentioned above, Yeung et al. (Yeung et al., 2016) propose an attention based model that predicts the action position through a few of glimpses, which is trained via reinforcement learning. The difference between their work and ours is that our approach locates the action through continuously adjusting the span of current window not predicting the bounds directly.

Most of recent approaches for object detection are built upon the paradigm of "proposal + classification" (Girshick et al., 2014; Ren et al., 2015). Object proposals are usually either generated by methods relied on hand-crafted low-level visual cues, such as SelectiveSearch (Uijlings et al., 2013) and Edgebox (Zitnick and Dollár, 2014), or produced by fully convolutional network implemented on CNN features extracted from anchor boxes arranged uniformly on the image, such as Faster R-CNN (Ren et al., 2015). However, generating too many proposals for a image with only one or two objects is unnecessary and computational inefficiency. Some works attempt to reduce the number of proposals with an active object detection strategy (Caicedo and Lazebnik, 2015; Jie et al., 2016; Mathe et al., 2016). Caicedo et al. (Caicedo and Lazebnik, 2015) learns an optimal policy to locate one single object in the image via Deep Q-Learning, where it starts from the whole image in a top-down way and adaptively adjusts the window scale and position to focus on the true region. Jie et al. (Jie et al., 2016) propose an effective tree-structured reinforcement learning approach, which learns to balance the exploration of uncovered new objects and the refinement of covered ones, and can localize multiple objects in a single run. Inspired by (Caicedo and Lazebnik, 2015; Jie et al., 2016), we design a reinforcement learning based approach for temporal action localization, which locates action instances within the long untrimmed video via the learned policy in a bottom-up way, and meanwhile utilizes a regression network to refine the predicted temporal window boundaries.

Figure 1: The framework of our proposed action proposal model based on Deep Q-learning, which incorporates a regression network for better action localization.

3 Self-Adaptive Action Proposal Model

In this section, we present our action-proposal generation model, which is self-adapted and will gradually adjust its predicted results according to the content of attended window and the history of executed actions to cover the true action region as accurate as possible in a few steps. We cast the problem of temporal action localization as a Markov Decision Process (MDP), in which the agent interacts with the environment and makes a sequence of decisions to achieve the settled goal. In our formulation, the environment is the input video clip, in which the agent has an observation of the current video segment, called temporal window, and restructures the position or span of the window, to achieve the goal of locating the action precisely. The agent receives positive or negative rewards after each decision made during the train phase to learn an effective policy. Besides, we construct a regression network to refine the final detection results to promote the accuracy of localization. The framework of our proposal generation model is illustrated in Fig.

1. In the following subsections, the set of actions , the set of states , and the reward function of MDP and the regression network are discussed in detail. To avoid confusion, the action performed by the actor in the video is called motion in this section, and in other sections the meaning of action is determined by the context.

3.1 MDP Formulation

The set of actions can be divided into two categories: one group for transformation on temporal window, such as "move left", "move right", "expand left", and the remaining one for terminating the search, "trigger", as shown in Fig. 2. The transformation group includes regular actions that comprises of translation and scale, and one irregular action. The regular actions vary the current window in terms of position or time span around the attended region, such as "move left", "expand left" or "shrink", which are adopted by the agent to increase the intersection with the groundtruth that has overlaps with the current window. The irregular action, namely "jump", translates the window to a new position away from the current site to avoid that the agent traps itself round the present location when there is no motion occurring nearby. The change caused by any regular actions at each time to the window equals to a value in proportion to the current window size. For instance, supposing that current window is denoted as , where and stand for the left and right boundary respectively. The action "move left" translates the window to a new site of with , while for action "expand left" scales the window with the change of and . Here, is a parameter that can give a trade-off between search speed and localization accuracy. In this paper, we set . The action "jump" selects a new window randomly from the left or right side, which has the same size with the current window, being a distance away from the present site. The regular actions make the agent gradually adjust its position to cover the motion more accurately when it has found one; while the action "jump" let the agent explore unknown region that may contain the motion in a discontinuous and efficient way. The action "trigger" is employed by the agent whenever it considers that a motion has been localized by the current window, and stops the sequence of current search, and restarts a new search for the next motion with an initial window position away from current site.

Figure 2: Illustrations of the actions adopted by the agent for motion search in our experiment. Yellow windows with dash line represent next windows after taking the corresponding action.

The state of MDP is the concatenation of two components: the presentation of current window and the history of taken actions. To describe the motion within current window generally, the feature extracted from the C3D CNN model (Tran et al., 2015)

, which is pretrained on Sports-1M and finetuned on UCF-101, is utilized as the presentation. Here, we choose the feature vector from

fc6 layer (4096 dimension) in our problem, consideing its good abstract representation for the semantic information about the motion. The original C3D model can only accept 16 frames as input, however, the duration of temporal window is always far more than that number. To tackle with the problem, we design two different solutions: i.) uniformly select 16 frames from the whole duration ; ii.) fed all the frames into the C3D model and add a additional pooling layer (average pooling for our problem) between the "pool5" layer and the "fc6" layer, which condenses the dimension of extracted feature vector from "pool5" to the value specified by the C3D model. The history of the taken actions is a binary vector that tells which action has been adopted by the agent in the past. Each action in the history is represented by a 7-dimension binary vector where all the values are zero except the one corresponding to the taken action. In the experiment, we totally record 10 past actions as the history. The history of taken actions informs the agent the search path that has been passed through and the regions already attended, so as to stabilize search trajectories that might get stuck in repetitive cycles.

The reward function provides a feedback to the agent when it performs the action at the current state , which awards the agent for actions that will bring about the improvement of motion localization accuracy while gives the punishment for actions that leads to the decline of the accuracy. The quality of motion localization is evaluated via the simple yet indicative measurement, Intersection over Union (IoU) between current attended temporal window and the groundtruth. Supposing that stands for the current window and represents the groundtruth region of motion, then the IoU between and is defined as IoU = span() / span(). The reward function is proportional to the difference between IoUs of two successive states and , where the agent moves to state from by executing the action . Specially, it is formulated as following:

(1)

where and are attended windows corresponding to state and respectively, is the number of groundtruths within the input video. The reward function returns +1 or -1. Equation 1 indicates that the agent receives the reward +1 if the new window has more overlap with any of the groundtruth than the previous window , while the reward -1 otherwise. Such binary reward value makes the agent clearly realize that at present state, which action drives the attended window towards the groundtruth, and thus accelerates the convergence rate of the model during training phase. In addition, such reward-function scheme facilitates better localization towards motion regions especially for the video with multiple motion instances, as there is no limitation on which motion should be focused on at each state. The "trigger" action has a different reward function scheme, as it leads to the termination of search and there is no next state. The reward of "trigger" is determined by a piecewise function of IoU threshold, which can be presented as following:

(2)

In equation 2, represents the "trigger" action, is the reward value and chosen as 3 in our experiment, is the IoU threshold, which controls the tradeoff between the localization accuracy and computational overhead. The large will encourage the agent to locate the motion more precisely, however it consumes more action steps to complete the search. In training phase, we do not stop the search process when the agent correctly performs the action "trigger" for the first time, and let it continuously explore uncover regions. Therefore, our model recognizes many termination states that have IoU with groundtruth more than . We utilize for our problem, and find that larger , such as 0.6 or 0.7, gives rise to negligible promotion on recall value, which is validated by the experiments.

- The goal of the agent is to maximize the sum of discounted rewards that are received through continuously transforming the current attended window during a sequence of interactions with the environment (an episode). In other words, the agent needs to learn a policy that specifies an optimal action at current state

in the view of maximizing the long-term benefit. Due to the lack of state transition probability and the model free environment, we utilize reinforcement learning, specially Deep Q-learning, to estimate the optimal value for each state-action pair. In this paper, we follow the deep Q-learning framework proposed by Mnih

et al. (Mnih et al., 2015) that estimates the action-value function via a neural network. The architecture of our Deep Q-Network (DQN) is illustrated as the up branch of Fig. 1. Similar to (Caicedo and Lazebnik, 2015; Jie et al., 2016), the C3D CNN model is just used for feature extraction, and we do not train the whole pipeline for the full feature hierarchy learning, due to the good generalization of CNN model pretrained on large dataset and short of sufficient motion detection data for jointly training both two networks. During training phase, the agent operates multiple episodes with randomly initialized positions for each video clip. We train separate DQN for each motion category and follow the -greedy policy. Specially, the agent randomly selects an action from the whole action set with probability at current state, while greedily chooses the optimal action according to the learned policy with probability 1-

. During the whole training epochs,

is annealed linearly from 1.0 to 0.1, which gradually shifts from exploration to exploitation. Following (Mnih et al., 2015), we also incorporate the replay-memory scheme to collect various transition experiences from the past episodes, from which each record may be repeatedly used for model updates, in favor of breaking short-term correlations between states. A minibatch (e.g. 200 records) is randomly sampled from replay-memory as training samples to update the model at each time.

3.2 Regression Network

Inspired by Fast R-CNN (Girshick et al., 2014) where a regression network is incorporated to revise the position deviation between the predicted result and the groundtruth, we also introduce a regression model to refine the motion proposals. As shown in the down branch of Fig. 1

, the regression channel accepts 4096-dimension feature vector as input and gives out two coordinate offsets on both starting and end moment. Unlike spatial bounding box regression, in which coordinate scaling is needed due to various camera-projection perspectives, we directly utilize original temporal coordinate (

i.e. frame number) for offsets calculation leveraging the advantage of unified frame rate among video clips in our experiment. The regression biases are represented as the ratio of position deviation relative to the predicted span, which are defined as following:

(3)

where and are frame indexes for predicted starting and end moment, while and are frame indexes for the matched groundtruth. The loss for temporal coordinate regression, , is defined as following:

(4)

where is the number of actions that correctly perform "" in a minibatch. In other words, we only regress the position of temporal window whose IoU with groundtruth is more than 0.5. We utilize

norm to make the loss be insensitive to outliers.

4 Experimental Results

We evaluate the performance of our model on the dataset THUMOS’14. Followed the standard evaluation protocol, our method achieves a competitive recall compared to the state-of-the-art results and outperforms the existing methods by a large margin for action detection task.

4.1 Implementation Details

Datasets. We validate the quality of our methods on labeled untrimmed videos from the challenging THUMOS’14, which contains over 20 hours of video from 20 sport action categories. The dataset comprises 413 videos with 200 for validation and 213 for test. We train our model on validation set and report results on test set.
Training Details.

Our model is implemented on Torch 7

Collobert et al. (2011). We train category specific model for each action and keep the same parameters settings. In pre-process stage, we downsample videos to extract a more compact C3D descriptor. The replay memory buffer size is set as 2000, while the minibatch size is 200. The learning rate for DQN is 1e-3 with a decay rate of 5e-5, while the learning rate for regression network is 1e-4 with a decay rate of 9e-5. Dropout is applied with a ratio of 0.2. To accelerate training, we force the agent to take a "jump" action if the IoU for current window is zero, which will drive the window to the region around a groundtruth.
Testing Details. During test phase, the agent starts its search from the beginning, and take actions to adjust itself position according to the attended region. We set a maximum action steps for the agent as 15. The agent will restart its search from the back bound of current window, if it takes a "trigger" action or finishes maximum action steps. Note that different from training, the agent consistently takes a leap forward (two times farther than move "left/right") when it adopts a "jump" action. We choose the windows, where the agent takes "trigger" action, as proposals and utilize the pre-trained TSN Wang et al. (2016) as our classifier.

4.2 Temporal Proposals Evaluation

All the regions attended by the agent can be understood as temporal proposal candidates. Our methods run for about 400 steps with around 50 triggers averaged for each video. Fig.6 is an instance of the detection process of DQN. For each attended region, we score them with the Q-value predicted by the model, and add a large bonus only to "trigger" regions in order to give them higher priority when ranking the proposals. To assess the recall performance of our method, we use the metrics from Escorcia et al. (2016):
Recall vs. Average Number of Proposal: average recall over all categories at IoU 0.5 is calculated as a function of average number of proposal. The best proposal approach is expected to achieve a higher recall with less proposals.
Recall vs. IoU: for a fixed number of proposals, recall is calculated at IoU between 0.05 to 1. To measure the localization quality of the top ranked proposals that are of most important for further recognition task, we fix the number of proposals to 100.
We compare our method with DAPs (Escorcia et al., 2016), SCNN-prop (Shou et al., 2016), Sparse-prop Heilbron et al. (2016) and sliding window. SCNN-prop and DAPs are the state-of-the-art methods while sliding window is the baseline. For DAPs, SCNN-prop and Sparse-prop, we plot the curves using the proposal results provided by the authors. sliding window generates the proposals including all siding windows of lengths from 16 to 512 with 50% overlapping, and each window is scored with a random value. As shown in Fig.3.(a), our method achieves a better performance than the state-of-the-art methods in the early state of recall, and we have a competitive recall performance for the top 100 proposals according to Fig.3.(b). Notice that, the recall growth of our method slows down after about 70 proposals. It is because that our DQN agent tries to figure out the ground truth as fast as possible, and tends to stop exploration when it considers that an action region with IoU more than 0.5 has been found. Therefore, except the "trigger" segments, other proposals are intermediate results during the exploring process, which are unreliable on most occasions.

(a) (b)
Figure 3: Evaluation results of recall performance on THUMOS’14. S-CNN and DAPs are state-of-the-art methods while Sliding Window in dash line is the baseline. We use the codes provided by (Escorcia et al., 2016) to calculate recalls

4.3 Temporal Action Detection Analysis

Following the convention Jiang et al. (2014), we evaluate the performance of our methods on the temporal localization task with mean Average Precision(mAP) score at 50% IoU. In the experiments, we take "trigger" windows as proposals and classify them with a pre-trained TSN. Our methods are compared with other state-of-the-art methods in the literature, including S-CNN (Shou et al., 2016), Oneata et al. Oneata et al. (2013) and Yeung et al. Yeung et al. (2016). As shown in Fig.4, our method outperforms the state-of-the-art approaches on THUMOS’14 by a large margin of 8.4%.

Figure 4: Histograms of average precision for each categories on THUMOS’14. The results are calculated with the official toolkit. The mAP(%) for Oneata et al. Oneata et al. (2013), Yeung et al. Yeung et al. (2016), S-CNN (Shou et al., 2016) and Ours are 14.4, 17.1, 19.0 and 27.7 respectively.

To further analyze the contributions of different model components, namely temporal pooling and coordinate regression, for action detection task, we implement ablation studies. We construct three models, which are described as follows:
Ours: The integrated model with architecture shown in Fig. 1, where DQN agent generates proposals with the features processed through temporal pooling layer and finetunes the proposals with regression network. Average number of "trigger" proposals is 67 per video.
Ours-POOLING: The model without temporal pooling layer, uniformly samples video frames from input video segment to extract C3D features and finetunes the proposals with regression network. Average number of "trigger" proposals is 50 per video.
Ours-POOLING-RGN: The basic model without both temporal pooling layer and regression network, uniformly samples video frames and only utilizes DQN for proposal generation. Average number of "trigger" proposals is 50 per video.
For each model, we evaluate the proposals and overall action localization performance. First of all, we use Recall vs. Average Number of Proposal at IoU=0.5 to evaluate the proposal performance that is shown in Fig. 5. Then we present the quantitative detection results of the models in Table 1 that are reported by mAP scores at 50% IoU. The mAPs are calculated with a fixed number of average proposals ( 50) that is equal to the number of average "trigger" proposals for Ours-POOLING and Ours-POOLING-RGN. The last line in Table 1 reports the mAP calculated with the total "trigger" proposals for Ours where average number is 67. Interestingly, it seems that Ours-POOLING has the superior recall performance than Ours, as shown in Fig. 5, however, Ours outperforms the other models by a large margin on overall detection performance. As pointed out by (Escorcia et al., 2016), we consider that this inconsistent result claims that Ours produces proposals with a small number of hard negatives, which allows the activity classifier to keep the number of false positive low. Besides, the results also illustrate that localization regression is of benefit to the detection task without exception. We also compare our proposal models with exiting related works, and our model achieves the best performance for action detection task, as shown in Table 1.

Figure 5: Recall Evaluation for ablation study:     Recall vs. Average Number of Proposal at IoU=0.5
Model@Proposal Number mAP(%) SCNN (Shou et al., 2016)@NA 19.0 Yeung et al. Yeung et al. (2016)@NA 17.1 Oneata et al. Oneata et al. (2013)@NA 14.4 Ours-POOLING-RGN@50 22.3 Ours-POOLING@50 24.6 Ours@50 26.4 Ours@67 27.7
Table 1: Temporal-action detection results evaluation for various proposal models: mAP calculated with a fixed number of average proposals on THUMOS’14. @NA means the proposal number is not specified for the methods.
Figure 6: An instance of how DQN agent takes actions to generate proposals. The examples are sampled from the action PoleVaul of THUMOS’14. The last row is the time line, where the red lines correspond to the ground truth. The top 3 rows are the running details corresponding to action instances , and . The blue lines are the agent’s search histories. A green circle indicates that it is a right "trigger" decision while the red one indicates a wrong one.

4.4 Run-time Performance

The run-time property of our method is dependent on the DQN’s performance. For a well trained DQN agent, it will concentrate on the ground truth in a couple of steps once it perceives the action segment. Meanwhile, it can also accelerate the exploring process over the video with "jump" action. Besides, the selection of scalar is also an important factor that will influence the run-time performance. A large will make the agent take a brief glance over the video in most of the case, but will also result in coarse proposals. As a trade off,we set the = 0.2 during the training and testing phase. On Tesla K80 platform, the average run-time of our model over all testing videos in THUMOS’14 is 50.4 FPS, including the online C3D extraction.

5 Conclusion

In this paper, we have introduced an active action proposal model that learns to adaptively adjust the span of attended current window to cover the true action regions in a few steps. We build our model based on deep reinforcement learning and lean an optimal policy to direct the agent to act. In order to precisely locate the action, we design a regression network to revise the offsets between predicted bound results and the groundtruth. Experiment results on THUMOS 14 dataset validate that the proposed approach can achieve comparable performance with most of modern action-detection methods with much fewer action proposals.

References

  • Shou et al. [2016] Zheng Shou, Dongang Wang, and Shih-Fu Chang. Temporal action localization in untrimmed videos via multi-stage cnns. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 1049–1058, 2016.
  • Zhu and Newsam [2016] Yi Zhu and Shawn Newsam. Efficient action detection in untrimmed videos via multi-task learning. arXiv preprint arXiv:1612.07403, 2016.
  • Xiong et al. [2017] Yuanjun Xiong, Yue Zhao, Limin Wang, Dahua Lin, and Xiaoou Tang. A pursuit of temporal accuracy in general activity detection. arXiv preprint arXiv:1703.02716, 2017.
  • Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • Caicedo and Lazebnik [2015] Juan C Caicedo and Svetlana Lazebnik. Active object localization with deep reinforcement learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 2488–2496, 2015.
  • Jie et al. [2016] Zequn Jie, Xiaodan Liang, Jiashi Feng, Xiaojie Jin, Wen Lu, and Shuicheng Yan. Tree-structured reinforcement learning for sequential object localization. In Advances in Neural Information Processing Systems, pages 127–135, 2016.
  • Bellver et al. [2016] Miriam Bellver, Xavier Giró-i Nieto, Ferran Marqués, and Jordi Torres. Hierarchical object detection with deep reinforcement learning. arXiv preprint arXiv:1611.03718, 2016.
  • Jiang et al. [2014] Y.-G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar. THUMOS challenge: Action recognition with a large number of classes. http://crcv.ucf.edu/THUMOS14/, 2014.
  • Laptev and Lindeberg [2003] Ivan Laptev and Tony Lindeberg. Space-time interest points. In 9th International Conference on Computer Vision, Nice, France, pages 432–439. IEEE conference proceedings, 2003.
  • Wang and Schmid [2013] Heng Wang and Cordelia Schmid. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision, pages 3551–3558, 2013.
  • Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pages 568–576, 2014.
  • Tran et al. [2015] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 4489–4497, 2015.
  • Donahue et al. [2015] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2625–2634, 2015.
  • Yue-Hei Ng et al. [2015] Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4694–4702, 2015.
  • Gaidon et al. [2013] Adrien Gaidon, Zaid Harchaoui, and Cordelia Schmid. Temporal localization of actions with actoms. IEEE transactions on pattern analysis and machine intelligence, 35(11):2782–2795, 2013.
  • Oneata et al. [2013] Dan Oneata, Jakob Verbeek, and Cordelia Schmid. Action and event recognition with fisher vectors on a compact feature set. In Proceedings of the IEEE International Conference on Computer Vision, pages 1817–1824, 2013.
  • Yuan et al. [2016] Jun Yuan, Bingbing Ni, Xiaokang Yang, and Ashraf A Kassim. Temporal action localization with pyramid of score distribution features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3093–3102, 2016.
  • Gao et al. [2017] Jiyang Gao, Zhenheng Yang, Chen Sun, Kan Chen, and Ram Nevatia. Turn tap: Temporal unit regression network for temporal action proposals. arXiv preprint arXiv:1703.06189, 2017.
  • Yeung et al. [2016] Serena Yeung, Olga Russakovsky, Greg Mori, and Li Fei-Fei. End-to-end learning of action detection from frame glimpses in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2678–2687, 2016.
  • Girshick et al. [2014] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.
  • Ren et al. [2015] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
  • Uijlings et al. [2013] Jasper RR Uijlings, Koen EA Van De Sande, Theo Gevers, and Arnold WM Smeulders. Selective search for object recognition. International journal of computer vision, 104(2):154–171, 2013.
  • Zitnick and Dollár [2014] C Lawrence Zitnick and Piotr Dollár. Edge boxes: Locating object proposals from edges. In European Conference on Computer Vision, pages 391–405. Springer, 2014.
  • Mathe et al. [2016] Stefan Mathe, Aleksis Pirinen, and Cristian Sminchisescu. Reinforcement learning for visual object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2894–2902, 2016.
  • Collobert et al. [2011] Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet.

    Torch7: A matlab-like environment for machine learning.

    In BigLearn, NIPS Workshop, number EPFL-CONF-192376, 2011.
  • Wang et al. [2016] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In European Conference on Computer Vision, pages 20–36, 2016.
  • Escorcia et al. [2016] Victor Escorcia, Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem. DAPs: Deep Action Proposals for Action Understanding. Springer International Publishing, 2016.
  • Heilbron et al. [2016] Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem. Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In Computer Vision and Pattern Recognition, pages 1914–1923, 2016.