Human spatial action localization and classification in videos are challenging tasks that are key to better video understanding. Action detection is especially challenging, as it requires localizing the actor in the scene, as well as classifying the action. This is done for every frame in a video with little or no context. In contrast, a related task is action recognition, which uses signals from all video frames to predict the action. Action detection has important applications, such as surveillance and human-robot interaction. However, most current approaches are computationally expensive and are far from real-time performance, which limits their usage in real life applications.
Understanding actions in videos has been an active area of research in recent years. Following the success of deep convolutional neural networks (CNNs) on the task of image classification, researchers have used CNNs for the tasks of action recognition and localization. For image classification, appearance is typically the only cue available, represented by RGB pixel values. Videos provide an extra signal: motion. Researchers have worked on many different ways to model motion cues, including 3D CNNs and recurrent neural networks. One of the most successful approaches are two-stream networks, which usually consist of a spatial network that models appearance, whose input is RGB frames, and a temporal network that models motion. Optical flow is often chosen as input to this network; however, other inputs can be used, such as dense trajectories. While adding the temporal stream often improves the model, it adds complexity, as optical flow is usually computed using a third party algorithm, which works separately from the RGB stream. This limits the ability for parallelization and full utilization of compute resources like GPUs, in addition to memory overhead. Also, using a third party algorithm prevents the model from being trainable end-to-end such that the visual and motion pathways cannot learn to co-ordinate. Finally, as shown in , optical flow algorithms optimize the end-point-error (EPE), which does not necessarily align with the objective for action detection.
One of the challenges of the action detection task is the absence of large-scale annotated datasets. This problem forces researchers to work with relatively shallow architectures or use an architecture that is pre-trained on the image classification task. Only recently have large-scale datasets for action recognition emerged, such as Kinetics . Pre-training on a large-scale dataset for action recognition should transfer well to the task of action localization.
To the best of our knowledge, all past efforts that used two-stream networks for action detection trained the two streams separately. The predictions from both streams were then fused using a fusion algorithm. Training the two streams separately prevents the model from exploiting dependencies between the appearance and motion cues. As a downside, training the two networks jointly on the small action localization dataset might lead to overfitting, as the model will have a very high capacity when compared to amount of labeled data. However, pre-training on Kinetics should solve this overfitting problem.
In this work, we propose an end-to-end trainable framework for real-time spatial action detection. Following the advances in real-time object detection, we build our framework with motivation from YOLOv2 , the state-of-the-art real-time object detector. We generalize its architecture to a two-stream network architecture for action detection. Instead of training each stream separately, we train both streams jointly by fusing the final activations from each stream and applying a convolutional layer to produce the final prediction.
We replace the usual third party algorithms used for computing optical flow [5, 6, 7] with a trainable neural network. We use Flownet2  for optical flow computation and integrate it in our architecture at the beginning of the temporal stream. Using Flownet2 has two advantages: first, the framework becomes end-to-end trainable. While Flownet2 is trained to optimize the EPE, the computed optical flow might not be optimal for the objective of action detection. Fine-tuning Flownet2 for the task of action detection should result in better optical flow for our objective . Secondly, while other efforts on action detection usually use implementations of optical flow algorithms that are totally separate from the model, integrating the optical flow computation in the network improves the computational speed of the framework, as it makes better use of parallelization and reduces the data transfer overhead.
Finally, to address the overfitting problem that may be caused by the use of small-scale datasets or by training the two streams jointly, we pre-train our model for the task of action recognition on Kinetics. The pre-trained model is then trained on the task of action detection with a weak learning rate to preserve a relatively generic feature initialization and to prevent overfitting.
We test our framework using UCF-101-24 , a realistic and challenging dataset for action localization. We use temporally trimmed videos as our framework does not yet include temporal localization.
Ii Related Work
In recent years, deep CNNs have been very successful for computer vision tasks. Specifically, they have shown great improvements for the tasks of image classification[10, 11] and object detection [12, 4] when compared to traditional hand-crafted methods. Studying actions in videos has been an active area of research. Videos provide two types of information: appearance, which is what exists in static images or individual frames of video, and motion. Researchers have used different approaches for modeling motion, including two-stream networks and 3D-CNNs.
Two-stream networks  have been one of the most successful approaches for modeling motion for the tasks of action recognition and detection. In this approach, the network is designed as two feed-forward pathways: a spatial stream for modeling appearance and a temporal stream for modeling motion. While RGB images are a good representation of appearance information, optical flow is a good representation for motion. The spatial and temporal streams take RGB frames and optical flow as inputs, respectively. Many efforts for solving the action detection problem have followed this approach. Gkioxari and Malik , motivated by R-CNNs 
, use selective search to find region proposals. They use two separate CNNs (appearance and motion) for feature extraction. These features are fed to a Support Vector Machine (SVM) to predict action classes. Region proposals are linked using the Viterbi algorithm.Weinzaepfel et al.  obtain frame-level region proposals using EdgeBox . The frames are then linked by tracking high-scoring proposals using a tracking-by-detection approach, which uses two separate CNNs for modeling appearance and motion, and a SVM classifier similar to . Peng and Schmid , motivated by faster-R-CNN , use region proposal networks (RPNs) to find frame-level region proposals. They use a motion RPN to obtain high quality proposals and show that it is complementary to an appearance RPN. Multiple frame optical flows are stacked together and demonstrate improvement in the motion R-CNN. Region proposals are then linked using the Viterbi algorithm. Both appearance and motion streams are trained separately. Singh et al. 
were the first deep learning-based approach to address real-time performance for action detection. They proposed using the single shot detector (SSD), which is a real-time object detector. They also employ a real-time, but less accurate, optical flow computation . Combining these two components, they managed to achieve a rate of . They propose a novel greedy algorithm for online incremental action linking across the temporal dimension. While this work is significantly faster than previous efforts, they sacrificed accuracy for speed by using a less accurate flow computation. Kalogeiton et al.  propose generalizing the anchor box regression method used by faster R-CNN  and SSD  to anchor cuboids, which consist of a sequence of bounding boxes over time. They take a fixed number of frames as input. Then, feature maps from all frames in this sequence are used to regress and find scores for anchor cuboids. At test time, the anchor cuboids are linked to create tubelets, which do not have a fixed temporal extent. While most methods for solving the action detection problem followed the two-stream approach, Hou et al.  use 3D-CNNs. They suggest generalizing R-CNN to videos by designing a tube CNN (T-CNN). Instead of obtaining frame-level action proposals and using a post-processing algorithm to link actions temporally to form action tubes, T-CNN learns the action tubes directly from RGB frames.
Optical flow estimation has been dominated by variational approaches that follow. Though recently, approaches that use deep CNNs for optical flow estimation [24, 25, 8] have shown promise. Flownet  is the first end-to-end trainable deep CNN for optical flow estimation. It is trained using synthetic data to optimize EPE. The authors provide two architectures to estimate optical flow. The first is a standard CNN that takes the concatenated channels from two consequent frames and predicts the flow directly. The second is a two-stream architecture that attempts to find a good representation for each image before they are combined by a correlation layer. However, Flownet falls behind other top methods due to inaccuracies with small displacements present in realistic data. Flownet2  addresses this problem by introducing a stacked architecture, which includes a subnetwork that is specialized to small displacements. It achieves more than 50% improvement in EPE compared to Flownet. Having a trainable network for estimating optical flow can be very useful, especially when integrated with other tasks. Sevilla-Lara et al.  studied the integration of trainable optical flow networks Flownet  and Spynet  on the task of action recognition. They came up with multiple conclusions that suggest that fine-tuning optical flow networks for the objective of action recognition consistently demonstrated improvement.
We propose a framework for efficient and accurate action detection, as outlined in Figure 1. We follow the two-stream network architecture  and integrate optical flow computation in our framework by using Flownet2 as input to the motion stream. We build each stream on YOLOv2 . In contrast to previous methods, instead of training each stream separately, we apply early fusion and train both streams jointly. Finally, the fused feature maps are used to regress bounding boxes, class scores, and overlap estimates, similar to YOLOv2.
Iii-a Two-Stream YOLOv2 with Early Fusion
YOLO  is a real time object detector. While there have been many successful object detection methods, such as R-CNN , these methods rely on extracting region proposals for candidate objects, either by an external algorithm like Selective Search or EdgeBox, or by a RPN. These proposals are then fed to a CNN to extract features and predict object classes. In contrast, YOLO defines object detection as a regression problem. A single network predicts both the spatial bounding boxes and their associated object classes. This design enables end-to-end training and optimization which allows YOLO to run in real-time (). Compared to R-CNN , YOLO uses the entire image to predict objects and their locations, meaning that it encodes appearance as well as contextual information about object classes. This is very critical for the task of action detection, as context is an extremely important clue for which action class is present in the scene (e.g., surfing is associated with sea, skiing is associated with snow). YOLOv2 is an improved version of YOLO, which adopts the anchor box idea that is used by R-CNN and SSD. A pass-through layer is added, which brings high resolution features from early layers on the network to the final low resolution layers. This layer improves the performance with small-scale objects that the previous version struggled with. Moreover, YOLOv2 is even faster, as it maintains high accuracy with small-scale images. The fully-connected layer was removed, which makes the network completely convolutional, reducing the number of parameters. We built our framework on YOLOv2, as it is the best fit for our objective, running at above real-time speeds while maintaining state-of-the-art accuracy. Moreover, it encodes better contextual information, which is critical for the task of action detection. We use the open-source implementation and the pre-trained models provided by https://github.com/longcw/yolo2-pytorch.
In contrast to previous efforts, we train both input streams jointly. Training the two streams independently prevents the networks from learning complementary features. Associating appearance and motion cues can be very useful for identifying the action in the scene. We apply early fusion by concatenating the final activations of both streams channel-wise. We apply a 1x1 convolutional kernel on top of the fused activations. By applying this convolution, we combine the features from both streams across each spatial location where there is high correspondence. The final activations are used to regress bounding boxes, class scores, and overlap estimates, similar to YOLOv2.
Iii-B Integrating Flownet
Previous two-stream approaches for solving action detection use non-trainable optical flow algorithms [5, 6, 7] that are completely separate from their detection model. In contrast, we integrate optical flow computation in our pipeline. This provides two advantages. Firstly, our framework becomes fully trainable end-to-end. Fine-tuning optical flow for the task in hand can be very useful. Sevilla-Lara et al.  observe that a CNN trained to optimize the EPE might not be the best representative of motion for the task of action recognition. They propose fine-tuning the optical flow network for action recognition with a weak learning rate and they observe consistent improvements. Motivated by this work, we fine-tune Flownet2 for the task of action detection. Secondly, integrating Flownet2 in our pipeline leverages the computational power of GPUs, as all we need is a forward pass starting from the video frames to the final detections. Other methods usually use publicly available CPU implementations of variational optical flow algorithms, which are significantly slower, in addition to data transfer overhead. While  uses a less accurate, faster optical flow algorithm called DIS-Fast , Flownet2 has architectures that are faster with matching quality, or with the same speed with significantly higher quality, as shown in Table I. We chose to test our model with three variations of Flownet2. The full-stack architecture Flownet2, is the most accurate, but slowest architecture. Flownet2-CSS a less accurate, but faster version. Finally, we test with Flownet2-SD, a relatively small network that is specialized toward small displacements. This model is relatively less accurate than the first two; however, it is significantly faster. We use the open-source implementation and pre-trained models provided by https://github.com/NVIDIA/flownet2-pytorch.
Iii-C Pre-Training Using Kinetics
One of the challenges that researchers face when working on the task of action detection is the absence of large-scale annotated datasets. Providing bounding boxes for every frame in every video for a large-scale dataset is an extremely difficult task. One of the most successful ways to deal with this kind of problem is through transfer learning. Deep CNN architectures trained on large-scale image classification datasets like ImageNet have shown that they can learn features generic enough such that they can be used for other vision tasks. This suggests that features learned from one task can be transferred to another. It was also observed that the more similar the two tasks are, the better the performance after transfer.
After the release of Kinetics , Carreira and Zisserman  studied the effect of pre-training different architectures with Kinetics and then used the pre-trained model to train smaller datasets (e.g., UCF-101, HMDB) for the same task of action recognition. They report a consistent boost in performance after pre-training; however the extent of the improvement varies with different architectures. In this study, the transfer should be optimal, as the target and source tasks are the same. Previous efforts for solving action detection usually use network architectures pre-trained on image classification using ImageNet networks or are pre-trained on the task of object detection using Pascal VOC . However, T-CNN  uses a pre-trained C3D model  that is trained using the UCF-101 action recognition dataset, which is considerably smaller than Kinetics.
The tasks of action recognition and detection are very similar. In fact, action recognition can be considered a subtask of action detection. Similarly, action detection and object detection are also related, mainly through the localization subtask. In order to gain benefit from both tasks and make use of the large-scale Kinetics dataset, we start with YOLOv2 architectures for both streams that are pre-trained on object detection using Pascal VOC. We then train our framework using Kinetics with a weak learning rate in order to preserve some of the features that can help with localization, while fine-tuning for a different classification task.
We evaluate different variations of our architecture with respect to detection performance and runtime:
Flownet2 provides improvement in both speed and accuracy. Therefore, to test the quality of Flownet2 compared to other accurate optical flow algorithms, we substitute the method of Brox and Malik , an accurate but slow optical flow algorithm.
Fine-tuning Flownet2 for the task of action detection produces optical flow that is a better representation of the action-related motion in the scene. To validate this idea, we train models with frozen and fine-tuned Flownet2 parameters.
To investigate transfer learning from the task of activity recognition, we train models with and without Kinetics pre-training. For the models that were not pre-trained, we use the parameters trained on object detection using PASCAL VOC.
Finally, to have the ability to choose between accuracy and speed, we substitute Flownet2 with either Flownet2-SD or Flownet2-CSS, observing how they compare in terms of accuracy and speed to the full-stack estimator.
We use UCF-101 to test our framework. This is a dataset that consists of videos for 101 actions in realistic environments collected from YouTube. This dataset is mainly used for the task of action recognition. For the action detection task, a subset of 24 actions have been annotated with bounding boxes, consisting of 3,207 videos. This is currently the largest dataset available for the task of action detection. While this dataset includes untrimmed videos, we use the trimmed ones, as our framework does not include a temporal localization component. We use split 1 for splitting training and testing data.
Iv-B Evaluation Metric
We use frame mean average precision (f-mAP) to evaluate our methods. This computes the area under the precision recall curve for the frame-level detections. A true positive is a detection that has an intersection over union (IoU) more than a threshold with the ground truth, and the action class is predicted correctly.
Iv-C Implementation Details
We use PyTorch for all experimentation. For Kinetics pre-training, we initialize both streams using parameters trained on PASCAL VOC. We use the SGD optimizer with a learning rate of 0.0008. We pre-train Kinetics with optical flow from Flownet2. We trained UCF-101 using the Adam optimizer with a learning rate of and batch size of 32. We observed that the Adam optimizer added more stability when training a multi-task objective. We apply random cropping, HSV distortion, and horizontal flipping for data augmentation. During training, we sample two consecutive frames randomly from each sequence. We scale the images and optical flow to . For fine-tuning all the Flownet2 architectures, we used a learning rate of . We used the pre-computed Brox et al. optical flow provided by https://github.com/gurkirt/realtime-action-detection. For testing, we select the detection box with the highest score in the current frame. We do not apply any post-processing action linking algorithm.
V-a Ablation Study
We experiment with different variations of our architecture to show the value of our proposals. We report the frame mAP at different IoU thresholds for 8 different models in Table. II. First, to study the impact of pre-training using Kinetics, we compare it against models pre-trained using Pascal VOC. We can observe a consistent improvement when pre-training with Kinetics, for both networks trained with Brox, optical flow, where we notice a 2.5% gain in frame mAP (0.5 threshold) or using Flownet2 where the gain is 4.5%. The difference in the gain can be explained by the fact that we pre-trained Kinetics using Flownet2. Second, we study the value of fine-tuning Flownet2 for the task of action detection. We compare models with frozen and fine-tuned Flownet2 parameters. We observe an improvement of 2% for models pre-trained with Pascal VOC and 2.5% for models pre-trained using Kinetics. Combining pre-training with fine-tuning Flownet2, we see a gain of . We notice that a model pre-trained with Kinetics and fine-tuned for action detection outperforms all other variations for all different IoU thresholds.
Finally, we test with Flownet2-CSS and Flownet2-SD which are faster, less accurate variations of Flownet2. We observe that with pre-training and fine-tuning, these models outperform the Brox optical flow-trained model (Brox + VOC), while being significantly faster. We show the AUC curves for all 8 models we tested in Figure 2.
V-B Comparison with Top Performers
We compare our results with other top performers on the UCF-101-24 dataset, as shown in Table. III. It should be noted that out of all reported results, only one variation of the Singh et al. framework runs in real-time ().
We observe that all of our models that use Kinetics pre-training and fine-tuning for Flownet2 variants outperform the other top performers. However, we can only fairly compare our results to Hou et al. , as both our tests use temporally trimmed videos from the UCF-101 dataset. The other methods [21, 18, 14, 16] test on untrimmed videos, as they perform both spatial and temporal detections. While they have an advantage over our framework as linking actions temporally can improve the spatial detections, they also suffer from a disadvantage as they have a greater chance of getting a false positive if they detect an action in a frame where there is no action being performed.
|Weinzaepfel et al. ||35.84|
|Hou et al. ||41.37|
|Peng et al. ||65.37|
|Singh et al.  RGB + DIS-Fast||65.66111As reported in https://github.com/gurkirt/realtime-action-detection 00footnotetext: As reported in https://github.com/gurkirt/realtime-action-detection|
|Singh et al.  RGB + Brox||68.3111footnotemark: 1|
|Kalogeiton et al.||67.1|
|Brox + Kinetics||73.18|
|Tuned Flownet2 + Kinetics||74.07|
|Tuned Flownet2-CSS + Kinetics||72.13|
|Tuned Flownet2-SD + Kinetics||71.67|
: untrimmed videos. : trimmed videos. : real-time .
V-C Detection Runtime
We propose an end-to-end trainable pipeline. Integrating the flow computation in our framework using Flownet2 improves the compute resources utilization. We can make the best use of GPU parallelization in addition to reducing the overhead caused by memory transfer if the framework is separated into two parts. The frame per second (fps) rates for our architectures are shown in Table. IV. We used a NVIDIA GTX Titan X GPU for testing the runtime speed which is the same card used for previously proposed work on real-time action detection . We test using a batch sizes of 1 and 4. With a batch size of 1 (online), the system will have no latency. If a small latency is acceptable, we can buffer the input frames to use a batch size of 4 which improves the frame per second rate. We compare our results to Singh et al. , the only real-time method for action detection. However, in their reported runtime, they do not account for the overhead caused by transferring the optical flow computed using DIS-Fast to their two-stream SSD networks. Nevertheless, our model using Flownet2-SD is the fastest, achieving with no latency or with minimal latency.
|Model||batch size = 1||batch size = 4|
|Singh et al.  RGB+DIS-Fast||-||28|
|Tuned Flownet2 + Kinetics||12||15|
|Tuned Flownet2-CSS + Kinetics||17||21|
|Tuned Flownet2-SD + Kinetics||25||31|
In this work, we propose a real-time, end-to-end trainable two-stream network for action detection by generalizing the YOLOv2 network architecture. We train two-stream YOLOv2 networks jointly to learn complementary features between the appearance and motion streams. We show that transfer learning from the task of action recognition to action detection introduces a boost in performance. Additionally, fine-tuning a trainable optical flow estimator for the task of action detection results in a better representation for the action-related motion in the scene, improving our model’s performance. Finally, we show that by integrating the optical flow computation and training end-to-end, our framework runs in real-time (), faster than all previous methods.
We would like to thank Brendan Duke of the Machine Learning Research Group at the University of Guelph for his help with training theKinetics dataset and helpful suggestions toward improving the manuscript.
- Simonyan and Zisserman  Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. CoRR, 2014.
- Sevilla-Lara et al.  Laura Sevilla-Lara, Yiyi Liao, Fatma Guney, Varun Jampani, Andreas Geiger, and Michael J. Black. On the integration of optical flow and action recognition, 2017.
- Kay et al.  Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset. CoRR, abs/1705.06950, 2017.
Redmon and Farhadi 
Joseph Redmon and Ali Farhadi.
YOLO9000: better, faster, stronger.
2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 6517–6525, 2017. doi: 10.1109/CVPR.2017.690.
- Farnebäck  Gunnar Farnebäck. Two-frame motion estimation based on polynomial expansion. In Proceedings of the 13th Scandinavian Conference on Image Analysis, SCIA’03, pages 363–370, Berlin, Heidelberg, 2003. Springer-Verlag. ISBN 3-540-40601-8.
- Zach et al.  C. Zach, T. Pock, and H. Bischof. A duality based approach for realtime tv-l1 optical flow. In Proceedings of the 29th DAGM Conference on Pattern Recognition, pages 214–223, Berlin, Heidelberg, 2007. Springer-Verlag. ISBN 978-3-540-74933-2.
- Brox and Malik  T. Brox and J. Malik. Large displacement optical flow: Descriptor matching in variational motion estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33, March 2011. ISSN 0162-8828. doi: 10.1109/TPAMI.2010.143.
- Ilg et al.  E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul 2017.
- Soomro et al.  Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR, abs/1212.0402, 2012.
- Szegedy et al.  Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014.
- He et al.  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
- Girshick et al.  Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. CoRR, abs/1311.2524, 2013.
- Gkioxari and Malik  Georgia Gkioxari and Jitendra Malik. Finding action tubes. CoRR, abs/1411.6031, 2014.
- Weinzaepfel et al.  Philippe Weinzaepfel, Zaïd Harchaoui, and Cordelia Schmid. Learning to track for spatio-temporal action localization. CoRR, abs/1506.01929, 2015.
- Zitnick and Dollar  Larry Zitnick and Piotr Dollar. Edge boxes: Locating object proposals from edges. In ECCV. European Conference on Computer Vision, September 2014.
- Peng and Schmid  Xiaojiang Peng and Cordelia Schmid. Multi-region Two-Stream R-CNN for Action Detection, pages 744–759. Springer International Publishing, Cham, 2016. ISBN 978-3-319-46493-0. doi: 10.1007/978-3-319-46493-0˙45.
- Ren et al.  Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28, pages 91–99. Curran Associates, Inc., 2015.
- Singh et al.  Gurkirt Singh, Suman Saha, Michael Sapienza, Philip Torr, and Fabio Cuzzolin. Online real time multiple spatiotemporal action localisation and prediction. 2017.
- Liu et al.  Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C. Berg. SSD: single shot multibox detector. CoRR, abs/1512.02325, 2015.
- Kroeger et al.  Till Kroeger, Radu Timofte, Dengxin Dai, and Luc J. Van Gool. Fast optical flow using dense inverse search. CoRR, abs/1603.03590, 2016.
- Kalogeiton et al.  Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, and Cordelia Schmid. Action tubelet detector for spatio-temporal action localization. CoRR, abs/1705.01861, 2017.
- Hou et al.  Rui Hou, Chen Chen, and Mubarak Shah. Tube convolutional neural network (T-CNN) for action detection in videos. CoRR, abs/1703.10664, 2017.
- Horn and Schunck  Berthold K. P. Horn and Brian G. Schunck. Determining optical flow. ARTIFICAL INTELLIGENCE, 17:185–203, 1981.
- Ranjan and Black  Anurag Ranjan and Michael J. Black. Optical flow estimation using a spatial pyramid network. CoRR, abs/1611.00850, 2016.
- Dosovitskiy et al.  A. Dosovitskiy, P. Fischery, E. Ilg, P. Häusser, C. Hazirbas, V. Golkov, P. v. d. Smagt, D. Cremers, and T. Brox. Flownet: Learning optical flow with convolutional networks. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 2758–2766, Dec 2015. doi: 10.1109/ICCV.2015.316.
- Redmon et al.  Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. CoRR, abs/1506.02640, 2015.
- Deng et al.  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
- Carreira and Zisserman  Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? A new model and the kinetics dataset. CoRR, abs/1705.07750, 2017.
- Everingham et al.  M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136, January 2015.
- Tran et al.  Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In The IEEE International Conference on Computer Vision (ICCV), December 2015.
- Paszke et al.  Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.