Real-Time End-to-End Action Detection with Two-Stream Networks

02/23/2018 ∙ by Alaaeldin El-Nouby, et al. ∙ University of Guelph 0

Two-stream networks have been very successful for solving the problem of action detection. However, prior work using two-stream networks train both streams separately, which prevents the network from exploiting regularities between the two streams. Moreover, unlike the visual stream, the dominant forms of optical flow computation typically do not maximally exploit GPU parallelism. We present a real-time end-to-end trainable two-stream network for action detection. First, we integrate the optical flow computation in our framework by using Flownet2. Second, we apply early fusion for the two streams and train the whole pipeline jointly end-to-end. Finally, for better network initialization, we transfer from the task of action recognition to action detection by pre-training our framework using the recently released large-scale Kinetics dataset. Our experimental results show that training the pipeline jointly end-to-end with fine-tuning the optical flow for the objective of action detection improves detection performance significantly. Additionally, we observe an improvement when initializing with parameters pre-trained using Kinetics. Last, we show that by integrating the optical flow computation, our framework is more efficient, running at real-time speeds (up to 31 fps).



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Human spatial action localization and classification in videos are challenging tasks that are key to better video understanding. Action detection is especially challenging, as it requires localizing the actor in the scene, as well as classifying the action. This is done for every frame in a video with little or no context. In contrast, a related task is action recognition, which uses signals from all video frames to predict the action. Action detection has important applications, such as surveillance and human-robot interaction. However, most current approaches are computationally expensive and are far from real-time performance, which limits their usage in real life applications.

Understanding actions in videos has been an active area of research in recent years. Following the success of deep convolutional neural networks (CNNs) on the task of image classification, researchers have used CNNs for the tasks of action recognition and localization. For image classification, appearance is typically the only cue available, represented by RGB pixel values. Videos provide an extra signal: motion. Researchers have worked on many different ways to model motion cues, including 3D CNNs and recurrent neural networks. One of the most successful approaches are two-stream networks

[1], which usually consist of a spatial network that models appearance, whose input is RGB frames, and a temporal network that models motion. Optical flow is often chosen as input to this network; however, other inputs can be used, such as dense trajectories. While adding the temporal stream often improves the model, it adds complexity, as optical flow is usually computed using a third party algorithm, which works separately from the RGB stream. This limits the ability for parallelization and full utilization of compute resources like GPUs, in addition to memory overhead. Also, using a third party algorithm prevents the model from being trainable end-to-end such that the visual and motion pathways cannot learn to co-ordinate. Finally, as shown in [2], optical flow algorithms optimize the end-point-error (EPE), which does not necessarily align with the objective for action detection.

One of the challenges of the action detection task is the absence of large-scale annotated datasets. This problem forces researchers to work with relatively shallow architectures or use an architecture that is pre-trained on the image classification task. Only recently have large-scale datasets for action recognition emerged, such as Kinetics [3]. Pre-training on a large-scale dataset for action recognition should transfer well to the task of action localization.

To the best of our knowledge, all past efforts that used two-stream networks for action detection trained the two streams separately. The predictions from both streams were then fused using a fusion algorithm. Training the two streams separately prevents the model from exploiting dependencies between the appearance and motion cues. As a downside, training the two networks jointly on the small action localization dataset might lead to overfitting, as the model will have a very high capacity when compared to amount of labeled data. However, pre-training on Kinetics should solve this overfitting problem.

In this work, we propose an end-to-end trainable framework for real-time spatial action detection. Following the advances in real-time object detection, we build our framework with motivation from YOLOv2 [4], the state-of-the-art real-time object detector. We generalize its architecture to a two-stream network architecture for action detection. Instead of training each stream separately, we train both streams jointly by fusing the final activations from each stream and applying a convolutional layer to produce the final prediction.

We replace the usual third party algorithms used for computing optical flow [5, 6, 7] with a trainable neural network. We use Flownet2 [8] for optical flow computation and integrate it in our architecture at the beginning of the temporal stream. Using Flownet2 has two advantages: first, the framework becomes end-to-end trainable. While Flownet2 is trained to optimize the EPE, the computed optical flow might not be optimal for the objective of action detection. Fine-tuning Flownet2 for the task of action detection should result in better optical flow for our objective [2]. Secondly, while other efforts on action detection usually use implementations of optical flow algorithms that are totally separate from the model, integrating the optical flow computation in the network improves the computational speed of the framework, as it makes better use of parallelization and reduces the data transfer overhead.

Finally, to address the overfitting problem that may be caused by the use of small-scale datasets or by training the two streams jointly, we pre-train our model for the task of action recognition on Kinetics. The pre-trained model is then trained on the task of action detection with a weak learning rate to preserve a relatively generic feature initialization and to prevent overfitting.

We test our framework using UCF-101-24 [9], a realistic and challenging dataset for action localization. We use temporally trimmed videos as our framework does not yet include temporal localization.

Ii Related Work

In recent years, deep CNNs have been very successful for computer vision tasks. Specifically, they have shown great improvements for the tasks of image classification

[10, 11] and object detection [12, 4] when compared to traditional hand-crafted methods. Studying actions in videos has been an active area of research. Videos provide two types of information: appearance, which is what exists in static images or individual frames of video, and motion. Researchers have used different approaches for modeling motion, including two-stream networks and 3D-CNNs.

Two-stream networks [1] have been one of the most successful approaches for modeling motion for the tasks of action recognition and detection. In this approach, the network is designed as two feed-forward pathways: a spatial stream for modeling appearance and a temporal stream for modeling motion. While RGB images are a good representation of appearance information, optical flow is a good representation for motion. The spatial and temporal streams take RGB frames and optical flow as inputs, respectively. Many efforts for solving the action detection problem have followed this approach. Gkioxari and Malik [13], motivated by R-CNNs [12]

, use selective search to find region proposals. They use two separate CNNs (appearance and motion) for feature extraction. These features are fed to a Support Vector Machine (SVM) to predict action classes. Region proposals are linked using the Viterbi algorithm.

Weinzaepfel et al. [14] obtain frame-level region proposals using EdgeBox [15]. The frames are then linked by tracking high-scoring proposals using a tracking-by-detection approach, which uses two separate CNNs for modeling appearance and motion, and a SVM classifier similar to [12]. Peng and Schmid [16], motivated by faster-R-CNN [17], use region proposal networks (RPNs) to find frame-level region proposals. They use a motion RPN to obtain high quality proposals and show that it is complementary to an appearance RPN. Multiple frame optical flows are stacked together and demonstrate improvement in the motion R-CNN. Region proposals are then linked using the Viterbi algorithm. Both appearance and motion streams are trained separately. Singh et al. [18]

were the first deep learning-based approach to address real-time performance for action detection. They proposed using the single shot detector (SSD)

[19], which is a real-time object detector. They also employ a real-time, but less accurate, optical flow computation [20]. Combining these two components, they managed to achieve a rate of . They propose a novel greedy algorithm for online incremental action linking across the temporal dimension. While this work is significantly faster than previous efforts, they sacrificed accuracy for speed by using a less accurate flow computation. Kalogeiton et al. [21] propose generalizing the anchor box regression method used by faster R-CNN [17] and SSD [19] to anchor cuboids, which consist of a sequence of bounding boxes over time. They take a fixed number of frames as input. Then, feature maps from all frames in this sequence are used to regress and find scores for anchor cuboids. At test time, the anchor cuboids are linked to create tubelets, which do not have a fixed temporal extent. While most methods for solving the action detection problem followed the two-stream approach, Hou et al. [22] use 3D-CNNs. They suggest generalizing R-CNN to videos by designing a tube CNN (T-CNN). Instead of obtaining frame-level action proposals and using a post-processing algorithm to link actions temporally to form action tubes, T-CNN learns the action tubes directly from RGB frames.

Optical flow estimation has been dominated by variational approaches that follow

[23]. Though recently, approaches that use deep CNNs for optical flow estimation [24, 25, 8] have shown promise. Flownet [25] is the first end-to-end trainable deep CNN for optical flow estimation. It is trained using synthetic data to optimize EPE. The authors provide two architectures to estimate optical flow. The first is a standard CNN that takes the concatenated channels from two consequent frames and predicts the flow directly. The second is a two-stream architecture that attempts to find a good representation for each image before they are combined by a correlation layer. However, Flownet falls behind other top methods due to inaccuracies with small displacements present in realistic data. Flownet2 [8] addresses this problem by introducing a stacked architecture, which includes a subnetwork that is specialized to small displacements. It achieves more than 50% improvement in EPE compared to Flownet. Having a trainable network for estimating optical flow can be very useful, especially when integrated with other tasks. Sevilla-Lara et al. [2] studied the integration of trainable optical flow networks Flownet [25] and Spynet [24] on the task of action recognition. They came up with multiple conclusions that suggest that fine-tuning optical flow networks for the objective of action recognition consistently demonstrated improvement.

Iii Methodology

Fig. 1: Our framework takes a sequence of video frames as input. (a) Flownet2 is used to estimate optical flow, which is input to the motion stream. (b) The two streams follow the YOLOv2 architecture. (c) We apply early fusion by concatenating the activations from both streams channel-wise and then applying a 1x1 convolutional kernel on the fused activations. (d) Finally, similar to YOLOv2, the final feature maps are used to regress bounding boxes, class scores, and overlap estimates.

We propose a framework for efficient and accurate action detection, as outlined in Figure 1. We follow the two-stream network architecture [1] and integrate optical flow computation in our framework by using Flownet2 as input to the motion stream. We build each stream on YOLOv2 [4]. In contrast to previous methods, instead of training each stream separately, we apply early fusion and train both streams jointly. Finally, the fused feature maps are used to regress bounding boxes, class scores, and overlap estimates, similar to YOLOv2.

Iii-a Two-Stream YOLOv2 with Early Fusion

YOLO [26] is a real time object detector. While there have been many successful object detection methods, such as R-CNN [12], these methods rely on extracting region proposals for candidate objects, either by an external algorithm like Selective Search or EdgeBox, or by a RPN. These proposals are then fed to a CNN to extract features and predict object classes. In contrast, YOLO defines object detection as a regression problem. A single network predicts both the spatial bounding boxes and their associated object classes. This design enables end-to-end training and optimization which allows YOLO to run in real-time (). Compared to R-CNN [12], YOLO uses the entire image to predict objects and their locations, meaning that it encodes appearance as well as contextual information about object classes. This is very critical for the task of action detection, as context is an extremely important clue for which action class is present in the scene (e.g., surfing is associated with sea, skiing is associated with snow). YOLOv2 is an improved version of YOLO, which adopts the anchor box idea that is used by R-CNN and SSD. A pass-through layer is added, which brings high resolution features from early layers on the network to the final low resolution layers. This layer improves the performance with small-scale objects that the previous version struggled with. Moreover, YOLOv2 is even faster, as it maintains high accuracy with small-scale images. The fully-connected layer was removed, which makes the network completely convolutional, reducing the number of parameters. We built our framework on YOLOv2, as it is the best fit for our objective, running at above real-time speeds while maintaining state-of-the-art accuracy. Moreover, it encodes better contextual information, which is critical for the task of action detection. We use the open-source implementation and the pre-trained models provided by

In contrast to previous efforts, we train both input streams jointly. Training the two streams independently prevents the networks from learning complementary features. Associating appearance and motion cues can be very useful for identifying the action in the scene. We apply early fusion by concatenating the final activations of both streams channel-wise. We apply a 1x1 convolutional kernel on top of the fused activations. By applying this convolution, we combine the features from both streams across each spatial location where there is high correspondence. The final activations are used to regress bounding boxes, class scores, and overlap estimates, similar to YOLOv2.

Iii-B Integrating Flownet

Previous two-stream approaches for solving action detection use non-trainable optical flow algorithms [5, 6, 7] that are completely separate from their detection model. In contrast, we integrate optical flow computation in our pipeline. This provides two advantages. Firstly, our framework becomes fully trainable end-to-end. Fine-tuning optical flow for the task in hand can be very useful. Sevilla-Lara et al. [2] observe that a CNN trained to optimize the EPE might not be the best representative of motion for the task of action recognition. They propose fine-tuning the optical flow network for action recognition with a weak learning rate and they observe consistent improvements. Motivated by this work, we fine-tune Flownet2 for the task of action detection. Secondly, integrating Flownet2 in our pipeline leverages the computational power of GPUs, as all we need is a forward pass starting from the video frames to the final detections. Other methods usually use publicly available CPU implementations of variational optical flow algorithms, which are significantly slower, in addition to data transfer overhead. While [18] uses a less accurate, faster optical flow algorithm called DIS-Fast [20], Flownet2 has architectures that are faster with matching quality, or with the same speed with significantly higher quality, as shown in Table I. We chose to test our model with three variations of Flownet2. The full-stack architecture Flownet2, is the most accurate, but slowest architecture. Flownet2-CSS a less accurate, but faster version. Finally, we test with Flownet2-SD, a relatively small network that is specialized toward small displacements. This model is relatively less accurate than the first two; however, it is significantly faster. We use the open-source implementation and pre-trained models provided by

. Method Sintel Final AEE (Train) Runtime (ms per frame) DIS-Fast [20] (CPU) 6.31 70 FlownetS [25] (GPU) 5.45 18 Flownet2-CSS [8] (GPU) 3.55 69 Flownet2 [8] (GPU) 3.14 123

TABLE I: Average Endpoint Error (AEE) and Runtime comparison of different variations of Flownet and DIS-Fast, as reported in [8]

Iii-C Pre-Training Using Kinetics

One of the challenges that researchers face when working on the task of action detection is the absence of large-scale annotated datasets. Providing bounding boxes for every frame in every video for a large-scale dataset is an extremely difficult task. One of the most successful ways to deal with this kind of problem is through transfer learning. Deep CNN architectures trained on large-scale image classification datasets like ImageNet

[27] have shown that they can learn features generic enough such that they can be used for other vision tasks. This suggests that features learned from one task can be transferred to another. It was also observed that the more similar the two tasks are, the better the performance after transfer.

After the release of Kinetics [3], Carreira and Zisserman [28] studied the effect of pre-training different architectures with Kinetics and then used the pre-trained model to train smaller datasets (e.g., UCF-101, HMDB) for the same task of action recognition. They report a consistent boost in performance after pre-training; however the extent of the improvement varies with different architectures. In this study, the transfer should be optimal, as the target and source tasks are the same. Previous efforts for solving action detection usually use network architectures pre-trained on image classification using ImageNet networks or are pre-trained on the task of object detection using Pascal VOC [29]. However, T-CNN [22] uses a pre-trained C3D model [30] that is trained using the UCF-101 action recognition dataset, which is considerably smaller than Kinetics.

The tasks of action recognition and detection are very similar. In fact, action recognition can be considered a subtask of action detection. Similarly, action detection and object detection are also related, mainly through the localization subtask. In order to gain benefit from both tasks and make use of the large-scale Kinetics dataset, we start with YOLOv2 architectures for both streams that are pre-trained on object detection using Pascal VOC. We then train our framework using Kinetics with a weak learning rate in order to preserve some of the features that can help with localization, while fine-tuning for a different classification task.

Iv Experiments

We evaluate different variations of our architecture with respect to detection performance and runtime:

  • Flownet2 provides improvement in both speed and accuracy. Therefore, to test the quality of Flownet2 compared to other accurate optical flow algorithms, we substitute the method of Brox and Malik [7], an accurate but slow optical flow algorithm.

  • Fine-tuning Flownet2 for the task of action detection produces optical flow that is a better representation of the action-related motion in the scene. To validate this idea, we train models with frozen and fine-tuned Flownet2 parameters.

  • To investigate transfer learning from the task of activity recognition, we train models with and without Kinetics pre-training. For the models that were not pre-trained, we use the parameters trained on object detection using PASCAL VOC.

  • Finally, to have the ability to choose between accuracy and speed, we substitute Flownet2 with either Flownet2-SD or Flownet2-CSS, observing how they compare in terms of accuracy and speed to the full-stack estimator.

Iv-a Dataset

We use UCF-101 to test our framework. This is a dataset that consists of videos for 101 actions in realistic environments collected from YouTube. This dataset is mainly used for the task of action recognition. For the action detection task, a subset of 24 actions have been annotated with bounding boxes, consisting of 3,207 videos. This is currently the largest dataset available for the task of action detection. While this dataset includes untrimmed videos, we use the trimmed ones, as our framework does not include a temporal localization component. We use split 1 for splitting training and testing data.

Iv-B Evaluation Metric

We use frame mean average precision (f-mAP) to evaluate our methods. This computes the area under the precision recall curve for the frame-level detections. A true positive is a detection that has an intersection over union (IoU) more than a threshold with the ground truth, and the action class is predicted correctly.

Iv-C Implementation Details

We use PyTorch

[31] for all experimentation. For Kinetics pre-training, we initialize both streams using parameters trained on PASCAL VOC. We use the SGD optimizer with a learning rate of 0.0008. We pre-train Kinetics with optical flow from Flownet2. We trained UCF-101 using the Adam optimizer with a learning rate of and batch size of 32. We observed that the Adam optimizer added more stability when training a multi-task objective. We apply random cropping, HSV distortion, and horizontal flipping for data augmentation. During training, we sample two consecutive frames randomly from each sequence. We scale the images and optical flow to . For fine-tuning all the Flownet2 architectures, we used a learning rate of . We used the pre-computed Brox et al. optical flow provided by For testing, we select the detection box with the highest score in the current frame. We do not apply any post-processing action linking algorithm.

V Results

V-a Ablation Study

We experiment with different variations of our architecture to show the value of our proposals. We report the frame mAP at different IoU thresholds for 8 different models in Table. II. First, to study the impact of pre-training using Kinetics, we compare it against models pre-trained using Pascal VOC. We can observe a consistent improvement when pre-training with Kinetics, for both networks trained with Brox, optical flow, where we notice a 2.5% gain in frame mAP (0.5 threshold) or using Flownet2 where the gain is 4.5%. The difference in the gain can be explained by the fact that we pre-trained Kinetics using Flownet2. Second, we study the value of fine-tuning Flownet2 for the task of action detection. We compare models with frozen and fine-tuned Flownet2 parameters. We observe an improvement of 2% for models pre-trained with Pascal VOC and 2.5% for models pre-trained using Kinetics. Combining pre-training with fine-tuning Flownet2, we see a gain of . We notice that a model pre-trained with Kinetics and fine-tuned for action detection outperforms all other variations for all different IoU thresholds.

Finally, we test with Flownet2-CSS and Flownet2-SD which are faster, less accurate variations of Flownet2. We observe that with pre-training and fine-tuning, these models outperform the Brox optical flow-trained model (Brox + VOC), while being significantly faster. We show the AUC curves for all 8 models we tested in Figure 2.

. Model = 0.2 = 0.5 = 0.75 Brox + VOC 77.93 70.64 32.73 Brox + Kinetics 80.24 73.18 33.81 Flownet2 + VOC 75.43 66.97 28.57 Flownet2 + Kinetics 79.41 71.51 32.83 Tuned Flownet2 + VOC 76.69 69.03 31.88 Tuned Flownet2 + Kinetics 81.31 74.07 34.41 Tuned Flownet2-CSS + Kinetics 79.90 72.13 32.24 Tuned Flownet2-SD + Kinetics 78.86 71.67 33.39

TABLE II: Comparison of variants of our architecture using f-mAP. We test with different IoU thresholds
Fig. 2: AUC plot for UCF-101-24 dataset using variations of our architecture.

Horse Riding

Pole Vaulting


Cliff Diving

Fig. 3: Action detection results for four action classes from the UCF-101 dataset using a model pre-trained using Kinetics, and using tuned Flownet2 optical flow as input.

V-B Comparison with Top Performers

We compare our results with other top performers on the UCF-101-24 dataset, as shown in Table. III. It should be noted that out of all reported results, only one variation of the Singh et al. framework runs in real-time ().

We observe that all of our models that use Kinetics pre-training and fine-tuning for Flownet2 variants outperform the other top performers. However, we can only fairly compare our results to Hou et al. [22], as both our tests use temporally trimmed videos from the UCF-101 dataset. The other methods [21, 18, 14, 16] test on untrimmed videos, as they perform both spatial and temporal detections. While they have an advantage over our framework as linking actions temporally can improve the spatial detections, they also suffer from a disadvantage as they have a greater chance of getting a false positive if they detect an action in a frame where there is no action being performed.

Model = 0.5
Weinzaepfel et al. [14] 35.84
Hou et al. [22] 41.37
Peng et al. [16] 65.37
Singh et al. [18] RGB + DIS-Fast 65.66111As reported in 00footnotetext: As reported in
Singh et al. [18] RGB + Brox 68.3111footnotemark: 1
Kalogeiton et al.[21] 67.1
Brox + Kinetics 73.18
Tuned Flownet2 + Kinetics 74.07
Tuned Flownet2-CSS + Kinetics 72.13
Tuned Flownet2-SD + Kinetics 71.67

: untrimmed videos. : trimmed videos. : real-time .

TABLE III: Comparison of the f-mAP with other top performers using IoU threshold of .

V-C Detection Runtime

We propose an end-to-end trainable pipeline. Integrating the flow computation in our framework using Flownet2 improves the compute resources utilization. We can make the best use of GPU parallelization in addition to reducing the overhead caused by memory transfer if the framework is separated into two parts. The frame per second (fps) rates for our architectures are shown in Table. IV. We used a NVIDIA GTX Titan X GPU for testing the runtime speed which is the same card used for previously proposed work on real-time action detection [18]. We test using a batch sizes of 1 and 4. With a batch size of 1 (online), the system will have no latency. If a small latency is acceptable, we can buffer the input frames to use a batch size of 4 which improves the frame per second rate. We compare our results to Singh et al. [18], the only real-time method for action detection. However, in their reported runtime, they do not account for the overhead caused by transferring the optical flow computed using DIS-Fast to their two-stream SSD networks. Nevertheless, our model using Flownet2-SD is the fastest, achieving with no latency or with minimal latency.

Model batch size = 1 batch size = 4
Singh et al. [18] RGB+DIS-Fast - 28
Tuned Flownet2 + Kinetics 12 15
Tuned Flownet2-CSS + Kinetics 17 21
Tuned Flownet2-SD + Kinetics 25 31
TABLE IV: Frames per second rate of our models compared to the other reported real-time method.

Vi Conclusion

In this work, we propose a real-time, end-to-end trainable two-stream network for action detection by generalizing the YOLOv2 network architecture. We train two-stream YOLOv2 networks jointly to learn complementary features between the appearance and motion streams. We show that transfer learning from the task of action recognition to action detection introduces a boost in performance. Additionally, fine-tuning a trainable optical flow estimator for the task of action detection results in a better representation for the action-related motion in the scene, improving our model’s performance. Finally, we show that by integrating the optical flow computation and training end-to-end, our framework runs in real-time (), faster than all previous methods.


We would like to thank Brendan Duke of the Machine Learning Research Group at the University of Guelph for his help with training the

Kinetics dataset and helpful suggestions toward improving the manuscript.