Fast and accurate video object detection methods are highly valuable in vast number of scenarios. Single-image object detectors like Faster R-CNN  and R-FCN  have achieved excellent accuracy on still images, so it is natural to apply them to video tasks. One intuitive way is applying them frame by frame on videos, but this is far from optimal. First, image detectors typically involve a heavy feature network like ResNet-101 , which runs rather slow (5fps) even on GPUs. This hampers their potential in real-time applications like autonomous driving and video surveillance. Second, single-image detectors are vulnerable to the common image degeneration problem in videos . As shown in Figure 2, frames may suffer from defocus, motion blur, strange object positions and all sorts of deteriorations, leaving too weak visual clues for successful detections. The two problems make object detection in videos challenging.
have addressed either one of the two problems. These methods treat single-image recognition pipeline as two stages: 1. the image is passed through a general feature network; 2. the result is then generated by a task-specific sub-network. When transferring image detectors to videos, feature-level methods seek ways to improve the feature stage, while the task network remains unchanged. The task-independence makes feature-level methods versatile and conceptually simple. To improve speed, feature-level methods reuse sparsely sampled deep features in the first stage[34, 26], because nearby video frames provide redundant information. This saves the expensive feature network inference and boosts speed to real-time level, but sacrifices accuracy. On the other hand, accuracy can be improved by multi-frame feature aggregation [33, 21]. This enables successful detection on low-quality frames, but the aggregation cost can be huge thus further slows down the framework. In this work, we combine the advantages of both tracks. We present a new feature-level framework, which runs at real-time speed and outperforms per-frame detection baseline.
Our method, called Impression Network, is inspired by the way how human understand videos. When there comes a new frame, humans do not forget previous frames. Instead, the impression is accumulated along the video, which helps us understand degenerated frames with limited visual clue. This mechanism is embodied in our method to enhance frame feature and improve accuracy. Moreover, we combine it with sparse keyframe feature extraction to obtain real-time inference speed. The pipeline of our method is shown in Figure1.
To address the redundancy and improve speed, we split a video into segments of equal length. For each segment, only one keyframe is selected for deep feature extraction. With flow-guided feature propagation [34, 5], the key feature is reused by non-key frames to generate detection results. Based on this, we adopt our Impression mechanism to perform multi-frame feature fusion. When a key feature is extracted, it not only goes to task network, but is also absorbed by a impression feature. The impression feature is then propagated down to the next keyframe. The task feature for the next keyframe is a weighted combination of its own feature and the impression feature, and the impression feature is updated by absorbing the feature of that frame. This process keeps going on along the whole video. In this framework, the impression feature accumulates high-quality video object information and is propagated all the way down, helping enhance incoming key features if the frames get deteriorated. It improves the overall quality of task features, thus increases detection accuracy.
The Impression mechanism also contributes to the speed. With the iterative aggregation policy, it minimized the cost of feature fusion. Previous work 
has proved that, video frame features should be spatially aligned with flow-guided warping before aggregation, while flow computation is not negligible. Intuitive way requires one flow estimation for each frame being aggregated, while Impression Network only needs one extra flow estimation for adjacent segments, being much more efficient.
Without bells and whistles, Impression Network surpasses state-of-the-art image detectors on ImageNet VID  dataset. It’s three times faster (20 fps) and significantly more accurate. We hope Impression Network can provide a new perspective on feature aggregation in video tasks.
Code will be released to facilitate future research.
2 Related Work
Feature Reuse in Video Recognition: As shown by earlier analysis [30, 35, 15, 32, 27], consecutive video frames are highly similar, as well as their high-level convolutional features. This suggests that video sequences feature an inherent redundancy, which can be exploited to reduce time cost. In single image detectors [8, 10, 22, 7, 3], the heavy feature network (encoder) is much more costly than the task sub-network (decoder). Hence, when transplanting image detectors to videos, speed can by greatly improved by reusing the deep features of frames. Clockwork Convnets  exploit the different evolve speed of features at different levels. By updating low and high level convolutional features at different frequency, it partially avoids redundant feature computation. It makes the network times faster, while sacrifices accuracy by due to the lack of end to end training. Deep Feature Flow  is another successful feature-level acceleration method. It cheaply propagates the top feature of sparse keyframe to other frames, achieving a significant speed-up ratio (from 5 fps to 20 fps). Deep Feature Flow requires motion estimation like optical flow [12, 2, 29, 23, 5, 13] to propagate features, where error is introduced and therefore brings a minor accuracy drop (). Impression Network inherits the idea of Deep Feature Flow, but also utilizes temporal information to enhance the shared features. It’s not only faster than per-frame baseline, but also more accurate.
Exploiting Temporal Information in Video Tasks: Applying state-of-the-art still image detectors frame by frame on videos does not provide optimal result . This is mainly due to the low-quality images in videos. Single image detectors are vulnerable to deteriorated images because they are restricted to the frame they are looking at, while ignoring the ample temporal information from other frames in the video. Temporal feature aggregation [17, 20, 25, 28, 18, 1, 31] provides a way to utilize such information. Flow-Guided Feature Aggregation(FGFA)  aims at enhancing frame features by aggregating all frame features in a consecutive neighborhood. The aggregation weight is learned through end-to-end training. FGFA boosts video detection accuracy to a new level (from to ), yet it is three-times slower than per-frame solution ( fps). This is caused by the aggregation cost. For each frame in the fusion range, FGFA requires one optical flow computation to spatially align it with the target frame, which costs even more time than the feature network. Additionally, since neighboring frames are highly similar, the exhaustive dense aggregation leads to extra redundancy. Impression Network fuses features in an iterative manner, where only one flow estimation is needed for every new keyframe. Moreover, the sparse feature sampling reduces the amount of replicated information.
3 Impression Network
3.1 Impression Network Inference
Given a video, our task is to generate detection results for all its frames , . To avoid redundant feature computation, we split the frame sequence into segments of equal length . In each segment , only one frame (by default we take the central frame ) is selected for feature extraction via the feature network . The key feature is propagated to remaining frames with flow-guided warping, where the flow field is computed by a light-weight flow network, following the practice of Deep Feature Flow . Features of all frames are then fed into task network to generate detection results.
In such framework, we use impression mechanism to exploit long-range, cross-segment temporal information. The inference phase of Impression Network is illustrated in Figure 1. Each segment generates three features: calculated by passing through , shared by all frames in the segment for detection sub-network and , the impression feature containing long-term temporal information. For the first segment , and are identical to . For , is a weighted combination of and . The aggregation unit uses a tiny FCN to generate position-wise weight maps. Generally, larger weights are assigned to the feature with better quality. This is concluded as
Notice that such quality is not a handcrafted metric, instead it’s learned by end-to-end training to minimize task loss. We observe that when is deteriorated by motion blur or defocus, gets lower quality score, as shown in Figure 4. Also notice that the aggregation of cross-segment features is not simply adding them up. Former practice  shows that due to spatial misalignment in video frames, naive weighted mean yields worse results. Here we use flow-guided aggregation. Specifically, we first calculate the flow field of and , then perform spatial warping accordingly on to align it with , getting ; the fusion is then done with and to generate . and are then mingled to get :
Here a constant factor controls the contribution of . serves as a gate to control the memory of the framework (detailed in Figure 6). If set to 0, will only contain information of . The procedure keeps going on until all frames in a video get processed.
By iteratively absorbing every keyframe feature, the impression feature contains visual information in a large time span. The weighted aggregation of and can be seen as a balancing between memory and new information, depending on the quality of the new incoming keyframe. When the new keyframe gets deteriorated, the impression feature compensate for the subsequent weak feature, helping infer bounding box and class information through low-level visual clue such as color distribution. On the other hand, the impression feature also keeps getting updated. Since sharp and clear frames get higher scores, they contribute more to an effective impression. Compared to exhaustively aggregating all nearby features in a fixed range for every frame, our framework is more natural and elegant. The whole process is summarized in Algorithm 1.
3.2 Impression Network Training
The training procedure of Impression Network is rather simple. With video data provided, a standard single-image object detection pipeline can be transfered to video tasks with slight modifications. The end-to-end training framework is illustrated in Figure 3.
During training, each data batch contains three images from a same video sequence. and are random offsets whose ranges are controlled by segment length . Typically, lies in , while falls into . This setting is coherent with the inference phase, as represents an arbitrary frame from segment , for keyframe of current segment , while stands for the previous keyframe. For simplicity, the three images are dubbed as . The ground-truth at is provided as label.
In each iteration, first, is applied on to get their deep features . Then, image pairs and are fed into the flow network, yielding optical flow fields and , respectively. Flow-guided warping unit then use to propagate to align with . We denote the warped old keyframe feature as . The aggregation unit weights and fuses , generating . in training corresponds to the impression feature in inference. This is an approximation since it only contains information of one previous keyframe. Finally, is warped to according to to get , the task feature for a standard detection sub-network. Since all the components are differentiable, the detection loss propagates all the way back to jointly fine-tune , , flow network and feature aggregation unit, optimizing task performance. Notice that single-image datasets can be fully exploited in this framework, in which case the three images are all the same.
3.3 Module Design
We use ResNet-101 pretrained for ImageNet classification. The fully connected layers are removed. For denser feature map, feature stride is reduced from 32 to 16. Specifically, the stride of the last block is modified from 2 to 1. To maintain receptive field, A dilation of 2 is applied to convolution layers with kernel size greater than 1. A 1024-channelconvolution layer (randomly initialized) is appended to reduce feature dimension.
Flow-Guided Feature Propagation: Before aggregation, we spatially align frame features by flow-guided warping. Optical flow field is calculated first to obtain pixel-level motion path, then reference feature is warped to target frame with bilinear sampling. The procedure is defined as
where and denotes target frame and reference frame respectively, is the deep feature of reference frame, denotes reference feature warped to target frame, stands for flow estimation function, W denotes the bilinear sampler, and is a predicted position-wise scale map to refine warped feature. We adopt the state-of-the-art CNN-based FlowNet [5, 13] for optical flow computation. Specifically, we use FlowNet-S . The flow network is pretrained on FlyingChairs dataset. The scale map has equal channel dimension with task features, and is predicted with flow field in parallel through an additional convolution layer attached to the top of FlowNet-S. The new layer is initialized with weights of all zeros and fixed biases of all ones. The implementation of bilinear sampling unit has been well described in [14, 4, 34]. It is fully differentiable.
Aggregation Unit: The aggregation weights of features are generated by a quality estimation network . It has three randomly initialized layers: a convolution, a convolution and a convolution. The output is a position-wise raw score map which will be applied on each channel of task feature. Raw score maps of different features are normalized by softmax function to sum up to one. We then multiply the score maps with features and sum them up to obtain the fused feature as Eq. 3.
Detection Network: We use the state-of-the-art R-FCN as detection sub-network. RPN and R-FCN are attached to the 1024-channel convolution of the feature network, using the first and second 512 channels respectively. RPN uses 9 anchors and generates 300 proposals for each image. We use groups position-sensitive score maps for R-FCN.
3.4 Runtime Complexity Analysis
The ratio of inference time of our method to that of per-frame evaluation is:
In each segment of length , Impression Network requires: 1. flow warping in total, one for impression feature propagation and for non-key frame detection; 2. One feature fusion operation ; 3. One feature network inference for keyframe feature; 4. detection subnetwork inference. In comparison, per-frame solution takes and inference. Notice that compared to (Resnet-101 in our practice) and FlowNet, the complexity of , and are negligible. So the ratio can be approximated as:
In practice, the flow network is times smaller than Resnet-101, while is large () to reduce redundancy. This suggests that unlike existing feature aggregation method like FGFA, Impression Network can perform multi-frame feature fusion while maintaining a noticeable speedup over per-frame solution.
4.1 Experiment Setup
ImageNet VID dataset : It is a large-scale video object detection dataset. There are 3862, 555 and 937 snippets with frame rates of 25 and 30 in training, validation and test sets, respectively. All the video snippets are fully-annotated. Imagenet VID dataset has 30 object categories, which is a subset of the Imagenet DET dataset. In our experiments, following the practice in [16, 19, 34, 33], model are trained on the training set, while evaluations are done on the validation set with the standard mean average precision (mAP) metric.
Implementation Details: Our training set consists of the full ImageNet VID train set, together with images from ImageNet DET train set. Only the same 30 categories are used. As mentioned before, each training batch contains three images. If sampled from DET, all images are same. In both training and testing, images are resized to have the shorter side of 600 and 300 pixels for the feature network and the flow network, respectively. The whole framework is trained end to end with SGD, where 120K iterations are performed on 8 GPUs. The learning rate is for the first 70K iterations, then reduced to for the remaining 50K iterations. For clear comparison, no bells-and-whistles like multi-scale training and box-level post-processing are used. Inference time is measured on a Nvidia GTX 1060 GPU.
4.2 Ablation Study
Architecture Design: Table 1 summarizes main experiment results. It shows a comparison of single-frame baseline, Impression Network and its variants.
Method (a) is the standard ResNet-101 R-FCN applied frame by frame to videos. The accuracy is close to the mAP reported in 
, which shows its validity as a strong baseline for our evaluations. The runtime is a little bit faster, probably due to differences in implementation environment. The6fps inference speed is insufficient for real-time applications, where typically a speed of 15fps is required.
Method (b) is a variant of Method (a) with sparse feature extraction. In this approach, videos are divided into segments of frames. Only one keyframe in each segment will be passed through the feature network for feature extraction. That feature is then propagated to other frames with optical flow. Finally, the detection sub-network generates results for every frame. The structure is identical to a Deep Feature Flow framework for video object detection . Specifically, is set to 10 for all experiments in this table. We select the 5th frame as keyframe, because this minimizes average feature propagation distance, thus reduces the error introduced and improves accuracy (explained later). Compared to per-frame evaluation, there’s a minor accuracy drop of , mainly because of lessened information, as well as errors in flow-guided feature warping. However, the inference speed remarkably increases to 21fps, proving that sparse feature extraction is an efficient way to trade accuracy for speed.
Method (c) is a degenerated version of Impression Network. Keyframe features are iteratively fused to generate the impression feature, but without quality-aware weighting unit. The weights in Eq. 3 are naively fixed to . For all experiments here, the memory gate in Eq. 4 is set to . With information of previous frames fused into current task feature, mAP increases for over per-frame baseline. Notice that sparse feature extraction is still enabled here, which proves that 1.the computational redundancy of per-frame evaluation is huge; 2.such redundancy is not necessary for higher accuracy. Due to the one additional flow estimation for each segment, the framework slows down a little bit, yet still runs at a real-time-level 20fps.
Method (d) is the proposed Impression Network. Here the aggregation unit uses the tiny FCN to generate position-wise weights. Through end-to-end training, the sub-network learns to assign smaller weights to features of deteriorated frames, as shown in Figure 4. Experiment on ImageNet VID validation set shows the in algorithm 1
obeys a normal distribution of. Quality-aware weighting brings another mAP improvement, mainly because of the increment of valid information. Overall, Impression Network increases mAP by to , comparable to exhaustive feature aggregation method , while significantly faster, running at 20fps. Impression Network shows that, if redundancy and temporal information are properly handled, the speed and accuracy of video object detection can actually be simultaneously improved. Examples are shown in Figure 5.
Method (e) is Impression Network without end-to-end training. The feature network is trained in single-image detection framework, same to that in Method (a). The flow network is the off-the-shelf FlyingChairs pretrained FlowNet-S . Only the weighting and detection sub-networks learn during training. This clearly worsen the mAP, showing the importance of end-to-end optimization.
The Influence of Memory Gate: As shown in Eq. 4, the memory gate controls the component of impression features. Here we study its influence on mAP. Experiment settings are the same as Method (d) in Table 1, except that varies from to . Figure 6 shows the average contribution of previous keyframes to current detection at different values. It can be seen that controls the available range of temporal information. When set to , the impression feature consists solely of the previous key feature, just like how the framework is trained; while setting to 1.0 leads in more temporal information. Figure 7 shows the mAP of different setting. Apparently, larger benefits accuracy. The involvement of long-range feature aggregation may help detection in longer series of low-quality frames.
|keyframe id||(frames)||mAP (%)|
Different Keyframe Selection: In aforementioned experiments, to reduce error, we select the central frame of each segment as keyframe. Here we explain this and compare different keyframe scheduling. Flow-guided feature warping introduces error, and as shown in , the error has positive correlation with propagation distance. This is because that larger displacement increases the difficulty of pixel-level matching. Hence, we take average feature propagation distance as a metric for flow error, and seek the way to minimize it. is calculated as:
where is propagation distance, is the id of keyframe, and is segment length. Key feature needs to be propagated to non-key frames, and there’s also an impression feature propagation of distance . Apparently there’s an optimal to minimize :
which shows that the central frame is the best. Table 2 shows mAPs at different keyframe selections, coherent with our assumption. Notice that selecting the first frame enables strict real-time inference, while selecting the central frame brings a slight latency of frames. This can be traded-off according to application needs.
|method||mAP (%)||runtime (ms)|
4.3 Compare with Other Feature-Level Methods
We compare Impression Network with other feature-level video object detection methods. In Figure 8, we compare the speed-accuracy curve of Impression Network and Deep Feature Flow . Per-frame baseline is also marked. Segment length varies from 1 to 20. Apparently, Impression Network is more accurate than per-frame solution even in high-speed zone. Similar to Deep Feature Flow, it also offers a smooth accuracy-speed trade-off as varies. The accuracy drops a little when gets close to 1, which is reasonable because Impression Network is trained for aggregating sparse frame features. Dense sampling limits aggregation range and result in a less useful impression.
Table 3 compares Impression Network with Flow-Guided Feature Aggregation and its faster variant. Both are described in . FGFA is the standard version with a fusion radius of 10, and FGFA-fast is the accelerated version. It only calculates flow fields for adjacent frames, and composite them for non-adjacent pairs. This comparison shows that the accuracy of Impression Network is on par with the best aggregation-based method, yet being much more efficient.
5 Conclusion and Future Work
This work presents a fast and accurate feature-level method for video object detection. The proposed Impression mechanism explores a novel scheme for feature aggregation in videos. Since Impression Network works at feature stage, it’s complementary to existing box-level post-processing methods like Seq-NMS . For now we use FlowNet-S  to guide feature propagation for clear comparison, while more efficient flow algorithms  exist and can surely benefit our method. We use fixed segment length for simplicity, while a adaptively varying length may schedule computation more reasonably. Moreover, as a feature-level method, Impression Network inherits the task-independence, and has the potential to tackle image degeneration problem in other video tasks.
-  N. Ballas, L. Yao, C. Pal, and A. Courville. Delving deeper into convolutional networks for learning video representations. arXiv preprint arXiv:1511.06432, 2015.
-  T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. High accuracy optical flow estimation based on a theory for warping. Computer Vision-ECCV 2004, pages 25–36, 2004.
-  J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional networks. In Advances in neural information processing systems, pages 379–387, 2016.
-  J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. arXiv preprint arXiv:1703.06211, 2017.
-  A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, and T. Brox. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2758–2766, 2015.
-  R. Gadde, V. Jampani, and P. V. Gehler. Semantic video cnns through representation warping. arXiv preprint arXiv:1708.03088, 2017.
-  R. Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
R. Girshick, J. Donahue, T. Darrell, and J. Malik.
Rich feature hierarchies for accurate object detection and semantic
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.
-  W. Han, P. Khorrami, T. L. Paine, P. Ramachandran, M. Babaeizadeh, H. Shi, J. Li, S. Yan, and T. S. Huang. Seq-nms for video object detection. arXiv preprint arXiv:1602.08465, 2016.
-  K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In European Conference on Computer Vision, pages 346–361. Springer, 2014.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  B. K. Horn and B. G. Schunck. Determining optical flow. Artificial intelligence, 17(1-3):185–203, 1981.
-  E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. arXiv preprint arXiv:1612.01925, 2016.
-  M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In Advances in Neural Information Processing Systems, pages 2017–2025, 2015.
-  D. Jayaraman and K. Grauman. Slow and steady feature analysis: higher order temporal coherence in video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3852–3861, 2016.
K. Kang, H. Li, J. Yan, X. Zeng, B. Yang, T. Xiao, C. Zhang, Z. Wang, R. Wang,
X. Wang, et al.
T-cnn: Tubelets with convolutional neural networks for object detection from videos.IEEE Transactions on Circuits and Systems for Video Technology, 2017.
-  A. Kar, N. Rai, K. Sikka, and G. Sharma. Adascan: Adaptive scan pooling in deep convolutional neural networks for human action recognition in videos. arXiv preprint arXiv:1611.08240, 2016.
-  A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014.
-  B. Lee, E. Erdenee, S. Jin, M. Y. Nam, Y. G. Jung, and P. K. Rhee. Multi-class multi-object tracking using changing point detection. In European Conference on Computer Vision, pages 68–83. Springer, 2016.
-  Z. Li, K. Gavrilyuk, E. Gavves, M. Jain, and C. G. Snoek. Videolstm convolves, attends and flows for action recognition. Computer Vision and Image Understanding, 2017.
-  Y. Liu, J. Yan, and W. Ouyang. Quality aware network for set to set recognition. arXiv preprint arXiv:1704.03373, 2017.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid.
Epicflow: Edge-preserving interpolation of correspondences for optical flow.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1164–1172, 2015.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
-  S. Sharma, R. Kiros, and R. Salakhutdinov. Action recognition using visual attention. arXiv preprint arXiv:1511.04119, 2015.
-  E. Shelhamer, K. Rakelly, J. Hoffman, and T. Darrell. Clockwork convnets for video semantic segmentation. In Computer Vision–ECCV 2016 Workshops, pages 852–868. Springer, 2016.
-  L. Sun, K. Jia, T.-H. Chan, Y. Fang, G. Wang, and S. Yan. Dl-sfa: deeply-learned slow feature analysis for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2625–2632, 2014.
-  L. Sun, K. Jia, D.-Y. Yeung, and B. E. Shi. Human action recognition using factorized spatio-temporal convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 4597–4605, 2015.
-  P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid. Deepflow: Large displacement optical flow with deep matching. In Proceedings of the IEEE International Conference on Computer Vision, pages 1385–1392, 2013.
L. Wiskott and T. J. Sejnowski.
Slow feature analysis: Unsupervised learning of invariances.Neural computation, 14(4):715–770, 2002.
-  J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4694–4702, 2015.
-  Z. Zhang and D. Tao. Slow feature analysis for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3):436–450, 2012.
-  X. Zhu, Y. Wang, J. Dai, L. Yuan, and Y. Wei. Flow-guided feature aggregation for video object detection. arXiv preprint arXiv:1703.10025, 2017.
-  X. Zhu, Y. Xiong, J. Dai, L. Yuan, and Y. Wei. Deep feature flow for video recognition. arXiv preprint arXiv:1611.07715, 2016.
-  W. Zou, S. Zhu, K. Yu, and A. Y. Ng. Deep learning of invariant features via simulated fixations in video. In Advances in neural information processing systems, pages 3203–3211, 2012.