Towards High Performance Video Object Detection for Mobiles

04/16/2018 ∙ by Xizhou Zhu, et al. ∙ Microsoft 0

Despite the recent success of video object detection on Desktop GPUs, its architecture is still far too heavy for mobiles. It is also unclear whether the key principles of sparse feature propagation and multi-frame feature aggregation apply at very limited computational resources. In this paper, we present a light weight network architecture for video object detection on mobiles. Light weight image object detector is applied on sparse key frames. A very small network, Light Flow, is designed for establishing correspondence across frames. A flow-guided GRU module is designed to effectively aggregate features on key frames. For non-key frames, sparse feature propagation is performed. The whole network can be trained end-to-end. The proposed system achieves 60.2



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


video detection papers

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Object detection has achieved significant progress in recent years using deep neural networks 

[1]. The general trend has been to make deeper and more complicated object detection networks [2, 3, 4, 5, 6, 7, 8, 9, 10, 11] in order to achieve higher accuracy. However, these advances in improving accuracy are not necessarily making networks more efficient with respect to size and speed. In many real world applications, such as robotics, self-driving car, augmented reality, and mobile phone, the object detection tasks need to be carried out in a real-time fashion on a computationally limited platform.

Recently, there has been rising interest in building very small, low latency models that can be easily matched to the design requirements for mobile and embedded vision application, for example, SqueezeNet [12], MobileNet [13], and ShuffleNet [14]. These structures are general, but not specifically designed for object detection tasks. For this purpose, several small deep neural network architectures for object detection in static images are explored, such as YOLO [15], YOLOv2 [11], Tiny YOLO [16], Tiny SSD [17]. However, directly applying these detectors to videos faces new challenges. First, applying the deep networks on all video frames introduces unaffordable computational cost. Second, recognition accuracy suffers from deteriorated appearances in videos that are seldom observed in still images, such as motion blur, video defocus, rare poses, etc.

Figure 1:

Speed-accuracy trade-off for different lightweight object detectors. Curves are drawn with varying image resolutions. Inference time is evaluated with TensorFlow Lite 

[18] on a single 2.3GHz Cortex-A72 processor of Huawei Mate 8.

To address these issues, the current best practice [19, 20, 21] exploits temporal information for speedup and improvement on detection accuracy for videos. On one hand, sparse feature propagation is used in [19, 21] to save expensive feature computation on most frames. Features on these frames are propagated from sparse key frame cheaply. On the other hand, multi-frame feature aggregation is performed in [20, 21] to improve feature quality and detection accuracy.

Built on the two principles, the latest work [21]

provides a good speed-accuracy tradeoff on Desktop GPUs. Unfortunately, the architecture is not friendly for mobiles. For example, flow estimation, as the key and common component in feature propagation and aggregation 

[19, 20, 21], is still far from the demand of real-time computation on mobiles. Aggregation with long-term dependency is also restricted by limited runtime memory of mobiles.

This paper describes a light weight network architecture for mobile video object detection. It is primarily built on the two principles – propagating features on majority non-key frames while computing and aggregating features on sparse key frames. However, we need to carefully redesign both structures for mobiles by considering speed, size and accuracy. On all frames, we present Light Flow, a very small deep neural network to estimate feature flow, which offers instant availability on mobiles. On sparse key frame, we present

flow-guided Gated Recurrent Unit (GRU) based feature aggregation

, an effective aggregation on a memory-limited platform. Additionally, we also exploit a light image object detector for computing features on key frame, which leverage advanced and efficient techniques, such as depthwise separable convolution [22] and Light-Head R-CNN [23].

The proposed techniques are unified to an end-to-end learning system. Comprehensive experiments show that the model steadily pushes forward the performance (speed-accuracy trade-off) envelope, towards high performance video object detection on mobiles. For example, we achieve mAP score on ImageNet VID validation at speed of 25.6 frame per second on mobiles (e.g., HuaWei Mate 8). It is one order faster than the best previous effort on fast object detection, with on par accuracy (see Figure 1). To the best of our knowledge, for the first time, we achieve realtime video object detection on mobile with reasonably good accuracy.

2 Revisiting Video Object Detection Baseline

Object detection in static images has achieved significant progress in recent years using deep CNN [1]. State-of-the-art detectors share the similar network architecture, consisting of two conceptual steps. First step is feature network, which extracts a set of convolutional feature maps over the input image via a fully convolutional backbone network [24, 25, 26, 27, 28, 29, 30, 13, 14], denoted as . Second step is detection network, which generates detection result upon the feature maps , by performing region classification and bounding box regression over either sparse object proposals [2, 3, 4, 5, 6, 7, 8, 9] or dense sliding windows [10, 15, 11, 31], via a multi-branched sub-network, namely . It is randomly initialized and jointly trained with .

Directly applying these detectors to video object detection faces challenges from two aspects. For speed, applying single image detectors on all video frames is not efficient, since the backbone network is usually deep and slow. For accuracy, detection accuracy suffers from deteriorated appearances in videos that are seldom observed in still images, such as motion blur, video defocus, rare poses.

Current best practice [19, 20, 21] exploits temporal information via sparse feature propagation and multi-frame feature aggregation to address the speed and accuracy issues, respectively.

2.0.1 Sparse Feature Propagation

Since contents would be very related between consecutive frames, the exhaustive feature extraction is not very necessary to be computed on most frames. Deep feature flow 

[19] provides an efficient way, which computes expensive feature network at only sparse key frames (e.g., every ) and propagates key frame feature maps to majority non-key frames, which results speedup with minor drop in accuracy.

During inference, feature maps on any non-key frame are propagated from its preceding key frame by,


where is the feature of key frame , and represents the differentiable bilinear warping function. The two dimensional motion field between two frames and is estimated through a flow network , which is much cheaper than .

2.0.2 Multi-frame Feature Aggregation

To improve detection accuracy, flow-guided feature aggregation (FGFA) [20] aggregates feature maps from nearby frames, which are aligned well through the estimated flow.

The aggregated feature maps at frame is obtained as a weighted average of nearby frames feature maps,


where denotes element-wise multiplication, and the weight is adaptively computed as the similarity between the propagated feature maps and the feature map at frame .

To avoid dense aggregation on all frames, [21] suggested sparsely recursive feature aggregation, which operates only on sparse key frames. Such a way retain the feature quality from aggregation but reduce the computational cost as well. Specifically, given two succeeding key frames and , the aggregated feature at frame is computed by,


2.1 Practice for Mobiles

As the two principles, sparse feature propagation and multi-frame feature aggregation, yield the best practice towards high performance (speed and accuracy trade-off) video object detection [21] on Desktop GPUs. Instead, there are very limited computational capability and runtime memory on mobiles. Therefore, what are principles for mobiles should be explored.

  • Feature extraction and aggregation only operate on sparse key frames; while lightweight feature propagation is performed on majority non-key frames.

  • Flow estimation is the key to feature propagation and aggregation. However, flow networks used in [19, 20, 21] are still far from the demand of real-time processing on mobiles. Specifically, FlowNet [32] is FLOPs of MobileNet [13] under the same input resolutions. Even the smallest FlowNet Inception used in [19] is more FLOPs. A more cheaper is so necessary.

  • Feature aggregation should be operated on aligned feature maps according to flow. Otherwise, displacements caused by large object motion would cause severe errors to aggregation. Long-term dependency in aggregation is also favoured because more temporal information can be fused together for better feature quality.

  • The backbone network of single image detector should be as small as possible, since we need it to compute features on sparse key frame.

3 Model Architecture for Mobiles

Based on the above principles, we design a much smaller network architecture for mobile video object detection. Inference pipeline is illustrated in Figure 2.

Figure 2: Illustration of video object detection for mobile by the proposed method.

Given a key frame and its proceeding key frame , feature maps are first extracted by , and then aggregated with its proceeding key frame aggregated feature maps by,


where is a flow-guided feature aggregation function. The detection network is applied on to get detection predictions for the key frame .

Given a non-key frame , the feature propagation from key frame to frame is denoted as,


where is the aggregated feature maps of key frame , and represents the differentiable bilinear warping function also used in [19]. And then the detection network is applied on to get detection predictions for the non-key frame .

Next, we will describe two new techniques which are specially designed for mobiles, including Light Flow, a more efficient flow network for mobiles, and a flow-guided GRU based feature aggregation for better modeling long-term dependency, yielding better quality and accuracy.

3.1 Light Flow

FlowNet [32] is originally proposed for pixel-level optical flow estimation. It is designed in a encoder-decoder mode followed by multi-resolution optical flow predictors. Two input RGB frames are concatenated to form a 6-channels input. In encoder, the input is converted into a bundle of feature maps in spatial dimensions to

of input size through a class of convolutional layers. In decoder, the feature maps are fed to multiple deconvolution layers to achieve the high resolution flow prediction. After each deconvolution layer, the feature maps are concatenated with the last feature maps in encoder, which share the same spatial resolution and an upsampled coarse flow prediction. Multiple optical flow predictors follow each concatenated feature maps in decoder. Loss functions are applied to each predictor, but only the finest prediction is used during inference.

To speedup flow network greatly, we present Light Flow, a more light weight flow network with several deliberate designs based on FlowNet [32]. It only causes minor drop in accuracy ( increasing in end-point error) but significantly speeds up by nearly theoretically (see Table. 2).

In encoder part, convolution is always the bottleneck of computation. Motivated by MobileNet [13], we replace all convolutions to 33 depthwise separable convolutions [22] (each 33 depthwise convolution followed by a 11 pointwise convolution). Compared with standard 33 convolution, the computation cost of 33 depthwise separable convolution is reduced by times with a slight drop in accuracy [13].

In decoder part, each deconvolution operation is replaced by a nearest-neighbor upsampling followed by a depthwise separable convolution. [33] replaces deconvolution with nearest-neighbor upsampling followed by a standard convolution to address checkerboard artifacts caused by deconvolution. By contrast, We leverage this idea, and further replace the standard convolution with depthwise separable convolution, to reduce computation cost.

Finally, we adopt a simple and effective way to consider multi-resolution predictions. It is inspired by FCN [34] which fuses multi-resolution semantic segmentation prediction as the final prediction in a explicit summation way. Unlike [32], we do not use only the finest optical flow prediction as final prediction during inference. Instead, multi-resolution predictions are up-sampled to the same spatial resolution with the finest prediction, and then are averaged as the final prediction. Also, during training, only a single loss function is applied on the averaged optical flow prediction instead of multiple loss functions after each prediction. Such a way can reduce end-point error by nearly .

3.1.1 Architecture and Implementation

Network of Light Flow is illustrated in Table. 1

. Each convolution operation is followed by batch normalization 


and Leaky ReLU nonlinearity 

[36] with slope fixed as 0.1. Following [32, 37], Light Flow is pre-trained on the Flying Chairs dataset. For training Light Flow, Adam [38] with a weight decay of 0.00004 is used as optimization method. 70k iterations are performed on 4 GPUs, with each GPU holding 64 image pairs. A warm-up learning rate scheme is used in which we first train with a learning rate of 0.001 for 10k iterations. Then we train with learning rate of 0.01 for 20k iterations and divided it by 2 every 10k iterations.

Name Filter Shape Stride Output size Input
Images 5123846
Conv1_dw 3 3 6 dw 2 2561926 Images
Conv1 1 1 6 32 1 25619232 Conv1_dw
Conv2_dw 3 3 32 dw 2 1289632 Conv1
Conv2 1 1 32 64 1 1289664 Conv2_dw
Conv3_dw 3 3 64 dw 2 644864 Conv2
Conv3 1 1 64 128 1 6448128 Conv3_dw
Conv4a_dw 3 3 128 dw 2 3224128 Conv3
Conv4a 1 1 128 256 1 3224256 Conv4a_dw
Conv4b_dw 3 3 256 dw 1 3224256 Conv4a
Conv4b 1 1 256 256 1 3224256 Conv4b_dw
Conv5a_dw 3 3 256 dw 2 1612256 Conv4b
Conv5a 1 1 256 512 1 1612512 Conv5a_dw
Conv5b_dw 3 3 512 dw 1 1612512 Conv5a
Conv5b 1 1 512 512 1 1612512 Conv5b_dw
Conv6a_dw 3 3 512 dw 2 86512 Conv5b
Conv6a 1 1 512 1024 1 861024 Conv6a_dw
Conv6b_dw 3 3 1024 dw 1 861024 Conv6a
Conv6b 1 1 1024 1024 1 861024 Conv6b_dw
Conv7_dw 3 3 1024 dw 1 861024 Conv6b
Conv7 1 1 1024 256 1 86256 Conv7_dw
Conv8_dw 3 3 768 dw 1 1612768 [Conv7, Conv5b]
Conv8 1 1 768 128 1 1612128 Conv8_dw
Conv9_dw 3 3 384 dw 1 3224384 [Conv8, Conv4b]
Conv9 1 1 384 64 1 322464 Conv9_dw
Conv10_dw 3 3 192 dw 1 6448192 [Conv9, Conv3]
Conv10 1 1 192 32 1 644832 Conv10_dw
Conv11_dw 3 3 96 dw 1 1289696 [Conv10, Conv2]
Conv11 1 1 96 16 1 1289616 Conv11_dw
Optical Flow Predictors
Conv12_dw 3 3 256 dw 1 86256 Conv7
Conv12 1 1 256 2 1 862 Conv12_dw
Conv13_dw 3 3 128 dw 1 1612128 Conv8
Conv13 1 1 128 2 1 16122 Conv13_dw
Conv14_dw 3 3 64 dw 1 322464 Conv9
Conv14 1 1 64 2 1 32242 Conv14_dw
Conv15_dw 3 3 32 dw 1 644832 Conv10
Conv15 1 1 32 2 1 64482 Conv15_dw
Conv16_dw 3 3 16 dw 1 1289616 Conv11
Conv16 1 1 16 2 1 128962 Conv16_dw
Multiple Optical Flow Predictions Fusion
Average 128962 Conv12 + Conv13 +
Conv14 + Conv15 + Conv16
Table 1: The details of the Light Flow architecture, ’dw’ in filter shape denotes a depthwise separable convolution, is a 2 nearest neighbor upsampling, [,] is the concatenation operation.

When applying Light Flow for our method, to get further speedup, two modifications are made. First, following [19, 20, 21], Light Flow is applied on images with half input resolution of the feature network, and has an output stride of 4. As the feature network has an output stride of 16, the flow field is downsampled to match the resolution of the feature maps. Second, since Light Flow is very small and has comparable computation with the detection network , sparse feature propagation is applied on the intermediate feature maps of the detection network (see Section 3.3, the 256-d feature maps in RPN [5], and the 490-d feature maps in Light-Head R-CNN [23]), to further reduce computations for non-key frame.

3.2 Flow-guided GRU based Feature Aggregation

Previous works [20, 21] have showed that feature aggregation plays an important role on improving detection accuracy. It should be explored how to learn complex and long-term temporal dynamics for a wide variety of sequence learning and prediction tasks. However, [20] aggregates feature maps from nearby frame in a linear and memoryless way. Obviously, it only models short-term dependencies. Though recursive aggregation [21]

has proven successful on fusing more past frames, it can be difficult to train it to learn long-term dynamics, likely due in part to the vanishing and exploding gradients problem that can result from propagating the gradients down through the many layers of the recurrent network.

Recently, [39] has showed that Gated Recurrent Unit (GRU) [40] is more powerful in modeling long-term dependencies than LSTM [41] and RNN [42], because nonlinearities are incorporated into the network state updates. Inspired by this work, we incorporate convolutional GRU proposed by [43] into flow-guided feature aggregation function instead of simply weighted average used in [20, 21]. The aggregation function in Eq. (4) is computed by,


where ,

are parameter tensors and vectors,

denotes elementwise multiplication, denotes convolution,

is sigmoid function and

is ReLU function.

Compared with the original GRU [40], there are three key differences. First, convolution is used instead of fully connected matrix multiplication, since fully connected matrix multiplication is too costly when GRU is applied to image feature maps. Second, is ReLU function instead of hyperbolic tangent function (tanh) for faster and better convergence. Third, we apply GRU only on sparse key frames (e.g., every ) instead of consecutive frames. Since two successive inputs for GRU would be apart 10 frames (1/3 second for a video with 30 fps), object displacements should be taken into account, and thus features should be aligned for GRU aggregation. By contrast, previous works [44, 45, 46, 43] based on either convolutional LSTM or convolutional GRU do not consider such a designing since they operate on consecutive frames instead, where object displacement would be small and neglected.

3.3 Lightweight Key-frame Object Detector

For key frame, we need a lightweight single image object detector, which consists of a feature network and a detection network. For the feature network, we adopt the state-of-the-art lightweight MobileNet [13] as the backbone network, which is designed for mobile recognition tasks. The MobileNet module is pre-trained on ImageNet classification task [47]. For the detection network, RPN [5] and the recently presented Light-Head R-CNN [23] are adopted, because of their light weight. Detailed implementation is illustrated below.

3.3.1 Feature Network

We remove the ending average pooling and the fully-connected layer of MobileNet [13], and retain the convolutional layers. Since our input image resolution is very small (e.g., 224400), we increase feature resolution to get higher performance. First, a 33 convolution is applied on top to reduce the feature dimension to 128, and then a nearest-neighbor upsampling is utilized to increase feature stride from 32 to 16. To give more detailed information, a 11 convolution with 128 filters is applied to the last feature maps with feature stride 16, and then added to the upsampled 128-d feature maps.

3.3.2 Detection Network

RPN [5] and Light-Head R-CNN [23] are applied on the shared 128-d feature maps. In our model, to reduce computation of RPN, 256-d intermediate feature maps was utilized, which is half of originally used in [5]. Three aspect ratios {1:2, 1:1, 2:1} and four scales {, , , } for RPN are set to cover objects with different shapes. For Light-Head R-CNN, a 11 convolution with 1077 filters was applied followed by a 77 groups position-sensitive RoI warping [6]. Then, two sibling fully connected layers are applied on the warped feature to predict RoI classification and regression.

3.4 End-to-end Training

All the modules in the entire architecture, including , and , can be jointly trained for video object detection task. In SGD, nearby video frames, , , , , …, , , are randomly sampled, where key frame duration and key frame samples are set for our experiments. In the forward pass, is assumed as a key frame, and the inference pipeline is exactly performed. Final result for frame incurs a loss against the ground truth annotation. All operations are differentiable and thus can be end-to-end trained.

4 Experiments

Experiments are performed on ImageNet VID [47], a large-scale benchmark for video object detection. Following the practice in [48, 49], model training and evaluation are performed on the 3,862 training video snippets and the 555 validation video snippets, respectively. The snippets are at frame rates of 25 or 30 fps in general. 30 object categories are involved, which are a subset of ImageNet DET annotated categories.

In training, following [48, 49], both the ImageNet VID training set and the ImageNet DET training set are utilized. In each mini-batch of SGD, either nearby video frames from ImageNet VID, or a single image from ImageNet DET, are sampled at 1:1 ratio. The single image is copied be a static video snippet of frames for training. In SGD, 240k iterations are performed on 4 GPUs, with each GPU holding one mini-batch. The learning rates are , and in the first 120k, the middle 60k and the last 60k iterations, respectively.

By default, the key-frame object detector is MobileNet+Light-Head R-CNN, and flow is estimated by Light Flow. The key frame duration length is every 10 frames. In both training and inference, the images are resized to a shorter side of 224 pixels and 112 pixels, for the image recognition network and the flow network, respectively. Inference time is evaluated with TensorFlow Lite [18] on a single 2.3GHz Cortex-A72 processor of Huawei Mate 8. Theoretical computation is counted in FLOPs (floating point operations, note that a multiply-add is counted as 2 operations).

Following the practice in MobileNet [13], two width multipliers, and , are introduced for controlling the computational complexity, by adjusting the network width. For each layer (except the final prediction layers) in , and , its output channel number is multiplied by , and , respectively. The resulting network parameter number and theoretical computation change quadratically with the width multiplier. We experiment with and . By default, and are set as 1.0.

4.1 Ablation Study

4.1.1 Ablation on flow networks

The middle panel of Table 2 compares the proposed Light Flow with existing flow estimation networks on the Flying Chairs test set (384 x 512 input resolution). Following the protocol in [32], the accuracy is evaluated by the average end-point error (EPE). Compared with the original FlowNet design in [32], Light Flow () can achieve theoretical speedup with less parameters. The flow estimation accuracy drop is small ( relative increase in EPE). It is worth noting that it achieves higher accuracy than FlowNet Half and FlowNet Inception utilized in [19], with at least one order less computation overhead. The speed of Light Flow can be further fastened with reduced network width, at certain cost of flow estimation accuracy. Flow estimation would not be a bottleneck in our mobile video object detection system.

Would such a light-weight flow network effectively guide feature propagation? To answer this question, we experiment with integrating different flow networks into our mobile video object detection system. The key-frame object detector is MobileNet+Light-Head R-CNN.

The rightmost panel of Table 2 presents the results. The detection system utilizing Light Flow achieves accuracy very close to that utilizing the heavy-weight FlowNet (61.2% v.s. 61.5%), and is one order faster. Actually, the original FlowNet is so heavy that the detection system with FlowNet is even slower than simply applying the MobileNet+Light-Head R-CNN detector on each frame.

flow network Flying Chairs test ImageNet VID validation
EPE params (M) FLOPs (B) mAP params (M) FLOPs (B)
FlowNet [32] 2.71 38.7 53.48 61.5 45.1 6.41
FlowNet Half [19] 3.53 9.7 14.50 - - -
FlowNet Inception [19] 3.68 3.5 7.28 - - -
FlowNet 2.0 [37] 1.71 162.5 269.39 - - -
1.0 Light Flow 3.14 2.6 0.82 61.2 9.0 0.41
0.75 Light Flow 3.63 1.4 0.48 60.6 7.8 0.37
0.5 Light Flow 4.44 0.7 0.23 60.1 7.1 0.34
Table 2: Ablation of different flow networks for optical flow prediction on Flying Chairs and for video object detection on ImageNet VID.

4.1.2 Ablation on feature aggregation

How important is to exploit flow to align features across frames? To answer this question, we experiment with a degenerated version of our method, where no flow-guided feature propagation is applied before aggregating features across key frames.

Figure 3: Ablation on the effect of flow guidance in flow-guided GRU and in sparse feature propagation.

Figure 3 shows the speed-accuracy curves of our method with and without flow guidance. The curve is drawn by adjusting the key frame duration . We can see that the curve with flow guidance surpasses that without flow guidance. The performance gap is more obvious when the key frame duration increases (1.5% mAP score gap at , 2.9% mAP score gap at ). This is because the spatial disparity is more obvious when the key frame duration is long. It is worth noting that the accuracy further drops if no flow is applied even for sparse feature propagation on the non-key frames.

Table 3 presents the results of training and inference on frame sequences of varying lengths. We tried training on sequences of 2, 4, 8, 16, and 32 frames. The trained network is either applied on trimmed sequences of the same length as in training, or on the untrimmed video sequences without specific length restriction. The experiment suggests that it is beneficial to train on long sequences, but the gain saturates at length 8. Inference on the untrimmed video sequences leads to accuracy on par with that of trimmed, and can be implemented easier. By default, we train on sequences of length 8, and apply the trained network on untrimmed sequences.

Table 4 further compares the proposed flow-guided GRU method with the feature aggregation approach in [21]. An mAP score of 58.4% is achieved by the aggregation approach in [21], which is comparable with the single frame baseline at 6.5 theoretical speedup. But it is still 2.8% shy in mAP of utilizing flow-guided GRU, at close computational overhead.

We further studied several design choices in flow-guided GRU. function with ReLU nonlinearity leads to 3.9% higher mAP score compared to nonlinearity. The ReLU nonlinearity seems to converge faster than in our network. If computation allows, it would be more efficient to increase the accuracy by making the flow-guided GRU module wider (1.2% mAP score increase by enlarging channel width from 128-d to 256-d), other than by stacking multiple layers of the flow-guided GRU module (accuracy drops when stacking 2 or 3 layers).

train sequence length 2 4 8 16 32
inference trimmed, mAP (%) 59.5 61.0 61.5 61.6 61.5
inference untrimmed, mAP (%) 56.4 60.6 61.2 61.4 61.5
Table 3: Ablation of sequence length in training and inference.
aggregation method mAP (%) params (M) FLOPs (B)
single frame baseline 58.3 5.6 2.39
feature aggregation in [21] 58.4 8.3 0.37
GRU (128-d, default) 61.2 9.0 0.41
GRU (256-d) 62.4 13.0 0.64
GRU (tanh for ) 57.3 9.0 0.41
GRU (stacking 2 layers) 61.4 9.9 0.47
GRU (stacking 3 layers) 60.6 10.8 0.53
Table 4: Ablation on feature aggregation.

4.1.3 Accurate Realtime Video Object Detection on Mobile

Figure 4 presents the speed-accuracy trade-off curve of our method, drawn with varying key frame duration length from 1 to 20. Multiple curves are presented, which correspond to networks of different complexity (). When , the image recognition network is densely applied on each frame, as in the single frame baseline. The difference is flow-guided GRU is applied. The derived accuracy by such dense feature aggregation is noticeably higher than that of the single frame baseline. With increased key frame duration length, the accuracy drops gracefully as the computation overhead relieves. The accuracy of our method at long duration length () is still on par with that of the single frame baseline, and is 10.6 more computationally efficient. The above observation holds for the curves of networks of different complexity.

As for comparison of different curves, we observe that under adequate computational power, networks of higher complexity () would lead to better speed-accuracy tradeoff. On the other hand, networks of lower complexity () would perform better under limited computational power.

Figure 4: Speed-accuracy trade-off curves of our method utilizing networks of different computational complexity. Curves are drawn with different key frame duration length .

At our mobile test platform, the proposed system achieves an accuracy of 60.2% at speed of 25.6 frames per second (, , ). The accuracy is 51.2% at a frame rate of 50Hz (, , ). Table 5 summarizes the results.

method mAP (%) Params (M) FLOPs (B) runtime (fps)
Single frame baseline () 58.3 5.6 2.39 4.0
Single frame baseline () 53.1 3.4 1.36 7.6
Single frame baseline () 48.6 1.7 0.62 16.4
Our method (, ) 61.2 9.0 0.41 12.5
Our method (, ) 60.8 7.8 0.37 18.2
Our method (, ) 60.2 7.1 0.34 25.6
Our method (, ) 56.4 5.3 0.23 26.3
Our method (, ) 56.0 4.6 0.20 37.0
Our method (, ) 51.2 2.6 0.11 52.6
Table 5: Speed-accuracy performance of our method.

5 In Context of Previous Work on Mobile

There are also some other endeavors trying to make object detection efficient enough for devices with limited computational power. They can be mainly classified into two major branches: lightweight image object detectors making the per-frame object detector fast, and mobile video object detectors exploiting temporal information.

5.1 Lightweight Image Object Detector

In spite of the work towards more accurate object detection by exploiting deeper and more complex networks, there are also efforts designing lightweight image object detectors for practical applications. Of them, the improvements of YOLO [15], SSD [10], together with the lastest Light-head R-CNN [23] are of the best speed-accuracy trade-off.

YOLO [15] and SSD [10] are one-stage object detectors, where the detection result is directly produced by the network in a sliding window fashion. YOLO frames object detection as a regression problem, and a light-weight detection head directly predicts bounding boxes on the whole image. In YOLO and its improvements, like YOLOv2 [11] and Tiny YOLO [16], specifically designed feature extraction networks are utilized for computational efficiency. For SSD, the output space of bounding boxes are discretized into a set of anchor boxes, which are classified by a light-weight detection head. In its improvements, like SSDLite [50] and Tiny SSD [17], more efficient feature extraction networks are also utilized.

Light-head R-CNN [23] is of two-stage, where the object detector is applied on a small set of region proposals. In previous two-stage detectors, either the detection head or its previous layer, is of heavy-weight. In Light-head R-CNN, position-sensitive feature maps [6] are exploited to relief the burden. It shows better speed-accuracy performance than the single-stage detectors.

Lightweight image object detector is an indispensable component for our video object detection system. On top of it, our system can further significantly improve the speed-accuracy trade-off curve. Here we choose to integrate Light-head R-CNN into our system, thanks to its outstanding performance. Other lightweight image object detectors should be generally applicable within our system.

5.2 Mobile Video Object Detector

Despite the practical importance of video object detection on devices with limited computational power, there is scarce literature. Till very recently, there are two latest works seeking to exploit temporal information for addressing this problem.

In Fast YOLO [51], a modified YOLOv2 [11] detector is applied on sparse key frames, and the detected bounding boxes are directly copied to the the non-key frames, as their detection results. Although sparse key frames are exploited for acceleration, no feature aggregation or flow-guided warping is applied. No end-to-end training for video object detection is performed. Without all these important components, its accuracy cannot compete with ours. But direct comparison is difficult, because the paper does not report any accuracy numbers on any datasets for their method, with no public code.

In [44], MobileNet SSDLite [50] is applied densely on all the video frames, and multiple Bottleneck-LSTM layers are applied on the derived image feature maps to aggregate information from multiple frames. It cannot speedup upon the single-frame baseline without sparse key frames. Extending it to exploit sparse key frame features would be non-trival. It would involve feature alignment, which is also lacking in [44]. Its performance also cannot be easily compared with ours. It reports accuracy on a subset of ImageNet VID, where the split is not publicly known. Its code is also not public.

Both two systems cannot compete with the proposed system. They both do not align features across frames. Besides, [51] does not aggregate features from multiple frames for improving accuracy, while [44] does not exploit sparse key frames for acceleration. Such design choices are vital towards high performance video object detection.

5.3 Comparison on ImageNet VID

Of all the systems discussed in Section 5.1 and Section 5.2, SSDLite [50], Tiny YOLO [16], and YOLOv2 [11] are the most related systems that can be compared at proper effort. They all seek to improve the speed-accuracy trade-off by optimizing the image object detection network. Although they do not report results on ImageNet VID [47], they all public their code fortunately. We first carefully reproduced their results in paper (on PASCAL VOC [52] and COCO [53]), and then trained models on ImageNet VID, also by utilizing ImageNet VID and ImageNet DET train sets. The trained models are applied on each video frame for video object detection. By varying the input image frame size (shorter side in {448, 416, 384, 352, 320, 288, 256, 224} for SSDLite and Tiny YOLO, and {320, 288, 256, 224, 192, 160, 128} for YOLO v2), we can draw their speed-accuracy trade-off curves. The technical report of Fast YOLO [51] is also very related. But it neither reports accuracy nor has public code. We cannot compare with it. Note that the comparison is at the detection system level. We do not dive into the details of varying technical designs.

Figure 1 presents the the speed-accuracy curves of different systems on ImageNet VID validation. For our system, the curve is drawn also by adjusting the image size111the input image resolution of the flow network is kept to be half of the resolution of the image recognition network.(shorter side for image object detection network in {320, 288, 256, 224, 208, 192, 176, 160}), for fair comparison. The width multipliers and are set as 1.0 and 0.5 respectively, and the key frame duration length is set as 10. Our system surpasses all the existing systems by clear margin. Our method achieves an accuracy of 60.2% at 25.6 fps. Meanwhile, YOLOv2, SSDLite and Tiny YOLO obtain accuracies of 58.7%, 57.1%, and 44.1% at frame rates of 0.3, 3.8 and 2.2 fps respectively. To the best of our knowledge, for the first time, we achieve realtime video object detection on mobile with reasonably good accuracy.

6 Discussion

In this paper, we propose a light weight network for video object detection on mobile devices. We verified that the principals of sparse feature propagation and multi-frame feature aggregation also hold at very limited computational overhead. A very small flow network, Light Flow, is proposed. A flow-guided GRU module is proposed for effective feature aggregation.

A possible issue with the current approach is that there would be short latency in processing online streaming videos. Because the recognition on the key frame is still not fast enough. It would be interesting to study this problem in the future.