Object detection (OD) is an important cue in developing products for many domains: Digital Security and Surveillance, Autonomous vehicles, etc. Usually, complex solutions solve multiple tasks in parallel, so it is essential to have a fast and accurate algorithm.
In  authors compared modern meta-architectures for OD and shown, that there is a speed/accuracy trade-off: higher quality two-stage models work slower than less accurate single shot methods. However, even single shot methods  work more or less fast only on small resolution (e.g., 300x300 pixels).
For deployment222By deployment, we understand the steps that should be done in order to use the algorithm in market-ready solutions., it is not enough to have just a fast object detector. It has to be robust against the two typical types of errors: missing objects and false alarms.
In this work, we address the problem of designing an object detector for deployment. Specifically, how to make it fast, yet pretty accurate, as well as compensate its blinking and false positives. We selected autonomous driving domain and require detection of 2 classes of objects: vehicles and pedestrians, which is a common task in advanced driving-assistance systems (ADAS). Our main contributions are:
We advise the design steps to build an OD, which can run at 60+ fps on edge devices.
We provide a designed model available for evaluation as a part of an open source inference framework.
We propose the way to weaken the confidence of typical false positives that helps to reduce false alarm rate.
We present a lightweight strategy to post-process detections in order to compensate their blinking.
1.1 Related Work
There are two major groups of DL OD: one and two-stage methods. For two-stage methods, Faster R-CNN  provides the best quality, but it is the slowest one. R-FCN  aims to improve the speed by making all computations shared with position sensitive score maps but at the cost of accuracy. One-stage methods, such as SSD  are the fastest ones. However, their speed degrades on high-resolution input.
An important part of research is conducted by the design of lightweight backbones, which can perform on par with the top networks for classification. CNNs, that utilize depth-wise convolutions , , allow achieving dramatic parameter reduction and faster inference time. Authors in  show that only SSD-like OD can adopt lightweight backbones without a huge drop in accuracy.
One more promising technique to have a lightweight and well-performed solution is knowledge distillation . There are plenty of works in this direction for classification task , , . For OD this topic is not so well explored. In  authors adopt Faster R-CNN as an object detection framework to apply distillation and propose a set of steps, which need to be done in order to make distillation work. Despite it looks promisingly, one should have deep teacher model trained first, which requires additional time and sometimes data (to prevent overfitting). Nevertheless, our findings are complementary and can be used along with distillation.
Most of the modern OD is based on backbones pre-trained on ImageNet. In many cases, pre-training is a separate task which usually requires a lot of time. Recent works ,  suggest the way how to specifically design CNN, which can be trained directly from scratch for OD. Here we propose steps how to train lightweight OD directly, without specifically designed CNN blocks or need for many hours backbone pre-training on additional data.
Often OD suffers from false positives. No one will deploy OD network in the application if it regularly produces false alarms. One can say, the better detector accuracy, the less number of false positives. But we know, that there is a speed/accuracy trade-off. In the next chapter, we propose a simple method, which allows decreasing the confidence of false positives.
Usually, before running any OD, one should select the threshold of confidence value for a detector. This is the number, above which we consider all detected objects as positives, and the objects below such threshold are considered false positives. So, when running a good OD one will see the box around the object most of the time, but sometimes it blinks. This happens due to the low confidence value of the detected object, so it is filtered by the threshold. It means, that we missed the object in some frames. Such a situation can be compensated with trackers . However, a tracker is a separate algorithm, that can be computationally expensive. We outline the extremely cheap tracking strategy, based on re-detection, which utilizes the nature of OD.
2 Designing Object Detector for Deployment
In this section we consider all the aspects of designing lightweight object detection architecture for deployment, which is able to run with real-time speed on edge computing devices (at the edge). Our target use case is OD for ADAS scenario, so the final detector is able to recognize objects of two classes: pedestrian and vehicle (the last includes cars, trucks, buses, etc.). ADAS typically receive an input from a monocular RGB camera and camera is usually mounted inside a car on a windshield or on the top of the car and provides video stream with 16:9 aspect ratio. Despite that, all findings and insights can be applied to other classes of objects due to their high-level structure.
In our research we used self-collected datasets to evaluate the accuracy of the final model. They consist of representative sets of objects captured from several cameras under various weather conditions and containing multiple road scenes, like city road, countryside, highway, etc.
2.1 Design Practices
2.1.1 Real-time CNN
As it was mentioned, there are many object detection frameworks like Faster R-CNN, R-FCN, and SSD. Moreover, many variations of them have been recently designed to improve the quality of the original ones , , . In our work, we chose SSD as a detection architecture based on the comparison made in . It was shown there, that SSD performs not as well as two-stage detectors like Faster R-CNN or R-FCN in general, but outperforms them with lightweight backbone. Thus following this fact we used MobileNet  as a feature extractor inside SSD detection framework since it is light in terms of computational complexity as well as in the number of parameters. Furthermore, we applied several modifications on top of it and inside SSD pipeline to be able to run it in real-time on a mainstream CPU.
Resolution. We used input resolution and aspect ratio different from the original SSD by the following two reasons. One of them is to improve the detection of small objects. We increased the input resolution of CNN to 672x384 from the default 300x300. It helps to recognize pedestrians with a minimum size of 40x80 and vehicles with a minimum size of 40x30 on a 720p frame. Another one is that this resolution has the aspect ratio close to 16:9, used in popular image formats like 720p or 1080p. It means that the loss of information along “width” dimension is less than in the case of square resolution.
Depth-wise head. Besides the backbone, we also used the depth-wise block in SSD “head”. In the recent MobileNetV2 paper  the similar architecture was called SSD-lite. Authors argue that such change reduces computational complexity but does not affect quality dramatically.
Extended SSD. We used more prediction branches in SSD to improve handling of small and medium-size objects. We added two additional branches (one for small and one for medium size) and put them to the same feature maps as the first two in the original MobileNet+SSD architecture . This also forced us to change sizes of prior boxes placed on the same feature maps. For example, if originally prior boxes had parameters , then they would be evenly split and have parameters , and , accordingly. Fig. 1 shows a scheme of such split.
It may seem, that these branches can significantly increase GMAC number and slow down the inference time since they placed on the feature map with the highest spatial resolution. However for the target use case, when we need to detect just two classes, this change is not so dramatic, while allows to reasonably improve quality. Table 2 shows such comparison for the networks with 672x384 input resolution after two-stage pre-training and depth-wise head.
|Model with extra predictors||0.635||0.423||3.6|
Changes from each design choice are summarized in the Table 1.
Pruning. Since we are solving two-class detection problem, pedestrians, and vehicles, we can use fewer channels in most layers. It can be done by applying pruning methods which remove the whole convolutional kernels to obtain immediate inference speed improvement. These methods can be based on some straightforward strategy, such as random sampling, or more sophisticated algorithms which consider the importance of filters , . In 
authors show, that different pruning methods give comparable results for a similar problem. Guided by considerations of simplicity and ease of reproducibility, we used the random filter sampling. After pruning the network, one more training stage for a couple of epochs is performed to adopt the weights, see results in Table3. The pruned model shows slightly better results both for pedestrians and vehicles because pruning has a regularization effect and can help convergence.
2.1.2 Two-stage pre-training
The one important aspect of designing the OD model is that in order to achieve a sufficient quality a backbone should be pre-trained on some diverse dataset, such as ImageNet, which contains millions of images. However, this process might be time-consuming and has some disadvantages, such as learning bias, domain mismatch , etc. While experimenting with various datasets, we found that for object detection use case it is possible to use just COCO  dataset to get decent OD.
To train model from scratch a good gradient estimation is required. Thus a large batch is essential in this case. However, since the input resolution of the network was increased, not so many images may fit into memory, the actual number depends on specific device configuration. We can provide a batch size of 96 images during the training, which leads to weak accuracy results. That is why we propose two-stage pre-training on COCO. At the first stage, we trained MobileNet+SSD on the original resolution of 300x300 pixels with large batch size. After that, we changed the resolution to the target one, adjusted size of prior boxes and used weights of small resolution model to initialize stage of fine-tuning with smaller batch size. We did not freeze any weights of layers during both stages. The intermediate results for single-stage training, the results of the two-stage training scheme and this scheme with additional prediction layers are shown in the Table1.
To compare we trained MobileNet+SSD 300x300 on both ImageNet+COCO and with the proposed procedure on COCO only. Table 4
shows results of these experiments. Both results are similar, hence such a two-stage scheme allows to use just a single dataset and avoid spending time for additional hyperparameters tuning for backbone pre-training.
|Two-stage scheme only on COCO||0.359|
After two-stage pre-training, we fine-tuned the final topology on the proprietary dataset.
2.1.3 False positives suppression
Pedestrian detection remains a hard task to solve due to large-scale and appearance variability. In 
authors show, that objects with significant horizontal gradient, like poles, trees trunks, etc. are strong (have high confidence value) false positives, classified as pedestrians. These objects are typical for the road scenes, see Fig.2, so we set them as the first candidates to suppress.
To address the problem, we collected additional dataset of frames with only such objects, so these images don’t contain positives. This is done to balance positives and hard false positives. It is also complicated to find hard false positive and positives in the same image. Furthermore, such images do not require any annotation, thus making the process of gathering them cheap. Our aim is to show, how to use such kind of data to suppress false positives, without reducing in detection quality. So we ran already trained detector on this dataset and left frames only with false positives, which have reasonable confidence (more than 0.3). It was done to make sure that each additional frame, containing false positive, will have a strong impact on training. Then batch size was increased by 30% to include such images, so the total size of training dataset increased by 30%. By default, SSD framework doesn’t compute loss from frames without positives, so we made modifications in the loss layer to allow contribution from such frames and continued training for 5 epochs with the same parameters. This trick leads to the false positive rate reduction, but as well as overall accuracy degradation. We hypothesize that such false positives introduce difficulties into the training process, so network goes fast from local optima and started to learn how to filter false positives, but not how to detect pedestrians well. To remedy the situation, we adjusted the learning rate by decreasing it twice from the default one. This prevents the network to go so far from local optima, while reduces the confidence of false positives.
The results in the Table 5 shown, that using proposed scheme, the detector will find more positives with the same number of false positives. The final average precision of the network remains the same, so this scheme does not actually eliminate strong false positives. However, it reduces their confidence, making them not so strong.
2.1.4 Results post-processing
When working with a video stream, not just with a single image, it is important to have an auxiliary part in an overall pipeline, which is responsible for tracking objects, if the main detector fails. There is a lot of research done in this field , , , , , . While simple approaches are able to run in real-time, they usually stuck on background objects, if the tracked object was occluded. The more complex solutions can handle this, but they usually require extra computations to be able to discriminate background. We propose tracking strategy, which utilizes the nature of OD, so it distinguishes object versus background and uses almost no extra computations.
Almost all DL-based OD perform the final classification of multiple proposals. Some of them have high confidence, which passes the threshold, the rest are usually discarded. The idea of the proposed tracking method is to match detections with reasonable confidence value from the current frame to detections from the previous frame. As a similarity measure, the popular Intersection Over Union (IOU) metric is used. We match all detections with the confidence higher than 0.2 and choose the best one in IOU metric. So, if for low-confident detection on the current frame there is a match on the previous frame, then this detection is retained despite its low confidence. Such simple re-detection approach allows to compensate detector errors and the negative influence of setting strong detection threshold value, so detections blink less often. Moreover, such retained detections don’t stick at the background, because the detector discriminates them and even doesn’t give proposals when the object is occluded by something or left the frame. The runtime of this procedure is less than 0.1 ms in a challenging road scene with more than 10 objects on our test system, so it is reasonable to use it.
2.2 Inference with OpenVINO
One important thing that also should be considered is the inference engine. That is why hardware vendors provide highly optimized inference frameworks such as Nvidia Tensor RT or Intel OpenVINO .
In our work we used OpenVINO and its Intel DL Deployment Toolkit as a target solution for inference. OpenVINO is able to import models from many DL frameworks and optimize them for various Intel hardware, like CPUs, GPUs, FPGAs or Movidius VPUs.
We performed all the experiments using Caffe framework with additional layers to implement SSD and depth-wise convolutions.
Table 6 shows an accuracy of our final MobileNet+SSD architecture on 672x384 resolution.
To measure its performance we used publicly available OpenVINO toolkit and ran experiments on Intel Core i5-6500 CPU, which can be used in the edge devices. In the Table 7 the results are also compared with naive inference using the Caffe framework compiled with Intel MKL library. It can be seen, that designed OD can be offloaded to the integrated GPU, while still run in real-time, so CPU will be available for other tasks. Table 8 shows the performance of the designed OD on the target hardware in comparison with the simple two-class baseline network, which is SSD with MobileNet backbone without proposed improvements, on the identical resolution.
In this work, we have outlined some important problems of developing DL-based object detector and provide solutions to deal with them. Based on these insights we developed a lightweight CNN which shows 60+ frames per second of inference speed with OpenVINO toolkit on a general purpose CPU. We focused on ADAS case, however, such practices can be applied to other domains. It worth to note, that usually real systems consist of complex pipelines, which combine multiple tasks sharing the same hardware. So every component of such systems should operate faster than real-time, to allow the whole pipeline running in real-time. Thus we believe, that the described practices are important to develop performance-critical OD for applications.
Using quantization or even binarization of the model weights can further improve the inference speed as well as more sophisticated pruning methods. Moreover, designing CNN in a hardware-friendly way may further boost the performance. We left the evaluation and development of such practices for the future research.
-  Intel® OpenVINOTM Toolkit. https://software.intel.com/en-us/openvino-toolkit.
-  NVIDIA TensorRT Programmable Inference Accelerator. https://developer.nvidia.com/tensorrt.
-  D. Anisimov and T. Khanova. Towards lightweight convolutional neural networks for object detection. arXiv preprint arXiv::1707.01395, 2017.
-  G. R. Bradski. Computer vision face tracking for use in a perceptual user interface. In Workshop on Applications of Computer Vision, pages 214–219, 1998.
Z. Cai, Q. Fan, R. Feris, and N. Vasconcelos.
A unified multi-scale deep convolutional neural network for fast object detection.In ECCV, pages 354–370, 2016.
-  G. Chen and et al. Learning efficient object detection models with knowledge distillation. In NIPS, pages 742–751, 2017.
Xception: Deep learning with depthwise separable convolutions.In CVPR, pages 1251–1258, 2017.
-  E. Crowley, G. Gray, and A. Strokey. Moonshine: Distilling with cheap convolutions. arXiv preprint arXiv:1711.02613, 2017.
-  J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional networks. In NIPS, pages 379–387, 2016.
-  M. Danelljan and et al. Eco: Efficient convolution operators for tracking. In CVPR, 2017.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
-  H. Fan and H. Ling. Sanet: Structure-aware network for visual tracking. In CVPR Deep Vision Workshop, 2017.
-  J. F. Henriques and et al. High-speed tracking with kernelized correlation filters. In TPAMI, 2015.
-  G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
-  A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arxiv:1704.04861, 2017.
-  J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, and et al. Speed/accuracy trade-offs for modern convolutional object detectors. In CVPR, pages 7310–7319, 2017.
-  J. Jeong, H. Park, and N. Kwak. Enhancement of ssd by concatenating feature maps for object detection. arxiv:1705.09587, 2017.
-  Z. Kalal, K. Mikolajczyk, and J. Matas. Forward-backward error: Automatic detection of tracking failures. In ICPR, pages 23–26, 2010.
-  H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. G. arXiv:1608.08710. Pruning filters for efficient convnets. arxiv:1608.08710, 2016.
-  T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar. Focal loss for dense object detection. In ICCV, pages 2980–2988, 2017.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740–755, 2014.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed. Ssd: Single shot multibox detector. In ECCV, pages 21–37, 2016.
-  A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler. MOT16: A benchmark for multi-object tracking. arXiv:1603.00831, 2016.
-  H. Nam and B. Han. Learning multi-domain convolutional neural networks for visual tracking. In CVPR, 2016.
-  R. Rajaram, E. Ohn-Bar, and M. Trivedi. Looking at pedestrians at different scales: A multiresolution approach and evaluations. In IEEE Transactions on Intelligent Transportation Systems, pages 3565–3576, 2016.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, pages 91–99, 2015.
-  M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, pages 4510–4520, 2018.
-  Z. Shen, Z. Liu, J. Li, Y. Jiang, Y. Chen, and X. Xue. Dsod: Learning deeply supervised object detectors from scratch. In ICCV, 2017.
-  Z. Shen, H. Shi, R. Feris, L. Cao, S. Yan, D. Liu, X. Wang, X. Xue, and T. Huang. Learning object detectors from scratch with gated recurrent feature pyramids. arXiv:1712.00886, 2017.
-  G. Urabn and et al. Do deep convolutional nets really need to be deep and convolutional? In ICLR, 2017.
-  W. Wen, C. Xu, C. Wu, Y. Wang, Y. Chen, and H. Li. Coordinating filters for faster deep neural networks. arxiv:1703.09746, 2017.
-  S. Zagoruyko and N. Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In ICLR, 2017.