1 Introduction
Implementing CNN based object detection on stationary surveillance cameras can lead to enhancing the safety of cities, homes, offices and factories by detecting unauthorized substances or discovering anomaly events (e.g. a collapsed person). However, the computation-efficiency is a key requirement since such devices demand battery operation to ease installation. Many successful approaches to improve the efficiency of image classification have been proposed, such as model compressionhan2015deep and model cascades with domain specific models kang2017noscope . However, object detection is more complex than image classification, and while these techniques are likely to remain effective, there is need for additional methods.
Instead of compressing large models, we target to train a computation-efficient model for each specific surveillance camera and a framework is proposed to train such domain specific models (DSM). The framework is based on knowledge distillation hinton2015distilling chen2017learning girshickdata but targets to reduce the accuracy gap between student and teacher models by training the student using a restricted class of domain specific images. Since such training may be conducted on edge-devices, we improve the training efficiency by culling easy-to-classify images with small accuracy penalty.
This paper’s contribution is summarized below.
-
We propose an end-to-end framework for training domain specific models (DSMs) to mitigate the tradeoff between object-detection accuracy and computational efficiency. To the best of our knowledge, this is the first successful demonstration of training DSMs for object detection tasks.
-
By training resnet18-based Faster-RCNN DSMs, we observed a 19.7% accuracy (relative mAP) improvement compared to COCO trained model of the same size, tested on a customized YoutubeLive dataset.
-
Since edge devices will have limited resources, we propose culling the training dataset to significantly reduce the computation resource required for the training. Only training data that has high utility in training is added. This filtering allows us to reduce training time by 93% with an accuracy loss of only 3.6%.
2 Training Domain Specific Models
Large scale object detection datasets such as COCOlin2014microsoft contain a large and diverse set of natural images. Using a small model on such a large dataset would typically yield higher misdetections than a large model. Furthermore, chen2017learning showed that misdetections usually occur between foreground and background (false positives true negatives); rarely do misdetections occur as a result of inter-class errors. In video surveillance, because frames in a video stream share a stationary background, a compact model can be good enough to detect foreground and background. This motivates our DSM framework to train compact models with dataset constructed by domain-specific data.

As illustrated in Algorithm 1, our DSM framework consists of preparation of the data and training of the DSM. A large challenge when deploying models in surveillance is preparing the training data since manually labelling frames in videos is cumbersome. To overcome this, we label the dataset used to train the DSM by using the predictions of a much larger teacher model with higher accuracy and treating these predictions as ground truth labels. Furthermore, we compare the prediction on image made by the teacher to that of the DSM; we determine whether to store the and label Teacher in our compiled dataset . After the training set is compiled, it is used to train the DSM.
Training a object detection model can take hours even with a GPU and can be challenging for applications requiring frequent retraining. We exploit the fact that when the DSM is pretrained on large-scale general dataset, it can already provide good predictions for a large chunk of the domain-specific data. This procedure develops a compact dataset that is only composed of data that the DSM finds inconsistent with the prediction made by the teacher model. Keeping data that both the DSM and teacher detections are consistent is computationally redundant because it does not contribute to gradient signals. We define to quantify the consistency between teacher and DSM:
(1) |
where TP, FP, FN represents the number of true positive, false positive and false negative bounding-box (BB) detections of the image and . Significantly fewer training data and steps are required with only a minimal penalty in accuracy.
Teacher: Res101 [48M/68ms] | ||||||
Res18 [12M/26ms} | Squeeze [6M/21ms] | |||||
COCO | DSM | Improvement | COCO | DSM | Improvement | |
mean accuracy | 54.5 | 74.3 | + 19.7 | 41.5 | 63.3 | + 21.7 |
Dataset | Classes | Strategy | Number of training samples | ||||||||||||||||
64 | 128 | 256 | 512 |
|
|||||||||||||||
coral | 1 |
|
|
|
|
|
90 | ||||||||||||
taipei | 4 |
|
|
|
|
|
68.2 | ||||||||||||
jackson | 2 |
|
|
|
|
|
87.0 | ||||||||||||
kentucky | 2 |
|
|
|
|
|
67.2 | ||||||||||||
castro | 3 |
|
|
|
|
|
77.6 | ||||||||||||
|
- |
|
|
|
|
|
0 | ||||||||||||
Training Time [min] | - | - | 1.8 | 3.6 | 7.4 | 14.6 | 110 |
3 Experiments
Models.
We develop Faster-RCNN object detection models on PyTorch pretrained on MS-COCO
paszke2017pytorch ren2015faster lin2014microsoft . We use 3 models: resnet101(res101), resnet(res18), and squeezenet(squeeze) as the backbone region proposal network (RPN) he2016deep iandola2016squeezenet. Res18 and squeeze holds 10% and 19% TOP-5 Imagenet error, which is a common accuracy range for compact CNN models like MobileNet
howard2017mobilenets . During training, res101 is used as the teacher, while res18 and squeeze are used as DSMs. While we chose Faster-RCNN for its accuracy on YoutubeLive, we can also use YOLO/SSD detectors for improved efficiency with this framework because the training requires the bounding box labels redmon2016you liu2016ssd . 111We release the codes and the dataset https://github.com/kentaroy47/training-domain-specific-modelsDataset. We obtain 5 fixed-angle videos from YouTubeLive. The video is 2 hours each with 1 frames-per-second (fps), consisting of 7200 images. We split the images evenly: the first 3600 images are for training and the later 3600 images for testing.
Results. As shown in table 1, we first train our res18 DSM using the full
3600 training images for 10 epochs using stochastic-gradient descent with a learn rate of
. As compared to the res18 model pretrained on MS-COCO but without domain specific training, we achieved an average of 19.7% accuracy improvement.Table 2 shows the effectiveness of DDS on res18. Using DDS, we were able to reduce the training time by 93% (256 images) with only 3.6% accuracy penalty. If we simply picked 256 training images sequentially (strategy simple on table), the accuracy worsens 10.2% compared to DDS.
3.1 Comparison against Data Distillation
Data Distillation girshickdata is fundamentally different from our application setting and methods. Models with large network capacities were shown to achieve higher accuracy by bootstrapping the dataset with girshickdata . On the other hand in our framework, in order to fully utilize the small network capacity, we aim to train the models with only the domain specific data.
We show on Table 3 that following a method of data distillation (i.e. aggregating PASCAL with the entire YoutubeLive data) yields lower rmAP improvement than with our approach of training with only domain specific data. In addition, training the small models with the entire YoutubeLive dataset also yields lower improvements as well. This is fundamentally because of the limited model capacity of the compact, but computationally-efficient model. We observe that for training small models, utilizing a larger dataset do not always obtain better results but restricting the data domain can do better.
PASCAL+YoutubeLive | YoutubeLive | Domain Specific | |||
---|---|---|---|---|---|
|
+ 9.7 | + 12.9 | + 19.7 | ||
|
+ 10.3 | + 13.5 | + 21.7 |
References
- [1] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- [2] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
-
[3]
Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, and Matei Zaharia.
Noscope: optimizing neural network queries over video at scale.
Proceedings of the VLDB Endowment, 10(11):1586–1597, 2017. - [4] Guobin Chen, Wongun Choi, Xiang Yu, Tony Han, and Manmohan Chandraker. Learning efficient object detection models with knowledge distillation. In Advances in Neural Information Processing Systems, pages 742–751, 2017.
-
[5]
Ilija Radosavovic, Piotr Dollár, Ross Girshick, Georgia Gkioxari, and
Kaiming He.
Data distillation: Towards omni-supervised learning.
pages 4119–4128, 2018. -
[6]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva
Ramanan, Piotr Dollár, and C Lawrence Zitnick.
Microsoft coco: Common objects in context.
In
European conference on computer vision
, pages 740–755. Springer, 2014. - [7] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
- [8] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
-
[9]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016. - [10] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
- [11] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
- [12] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
- [13] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. pages 21–37, 2016.
Comments
There are no comments yet.