Training Domain Specific Models for Energy-Efficient Object Detection

11/06/2018
by   Kentaro Yoshioka, et al.
Stanford University
0

We propose an end-to-end framework for training domain specific models (DSMs) to obtain both high accuracy and computational efficiency for object detection tasks. DSMs are trained with distillation hinton2015distilling and focus on achieving high accuracy at a limited domain (e.g. fixed view of an intersection). We argue that DSMs can capture essential features well even with a small model size, enabling higher accuracy and efficiency than traditional techniques. In addition, we improve the training efficiency by reducing the dataset size by culling easy to classify images from the training set. For the limited domain, we observed that compact DSMs significantly surpass the accuracy of COCO trained models of the same size. By training on a compact dataset, we show that with an accuracy drop of only 3.6%, the training time can be reduced by 93%.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

02/01/2019

Dataset Culling: Towards Efficient Training Of Distillation-Based Domain Specific Models

Real-time CNN based object detection models for applications like survei...
06/05/2016

Shallow Networks for High-Accuracy Road Object-Detection

The ability to automatically detect other vehicles on the road is vital ...
10/04/2018

Domain Specific Approximation for Object Detection

There is growing interest in object detection in advanced driver assista...
10/28/2020

A methodology of weed-crop classification based on autonomous models choosing and ensemble

Neural networks play an important role in crop-weed classification have ...
02/11/2021

ABOShips – An Inshore and Offshore Maritime Vessel Detection Dataset with Precise Annotations

Availability of domain-specific datasets is an essential problem in obje...
09/02/2019

Training-Time-Friendly Network for Real-Time Object Detection

Modern object detectors can rarely achieve short training time, fast inf...
11/16/2018

AclNet: efficient end-to-end audio classification CNN

We propose an efficient end-to-end convolutional neural network architec...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Implementing CNN based object detection on stationary surveillance cameras can lead to enhancing the safety of cities, homes, offices and factories by detecting unauthorized substances or discovering anomaly events (e.g. a collapsed person). However, the computation-efficiency is a key requirement since such devices demand battery operation to ease installation. Many successful approaches to improve the efficiency of image classification have been proposed, such as model compressionhan2015deep and model cascades with domain specific models kang2017noscope . However, object detection is more complex than image classification, and while these techniques are likely to remain effective, there is need for additional methods.

Instead of compressing large models, we target to train a computation-efficient model for each specific surveillance camera and a framework is proposed to train such domain specific models (DSM). The framework is based on knowledge distillation hinton2015distilling chen2017learning girshickdata but targets to reduce the accuracy gap between student and teacher models by training the student using a restricted class of domain specific images. Since such training may be conducted on edge-devices, we improve the training efficiency by culling easy-to-classify images with small accuracy penalty.

This paper’s contribution is summarized below.

  • We propose an end-to-end framework for training domain specific models (DSMs) to mitigate the tradeoff between object-detection accuracy and computational efficiency. To the best of our knowledge, this is the first successful demonstration of training DSMs for object detection tasks.

  • By training resnet18-based Faster-RCNN DSMs, we observed a 19.7% accuracy (relative mAP) improvement compared to COCO trained model of the same size, tested on a customized YoutubeLive dataset.

  • Since edge devices will have limited resources, we propose culling the training dataset to significantly reduce the computation resource required for the training. Only training data that has high utility in training is added. This filtering allows us to reduce training time by 93% with an accuracy loss of only 3.6%.

2 Training Domain Specific Models

Large scale object detection datasets such as COCOlin2014microsoft contain a large and diverse set of natural images. Using a small model on such a large dataset would typically yield higher misdetections than a large model. Furthermore, chen2017learning showed that misdetections usually occur between foreground and background (false positives true negatives); rarely do misdetections occur as a result of inter-class errors. In video surveillance, because frames in a video stream share a stationary background, a compact model can be good enough to detect foreground and background. This motivates our DSM framework to train compact models with dataset constructed by domain-specific data.

1:Domain Specific Model (DSM), Teacher Model
2:procedure 1. Prepare Dataset Given domain images for
3:     for  do
4:         .
5:         .
6:         compute from and .      
7:     Collect pairs with largest values of .
8:     Compile Difficult Dataset (DDS): .
9:procedure 2. Train DSM
10:     DSM.train
11:procedure 3. Inference
12:     Detection DSM.predict(image)
Algorithm 1 Training Domain Specific Models
Figure 1: Object detection results of the test image, before and after domain specific training.

As illustrated in Algorithm 1, our DSM framework consists of preparation of the data and training of the DSM. A large challenge when deploying models in surveillance is preparing the training data since manually labelling frames in videos is cumbersome. To overcome this, we label the dataset used to train the DSM by using the predictions of a much larger teacher model with higher accuracy and treating these predictions as ground truth labels. Furthermore, we compare the prediction on image made by the teacher to that of the DSM; we determine whether to store the and label Teacher in our compiled dataset . After the training set is compiled, it is used to train the DSM.

Training a object detection model can take hours even with a GPU and can be challenging for applications requiring frequent retraining. We exploit the fact that when the DSM is pretrained on large-scale general dataset, it can already provide good predictions for a large chunk of the domain-specific data. This procedure develops a compact dataset that is only composed of data that the DSM finds inconsistent with the prediction made by the teacher model. Keeping data that both the DSM and teacher detections are consistent is computationally redundant because it does not contribute to gradient signals. We define to quantify the consistency between teacher and DSM:

(1)

where TP, FP, FN represents the number of true positive, false positive and false negative bounding-box (BB) detections of the image and . Significantly fewer training data and steps are required with only a minimal penalty in accuracy.

Teacher: Res101 [48M/68ms]
Res18 [12M/26ms} Squeeze [6M/21ms]
COCO DSM Improvement COCO DSM Improvement
mean accuracy 54.5 74.3 + 19.7 41.5 63.3 + 21.7
Table 1: Domain specific training results are summarized, where the mean accuracy result of 5 datasets are reported. Res101 results are used as ground truth, therefore accuracy is relative mAP (rmAP). Along the model name, parameters and inference time on GPU per image is reported.
Dataset Classes Strategy Number of training samples
64 128 256 512
All
(3600)
coral 1
Simple
DDS
81
89.6
89.4
89.6
89.6
89.6
90
89.8
90
taipei 4
Simple
DDS
50.4
60.7
62
61.7
62.1
62.2
62.8
64.2
68.2
jackson 2
Simple
DDS
52.5
71.6
60.1
76.7
60.9
78.3
72.8
80.6
87.0
kentucky 2
Simple
DDS
35
53.1
38.7
63.8
44.2
66.4
54.8
69.5
67.2
castro 3
Simple
DDS
60.4
67.2
63
68.35
65.2
75.0
66
77
77.6
mean
accuracy drop
-
Simple
DDS
23.6
9.5
20.8
5.9
13.8
3.6
11.3
1.7
0
Training Time [min] - - 1.8 3.6 7.4 14.6 110
Table 2: Number of training samples versus accuracy with res18. For simple, we pick the first N training data and filter out the rest. For difficult dataset (DDS), N training data having highest are chosen. The mean accuracy drop of the 5 datasets were computed, in respect to the model trained with all 3600 images. The training time does not include the time for teacher model labeling, which takes about 10 min. We utilize single TitanXp GPU to measure the training time.

3 Experiments

Models.

We develop Faster-RCNN object detection models on PyTorch pretrained on MS-COCO

paszke2017pytorch ren2015faster lin2014microsoft . We use 3 models: resnet101(res101), resnet(res18), and squeezenet(squeeze) as the backbone region proposal network (RPN) he2016deep iandola2016squeezenet

. Res18 and squeeze holds 10% and 19% TOP-5 Imagenet error, which is a common accuracy range for compact CNN models like MobileNet

howard2017mobilenets . During training, res101 is used as the teacher, while res18 and squeeze are used as DSMs. While we chose Faster-RCNN for its accuracy on YoutubeLive, we can also use YOLO/SSD detectors for improved efficiency with this framework because the training requires the bounding box labels redmon2016you liu2016ssd . 111We release the codes and the dataset https://github.com/kentaroy47/training-domain-specific-models

Dataset. We obtain 5 fixed-angle videos from YouTubeLive. The video is 2 hours each with 1 frames-per-second (fps), consisting of 7200 images. We split the images evenly: the first 3600 images are for training and the later 3600 images for testing.

Results. As shown in table 1, we first train our res18 DSM using the full

3600 training images for 10 epochs using stochastic-gradient descent with a learn rate of

. As compared to the res18 model pretrained on MS-COCO but without domain specific training, we achieved an average of 19.7% accuracy improvement.

Table 2 shows the effectiveness of DDS on res18. Using DDS, we were able to reduce the training time by 93% (256 images) with only 3.6% accuracy penalty. If we simply picked 256 training images sequentially (strategy simple on table), the accuracy worsens 10.2% compared to DDS.

3.1 Comparison against Data Distillation

Data Distillation girshickdata is fundamentally different from our application setting and methods. Models with large network capacities were shown to achieve higher accuracy by bootstrapping the dataset with girshickdata . On the other hand in our framework, in order to fully utilize the small network capacity, we aim to train the models with only the domain specific data.

We show on Table 3 that following a method of data distillation (i.e. aggregating PASCAL with the entire YoutubeLive data) yields lower rmAP improvement than with our approach of training with only domain specific data. In addition, training the small models with the entire YoutubeLive dataset also yields lower improvements as well. This is fundamentally because of the limited model capacity of the compact, but computationally-efficient model. We observe that for training small models, utilizing a larger dataset do not always obtain better results but restricting the data domain can do better.

PASCAL+YoutubeLive YoutubeLive Domain Specific
mean accuracy improvement
Res18
+ 9.7 + 12.9 + 19.7
mean accuracy improvement
Squeeze
+ 10.3 + 13.5 + 21.7
Table 3: Accuracy improvement observed for multiple dataset settings. For PASCAL+YoutubeLive, we train the models with PASCAL-VOC2007 and YoutubeLive data. The accuracy improvement is rmAP improvement compared to the COCO trained model.

References