Dataset Culling: Towards Efficient Training Of Distillation-Based Domain Specific Models

02/01/2019
by   Kentaro Yoshioka, et al.
Stanford University
0

Real-time CNN based object detection models for applications like surveillance can achieve high accuracy but require extensive computations. Recent work has shown 10 to 100x reduction in computation cost with domain-specific network settings. However, this prior work focused on inference only: if the domain network requires frequent retraining, training and retraining costs can be a significant bottleneck. To address training costs, we propose Dataset Culling: a pipeline to significantly reduce the required training dataset size for domain specific models. Dataset Culling reduces the dataset size by filtering out non-essential data for train-ing, and reducing the size of each image until detection degrades. Both of these operations use a confusion loss metric which enables us to execute the culling with minimal computation overhead. On a custom long-duration dataset, we show that Dataset Culling can reduce the training costs 47x with no accuracy loss or even with slight improvements. Codes are available: https://github.com/kentaroy47/DatasetCulling

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

11/06/2018

Training Domain Specific Models for Energy-Efficient Object Detection

We propose an end-to-end framework for training domain specific models (...
10/04/2018

Domain Specific Approximation for Object Detection

There is growing interest in object detection in advanced driver assista...
04/18/2021

Motion Vector Extrapolation for Video Object Detection

Despite the continued successes of computationally efficient deep neural...
12/28/2021

Deep-CNN based Robotic Multi-Class Under-Canopy Weed Control in Precision Farming

Smart weeding systems to perform plant-specific operations can contribut...
04/26/2022

Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs

DNN models across many domains continue to grow in size, resulting in hi...
03/30/2020

Squeezed Deep 6DoF Object Detection Using Knowledge Distillation

The detection of objects considering a 6DoF pose is common requisite to ...
06/11/2020

JIT-Masker: Efficient Online Distillation for Background Matting

We design a real-time portrait matting pipeline for everyday use, partic...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Convolutional neural network (CNN) object detectors have recently achieved significant improvements in accuracy [1][2] but have also become more computationally expensive. Since CNNs generally obtain better classification performance with larger networks, there exists a tradeoff between accuracy and computation cost (or efficiency). One way around this tradeoff is to leverage application and domain knowledge. For example, models for stationary surveillance and traffic cameras require pedestrians and cars to be detected but not different specifies of dogs. Therefore, by leveraging specialization, smaller models can be used.

Recent approaches utilize domain-specialization to train compact domain specific models (DSMs) with distillation [3]. Compact student models can achieve high accuracies when trained with sufficient domain data, and such student models can be 10-100 smaller than the teacher. [4] utilized this idea in a model cascade, [5] pushed this idea to the extreme by training frequently with extremely-small student models, and [6] used unlabeled data to augment the student dataset.

Figure 1: Dataset Culling aims to reduce the size of the unlabelled training data (number of images and image resolution) to reduce the computation costs for both the teacher and student.

The computation cost in conventional teacher-student frameworks is as follows: 1. Inference cost for the student, 2. Inference cost for the teacher (for labeling) and 3. Training cost for the student. Importantly, small student models may require frequent retraining to cancel out drift in data statistics associated with environment. For example, in a traffic surveillance setting, the appearance of pedestrian and cyclist may change seasonally. Hence, frequent retraining may be necessary when a small model is used, due to its capability to learn features is limited. Therefore with a small model, one can achieve computationally-efficient inference but with high (re)training overheads. For our surveillance application, a day’s worth of surveillance data (86,400 images at 1 fps) requires 100 GPU hours (Nvidia K80 on AWS P2) to train.

Figure 2: Our Dataset Culling pipeline. First, by culling the data with the confidence loss (), the dataset size is reduced 50 (in surveillance). The dataset is further reduced 6 by culling further with precision using teacher predictions. Finally, optResolution is applied to further reduce computation by another 1.2-6.

Prior works have discussed ways to improve computation costs for the student model during inference. However, there has not been much focus on costs associated with (re)training or teacher costs. Our contributions are:

  • We propose Dataset Culling, which significantly reduces the computation cost of training. We show speedups in training by over 47 with no accuracy penalty. To the best of our knowledge, this is the first successful demonstration of improving training efficiency of DSMs.

  • Dataset Culling uses the student predictions to keep only difficult-to-predict data for training. We show that the dataset size can be culled by a factor of 300 to minimize training costs.

  • We develop optResolution as part of the Dataset Culling pipeline. This optimizes the image resolution of the dataset to further improve inference and training costs.

Figure 3: Object detection results of Dataset Culling with results of 3 scenes from surveillance dataset (top to bottom: Coral, Jackson, Kentucky.). Accuracy and the computation cost per image (GFLOPs) are shown. The student model is trained with a compressed dataset of =128, and optResolution is set automatically. While optResolution introduces a penalty in accuracy (average 1% mAP), it dramatically improves the computation cost. For example, the computation cost for inference is improved by up to 18 for Coral.

2 Efficient training of DSMs

The role of DSMs is to achieve high object detection performance with small model size. However, training of DSMs can itself be computationally problematic. In order to reduce training time, we propose Dataset Culling. This procedure removes (filters out) data samples from the training dataset that are believed to be easy to predict, and minimizes the image resolution of the dataset while maintaining accuracy (Fig. 1). By reducing both the dataset size and image resolution, we reduce the number of 1) expensive teacher inference passes for labelling, 2) the training steps of the student, and 3) computations for the student. Previous work [4] required the student and teacher to be run for all data samples for training, holding significant computing costs.

2.1 Dataset Culling.

The Dataset Culling pipeline is illustrated in Fig.2

. We first assess the difficulty of a stream of data by performing inference through the student. During training, model parameters are only updated when differences between the label and prediction exist. In other words, ”easy” data, which the student already provides good predictions for, do not contribute to training. The designed confidence loss (shown below) assesses the difficulty of prediction on a sample of data from the student’s output probabilities. For example, if the model’s output probability (i.e. confidence) for an object class is high, we assume that the sample of data is reasonably easy to infer, and similarly if the answer is very low, the region is likely to be background. However, intermediate values mean that the data is hard to infer.

To evaluate how difficult an image is to predict, we develop a confidence loss metric. This loss (shown below) uses the model’s output confidence levels to determine whether data samples are 1) difficult-to-predict and kept or 2) easy and culled away.

Input is the prediction confidence, is a constant to set the intercept to zero, and sets the weighting of low-confidence predictions. In experiments, we use and to roughly weight low-confidence detections

more than confident results. The absolute form of the loss function is not essential as we observed similar results by designing functions which emphasize unconfident predictions. When the model provides multiple detection results,

is computed for each prediction and are summed to obtain the overall loss of the data. The loss is also a function of the total number of objects in the image, to ensure that images that contain no objects (e.g. images at midnight) are not misinterpreted as difficult data. This first stage of culling yields a 10 to 50 reduction in the size of training data.

In the second stage of culling, we feed the remaining samples into the computationally-expensive teacher model. We compare the answers made between the teacher and student and use this to directly determine the difficulty. Here, we compute the average precision by treating the teacher predictions as ground truths. Using this second stage of culling, we further reduce the number of data samples by . Furthermore, in some cases, we can even improve the student’s mAP as we eliminate data that add little to no feedback for enhancing the student.

2.2 Optimizing the image resolution (optResolution)

Our second technique in Dataset Culling is optResolution, which sets the image resolution as a function of the prediction difficulty. By decreasing the CNN-input image size, we reduce the number of multiply-and-add operations and memory footprints. For example, with a reduction in image resolution, we obtain 4 improvement in computational efficiency. OptResolution takes advantage of the fact that object detection difficulties depend on the scene and application itself. For example, objects-of-interest in video for indoor and sport scenes are usually large in size and relatively easy to classify with low-resolution. However, traffic monitoring cameras demand high resolutions in order to monitor both relatively-small pedestrians and large vehicles. Traditionally, human supervision or expensive teacher inference was required to tune resolution [7].

Dataset Culling integrates optResolution with low computational overheads. We first feed an image of (pixel) size into the student model and compute the confidence loss. We downsample to size , infer, and compute its confidence loss. These downsampling operations are recursively performed times, until the change in confidence loss exceeds a predefined threshold, as the large change indicates that objects are becoming harder to infer. Hence, the image resolution of the data that is finally kept is . In our implementation, we compute the mean-squared-error (MSE) of the confidence loss against full-resolution inference results () as:

Here, is the confidence loss of the downsampled inference, and is the size of the culled dataset. One limitation of optResolution is that we strongly assume that the overall object size is constant for both training and runtime. For example, pedestrians as viewed by a surveillance camera are assumed to remain roughly the same size from train-time to test-time unless the surveillance camera were to move to a different position or orientation. However, in such cases, the model would require retraining either way.

Dataset
Training
images
Target dataset size
64 128 256 Full No train
Surveillance 86,400 Accuracy [mAP] 85.56 (- 3.0%) 88.3 (- 0.3%) 89.3 (+ 0.8%) 88.5 58.6
Total Train Time 1.9 (54) 2.0 (50) 2.2 (47) 104 -
Student Training
Student Prediction
Teacher Prediction
0.07
1.54
0.33
0.14
1.54
0.33
0.28
1.54
0.33
96
0
8
-
Sports 3,600 Accuracy [mAP] 93.7 (- 0.1%) 93.8 (0%) 93.8 (0%) 93.8 80.7
Total Train Time 0.16 (16) 0.23 (11) 0.40 (6) 2.5 -
Student Training
Student Prediction
Teacher Prediction
0.07
0.06
0.03
0.14
0.06
0.03
0.28
0.06
0.06
2
0
0.5
-
Table 1: We evaluate how culling the dataset impacts accuracy. Here, we reduce the dataset size using the first two stages in Fig. 2. (cull by confidence and precision) without optResolution. Time is reported in GPU hours.

3 Experiments

Filtering strategy
Intermittent Sampling
Confidence only
Precision only
Confidence + Precision
Full dataset
mAP 0.731 0.911 0.954 0.948 0.958
GPU hours 0.15 1.7 8.0 2.0 104
Table 2: Ablation study and comparison of Dataset Culling strategies conducted on Jackson dataset. Our approach of conducting both filtering by confidence loss and data difficulty has a good balance of accuracy and computation. Confidence-only culling misses more than of samples that was otherwise kept using Precision-only, and ConfidencePrecision misses that Precision-only kept. All strategies have a target dataset size of 128.

Models. For experiments, we develop Faster-RCNN object detection models pretrained on MS-COCO [1][8]. We utilize resnet101 (res101) for the teacher and resnet18 (res18) for the student model for the region proposal network (RPN) backbone [9]. We expect similar outcomes when MobileNet [10]

is utilized for RPN, since both models achieve similar Imagenet accuracy. We chose Faster-RCNN for its accuracy and adaptive image resolution, but Dataset Culling can be applied to other object detection frameworks such as SSD and YOLO with fixed resolutions, which have similar training frameworks

[11][12].

Custom Long-Duration Dataset. We develop 8 long-duration, fixed-angle videos from YouTubeLive to evaluate Dataset Culling. As manually labelling all frames in video is cumbersome, we label the dataset by treating the teacher predictions as ground truth labels as in [4]. In this paper, we report the accuracy as mean average precision (mAP) at 50% IoU. We will cite 5 of the videos ”surveillance”, each consisting of 30-hour fixed-angle streams with 1 to 4 detection classes. The first 24 hours (86,400 images) are used for training and the consequent 6 hours (21,600 images) are used for validation. We cite 3 of the videos ”sports”, which consist of 2-hours (7,200 images) of fixed-angle videos. Here, the class is ”person” only. The images are split evenly: the first 3600 images for training and the later 3600 images for testing.

Results. Object detection results from 3 scenes are shown in Fig.3 and mAP and computation costs are reported in Table 1. For surveillance results, using domain specific training for the student improved accuracy by 31% compared to the COCO-pretrained student results. In comparison to the full dataset training, Dataset Culling improves the training time 47 by reducing the dataset size to =256. A slight increase in accuracy is observed because Dataset Culling presents a similar effect to hard example mining [13], where training on a limited but difficult data benefits model accuracy. Since we include the time to run inference on the entire training set in the training time, further culling of the dataset does not dramatically improve the training time. However in smaller datasets (sports), increasing the culling ratio contributes to training efficiency because the model training time is a large fraction of training.

Figure 4: The image resolution was scaled manually to observe the change in mAP (blue solid line) and computed MSE of (black dasheds) in two domains. The red star indicates the image resolution proposed by optResolution, meeting both high accuracy and computation cost.

Ablation study. We perform ablation as shown in Table 2. We construct a dataset (of size ) with only difficult-to-predict images with four filtering techniques and compare the mAPs of our trained student models. We show that while filtering using only the precision metric (no confidence-loss induced or image scaling) can achieve the highest accuracy, the training time is 4 higher than our final approach (confidence + precision). Finally, we illustrate that with both culling procedures, we can realize a good balance of accuracy and computation.

optResolution. Fig. 4 shows the results of optResolution. OptResolution provides a well-tuned image resolution, satisfying both accuracy and computation costs. For indoor surveillance (Coral), the objects-of-interest were large and easy to detect. OptResolution thus selects a resolution of 0.5. For surveillance of traffic in Jackson, the objects are small in size and difficult to detect. Therefore, our procedure selects a scale of 0.8.

4 Conclusions

Domain-specific models dramatically decrease the cost of inference. But if models need frequent retraining, training costs can become a significant bottleneck. We show how Dataset Culling significantly reduces training time and overall computation costs by 47 with little to no penalty in object detection performance in our long-duration, fixed-angle datasets.

References