Learning Efficient Detector with Semi-supervised Adaptive Distillation

01/02/2019 ∙ by Shitao Tang, et al. ∙ 0

Knowledge Distillation (KD) has been used in image classification for model compression. However, rare studies apply this technology on single-stage object detectors. Focal loss shows that the accumulated errors of easily-classified samples dominate the overall loss in the training process. This problem is also encountered when applying KD in the detection task. For KD, the teacher-defined hard samples are far more important than any others. We propose ADL to address this issue by adaptively mimicking the teacher's logits, with more attention paid on two types of hard samples: hard-to-learn samples predicted by teacher with low certainty and hard-to-mimic samples with a large gap between the teacher's and the student's prediction. ADL enlarges the distillation loss for hard-to-learn and hard-to-mimic samples and reduces distillation loss for the dominant easy samples, enabling distillation to work on the single-stage detector first time, even if the student and the teacher are identical. Besides, ADL is effective in both the supervised setting and the semi-supervised setting, even when the labeled data and unlabeled data are from different distributions. For distillation on unlabeled data, ADL achieves better performance than existing data distillation which simply utilizes hard targets, making the student detector surpass its teacher. On the COCO database, semi-supervised adaptive distillation (SAD) makes a student detector with a backbone of ResNet-50 surpasses its teacher with a backbone of ResNet-101, while the student has half of the teacher's computation complexity. The code is avaiable at https://github.com/Tangshitao/Semi-supervised-Adaptive-Distillation

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Boosted by the development of deep convolutional neural network (CNN), the accuracy of object detection has been improved greatly 

[15, 18, 12]. Along with the requirement of high performance for a detector, low latency is demanded in wide-range practical applications, e.g., mobile apps and autonomous cars. There exists many previous work to speed up CNN, including detection pipeline optimization [15, 18, 12], architecture design [9, 25], pruning [7], quantization [26], decomposition [10, 22] and knowledge distillation [8]. In this paper, we propose semi-supervised adaptive distillation (SAD) scheme to accelerate network in object detecton track.

Knowledge distillation encourages the student network to converge to a better solution by mimicking the teacher network’s feature maps or soften logits. It has achieved great success on image classification [8, 19, 24]. However, when applying it on object detection, due to the ”small” capacity of the student network, it is hard to mimic all feature maps or logits well. Knowledge transferring has been applied in the two-stage detector. Chen et al[1] proposed a weighted cross-entropy loss to underweight matching errors in background regions. Li et al[11] mimicked feature maps between the student and the teacher pooled from the same region proposal and discarded those from uninterested regions. Wei et al[23]

introduced quantization mimic to reduce the search scope of the student network. All the above previous work attempt to design sophistical rules to focus on mimicking informative neurons of the teacher network. In these work, both the teacher and student detectors are two-stage. The application of KD in the single-stage detector has not been explored yet.

Compared with the two-stage detector, the single-stage detector needs to process much more samples due to the setting of dense anchors. Without the region proposal network (RPN), sample imbalance between easy and hard samples is a special challenge for the single-stage detector. Most of the samples are easy ones during KD for the single-stage detector. However, these easy samples dominate the KD loss. The lack of guidance from hard samples makes KD inefficient for the single-stage detector. There are two types of samples that are important in the distillation process: (1) Hard-to-mimic samples whose gaps between the student’s prediction and the teacher’s prediction are large; (2) Hard-to-learn samples whose uncertainties defined by teacher’s prediction are large. Both the hard-to-mimic and hard-to-learn samples are relevant with the teacher model and should be paid more attention for an effective distillation in the single-stage detector. Previous hard samples defined in focal loss or online hard example mining (OHEM) are selected through comparing predictions with ground truth[13, 21], which are determined only by the detector itself. Due to the difficulties in supervised training and distillation are from different sources, the imbalance treatment between hard and easy samples should be performed for the supervision loss and the distillation loss separately. With this motivation, an adaptive distillation knowledge loss (ADL) is proposed in this paper, which pays more attention to teacher-defined hard samples and adaptively adjusts the distillation weights between easy-to-mimic/easy-to-learn and hard-to-mimic/hard-to-learn samples in the distillation process. Besides, ADL is also effective in the self distillation setting [5] just as knowledge distillation.

Annotating object detection bounding box is extremely time-consuming, which hinders object detection to be used in wide applications. Previous work [20, 16] has demonstrated that unlabeled data can potentially help image classification and object detection. However, in the knowledge distillation scenario, it is an open question how to extract the knowledge of unlabeled data to guide the student network training. The proposed adaptive distillation knowledge also works well in a semi-supervised setting. Provided with potentially unlimited unlabeled data from internet-scale sources, the teacher can present more knowledge to the student via the augmented transferring set. Data distillation [16] expresses knowledge of unlabeled data as the annotations produced by the teacher. However, representing knowledge as hard targets of unlabeled may not be an optimal representation. Most of them can be predicted by teacher with very high confidence, so they can also be easily classified by the student. By contrast, soft targets provided by ADL contains balanced easy and hard samples. Thus, soft targets are proposed to be utilized in the semi-supervised distillation for the single-stage detector.

In the real-world application, unlabeled data are far more than labeled data and their distributions are also different, i.e., most unlabeled images do not contain any targeted object. Thus the efficiency of semi-supervised KD will be affected by the large background images. Given these considerations, we raise a practical problem, how to select the unlabeled data which can transfer knowledge more efficiently. In this paper, we show that a trivial filtering mechanism is effective to address this problem.

We select the state-of-the-art single-stage detector RetinaNet [13] to validate the effectiveness of our proposed ADL. Experiments on standard detection data set COCO verify that the proposed ADL can consistently improve the student network’s performance, and explore the knowledge of unlabeled data to help the student network to converge to a better solution. Surprisingly, our student detector with a backbone of ResNet-50 even surpasses its teacher detector with a backbone of ResNet-101, even though the student only has half of computation complexity of its teacher. The student detector ResNet-50 achieves an mAP of 36.7 on COCO test-dev while the teacher detector ResNet-101 achieves 36.0 when trained only with labeled data.

In the paper, we make the following contributions:

  • We design an adaptive knowledge distillation loss, which is able to pay more attention to teacher-defined hard samples and adaptively adjust the distillation weights between easy-to-mimic/easy-to-learn samples and hard-to-mimic/hard-to-learn samples for the single-stage object detector.

  • We develop the proposed adaptive knowledge distillation in a semi-supervised learning setting. The student even surpass its teacher through the semi-supervised KD.

  • In order to improve the efficiency of KD in the semi-supervised setting, a data filtering mechanism is proposed to select transferring set from unlabeled data, when unlabeled data and labeled data have different distributions.

2 Related work

Deep network compression and acceleration Many works are proposed to accelerate the convolution neural network due to the demand for practice applications. Knowledge transferring is one approach that transfers knowledge from the teacher model to the student model. Previous work explores this area by representing knowledge in different forms. FitNet [19] makes the student mimic the full feature maps of the teacher. KD [8]

proposes to supervise the student by soft targets predicted by the teacher. The probability distribution from the teacher model providing extra information than one-hot targets encoding. Our work is closely related to knowledge distillation.

Semi-supervised learning and self training Semi-supervised learning has been studied for years. The goal is to train a model with labeled and unlabeled data. In [20], experiments show object detector can gain extra improvement by Semi-supervised learning. Another work is data distillation[16]. It first trains a model with labeled data and then uses the model to make predictions on unlabeled data through multi-transform inference and data transformations. Those operations can improve the performance and generate extra knowledge. Different from data distillation, our work focus on knowledge transferring from the strong teacher to the weak student.

Object detection The object detectors include single-stage and two-stage approaches. The two-stage approach consists of two parts, where the first one generates a sparse set of candidate object proposals and then it is fed to a classification and location subnet for further classification and location regression. Single-stage detector directly forward raw pixels through convolution neural network, getting the final classification and location results. One of major problems existing in both types of detectors is class imbalance. In order to address it, Abhinav et al[21] introduces online hard example mining (OHEM) by selecting the top k samples sorted by the loss in one mini-batch. In contrast to two-stage detector where the region proposal network can reduce the candidate location significantly, the single-stage detector suffers severer class imbalance problem. Different from OHEM, focal loss [13] aims to pay more attention to hard examples than easy examples by multiplying a focal term to the common cross entropy loss. Therefore, our distillation loss design follows the spirit of focal loss.

Model compression in object detection Recently, model compression has been studied to facilitate the application of cnn-based object detector in devices with limited computation resources. Chen et al[1] utilizes soft targets to guide the student model in both region proposal network and region convolution neural network and balance the positive and negative examples by re-weighting the loss of positive and negative samples. Instead of addressing the class imbalance problem directly, Li et al[11] proposes to match the feature map after roi-pooling layer where the candidate regions have been significantly reduced. These methods are designed for two-stage detector and cannot be applied to single-stage detector directly. In contrast, our insightful designed loss is another way to address the class imbalance problems.

3 Semi-supervised Adaptive Distillation

The section introduces semi-supervised adaptive distillation (SAD), as shown in Figure 1.

3.1 Adaptive Distillation

In this section, we discuss the design of distillation loss for the single-stage detector. Compared with the two-stage detector, the distinguishing feature of the single-stage detector is dense sampling of possible object locations. In a single-stage detector, dense anchors are set on multiple feature maps in the backbone network. Hence distillation needs to be performed on a large number of output logits between teacher and student. In RetinaNet, a typical number of anchors is 100K and most of them correspond to easy-to-mimic or easy-to-learn samples. Though an easy sample contributes little to the distillation loss, the sum of losses from those easy samples will dominate the distillation loss during training. Thus the hard-to-mimic/hard-to-learn samples worth mining are not to be learned well and this restricts the capacity of KD on a single-stage detector yet.

Without loss of generality, we study the case of cross entropy for binary classification. The original focal loss is defined as

(1)
(2)

In the above, specified the ground-truth class and

is the model’s estimated probability for the class with label

.

For the following, We represent as the soft probability value predicted by the teacher and

as the one predicted by the student. Knowledge distillation is inspired by Kullback–Leibler divergence, which measures the similarity between two distribution, defined as:

(3)

The student model tries to mimic the soft class probability distribution predicted by the teacher model. We abbreviate as for the following part of the paper.

3.1.1 Focal Distillation Loss

The common way of adopting focal loss to knowledge distillation is to multiply KL by a focal term. If the focal term utilized by the classification loss (hard targets) is shared by the (soft targets), the joint loss of classification loss and can be defined as

(4)

where is the focal term. As shown in Equation (1), the focal term is

(5)

Thus, the focal distillation loss is:

(6)

is a simple modification from and . We consider it as a baseline in our experiments.

3.1.2 Adaptive Distillation Loss

However, as experiments will show, is dominated by the focal term , so contributes little to the overall loss. In order to address the problem, we propose the following loss. We assume that KD on a single-stage detector should focus on measuring the distance of probability distribution between the student and the teacher. Based on this motivation, a modulating factor between 0 to 1 should be used to learn the feature adaptively. Inspired by KL-divergence, We come up with the following term to suit the purpose as:

(7)

where is defined in Equation (3

). DW is abbreviated for distillation weight. Just as the focal loss, the hyperparameter

controls the rate at which easy examples are down-weighted. The term controls the weight of each sample. However, only adjusts the weights between students and teachers during the training process. Given that the hard-to-learn samples are extremely important for distillation, we propose to adjust the percentage of overall weights of hard-to-learn samples (PHLS), defined as:

(8)
(9)

T(q), the entropy of the teacher, reaches maximum when q is and minimum when q approaches 0 or 1. The teacher probability q reflects the uncertainty of classifying it. When q approaches to , the corresponding sample is treated as a hard-to-learn sample. And a sample with a high KL is treated as a hard-to-mimic sample. Intuitively, PHLS increases when becomes larger. Thus, controls the weights of hard-to-mimic samples which are adjustable in the training process while controls the weights of hard-to-learn samples initially defined by the teacher. The combination of them can adaptively adjust the distillation weights. At last, the adaptive distillation loss is:

(10)

In addition, we will show in the experiment that the above is effective in the self distillation setting where the teacher and the student are identical.

The original focal loss in one image is the sum of the focal losses over all anchors, normalized by the number of anchors assigned to ground-truth boxes. The proposed adaptive distillation loss for soft targets adopts the same setting of the sum except the normalizer. We found that the training is unstable when the is normalized by the the normalizer of the original focal loss, because mimicking the soft targets of negative samples predicted by the teacher model also contributes to . In addition, the number of positive samples is unknown for the unlabeled data in the semi-supervised setting. To make the KD training more stable and robust, we define the normalizer as:

(11)

The mark q correspond to soft targets of positive samples predicted by the teacher. To be more specific, is the sum of probability of positive samples powered by over all anchors. The modulating factor reduces the weight contribution from negative samples. In sum, the model is trained with the sum of over all anchors divided by the . is set to 1.8 empirically in all experiments.

3.1.3 Loss for Distilling Student Model

For the student model, we optimize the following function.

(12)

is the original focal loss and is the bounding box loss. is the proposed adaptive distillation loss.

Figure 1: Semi-supervised adaptive distillation (SAD) schematic. To begin with, the teacher selects and annotates samples with at least one annotation. Then, combine those selected samples with labeled ones. At last, the student is trained using guided by the teacher.

3.2 Semi-supervised Adaptive Distillation Scheme

In general, only improving the transferring efficiency by is hard to bridge the gap between the teacher and the student. Therefore, we try to further improve distillation performance by mining the data efficiently. Labelling samples by human is expensive, especially for the object detection task. The collection of web-scale unlabeled samples is economical. Hence exploiting unlabeled data for KD is investigated here. Previous work [16] introduces semi-supervised data distillation, in which the learner exploits all available labeled data plus internet-scale sources of unlabeled data. It reveals a strong connection between the improvement of the student model and the amount of unlabeled data used. Essentially, semi-supervised learning improves distillation performance by augmenting the transferring set. In the semi-supervised setting, the teacher is trained on a labeled dataset and the student is trained using both and an unlabeled dataset . The and compose the transferring dataset . The student is guided by soft targets from the teacher on and ground truth on simultaneously. KD aims to improve transferring efficiency given that in the common setting. We suppose that the expansion of the transferring set can improve KD further.

Generating labels on unlabeled data Data distillation proposes to use final output from the teacher model as the distillation model. These labels can be generated from the soft targets according a mapping function. In addition, they are the easiest given that the soft target probability of most of them is greater than 0.95. However, the hard samples dropped by non maximum suppression are of great importance in knowledge transferring. Thus, we propose to use the combination of hard targets and soft targets. The steps are as follows.

(1) Train the teacher model with labeled data;

(2) Generate hard targets for unlabeled data using the teacher model;

(3) Train the student model with both labeled and unlabeled data using both soft targets and hard targets.

Unlabeled data selection For real-world applications, the unlabeled data and the labeled data are often not in the same distribution. Different types of data are collected for different purposes, i.e

., ImageNet 

[3] for classification and COCO [14] for detection. We can simulate a real-world scenario, for a task , only one data-set contains the label for it. All the rest data-sets contain unlabeled data for the task and the amount of those unlabeled data is much larger than that of the labeled data. Therefore, it is inefficient to utilize all the available data to distill the student. For unlabeled data collected from different sources, most of them do not contain any annotations, which can be easily classified as negative by a well-trained model. On the contrary, the image containing at least one positive samples are generally harder to detect. Given these considerations, we propose to select those images which have at least one annotation produced by the teacher to distill the student for transferring knowledge more efficiently.

4 Experiment

We evaluate our methods on the detection task of the COCO benchmark [14].We report all studies by evaluating on the mini-val (5k images) or test-dev (41k images) split with the standard metrics of average precision following the COCO definitions, including .

4.1 Data Splits

In the 2017 version of COCO, there exists 115k labeled images and 120k unlabeled images. We referred them as co-115 and un-120 respectively. We train the teacher model on co-115 and the student model on the union of co-115 and un-120.

Optimization We evaluate our method using RetinaNet, one of the state-of-the-art single stage detectors. All the hyper-parameters are the same as [13]. We utilize the implementation from detectron [6]. All the models are trained with synchronized SGD over 8 GPUs with a total of 16 images per minibatch (2 images per GPU). The initial learning rate is set as 0.01, weight decay as 0.0001, and momentum as 0.9. For training models only using co-115, we set the iteration size to 90000. For training models on both co-115 and un-120, a iteration size of 270000 is used. The learning rate is divided by 10 at 70% and 90% of the total number of iterations. Further increasing the number of iterations will not improve the performance.

Loss We use the loss introduced above for knowledge distillation. is set the same for soft target loss and hard target loss. The classification loss is applied to all  100k anchors in each sampled image.

4.2 Student-teacher Pairs

We validate our methods in different student-teacher pairs.

Distillation over scales We first investigate the performance improvement through distillation when the input size is reduced. For the implementation of KD, the teacher and the student should have the same number of output logits, though input sizes are different. Therefore, We simply add a deconvolutional layer on top of the final feature map of the student model to match the size of the teacher’s final feature map. In experiments, the input size of the student model is 400*677 and the input size of the teacher model is 800*1333. Both the teacher and the student utilize ResNet-50 as the backbone.

Distillation over small models In addition to distillation over different input sizes, we also examine distillation over detectors with different capacities, in which a strong teacher model distills a weak student model. Experiments with several pairs of teacher and student are conducted in this study.

4.3 Adaptive Distillation Study

In this section, we compare different methods. We use ResNext-50 as the teacher and ResNet-50-half as the student if not specified. The scale is 800*1333.

Feature map mimic First we evaluate the method of naive logits mimic using L2 loss. The entire feature map regression is implemented through the mimic mentioned in [11]. Results of logits mimic and entire feature map mimic are shown in Table 2. The mimicked models do not obtain any improvement compared to the baseline. The small network is difficult to learn from the teacher through this method.

Focal Knowledge Distillation

We try another loss function , which adopts the same focal term between hard targets and soft targets. Surprisingly, the performance of student model drops from 34.3 to 33.9. We attribute the performance decrease to the reason that the supervisions of the ground truth in the focal term is so strong that it neglects the effect of the soft targets. The gradient is 0 when p is equal to the ground truth. In other words,

is not minimum when p is equal to q.

Adaptive distillation loss Results using proposed with varying are shown in Table. 1. When is 0,which is equivalent to , does not work, since PHLS is very small. The performance improves as becomes larger. With , ADL yields nearly 2 AP improvement of the student. Compared with , our proposed has the property that the loss is minimum when the output p produced by the student is equal to q produced by the teacher. Without specific noticing, the aberration of means the loss defined in Equation (10) and we use for all the following experiments.

ADL under different student-teacher pairs In Table 3, we show distillation results using co-115 over different student-teacher pairs, with used. The performance of student models improves significantly with distillation, despite architectural differences between teacher and student. In general, the weak student model achieves over 1% improvement in mAP and 2% in AP50. As the results show, simply adding a deconvolution layer on the top of classification and location subnets will harm the performance compared to the results (30.5 mAP) reported in [13] , but it can outperform the normal 400*677 model after distillation.

AP AP50 AP75
baseline 28.8 45.8 30.6
0 28.9 45.9 30.6
0.5 29.4 46.3 31.2
1.0 30.5 48.5 32.7
1.5 30.7 48.8 32.7
Table 1: Varying of ADL. The performance increases as (PHLS) becomes larger.

.

Method AP AP50 AP75
baseline 28.8 45.8 30.6
Feature map mimic 28.8 45.8 30.6
FDL 28.5 45.5 30.2
ADL 30.7 48.8 32.7
Table 2: Results for different distillation methods. Feature map mimic is to minimize L2 loss between the student and the teacher. is introduced in Equation (6). ADL is introduced in Equation (10).

.

Model scales AP AP50 AP75
T (ResNet-50) 800 35.4 54.6 37.9
S (ResNet-50 up) 400 29.8 48.8 30.9
AD 400 31.2 50.9 32.5
T (ResNet-50) 800 35.4 54.6 37.9
S (ResNet-50 half) 800 28.8 45.8 30.6
AD 800 30.7 48.8 32.7
T (ResNext-101) 600 37.9 57.2 40.6
S (ResNet-50) 600 34.3 53.2 36.9
AD 600 35.2 54.1 37.7
Table 3: Distillation with co-115k using over different student-teacher pairs. S stands for student and T stands for teacher. The notation is the same in the following table.

4.4 Semi-supervised Adaptive Distillation Study

Student (scale) Teacher (scale) co-115 GT co-115 ST un-120 HT un-120 ST AP AP50 AP75
ResNet-50 up (400) ResNet-50 (800) 28.8 45.8 30.6
32.1 51.6 33.9
32.3 51.3 34.1
33.2 53.2 35.1
ResNet-50 half (800) ResNet-50 (800) 28.8 45.8 30.6
32.1 50.6 34.2
32.3 50.3 34.6
33.1 52.1 35.2
ResNet-50 (600) ResNext-101 (600) 34.3 53.2 36.9
35.6 54.7 37.9
35.9 54.9 38.5
36.6 55.8 38.9
Table 4: Distillation results using un-120 under different settings. Ground truths are abbreviated as GT. Soft targets are abbreviated as ST. Hard targets are abbreviated as HT, representing hard targets produced by the teacher. The notation co-115 is the 115k COCO training set with annotation while co-120 is the 120k COCO unlabeled set.

We conduct different experiments to study distillation with the un-120. The results are summarized in Table 4.

Experiment setting The notation co-115 ST or un-120 ST represents soft targets produced by the teacher. Ground truths are abbreviated as GT while hard targets are abbreviated as HT. Hard targets are predicted by the teacher using the method introduced in [16].

Effect of un-120 First, we investigate the method introduced in [16]. Following the protocol, we generate annotations for un-120k by selecting a threshold that makes ’the average number of annotated instances per unlabeled image’ roughly equal to ’the average number of instances per labeled image’. Compared with the one only using co-115 GT, the use of un-120 yields significant improvement.

Effect of soft targets and hard targets in un-120 We use both soft targets and hard targets of un-120 in the training stage. It is noted that the soft targets and hard targets are all from the teacher model. Compared with the distillation model using only hard targets of un-120k, the combination of them yields over 1 AP improvement in all student-teacher pairs, which shows the effectiveness of hard samples contained in soft targets in knowledge transferring.

4.5 Surpass the Teacher

In this section, we show that the student can surpass the teacher with our ADL under the semi-supervised setting. We discuss this investigation in two categories: the student and the teacher are identical or the teacher has better performance than the student. The former is called as self distillation.

Self distillation In the experiments of self distillation, the teacher and the student are parameterized identically. We conduct experiments with different input scales (400, 500, 600) and different backbone models (ResNet-50, ResNet-101). In Table 5, GT is short for ground truth and ST is short for soft targets. As shown in Table 5, our self distillation method achieves around an improvement of 0.5 on AP compared to the teacher model trained with . The improvement by self distillation is consistent over different input scales and backbone models. We also tested to perform self distillation using both co-115 and un-120, but no improvement is gained by the addition of un-120, which indicates that co-115 is sufficient to transfer knowledge from teacher model to student model with the same capacity.

Model Scales Targets AP AP50 AP75
ResNet-50 400 GT 30.5 47.5 32.7
ST+GT 31.0 48.1 32.9
ResNet-50 500 GT 32.5 50.9 34.8
ST+GT 33.3 51.4 35.6
ResNet-50 600 GT 34.3 53.2 36.9
ST+GT 34.7 53.4 37.0
ResNet-101 600 GT 36.0 54.8 38.7
ST+GT 36.3 55.2 38.9
Table 5: Self distillation results. GT represents ground truth and ST represents soft targets. The students is trained with GT and ST using proposed ADL. As the results show, improvement is obtained under different scales and networks even if the student and the teacher are identical.

The teacher is better than the student In this experiment, it is shown that when the teacher is better than the student, the student can surpass its teacher by augmenting the transferring set . We conducted experiments with two student-teacher pairs: (ResNet-50, ResNet-101) pair and (ResNet-101, ResNext-101) pair. The teacher is trained on co-115 and the student is trained on the union of co-115 and un-120. The performance of the teachers are 1.7 and 2.2 higher than that of the students respectively, but the student can still beat the teacher by some margin. In Table 4, the ResNet-50 (600) student detector can reach to a mAP of 36.6 guided by the ResNext-101 teacher, only slightly higher than the one guided by ResNet-101 in this experiment. We suppose the limitation for (ResNet-50, ResNext-101) pair is caused by the limited amount of data in the transferring set.

Model Scale AP AP50 AP75
T (ResNet-101) 600 36.0 54.8 38.7
S (ResNet-50) 600 34.3 53.2 36.9
AD 600 36.3 55.2 38.9
T (ResNext-101) 500 36.6 55.5 39.3
S (ResNet-101) 500 34.4 52.7 36.9
AD 500 36.8 55.7 39.4
Table 6: Results on the union of co-115 and un-120 using the proposed adaptive distillation loss. We show that the student can surpass its teacher.
AP time
YOLOv2 [17] 21.6 25
SSD321 [15] 28.0 61
DSSD321 [4] 28.0 85
R-FCN [2] 29.9 85
SSD513 [15] 31.2 125
DSSD513 [4] 33.2 156
FPN-FRCN [12] 36.2 172
RetinaNet-50-400 30.5 69
RetinaNet-50-600 34.3 90
RetinaNet-50-800 35.7 123
RetinaNet-101-500 34.4 90
RetinaNet-101-800 37.8 190
SAD (ours) 36.7 90
SAD (ours) 36.9 90
Table 7: Speed (ms) versus accuracy (AP) on COCO test-dev , which has no public labels and requires evaluation on servers. Our detector has achieves an AP of 36.9, running at 90 ms per image. The distilled detector is more accurate and faster than RetinaNet-50-800.

Comparison with other detectors We compare our distilled detector with different detectors. As shown in Table 7, our detector is significantly better than any other detector except FPN-FRCN and RetinaNet-101-800. With a comparable mAP with the above two detectors, our distilled detectors run 2 faster.

4.6 Distillation with Dissimilar-distribution Data

In this section, we evaluate the proposed unlabeled data selection method. Unlabeled data are from the ImageNet dataset other than the COCO dataset. Thus unlabeled data and labeled data have different distributions.

Configuration First, 110k images are randomly selected from ImageNet images that contain at least one annotation predicted by the teacher. Another 110k images are randomly selected from ImageNet images that do not contain any annotation predicted by the teacher. The threshold of the teacher’s prediction is kept the same as the one utilized for un-120. We refer the former as ImageNet-110 and the later as ImageNet-110, in which the subscript means positive responses from the teacher and means negative responses. We choose the (ResNet-50, ResNext-101) pair in this experiment.

Results We conducted the experiments using a combination of ImageNet-110 and ImageNet-110, as shown in Figure 2. We randomly sample fraction from ImageNet-110 and from ImageNet-110 so that the total number of images is still 110k. The more images sampled from ImageNet-110, the better performance the student will be, with varing from 0 to 0.6. The proposed unlabeled data selection will improve the distillation performance when the fraction is small. For the real-world data, the fraction is typically between 0 to 0.2.

Figure 2: Adaptive distillation applied to ImageNet. Varying the fraction of ImageNet-110.

5 Conclusion

In this paper, we design an adaptive distillation loss for the single-stage detector and demonstrate its effectiveness with RetinaNet in the common distillation setting . We also investigate this adaptive distillation in a semi-supervised learning setting. It is proved the student model can gain much improvement using both hard targets and soft targets produced by the teacher on unlabeled data. The student even surpass the teacher given enough transferring set. In the end, we demonstrate the proposed unlabeled data selection method is effective to transfer knowledge through unlabeled data with a dissimilar distribution compared with labeled data.

References