LabelEnc: A New Intermediate Supervision Method for Object Detection

by   Miao Hao, et al.
Megvii Technology Limited

In this paper we propose a new intermediate supervision method, named LabelEnc, to boost the training of object detection systems. The key idea is to introduce a novel label encoding function, mapping the ground-truth labels into latent embedding, acting as an auxiliary intermediate supervision to the detection backbone during training. Our approach mainly involves a two-step training procedure. First, we optimize the label encoding function via an AutoEncoder defined in the label space, approximating the "desired" intermediate representations for the target object detector. Second, taking advantage of the learned label encoding function, we introduce a new auxiliary loss attached to the detection backbones, thus benefiting the performance of the derived detector. Experiments show our method improves a variety of detection systems by around 2 two-stage frameworks. Moreover, the auxiliary structures only exist during training, i.e. it is completely cost-free in inference time.


Label-Guided Auxiliary Training Improves 3D Object Detector

Detecting 3D objects from point clouds is a practical yet challenging ta...

Interactron: Embodied Adaptive Object Detection

Over the years various methods have been proposed for the problem of obj...

Object Detection Made Simpler by Eliminating Heuristic NMS

We show a simple NMS-free, end-to-end object detection framework, of whi...

Frequency Spectrum Augmentation Consistency for Domain Adaptive Object Detection

Domain adaptive object detection (DAOD) aims to improve the generalizati...

Probability-based Detection Quality (PDQ): A Probabilistic Approach to Detection Evaluation

We propose a new visual object detector evaluation measure which not onl...

Scaling Object Detection by Transferring Classification Weights

Large scale object detection datasets are constantly increasing their si...

1 Introduction

Object detection is one of the fundamental problems in computer vision. In deep learning era, modern object detection networks

[32, 11, 23, 24, 14, 38, 30, 31] are composed of two main components: one is the backbone part , which generates the intermediate embedding from each image; the other part is the detection head , to extract instance information (i.e. class label as well as the corresponding bounding box) from the intermediate representation. To learn the parameters and , earlier work like [15] proposes to optimize them separately on different datasets respectively. However, most of recent state-of-the-art detection frameworks [11, 32, 23, 30, 26] suggest joint optimization of the backbones and detection heads for simpler pipeline and better performance, formulated as follows:


where stands for a pair of image and ground-truth label; is the dataset distribution; and represents the detection loss, which is usually composed of classification terms and bounding-box regression terms [32].

Typically, the backbone part contains too many parameters, thus may be nontrivial or very costly to be directly optimized in the detection dataset [45, 34, 22, 13]. A common practice is to introduce pretraining, for instance, initializing in Eq. 1 with ImageNet pretrained [32, 30, 31, 23, 26, 24] or self-supervised [7, 12] weights. Though such pretraining-then-finetuning paradigm has been demonstrated to achieve state-of-the-art performances [14, 4], however, we find that only pretraining backbone weights may be suboptimal for the optimization. Since the weights in detection head are still randomly initialized, during training, gradient passed from the detection head to the backbone could be very noisy, especially in the very beginning. The noisy gradient may significantly harm the pretrained weights, causing slower convergence or poorer performance. Actually, such degradation has been observed in many codebases and a few workarounds are also proposed. For example, a well-known workaround is to freeze a few weight layers in the backbone during finetuning to avoid unstable optimization [32, 11]; however, it seems still insufficient to fully address the issue.

In this paper, we propose to deal with the problem from a new direction – introducing an auxiliary intermediate supervision directly to the backbone. The key motivation is, if we can provide a feasible supervision in the training phase, the backbone part could be effectively optimized even before the detection head converges. We formulate our method as follows:


where means the auxiliary loss attached to the outputs of the backbone, which is independent to the detection head thus not affected by the latter’s convergence progress. is the balanced coefficient.

The core of our approach thus includes the design of . Intuitively, the auxiliary supervision aims to minimize the distance between latent feature representation and some “ideal” embedding of the corresponding training sample. However, how to define and calculate the desired representation? Some previous works, especially Knowledge Distillation [17] methods, suggest acquiring the intermediate supervision from more powerful teacher models; nevertheless, whose representations are not guaranteed to be optimal. Instead in this work, for the first time, we point that the inverse of the underlying optimal detection head (i.e. ) could be the feasible embedding, which traces the ground-truth label back to the corresponding latent feature. More discussion will be referred in Sec. 3.

Motivated by the analyses, in our proposed method LabelEnc, we introduce a novel label encoding function to realize in Eq. 2, which maps the ground-truth labels into the latent embedding space thus providing an auxiliary intermediate supervision to the detector’s training. Label encoding function is designed to approximate , since the underlying optimal parameters and the “inverse form in the latter’s formulation are nontrivial to be directly derived in mathematics. Thus our method in general involves a two-step training pipeline. First, to learn the label encoding function, we train an AutoEncoder architecture, embedding the ground-truth labels into the latent space according to the (approximated) optimal detection head. Second, with the help of the learned label encoding function, we optimize Eq. 2 under the auxiliary supervision ; in addition, initial weights in the detection head can also inherit from the AutoEncoder instead of random values for more stable optimization.

We evaluate our method on various object detection systems. Under different backbones (e.g. ResNet-50, ResNet-101 [16] and Deformable Convolutional Networks [9, 46]) or detection frameworks (e.g. RetinaNet [24], FCOS [38] and FPN [23]), on each of them our training pipeline achieves significant performance gains consistently, e.g. 2% improvements on COCO [25] dataset. More importantly, our method is completely cost-free in inference time, as the auxiliary structures only exist during training. Please refer to Sec. 4 for detailed results.

In conclusion, the major contributions of our paper are as follows:

  • We propose a new auxiliary intermediate supervision method named LabelEnc, to boost the training of object detection systems. With the novel label encoding function, our method can effectively overcome the drawbacks of randomly initialized detection heads, leading to more stable optimization and better performance.

  • Our method is demonstrated to be generally applicable in most of modern object detection systems. Compared with previous methods like [21, 46, 34], though various auxiliary losses are also introduced, usually those methods rely on specified backbone architectures or detection frameworks. Furthermore, though the underlying formulations appear to be somewhat complex, the implementation of our approach is relatively simple. Code will be released soon.

2 Related Work

2.0.1 Auxiliary supervision.

Auxiliary Supervision is a common technique to improve the performance of the model in indirect ways, e.g., weight decay, Center Loss [41], etc. Among various auxiliary supervision methods, Multi-task Learning (MTL) [5] methods are used commonly. MTL solves multiple tasks using a single model. By sharing parameters between the tasks, inductive bias is transferred and better generalization is gained. In object detection, Mask R-CNN [14] combines object detection with instance segmentation by adding a simple mask branch to Faster R-CNN model [32]. The MTL strategy can improve the performance of the detection branch efficiently, but it requires additional mask annotation. [20], on the contrary, does not need additional annotations, but it requires carefully-designed auxiliary tasks.

Deeply Supervise is another common method of auxiliary supervise. Instead of introducing additional tasks, Deeply Supervise introduces supervision on additional layers. DSN [19] first proposes the concept by adding additional supervision on the hidden layers. Inception [36]

also uses similar auxiliary classifiers on lower stages of the network. In semantic segmentation,

PSPNet [44] and ExFuse [43] adopt Deeply Supervise in order to improve the low-level features. In object detection, DSOD [34] utilizes Deeply Supervise with dense connections to enable from-scratch training. In our method, we adopt the idea of Deeply Supervise by proposing a label encoding function, with which we can map the labels into latent embedding for auxiliary intermediate supervision.

2.0.2 Knowledge distillation.

our method shares some common inspiration with Knowledge Distillation [17, 33, 42, 40]. In Knowledge Distillation, the training is a two-step process. A large teacher model is trained first. Then its predictions are used to supervise a smaller student model. Knowledge Distillation has been used in several fields, e.g., face [27], speech [37], re-id [8]. There are several works focusing on object detection as well: [6] uses balanced loss on classification, bounded loss on regression and loss on feature; [46] and [21] propose their distillation methods based on RoIs; [39] so that distillation focuses on object-local areas.

From the distillation perspective, the label encoding function is the teacher model in our pipeline. It is trained in the first step and utilized for supervision in Step 2. But it is a relatively simple architecture and does not involve feature in real world. On the contrary, traditional distillation models rely heavily on a big teacher model. Usually, the stronger the teacher model is, the better distillation performance it can give. However, teacher models with high performance are not always available in practice. The state-of-the-art models are the best teachers we can find. This limits the performance of traditional distillation.

2.0.3 Label encoding.

There are several works that use label encoding to boost training [3, 1, 35]. However, few evaluate in supervised object detection task. Among them, our method is most similar to [28]. [28] uses an AutoEncoder to model the labels of semantic segmentation. The AutoEncoder is then used to perform auxiliary supervision. Compared with our method, there are two main differences: first, in object detection, label structures hardly exist. Segmentation has rich information in label structures thanks to the outline of regions in annotation, e.g. a cat has a long tail, a thick body and a small head. Whereas in object detection, such structures are very limited, since all objects are just boxes with different scales and aspect ratios. Second, we propose a joint optimization scheme that introduces auxiliary structures for training AutoEncoder, which we empirically find vital to the performance. Whereas in [28], the AutoEncoder is trained independently.

3 Method

3.1 Intermediate Auxiliary Supervision

As mentioned in the introduction, the core of our method is to define the supervision term in Eq. 2, which is expected to provide feasible supervision to the backbone training. Intuitively, the auxiliary loss should encourage the latent feature generated by the backbone network to be close to some “ideal” embedding for each training sample:


where represents the distance measurement. Therefore, a problem rises: how to define the so-called “ideal” feature ? Obviously, the calculation of cannot directly rely on the training of the detection head , otherwise it may be unstable and redundant to the existing detection loss .

Let us think for a further step. If we have finished the optimization in Eq. 2 via some way, i.e. the corresponding optimal weights and have been obtained, we can intuitively define the inverse of the detection head as the “optimal” intermediate embedding. So,


We argue that the definition of is feasible because if the auxiliary loss tends to zero, it is easy to verify that the detector will predict the ground truth exactly. Unfortunately, Eq. 4 cannot be directly used in the optimization. First, to substitute Eq. 4 into Eq. 2, we find exists in both side of the equation – we cannot determine the value in advance. Second, even though is given, the inverse form

is still difficult to be calculated due to the high nonlinearity of neural networks (actually the inverse is generally not unique).

We deal with the second problem firstly. Notice that for any , we have . Motivated by this, to approximate we introduce a new network , whose parameters are learned by the optimization:


Here, is the detection loss, following the definition in Eq. 1. Intuitively, maps the ground truth label into the latent feature space and recovers the label from the latent representation. So, we say that approximates the “inverse” of . It is worth noting that the composite function actually represents an AutoEncoder defined in the label space. Thus we name as label encoding function. Thanks to the approximation, we rewrite Eq. 4 as follows:


Then we come back to the first problem. In Eq. 6, note that the optimization of still implies (Eq. 5). So, in our formulations (Eq. 2, 6 and 5) there still exists the recursive dependence on . To get out of the dilemma, we use an unrolling trick, i.e. recursively substituting Eq. 6 and Eq. 5 into Eq. 2. Thus we obtain the final formulations (please refer to the appendix for the detailed derivation):




Here is short for .

Eq. 7 and Eq. 8 compose the core idea of our method. The formulations actually imply a two-step training pipeline. In the first step, by optimizing the auxiliary AutoEncoder defined in Eq. 8, we obtain an encoding function mapping the ground-truth label map into the latent space. Then in the second step, we train the detection framework with the intermediate supervision of , as described in Eq. 7. In the next subsections, we will introduce the optimization details.

3.2 Step 1: AutoEncoder Training

In this subsection we aim to derive the label encoding function via Eq. 8. However, directly solving Eq. 8 is not easy – since exists in both the target and the constraint, it is actually a bilevel optimization problem, which seems nontrivial to be implemented with current deep learning tools. Therefore, we propose to relax the formulation into joint optimization scheme, as follows:


where and are balanced coefficients, while in our experiment we just trivially set them to 1. It is clear that Eq. 9 simply corresponds to a multi-task training paradigm with three loss terms: the first one is reconstruction loss (L1) for the label’s AutoEncoder; the second term is the common detection loss (L2), which enforces to be a valid detection head; the third loss (L3) minimizes the gap between the two latent spaces (namely the outputs of the backbone and label encoding function respectively).

Figure 1: Step 1: AutoEncoder training. L1 – AutoEncoder reconstruction loss; L2 – detection loss; L3 – distance minimization loss; please refer to Eq. 9 for details. The solid and dashed lines indicate the forward and backward flows respectively

Fig. 1 illustrates the implementation and optimization of Eq. 9. According to Eq. 9, the same detection head is applied in both L1 and L2 terms – which is why we mark “shared detection head” in Fig. 1. It is also worth noting that we forbid the gradient flow from L3 to the label encoding function . The motivation is, in Eq. 8 (which is the original form of Eq. 9), the optimization of does not directly affect , thus we follow the property in the implementation. We empirically find the above details are critical to improve the final performance.


Before optimization, we follows the common practice of initialization method, i.e. using pretrained weights (e.g. pretrained on ImageNet [10]) for backbone parameters and Gaussian random weights for and . One may argue that according to the introduction, randomly initialized detection head may cause unstable training. But actually, since this training step mainly aims to learn the label encoding function , the detection backbone and the detection head are thus “auxiliary structures” in this step, whose performances are not that important. Furthermore, as we will introduce, the architecture of is relatively simple, so the optimization seems not difficult.

3.3 Step 2: Detector Training with Intermediate Supervision

Figure 2: Step 2: Detector training with intermediate supervision. Please refer to Eq. 7 for the detailed definitions. The solid and dashed lines indicate the forward and backward flows respectively

After the label encoding function has been learned, we then use it as the intermediate supervision to improve object detector training, according to Eq. 7. Fig. 2 illustrates the implementation. In addition to the common detection loss, we introduce an auxiliary loss to directly supervise the detection backbone. The coefficient is also trivially set to 1. Besides, Eq. 7 also suggests that is fixed rather than optimization variable. So, we block the gradient flow from the auxiliary loss to , as shown in the figure. After training, the auxiliary structure – – is then removed. The resulted is the learned object detector we expected.

Another important detail on the implementation is initialization. From Eq. 7 and Eq. 8 we know that in the two training steps, the detection backbones and the detection heads shares the same network architecture respectively, however, whose parameters are not necessarily the same. So, in Step 2, we reinitialize the the backbone parameters (using ImageNet pretrained weights, for instance) before training. As for the detection head parameters , empirically we find that initializing them with the corresponding parameters learned in Step 1 (see Eq. 9) results in better performance and stable convergence. It may be because the pretrained detection head can provide less gradient noise to the backbone, compared with the randomly initialized heads.

3.4 Implementation Details and Remarks

3.4.1 Ground-truth label representation.

As mentioned above, in both two training steps the label encoding function needs to take ground-truth labels as the network inputs. It is nontrivial because in detection task, each image contains different numbers of instances, each of which may have various class labels and bounding boxes. We have to produce a fixed-length label map that contains all the ground-truth information for each image.

We propose to use a tensor to represent the ground-truth objects in one image, where equals to the image size and is the number of classes in the dataset (e.g. 80 for COCO [25] dataset). For an object of the -th class, we fill the corresponding region (according to the bounding box) in the -th channel with positive values: the value ranges from 1 at the object center to 0.5 in the box boundary, which decays linearly. Fig. 1 and Fig. 2

visualize the encoding. Specially, if two bounding boxes of the same class overlap with each other, the joint region is filled with larger values of those calculated separately. Additionally, in training, the boxes are augmented by multiplying a random number between 0 and 1 with a probability of 0.5. Other values in the tensor remain to be zeros.

3.4.2 Architecture of label encoding function.

For ease of optimization, we use relatively simple architecture to implement . The design of the structure is inspired by ResNet [16], while the number of residual blocks in each stage reduces to respectively. In addition, the Max Pooling

layer is replaced by stride convolution. The input channels is set to 80 to satisfy the number of classes in

COCO [25] dataset. Batch Normalization [18] is not used here. We use the same architecture for all experiments in the paper. Please refer to the appendix for details.

3.4.3 Multi-scale intermediate supervision.

Recently, state-of-the-art detection frameworks like [23, 24, 38] usually introduce Feature Pyramid Networks (FPNs) to generate multi-scale feature maps, which greatly improves the capacity to detect objects of various sizes. Our approach can be easily generalized to multi-scale cases. First, we attach one FPN structure to the label encoding function so that it can produce multi-resolution representations. Then in both Step 1 and Step 2, we make the intermediate supervision terms (see Eq. 9 and Eq. 7) applied on all the scale levels. As shown in the following experiments, our method can effectively boost the detection frameworks with FPNs.

3.4.4 Distance measurement.

In Eq. 7 and Eq. 9, the distance measurement term is used to minimize the difference between two feature maps. One simple alternative is to use -distance directly. However, there are several issues as follows: 1) the sizes of the two feature maps may be different; 2) since the feature maps are generated from different domains respectively, directly minimizing their difference may suffer from very large gradient. So, we propose to introduce a feature adaption block into the distance measurement, which is defined as follows:


where means Layer Normalization [2]; is -distance; and are feature maps derived from the backbone and the label encoding function respectively. represents feature adaption network, which acts as the transformer between the two domains. We implement with three convolution layers, whose kernel size is and number of channels is 256. The parameters are learned jointly with the outer optimization. Similar to , is also an auxiliary structure thus will be discarded after training.

4 Experiment

4.1 Setup

All our experiments are done with PyTorch [29]. We use COCO [25] dataset to evaluate our method. Following the common practice [23, 24], we train our models with the union of 80k train images and a subset of 35k validation images (trainval35k). We test our models in the rest 5k of validation images (minival). All results are evaluated with mmAP, i.e. mAP@[0.5,0.95], using common single-scale test protocol. For both training and inference, we resize each images to 800 pixels on the shorter edge. The training batch size is a total of 16 in 8 GPUs. We mainly use so-called 1 schedule for training, which refers to 90k iterations with two learning rate decays at 60k and 80k iteration. We use almost the same training protocol for our Step 1 and Step 2 training, as well as all the counterpart baseline models respectively, with two exceptions: for Step 1, we find that adding L3 from the beginning cause L3 to be nearly zero. The network somehow finds a way to cheat, causing terrible results. So we add an additional 30k warmup iterations without L3, which we find sufficient to solve the problem; for Step 2, we remove the auxiliary loss in the last 10k iterations, which results in minor improvements. Since our training pipeline involves two steps, the total number of the iterations thus doubles. For fair comparison, we provide 2 schedule for baseline models as well, which refers to 180k iterations with two learning rate decays at 120k and 160k iteration.

4.2 Main Results

Model Backbone Baseline(1x) Baseline(2x) Ours
RetinaNet [24]
(our impl.)
ResNet50 36.1 36.4 38.4
ResNet101 38.1 38.6 40.3
Res101-DCN 40.6 41.1 42.1
FCOS [38]
(our impl.)
ResNet50 36.7 37.0 38.9
ResNet101 38.8 39.2 41.2
Res101-DCN 41.9 41.9 43.2
FPN [23]
(our impl.)
ResNet50 36.8 37.3 38.8
ResNet101 38.9 39.6 40.9
Res101-DCN 41.8 42.7 43.2
Table 1: Experiments on various baselines (mmAP/%)

In order to show the effectiveness of our model on different detection frameworks, we evaluate our method on RetinaNet [24], FCOS [38] and FPN [23], which are representative baselines of one-stage detectors, anchor-free methods and two-stage frameworks respectively. We also evaluate our method on various commonly-used backbones, including ResNet-50, ResNet-101 [16] and Deformable Convolutional Networks (DCNs) [9].

Results are presented in Table 1. Compared with the counterparts with 1 schedule, our method achieves performance gains of over 2% on both ResNet-50 and ResNet-101 backbones. On ResNet-101-DCN, there are still relative improvements of 1.4% in average. Compared with the baselines of 2 schedule, the gap becomes closer but still remains considerable, which suggests that our improvements are not mainly brought by more training iterations. It is worth noting that although our training pipeline doubles the total number of iterations, we argue that our label encoding function can usually be reused among different backbones (see the next subsection). Therefore in practice, we usually only need to run Step 1 only once for different models.

4.3 Ablation Study

4.3.1 Step 1: is joint optimization required?

In Sec. 3.2, to optimize Eq. 9 we propose a joint optimization scheme to take all the three loss terms into account. Recall that in Step 1, only the learned label encoding function will be reserved into the next stage. As a result, one may argue that whether the auxiliary structure, i.e. the detection backbone, is really necessary in training. In other words, the question is, can we only use L1 (AutoEncoder reconstruction loss) in Eq. 9 for this step? If it is true, the training step can be further simplified. Unfortunately, we find it not the case.

To validate the argument, we conduct a comparison by removing L2 and L3 in Eq. 9 to derive the label encoding function. Other settings such as Step 2 keep unchanged. The results are listed in Table 2, while the modified counterparts are marked with “reconstruction loss only”. We compare them on RetinaNet with ResNet-50 and ResNet-101 backbones. It is clear that, without the auxiliary backbone, our method (although still outperforms baseline models) shows significant degradation in precision.

Backbone Methods mmAP (%)
ResNet50 Baseline (1) 36.1
Baseline (2) 36.4
Ours (reconstruction loss only) 36.9
Ours 38.4
ResNet101 Baseline (1) 38.1
Baseline (2) 38.6
Ours (reconstruction loss only) 39.0
Ours 40.3
Table 2: Ablation study of removing the auxiliary structures in Step 1
Discussion and remarks.

In Step 1, although the existence of auxiliary structures is vital, we find the exact weights in the backbone are actually less important. From Eq. 7 and Eq. 8, we know that does not affect the optimization of directly. It only contributes to the optimization of . Also, unlike which is inherited from for initialization, is reinitialized exactly in Step 2. Therefore, the trained auxiliary detection backbone in Step 1 is completely discarded.

The observation inspires an interesting assumption: is the final performance actually insensitive to the detailed backbone architecture in Step 1? We try to verify the guess by using different backbones in Step 1 and Step 2. As reported in Table 3, we use ResNet-50 as the auxiliary backbone in the first step. Whereas in Step 2, the final detection backbone is ResNet-101. Compared with the model whose backbones in both stages are ResNet-101, the performances almost keep the same. The new finding thus suggests another advantage of our method in practice. The label encoding function can be pretrained once but re-used for multiple detectors with different backbones, as long as they have the same detection head. This property of our method greatly reduces the cost of the practical applications.

Step 1 Backbone Step 2 Backbone mmAP (%)
ResNet50 ResNet101 40.3
ResNet101 40.3
Table 3: Comparisons of different detection backbones in Step 1

4.3.2 Is Step 1 alone sufficient?

In Step 1, we only aim to solve the label encoding function for later intermediate supervision. However, the training framework in Step 1 is quite similar to that in Step 2, and there is a detection model (the auxiliary structure) that can be proceeded for testing. Intuitively, the detection model in Step 1 should improve as well. One may even guess that Step 1 alone is sufficient. We show the ablation in Table 4. We only use Step 1 and test the performance of the detection model (the auxiliary structure). We compare them on multiple models with ResNet50 backbone. Step1-only can indeed improve the detection model over baseline, but clearly it alone is not sufficient.

Model Backbone Method mmAP (%)
RetinaNet ResNet50 Step1-only 37.9
Ours 38.4
FCOS ResNet50 Step1-only 38.0
Ours 38.9
FPN ResNet50 Step1-only 37.3
Ours 38.8
Table 4: Results of only using Step 1

4.3.3 Step 2: do intermediate supervision and initialization matter?

In Step 2, we use two methods to facilitate the optimization, i.e. intermediate supervision on the backbone as well as the initialization of the detection head. In Table 5, we show the ablation studies on them. The baseline framework is RetinaNet [24] with ResNet-50 backbone. We also make the combinational studies of the case that using reconstruction loss only in Step 1 (please refer to Table 2). The results suggest that both methods contribute to the final performance.

Step 1 Supervision Initialization mmAP (%)
ResNet50 Baseline 36.1
(reconstruction loss only)
Ours 36.8
Table 5: Intermediate supervision and initialization in Step 2

4.4 Comparison with Knowledge Distillation

Our two-step pipeline resembles Knowledge Distillation (KD). Actually, if we train an object detector alone in Step 1 instead of our label encoding function with a joint framework, and use it in Step 2 for supervision, the method becomes KD. In Table 6 we show comparison between our method and the alternative mentioned above, denoted as “Vanilla KD”. On a lightweight backbone, i.e. MobileNet, our method can reach similar performance to Knowledge Distillation, although we only use a label encoding function instead of a heavy ResNet-50 that extracts “real” features. On a heavier backbone, i.e. ResNet-50, our method outperforms KD with ResNet-50 and ResNet-101 as teachers, whose improvements are limited due to the small performance gap between teacher and student. Knowledge distillation requires a teacher network that is strong enough, which is usually not easy to find when the student network is already strong. Our method, on the other hand, is not limited by it.

Backbone Method Teacher Network mmAP
MobileNet Baseline - 27.7
Vanilla KD ResNet50 29.7
Ours Label Encoding Function 29.8
ResNet50 Baseline - 36.1
Vanilla KD ResNet50 36.8
Vanilla KD ResNet101 36.5
Ours Label Encoding Function 38.4
Table 6: Comparison with Knowledge Distillation (%)

4.5 Performance on Mask Prediction

Above we mainly focus on object detection. However, our previous discussion when proposing the method (Sec. 1 and Sec. 3) is based on the structure and optimization of detection networks, not object detection task itself. Thus it is likely that our method can be extended to other tasks with similar framework. We tested our method on Mask R-CNN [14], which produces mask prediction in instance segmentation, but has a similar framework to FPN. It is worth noting that for Mask R-CNN, we use masks instead of boxes as the input for label encoding function. Results are presented in Table 7. It indicates our method improves mask prediction as well.

Backbone Method box mask
ResNet50 Baseline (1) 37.4 34.2
Baseline (2) 38.2 34.6
Ours 39.1 35.6
ResNet101 Baseline (1) 40.0 36.0
Baseline (2) 40.6 36.4
Ours 41.7 37.6
Table 7: Experiments on MaskRCNN (mmAP/%)

5 Conclusions

In this paper, we propose a new training pipeline for object detection systems. We design a feature encoding function and utilize it to introduce intermediate supervision on the detection backbone. Our method is generally applicable and efficient, adding no extra cost in inference time. To show its ability, we evaluate it on a variety of detection models and gain consistent improvement.


  • [1] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid (2013) Label-embedding for attribute-based classification. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 819–826. Cited by: §2.0.3.
  • [2] J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §3.4.4.
  • [3] S. Bengio, J. Weston, and D. Grangier (2010) Label embedding trees for large multi-class tasks. In Advances in Neural Information Processing Systems, pp. 163–171. Cited by: §2.0.3.
  • [4] Z. Cai and N. Vasconcelos (2018) Cascade r-cnn: delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6154–6162. Cited by: §1.
  • [5] R. Caruana (1997) Multitask learning. Machine learning 28 (1), pp. 41–75. Cited by: §2.0.1.
  • [6] G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker (2017) Learning efficient object detection models with knowledge distillation. In Advances in Neural Information Processing Systems, pp. 742–751. Cited by: §2.0.2.
  • [7] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: §1.
  • [8] Y. Chen, N. Wang, and Z. Zhang (2018) Darkrank: accelerating deep metric learning via cross sample similarities transfer. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §2.0.2.
  • [9] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei (2017) Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 764–773. Cited by: §1, §4.2.
  • [10] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §3.2.
  • [11] R. Girshick (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §1, §1.
  • [12] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2019) Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722. Cited by: §1.
  • [13] K. He, R. Girshick, and P. Dollár (2019) Rethinking imagenet pre-training. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4918–4927. Cited by: §1.
  • [14] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §1, §1, §2.0.1, §4.5.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence 37 (9), pp. 1904–1916. Cited by: §1.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §3.4.2, §4.2.
  • [17] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §1, §2.0.2.
  • [18] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §3.4.2.
  • [19] C. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu (2015) Deeply-supervised nets. In Artificial intelligence and statistics, pp. 562–570. Cited by: §2.0.1.
  • [20] W. Lee, J. Na, and G. Kim (2019) Multi-task self-supervised object detection via recycling of bounding box annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4984–4993. Cited by: §2.0.1.
  • [21] Q. Li, S. Jin, and J. Yan (2017) Mimicking very efficient network for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6356–6364. Cited by: 2nd item, §2.0.2.
  • [22] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun (2018) Detnet: a backbone network for object detection. arXiv preprint arXiv:1804.06215. Cited by: §1.
  • [23] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §1, §1, §1, §3.4.3, §4.1, §4.2, Table 1.
  • [24] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §1, §1, §1, §3.4.3, §4.1, §4.2, §4.3.3, Table 1.
  • [25] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §1, §3.4.1, §3.4.2, §4.1.
  • [26] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §1, §1.
  • [27] P. Luo, Z. Zhu, Z. Liu, X. Wang, and X. Tang (2016)

    Face model compression by distilling knowledge from neurons

    In Thirtieth AAAI conference on artificial intelligence, Cited by: §2.0.2.
  • [28] M. Mostajabi, M. Maire, and G. Shakhnarovich (2018) Regularizing deep networks by modeling and predicting label structure. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5629–5638. Cited by: §2.0.3.
  • [29] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §4.1.
  • [30] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §1, §1.
  • [31] J. Redmon and A. Farhadi (2017) YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7263–7271. Cited by: §1, §1.
  • [32] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1, §1, §2.0.1.
  • [33] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio (2014) Fitnets: hints for thin deep nets. arXiv preprint arXiv:1412.6550. Cited by: §2.0.2.
  • [34] Z. Shen, Z. Liu, J. Li, Y. Jiang, Y. Chen, and X. Xue (2017) Dsod: learning deeply supervised object detectors from scratch. In Proceedings of the IEEE international conference on computer vision, pp. 1919–1927. Cited by: 2nd item, §1, §2.0.1.
  • [35] X. Sun, B. Wei, X. Ren, and S. Ma (2017) Label embedding network: learning label representation for soft training of deep networks. arXiv preprint arXiv:1710.10393. Cited by: §2.0.3.
  • [36] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §2.0.1.
  • [37] Z. Tang, D. Wang, and Z. Zhang (2016) Recurrent neural network training with dark knowledge transfer. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5900–5904. Cited by: §2.0.2.
  • [38] Z. Tian, C. Shen, H. Chen, and T. He (2019) Fcos: fully convolutional one-stage object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9627–9636. Cited by: §1, §1, §3.4.3, §4.2, Table 1.
  • [39] T. Wang, L. Yuan, X. Zhang, and J. Feng (2019) Distilling object detectors with fine-grained feature imitation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4933–4942. Cited by: §2.0.2.
  • [40] X. Wang, R. Zhang, Y. Sun, and J. Qi (2018) Kdgan: knowledge distillation with generative adversarial networks. In Advances in Neural Information Processing Systems, pp. 775–786. Cited by: §2.0.2.
  • [41] Y. Wen, K. Zhang, Z. Li, and Y. Qiao (2016)

    A discriminative feature learning approach for deep face recognition

    In European conference on computer vision, pp. 499–515. Cited by: §2.0.1.
  • [42] S. Zagoruyko and N. Komodakis (2016)

    Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer

    arXiv preprint arXiv:1612.03928. Cited by: §2.0.2.
  • [43] Z. Zhang, X. Zhang, C. Peng, X. Xue, and J. Sun (2018) Exfuse: enhancing feature fusion for semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 269–284. Cited by: §2.0.1.
  • [44] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890. Cited by: §2.0.1.
  • [45] R. Zhu, S. Zhang, X. Wang, L. Wen, H. Shi, L. Bo, and T. Mei (2019) ScratchDet: training single-shot object detectors from scratch. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2268–2277. Cited by: §1.
  • [46] X. Zhu, H. Hu, S. Lin, and J. Dai (2019) Deformable convnets v2: more deformable, better results. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9308–9316. Cited by: 2nd item, §1, §2.0.2.


Appendix 0.A Architecture of Label Encoding Function

Stage Block Kernel Size Stride Output Channels
Stage1 Conv 7 7 2 128
Stage2 ResBlock 1 1 1 64
3 3 2 64
1 1 1 256
Stage3 ResBlock 1 1 1 128
3 3 2 128
1 1 1 512
ResBlock 1 1 1 128
3 3 1 128
1 1 1 512
Stage4 ResBlock 1 1 1 256
3 3 2 256
1 1 1 1024
ResBlock 1 1 1 256
3 3 1 256
1 1 1 1024
Stage5 ResBlock 1 1 1 512
3 3 2 512
1 1 1 2048
Table 8: Architecture of our label encoding function. It has 19 layers. Most stages have the same output channels and stride as in ResNet-50 and ResNet-101, which is convenient for later supervision. Except that the first convolution has 80 and 128 for input and output channels respectively, instead of 3 and 64, in order to satisfy COCO dataset. We also remove the max pooling and we do not use batch normalization

Appendix 0.B Derivation of Eq. (7, 8)

In Sec. 1 and Sec. 3 we introduce our model as follows:


where the optimal weights of the label encoding function () is derived from:


Clearly, there exist nested dependencies on the two variables and . Thus the above equations are infeasible to compute directly.

Notice that in Eq. 11, actually acts as a constant in the optimization. We define a function as follows:


Compared with Eq. 11, we use , instead of and respectively for distinguishing. Easy to find that . Then, we can rewrite Eq. 12 as follows:




Eq. 14 suggests that we need to find a certain satisfying that the optimal point of the partial function is also , i.e. . It motivates us to approximate with the following optimization, since Eq. 14 is nontrivial to compute directly:


which derives our formulations in the text.

Appendix 0.C Feature Visualization

In this section we analyze our method with visualization on feature maps. We pick the second layer of the multi-scale feature maps from RetinaNet. We use images in validation set. We visualize each feature map with its intensity. Specifically, we use the norm of each pixel. The larger the norm is (i.e., the stronger the intensity), the brighter it is in the figure. Visualization results are shown in Fig. 3. Four columns are: (a) The original images. (b) Feature from baseline models. (c) Feature from our model. (d) Feature from our encoding function, which is the optimization target for (c). Compared with feature from baseline, which has clear boundary at the outline of each object, feature from ours is closer to boxes. Under the supervision from (d), the feature extends outside the object outline and “tries to reach the box edges”. We believe this is beneficial for later instance extraction by detection head. Note that Fig. 3 is just a spatial projection of features. Information across channels is not visible here.

Figure 3: Visualization of feature in baseline and ours. (a) The original images. (b) Feature from baseline. (c) Feature from ours. (d) Feature from our encoding function