Pedestrian detection is a long-standing topic in computer vision. It aims to locate and classify pedestrians of various sizes and aspect ratios. It plays a crucial role in many applications, ranging from video surveillance to autonomous driving, which usually demand for both fast and accurate detection.
Recent years have seen a tremendous increase in the accuracy of pedestrian detection, relying on deep convolutional neural networks (CNNs)[10, 20, 8]. CNN-based object detectors [5, 6, 17, 12], such as Faster R-CNN, have become the mainstream approaches for object detection. With increasingly deeper architecture, these models achieve better performance, which are usually associated with growing computational expense. Consequently, the speed-accuracy trade-off makes it difficult for these cumbersome detectors to be applied in real-world scenarios. In this work, we aim at learning a pedestrian detector with both faster speed and satisfied accuracy through hierarchical knowledge distillation.
Knowledge distillation aims to improve a lightweight model’s performance by learning from a well-trained but cumbersome model . It usually formulates a teacher-student architecture , which treats the lightweight model as the student, and the cumbersome model as the teacher. At first, distillation technique was applied in classification tasks, but recent works show great potentials of knowledge distillation in detection tasks [21, 11, 3]. Li et al. 
propose a feature mimicking framework for training efficient object detectors to relief the detector training pipeline from ImageNet pre-training. They apply distillation technique by adding a supervision to high-level features, which helps the small network better learn object representations during training stage. Chenet al.  use the final output of the teacher’s region proposal network and region classification network as the distillation targets, and also adopt a intermediate supervision to improve the student’s performance.
However, these works have several shortcomings. Only high-level features are considered to perform distillation in , which loses much spatial information. Thus their model can not easily detect pedestrians with severe scale variations. An intermediate supervision is also considered as a distillation target in , but it turns out that their student detectors still are lack of supervisions from multiple levels, which are crucial to knowledge distillation.
In this paper, we propose a unified hierarchical knowledge distillation framework for pedestrian detection task. Firstly, instead of adding a supervision to the final feature map, we perform knowledge distillation in multiple layers, ranging from low-level local details to high-level semantic abstractions, referred as pyramid distillation. Secondly, we perform pyramid RoIAlign to extract features from multi-level feature maps followed with a concatenation operation, and apply distillation on all levels, which enables the student to see more feature levels with wider receptive field. The intuition behind is that a student should also learn spatial details as well as high level representations to become a comprehensive pedestrian detector. Additionally, multiple level distillation can server as intermediate supervisions to improve the gradient flow during backpropagation. Finally, the distillation is also performed on the logit features following, with which our approach forms a comprehensive, unified distillation framework. Our contributions are three folds:
We propose a hierarchical distillation technique, which adds supervisions to multi-level feature maps, for learning better object representations with both strong semantic abstractions and precise spatial responses.
We train a lightweight pedestrian detector through our unified, hierarchical distillation framework, which achieves competitive performance on the widely used Caltech  pedestrian detection benchmark.
2 Related Work
This section briefly discusses previous works that are most related to the proposed framework in object detection and knowledge distillation.
CNNs for Detection. Recently, CNN-based object detectors [5, 6] have become the mainstream for detection tasks. Regular CNN detectors can be categorized into single-stage [16, 14, 13] and two-stage [17, 12, 19] detectors. Faster R-CNN  is a typical two-stage detector, which generates region proposals in the first stage, and classifies the proposals in the second stage. Feature Pyramid Network (FPN)  proposes a top-down architecture with lateral connections for building high-level semantic feature maps at all scales.
Knowledge Distillation in Detection. Knowledge distillation aims to improve a lightweight model’s performance by learning from a well-trained but cumbersome model [9, 18]. Li et al.  propose a feature mimicking framework based on high-level feature distillation, to train efficient detectors without ImageNet pre-training. Wei et al.  propose to combine feature mimicking with quantization technique, which helps student network to better match the feature maps of teacher network for more efficient feature distillation. Chen et al.  design a distillation framework for detection based on final output distillation and single intermediate supervision.
In this section, we first present an overview of our framework, and then introduce our hierarchical knowledge distillation, which are achieved with multiple supervisions: Firstly, we perform distillation on multiple feature levels in the pyramid, referred as Pyramid Distillation (PD). Secondly, we perform another distillation on the output proposals, referred as Region Distillation (RD), enabling the student to focus on the positive regions. Finally, we add the final distillation at the very end of the detector, referred as Logit Distillation (LD).
3.1 Framework overview
Our detection framework is built on FPN, which introduces a top-down connection to join different levels ( –
) for constructing a pyramid of deep features (– ) . Then a region proposal network (RPN)  is adopted across all pyramid features ( – ) to generate an over-complete set of proposals. Following the RPN, we crop features according to the proposals, then feed them to a regional classifier to get the final detection results (class labels and boxes). In the training stage, we adopt cross entropy loss for classification and loss for box regression . And the RPN is also trained as Ren et al. .
In this work, we adopt ResNet18 as a student model and ResNet50 as a teacher model. In training stage, as illustrated in Fig. 1, the input images are fed to both student and teacher model for feature extraction. Pyramid features,e.g. (), are generated for both of them. Then, the region of interests (RoIs) generated by the student are used to crop regions in the pyramids for both of the student and the teacher detector. And RPN is followed by two fully-connected layers (FCs) and a pair of siblings for computing the final output classes and boxes.
3.2 Hierarchical knowledge distillation
Pyramid Distillation, Region Distillation and Logit Distillation form our comprehensive, hierarchical ‘teacher – student’ learning architecture.
3.2.1 Pyramid distillation
Inputs are first fed into FPNs to generate feature pyramids for both student and teacher models, denoted by . Different levels in the feature pyramid often contains information with different semantic meanings. The lowest level contains only local details, such as edges and contours. As we going deeper with the pyramid level, for instance, the highest level , contains more abstractions or stronger semantic meanings, e.g. object parts. Instead of letting the student only focuses on learning the abstractions, it is equivalently important for learning the details. Previous methods often omit this learning objective, to tackle above problem, we propose pyramid distillation for learning better object representations with both strong abstractions and precise details. As shown in Fig. 2 (a), the Pyramid Distillation is defined as:
where is the number of total locations in the pyramid, , represents the -th level of student’s feature pyramid and that of teacher’s feature pyramid, respectively.
3.2.2 Region distillation
RoIAlign is first introduced in Mask R-CNN . A certain level of feature in the feature pyramid is first found according to the area of each RoI, then the region feature is cropped within that level, as shown in Fig. 2 (b). In this work, for each RoI we crop features from all levels of the feature pyramid, as shown in Fig. 2 (a). The resulting region feature () contains both low-level details and high-level abstractions. Based on Pyramid RoIAlign, we define Region Distillation as:
where is the number of total locations in the regions.
3.2.3 Logit distillation
Additionally, we also add a final distillation in the training stage. In the second stage of a detector, we add Logit Distillation on the fully connected layer before the detection output, which forces the student to mimic the teacher’s final behavior. It is defined as:
where is number of proposals in the second stage detector.
3.2.4 Final hierarchical distillation objective
The final objective of our proposed hierarchical knowledge distillation is:
where are factors, which balance all the objectives so that they are at the same magnitude.
In this section, we first introduce dataset, evaluation metrics, implementation details and overall performance. Then, we perform ablation study to validate the contributing factors proposed in this work. At the end, we compare our unified framework with other state-of-the-art methods.
Dataset. We evaluate our unified, hierarchical knowledge distillation framework on Caltech dataset . This pedestrian detection dataset contains 2.5 hours videos captured from a moving vehicle. We train our model on Caltech10, which samples ten times frames from the videos than original training set , but using the original annotations still. In evaluation, we test the model on the standard test set and report results on Reasonable configuration.
Evaluation Metrics. The MR (log-average miss rate) between FPPI (false positive per image) is used as the evaluation metrics following Dollar et al. . We report MRs on different subsets, including the reasonable subset (, ), and the small subset (, ).
Our model is trained for 6 epochs using a SGD optimizer with initial learning rate of 0.002. And we decrease the learning rate by a factor of 0.1 at the 4th and 6th epochs. The size of input image is set to, which is 1.5 times of the original image. We adopt randomly horizontal flipping as the only data augmentation. In the final distillation objective, the parameters , and is set to 0.5, 30 and 30, respectively.
Overall performance. In Table 1, training through our proposed distillation framework, with 6 times compression in number of parameters, our student model still achieves competitive performance as the teacher model, with 10.03% MR on reasonable subset and 12.28% MR on small subset.
4.1 Ablation study
In Table 2, we validate the contributing factors proposed in this work, including Pyramids Distillation (PD), Region Distillation (RD), Logit Distillation (LD) and Pyramid RoIAlign (PyRoIAlign). Compared with the baseline (2nd row of Table 2), using LD (3rd row) as the only intermediate supervision improves our performance by 1.09%. Only using RD (5th row) as supervision shows more significant improvement by 1.70%. It proves that region feature contains more supervision than the final output logit, because the dimension of logit is significantly reduced for passing through two FC layers. PyRoIAlign is critical to our distillation framework, which boosts our performance by a notable margin. Using RD without PyRoIAlign (4th row) merely boosts our performance by 0.57%, which drops 1.13% compared with enabling PyRoIAlign (5th row). This performance degradation is caused by using supervision only from single-level features for distillation, as mentioned earlier in Sec 3.2.1. Moreover, we combine supervisions from both RD and LD at the same time (6th row), which gains notable improvements by 1.78%. Further more, we add supervision from PD to form our comprehensive, hierarchical distillation framework (8th row), and experiment result demonstrates the effectiveness of proposed framework, with 2.49% performance boosting compared with baseline method. Comparing with experiment without PD (6th row), this experiment (8th row) also proves that PD is critical to our unified framework, which contributes 0.71% performance improvements. Most importantly, this experiment (8th row) shows the best performance of our proposed student model, which achieves 10.03% MR in the reasonable subset. And it is a very competitive performance compared with our teacher model, thus further proves the effectiveness of our comprehensive, hierarchical distillation framework.
4.2 Comparison with other methods
In Table 3, we compare our proposed method with other state-of-the-art methods on Caltech dataset. Our student model outperforms some of state-of-the-art methods by a notable margin, even if they use stronger network architecture (VGG-16) than ours. Though our student model is much smaller in number of parameters, we still achieves 1.7% lower MR than Cai et al. , 1.5% lower MR than Ouyang et al. , and 0.3% lower MR than Zhang et al. . Even with limited representation ability, our model still achieves competitive performance with other state-of-the-art methods [1, 22].
We propose a comprehensive, hierarchical knowledge distillation framework for training lightweight pedestrian detector. Comparing with single-level supervision, our unified framework utilize multiple intermediate supervisions for distillation, which significantly improves the efficiency of transferring knowledge from a teacher model to a student model. Besides, our hierarchical distillation framework helps our student model learn better representations from multi-level feature maps with both abstractions and details. Experiment results on Caltech pedestrian detection benchmark demonstrate the effectiveness of our proposed framework.
-  (2016) A unified multi-scale deep convolutional neural network for fast object detection. In ECCV, Cited by: §4.2, Table 3.
-  (2015) Learning complexity-aware cascades for deep pedestrian detection. In CVPR, Cited by: §4.2, Table 3.
-  (2017) Learning efficient object detection models with knowledge distillation. In NeurIPS, Cited by: item 1, §1, §1, §1, §2.
-  (2012) Pedestrian detection: an evaluation of the state of the art. TPAMI 34 (4), pp. 743–761. Cited by: item 3, §4, §4.
-  (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, Cited by: §1, §2.
-  (2015) Fast r-cnn. In ICCV, Cited by: §1, §2, §3.1.
-  (2017) Mask r-cnn. In ICCV, Cited by: §3.2.2.
-  (2016) Deep residual learning for image recognition. In CVPR, Cited by: §1.
-  (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §1, §2.
-  (2012) Imagenet classification with deep convolutional neural networks. In NeurIPS, Cited by: §1.
-  (2017) Mimicking very efficient network for object detection. In CVPR, Cited by: item 1, §1, §1, §2.
-  (2017) Feature pyramid networks for object detection. In CVPR, Cited by: §1, §2, §3.1.
-  (2018) Focal loss for dense object detection. TPAMI. Cited by: §2.
-  (2016) Ssd: single shot multibox detector. In ECCV, Cited by: §2.
-  (2018) Jointly learning deep features, deformable parts, occlusion and classification for pedestrian detection. TPAMI 40 (8), pp. 1874–1887. Cited by: §4.2, Table 3.
-  (2016) You only look once: unified, real-time object detection. In CVPR, Cited by: §2.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NeurIPS, Cited by: §1, §2, §3.1.
-  (2015) Fitnets: hints for thin deep nets. ICLR. Cited by: §1, §2.
-  (2018) ZoomNet: deep aggregation learning for high-performance small pedestrian detection. In ACML, Cited by: §2.
-  (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: §1.
-  (2018) Quantization mimic: towards very tiny cnn for object detection. In ECCV, Cited by: §1, §2.
-  (2016) Is faster r-cnn doing well for pedestrian detection?. In ECCV, Cited by: §4.2, Table 3.
-  (2016) How far are we from solving pedestrian detection?. In CVPR, Cited by: §4.
-  (2018) Occluded pedestrian detection through guided attention in cnns. In CVPR, Cited by: §4.2, Table 3.