Knowledge Distillation for End-to-End Person Search

by   Bharti Munjal, et al.

We introduce knowledge distillation for end-to-end person search. End-to-End methods are the current state-of-the-art for person search that solve both detection and re-identification jointly. These approaches for joint optimization show their largest drop in performance due to a sub-optimal detector. We propose two distinct approaches for extra supervision of end-to-end person search methods in a teacher-student setting. The first is adopted from state-of-the-art knowledge distillation in object detection. We employ this to supervise the detector of our person search model at various levels using a specialized detector. The second approach is new, simple and yet considerably more effective. This distills knowledge from a teacher re-identification technique via a pre-computed look-up table of ID features. It relaxes the learning of identification features and allows the student to focus on the detection task. This procedure not only helps fixing the sub-optimal detector training in the joint optimization and simultaneously improving the person search, but also closes the performance gap between the teacher and the student for model compression in this case. Overall, we demonstrate significant improvements for two recent state-of-the-art methods using our proposed knowledge distillation approach on two benchmark datasets. Moreover, on the model compression task our approach brings the performance of smaller models on par with the larger models.


Knowledge Distillation for End-to-EndPerson Search

We introduce knowledge distillation for end-to-end person search. End-to...

Diverse Knowledge Distillation for End-to-End Person Search

Person search aims to localize and identify a specific person from a gal...

Multi-scale Knowledge Distillation for Unsupervised Person Re-Identification

Unsupervised person re-identification is a challenging and promising tas...

Neighbourhood Distillation: On the benefits of non end-to-end distillation

End-to-end training with back propagation is the standard method for tra...

Robust Re-Identification by Multiple Views Knowledge Distillation

To achieve robustness in Re-Identification, standard methods leverage tr...

How many Observations are Enough? Knowledge Distillation for Trajectory Forecasting

Accurate prediction of future human positions is an essential task for m...

Decoupled and Memory-Reinforced Networks: Towards Effective Feature Learning for One-Step Person Search

The goal of person search is to localize and match query persons from sc...

1 Introduction

Person search is the complex (multi-)task of jointly localizing people and verifying their identity against a provided query person ID. Person search has recently gained attention [Liu2017NPSM, xiao2017joint, Xu2014PSS, zheng2016prw], also thanks to its numerous applications, including cross-camera visual tracking, person verification, and surveillance. The two tasks in person search, i.e. detection and re-identification are arguably in contrast with each other. In fact, detection should disregard any specific nuance of individuals and just retrieve any person, while re-identification should focus on these nuances, to distinguish individuals and retrieve the queried person. The dispute reflects in literature arguing that detection and re-identification in person search should be addressed separately [Xu2014PSS, zheng2016prw, Chen_2018_ECCV] or jointly [Liu2017NPSM, xiao2017joint, Xiao2017IANTI, munjal2019cvpr].

Most recent state-of-the-art work in person search [xiao2017joint, munjal2019cvpr] demonstrates the benefits of end-to-end optimization. The approaches add the re-identification task onto the Faster R-CNN [ren2015faster] detection framework and learn both objectives jointly. As a result, the detector performance degrades, as illustrated in Fig. 1, but it still remains state-of-the-art w.r.t person search performance.

In this work, we propose knowledge distillation to address the sub-optimal detector performance for end-to-end person search, and model compression to reach the same level of performance with a smaller model. Knowledge distillation [NIPS2015_distilling]

stems from the belief that training a neural network from labelled data requires a large amount of over-parameterization, as it is the case for the teacher, generally a large and accurate model. The teacher supervisory training signal enables training of a student 

[YangAAAI18teachers], typically smaller.

Figure 1: Joint detection and re-identification provides best performance [xiao2017joint, munjal2019cvpr] but analysis shows that detection is harmed and limits performance. Cyan (dot-marker) and blue (star-marker) curves illustrate the OIM [xiao2017joint] detection and person search performance respectively, when varying the relative training importance via the weight, cf. (3). Both in the case of ResNet50 (left) and ResNet18 (right), the more weight is given to re-id task, the more the detector is harmed (decreasing cyan curve), which remains below the original Faster R-CNN performance (black dashed line). A teacher-student framework for the detection part of OIM, OIM + (magenta), improves detection (light magenta, dot marker) but harms re-id, which is visible by the drop in person search performance (dark magenta, star marker). The new knowledge distillation for the re-id part, OIM+ (green) improves overall person search (dark green, star-marker) and also keeps the original detector performance.

Here we propose two distinct ways to distill knowledge and supervise the two joint tasks of person search, i.e. detection and re-identification. We build upon the end-to-end approach of OIM [xiao2017joint], the de-facto building block of all most recent person search approaches [munjal2019cvpr, Xiao2017IANTI, Liu2017NPSM]. For the detection part, we adopt the teacher-student framework of [NIPS2017_6676] which employs a multi-task loss, for intermediate feature guidance, region proposals, classification, and regression outputs. For the re-identification part, we propose a new, simple and yet effective strategy that relieves the student from learning a look-up table (LUT) for the identities in the training set. Instead, we copy and fix a pre-computed teacher LUT, which relaxes the student task of identification feature learning. We test our new distilled person search models on the two most recent CUHK-SYSU [xiao2017joint] and PRW-mini [zheng2016prw, munjal2019cvpr] datasets, extending both the baseline OIM [xiao2017joint] technique and the most recent query-based method QEEPS [munjal2019cvpr]. We demonstrate performance improvement in all cases. We achieve 85.0% mAP on CUHK-SYSU and 39.7% mAP on PRW-mini. Notably, the same approach allows to train a smaller student network, realizing therefore model compression [bucilaKDD06]. In fact, we show that a ResNet18 student still provides 84.1% mAP on CUHK-SYSU with only 46% of parameters of the larger Resnet50 teacher.

We summarize our contributions as follows: i. we introduce knowledge distillation for person search and propose two distinct teacher-student frameworks, i.e. for the detector and for re-identification parts; ii. we integrate our approach into the OIM [xiao2017joint] and QEEPS [munjal2019cvpr] person search models; iii. we show significant improvement over baseline methods on both the CUHK-SYSU [xiao2017joint] and PRW-mini [zheng2016prw, munjal2019cvpr] datasets; iv. we also show that our knowledge distillation approach enables model compression without drop in performance.

2 Related Work

In this section, we first discuss the literature on the multiple tasks encompassed within person search. Then we review prior art on knowledge distillation and model compression.

Person Detection.  There is a large body of work on person detection, from methods based on hand-crafted features [Felzenszwalb2009ObjectDW, Dollr2014FastFP]

to deep convolutional neural network (CNN) feature-learning methods 

[Girshick2015FastR, Girshick2014RichFH, Yang2015ConvolutionalCF]. Best CNN detectors are either one-stage [Liu2016SSDSS, Redmon2017YOLO9000BF] or two-stage [ren2015faster, Girshick2015FastR, Girshick2014RichFH]

. The latter select object proposals via a region proposal network and then classify those into persons vs background. In line with OIM 

[xiao2017joint], we adopt Faster R-CNN with a ResNet [he2016resnet] backbone, due to its robustness and flexibility.

Person Re-Identification.  Person re-identification is the task of classifying the same individuals as provided by a query sample, within a gallery of cropped, centered and aligned persons. Earlier approaches have focused on manual feature design [Wang2007ShapeAA, Gray2008ViewpointIP, Farenzena2010PersonRB, Zhao2013UnsupervisedSL] and metric learning [Liao2015PersonRB, Zhao2017PersonRB, Kstinger2012LargeSM, Li2015MultiScaleLF, Liao2015EfficientPC, Paisitkriangkrai2015LearningTR, Ali_2018_ECCV]. More recent re-identification approaches are based on CNNs [ahmed2015cvpr, li2014cvpr_deepreid, Yi2014DeepML, zhang2017alignedreid]

and mainly concerned with the estimation of a ID-feature embedding spaces, either via Siamese networks and contrastive losses

[ahmed2015cvpr, li2014cvpr_deepreid, Liu2017EndtoEndCA, Varior2016ASL, Xu2018AttCompNet, Yi2014DeepML, Cheng2016PersonRB, Ding2015DeepFL, Chen2017BeyondTL], or with ID-classification and cross-entropy losses [Xiao2016LearningDF, Zheng2016MARSAV].

Person Search.  Xu et al [Xu2014PSS] introduced this as finding a person in a set of non-cropped gallery images, given a crop of the queried person. Person search involves detecting people in gallery images as well as verifying their ID against the provided query-ID. It encompasses therefore the two tasks of detection and re-identification. Thanks to large-scale datasets (CUHK-SYSU [xiao2017joint] and PRW [zheng2016prw]), person search has witnessed progress but it remains divided into approaches addressing the two tasks separately [Xu2014PSS, zheng2016prw, Chen_2018_ECCV, person_eccv18] vs jointly [Liu2017NPSM, xiao2017joint, Xiao2017IANTI, munjal2019cvpr, Yan_2019_CVPR]. We consider the two tasks jointly, since it was proven most recently beneficial [munjal2019cvpr].

End-to-End Person Search.  Xiao et al [xiao2017joint] introduced a model for the end-to-end training of joint person search. They extended Faster R-CNN to estimate an ID embedding for re-identification, and introduced an Online Instance Matching (OIM) loss, to effectively train it. IAN [Xiao2017IANTI] also proposed an end-to-end approach using a center loss [Wen2016ADF] with the goal to improve the intra-class feature compactness. More recently, Munjal et al [munjal2019cvpr] proposed query-guidance for OIM, dubbed QEEPS, i.e. conditioning person search on the non-cropped query image. We propose knowledge distillation for the end-to-end OIM [xiao2017joint] and demonstrate the generality of our approach by also applying it to the current state-of-the-art QEEPS [munjal2019cvpr].

Knowledge Distillation and Model Compression.  Knowledge distillation, proposed by [bucilua2006model, NIPS2014_5484] and popularized by [NIPS2015_distilling], aims to train a small neural network from a large and more accurate one. It has gained attention for its promise of more effective training [Romero2015-iclr], better accuracy [belagiannisECCV2018] and efficiency [NIPS2014_5484, polino2018iclr, urban2016ArXiv, XuBMVC2018], but it remains strongly limited to networks solving the single classification task. When moving from classification to detection, as only few works [DBLP:journals/corr/ShenVBK16, NIPS2017_6676, Li_2017_CVPR] attempted, complex modelling questions arise as to where and how to supervise, which are not entirely answered yet. Such complex questions include the class importance imbalance, as for the background vs the other classes, and the implicit multi-task objective, since detection implies the joint bounding box regression and classification. Here, we apply knowledge distillation to the even more complex multi-task person search.

Distilling knowledge from a larger to a smaller network realizes model compression, i.e. train a smaller but accurate network. This has also been addressed for classification via quantization [Zhou2017IncrementalNQ, Zhu2017TrainedTQ, Gong2014CompressingDC]

and binarization

[Rastegari2016XNORNetIC, Binarization_NIPS2015_5647] of floating point operations, network pruning [Carreira_2018_CVPR, Chen2015CompressingNN, Han2015LearningBW, Li2017PruningFF, Yu2018NISPPN] and factorization [Sironi2013LearningSF, Lebedev2015SpeedingupCN]. The proposed knowledge distillation is directly applicable and realizes model compression for person search for the first time.

3 Background - Online Instance Matching

Online Instance Matching (OIM) loss and the end-to-end architecture proposed by Xiao et al [xiao2017joint] is currently the de-facto standard for identification feature learning in end-to-end person search [Xiao2017IANTI, munjal2019cvpr, Liu2017NPSM]. The architecture of [xiao2017joint] is based on Faster R-CNN [ren2015faster] with a ResNet backbone [he2016resnet], a Region Proposal Network (RPN) and a Region Classification Network (RCN), as illustrated in Fig. 2 (blue region). In parallel to the RCN, [xiao2017joint] defines an ID Net, which provides an identification feature embedding. They introduce an OIM loss as an additional objective in the Faster R-CNN framework focusing on the task of learning unique identity (ID) features for the image instances of the same person. This is accomplished by learning a lookup table (LUT) for all the identities in the training set. We refer to this approach as OIM in the text.

Figure 2: We propose two knowledge distillation approaches for person search. The first approach is motivated from  [NIPS2017_6676] and uses the output of a specialized teacher detector (shown in green) to guide the detector of our person search student network (shown in blue). The second approach uses a copy of the LUT from a person search teacher (shown in yellow) and fix it during the student’s training, thereby relaxing the task of ID feature learning and allowing the student to focus on the detection task.

In more details, given ID features where , OIM maintains a LUT for D-dimensional ID features corresponding to the ID labels, and also a circular

queue containing most recent ID features of the unlabeled identities that appear in the recent mini-batches. During the forward pass of the network, the computed ID feature of each mini-batch sample is compared with all the entries in and

, to compute the OIM loss. The OIM objective is to minimize the negative expected log-likelihood. Given the softmax probabilities

of the positive IDs in the mini-batch, the OIM loss is given by .

During the backward pass, entries of corresponding to the identities in the current mini-batch are updated using moving-average. The OIM detection and re-identification objectives are in conflict during the optimization, which results in a significantly sub-optimal detector performance, as illustrated in Fig. 1. Intuitively, one would expect adjusting the relative weights of the two tasks might solve this problem. In fact we notice that the relative weights do play a role, and by decreasing the weight of the ReID task the detector approaches the performance of the specialized detector (standalone Faster R-CNN). However, at the same time the ReID task becomes harder to train and therefore its performance drops significantly.

4 Knowledge Distillation in OIM

We propose two independent approaches for knowledge distillation in OIM for person search. First, we draw from most recent literature in knowledge distillation for object detection to supervise the detector of the OIM model with the help of a specialized (hence stronger) person detector. The second approach is to relax the task of re-identification with the help of a pre-computed identification feature table. As a result the detector becomes the focus during the optimization without compromising on the quality of identification features. Both proposed approaches contribute in recovering the sub-optimal detector performance, as we would illustrate experimentally in Sec. 6. However, we would demonstrate that the two approaches are not complementary. In the following we discuss both approaches in detail.

4.1 KD using a Specialized Detector

As a first approach, we propose that the detector of the student model (OIM) to mimick the superior output of a stronger detector (a teacher). The teacher in this case is a specialized Faster R-CNN with the same backbone architecture as the student. This approach stems from the belief that a better dedicated detector be a good teacher for the detector in OIM model, without modifying the OIM training.

To this end, we adopt the approach of Chen et al [NIPS2017_6676] for the supervision of the student at various levels, from mimicking of the base features to the supervision of the detector. This corresponds to the green region in Fig. 2. As illustrated in the figure, we employ supervision for three different components of the Faster R-CNN [ren2015faster] object detection framework. Following [NIPS2017_6676] these components are, i. intermediate base feature representations using a hint based learning [Romero2015-iclr], ii. RPN classification and regression modules to produce better region proposals. iii. likewise RCN classification and regression modules to generate stronger object detections. We refer to this approach as in the text.

Following [NIPS2017_6676, Romero2015-iclr], hint loss () is given as an loss between an intermediate layer of teacher BaseNet () and an adapted intermediate layer of student BaseNet ().


The classification loss () and bounding box regression loss () for both RPN and RCN sub-networks are given as:


corresponds to the losses w.r.t the ground-truth labels [ren2015faster] and corresponds to the losses w.r.t the output of the teacher network. Motivated from [NIPS2017_6676], we use soft cross entropy loss i.e as where and are the softened teacher and student classification probabilities with temperature 10. For we use teacher bounded regression loss as in [NIPS2017_6676]. The overall optimization objective for this approach is given as follows:


whereby, represents the OIM loss given in Sec. 3. For details on the contribution of each loss term (except ), we refer the reader to [NIPS2017_6676]. We set , and to 0.5 as in [NIPS2017_6676], while we investigate different values of in Sec. 6 and also Fig. 1.

4.2 KD using Pre-Trained Re-ID Features

The OIM optimization includes an iterative estimation of i.  the ID feature embeddings, and ii.  the lookup table (LUT) for the evaluation of these ID feature embeddings and computation of the OIM loss. The LUT in the original model [xiao2017joint] is randomly initialized. It eventually converges over time, however this iterative complexity impacts the learning of the parallel detection task.

We propose a new approach to person search with OIM, whereby we leverage knowledge distillation to relax one of the two tasks, i.e. estimating the LUT. In other words, distilling knowledge for the re-identification means that the student is not tasked any more with training for the OIM LUT. Most importantly, originally an optimization goal becomes instead a supervisory signal which eases the training and improves the performance.

To accomplish this, we fix the LUT of our student OIM model using a copy from a pre-trained OIM model. This approach for knowledge distillation is illustrated by yellow components in Fig. 2. We refer to this approach as in the text. The pre-computed is fixed, hence not updated during the back-propagation step. The network is able to obtain the optimal supervision for the ID features directly from the very first iteration.

The proposed approach is in contrast with the approach, since it aims to simplify the learning of the ID features while the latter focuses on directly supervising and improving the detector. Moreover, the proposed method do not add any overhead to the network training as compared to the original OIM. In fact it reduces the computations (FLOPS), since LUT is fixed and we skip the step of updating it during back-propagation. Whereas, for a forward pass over teacher network is also required.

Finally, we would like to emphasize that the proposed knowledge distillation is applicable to any OIM based person search method, since we just require to copy and fix the LUT from a pre-trained model. We show this by applying to the most recent work Query-guided End-to-End Person Search (QEEPS) [munjal2019cvpr] in Sec. 6.

5 Model Compression

In this work, we also demonstrate achieving model compression using our proposed knowledge distillation approaches. We do not require any modification to the proposed approaches which we discussed in section 4.1 and 4.2. The only difference is that for model compression the backbone architecture of our student OIM network (blue region in Fig. 2) is much smaller than the specialized detector (green region), and the pre-trained OIM model (in yellow).

Prior works in knowledge distillation [Denil_NIPS2013_5025, NIPS2017_6676] show that neural networks are often over-parametrized, and a proper teacher-student knowledge distillation has the potential to scale down the redundancy while keeping the performance intact. In other words, supervision from a stronger model as a teacher allows a weaker model to reach the level of performance which the smaller model itself, with current training procedures, is unable to arrive at.

In our evaluation we focus on different sizes of the Resnet [he2016resnet] architecture to study model compression. In particular, our teacher has a larger backbone architecture (Resnet50), while our student is based on Resnet18, which is the smallest standard variant of this architecture. In Sec. 6, we discuss this further.

6 Experiments

CUHK-SYSU.  The CUHK-SYSU is a large scale person search dataset [xiao2017joint] consisting of 18,184 images, labeled with 8,432 identities and 96,143 pedestrian bounding boxes (23,430 boxes are ID labeled). The dataset displays a large variation in person appearance, backgrounds, illumination conditions etc. For our evaluations, we adopt the standard train/test split as detailed in [xiao2017joint, munjal2019cvpr].

PRW-mini.  PRW [zheng2016prw] is another important dataset focusing on the task of person search. Different from CUHK-SYSU, PRW is acquired in a single setup, i.e. a university campus using six cameras. Overall it consists of 11,816 images with 34,304 bounding boxes and 932 identities. The diversity in the background and the appearance of the persons is limited compared to CUHK-SYSU. Munjal et al [munjal2019cvpr] proposed PRW-mini111Publicly available at, a subset of PRW, for a faster and representative benchmarking of the full dataset. The motivation came from the huge computational complexity of their query-guided person search method (QEEPS), which we also evaluate in this work.

Evaluation Metrics.  Following [xiao2017joint, Xiao2017IANTI, Liu2017NPSM, munjal2019cvpr]

, we adopt mean Average Precision (mAP) and Common Matching Characteristic (CMC top-K) for results on person search. On the other hand, we report mAP for person detection results. mAP metric is common in the detection literature, reflecting both precision and recall of results. CMC is specific to re-identification and reports the probability of retrieving at least one correct ID within the top-K predictions (CMC top-1 is adopted here, which we refer to as top-1).

Implementation Details.  Our implementation of OIM is based on Resnet [he2016resnet] and uses first four blocks (conv1 to conv4) as BaseNet. The input images are re-scaled such that their shorter side is 600 pixels, unless specified otherwise. We employ the same training strategy as in [munjal2019cvpr]

. Note that, we augment the data by flipping and initialize the backbone architecture using pre-trained ImageNet 

[imagenet_cvpr09] weights.

6.1 Knowledge Distillation in OIM

In Table 1, we summarize the ablation study of our proposed knowledge distillation approaches, i.e. and , on CUHK-SYSU [xiao2017joint] dataset. All experiments in this section consider Renset50 as the backbone architecture. We begin with the evaluation of a pure detector (Faster R-CNN [ren2015faster]). We obtain a mAP of 83.6% and recall of 88.1% as the baseline for detection accuracy. We refer to this as DET in the text. Then, we evaluate a basic OIM [xiao2017joint] model with a relative weight of ReID task . We set this as our baseline for person search for rest of the experiments in this section and refer it as OIM, which achieves 78% mAP and 77.9% CMC top-1 for person search, meanwhile significantly underperforms on detector accuracy compared to DET by 8.4% mAP. In Fig. 1, we investigate the effect of relative weighting of the detection and re-identification tasks by varying demonstrating contrasting nature of both tasks.

Next, we detail the results for our distillation approaches. We show that applying to OIM with DET as the teacher and does not change the person search performance, however reasonably improves the detector accuracy by 3.7pp mAP and 2.3pp recall. On the other hand, selecting degrades the person search by more than 5% in terms of mAP as well as top-1, but the detector recovers almost entire performance of teacher DET. This result is intuitive since the detector of the student model is getting an additional signal from the teacher network and at the same time we decreased the relative weight of the contrasting re-id task.

Keeping the results of in view, for our second distillation approach , we select to ensure the detector of the student gets the required focus during the joint optimization. Whereas, supervision through is supposed to simplify the re-id task, hence justifying the lower value of its relative weight . Interestingly, we observe significant improvements for both detection (6.7pp mAP and 4.2pp recall) and person search (3.2pp mAP and 3.1pp top-1) over the baseline OIM. Note that, is significantly simplified knowledge distillation approach in comparison to , and still it improves significantly on both detection and re-id tasks unlike . This result further motivates the importance of research into appropriate supervision and optimal training procedures.

We further combine with in our evaluations in Table 1. We notice that the detector of the student reached the Faster R-CNN detector performance (83.2% mAP and 87.8% recall), however there is a drop in person search results as compared to employing only (80.3 vs 81.2 in mAP). Clearly, the addition of increases the focus on the detection task; weakening the training of the re-id branch, hence declining the person search results. This result indicates the challenges in combining the two knowledge distillation approaches (, ). One could adjust the relative weighting of these approaches, we however adopt for all of our next experiments, due to its superior performance and simplicity.

Student Type of KD Teacher Person Search Detection
Resnet50 Models mAP(%) top-1(%) mAP(%) Recall(%)
Faster R-CNN (DET)* - - - - - 83.6 88.1
OIM (OIM, Baseline)* 1.0 - - 78.0 77.9 75.2 82.7
OIM 1.0 DET 78.3 77.5 78.9 85.0

0.3 DET 72.9 72.0 82.3 87.2

0.1 OIM 81.2 81.0 81.9 86.9
OIM 0.1 , DET, OIM 80.3 79.9 83.2 87.8
[3pt/5pt] OIM 0.1 OIM 75.0 73.9 81.7 86.8
OIM 0.1 QEEPS 83.8 84.2 81.7 86.8
QEEPS[munjal2019cvpr] (QEEPS)* 1.0 - - 84.4 84.4 - -
QEEPS 0.1 OIM 83.2 82.9 - -
QEEPS 0.1 QEEPS 85.0 85.5 - -

Table 1: Knowledge distillation for Resnet50 student models. Above the dashed line, are the detection and person search results of the student model using the two proposed distillation methods ( and ). For , Resnet50 Faster R-CNN detector (DET) is used as teacher. For Resnet50 OIM model (OIM) is used as teacher. Below the dashed line, are the results of using different teachers with OIM and QEEPS student. OIM represents Resnet18 OIM model, QEEPS represents Resnet50 QEEPS model. (*) indicates models trained without KD.

Significance of Teachers’ Quality.  In Table 1 below the dashed line, we study the effect of using different teacher models in , namely Resnet18 (referred as OIM) and recently proposed QEEPS [munjal2019cvpr] with Resnet50 (referred as QEEPS), in addition to the baseline OIM. Notice, how using OIM as teacher drops the person search performance to 75% mAP and 73.9% top-1 which is 3pp mAP and 4pp top-1 lower than the baseline OIM. While, using QEEPS as teacher gives significant improvement in person search over the baseline, 5.8pp mAP and 6.3pp top-1. This is 2.6pp mAP and 3.2pp top-1 better than using OIM as a teacher. Overall, we can conclude that stronger teachers provide stronger supervision hence better results, while inferior teachers may harm the performance. However, we also demonstrate that students using same model as teacher, e.g. OIM Resnet50 student, OIM teacher (cf. Table 1, row) also improves the results due to improved training conditions. It is worth noting that the detector performance, in this case, remains almost the same when using different teachers for and (mAP 81.9%, 81.7% and 81.7% for OIM, OIM and QEEPS, respectively), since a higher relative importance of the detector encourages its convergence to the performance of a pure detector (mAP 83.6%).

Next, we also evaluate the state-of-the-art QEEPS model [munjal2019cvpr] with Resnet50 as a student with supervision of OIM and QEEPS as teachers. QEEPS does not consider intermediate detection stage, therefore we only report person search results. As shown in the table, using OIM to train QEEPS dropped the person search performance by 1.2pp mAP and 1.5pp top-1, while using QEEPS, results in an improvement of 0.6pp mAP and 1.1pp top-1.

mAP(%) top-1 (%)
OIM [xiao2017joint] 75.5 78.7
Distilled OIM 83.8 84.2
[3pt/5pt] QEEPS [munjal2019cvpr] 84.4 84.4
Distilled QEEPS 85.0 85.5
IAN [Xiao2017IANTI] 76.3 80.1
NPSM [Liu2017NPSM] 77.9 81.2
Context Graph [Yan_2019_CVPR] 84.1 86.5
CLSA [person_eccv18] 87.2 88.5

mAP(%) top-1 (%)

38.3 70.0
Distilled OIM‡ 39.5 73.3
[3pt/5pt] QEEPS [munjal2019cvpr] 39.1 80.0
Distilled QEEPS 39.7 80.0
Mask-G [Chen_2018_ECCV] 33.1 70.0
(a) CUHK-SYSU (b) PRW-mini
Table 2: Comparison to the state-of-the-art on, (a) CUHK-SYSU [xiao2017joint] (image size = 600), and (b) PRW-mini [munjal2019cvpr] (image size = 900), where OIM‡ is same as in [munjal2019cvpr].

Comparison to the State-of-the-Art.  It is important to note that our knowledge distillation approach can be directly applied to all methods that use OIM loss [xiao2017joint] for learning the identification features [Yan_2019_CVPR, Chen_2018_ECCV, munjal2019cvpr, xiao2017joint]. In Table 2 (a), we demonstrate our approach on CUHK-SYSU [xiao2017joint] dataset for two such state-of-the-art methods, i.e. OIM [xiao2017joint] and QEEPS [munjal2019cvpr]. Our distilled OIM (above the dashed line) using from QEEPS outperforms OIM [xiao2017joint] by 8.3pp in mAP and 5.5pp in top-1. Similarly, our distilled QEEPS outperforms QEEPS [munjal2019cvpr] by 0.6pp mAP and 1.1pp top-1, achieving overall 85.0% mAP and 85.5% top-1. Also notice that our distilled QEEPS outperforms IAN [Xiao2017IANTI], NPSM [Liu2017NPSM] and a recent context graph based approach [Yan_2019_CVPR] by 8.7pp, 7.1pp and 0.9pp mAP, respectively.

Our knowledge distillation approach is also applicable to other methods like IAN [Xiao2017IANTI], NPSM [Liu2017NPSM] and CLSA [person_eccv18] that learn the identification features using softmax loss. In this case, first we would need to compute an OIM-like ID feature table for the teacher model and then use it to additionally supervise the identification feature learning of the student model.

In Table 2 (b), we report results on PRW-mini. In this case we adopt image size of 900 pixels, following  [munjal2019cvpr, Shen_2018_ECCV]. Our distilled OIM‡222OIM‡ uses image size 900 whereas OIM uses image size 600. (above the dashed line) trained with using QEEPS surpasses OIM‡  by 1.2pp mAP and 3.3pp top-1. Similarly our distilled QEEPS surpasses QEEPS [munjal2019cvpr] by .6pp mAP.

6.2 Model Compression

In Table 3, we report the model compression results for our two knowledge distillation methods, and . We employ Resnet18 (with ~46% of Resnet50 parameters), as the network architecture for all the student entries in this table. We keep DET as the teacher for distillation using .

The person search results for our baseline model OIM i.e. 69.1% mAP and 68% top-1 are significantly lower than OIM by around 9%. Whereas, the detector of OIM is only slightly worse than OIM but significantly below the pure detector accuracy of DET (mAP 82.4%). As shown in Table 3, the application of to OIM at , improves the detection performance by 3.3pp mAP and 2.1pp recall with no effect on person search results. We also evaluate the at , as it provides the best trade-off between detection and re-id tasks (cf. Fig. 1 (right)). We notice that the detection improves by 1.3pp mAP and 1pp recall, while person search performance drops by a large value of 9.5pp mAP and 11.4pp top-1. This result indicates that while detector is being directly supervised through , a lower for re-id task during optimization is counter-productive for person search.

Student Type of KD Teacher Person Search Detection
Resnet18 Models mAP(%) top-1(%) mAP(%) Recall(%)
Faster R-CNN (DET)* - - - - - 82.4 87.2
OIM (OIM, Baseline)* 1 - - 69.1 68.0 74.2 81.6
OIM 1 DET 69.1 68.0 77.5 83.7
OIM* 0.6 - - 79.9 80.4 78.4 84.4
OIM 0.6 DET 70.4 69.0 79.7 85.4

0.1 OIM 80.5 80.9 80.6 85.9
OIM 0.1 , DET, OIM 78.4 78.0 82.0 86.9
[3pt/5pt] OIM 0.1 OIM 72.9 71.1 80.6 85.9

0.1 QEEPS 82.4 83.0 80.8 86.0
QEEPS* 1 - - 76.6 76.0 - -
QEEPS 0.1 OIM 82.1 81.4 - -
QEEPS 0.1 QEEPS 84.1 84.3 - -

Table 3: Knowledge distillation for Resnet18 student models. Above the dashed line, are the detection and person search results of the student model using the two proposed distillation methods ( and ). For , Resnet50 Faster R-CNN detector (DET) is used as teacher. For Resnet50 OIM model (OIM) is used as teacher. Below the dashed line, are the results of using different teachers with OIM and QEEPS as students. OIM represents Resnet18 OIM model, QEEPS represents Resnet50 QEEPS model. (*) indicates models trained without KD.

Next we notice that, our second approach for distillation is quite effective also for the model compression scenario. It brings a significant improvement over the baseline on both person search and detection benchmarks. Specifically, it improves by 11.4pp mAP and 12.9pp top-1 for person search and 6.4pp mAP and 4.3pp recall for detection. Quite interestingly, our trained Resnet18 OIM model outperforms Resnet50 baseline OIM model both for person search (78% mAP vs 80.5%) and detection (75.2% mAP vs 80.6%).

Finally, below the dashed line in Table 3, we report results for Resnet18 based OIM and QEEPS student models through supervision from different teacher models such as OIM, OIM, and QEEPS. In particular, supervision of Resnet18 QEEPS with QEEPS supervision achieves on par performance to the state-of-the-art results of QEEPS [munjal2019cvpr].

7 Conclusions

We have introduced knowledge distillation for person search and proposed two approaches to supervise either the detector or the re-identification part of two most recent models, OIM [xiao2017joint] and QEEPS [munjal2019cvpr]. In both cases we improve performance on the CUHK-SYSU [xiao2017joint] and PRW-mini [munjal2019cvpr] datasets, which extends to model compression.

Our approach is the first-of-a-kind, relaxing the multi-task person search optimization by transferring one task to a teacher. We plan to further investigate whether this multi-task relaxation approach may apply to other multi-task goals.