Existing person re-identification benchmarks and methods mainly focus on matching cropped pedestrian images between queries and candidates. However, it is different from real-world scenarios where the annotations of pedestrian bounding boxes are unavailable and the target person needs to be searched from a gallery of whole scene images. To close the gap, we propose a new deep learning framework for person search. Instead of breaking it down into two separate tasks---pedestrian detection and person re-identification, we jointly handle both aspects in a single convolutional neural network. An Online Instance Matching (OIM) loss function is proposed to train the network effectively, which is scalable to datasets with numerous identities. To validate our approach, we collect and annotate a large-scale benchmark dataset for person search. It contains 18,184 images, 8,432 identities, and 96,143 pedestrian bounding boxes. Experiments show that our framework outperforms other separate approaches, and the proposed OIM loss function converges much faster and better than the conventional Softmax loss.READ FULL TEXT VIEW PDF
Mining the shared features of same identity in different scene, and the
Person search aims to search for a target person among multiple images
Person search is to detect all persons and identify the query persons fr...
This paper proposes a pedestrian detection and re-identification (re-id)...
We present a multi-camera 3D pedestrian detection method that does not n...
The goal of person search is to localize and match query persons from sc...
We present a novel large-scale dataset and comprehensive baselines for
Person re-identification (re-id) [39, 8] aims at matching a target person with a gallery of pedestrian images. It has many video surveillance applications, such as finding criminals , cross-camera person tracking , and person activity analysis . The problem is challenging because of complex variations of human poses, camera viewpoints, lighting, occlusion, resolution, background clutter, etc., and thus draws much research attention in recent years [43, 21, 24, 35, 19, 4].
Although numerous person re-id datasets and methods have been proposed, there is still a big gap between the problem setting itself and real-world applications. In most benchmarks [19, 10, 14, 18, 45], the gallery only contains manually cropped pedestrian images (Figure 0(a)), while in real applications, the goal is to find a target person in a gallery of whole scene images, as shown in Figure 0(b). Following the protocols of these benchmarks, most of the existing person re-id methods assume perfect pedestrian detections. However, these manually cropped bounding boxes are unavailable in practical applications. Off-the-shelf pedestrian detectors would inevitably produce false alarms, misdetections, and misalignments, which could harm the final searching performance significantly.
In 2014, Xu et al.  made the first step towards closing this gap. They introduced the person search problem to the community, and proposed a sliding window searching strategy based on a combination of pedestrian detection and person matching scores. However, the performance is limited by the handcrafted features, and the sliding window framework is not scalable.
In this paper, we propose a new deep learning framework for person search. Different from conventional approaches that break down the problem into two separate tasks—pedestrian detection and person re-identification, we jointly handle both aspects in a single Convolutional Neural Network (CNN). Our CNN consists of two parts, given a whole input gallery image, a pedestrian proposal net is used to produce bounding boxes of candidate people, which are fed into an identification net to extract features for comparing with the target person. The pedestrian proposal net and the identification net adapt with each other during the joint optimization. For example, the proposal net can focus more on the recall rather than the precision, as false alarms could be eliminated through the latter features matching process. Meanwhile, misalignments of proposals are also acceptable, as they can be further adjusted by the identification net. To improve the scalability of the whole system, inspired by recent advances in object detection , we encourage both parts to share underlying convolutional feature maps, which significantly accelerates the inference procedure.
Traditional re-id feature learning mainly employs pairwise or triplet distance loss functions [19, 1, 3, 5]. However, they are not efficient as only several data samples are compared at each time, and there are potential input combinations, where is the number of images. Different sampling strategies could significantly impact the convergence rate and quality, but finding efficient sampling strategies becomes much more difficult as
increases. Another approach is learning to classify identities with the Softmax loss function, which effectively compares all the samples at the same time. But as the number of classes increases, training the big Softmax classifier matrix becomes much slower or even cannot converge. In this paper, we propose a novel Online Instance Matching (OIM) loss function to cope with the problems. We maintain a lookup table of features from all the labeled identities, and compare distances between mini-batch samples and all the registered entries. On the other hand, many unlabeled identities could appear in scene images, which can be served as negatives for labeled identities. We thus exploit a circular queue to store their features also for comparison. This is another advantage brought by the person search problem setting. The proposed parameter-free OIM loss converges much faster and better than the Softmax loss in our experiments.
The contribution of our work is three-fold. First, we propose a new deep learning framework to search a target person from a gallery of whole scene images. Instead of simply combining the pedestrian detectors and person re-id methods, we jointly optimize both objectives in a single CNN and they better adapt with each other. Second, we propose an Online Instance Matching loss function to learn identification features more effectively, which enables our framework to be scalable to large datasets with numerous identities. Together with the fast inference speed, our framework is much closer to the real-world application requirements. At last, we collect and annotate a large-scale benchmark dataset for person search, covering hundreds of scenes from street and movie snapshots. The dataset contains images, identities, and pedestrian bounding boxes. We validate the effectiveness of our approach comparing against other baselines on this dataset. The dataset and code are made public to facilitate further research111https://github.com/ShuangLI59/person_search.
Person re-identification. Early person re-identification methods addressed the problem by manually designing discriminative features [34, 12, 42], learning feature transforms across camera views [26, 25, 30], and learning distance metrics [46, 11, 26, 24, 22]. Recent years, many researchers have proposed various deep learning based methods that jointly handle all these aspects. Li et al.  and Ahmed et al.  designed specific CNN models for person re-id. Both the networks utilize as input a pair of cropped pedestrian images and employ a binary verification loss function to train the parameters. Ding et al.  and Cheng et al.  exploited triplet samples for training CNNs to minimize the feature distance between the same person and maximize the distance between different people. Apart from using pairwise or triplet loss functions, Xiao et al.  proposed to learn features by classifying identities. Multiple datasets are combined together and a domain guided dropout technique is proposed to improve the feature learning. Several recent works addressed on solving person re-id on abnormal images, such as low-resolution images , or partially occluded images .
Concurrent with our prior arXiv submission, Zheng et al.  also contributed a benchmark dataset for person search. They exploited separate detection and re-id methods with scores re-weighting to solve the problem, while in this work we propose a deep learning framework that jointly handles both aspects.
Pedestrian detection. DPM , ACF , and Checkerboards  are the most commonly used off-the-shelf pedestrian detectors. They rely on hand-crafted features and linear classifiers to detect pedestrians. Recent years, CNN-based pedestrian detectors have also been developed [37, 40]. Various factors, including CNN model structures, training data, and different training strategies are studied empirically in . Tian et al.  exploited pedestrian and scene attribute labels to train CNN pedestrian detectors in a multi-task manner. Cai et al.  proposed a complexity-aware boosting algorithm for learning CNN detector cascades.
We propose a new deep learning framework that jointly handles the pedestrian detection and person re-identification in a single convolutional neural network (CNN), as shown in Figure 2. Given as input a whole scene image, we first use a stem CNN to transform from raw pixels to convolutional feature maps. A pedestrian proposal net is built upon these feature maps to predict bounding boxes of candidate people, which are then fed into an identification net with RoI-Pooling  to extract L2-normalized -d features for each of them. At inference stage, we rank the gallery people according to their feature distances to the target person. At training stage, we propose an Online Instance Matching
(OIM) loss function on top of the feature vectors to supervise the identification net, together with several other loss functions for training the proposal net in a multi-task manner. Below we will first detail the CNN model structure, and then elaborate on the OIM loss function.
We adopt the ResNet-50  as our base CNN model. It has a convolution layer in front (named conv1), followed by four blocks (named conv2_x to conv5_x) each containing residual units, respectively. We exploit conv1 to conv4_3 as the stem part. Given an input image, the stem will produce channels of features maps, which have resolutions of the original image.
On top of these feature maps, we build a pedestrian proposal network to detect person candidates. A convolutional layer is first added to transform the features specifically for pedestrians. Then we follow  to associate
anchors at each feature map location, and use a Softmax classifier to predict whether each anchor is a pedestrian or not, as well as a linear regression to adjust their locations. We will keep the topadjusted bounding boxes after non-maximum suppression as our final proposals.
To find the target person among all these proposals, we build an identification net to extract the features of each proposal, and compare against the target ones. We first exploit an RoI-Pooling layer  to pool a region from the stem feature maps for each proposal. Then they are passed through the rest conv4_4 to conv5_3 of the ResNet-50, followed by a global average pooling layer to summarize into a dimensional feature vector. On one hand, as the pedestrian proposals would inevitably contain some false alarms and misalignments, we use again a Softmax classifier and a linear regression to reject non-persons and refine the locations. On the other hand, we project the features into a L2-normalized
dimensional subspace (id-feat), and use them to compute cosine similarities with the target person when doing inference. During the training stage, we supervise the id-feat with the proposed OIM loss function. Together with other loss functions for detection, the whole net is jointly trained in a multi-task learning manner, rather than using the alternative optimizations in.
There are three different types of proposals, labeled identities, unlabeled identities, and background clutter. Suppose there are different target people in the training set, when a proposal matches a target person, we call it an instance of the labeled identity, and assign a class-id (from to ) to it accordingly. There are also lots of proposals predicting pedestrians correctly, but do not belong to anyone of our target people. We call them unlabeled identities in such cases. We demonstrate some examples of labeled and unlabeled identities in Figure 3 with blue and orange bounding boxes, respectively. Other proposals are just false alarms on other objects or background regions. In the proposed loss function, we only consider the labeled and unlabeled identities, while leave the other proposals untouched.
As our goal is to distinguish different people, a natural objective is to minimize the features discrepancy among the instances of the same person, while maximize the discrepancy among different people. To fulfill this goal, we need to memorize the features of all the people. This could be done offline by doing network forward on all the training images, but it is not practical when using stochastic gradient descent (SGD) for optimization. Thus in our approach, we choose an online approximation instead. Denote the features of a labeled identity inside a mini-batch by, where is the feature dimension, we maintain a lookup table (LUT) to store the features of all the labeled identities, as demonstrated in Figure 3. During the forward propagation, we compute cosine similarities between the mini-batch sample and all the labeled identities by . During backward, if the target class-id is , then we will update the -th column of the LUT by , where , and then scale to have unit L2-norm.
Apart from labeled identities, many unlabeled identities are also valuable for learning feature representations. They can be safely used as negative classes for all the labeled identities. We use a circular queue to store the features of these unlabeled identities that appear in recent mini-batches. Denote the features in this circular queue by , where is the queue size, we can also compute their cosine similarities with the mini-batch sample by . After each iteration, we push the new feature vectors into the queue, while pop the out-of-date ones to keep the queue size unchanged.
Based on these two data structures, we define the probability ofbeing recognized as the identity with class-id by a Softmax function
where higher temperature
leads to softer probability distribution. Similarly, the probability of being recognized as the-th unlabeled identity in the circular queue is
OIM objective is to maximize the expected log-likelihood
and its gradient with respect to can be derived as
It can be seen that our OIM loss effectively compares the mini-batch sample with all the labeled and unlabeled identities, driving the underlying feature vector to be similar with the target one, while pushing it away from the others.
Why not Softmax loss? A natural question here is that why not learning a classifier matrix with a conventional Softmax loss to predict the class-id. There are mainly two drawbacks. First, large-scale person search datasets would have a large number of identities (more than in our training set), while each identity only has several instances and each image only contains a few identities. We need to learn more than
discriminant functions simultaneously, but during each SGD iteration we only have positive samples from tens of classes. The classifier matrix suffers from large variance of gradients and thus cannot be learned effectively, even with proper pre-training and high momentum. Second, we cannot exploit the unlabeled identities with Softmax loss, as they have no specific class-ids.
Although our OIM loss formulation is similar to the Softmax one, the major difference is that the OIM loss is non-parametric. The LUT and circular queue are considered as external buffer, rather than the network parameters. The gradients directly operate on the features without the transformation by a classifier matrix. The potential drawback of this non-parametric loss is that it could overfit more easily. We find that projecting the features into a L2-normalized low-dimensional subspace helps reduce overfitting.
Scalability. Computing the partition function in Eq (1) and Eq (2) could be time consuming when the number of identities increases. To overcome this problem, we can approximate the denominators by sub-sampling the labeled and unlabeled identities, which results in optimizing a lower-bound of Eq (3).
We collect and annotate a large-scale person search dataset to evaluate of our proposed method. We exploit two data sources to diversify the scenes. On one hand, we use hand-held cameras to shoot street snaps around an urban city. On the other hand, we collect from movie snapshots that contain pedestrians, as they could enrich the variations of viewpoints, lighting, and background conditions. In this section, we will show the basic statistics of our dataset, as well as define the evaluation protocols and metrics.
After collecting all the images, we first densely annotate all the pedestrians bounding boxes in these scenes, and then associate the person that appears across different images, resulting in labeled identities. The statistics of two data sources are listed in Table 1. We did not annotate those people who appear with half bodies or abnormal poses such as sitting or squatting. Moreover, people who change clothes and decorations in different video frames are not associated in our dataset, since person search problem requires to recognize identities mainly according to their clothes and body shapes rather than faces. We ensure that the background pedestrians do not contain labeled identities, and thus they can be safely served as negative samples for identification. Note that we also ignore the background pedestrians whose heights are smaller than pixels, as they would be hard to recognize even for human labelers. The height distributions of labeled and unlabeled identities are demonstrated in Figure 4. It can be seen that our dataset has rich variations of pedestrian scales.
|Source / Split||# Images||# Pedestrians||# Identities|
We split the dataset into a training and a test subset, ensuring no overlapped images or labeled identities between them. Table 1 shows the statistics of these two subsets. We divide the test identity instances into queries and galleries. For each of the test identities, we randomly choose one of his/her instances as the query, while the corresponding gallery set consists of two parts—all the images containing the other instances and some randomly sampled images not containing this person. Different queries have different galleries, and jointly they cover all the test images.
To better understand how gallery size would affect the person search performance, we define a set of protocols with gallery size ranging from to . Taking gallery size of as an example, as each image approximately contains pedestrians, then our task is to find the target person among about people. This setting is comparable with existing person re-id datasets (e.g., CUHK-03, VIPeR) in terms of the number of gallery pedestrians, and is even more challenging as there could be thousands of background clutter bounding boxes distracting our attentions.
We employ two kinds of evaluation metrics—cumulative matching characteristics (CMC top-K) and mean averaged precision (mAP). The first one is inherited from the person re-id problem, where a matching is counted if there is at least one of the top-K predicted bounding boxes overlaps with the ground truths with intersection-over-union (IoU) greater or equal to. The second one is inspired from the object detection tasks. We follow the ILSVRC object detection criterion  to judge the correctness of predicted bounding boxes. An averaged precision (AP) is calculated for each query based on the precision-recall curve, and then we average the APs across all the queries to get the final result.
To evaluate the effectiveness of our approach and study the impact of various factors on person search performance, we conduct several groups of experiments on the new dataset. In this section, we first detail the baseline methods and experiment settings in Section 5.1. Then we compare our joint framework with the baselines of using separate pedestrian detection and person re-identification in Section 5.2. Section 5.3 shows the effectiveness of our proposed Online Instance Matching (OIM) loss. At last, we present the influence of various factors, including detection recall and gallery size.
We implement our framework based on Caffe[16, 32] and py-faster-rcnn [9, 27]
. ImageNet-pretrained ResNet-50 are exploited for parameters initialization. We fix the first
convolution layer and the batch normalization (BN) layers as constant affine transformations in the stem part, while keep the other BN layers as normal in the identification part. The temperature scalarin Eq. (1) and Eq. (2) is set to , the size of the circular queue is set to . All the losses have the same loss weight. Each mini-batch consists of two scene images. The learning rate is initialized to , dropped to after K iterations, and kept unchanged until the model converges at K iterations.
We compare our framework with conventional methods that break down the problem into two separate tasks—pedestrian detection and person re-identification. Three pedestrian detection and five person re-id methods are used in our experiments, resulting in baseline combinations. For pedestrian detection, we directly use the off-the-shelf deep learning CCF  detector, as well as two other detectors specifically fine-tuned on our dataset. One is the ACF , and the other is Faster-RCNN (CNN)  with ResNet-50, which is equivalent to our framework but without the identification task. The recall-precision curve of each detector on our dataset are plotted in Figure 5. We also use the ground truth (GT) bounding boxes as the results of a perfect detector.
For person re-identification, we use several popular re-id feature representations, including DenseSIFT-ColorHist (DSIFT) , Bag of Words (BoW) , and Local Maximal Occurrence (LOMO) . Each feature representation is used in conjunction with a specific distance metric, including Euclidean, Cosine similarity, KISSME , and XQDA , where KISSME and XQDA are trained on our dataset. Moreover, by discarding the pedestrian proposal network in our framework and training the remaining net to classify identities with Softmax loss from cropped pedestrian images, we get another baseline re-id method (IDNet). This training scheme has been exploited in  to learn discriminative re-id feature representations. In our experiments, when training IDNet with detector boxes, we found that adding background clutter as a unique class improves the result, while adding unlabeled identities does not.
The following results are reported using the protocol with gallery size equal to if not specified.
We first compare our proposed person search framework (with or without using unlabeled identities) with other baseline combinations that break down the problem into separate detection and re-identification tasks. The results are summarized in Table 2. Our method outperforms the others by large margin. Comparing with CNN+IDNet, the gain comes from the joint optimization of the detection and identification parts, as well as the effective use of unlabeled identities in the OIM loss.
|CMC top-1 (%)||CCF||ACF||CNN||GT|
|Ours (w/o unlabeled)||—||—||76.1||78.5|
|Ours (w/o unlabeled)||—||—||72.7||75.5|
From Table 2 we can also see that different detectors affect the person search performance significantly for each re-id method. Directly using an off-the-shelf detector may not be a good choice when applying existing re-id methods in the real-world person search applications. Otherwise the detector could become a bottleneck that diminishes the returns of better re-id methods.
On the other hand, the relative performance of different re-id methods are consistent across all the detectors. It implies that existing person re-id datasets could still guide us to design better feature representations, but it may lose some valuable data, such as unlabeled identities and background clutter, which come with the person search datasets.
Another interesting phenomenon is that although IDNet and LOMO+XQDA have similar performance when using GT or fine-tuned ACF and CNN detectors, IDNet is significantly better when using off-the-shelf CCF detector. We observe that the CCF detection results contain many misalignments. Hand-crafted features in such cases are not as robust as the IDNet counterpart.
We validate the effectiveness of the proposed Online Instance Matching (OIM) loss by comparing it against Softmax baselines with or without pretraining the classifier matrix. The training identification accuracy and test person search mAP curves are demonstrated in Figure 6. First, we can see that using Softmax loss without pretraining classifier remains at low accuracy during the whole process. This phenomenon verifies our analysis in Section 3.2 that learning a large classifier matrix is difficult. Even with proper pretraining, the training accuracy still improves slowly, and the test mAP keeps at around .
On the contrary, the proposed OIM loss starts with a low training accuracy but converges much faster and also consistently improves the test performance. The parameter-free OIM loss learns features directly without needing to learn a big classifier matrix. Moreover, the mismatch between training and test criterion no longer exists, as both are computed based on the inner product of L2-normalized feature vectors, which represents the cosine similarity.
We further evaluate the impact of OIM loss on the standard person re-identification task. We train two different base CNNs, Inception  (from scratch) and ResNet-50  (ImageNet pretrained), with either Softmax loss or OIM loss, on three large-scale person re-id datasets, CUHK03 , Market1501 , and Duke [48, 28]. Following their own protocols, we evaluate the CMC top-1 accuracy of using different loss functions, as listed in Table 3. OIM loss consistently outperforms Softmax loss, regardless of which base CNN is used. We refer readers to Open-ReID222https://github.com/Cysu/open-reid benchmarks for more details.
Sub-sampling the identities. As the number of identities increases, the computation time of the OIM loss could become the bottleneck of the whole system. Thus we proposed in Section 3.2 to approximate Eq. (1) and Eq. (2) by sub-sampling both the labeled and unlabeled identities in the denominators. We validate this approach here by training the framework with sub-sampling size of , , and . The test mAP curves are demonstrated in Figure 6(a). In general, sub-sampling a small number of identities relaxes the training objective, which leads to slightly inferior performance but much faster convergence rate. This indicates that our framework is scalable to larger datasets with even more identities by using proper sub-sampling rate.
Low-dimensional subspace. We further investigate how the dimension of the L2-normalized feature vector affects the person search performance. The results are summarized in Table 4. We observe that using the -d global pooled feature vector directly with L2-normalization leads to lower training error, but its test performance is worse. This suggests that projecting the features into a proper low-rank subspace is very important to regularize the network training. In our experiments, to dimensions have similar test performance, and we choose -d to accelerate the computation of feature distances.
Detection Recall. We investigate how detection recalls would affect the person search performance by using LOMO+XQDA as the re-id method and setting different thresholds on detection scores. A lower threshold reduces misdetections (increases the recall) but results in more false alarms. We choose the recall rates ranging from to the maximum value of each detector. The final person search mAP under each setting is demonstrated in Figure 6(b). An interesting observation is that higher recall does not necessarily lead to higher person search performance, which means re-id method could still get confused on some false alarms. This again indicates that we should not focus solely on training re-id methods with manually cropped pedestrians, but should consider the detections jointly under the person search problem setting.
Gallery size. Person search could be more challenging as the gallery size increases. We evaluate several methods under different test gallery sizes from to full set of images, following the protocols defined in Section 4.2. The test mAPs are demonstrated in Figure 6(c). Note that for each test query, the corresponding gallery images are randomly sampled from the whole set. All test images are covered even with small gallery sizes. The performance gaps among different methods are reduced as the gallery size increases, indicating all the methods may suffer from some common hard samples, and we could further improve the performance with hard example minings.
In this paper, we propose a new deep learning framework for person search. It jointly handles detection and identification in a single CNN. An Online Instance Matching loss function is proposed to train the network effectively. Its non-parametric nature enables faster yet better convergence, which is validated through series of experiments.
Acknowledgements. This work is supported in part by SenseTime Group Limited, in part by the General Research Fund through the Research Grants Council of Hong Kong under Grants CUHK14213616, CUHK14206114, CUHK14205615, CUHK14207814, CUHK14203015, CUHK14239816, and CUHK419412, in part by the Hong Kong Innovation and Technology Support Programme Grant ITS/121/15FX, in part by the State Key Development Program under Grant 2016YFB1001004, in part by the Guangdong Natural Science Foundation under Grant No. 2014A030313201, and in part by National Natural Science Foundation of China under Grant 61371192.
Structured feature learning for pose estimation.2016.