Segmentation Mask Guided End-to-End Person Search

08/27/2019 ∙ by Dingyuan Zheng, et al. ∙ BEIJING JIAOTONG UNIVERSITY Xi'an Jiaotong-Liverpool University 7

Person search aims to search for a target person among multiple images recorded by multiple surveillance cameras, which faces various challenges from both pedestrian detection and person re-identification. Besides the large intra-class variations owing to various illumination conditions, occlusions and varying poses, background clutters in the detected pedestrian bounding boxes further deteriorate the extracted features for each person, making them less discriminative. To tackle these problems, we develop a novel approach which guides the network with segmentation masks so that discriminative features can be learned invariant to the background clutters. We demonstrate that joint optimization of pedestrian detection, person re-identification and pedestrian segmentation enables to produce more discriminative features for pedestrian, and consequently leads to better person search performance. Extensive experiments on benchmark dataset CUHK-SYSU, show that our proposed model achieves the state-of-the-art performance with 86.3 accuracy respectively.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 7

page 19

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Person re-identification has been widely applied in video surveillance systems with increasing demands in urban safety. It has attracted great attention in the computer vision community during the last decade. Person re-identification is generally solved as a retrieval problem

Zheng et al. (2016b); Hermans et al. (2017). Given a probe image, person re-identification aims to find all the images in the gallery set with the same identity. However, person re-identification has not be fully addressed, since the images captured by cameras are usually with the characteristics of illumination variations, occlusions and low resolution owing to the shooting environment. These challenges potentially increase the intra-class variations and raise the recognition difficulty.

To this end, a great deal of research works on person re-identification devote to extract more discriminative features to represent human individuals, either by hand-crafted features Liao et al. (2015); Gray and Tao (2008) or by CNN features Wu et al. (2016); Wang et al. (2016)

. Most of the existing person re-identification methods engage on cropped pedestrian bounding boxes without considering background clutters. Specifically, human individuals are represented by the features extracted from the regions constrained with the detected pedestrian bounding boxes, and Euclidean distance is computed to evaluate the similarity level among those probe-gallery pairs. It may result in a situation that different persons with similar background are close in the learned feature space. For example, in Fig.

1, the person in the bounding box in the third figure is different from the probe image. However, it is ranked before the person in the bonding box of the fourth figure who has the same identification as the probe image; this is simply because its background is more similar with the probe image.


Figure 1: One example to show the negative effect of background clutters which deteriorate the person re-identification performance. Blue box in the first column is the probe image. Other columns are the searching results from rank-1 to rank-3. Green boxes indicate the correct searching result, while the red box indicates the incorrect searching result.

One straightforward yet effective solution to tackle the problem is to make the foreground part such as human body as the dominant region for feature extraction. In Zhao et al. (2017)

, it adopts pose estimation approach to locate the key body points, and then aggregates the local features extracted from the pre-defined body regions with the global features extracted from the whole image. Based on the similar ideas, Tian

et al. propose a person-region guided pooling network with the assist of human parsing maps to solve the background bias problem Tian et al. (2018). Recently, researchers also attempt to introduce the attention mechanism into the person re-identification task for pedestrian feature extraction Tay et al. (2019); Li et al. (2018a, b).

Person search aims to search for the targeting person among multiple images recorded with different surveillance cameras, where the pedestrian bounding boxes are not available. Person search, different from person re-identification which assumes most of the pedestrian bounding boxes are manually cropped or perfectly detected by the state-of-the-art detectors, i.e. Faster R-CNN Ren et al. (2015), handles the challenges from both pedestrian detection and re-identification. Specifically, considering the step of pedestrian detection, the misalignment and false alarm caused by detectors further decrease the recognition rate Ouyang and Wang (2013, 2012). Meanwhile person search also has the aforementioned problem resulting from the background clutters in the generated pedestrian bounding boxes.

In a recent work, Chen et al. adopt segmentation mask to solve the background clutter problem in the person search task Chen et al. (2018). Specifically, a two-stream model is established to extract the pedestrian features with one stream to emphasize the foreground information for the regions covered by the segmentation mask, and second stream to retain the global information for the original image. However, in Chen et al. (2018)

the foreground regions are heuristically fixed annotation, in other words, to what extent the background should be removed is decided by the pedestrian segmentation mask. Besides, this work separates the steps of pedestrian masking, pedestrian detection and person re-identification, which ignores the fact that jointly optimizing these steps can further bring in performance gain.

Inspired by the previous works Xiao et al. (2016b, 2017, 2019) that solve the person search task in an end-to-end manner, we propose an novel end-to-end person search framework that uses the segmentation mask to mitigate the negative effect of background clutters. Different from the previous work Chen et al. (2018) that designates the foreground regions by the segmentation masks explicitly, we utilize the segmentation mask to guide the feature extraction network to learn the enriched foreground features through a parallel mask branch. To do this, segmentation masks are precisely labeled in our new created dataset. Our proposed person search approach jointly optimizes pedestrian detection, person re-identification and pedestrian segmentation, which obtains more discriminative features for pedestrians benefiting from end-to-end learning.

We summarize our contributions are as follows.

  • We propose a segmentation masks guided person search framework so as to mitigate the negative effect of the background clutters in the detected pedestrian bounding boxes. Our proposed person search framework is trained end-to-end which considers the inherent relations among pedestrian detection, person re-identification and pedestrian segmentation, and hence more discrimitive features for pedestrians can be learned, which effectively enhance the person search performance.

  • We create a new dataset which contains precise pedestrian segmentation mask annotations for 1,833 images from the existing CUHK-SYSU dataset. The dataset will be released for the future segmentation mask based person search research, which can be downloaded from the link: https://github.com/Dingyuan-Zheng/maskPS. Meanwhile, it is found that our approach only requires partial annotations for the segmentation masks rather than that for the whole dataset.

  • Extensive experiments on benchmark dataset CUHK-SYSU show that our proposed segmentation masks guided end-to-end person search framework outperforms a wide range of state-of-the-art person search methods, obtaining 86.3% mAP and 86.5% top-1 accuracy, respectively.

2 Related Work

In this section, we first review the existing works for the two sub-tasks in person search, pedestrian detection and person re-identification respectively. We then review the recent achievements on person search.

2.1 Pedestrian Detection

Pedestrian detection has witnessed significant improvement in the past few decades. The first landmark work achieved by Dalal et al.Dalal and Triggs (2005) adopts the architecture of HOGSVM, and then DPM Felzenszwalb et al. (2009) is developed to better address the occlusion issue. After that, ICF Dollár et al. (2009) and its variants Zhang et al. (2014); Nam et al. (2014)

outperform the previous hand-crafted feature based pedestrian detection methods. More recently, great progress has been made on the realm of general object detection benefiting from the convolutional neural networks

Filali et al. (2016); Cholakkal et al. (2016); Dai et al. (2016); Lin et al. (2017); Girshick (2015); Ren et al. (2015). Further, Zhang et al. (2016) discussed the feasibility of Faster R-CNN on pedestrian detection task. In this paper, we also adopt Faster R-CNN as our pedestrian detector.

2.2 Person Re-identification

With the great success of convolutional neural networks, researchers have proposed numerous deep learning based person re-identification solutions

Zheng et al. (2017); Cheng et al. (2016); Varior et al. (2016); Xiao et al. (2016a); McLaughlin et al. (2016). The re-identification system is typically composed of two categories, feature extraction and similarity metrics learning. Some researchers attempt to improve the person re-identification performance by taking the advantage of enhanced feature representation. For instance, in Yi et al. (2014); Li et al. (2014), the original image is horizontally split into patches, and part matching is then applied among these generated local patches. In Zhao et al. (2017), local features of the body sub-regions defined by the pose estimation results are merged with the whole body features to improve the robustness of the final feature representation. Other researchers propose better person re-identification solutions by using well-designed similarity metrics learning. Generally, one category adopts verification loss, for example, contrastive loss Varior et al. (2016), triplet loss Liu et al. (2017b) or quadruplet loss Chen et al. (2017), while another category utilizes identification loss Zheng et al. (2017, 2016a) or both Geng et al. (2016). In this paper, our person search framework is built upon the identification model.

2.3 Person Search

As an extension of the conventional person re-identification, person search retrieves the target person from the raw scene images, where pedestrian bounding boxes are not available Xu et al. (2014). In the pioneer work Xiao et al. (2017), Xiao et al. show that pedestrian detection and person identification could be solved in an end-to-end framework. Following this work, Xiao et al. Xiao et al. (2019) enhance the discriminability of the pedestrian features by introducing center loss. Liu et al. Liu et al. (2017a) recursively shrink the attentive regions till the target person is retrieved. In Munjal et al. (2019), global context of query-gallery pairs are emphasized by establishing a query-guided region proposal network and similarity sub-network in a siamese structure. Yan et al. further improve the person search performance by exploiting co-travelers as global context. A recent person search approach Chen et al. (2018) uses segmentation mask to filter the foreground person from the original input and aggregates the features of both foreground and whole image which are extracted from a two-stream model. Besides, the authors also state that better person search performance can be achieved by solving pedestrian detection and identification separately with off-line pedestrian masks. Different from Chen et al. (2018), we optimize jointly these three tasks in an end-to-end framework. In particular, we use the segmentation mask to guide the network to learn the discriminative regions automatically rather than explicitly specifying these regions.

3 Proposed Method

In this section, we propose a novel partially labeled segmentation masks guided person search framework, as shown in Fig. 3. We first introduce our new dataset which contains partially labeled segmentation masks, and then elaborate our end-to-end person search framework.

Figure 2: Two examples in our created dataset. We only provide the segmentation masks for the labeled persons (the persons labeled with [1-5532] in CUHK-SYSU dataset). The shadow regions in first and third columns indicate the labeled persons. The second and fourth columns are their segmentation masks.

3.1 A New Dataset with Partially Labeled Segmentation Masks

To the best of our knowledge, all current segmentation masks based person search/re-ID approaches are based on off-line pedestrian masks generated from Fully Convolutional Networks (FCN) Long et al. (2015) or Fully Convolutional Instance-aware Semantic Segmentation (FCIS) Li et al. (2017) without considering the benefit from joint optimization of pedestrian segmentation and person re-identification tasks. To extract more discriminative features and mitigate the negative effect of background clutters as well as to build an end-to-end framework that jointly optimize pedestrian detection, person identification and pedestrian segmentation, we created a new person search dataset to provide the precise annotations of pedestrian segmentation masks. We labeled the pedestrian segmentation masks for a portion of images in CUHK-SYSU dataset Xiao et al. (2016b).

CUHK-SYSU dataset Xiao et al. (2016b) is a large-scale dataset for person search, and the data is collected from diverse scenes. Specifically, it contains 18,184 images, 6,057 query persons in 12,490 images are captured from the street, while the rest 2,375 query persons in 5,694 images are collected from the movies and dramas. The dataset is split into the training set and test set, and guarantees no overlap occurs on images and query persons between the training split and test split. Training split contains 11,206 images with 5,532 query persons, while test split contains 6,978 images with 2,900 query persons. Each query person appears in at least two images. The dataset also contains two subsets to evaluate the person search performance under low resolution and occlusion. The person identities of training split are in the range of [-1, 5532], where -1 indicates the unlabeled persons and 0 indicates the background (non-person).

In our dataset, to guarantee the uniformity of data distribution, we divide the training set into N portions (N=2,241 in our case), with 5 images in each portion (except the last portion, which contains 6 images), and we randomly select one image from each portion, and filter out the images with only unlabeled persons (the persons labeled with -1). Finally 1,833 images in the training set are selected for the segmentation mask labeling.

To the best of our knowledge, accessories, i.e, handbags, luggage cases and baby carriage, might act as suggestive context in person re-identification. In a consequence, we treat these objects as foreground during the mask annotating process. It should be noticed that we provide the mask annotations for only the labeled persons (the persons labeled with [15532]) in a raw scene image. The samples of the image with segmentation masks are shown in Fig. 2.

We utilize Labelme Wada (2016) as the annotation tool. All our segmentation masks follow the unified annotation rules. When a person is occluded by non-person objects, we only keep the visible part of the occluded person, and the accessories are kept as well. We also give the statistics for our created dataset, as shown in Table 1. The selected 1,833 images from the CHUK-SYSU training set contain 9,084 pedestrians in total, with 2,815 labeled persons and 6,269 unlabeled persons. In particular, we only annotated the segmentation masks for the labeled persons. The rest 9,373 images in the training set contains 12,270 labeled persons and 33,918 unlabeled persons.

Dataset Number of images Number of pedestrians
Images with masks 1,833 LP 2,815
UP 6,269
1-4 Images without masks 9,373 LP 12,270
UP 33,918
Table 1: Statistics of our created dataset. LP: Labeled persons, UP: Unlabeled persons. The labeled persons (2,815) in the selected 1,833 images are annotated with pedestrian segmentation masks.

3.2 Our Proposed Person Search Framework

Person search aims to retrieve the target person across raw scene images without pedestrian bounding boxes. Our proposed approach jointly optimizes three sub-tasks including pedestrian detection, person identification and pedestrian segmentation in an end-to-end person search framework. Apart from the pedestrian detection module to produce online pedestrian bounding boxes and person identification module to categorize person identities, we further establish a parallel pedestrian segmentation branch to predict pedestrian masks. Benefiting from the end-to-end optimization of three tasks, more discriminative pedestrian features can be extracted. The overall schematic of the proposed segmentation masks guided end-to-end person search framework is shown in Fig. 3. The network is elaborated as follows.

Arbitrary size images are resized such that the shorter side has 600 pixels. An image is then fed into the first part of the residual backbone network He et al. (2016). Specifically, we divide the residual network into two parts, i.e, for ResNet-50, the first part contains the layers from Conv1 to Res4, and the rest Res5 forms the second part.

To address pedestrian detection, we adopt the region proposal network Ren et al. (2015) (RPN) to produce online pedestrian proposals. RPN is trained with cross entropy loss to distinguish pedestrians and background, we express it as :

(1)

where is the number of the generated proposals, is the prediction score and is the related ground truth which indicates person or non-person. We use the Smoothed-L1 loss Girshick (2015), , to regress the precise location for each pedestrian, as defined as follows:

(2)

where denotes the coordinate differences between the predicted box and its related ground truth location, and these two losses together are denoted as . The generated candidate boxes are either associated with background or a foreground part (the ground truth bounding boxes). Since we only provide the mask annotation for the labeled persons (persons labeled with [15532]) in a raw scene image, those generated candidate boxes associated with foreground parts are consequently divided into two types. The first is the candidate boxes associated with labeled persons, and it is denoted as proposals with mask, while the second is the candidate boxes associated with unlabeled persons, which we denote as proposals without mask, as shown in Fig. 3.

All the proposals generated from RPN and the feature maps generated from the first part of the residual network are input into the ROIAlign layer He et al. (2017) to produce the fixed size feature map for each ROI.

Targeting for person identification, once the fixed size feature maps are obtained, these feature maps are further convolved into the second part of the residual network and the output,

, are summarized into 2,048 dimensional feature vectors

through an average pooling layer. Here is the channel width and denotes the size of the feature maps. To further reduce the false alarm caused by RPN and refine the predicted locations of the candidate pedestrians, are then fed into two fully connected layers respectively and again supervised by and losses. Following Xiao et al. (2017), we denoted these two losses together as . Besides, is projected into a 256 dimensional feature vector through the third fully connected layer followed by L2-normalization, which is used as the final feature representation for each retrieved pedestrian. In the training phase, we adopt OIM loss Xiao et al. (2017) to supervise the person identification module, where

indicates the probability of the identification features,

, belonging to -th class,

(3)

Here is a parameter to control the softness of the probability function. The features of the labeled identities are stored in a lookup table with dimension L, with denoting the current feature for class among 5,532 categories, and it is continuously updated during the training phase as follows:

(4)

where is a momentum parameter used to adjust the update rate. While the features of the unlabeled persons are stored in a circular queue with dimension , and indicates the features for the -th unlabeled person. The objective of OIM loss is to maximize the expected log-likelihood, and the identification loss is then defined as:

(5)

Most importantly, in order to improve the discriminability of , we establish a parallel mask branch on top of the shared features . Specifically, we pick out the feature maps associated with labeled persons from , and use these feature maps to predict segmentation masks with the size of for each proposal with mask ( equals 7 in our case), and the predicted masks are then computed into binary cross entropy loss He et al. (2017), which can be written as:

(6)

where denotes the probability of -th pixel in the predicted mask being recognized as foreground, is its related ground truth, is the number of pixels in the predicted pedestrian mask (), and is the number of proposals with mask.

Finally, we adopt a multi-task loss to train our person search framework in an end-to-end manner. The total loss is defined as:

(7)

specifically, when input image contains labeled segmentation masks, otherwise, .

With the assist of the partially labeled segmentation masks, our proposed person search framework can generate more discriminative features invariant to background clutters, compared with the previous segmentation mask based state-of-the-art approach Chen et al. (2018).


Figure 3: The schematic of our proposed segmentation masks guided person search framework. The model is trained end-to-end with multi-task loss. We adopt RPN to generate candidate boxes, and we denote the proposals associated with labeled person as proposals with mask, while the proposals associated with unlabeled person as proposals without mask since we only partially label the segmentation masks for the labeled persons in a raw image. Feature vectors of all candidate boxes go into the regression loss, classification loss and identification loss, whereas only the feature maps of the proposals with mask are fed into the mask branch.

4 Experimental Results

In this section, the dataset and evaluation metrics we used are first introduced, followed by implementation details. We also compare our proposed method with previous state-of-the-art results. At last, our proposed person search framework is verified in the ablation study.

4.1 Dataset and Evaluation Metrics

We use our newly labeled CHUK-SYSU dataset, as introduced in Sec. 3.1, in our experiments. We adopt both mean average precision (mAP) and top-1 matching rate to evaluate all our experiment performance, similar to Xiao et al. (2017). A matching is accepted only if the overlap between the detected pedestrian bounding boxes and the ground truth bounding boxes is larger than pre-defined intersection over union (IOU) threshold, which equals to .

4.2 Implementation Details

Training Phase.

During training, we initialize our residual backbone with the ImageNet pretrained ResNet-50 model and adopt SGD as optimizer. The initial learning rate sets to

and decayed by a factor of at every epochs. Because of the large memory consumption of the Faster R-CNN framework Ren et al. (2015), we set the batch size to during the

training epochs. All our experiments are implemented by Pytorch on Titan X Pascal GPU.

It should be noticed that, the shorter side of the input images is resized to pixels, and we also augment the training data by horizontal flipping the training images and their related ground truth bounding boxes as well as the ground truth masks. In paticular, the ground truth masks are resized to to match the masks generated from the mask branch. For the implementation of the mask branch, we adopt a similar architecture as in the Mask R-CNN He et al. (2017). losses are used jointly to supervise the training process.

Inference Phase. At test time, the shorter side for both query and gallery images is resized to pixels as in the training process. We use the features generated from the last residual block (Res5) to represent each pedestrian, either probe person or the persons detected from the gallery set. Euclidean distance is then computed for each probe and gallery pair to assess the similarity level.

4.3 Comparison with State-of-the-Art Approaches

In this subsection, we report the person search performance of our model on our newly labeled person search dataset CUHK-SYSU, and we also give the comparison to several state-of-the-art approaches, including methods that optimize pedestrian detection and identification jointly (OIM Xiao et al. (2017), IAN Xiao et al. (2019), NPSM Liu et al. (2017a), QEEPS Munjal et al. (2019) and GCNPS Yan et al. (2019)), as well as the methods solving pedestrian detection and person identification separately (DSIFT+EuclideanZhao et al. (2013), DSIFT+KISS-
MEKoestinger et al. (2012), BoWZheng et al. (2015)

+Cosine similarity, LO-MO+XQDA, and MGTS

Chen et al. (2018)).

4.3.1 Overall Person Search Performance on CUHK-SYSU

The comparative results with gallery size are summarized in Table 2. We follow the annotations defined in Chen et al. (2018) and Yan et al. (2019), where “CNN” denotes the Faster R-CNN detector with ResNet-50 backbone, and “CNN” denotes the VGG-based detector.

The methods above the dash line handle pedestrian detection and person identification separately. It can be observed that the deep CNN based pedestrian features Chen et al. (2018) achieved better performance than hand-crafted features Zhao et al. (2013)Koestinger et al. (2012)Zheng et al. (2015). CNN+MGTS Chen et al. (2018) also utilizes segmentation mask to produce more discriminative features by filtering out the background, and achieves the best performance among those methods addressing pedestrian detection and person identification separately. Our proposed method uses segmentation mask to guide the network to extract discriminative pedestrian features by specifying the foreground regions. Meanwhile, pedestrian detection and person identification are optimized jointly. Our framework achieved 3% gain compared with Chen et al. (2018) on both mAP and top-1 matching rate.

All the joint methods (below the dash line) are built upon the Faster R-CNN Ren et al. (2015) framework where OIM Xiao et al. (2017) can be regarded as the benchmark. The major distinction between our method and OIM Xiao et al. (2017) is that a new pedestrian segmentation mask branch is added. We achieve a significant performance improvement, with 10.8% mAP and 7.8% top-1 higher compared with Xiao et al. (2017). It demonstrates the importance of the pedestrian segmentation mask and the newly labeled dataset. Other methods Xiao et al. (2019)Liu et al. (2017a)Munjal et al. (2019)Yan et al. (2019) are state-of-the-art person search approaches with good performance. IAN Xiao et al. (2019) improves the person search performance by introducing center loss to reduce the intra-class variations. NPSM Liu et al. (2017a) designs a person search approach by recursively shrinking the search area. QEEPS Munjal et al. (2019) proposes a strong person search framework by learning query-guided global context. Yan et al. (2019) utilizes GCN to explore the impact of context persons on the person search task. Nevertheless, we still achieve 2% gain on both mAP and top-1 accuracy compared with Munjal et al. (2019), and 2% improvement on mAP compared with Yan et al. (2019), all of which prove the effectiveness of our method.

The visualization of person search results on the CUHK-SYSU dataset are shown in Fig. 5. The upper images in each group are the searching results of OIM Xiao et al. (2017), and the lower images in each group are that of our model. It is observed that the persons in the bounding boxes of the third and the fourth images in the upper rows of group (a) and (b), as well as the person in the bounding box of the third image in the upper row of group (c), are different from their probe images. However, these persons are ranked before the persons who have the same identities as the probe images, simply because their background is more similar to the probe images. Nevertheless, with the assist of partially labeled segmentation masks, our model focus on the foreground and can distinguish persons based on the detailed textural information rather than the background-noise.

Method mAP(%) top-1(%)
CNN + DSIFT + Euclidean Zhao et al. (2013) 34.5 39.4
CNN + DSIFT + KISSME Zhao et al. (2013)Koestinger et al. (2012) 47.8 53.6
CNN + BoW + Cosine Zheng et al. (2015) 56.9 62.3
CNN + LOMO + XQDA Liao et al. (2015) 68.9 74.1
CNN + MGTS Chen et al. (2018) 83.0 83.7
1-3 OIM Xiao et al. (2017) 75.5 78.7
IAN(ResNet-50) Xiao et al. (2019) 76.3 80.1
NPSM Liu et al. (2017a) 77.9 81.2
QEEPS Munjal et al. (2019) 84.4 84.4
GCNPS Yan et al. (2019) 84.1 86.5
Ours 86.3 86.5
Table 2: Comparison with the state-of-the-art on CUHK-SYSU dataset with gallery size equals to 100.

4.3.2 Impact of Gallery Size

Each gallery image in CUHK-SYSU dataset contains around pedestrians on average. With gallery size , person search aims to retrieve each target person from about pedestrians. The person search is more challenging with the increasing number of gallery size. We also report the performance of our model with various gallery size, including [, , , , , ]. The results are demonstrated in Fig. 4. As expected, the person search performance of all methods drops with the increasing gallery size. While our person search framework remains superior than other approaches with various gallery sizes.

Figure 4: Person search performance comparison on CUHK-SYSU dataset with different gallery size, [50, 100, 500, 1,000, 2,000, 4,000]. (a) Model with ResNet-50 backbone. (b) Model with ResNet-101 backbone.

4.3.3 Impact of Occlusion and Low Resolution

Person search becomes even harder when pedestrians are occluded or the resolution of the captured images is low. Therefore, to prove the robustness of our method, we further evaluate our model on two subsets. One subset contains target persons with occlusion, and the other subset contains target persons with low resolution. The results are demonstrated in Table 3. We follow the notations defined in Xiao et al. (2019), where “whole” denote the full set which contains 2,900 probe images. We observe that the performance degenerate under these two extreme conditions compared with full set. However, our person search framework still outperforms the other approaches Xiao et al. (2016b)Xiao et al. (2019).

Method Low-Res Occulusion Whole
mAP(%) top-1(%) mAP(%) top-1(%) mAP(%) top-1(%)
E2E-PS(VGGNet) 46.1 51.0 44.3 45.4 69.6 72.9
E2E-PS(Res-101) 47.9 52.0 47.7 48.1 74.2 78.1
IAN(Res-101) 52.6 54.4 53.0 54.5 77.2 80.4
Ours(Res-50) 66.7 66.8 70.8 71.3 86.3 86.5
Table 3: Person search performance on low resolution and occlusion subsets.

4.4 Ablation Study

With the assist of the newly labeled dataset, our proposed person search framework produces more discriminative features by utilizing partially labeled segmentation mask. To evaluate the effectiveness of our approach, we report the person search performance when we progressively increase the number of images with segmentation mask. The results are shown in Table 4, where we denote the proportion of the images with segmentation mask as . In total, 1,833 images are labeled with segmentation mask, which accounts for around 16% of the 11,206 training images. When all those 1,833 images are used for training, we denote as “Full”. It can be observed that there is an obvious gain when 12% images with segmentation mask are used for training, and tend to be stable until 15% images are used. That is why we only label 16% of all the images.

Value of 3% 6% 9% 12% 15% Full
mAP(%) 85.1 85.3 85.3 86.1 86.3 86.3
top-1(%) 85.2 85.4 85.7 86.1 86.5 86.5

Table 4: Person search performance on CUHK-SYSU dataset with various proportion of images with segmentation masks
Figure 5: Three groups of top-3 comparison results for person search on CHUK-SYSU dataset. The upper row in each group are the searching results of OIM Xiao et al. (2017), and the lower row in each group are the searching results of our model (both models adopt ResNet-50 as backbone). The blue boxes in the first column indicate the probe images, and the green boxes in other columns indicate their top-3 searching results. Best viewed in color.

5 Conclusion

Person search handles the challenges from both pedestrian detection and person identification, and inevitably introduces background clutters into the detected candidate boxes. To address this problem, with the assist of our new created dataset which contains the labeled segmentation masks for a portion of images in the existing CUHK-SYSU dataset, we propose a novel segmentation mask guided person search framework to extract more discriminative and robust features invariant to background clutters for each human individual. Moreover, our person search framework is trained end-to-end, which proves that joint optimization of pedestrian detection, person re-identification, and pedestrian segmentation is an effective solution for person search. Finally, extensive experiments show that our proposed method achieves state-of-the-art performance on CUHK-SYSU dataset.

References

  • D. Chen, S. Zhang, W. Ouyang, J. Yang, and Y. Tai (2018) Person search via a mask-guided two-stream cnn model. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 734–750. Cited by: §1, §1, §2.3, §3.2, §4.3.1, §4.3.1, §4.3, Table 2.
  • W. Chen, X. Chen, J. Zhang, and K. Huang (2017) Beyond triplet loss: a deep quadruplet network for person re-identification. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 403–412. Cited by: §2.2.
  • D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng (2016)

    Person re-identification by multi-channel parts-based cnn with improved triplet loss function

    .
    In Proceedings of the iEEE conference on computer vision and pattern recognition, pp. 1335–1344. Cited by: §2.2.
  • H. Cholakkal, J. Johnson, and D. Rajan (2016)

    A classifier-guided approach for top-down salient object detection

    .
    Signal Processing: Image Communication 45, pp. 24–40. Cited by: §2.1.
  • J. Dai, Y. Li, K. He, and J. Sun (2016) R-fcn: object detection via region-based fully convolutional networks. In Advances in neural information processing systems, pp. 379–387. Cited by: §2.1.
  • N. Dalal and B. Triggs (2005) Histograms of oriented gradients for human detection. Cited by: §2.1.
  • P. Dollár, Z. Tu, P. Perona, and S. Belongie (2009) Integral channel features. Cited by: §2.1.
  • P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan (2009) Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence 32 (9), pp. 1627–1645. Cited by: §2.1.
  • I. Filali, M. S. Allili, and N. Benblidia (2016) Multi-scale salient object detection using graph ranking and global–local saliency refinement. Signal Processing: Image Communication 47, pp. 380–401. Cited by: §2.1.
  • M. Geng, Y. Wang, T. Xiang, and Y. Tian (2016)

    Deep transfer learning for person re-identification

    .
    arXiv preprint arXiv:1611.05244. Cited by: §2.2.
  • R. Girshick (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §2.1, §3.2.
  • D. Gray and H. Tao (2008) Viewpoint invariant pedestrian recognition with an ensemble of localized features. In European conference on computer vision, pp. 262–275. Cited by: §1.
  • K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §3.2, §3.2, §4.2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.2.
  • A. Hermans, L. Beyer, and B. Leibe (2017) In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737. Cited by: §1.
  • M. Koestinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof (2012) Large scale metric learning from equivalence constraints. In 2012 IEEE conference on computer vision and pattern recognition, pp. 2288–2295. Cited by: §4.3.1, §4.3, Table 2.
  • S. Li, S. Bak, P. Carr, and X. Wang (2018a) Diversity regularized spatiotemporal attention for video-based person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 369–378. Cited by: §1.
  • W. Li, R. Zhao, T. Xiao, and X. Wang (2014) Deepreid: deep filter pairing neural network for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 152–159. Cited by: §2.2.
  • W. Li, X. Zhu, and S. Gong (2018b) Harmonious attention network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2285–2294. Cited by: §1.
  • Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei (2017) Fully convolutional instance-aware semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2359–2367. Cited by: §3.1.
  • S. Liao, Y. Hu, X. Zhu, and S. Z. Li (2015) Person re-identification by local maximal occurrence representation and metric learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2197–2206. Cited by: §1, Table 2.
  • T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §2.1.
  • H. Liu, J. Feng, Z. Jie, K. Jayashree, B. Zhao, M. Qi, J. Jiang, and S. Yan (2017a) Neural person search machines. In Proceedings of the IEEE International Conference on Computer Vision, pp. 493–501. Cited by: §2.3, §4.3.1, §4.3, Table 2.
  • H. Liu, J. Feng, M. Qi, J. Jiang, and S. Yan (2017b) End-to-end comparative attention networks for person re-identification. IEEE Transactions on Image Processing 26 (7), pp. 3492–3506. Cited by: §2.2.
  • J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §3.1.
  • N. McLaughlin, J. Martinez del Rincon, and P. Miller (2016) Recurrent convolutional network for video-based person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1325–1334. Cited by: §2.2.
  • B. Munjal, S. Amin, F. Tombari, and F. Galasso (2019) Query-guided end-to-end person search. arXiv preprint arXiv:1905.01203. Cited by: §2.3, §4.3.1, §4.3, Table 2.
  • W. Nam, P. Dollár, and J. H. Han (2014) Local decorrelation for improved pedestrian detection. In Advances in Neural Information Processing Systems, pp. 424–432. Cited by: §2.1.
  • W. Ouyang and X. Wang (2012) A discriminative deep model for pedestrian detection with occlusion handling. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3258–3265. Cited by: §1.
  • W. Ouyang and X. Wang (2013) Joint deep learning for pedestrian detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2056–2063. Cited by: §1.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1, §2.1, §3.2, §4.2, §4.3.1.
  • C. Tay, S. Roy, and K. Yap (2019) AANet: attribute attention network for person re-identifications. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7134–7143. Cited by: §1.
  • M. Tian, S. Yi, H. Li, S. Li, X. Zhang, J. Shi, J. Yan, and X. Wang (2018) Eliminating background-bias for robust person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5794–5803. Cited by: §1.
  • R. R. Varior, M. Haloi, and G. Wang (2016) Gated siamese convolutional neural network architecture for human re-identification. In European conference on computer vision, pp. 791–808. Cited by: §2.2.
  • K. Wada (2016) labelme: Image Polygonal Annotation with Python. Note: https://github.com/wkentaro/labelme Cited by: §3.1.
  • F. Wang, W. Zuo, L. Lin, D. Zhang, and L. Zhang (2016) Joint learning of single-image and cross-image representations for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1288–1296. Cited by: §1.
  • L. Wu, C. Shen, and A. v. d. Hengel (2016) Personnet: person re-identification with deep convolutional neural networks. arXiv preprint arXiv:1601.07255. Cited by: §1.
  • J. Xiao, Y. Xie, T. Tillo, K. Huang, Y. Wei, and J. Feng (2019) IAN: the individual aggregation network for person search. Pattern Recognition 87, pp. 332–340. Cited by: §1, §2.3, §4.3.1, §4.3.3, §4.3, Table 2.
  • T. Xiao, H. Li, W. Ouyang, and X. Wang (2016a)

    Learning deep feature representations with domain guided dropout for person re-identification

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1249–1258. Cited by: §2.2.
  • T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang (2016b) End-to-end deep learning for person search. arXiv preprint arXiv:1604.01850 2, pp. 2. Cited by: §1, §3.1, §3.1, §4.3.3.
  • T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang (2017) Joint detection and identification feature learning for person search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3415–3424. Cited by: §1, §2.3, §3.2, Figure 5, §4.1, §4.3.1, §4.3.1, §4.3, Table 2.
  • Y. Xu, B. Ma, R. Huang, and L. Lin (2014) Person search in a scene by jointly modeling people commonness and person uniqueness. In Proceedings of the 22nd ACM international conference on Multimedia, pp. 937–940. Cited by: §2.3.
  • Y. Yan, Q. Zhang, B. Ni, W. Zhang, M. Xu, and X. Yang (2019) Learning context graph for person search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2158–2167. Cited by: §4.3.1, §4.3.1, §4.3, Table 2.
  • D. Yi, Z. Lei, S. Liao, and S. Z. Li (2014) Deep metric learning for person re-identification. In 2014 22nd International Conference on Pattern Recognition, pp. 34–39. Cited by: §2.2.
  • L. Zhang, L. Lin, X. Liang, and K. He (2016) Is faster r-cnn doing well for pedestrian detection?. In European conference on computer vision, pp. 443–457. Cited by: §2.1.
  • S. Zhang, C. Bauckhage, and A. B. Cremers (2014) Informed haar-like features improve pedestrian detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 947–954. Cited by: §2.1.
  • H. Zhao, M. Tian, S. Sun, J. Shao, J. Yan, S. Yi, X. Wang, and X. Tang (2017) Spindle net: person re-identification with human body region guided feature decomposition and fusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1077–1085. Cited by: §1, §2.2.
  • R. Zhao, W. Ouyang, and X. Wang (2013) Unsupervised salience learning for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3586–3593. Cited by: §4.3.1, §4.3, Table 2.
  • L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, and Q. Tian (2016a) Mars: a video benchmark for large-scale person re-identification. In European Conference on Computer Vision, pp. 868–884. Cited by: §2.2.
  • L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian (2015) Scalable person re-identification: a benchmark. In Proceedings of the IEEE international conference on computer vision, pp. 1116–1124. Cited by: §4.3.1, §4.3, Table 2.
  • L. Zheng, Y. Yang, and A. G. Hauptmann (2016b) Person re-identification: past, present and future. arXiv preprint arXiv:1610.02984. Cited by: §1.
  • L. Zheng, H. Zhang, S. Sun, M. Chandraker, Y. Yang, and Q. Tian (2017) Person re-identification in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1367–1376. Cited by: §2.2.