Person re-identification (ReID) involves spotting a specific person of interest, e.g. a missing child, across disjoint camera views. Due to the widespread deployment of modern surveillance systems, ReID has attracted increasing attention from both academia and industry [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]. Most existing ReID approaches assume that the pedestrian’s entire body is visible, and tend to ignore the more challenging occlusion situations. However, in real-world applications, pedestrians are very often occluded by objects or other pedestrians.
Occlusion poses a major challenge for ReID, as it affects the appearance of pedestrian. As illustrated in Fig. 1(a), similar occlusion reduces inter-class distance, which indicates that images of different identities may have similar visual features. Moreover, as shown in Fig. 1(b), different occlusions enlarge intra-class distance, meaning that two images of the same pedestrian may be quite different in terms of their appearance. This is because occlusions may differ as regards their location and content; therefore, occlusion tends to result in incorrect retrieval results.
. However, they usually rely on outside tools to acquire the visibility cues of body parts (e.g. the prediction confidence of pose estimation models). Beside extra computational cost, this strategy may not be robust to complex occlusions, such as occlusions between pedestrians. As illustrated in Fig.1(c), both human parsing and pose estimation tools may fail when facing occlusions between pedestrians. Moreover, visibility is not equivalent to discriminative power. On the one hand, one visible part may look quite similar across different pedestrians. For example, nearly all pedestrians’ forearms are not covered by any clothes in images captured in summer. On the other hand, invisible body parts may be occluded by discriminative accessories, such as backpacks and bags, as shown in Fig. 1(d). These accessories are critical for ReID, but tend to be ignored by outside tools. It is therefore reasonable to seek out a robust and easy-to-use method that automatically infers and utilizes discriminative body parts to handle the occlusion problem.
Accordingly, in this paper, we propose a novel framework named QPM for occluded ReID. QPM includes a part branch and a global branch. The part branch automatically infers part-specific quality scores rather than visibility scores. More specifically, it jointly learns discriminative part features and predicts part quality scores in an end-to-end fashion. This is achieved by including both pair-wise part distances and pair-wise part quality scores in the triplet loss ; as a result, the part branch can automatically assign low quality scores to poor-quality body parts in order to weaken their influence. Another key benefit of our approach is that it is independent from any outside tools and does not require annotations of quality scores. However, when a pedestrian is occluded by other persons, the model may predict a relatively high score for occluded regions; this is because the model cannot well differentiate body parts of different pedestrians. We solve this problem in the global branch.
The global branch includes two main components. First, we propose an identity-aware spatial attention (ISA) module based on the predicted part quality scores. In this module, a coarse identity-aware feature is processed by a simple two-layer network and optimized using the cross-entropy loss function. Then, it is utilized to suppress noisy responses and highlight responses from the body region of the pedestrian needing to be identified. Therefore, it can be used to handle occlusions between pedestrians. Second, we design an adaptive and efficient approach to generate global features from the common non-occluded regions with respect to each image pair. By contrast, existing works typically extract fixed global features, ignoring the difference in occlusion locations between a pair of images.
In the inference stage, both the part and global features are utilized to calculate the similarity score between each pair of images. The weighted average of both scores represents the overall similarity of an image pair. We conduct extensive experiments on four popular datasets for occluded ReID, i.e. Partial-iLIDS , Partial-REID , Occluded-Duke , and P-DukeMTMC . The results show that our simple QPM model consistently outperforms existing approaches by significant margins. Moreover, our approach enjoys further advantages of being robust and easy to use.
In conclusion, the main contributions of this paper are summarized as following:
We propose an end-to-end framework that jointly learns discriminative features and predicts part quality scores. Compared with existing works, it does not rely on any outside tools in either the training or inference stages.
We propose a novel identity-aware spatial attention (ISA) approach that efficiently handles the occlusion between pedestrians. Experimental results prove that it outperforms existing spatial attention methods.
We introduce an Adaptive Global Feature Extraction (AGFE) module that extracts global features from the commonly non-occluded regions for each image pair, which significantly promotes ReID performance.
The remainder of this paper is organized as follows. We first review the related works in Section II. Then, we describe the proposed QPM in more detail in Section III. Extensive experimental results on three benchmarks are reported and analyzed in Section IV, after which the conclusions of the present work are outlined in Section V.
Ii Related Work
Ii-a Occluded Person ReID Models
One main challenge for occluded ReID is to identify visible body regions. Most existing works utilize visibility cues provided by outside tools [21, 22, 6, 23, 24, 25]. Here, we divide occluded ReID methods into two categories depending on whether or not outside tools are required during training and testing.
The first category of methods employs outside tools in both the training and testing stages [14, 15, 16]. For example, Miao et al.  utilized pose landmarks to identify visible local patches and only adopt commonly visible local patches of one image pair for matching. Wang et al.  utilized pose landmarks to learn high-order relation and topology information of the visible local features, so as to better match probe with gallery images. Gao et al.  employed graph matching and utilized pose landmarks to self-mine part visibility scores. They then match probe and gallery images by calculating the part-to-part distances in visible regions. However, in addition to its extra computational cost, another key downside of this approach is that external tools may not be reliable when encountering complex occlusions, as illustrated in Fig. 1(c-d).
designed an occlusion-sensitive foreground probability generator that enables the model to focus on non-occluded human body parts. Heet al.  further combined pose landmarks and human masks to generate spatial attention maps that guide discriminative feature learning. These approaches can reduce the impact of occlusions on the extracted features. Despite the convenience this affords during testing, this approach still relies on the visibility information of body part for each training image.
Moreover, two very recent works [30, 31] try to predict visible scores without outside tools. VPM classifies each pixel in the image into one body part. However, it is designed for the partial ReID problem and cannot be directly used to solve the occluded ReID problem. Specifically, VPM classifies each pixel in the partial image into one body part, assuming there is no occlusion in the partial image. ISP  performs cascaded clustering on feature maps to generate the pseudo-labels of human parts for each pixel. However, ISP contains no strategy to handle the occlusion problem between pedestrians. In addition, the clustering process is time-consuming.
In a departure from existing works, our approach aims to predict the part quality rather than visibility in an end-to-end framework. And we do not employ any outside tools or require any quality annotations in either the training or inference stages. Moreover, our approach is robust to complex occlusions. In conclusion, QPM is a powerful and efficient model.
Ii-B Part-based Person ReID Models
Due to their powerful representation ability, part-based methods are popular for ReID. Depending on the way to obtain body part locations, we divide existing works into three categories.
Fixed Location-based Methods. Methods in this category typically split the output feature maps of one backbone model into several stripes in fixed locations [3, 32, 33]. Part features are then respectively extracted from the stripes. For example, Sun et al.  uniformly divide the output feature maps into 6 horizontal stripes to represent different part-level features. Wang et al.  also partition one image into horizontal stripes. The main advantage of this strategy lies in its efficiency.
Outside Tools-based Methods. These methods utilize outside tools, e.g. pose estimation [34, 25, 23, 35] and human parsing models [21, 6], to detect body parts. They then extract part features from the detected body parts. There are two key downsides of these approaches: first, they require additional computational cost; second, ReID performance is vulnerable to the reliability of outside tools.
Attention-based Methods. These methods predict body part locations based on the feature maps produced by ReID models [36, 2, 37, 38]. For example, Zhao et al.  proposed to predict a set of masks and perform element-wise multiplication between one mask and each channel of the feature maps to produce part-specific features. In comparison, Li et al.  designed a hard regional attention model that can predict bounding boxes for each body part. However, the lack of explicit supervision for part alignment may cause difficulty in the optimization of attention models.
In part-based ReID methods, it is a common practice to concatenate part features as the final representation [23, 3, 33, 36, 39, 40, 41, 42]. However, it is less effective for occluded ReID, as it ignores the impact of features from occluded parts. In this paper, we accordingly propose to jointly learn part features and predict part quality. For simplicity, we extract part features from fixed part locations. If part locations are provided by outside tools or attention modules, the performance of our approach can be further promoted.
The architecture of QPM is illustrated in Fig. 2. It consists of a part feature learning branch and a global feature learning branch. The part branch outputs part-level features, as well as quality scores that indicate the discriminative power of each body part. The global branch generates global features that are adaptively and efficiently extracted from the common non-occluded regions for each gallery-query image pair. In the following, we will introduce three key designs in the two branches individually.
Iii-a Joint Learning Part Feature and Quality Scores
Part Feature Extractor. Following , we adopt ResNet-50 as backbone and remove its last spatial down-sampling operation to increase the size of the output feature maps. The output feature maps are denoted as for simplicity. To obtain the part features, is first uniformly split into parts in the vertical orientation. Following , we set as 6 in this work. Next, the feature maps for each part are processed by a Region Average Pooling (RAP) operation and one Conv layer. The parameters of the Conv layer are not shared between parts. For the
-th part, the feature vectors before and after theConv layer are denoted as and , respectively. is utilized as the final part feature and is optimized by the cross-entropy loss:
where N denotes the batch size. represents the -th part-level feature for the -th image in a batch. stands for the parameters of the classification layer for the -th part feature. stands for the cross-entropy loss function.
Part Quality Predictor. A key challenge for occluded ReID is to identify visible body parts. Recent works [14, 16] typically rely on outside tools to infer whether or not one part is visible. However, as argued in Section I, visibility is not equivalent to discriminative power. We accordingly propose an efficient method to predict part feature quality rather than part visibility. As shown in Fig. 2, we feed into the part quality predictor module. This module comprises one
Conv layer, one batch normalization (BN) layer, and one sigmoid activation layer. The output of the sigmoid layer is the quality score of the -th part. The parameters of the predictor are not shared between parts.
As there is no annotation of part quality scores provided, we cannot impose a direct supervision signal on the predicted part quality. Recent works for video-based recognition have revealed that framewise features and quality scores can be jointly optimized with the same identification [44, 45] or metric learning loss . In these works, e.g., QAN  and MG-RAFA , the quality scores are utilized to aggregate multiple frame-level features into a single video-level feature. However, this approach cannot be directly used in image-based ReID, as recognition is based on each single image. Inspired by these works, we propose to jointly optimize part features and part quality scores using triplet loss. Accordingly, the quality scores in QPM are imposed on part distances instead of frame features. Moreover, both pair-wise part distances and pair part quality scores are included in the triplet loss, instead of using a sample’s own quality score only. In this way, our approach optimizes pair-wise part distance and pair part quality scores for occluded ReID in an end-to-end manner.
Specifically, given a probe image and a gallery image to be compared, we first calculate their part-wise cosine distances (1 ). Next, the part-wise distances are summed via weighted average, as follows:
where and are the quality score of -th part for probe image and gallery image , respectively. In this way, body parts with low quality scores will contribute less to , which weakens the impact of occlusion.
To sample sufficient triplets during training, we randomly sample images in each of random identities to create a mini-batch, the batchsize N of which is equal to . The triplet loss is formulated as
where is the margin of the triplet constraint, while is the number of triplets in a batch that violate the triplet constraint and denotes the hinge loss. and are calculated by Eq. 2 and denote the distances of the positive and negative image pairs in a triplet, respectively. In order to reduce triplet loss, the part quality predictors have to predict lower scores to occluded parts.
Discussion. With the help of the above constraints, the model can predict the quality scores of the part features without using any external tools. However, this approach still has limitations. Take the predicted quality scores in Fig. 4 as an example. It can be seen that the model works well for occlusions caused by objects (e.g. cars, trees, and boxes). In these situations, the quality scores of occluded body parts are very low. However, when a pedestrian is occluded by other persons, the model may predict a relatively high score for occluded regions; this is because the model cannot well differentiate body parts of different pedestrians. In the following, we handle the above problem in the global feature learning branch. Most existing approaches [14, 15, 29] simply extract global features from visible regions for each image. However, this strategy suffers from two problems. First, as explained above, visibility or quality prediction of image regions may be interfered by occlusions between pedestrians. Second, as shown in Fig. 1(b), two images may differ in occlusion locations, meaning that they are not directly comparable even if features are extracted from visible regions for each image. In the following, we propose an Identity-aware Spatial Attention (ISA) approach and an adaptive global feature extraction approach to handle each of these two problems, respectively.
Iii-B Identity-aware Spatial Attention
Fig. 3 illustrates the structure of ISA. ISA makes use of a coarse identity-aware feature to generate spatial attention for each image. This attention suppresses occlusions caused by objects and other pedestrians, meaning that only features of spatial regions relevant to the target pedestrian are highlighted.
In more detail, we feed into another Conv layer and obtain the feature maps . Global features are then extracted based on . Similar to the part branch, we uniformly partition into parts and obtain part-level features that are denoted as (1 ) via RAP operations. We then obtain a coarse identity-aware global feature by fusing via weighted averaging as follows:
where is the normalized part quality score produced in the part branch in Section III-A. Formally,
To further reduce the impact of occlusions, we process using a simple two-layer network inspired by the squeeze-and-excitation network . As illustrated in Fig. 3, the dimension of is first reduced via a
Conv layer which is followed by a ReLU layer, then recovered to the original dimension by means of anotherConv layer. The reduction ratio of the first Conv layer is set to 4; the output of the two Conv layers are denoted as and , respectively.
Moreover, to ensure that noisy elements in are suppressed via the reduction operation, we impose a cross-entropy loss on as follows:
where represents the feature produced by the first Conv layer for the -th image in a batch, while denotes the parameters of the classification layers.
By adopting this approach, information in is identity-aware and noise-free. Accordingly, we employ to generate the spatial attention map for the feature maps . Formally:
is a sigmoid function, whilerepresents the inner product between and the feature vector of each pixel in . Accordingly, is a matrix with the same height and width as . As identity-relevant pixels obtain a high response value in , we apply to weigh and produce new feature maps denoted as . The above process can be summarized as follows:
where signifies the element-wise multiplication between and each channel of .
Discussion. To the best of our knowledge, ISA is one of the first efficient method to address occlusions between pedestrians in occluded ReID. A few most recent works [48, 49] can potentially solve this problem. These approaches adopt co-attention mechanism and attempt to search for pixel-level correspondence between each pair of images, enabling features to be extracted from semantically corresponding regions. However, during the training and inference stage, these approaches adopt computationally expensive matrix multiplication to infer semantically corresponding pixels for each pair of query and gallery images. Obviously their computationally cost is significantly higher than that of ISA. Therefore, our method has obvious advantages in efficiency for ReID.
Iii-C Adaptive Global Feature Extraction
As indicated in Fig. 5, the responses on focus primarily on the body of the pedestrian to identify after the processing of ISA. However, this does not mean that it is reasonable to extract global-level features directly from ; this is because the two images being compared may differ in terms of their occlusion locations. To ensure semantic consistency, it is essential to adaptively extract global features from the common non-occluded regions for each image pair. We designed the Adaptive Global Feature Extraction (AGFE) module with the help of the part quality score to achieve this goal.
For example, given a probe image and a gallery image , we first obtain their feature maps and from the output of the ISA module. In the next step, we equally partition each of them into parts and apply the RAP operation on each divided feature maps. In this way, we obtain a set of feature vectors for and , denoted as and respectively. We then adopt the part quality scores from the part branch to aggregate and and obtain the global-level features and for and , respectively. More specifically,
where denotes the weight for and . is computed as follows:
The classification loss for the final global representations can thus be formulated as follows:
We also apply the triplet loss to ensure that the intra-class distances are smaller than the inter-class distances. This triplet loss is similar to Eq. 3:
Obtaining global features from the common non-occluded regions of each image pair is usually ignored by existing works. As shown in Table V, the AGFE module significantly promotes performance for occluded ReID. This is because the AGFE module extracts semantically aligned global features.
Iii-D Occluded ReID via QPM
During training, the overall objective function of QPM can be written as follows:
During the training process, the overall loss is optimized together. The parameters of the network, including those of the fully connected layers, are optimized together via gradient descent.
Many existing works design local and global branches [50, 51, 14, 15] for multi-source structural information integration. Similar to , there are two parts in QPM that make up the final distance between one pair of query and gallery images: namely, the distance between part-level features and the distance between global-level features. In our approach, the distance between the part-level features is computed according to Eq. 2. The global-level features with respect to the image pair are obtained according to Eq. 9 and Eq. 10. Formally,
where is the weight that balances the contributions from and . is consistently set to 0.6 in this work.
In the inference stage, we first compute the distance between a query image and each of the gallery images using the part features according to Eq. 2. The body parts with low quality scores will contribute less to the distance, which weakens the impact of occlusion. In this way, we efficiently obtain the top nearest neighbors for the query image. Then, we compute the final distance according to Eq. 15 between the query image and each of the nearest neighbors. Therefore, the AGFE module hardly increases the inference time cost.
Iv-a Datasets and Settings
Partial-iLIDS  was constructed based on the iLIDS  dataset. It contains 238 images of 119 identities, all of which were captured in an airport. Some images in the dataset contain people occluded by other individuals or luggages. Each pedestrian has 1 full-body image and 1 occluded image. All probe images are occluded person images, while all gallery images are holistic images.
Partial-REID  was collected at a university campus and includes 600 images of 60 pedestrians. Each person has 5 full-body images and 5 occluded images. These images are collected from different viewpoints, backgrounds, and different types of severe occlusion. All probe images are occluded person images, while all gallery images are holistic images.
Occluded-Duke  was constructed based on the DukeMTMC  database. It is composed of 15,618 training images of 702 identities, 2,210 occluded query images of 519 identities, and 17,661 gallery images. There are rich variations in Occluded-Duke, including different viewpoints and a large variety of obstacles, including cars, bicycles, trees, other persons. Occluded-Duke is a more difficult and practical dataset since both probe and gallery images have occlusions.
P-DukeMTMC  is another subset of DukeMTMC . There are 12,927 training images of 665 identifies, 2,163 query images of 634 identities, and 9,053 gallery images. These images are occluded by different types of occlusion in public, e.g., people, luggages, cars and guideboards.
We conduct experiments using the Pytorch framework. We set bothand to 8; therefore, the batch size is 64. We adopt random erasing to simulate occlusion. All images are resized to pixels and augmented via random horizontal flipping. The number of body parts, i.e. , is set to 6. The margin for the triplet loss is set to 0.3. The number of the nearest neighbors, i.e., , is set as 30. The SGD optimizer is utilized for model optimization. Following [3, 41, 32], we do not use weight regularization in the SGD optimizer. Fine-tuned from the IDE model 
, the QPM is trained in an end-to-end fashion for 70 epochs. The initialized learning rate is set to 0.01 and is reduced by multiplying 0.1 for every 20 epochs.
Evaluation Protocols. We report the Cumulated Matching Characteristics (CMC) and mean Average Precision (mAP) value for the proposed approach. The evaluation package is provided by , and all the experimental results are obtained in the single query setting.
Moreover, we provide stability analysis on the performance of QPM in the supplementary material.
|Random Erasing ||40.5||59.6||66.8||30.0|
|Adver Occluded ||44.5||-||-||32.2|
Iv-B Performance under Supervised Setting
Results on Occluded-Duke. The performance of QPM and state-of-the-art methods on Occluded-Duke are tabulated in Table I. Some recent methods have achieved competitive performance with the help of pose landmarks: for example, HOReID  learns high-order relation and topology information for local features of visible landmarks, facilitating better match between probe and gallery images, and achieves 55.1% Rank-1 accuracy and 43.8% mAP. In comparison, QPM significantly outperforms HOReID by 9.3% and 5.9% in terms of Rank-1 accuracy and mAP, respectively. Moreover, QPM does not depend on pose estimation tools in either training or testing. This remarkable performance improvement clearly demonstrates the effectiveness of QPM.
Results on P-DukeMTMC. Comparison results on P-DukeMTMC are summarized in Table II. As the table shows, QPM achieves 89.4% Rank-1 accuracy and 74.4% mAP, surpassing the best previous method PVPM  by 4.3% and 4.5% in terms of Rank-1 accuracy and mAP, respectively. The above comparison results are consistent with those obtained on the Occluded-Duke database. These experimental results further demonstrate that our method can effectively solve the occlusion problem for ReID.
Iv-C Performance under Transfer Setting
, ReID model in this setting are trained using the Market-1501 database. Then, the model is directly evaluated on the Partial-REID, Partial-iLIDS, and P-DukeMTMC databases.
Results on Partial-REID and Partial-iLIDS. Each of the two databases contain two types of testing data: namely, partial images from which occluded regions have been manually removed, and the original occluded images. Similarly, depending on the testing data used, existing methods can be roughly divided into two categories: those using partial images and those using the original occluded images. The performance of QPM and state-of-the-art methods are tabulated in Table III.
As is evident from the table, for Partial-iLIDS, it can be seen that QPM outperforms all other methods that also evaluate on the original occluded images. For example, QPM beats PGFA  by 8.2% and 4.8% in terms of Rank-1 and Rank-3 accuracy respectively. It also outperforms one of the most recent methods using partial data, i.e. HOReID , by 4.7% in terms of Rank-1 accuracy. For Partial-REID, QPM outperforms all other methods using the original occluded images: for example, QPM beats PVPM  by 3.4% in terms of Rank-1 accuracy. While its performance is slightly lower than HOReID  (which evaluates on partial data), it should be noted that unlike HOReID, our method does not require either the manual removal of occluded regions during testing or any additional tools.  achieves 65.7% Rank-1 and 75.3% Rank-3 accuracy on Partial-REID, which is lower than those by QPM. This is because the image quality of Partial-REID is poor and the image resolution is relatively low. In this case, ISP is prone to errors in pixel classification. In comparison, our method is more robust.
Results on P-DukeMTMC. Finally, we evaluate the performance of QPM on P-DukeMTMC under transfer setting. The results presented in Table IV show that QPM achieves state-of-the-art performance under all metrics. For example, QPM outperforms one of the most recent methods, i.e. PVPM , by 5.8% and 1.9% in terms of Rank-1 accuracy and mAP, respectively. Experiments on this database further justifies the effectiveness of QPM.
|Part Bilinear ||39.2||50.6||56.4||25.4|
Iv-D Ablation Study
Ablation study are conducted on two large-scale datasets, i.e. Occluded-Duke  and P-DukeMTMC . The experimental results on the reported two benchmarks are shown in Table V. It should be noted that Table V lists the results on Occluded-Duke under supervised setting and the results on P-DukeMTMC under transfer setting. These results show the robustness and effectiveness of our method under different experimental settings.
Effectiveness of the quality scores. In Table V, ‘Baseline’ refers to the PCB model . ‘Baseline(+triplet)’ equips PCB with the triplet loss in Eq. 3, while all part quality scores are set to 1. ‘Part branch’ means that we adopt the part feature learning branch only in QPM for ReID. As shown in Table V, ‘Baseline(+triplet)’ slightly improves the performance relative to the baseline. In comparison, the ‘Part branch’ brings in significant performance promotion, suggesting that part quality scores considerably benefit the occluded ReID task.
Effectiveness of the ISA module. In Table V, ‘GAP global’ means that we perform GAP on the feature maps to obtain the global feature for each image. ‘AGFE global’ means that we utilize adaptive global feature extraction module. When the ISA module is equipped, performance of both types of global features is promoted. In particular, ISA promotes the Rank-1 accuracy of ‘AGFE global’ by 6.0% and 4.6%, as well as mAP by 3.0% and 1.8%, on the two databases, respectively.
Effectiveness of the AGFE module. Compared with ‘GAP global’ and ‘GAP global(+ISA)’, the AGFE module consistently brings about significant performance gains. For example, ‘AGFE global(+ISA)’ outperforms ‘GAP global(+ISA)’ in the Rank-1 accuracy by as much as 19.9% on Occluded-Duke. These experimental results indicate that it is vital to adaptively extract global features from the common non-occluded regions for each image pair.
Moreover, we compare the performance of ‘AGFE global’ with ‘SI global’ in Table VI. ‘SI global’ means that we obtain global features for each single image (SI) using Eq. 4, without considering the difference in occlusion locations for each image pair. To facilitate a fair comparison, ‘SI global’ adopts the same loss functions as ‘AGFE global’, i.e., the cross-entropy loss and triplet loss. It is shown that the performance of ‘SI global’ drops dramatically compared with ‘AGFE global’, which suggests that Eq. 4 alone cannot promote ReID performance. This experimental result indicates that it is vital to extract semantically consistent global features for each image pair.
Effectiveness of the combination. With both ISA and AGFE modules, the quality of global features is promoted significantly. For example, ‘AGFE global(+ISA)’ outperforms ‘GAP global’ in the Rank-1 accuracy by as much as 21.8% and 20.0% on Occluded-Duke and P-DukeMTMC databases, respectively.
Finally, the combination of the part branch and the global branch, which is denoted as QPM in Table V, achieves better performance than using either one branch alone. The above comparisons justify the effectiveness of each key component in QPM.
ISA vs. Other Approaches. To facilitate fair comparison, all experiments are based on the ‘AGFE global’ model. We equip the ‘AGFE global’ model with different spatial attention modules respectively and summarize their performance in Table VII. It is shown that ISA significantly outperforms all other methods by at least 3.3% and 2.9% in terms of Rank-1 accuracy on Occluded-Duke and P-DukeMTMC, respectively. This is because ISA is identity-aware; therefore, it effectively handles the occlusions between pedestrians.
Moreover, we illustrate the heat maps for feature maps after the processing of different attention modules in Fig. 5. We have the following observations. First, heat maps for the baseline model have high responses on both occluded and non-occluded regions; second, existing popular spatial attention models [67, 68, 69, 70] handle occlusions between pedestrians poorly; third, with the identity-aware guidance, our ISA module can well differentiate discriminative body parts from the occluded ones by both objects and other pedestrians. The above comparisons further demonstrate the effectiveness of ISA.
Comparisons of Model Complexity. In this experiment, we demonstrate that QPM not only achieves superior performance in terms of occluded ReID accuracy, but also offers advantages in terms of both its time and space complexities.
Three recent powerful occluded ReID approaches are compared: PGFA , HOReID , and . To facilitate fair comparison, all experimental settings are consistent with the paper description. Following [14, 15, 59], the input images are resized to 256 × 128 pixels for HOReID and 384 128 pixels for PGFA, and QPM. The batch size is set to 64 uniformly for all methods. Comparisons are conducted on a Titan V GPU, and results are summarized in Table VIII. The inference time cost in Table VIII includes the feature extraction of the query image and the matching time of all gallery images.
As is evident from the Table VIII, the model size of QPM is much smaller since it does not need an additional detection model. In addition, QPM has a faster training and testing speed. Specifically, our test time is only 60% and 24% of PGFA  and HOReID , respectively. This is because QPM does not require time-consuming human key point extraction. Although QPM uses less additional information during training and testing than PGFA  and HOReID , it still has significant performance advantages. Accordingly, the above comparisons demonstrate that the proposed QPM model is both compact and efficient.
Iv-E Parameter Analysis
The Impact of Part Number . In this experiment, we analyze the impact of the part number . Experimental results are summarized in Table IX. It is shown that the optimal value of is 6, which is consistent with the conclusion in . Therefore, we consistently set to 6 in this work.
The Impact of the Feature Dimension . Table X shows the ReID performance with different feature dimension . QPM consistently achieves state-of-the-art performance when is set to 256, 512, and 1024, with the best result achieved when is set to 1024.
The Impact of the weight . Table XI shows the ReID performance with different values of . It is shown that the optimal value of is 0.6. Therefore, we consistently set to 0.6 in this work.
In this paper, we propose a novel framework named QPM to handle the occluded person ReID problem. Unlike most existing methods, which depend on visibility cues from outside tools, QPM jointly learns part features and predicts part quality in an end-to-end framework without using any annotations or outside tools. Moreover, based on the predicted part quality scores, we propose a novel identity-aware spatial attention (ISA) model to handle occlusion between pedestrians. We further design a novel approach that adaptively generates global features from common non-occluded regions for each image pair. Finally, extensive experiments on four popular datasets demonstrate the effectiveness of QPM.
Z. Zhang, C. Lan, W. Zeng, and Z. Chen, “Densely semantically aligned person
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 667–676.
-  W. Li, X. Zhu, and S. Gong, “Harmonious attention network for person re-identification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 2285–2294.
-  Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang, “Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline),” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 480–496.
-  J. Xu, R. Zhao, F. Zhu, H. Wang, and W. Ouyang, “Attention-aware compositional network for person re-identification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 2119–2128.
Y. Chen, X. Zhu, and S. Gong, “Person re-identification by deep learning multi-scale representations,” inProc. IEEE Int. Conf. Comput. Vis., 2017, pp. 2590–2600.
-  C. Song, Y. Huang, W. Ouyang, and L. Wang, “Mask-guided contrastive attention model for person re-identification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 1179–1188.
-  M. Tian, S. Yi, H. Li, S. Li, X. Zhang, J. Shi, J. Yan, and X. Wang, “Eliminating background-bias for robust person re-identification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 5794–5803.
-  A. Wu, W.-S. Zheng, and J.-H. Lai, “Robust depth-based person re-identification,” IEEE Trans. Image Process, vol. 26, no. 6, pp. 2588–2603, 2017.
-  H. Luo, W. Jiang, Y. Gu, F. Liu, X. Liao, S. Lai, and J. Gu, “A strong baseline and batch normalization neck for deep person re-identification,” IEEE Trans. Multimedia, vol. 22, no. 10, pp. 2597–2609, 2019.
-  W. J. Scheirer, P. J. Flynn, C. Ding, G. Guo, V. Struc, M. Al Jazaery, K. Grm, S. Dobrisek, D. Tao, Y. Zhu et al., “Report on the btas 2016 video person recognition evaluation,” in IEEE 8th International Conference on Biometrics Theory, Applications and Systems (BTAS), 2016.
-  L. Wei, S. Zhang, H. Yao, W. Gao, and Q. Tian, “Glad: Global–local-alignment descriptor for scalable person re-identification,” IEEE Trans. Multimedia, vol. 21, no. 4, pp. 986–999, 2018.
-  C. Yan, G. Pang, X. Bai, C. Liu, N. Xin, L. Gu, and J. Zhou, “Beyond triplet loss: person re-identification with fine-grained difference-aware pairwise loss,” IEEE Trans. Multimedia, 2021.
-  B. Jiang, X. Wang, A. Zheng, J. Tang, and B. Luo, “Ph-gcn: Person retrieval with part-based hierarchical graph convolutional network,” IEEE Trans. Multimedia, 2021.
-  J. Miao, Y. Wu, P. Liu, Y. Ding, and Y. Yang, “Pose-guided feature alignment for occluded person re-identification,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 542–551.
-  G. Wang, S. Yang, H. Liu, Z. Wang, Y. Yang, S. Wang, G. Yu, E. Zhou, and J. Sun, “High-order information matters: Learning relation and topology for occluded person re-identification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., June 2020.
-  S. Gao, J. Wang, H. Lu, and Z. Liu, “Pose-guided visible part matching for occluded person reid,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., June 2020.
-  A. Hermans, L. Beyer, and B. Leibe, “In defense of the triplet loss for person re-identification,” arXiv preprint arXiv:1703.07737, 2017.
-  L. He, J. Liang, H. Li, and Z. Sun, “Deep spatial feature reconstruction for partial person re-identification: Alignment-free approach,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7073–7082.
-  W.-S. Zheng, X. Li, T. Xiang, S. Liao, J. Lai, and S. Gong, “Partial person re-identification,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 4678–4686.
-  J. Zhuo, Z. Chen, J. Lai, and G. Wang, “Occluded person re-identification,” in 2018 IEEE Int. Conf. Multimedia Expo, 2018, pp. 1–6.
-  M. M. Kalayeh, E. Basaran, M. Gökmen, M. E. Kamasak, and M. Shah, “Human semantic parsing for person re-identification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 1062–1071.
-  L. Qi, J. Huo, L. Wang, Y. Shi, and Y. Gao, “Maskreid: A mask based deep ranking neural network for person re-identification,” arXiv preprint arXiv:1804.03864, 2018.
-  L. Zheng, Y. Huang, H. Lu, and Y. Yang, “Pose-invariant embedding for deep person re-identification,” IEEE Trans. Image Process, vol. 28, no. 9, pp. 4500–4509, 2019.
-  J. Liu, B. Ni, Y. Yan, P. Zhou, S. Cheng, and J. Hu, “Pose transferrable person re-identification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 4099–4108.
-  C. Su, J. Li, S. Zhang, J. Xing, W. Gao, and Q. Tian, “Pose-driven deep convolutional model for person re-identification,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 3960–3969.
-  L. He, Y. Wang, W. Liu, X. Liao, H. Zhao, Z. Sun, and J. Feng, “Foreground-aware pyramid reconstruction for alignment-free occluded person re-identification.” Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019.
-  L. He and W. Liu, “Guided saliency feature learning for person re-identification in crowded scenes,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 357–373.
-  L. He, Z. Sun, Y. Zhu, and Y. Wang, “Recognizing partial biometric patterns,” arXiv preprint arXiv:1810.07399, 2018.
-  J. Zhuo, J. Lai, and P. Chen, “A novel teacher-student learning framework for occluded person re-identification,” arXiv preprint arXiv:1907.03253, 2019.
-  K. Zhu, H. Guo, Z. Liu, M. Tang, and J. Wang, “Identity-guided human semantic parsing for person re-identification,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 346–363.
-  Y. Sun, Q. Xu, Y. Li, C. Zhang, Y. Li, S. Wang, and J. Sun, “Perceive where to focus: Learning visibility-aware part-level features for partial person re-identification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 393–402.
-  G. Wang, Y. Yuan, X. Chen, J. Li, and X. Zhou, “Learning discriminative features with multiple granularities for person re-identification,” in Proc. ACM Int. Conf. Multimedia, 2018, pp. 274–282.
-  F. Zheng, C. Deng, X. Sun, X. Jiang, X. Guo, Z. Yu, F. Huang, and R. Ji, “Pyramidal person re-identification via multi-loss dynamic training,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 8514–8522.
-  M. S. Sarfraz, A. Schumann, A. Eberle, and R. Stiefelhagen, “A pose-sensitive embedding for person re-identification with expanded cross neighborhood re-ranking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 420–429.
-  H. Zhao, M. Tian, S. Sun, J. Shao, J. Yan, S. Yi, X. Wang, and X. Tang, “Spindle net: Person re-identification with human body region guided feature decomposition and fusion,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1077–1085.
-  L. Zhao, X. Li, Y. Zhuang, and J. Wang, “Deeply-learned part-aligned representations for person re-identification,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 3219–3228.
-  D. Li, X. Chen, Z. Zhang, and K. Huang, “Learning deep context-aware features over body and latent parts for person re-identification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 384–393.
-  X. Gong, Z. Yao, X. Li, Y. Fan, B. Luo, J. Fan, and B. Lao, “Lag-net: Multi-granularity network for person re-identification via local attention system,” IEEE Trans. Multimedia, 2021.
-  Y. Li, J. He, T. Zhang, X. Liu, Y. Zhang, and F. Wu, “Diverse part discovery: Occluded person re-identification with part-aware transformer,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 2898–2907.
-  C. Wan, Y. Wu, X. Tian, J. Huang, and X.-S. Hua, “Concentrated local part discovery with fine-grained part representation for person re-identification,” IEEE Trans. Multimedia, vol. 22, no. 6, pp. 1605–1618, 2019.
-  C. Ding, K. Wang, P. Wang, and D. Tao, “Multi-task learning with coarse priors for robust part-aware person re-identification,” IEEE Trans. Pattern Anal. Mach. Intell, 2020.
-  K. Wang, P. Wang, C. Ding, and D. Tao, “Batch coherence-driven network for part-aware person re-identification,” IEEE Trans. Image Process, vol. 30, pp. 3405–3418, 2021.
S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network
training by reducing internal covariate shift,” in
International conference on machine learning. PMLR, 2015, pp. 448–456.
-  Z. Zhang, C. Lan, W. Zeng, and Z. Chen, “Multi-granularity reference-aided attentive feature aggregation for video-based person re-identification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 10 407–10 416.
Z. Zhou, Y. Huang, W. Wang, L. Wang, and T. Tan, “See the forest for the trees: Joint spatial and temporal recurrent neural networks for video-based person re-identification,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 4747–4756.
-  Y. Liu, J. Yan, and W. Ouyang, “Quality aware network for set to set recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 5790–5799.
-  J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7132–7141.
-  S. Zhao, C. Gao, J. Zhang, H. Cheng, C. Han, X. Jiang, X. Guo, W.-S. Zheng, N. Sang, and X. Sun, “Do not disturb me: Person re-identification under the interference of other pedestrians,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 647–663.
-  X. Lu, W. Wang, C. Ma, J. Shen, L. Shao, and F. Porikli, “See more, know more: Unsupervised video object segmentation with co-attention siamese networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 3623–3632.
-  M. Zhang, W. Li, R. Tao, H. Li, and Q. Du, “Information fusion for classification of hyperspectral and lidar data using ip-cnn,” IEEE Trans. Geosci. Remote. Sens. Lett., 2021.
-  P. Xie, M. Zhao, and X. Hu, “Pisltrc: Position-informed sign language transformer with content-aware convolution,” IEEE Trans. Multimedia, 2021.
-  W.-S. Zheng, S. Gong, and T. Xiang, “Person re-identification by probabilistic relative distance comparison,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2011, pp. 649–656.
-  Z. Zheng, L. Zheng, and Y. Yang, “Unlabeled samples generated by gan improve the person re-identification baseline in vitro,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 3754–3762.
-  L. Zheng, H. Zhang, S. Sun, M. Chandraker, Y. Yang, and Q. Tian, “Person re-identification in the wild,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1367–1376.
-  K. Zhou and T. Xiang, “Torchreid: A library for deep learning person re-identification in pytorch,” arXiv preprint arXiv:1910.10093, 2019.
-  Y. Suh, J. Wang, S. Tang, T. Mei, and K. M. Lee, “Part-aligned bilinear representations for person re-identification,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 418–437.
-  Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, “Random erasing data augmentation,” in Proc. AAAI Conf. Artif. Intell., 2020, pp. 13 001–13 008.
-  H. Huang, D. Li, Z. Zhang, X. Chen, and K. Huang, “Adversarially occluded samples for person re-identification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 5098–5107.
-  J. Miao, Y. Wu, and Y. Yang, “Identifying visible parts via pose estimation for occluded person re-identification,” IEEE Trans. Neural Netw. Learn. Syst., 2021.
-  K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 5693–5703.
-  L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalable person re-identification: A benchmark,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1116–1124.
S. Liao, A. K. Jain, and S. Z. Li, “Partial face recognition: Alignment-free approach,”IEEE Trans. Pattern Anal. Mach. Intell, vol. 35, no. 5, pp. 1193–1205, 2012.
-  Z. Gao, H. Zhang, L. Gao, Z. Cheng, R. Hong, and S. Chen, “Dcr: A unified framework for holistic/partial person reid,” IEEE Trans. Multimedia, 2020.
H. Luo, W. Jiang, X. Fan, and C. Zhang, “Stnreid: Deep convolutional networks with pairwise spatial transformer networks for partial person re-identification,”IEEE Trans. Multimedia, vol. 22, no. 11, pp. 2905–2913, 2020.
-  X. Chang, T. M. Hospedales, and T. Xiang, “Multi-level factorisation net for person re-identification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 2109–2118.
-  K. Zhou, Y. Yang, A. Cavallaro, and T. Xiang, “Omni-scale feature learning for person re-identification,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 3702–3712.
-  Y. Liu, Z. Yuan, W. Zhou, and H. Li, “Spatial and temporal mutual promotion for video-based person re-identification,” in Proc. AAAI Conf. Artif. Intell., vol. 33, no. 01, 2019, pp. 8786–8793.
-  Z. Zhang, C. Lan, W. Zeng, X. Jin, and Z. Chen, “Relation-aware global attention for person re-identification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 3186–3195.
-  F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang, “Residual attention network for image classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 3156–3164.
-  S. Woo, J. Park, J.-Y. Lee, and I. So Kweon, “Cbam: Convolutional block attention module,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 3–19.