AlignedReID: Surpassing Human-Level Performance in Person Re-Identification

11/22/2017 ∙ by Xuan Zhang, et al. ∙ Megvii Technology Limited 0

In this paper, we propose a novel method called AlignedReID that extracts a global feature which is jointly learned with local features. Global feature learning benefits greatly from local feature learning, which performs an alignment/matching by calculating the shortest path between two sets of local features, without requiring extra supervision. After the joint learning, we only keep the global feature to compute the similarities between images. Our method achieves rank-1 accuracy of 94.0 outperforming state-of-the-art methods by a large margin. We also evaluate human-level performance and demonstrate that our method is the first to surpass human-level performance on Market1501 and CUHK03, two widely used Person ReID datasets.



There are no comments yet.


page 1

page 3

page 5

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Person re-identification (ReID), identifying a person of interest at other time or place, is a challenging task in computer vision. Its applications range from tracking people across cameras to searching for them in a large gallery, from grouping photos in a photo album to visitor analysis in a retail store. Like many visual recognition problems, variations in pose, viewpoints illumination, and occlusion make this problem non-trivial.

Traditional approaches have focused on low-level features such as colors, shapes, and local descriptors [9, 11]

. With the renaissance of deep learning, the convolutional neural network (CNN) has dominated this field

[24, 32, 6, 54, 16, 24], by learning features in an end-to-end fashion through various metric learning losses such as contrastive loss [32], triplet loss [18], improved triplet loss [6], quadruplet loss [3], and triplet hard loss [13].

Many CNN-based approaches learn a global feature, without considering the spatial structure of the person. This has a few major drawbacks: 1) inaccurate person detection boxes might impact feature learning, e.g., Figure 1 (a-b); 2) the pose change or non-rigid body deformation makes the metric learning difficult, e.g., Figure 1 (c-d); 3) occluded parts of the human body might introduce irrelevant context into the learned feature, e.g., Figure 1 (e-f); 4) it is non-trivial to emphasis local differences in a global feature, especially when we have to distinguish two people with very similar appearances, e.g., Figure 1 (g-h). To explicitly overcome these drawbacks, recent studies have paid attention to part-based, local feature learning. Some works [33, 38, 43]

divide the whole body into a few fixed parts, without considering the alignment between parts. However, it still suffers from inaccurate detection box, pose variation, and occlusion. Other works use pose estimation result for the alignment

[52, 37, 50], which requires additional supervision and a pose estimation step (which is often error-prone).

Figure 1: Challenges in ReID: (a-b) inaccurate detection, (c-d) pose misalignments, (e-f) occlusions, (g-h) very similar appearance.

In this paper, we propose a new approach, called AlignedReID, which still learns a global feature, but performs an automatic part alignment during the learning, without requiring extra supervision or explicit pose estimation. In the learning stage, we have two branches for learning a global feature and local features jointly. In the local branch, we align local parts by introducing a shortest path loss. In the inference stage, we discard the local branch and only extract the global feature. We find that only applying the global feature is almost as good as combining global and local features. In other words, the global feature itself, with the aid of local features learning, can greatly address the drawbacks we mentioned above, in our new joint learning framework. In addition, the form of global feature keeps our approach attractive for the deployment of a large ReID system, without costly local features matching.

We also adopt a mutual learning approach [49] in the metric learning setting, to allow two models to learn better representations from each other. Combining AlignedReID and mutual learning, our system outperforms state-of-the-art systems on Market1501, CUHK03, and CUHK-SYSU by a large margin. To understand how well human perform in the ReID task, we measure the best human performance of ten professional annotators on Market1501 and CUHK03. We find that our system with re-ranking [57] has a higher level of accuracy than the human. To the best of our knowledge, this is the first report in which machine performance exceeds human performance on the ReID task.

Figure 2: The framework of AlignedReID. Both the global branch and the local branch share the same convolution network to extract the feature map. The global feature is extracted by applying global pooling directly on the feature map. For the local branch, one convolution layer is applied after horizontal pooling, which is a global pooling with a horizontal orientation. Triplet hard loss is applied, which selects triplet samples by hard sample mining according to global distances.

2 Related Work

Metric Learning. Deep metric learning methods transform raw images into embedding features, then compute the feature distances as their similarities. Usually, two images of the same person are defined as a positive pair, whereas two images of different persons are a negative pair. Triplet loss [18] is motivated by the margin enforced between positive and negative pairs. Selecting suitable samples for the training model through hard mining has been shown to be effective [13, 3, 39]. Combining softmax loss with metric learning loss to speed up the convergence is also a popular method [10].

Feature Alignments. Many works learn a global feature to represent an image of a person, ignoring the spatial local information of images. Some works consider local information by dividing images into several parts without an alignment [33, 38, 43], but these methods suffer from inaccurate detection boxes, occlusion and pose misalignment.

Recently, aligning local features by pose estimation has become a popular approach. For instance, pose invariant embedding (PIE) aligns pedestrians to a standard pose to reduce the impact of pose [52] variation. A Global-Local-Alignment Descriptor (GLAD) [37] does not directly align pedestrians, but rather detects key pose points and extracts local features from corresponding regions. SpindleNet [50] uses a region proposed network (RPN) to generate several body regions, gradually combining the response maps from adjacent body regions at different stages. These methods require extra pose annotation and have to deal with the errors introduced by pose estimation.

Mutual Learning. [49] presents a deep mutual learning strategy where an ensemble of students learn collaboratively and teach each other throughout the training process. DarkRank [4] introduces a new type of knowledge-cross sample similarity for model compression and acceleration, achieving state-of-the-art performance. These methods use mutual learning in classification. In this work, we study mutual learning in the metric learning setting.

Re-Ranking. After obtaining the image features, most current works choose the L2 Euclidean distance to compute a similarity score for a ranking or retrieval task. [35, 57, 1] perform an additional re-ranking to improve ReID accuracy. In particular, [57] proposes a re-ranking method with -reciprocal encoding, which combines the original distance and Jaccard distance.

3 Our Approach

In this section, we present our AlignedReID framework, as shown in Figure 1.

Figure 3: Example of AlignedReID local distance computed by finding the shortest path. The black arrows show the shortest path in the corresponding distance matrix on the right. The black lines show the corresponding alignment between the two images on the left.
Figure 4: Framework of the mutual learning approach. Two networks with parameters and are trained together. Each network has two branches: a classification branch and a metric learning branch. The classification branches are trained with classification losses, and learn each other through classification mutual loss. The metric learning branches are trained with metric losses, which include both global distance and local distance. Meanwhile, the metric learning branches learn each other by metric mutual loss.

3.1 AlignedReID

In AlignedReID, we generate a single global feature as the final output of the input image, and use the L2 distance as the similarity metric. However, the global feature is learned jointly with local features in the learning stage.

For each image, we use a CNN, such as Resnet50 [12], to extract a feature map, which is the output of the last convolution layer (, where is the channel number and is the spatial size, e.g., in Figure 1). A global feature (a

-d vector) is extracted by directly applying global pooling on the feature map. For the local features, a horizontal pooling, which is a global pooling in the horizontal direction, is first applied to extract a local feature for each row, and a

convolution is then applied to reduce the channel number from to . In this way, each local feature (a -d vector) represents a horizontal part of the image for a person. As a result, a person image is represented by a global feature and local features.

The distance of two person images is the summation of their global and local distances. The global distance is simply the L2 distance of the global features. For the local distance, we dynamically match the local parts from top to bottom to find the alignment of local features with the minimum total distance. This is based on a simple assumption that, for two images of the same person, the local feature from one body part of the first image is more similar to the semantically corresponding body part of the other image.

Given the local features of two images, and , we first normalize the distance to by an element-wise transformation:


where is the distance between the -th vertical part of the first image and the -th vertical part of the second image. A distance matrix is formed based on these distances, where its -element is . We define the local distance between the two images as the total distance of the shortest path from to in the matrix . The distance can be calculated through dynamic programming as follows:


where is the total distance of the shortest path when walking from to in the distance matrix , and is the total distance of the final shortest path (i.e., the local distance) between the two images.

As shown in Fig. 3, images A and B are samples of the same person. The alignment between the corresponding body parts, such as part in image A, and part in image B, are included in the shortest path. Meanwhile, there are alignments between non-corresponding parts, such as part in image A, and part in image B, still included in the shortest path. These non-corresponding alignments are necessary to maintain the order of vertical alignment, as well as make the corresponding alignments possible. The non-corresponding alignment has a large L2 distance, and its gradient is close to zero in Eq.1. Hence, the contribution of such alignments in the shortest path is small. The total distance of the shortest path, i.e., the local distance between two images, is mostly determined by the corresponding alignments.

The global and local distance together define the similarity between two images in the learning stage, and we chose TriHard loss proposed by [13] as the metric learning loss. For each sample, according to the global distances, the most dissimilar one with the same identity and the most similar one with a different identity is chosen, to obtain a triplet. For the triplet, the loss is computed based on both the global distance and the local distance with different margins. The reason for using the global distance to mine hard samples is due to two considerations. First, the calculation of the global distance is much faster than that of the local distance. Second, we observe that there is no significant difference in mining hard samples using both distances.

Note that in the inference stage, we only use the global features to compute the similarity of two person images. We make this choice mainly because we unexpectedly observed that the global feature itself is also almost as good as the combined features. This somehow counter-intuitive phenomenon might be caused by two factors: 1) the feature map jointly learned is better than learning the global feature only, because we have exploited the structure prior of the person image in the learning stage; 2) with the aid of local feature matching, the global feature can pay more attention to the body of the person, rather than over fitting the background.

3.2 Mutual Learning for Metric Learning

We apply mutual learning to train models for AlignedReID, which can further improve performance. A distillation-based model usually transfers knowledge from a pre-trained large teacher network to a smaller student network, such as [4]. In this paper, we train a set of student models simultaneously, transferring knowledge between each other, such as [49]. Differing from [49]

, which only adopts the Kullback-Leibler (KL) distance between classification probabilities, we propose a new mutual learning loss for metric learning.

The framework of our mutual learning approach is shown in Fig. 4

. The overall loss function includes the metric loss, the metric mutual loss, the classification loss and the classification mutual loss. The metric loss is decided by both the global distances and the local distances, while the metric mutual loss is decided only by the global distances. The classification mutual loss is the KL divergence for classification as in


Given a batch of N images, each network extracts their global features and calculates the global distance between each other as an batch distance matrix, where and denote the -th element in the matrices separately. The mutual learning loss is defined as



represents the zero gradient function, which treats the variable as constant when calculating gradients, stopping the backpropagation in the learning stage.

By applying the zero gradient function, the second-order gradients is


We found that it speeds up the convergence and improves the accuracy compared to a mutual loss without the zero gradient function.

4 Experiments

In this section, we present our results on three most widely used ReID datasets: Market1501 [53], CUHK03 [14], and CUHK-SYSU [41].

4.1 Datasets

Market1501 contains 32,668 images of 1,501 labeled persons of six camera views. There are 751 identities in the training set and 750 identities in the testing set. In the original study on this proposed dataset, the author also uses mAP as the evaluation criteria to test the algorithms.

CUHK03 contains 13,164 images of 1,360 identities. It provides bounding boxes detected from deformable part models (DPMs) and manual labeling.

CUHK-SYSU is a large-scale benchmark for a person search, containing 18,184 images (99,809 bounding boxes) and 8,432 identities. The training set contains 11,206 images of 5,532 query persons, whereas the test set contains 6,978 images of 2,900 persons.

Note that we only train a single model using training samples from all three datasets, as in [40, 50]. We follow the official training and evaluation protocols on Market1501 and CUHK-SYSU, and mainly report the mAP and rank-1 accuracy. For CUHK03, because we train one single model for all benchmarks, it is slightly different from the standard procedure in [14], which splits the dataset randomly 20 times, and the gallery for testing has 100 identities each time. We only randomly split the dataset once for training and testing, and the gallery includes 200 identities. It means our task might be more difficult than the standard procedure. Similarly, we evaluate our method with rank-1, -5, and -10 accuracy on CUHK03.

4.2 Implementation Details

We use Resnet50 and Resnet50-Xception (Resnet-X) pre-trained on ImageNet

[28] as the base models. Resnet50-Xception replaces the filter kernel through the Xception cell [7], which contains one channel-wise convolution layer and one spatial convolution layer. Each image is resized into pixels. The data augmentation includes random horizontal flipping and cropping. The margins of TriHard loss for both the global and local distances is set to , and the mini-batch size is set to , in which each identity has

images. Each epoch includes 2000 mini-batches. We use an Adam optimizer with an initial learning rate of

, and shrink this learning rate by a factor of at 80 and 160 epochs until achieving convergence.

For mutual learning, the weight of classification mutual loss (KL) is set to , and the weight of metric mutual loss is set to . The optimizer uses Adam with an initial learning rate of , which is reduced to and at 60 epochs and 120 epochs until convergence is achieved.

Re-ranking is an effective technique for boosting the performance of ReID [57]. We follow the method and details in [57]. In all of our experiments, we combined metric learning loss with classification (identification) loss.

4.3 Advantage of AlignedReID

Figure 5: The black lines show the alignments of local parts between two persons: the thicker the line is, the greater it contributes to the shortest path. Persons have the same identities in (a-c), while persons have different identities in (d).
Market1501 CUHK-SYSU CUHK03
Base model Methods mAP r = 1 r = 5 mAP r = 1 r = 5 r=1 r=5 r = 10
Resnet50 Baseline 64.5 83.8 94.1 88.5 90.9 96.6 83.3 95.8 97.9
GL-Baseline 58.0 80.4 92.0 86.0 88.2 95.6 81.7 95.0 97.2
AlignedReID 72.8 89.2 96.0 92.9 94.5 98.0 88.1 97.5 98.8
Resnet50-X Baseline 61.7 83.6 93.3 87.9 90.4 95.8 80.4 94.5 97.1
GL-Baseline 57.1 79.7 92.4 85.9 87.9 95.6 80.7 94.7 97.1
AlignedReID 71.8 89.4 95.8 91.3 93.5 97.3 88.3 97.1 98.5
Table 1: Experiment results of AlignedReID. We combine metric learning loss with classification loss in our experiments.
Market1501 CUHK-SYSU CUHK03
Loss Base model mAP r = 1 r=5 mAP r = 1 r=5 r = 1 r = 5 r = 10
Baseline Resnet50 64.5 83.8 94.1 88.5 90.9 96.6 83.3 95.8 97.9
Resnet50-X 61.7 83.6 93.3 87.9 90.4 95.8 80.4 94.5 97.1
Baseline+MC Resnet50 67.8 86.2 94.6 89.8 92.2 97.1 83.8 95.4 97.5
Resnet50-X 68.7 87.3 95.4 89.7 91.8 96.8 84.6 96.2 98.1
Baseline+MM Resnet50 66.8 86.3 95.1 90.1 92.2 96.9 83.8 95.6 97.7
Resnet50-X 66.8 86.2 94.6 90.1 92.5 96.8 84.2 95.8 97.8
AlignedReID Resnet50 72.8 89.2 96.0 92.9 94.5 98.0 88.1 97.5 98.8
Resnet50-X 71.8 89.4 95.8 91.3 93.5 97.3 88.3 97.1 98.5
AlignedReID+MC Resnet50 70.9 88.6 95.9 91.4 93.2 97.2 87.6 97.0 98.3
Resnet50-X 71.2 88.5 95.8 91.2 93.3 97.4 86.9 96.8 98.4
AlignedReID+MM Resnet50 79.3 91.8 97.1 94.4 95.7 98.8 92.4 98.9 99.5
Resnet50-X 79.3 91.2 96.9 94.4 95.8 98.7 92.3 99.6 99.8
Table 2: Results of mutual learning. MC stands for experiments with classification mutual loss. MM stands for experiments with both classification mutual loss and metric mutual loss.

In this section, we analyze the advantage of our AlignedReID model.

We first show some typical results of the alignment in Fig 5. In FIg 5(a), the detection box of the right person is inaccurate, which results in a serious misalignment of heads. AlignedReID matches the first part of the left image with the first three parts of the right image in the shortest path. Fig 5(b) presents another difficult situation where human body structure is defective. The left image does not contain the parts below the knee. In the alignment, the skirt side of the right image are associated with the skirt parts of the left one, while the leg parts of the right image provide small contribution to the shortest path. Fig 5(c) shows an example of occlusion, where the lower part of the persons are invisible. The alignment shows that the occlude parts contribute small in the shortest path, hence the other parts are paid more attention in the learning stage. Fig 5(d) show two different persons with similar appearances. The shirt logo of the right person has no similar part in the left person, which results in a large shortest path distance (local distance) between these two images.

We then compare our AlignedReID with two similar networks: Baseline which has no local feature branch, and GL-Baseline which has local feature branch without alignment. In GL-Baseline, the local loss is the sum of distances of spatial corresponding local features. All results are obtained by using the same network and the same training setting. The results are shown in Table 1. Compared to Baseline, GL-Baseline often gets a worse accuracy. Hence, a local branch without alignment does not help. Meanwhile, AlignedReID boosts 3.1 7.9 rank-1 accuracy and 3.6 10.1 mAP on all datasets. The local feature branch with alignment helps the network focus on useful image regions and discriminates similar person images with subtle differences.

We find that if we apply the local distance together with the global distance in the inference stage, rank-1 accuracy further improves approximately 0.3 0.5. However, it is time consuming and not practical when searching in a large gallery. Hence, we recommend using the global feature only.

4.4 Analysis of Mutual Learning

In the mutual learning experiment, we simultaneously train two AlignedReID models. One model is based on Resnet50, and the other is based on Resnet50-Xception. We compare their performances for three cases: with both metric mutual loss and classification mutual loss, with only classification mutual loss, and with no mutual loss. We also conduct a similar mutual learning experiment as a baseline, where the global features are trained without local features. The results are shown in Table 2.

Both experiments show that the metric mutual learning method can further improve performance. With the baseline mutual learning experiment, the classification mutual loss significantly improves performance on all datasets. However, with the AlignedReID mutual learning experiment, because the models without mutual learning perform well enough, the classification mutual loss cannot further improve performance.

4.5 Comparison with Other Methods

In this subsection, we compare the results of AlignedReID with state-of-the-art methods, in Table 3 5. In the tables, AlignedReID represents our method with mutual learning, and AlignedReID (RK) is our method with both mutual learning and re-ranking [57] with -reciprocal encoding.

On Market1501, GLAD [37] achieves an 89.9 rank-1 accuracy, and our AlignedReID achieves a 91.8 rank-1 accuracy, exceeding it. For mAP, [13] obtains 81.1 owing to the use of re-ranking. With the help of re-ranking, the rank-1 accuracy and mAP are further improved to 94.4 and 90.7 in our AlignedReID (RK), outperforming the best of previous works by 4.5 and 9.6 separately.

On CUHK03, without re-ranking, HydraPlus-Net [20] achieves 91.8 rank-1 accuracy and our AlignedReID yields 92.4. Note that our test gallery size is two times large as that used in [20]. Furthermore, our AlignedReID (RK) obtains a 97.8 rank-1 accuracy, exceeding state-of-the-art by 6.0.

There have not been many studies reported on CUHK-SYSU. With this dataset, AlignedReID achieves 94.4 mAP and 95.8 rank-1 accuracy, which is much higher than any published results.

Methods mAP r=1
Temporal [23] 22.3 47.9
Learning [47] 35.7 61.0
Gated [32] 39.6 65.9
Person [5] 45.5 71.8
Re-ranking [57] 63.6 77.1
Pose [52] 56.0 79.3
Scalable [1] 68.8 82.2
Improving [16] 64.7 84.3
In [13] 69.1 84.9
In (RK)[13] 81.1 86.7
Spindle[50] - 76.9
Deep[49] 68.8 87.7
DarkRank[4] 74.3 89.8
GLAD[37] 73.9 89.9
HydraPlus-Net[20] - 76.9
AlignedReID 79.3 91.8
AlignedReID (RK) 90.7 94.4
Table 3: Comparison on Market1501 in single query mode
Methods r=1 r=5 r=10
Person [15] 44.6 - -
Learning [47] 62.6 90.0 94.8
Gated [32] 61.8 - -
A [34] 57.3 80.1 88.3
Re-ranking [57] 64.0 - -
In [13] 75.5 95.2 99.2
Joint [42] 77.5 - -
Deep [10] 84.1 - -
Looking [2] 72.4 95.2 95.8
Unlabeled [56] 84.6 97.6 98.9
A [55] 83.4 97.1 98.7
Spindle[50] 88.5 97.8 98.6
DarkRank[4] 89.7 98.4 99.2
GLAD[37] 85.0 97.9 99.1
HydraPlus-Net[20] 91.8 98.4 99.1
AlignedReID 92.4 98.9 99.5
AlignedReID (RK) 97.8 99.6 99.8
Table 4: Comparison on CUHK03 labeled dataset
Methods mAP r=1
End[41] 55.7 62.7
Deep [29] 74.0 76.7
Neural [17] 77.9 81.2
AlignedReID 94.4 95.7
Table 5: Comparison with existing methods on CUHK-SYSU
Figure 6: Interface of our human performance evaluation system for CUHK03. The left side shows a query image and the right side shows 10 images sampled using our deep model.

5 Human Performance in Person ReID

Given the significant improvement of our approach, we are curious to find the quality of human performance. Thus, we conduct human performance evaluations on Market1501 and CUHK03.

To make the study feasible, for each query image, the annotator does not have to find the same person from the entire gallery set. We ask him or her to pick the answer from a much smaller set of selected images.

In CUHK03, for each query image, there is only one image for the identical person in the gallery set. The annotator looks for the identical person among images selected: our ReID model first generates the top10 results in the gallery set for the query image; if the “ground truth” is not among the top10 results, we replace the th result with the ground truth.

For Market1501, there may be more than one ground truth in the gallery set. The annotator needs to pick one from images selected as follows: our ReID model generated the top50 results in the gallery set for the query image; if any ground truth is not among them, it would be used to replace one non-ground truth result with the lowest rank. In this way, we make sure that all ground truths are in the selected images.

The interface of the human performance evaluation system is presented in Fig 6. The images are randomly shuffled before being displayed to the annotator. The evaluation website will be available soon. Ten professional annotators participate in the evaluation. Because only one candidate is chosen, we are unable to obtain the mAP of human beings as a standard evaluation. The rank-1 accuracies are computed for each annotator on all datasets. The best accuracy is then used as the human performance, which is shown in Table 6.

On Market1501, human beings achieve a 93.5 rank-1 accuracy, which is better than all state-of-the-art methods. The rank-1 accuracy in our AlignedReID (RK) reaches 94.4 rank-1, exceeding the human performance. On CUHK03, the human performance reaches a 95.7 rank-1 accuracy, which is much higher than any known state-of-the-art methods. Our AlignedReID (RK) obtains a 97.8 rank-1 accuracy, surpassing the human performance.

Figure 7 shows some examples, where an annotator selected a wrong answer, while the top1 result provided by our method is correct.

Market1501 CUHK03
Human Rank 1 93.5 95.7
Human Rank 2 91.1 91.9
Human Rank 3 90.6 91.2
Human Rank 4 90.0 91.1
Human Rank 5 88.3 90.0
AlignedReID (RK) 94.4 97.8
Table 6: Results of human performance evaluation. We show the accuracies of the five annotators who did best in the evaluation. We also show our AlignedReID results with re-ranking.
Figure 7: Top: query image. Middle: the result picked by an annotator. Bottom: top1 result by our method.

6 Conclusion

In this paper, we have demonstrated that an implicit alignment of local features can substantially improve global feature learning. This surprising result gives us an important insight: the end-to-end learning with structure prior is more powerful than a “blind” end-to-end learning.

Although we show that our methods outperform humans in the Market1501 and CUHK03 datasets, it is still early to claim that machines beat humans in general. Figure 8 presents a few “big” mistakes which seldom confuses humans. This indicates that the machine still has a lot of room for improvement.

Figure 8: Top: query image. Middle: top1 result by our method. Bottom: ground truth.