Relation Network for Person Re-identification

11/21/2019 ∙ by Hyunjong Park, et al. ∙ Yonsei University 0

Person re-identification (reID) aims at retrieving an image of the person of interest from a set of images typically captured by multiple cameras. Recent reID methods have shown that exploiting local features describing body parts, together with a global feature of a person image itself, gives robust feature representations, even in the case of missing body parts. However, using the individual part-level features directly, without considering relations between body parts, confuses differentiating identities of different persons having similar attributes in corresponding parts. To address this issue, we propose a new relation network for person reID that considers relations between individual body parts and the rest of them. Our model makes a single part-level feature incorporate partial information of other body parts as well, supporting it to be more discriminative. We also introduce a global contrastive pooling (GCP) method to obtain a global feature of a person image. We propose to use contrastive features for GCP to complement conventional max and averaging pooling techniques. We show that our model outperforms the state of the art on the Market1501, DukeMTMC-reID and CUHK03 datasets, demonstrating the effectiveness of our approach on discriminative person representations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Person re-identification (reID) is one of fundamental tasks in computer vision, with the purpose of retrieving a particular person from a set of pedestrian images, captured by multiple cameras. It has been getting a lot of attention in recent years, due to the wide range of applications including pedestrian detection 

[36] and multi-person tracking [28]. This problem is very challenging, since pedestrians have different attributes (e.g.

, clothing, gender, hair), and the pictures of them are taken under different conditions, such as illumination, occlusion, background clutter, and camera types. Remarkable advances in convolutional neural networks (CNNs) over the last decade allow to obtain person representations 

[16, 17, 8] robust to these factors of variations, especially for human pose, and they also enables learning metrics [31, 4] for computing the similarities of person features.

Person reID methods using CNNs typically focus on extracting a global feature of a person image [16, 17, 8, 31, 4] to obtain a compact descriptor for an efficient retrieval. This, however, gives a limited representation, as the global feature may not account for intra-class variations (e.g., human pose, occlusion, background clutter). To address this problem, part-based methods [32, 23, 34, 14, 18, 33, 26, 7] have been proposed. They extract local features from body parts (e.g.

, arms, legs, torso), often together with the global feature of a person image itself, and aggregate them for an effective person reID. To leverage body parts, these approaches extract pose maps from off-the-shelf pose estimators 

[32, 23, 34], compute attention maps to consider discriminative regions of interest [14, 18, 33], or slice person images into horizontal grids [26, 7]. Part-level features provide better person representations than a global one, but aggregating the individual local features e.g., by concatenating them without considering relations between body parts, is limited to represent an identity of a person discriminatively. In particular, this does not differentiate the identities of different persons that have similar attributes in corresponding parts between images, since part-based methods compute the similarity of corresponding part-level features independently.

In this paper, we propose to make each part-level feature incorporate information of other body parts to obtain discriminative person representations for an effective person reID. To this end, we introduce a new relation module exploiting one-vs.-rest relations of body parts. It accounts for the relations between individual body parts and the rest of them, so that each part-level feature contains information of the corresponding part itself and other body parts, supporting it to be more discriminative. As will be seen in our experiments, considering the relation between body parts provides better part-level features with a clear advantage over current part-based methods. We have observed that 1) directly using both global average and max pooling techniques (GAP and GMP) to obtain a global feature of a person image does not provide a performance gain, and 2) GMP gives better results than GAP. Based on this, we also present a global contrastive pooling (GCP) method to obtain better feature representations based on GMP, which adaptively aggregates GAP and GMP results of the entire part-level features. Specifically, it uses the discrepancy between the pooling results, and distill the complementary information to max pooled features in a residual manner. Experimental results on standard benchmarks, including the Market1501 

[35], DukeMTMC-reID [21], and CUHK03 [13], demonstrate the advantage of our approach for person reID. To encourage comparison and future work, our code and models are available online: https://cvlab-yonsei.github.io/projects/RRID/.

The main contributions of this paper can be summarized as follows: 1) We introduce a relation network for part-based person reID to obtain discriminative local features. 2) We propose a new pooling method exploiting contrastive features, GCP, to extract a global feature of a person image. 3) We achieve a new state of the art, outperforming other part-based reID methods by a large margin.

Figure 1: Overview of our framework. The proposed reID model mainly consists of three parts: We first extract part-level features by applying GMP to individual horizontal slices of the feature map from the backbone network. We then input the local features into separate modules, a one-vs.-rest relation module and GCP, that give local relational and global contrastive features, respectively.

Related Work

Person reID.

Several person reID methods based on CNNs have recently been proposed. They typically formulate the reID task as a multi-class classification problem [37], where person images of the same identity belong to the same category. A classification loss encourages the images of the same identity to be embedded nearby in feature space. Other reID methods additionally use person images of different identities for training, and enforces the feature distance of person images with the same identity to be smaller than that with different identities by a ranking loss. There are many attempts to obtain discriminative feature representations, e.g., leveraging generative adversarial networks (GANs) to distill identity-related features [8], using attributes to offer complementary information [17], or exploiting body parts to extract diverse person features [32, 34, 23, 18, 14, 30, 33, 26, 7, 29].

Part-based methods enhance the discriminative capabilities of various body parts. We classify them into three categories: The first approach uses a pose estimator (or a landmark detector) to extract a pose map 

[32, 34, 23]. This requires an additional data with landmark annotations to train a pose estimator, and the retrieval accuracy of reID largely depends on the performance of the estimator. The second approach leverages body parts implicitly using an attention map [14, 18, 33, 30], which can be achieved without auxiliary supervisory signals (i.e., pose annotations). It provides a feature representation robust to background clutter, focusing on the regions of interest, but the attended regions may not contain discriminative body parts. The third approach also exploits body parts implicitly dividing person images into horizontal grids of multiple scales [26, 7, 29]. It assumes that person pictures, localized by off-the-shelf object detectors [6], generally have the same body parts for particular grids (e.g., legs on the lower parts of person images). This is, however, problematic when the detectors do not localize the persons tightly. Our method belongs to the third category. In contrast to other methods, we aggregate local features while considering relations between body parts, rather than exploiting them directly. Furthermore, we introduce a GCP method to obtain a global feature of a person image, providing discriminative person representations.

Relation network.

Exploiting relational reasoning [22, 2, 24, 27] is important for many tasks requiring the capacity to reason about dependencies between different entities (e.g., objects, actors, scene elements). Many works have been proposed to support relation-centric computation including interaction networks [3] and gated graph sequence networks [15]. The relation network [22] is a representative method that has been successfully adapted to computer vision problems, including visual question answering [22], object detection [2], action recognition [24], and few-shot learning [27]. The basic idea behind the relation network is to consider all pairs of entities and to integrate all these relations e.g., to answer the question [22] or to localize the objects of interest [2]. Motivated by this work, we leverage relations of body parts to obtain better person representations for part-based person reID. Differently, we exploit the relations between individual body parts and the rest of them, rather than considering all pairs of the parts. This encourages each part-level feature to incorporate information of other body parts as well, making it more discriminative, while retaining compact feature representations for person reID.

Our Approach

We show in Fig. 1 an overview of our framework. We extract a feature map of size  from a person image, where , , are height, width, and the number of channels, respectively. The resulting feature map is divided equally into six horizontal grids. We then apply GMP to each feature map, and obtain part-level features of size . We feed these features through two modules in order to extract novel local and global person representations: One-vs.-rest relation module and GCP. The first module makes each part-level feature more discriminative by considering the relations between individual body parts and the rest of them, and outputs local relational features of size  where . The second module provides a global contrastive feature of size  representing the person image itself. We concatenate global contrastive and local relational features along the channel dimension, and use the feature of size  as a person representation for reID. We train our model end-to-end using cross-entropy and triplet losses, with triplets of anchor, positive and negative person images, where the anchor image has the same identity as a positive one while having a different identity from a negative one. At test time, we extract features of person images, and compute the Euclidean distance between them to determine the identities of persons.

(a) One-vs.-rest relation module
(b) GCP
Figure 2: Illustration of (a) a one-vs.rest relational module and (b) GCP. The relation module gives a local relational feature  for each . Here, we show a process of extracting the feature . Other relational local features are similarly computed. The GCP outputs a global contrastive feature  considering all part-level features simultaneously. We do not share weight parameters of convolutional layers for all part-level features. See text for details.
(a) GAP
(b) GMP
(c) GAP+GMP
(d) GCP
Figure 3: Illustration of various pooling methods: (a) GAP; (b) GMP; (c) GAP + GMP; (d) GCP.

Relation networks for part-based reID

Part-level features.

We exploit a ResNet-50 [9]

trained for ImageNet classification 

[5] as our backbone network to extract an initial feature map from an input person image. Specifically, following the work of [26]

, we remove the GAP and fully connected layers from the ResNet-50 architecture, and set stride of the last convolutional layer to 1. Similar to other part-based reID methods 

[26, 7], we split the initial feature map into multiple horizontal grids of size . We apply GMP to each of them, and obtain part-level features of size .

One-vs.-rest relational module.

Extracting part-level features from the horizontal grids allows to leverage body parts implicitly for diverse person representations. Existing reID methods [26, 7, 29] use these local features independently for person retrieval. They concatenate all local features in a particular order, considering rough geometric correspondences between person images. Although this gives a structural person representation robust to geometric variations and occlusion, the local features cover small parts of an image only, and more importantly they do not account for the relations between body parts. That is, individual parts are isolated, and do not communicate with other ones, which distracts computing the similarity between different persons with similar attributes in corresponding parts. To alleviate this problem, we propose to leverage the relations between body parts for person representations. Specifically, we introduce a new relation network (Fig. 2(a)) that exploits one-vs-rest relation of body parts, making it possible for each part-level feature to contain information of the corresponding part itself and other body parts.

Concretely, we denote by each part-level feature of size . We apply an average pooling to all part-level features, except the one of the particular part , aggregating the information from other body parts as follows: . We then add a convolutional layer separately for each  and , giving feature maps  and of size , respectively. The relation network concatenates the features  and , and outputs a local relational feature  for each . We depict in Fig. 2(a) an example of extracting the local relational feature . Here, we assume that the feature  contains information of the original one  itself and other body parts. We thus use a skip-connection [9] to transfer the relational information of and to , as follows:

(1)

where is a sub-network consisting of a

 convolution, batch normalization 

[11]

, and ReLU 

[12] layers. We denote by a concatenation of features. The residual  supports the part-level feature , making it more discriminative and robust to occlusion. We may leverage all pairwise relations between the features  similar to [3], but this requires a large computational cost and increases the dimension of features drastically. In contrast, our one-vs.-rest relation module computes the feature  in linear time, and also retains a compact feature representation.

Gcp.

To represent an entire person image, previous reID methods use GAP [26], GMP [29], or both [7]. GAP covers the whole body parts of the person image (Fig. 3(a)), but it is easily distracted by background clutter and occlusion. GMP overcomes this problem by aggregating the feature from the most discriminative part useful for reID while discarding background clutter (Fig. 3(b)). This, however, does not contain information from the whole body parts. A hybrid approach to exploiting both GAP and GMP [7] may perform better, but it is also influenced by background clutter (Fig. 3(c)). It has been proven that GMP is more effective than GAP [7], which will be also verified once more in our experiment. Motivated by this, we propose a novel GCP method based on GMP to extract a global feature map from the whole body parts (Fig. 2(b)). Rather than applying GAP or GMP to the initial feature map from the input person image [26, 7], we first perform average and max pooling with all part-level features. We denote by  and resulting feature maps obtained by average and max pooling, respectively. Note that and are robust to background clutter, as we use a GMP method to obtain the initial part-level features (Fig. 1). That is, we aggregate the features from the most discriminative parts for every horizontal regions. In particular,  corresponds to the result of GMP with respect to the initial feature map from the backbone network. We then compute a contrastive feature  by subtracting from , namely, the discrepancy between them. It aggregates most discriminative information from individual body parts (e.g., green boxes in Fig. 3(d)) except the one for  (e.g., the red box in Fig. 3(d)). We add bottleneck layers to reduce the number of channels of  and from to , denoted by and , respectively, and finally transfer the complementary information of the contrastive feature  to  (Fig. 3(d)). Formally, we obtain a global contrastive feature  of the input image as follows:

(2)

where  is a sub-network that consists of a convolutional, batch normalization [11], and ReLU [12] layers. The global feature  is based on , and aggregates the complementary information from the contrastive feature  with reference to . It thus inherits the advantages of GMP such as the robustness to background clutter while covering the whole body parts. We concatenate the global contrastive feature  in (2) and local relational ones   in (1), and use it as a person representation for reID.

Methods F-dim. Market1501

CUHK03

DukeMTMC-reID

Labeled

Detected

mAP

rank-1

mAP

rank-1

mAP

rank-1

mAP

rank-1

SVDNet [25] 2,048 62.1 82.3 37.8 40.9 37.3 41.5 56.8 76.7
Triplet [10] 128 69.1 84.9 - - - - - -
HA-CNN [14] 1,024 75.7 91.2 41.0 44.4 38.6 41.7 63.8 80.5
Deep-Person [1] 2,048 79.5 92.3 - - - - 64.8 80.9
AlignedReID [19] 2,048 79.1 91.8 - - 59.6 61.5 69.7 82.1
PCB [26] 1,536 77.3 92.4 - - 54.2 61.3 65.3 81.9
PCB+RPP [26] 1,536 81.0 93.1 - - 57.5 63.7 68.5 82.9
HPM [7] 3,840 82.7 94.2 - - 57.5 63.9 74.3 86.6
MGN [29] 2,048 86.9 95.7 67.4 68.0 66.0 66.8 78.4 88.7
Ours-S 1,792 88.0 94.8 73.5 76.6 69.5 72.5 77.1 89.3
Ours-F 3,840 88.9 95.2 75.6 77.9 69.6 74.4 78.6 89.7
Table 1: Quantitative comparison with the state of the art in person reID. We measure mAP(%) and rank-1 accuracy(%) on the Market1501 [35], CUHK03 [13] and DukeMTMC-reID [21] datasets. We denote suffixes “-S” and “-F” by our models using and , respectively.
Figure 4: Qualitative comparison of person reID results on the Market1501 dataset [35]. We show the top-5 retrieval results (left: rank-1, right: rank-5) for AlignedReID [19], Deep-Person [1], PCB [26] and Ours-F. Retrieved images with green and red boxes are correct and incorrect results, respectively.

Training loss.

We exploit ground-truth identification labels of person images to learn the person representation. To train our model, we use cross-entropy and triplet losses, balanced by the parameter  as follows:

(3)

where we denote by and triplet and cross-entropy losses, respectively. The cross-entropy loss is defines as

(4)

where we denote by and the number of images in mini-batch and a ground-truth identification label, respectively. is a predicted identification label for each feature  in the person representation, defined as

(5)

is the number of identification labels, and is classifier for the feature  and the label . We use a fully connected layer for the classifier. To enhance the ranking performance, we use the batch-hard triplet loss [10], formulated as follows:

(6)

where is the number of identities in mini-batch, and is the number of images for each identification label in mini-batch (). is a margin parameter to control the distances between positive and negative pairs in feature space. we denote by , , person representations of anchor, positive, and negative images, respectively, where , correspond to identification and image indexes.

Extension to different numbers of grids.

We so far describe our model using global and local features, i.e., for a person representation, which is denoted by hereafter. Without loss of generality, we can use different numbers of horizontal grids for the person representation, to consider various parts of multiple scales [7], such as and that splits the initial feature map into two and four horizontal regions, respectively. Accordingly, we concatenate the features of , and ,  i.e., and use it as a final person representation for an effective reID. Note that , , and contain different local relational features, and thus have different global contrastive features. Note also that these features share the same backbone network with the same parameters.

Experimental Results

Implementation details

Dataset.

We test our method on the following datasets and compare its performance with the state of the art. 1) The Market1501 dataset [35] contains 32,668 person images of 1,501 identities captured by six cameras. We use the training/test split provided by [35] where it consists of 12,936 images of 751 identities for training and 3,368 query and 19,732 gallery images of 750 identities for testing. 2) The CUHK03 dataset [13] provides 14,097 images of 1,467 identities observed by two cameras, and it offers two types of person images: Manually labeled and detected ones by the DPM method [6]. Following the training/test split of [38], we divide it into 7,365 images of 767 identities for training, and 1,400 query and 5,332 gallery images of 700 identities. 3) The DukeMTMC-reID [21] offers 16,522 training images of 702 identities, 2,228 query and 17,661 gallery images of 702 identities.

Training.

We resize all images into for training. We set the numbers of feature channels  to 2,048 and to 256. This results in 1,792- and 3,840-dimensional features for and , respectively. We augment the training datasets with horizontal flipping and random erasing [39]

. We use the stochastic gradient descent (SGD) as the optimizer with momentum of 0.9 and weight decay of 5e-4. We train our model with a batch size 

of 64 for 80 epochs, where we randomly choose 16 identities and sample 4 person images for each identity (

, ). A learning rate initially set to 1e-3 and 1e-2 for the backbone network and other parts, respectively, until 40 epochs is divided by 10 every 20 epochs. We empirically set the weight parameter  to 2, and fix it to all experiments. All networks are trained end-to-end using PyTorch [20]. Training our model takes about six, three and eight hours with two NVIDIA Titan Xps for the Market1501, CUHK03, and DukeMTMC-reID datasets, respectively.

Comparison with the state of the art

Quantitative results.

We compare in Table 1 our models with the state of the arts including part-based person reID methods. We measure mean average precision (mAP) (%) and rank-1 accuracy (%) on the Market1501 [35], CUHK03 [13], and DukeMTMC-reID [21] datasets. We report reID results for a single query for a fair comparison. We denote suffixes “-S” and “-F” by our models using and , respectively, for final person representations. Table 1 shows that Ours-S outperforms state-of-the-art reID methods in terms of mAP and rank-1 accuracy for all datasets, except MGN [29]. This demonstrates the effectiveness of our one-vs.-rest relation module and GCP. Moreover, Ours-S uses 1,792-dimensional features, allowing to an efficient person retrieval, while providing state-of-the-art results. Ours-F gives the best results again on all datasets in terms of mAP. We achieve mAP of 88.9% and rank-1 accuracy of 95.2% for the Market1501, mAP of 69.6%/75.6% and rank-1 accuracy of 74.4%/77.9% with detected/labeled images for the CUHK03, and mAP of 78.6% and rank-1 accuracy of 89.7% for the DukeMTMC-reID. The rank-1 accuracy of Ours-F is slightly lower than that of MGN [29] on the Market1501, but Ours-F outperforms MGN on other datasets by a significant margin. Note that person images in the CUHK03 and DukeMTMC-reID datasets are much more difficult to retrieve, as they are typically with large pose variations, background clutter, occlusion, and confusing attributes.

Qualitative results.

Figure 4 shows a visual comparison of person retrieval results with the state of the art [19, 1, 26] on the Market1501 dataset [35]. We show the top-5 retrieval results for the query image. The results for all comparisons have been obtained from the official models provided by the authors. We can see that our method retrieves correct person images, and in particular it is robust to attribute variations (e.g., bicycles) and background clutter (e.g., grass). Other part-based methods including AlignedReID and PCB try to match local features between images for retrieval. For example, they focus on finding the correspondences for bicycles or grass in the query image, giving many false-positive results. Note that the gallery set contain many images of persons riding a bicycle, but they have different identities from the person in the query.

GF LF RM Ext.  Pooling for F-dim. Market1501

CUHK03

DukeMTMC-reID

Labeled

Detected


GF LF

mAP

rank-1

mAP

rank-1

mAP

rank-1

mAP

rank-1


GAP - 256 74.5 88.2 57.0 60.6 54.3 58.4 62.9 79.6
- GAP 1,536 79.0 92.3 65.1 67.9 62.4 65.6 70.0 84.0
GAP GAP 1,792 82.9 92.9 68.1 71.4 63.6 66.2 73.5 85.5
GMP GMP 1,792 83.7 93.2 70.7 73.9 64.3 66.8 74.8 86.1
GMP GMP 1,792 86.7 94.2 73.3 75.8 67.6 70.2 76.3 88.3
GAP GMP 1,792 85.8 94.1 72.6 75.0 67.6 69.6 75.6 88.2
GAP+GMP GMP 1,792 86.6 94.3 72.9 75.8 68.1 70.3 76.5 88.4
GCP GMP 1,792 88.0 94.8 73.5 76.6 69.5 72.5 77.1 89.3
GMP GMP 3,840 86.7 94.4 72.8 74.6 67.7 69.1 76.1 87.7
GAP GMP 3,840 86.5 94.2 72.5 74.7 67.3 69.9 76.5 87.3
GAP+GMP GMP 3,840 86.5 94.1 72.8 75.4 66.5 69.9 76.6 87.8
GCP GMP 3,840 87.3 94.5 73.0 75.9 69.0 71.6 77.4 88.3
GCP GMP 3,840 88.9 95.2 75.6 77.9 69.6 74.4 78.6 89.7
Table 2: Quantitative comparison of different network architectures. We measure mAP(%) and rank-1 accuracy(%) on the Market1501 [35], CUHK03 [13] and DukeMTMC-reID [21] datasets. Numbers in bold indicate the best performance and underscored ones are the second best. GF: Global features; LF: Local features; RM: One-vs.-rest relational module; Ext.: An extension to multiple scales.

Discussion

Ablation study.

We show an ablation analysis on different components of our model in Table 2. We compare the performance for several variants of our model in terms of mAP and rank-1 accuracy. From the first three rows, we can see that using both global and local features improves the retrieval performance, which confirms the finding in part-based reID methods. The fourth row shows that GMP gives better results than GAP, as GMP avoids background clutter. The results in the next row demonstrates the effect of local features obtained using our relation module. For example, this gives the performance gains of 3% and 1% for mAP and rank-1 accuracy, respectively, for the Market1501 dataset, which is quite significant. From the fifth and eighth rows, we show a comparison of the retrieval performance according to pooling methods for the global feature, and we can see that GCP performs better than GMP, GAP and GMP+GAP in terms of mAP and rank-1 accuracy. For example, compared with GMP+GAP, our GCP improves mAP from 86.6% to 88.0% on the Market1501 dataset. The last four rows suggest that exploiting part-level features of multiple scales is important, and using all components performs best.

We show in Fig. 5(a) retrieval results with/with a relational module. We can see that person representations obtained from the relation module successfully discriminate the same attribute (e.g., violet shirts) for the person images of different identities (e.g., gender), and they are robust to occlusion (e.g., the person occluded by a bag or a bicycle). We compare in Fig. 5(b) retrieval results for different pooling methods. We confirm once again that GAP is not robust to background clutter and GMP sees the most discriminative region (e.g., bicycle) only rather than the person. We observe that GCP alleviates these problems while maintaining the advantage of GMP.

Figure 5: Visual comparison of retrieval results: (a) a relational module and (b) pooling methods. We show top-1 results. The relation module discriminates the same attribute for the person images of different identities. GCP allows to aggregate features from discriminative regions, and provides a person representation robust to background clutter, overcoming the drawbacks of GAP and GMP.
Figure 6: Examples of activation maps for the models using  (top) and  (bottom). We also show top-3 retrieval results for each model. We can see that the rest regions (e.g., the fifth and sixth horizontal grids) are highly activated by the person representation using , compared to that using , indicating that our one-vs.-rest relation allows each part-level feature to see the rest regions effectively.

One-vs.-rest relation.

We consider relations between each part-level feature and its rest feature . To show how the relation module works, we train a model using the combined rest feature instead of in (1). In this case, we do not use a multi-scale extension. We visualize activation maps of and in Fig. 6. We observe that focuses more on the regions whose attributes are different from those of , compared to . This demonstrates that extracts the complementary features of , that are helpful for person reID but not contained in , from the rest of the parts. This also verifies that using the feature of only is not enough to discriminate the identities of different persons having similar attributes in corresponding parts between images. The mAP/rank-1 for  on Market1501 are 84.5/93.4 which are lower than 86.7/94.2 obtained using in Table 2.

Performance comparison of local features.

We demonstrate the capabilities of providing discriminative features of the one-vs.-rest relation module. We extract a local feature of size  for each horizontal region with and without using the relation module. Given a query image, we then retrieve person images using a single local feature in the person representation of

. We report mAP and rank-1 accuracy for individual local features extracted from different horizontal regions (1: top, 6: bottom), on the Market1501 dataset 

[35] in Table 3. From this table, we can observe two things: (1) The relation module improves mAP and rank-1 accuracy drastically. The improvements in mAP and rank-1 accuracy are 21.7% and 19.6%, respectively, on average. The rank-1 accuracy measures the performance of retrieval results for the easiest match, while mAP characterizes the ability to retrieve all person images of the same identity, indicating that the relation module is beneficial especially for retrieving challenging person images. (2) The local features from the third and last horizontal regions give the best and the worst results, respectively. This suggests that the middle part typically corresponding to the torso of a person provides the most discriminative feature for specifying a particular person, and the bottom part (e.g., legs or sometimes background due to the incorrect localization of the person detector) gives the least discriminative feature for person reID.

Horizontal region

w/o RM

RM

mAP

rank-1

mAP

rank-1

1 29.6 49.4 58.7 82.2
2 46.4 69.2 68.5 87.1
3 55.3 77.2 73.5 89.1
4 51.8 72.7 71.9 87.6
5 49.7 69.2 69.7 85.4
6 35.5 53.4 55.7 76.9
Table 3: Quantitative comparison of single local features on the Market1501 dataset [35]. RM: One-vs.-rest relational module.

Performance comparison of global features.

Table 4 compares the reID performance of single global features in terms of mAP and rank-1 accuracy on the Market1501 dataset [35]. We use only the 256-dimensional global feature in the person representation of  for person retrieval but obtained by different pooling methods. Note that the size of the global feature is much smaller than typical person representations in Table 1. We can see from Table 4 that GCP gives the best retrieval results in terms of both mAP and rank-1 accuracy, outperforming GAP, GMP, and GAP+GMP by a large margin. Compared with other reID methods in Table 1, GCP offers a good compromise in terms of the accuracy and the size of features. For example, our global contrastive feature of size  achieves rank-1 accuracy of 93.4%, which is comparable with 93.1% for PCB+RPP [26] using 1,536-dimensional features.

Methods

mAP

rank-1

GAP 81.0 91.3
GMP 81.9 92.0
GAP+GMP 82.5 92.6
GCP 84.6 93.4
Table 4: Quantitative comparison of global features on the Market1501 dataset [35].

Conclusion

We have presented a relation network for person reID considering the relations between individual body parts and the rest of them, making each part-level feature more discriminative. We have also proposed to use contrastive features for a global person representation. We set a new state of the art on person reID, outperforming other reID methods by a significant margin. The ablation analysis clearly demonstrates the effectiveness of each component in our model.

Acknowledgments.

This research was supported by R&D program for Advanced Integrated-intelligence for Identification (AIID) through the National Research Foundation of KOREA (NRF) funded by Ministry of Science and ICT (NRF-2018M3E3A1057289).

References