Camera-aware Proxies for Unsupervised Person Re-Identification

12/19/2020 ∙ by Menglin Wang, et al. ∙ Zhejiang University 0

This paper tackles the purely unsupervised person re-identification (Re-ID) problem that requires no annotations. Some previous methods adopt clustering techniques to generate pseudo labels and use the produced labels to train Re-ID models progressively. These methods are relatively simple but effective. However, most clustering-based methods take each cluster as a pseudo identity class, neglecting the large intra-ID variance caused mainly by the change of camera views. To address this issue, we propose to split each single cluster into multiple proxies and each proxy represents the instances coming from the same camera. These camera-aware proxies enable us to deal with large intra-ID variance and generate more reliable pseudo labels for learning. Based on the camera-aware proxies, we design both intra- and inter-camera contrastive learning components for our Re-ID model to effectively learn the ID discrimination ability within and across cameras. Meanwhile, a proxy-balanced sampling strategy is also designed, which facilitates our learning further. Extensive experiments on three large-scale Re-ID datasets show that our proposed approach outperforms most unsupervised methods by a significant margin. Especially, on the challenging MSMT17 dataset, we gain 14.3% Rank-1 and 10.2% mAP improvements when compared to the second place.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Person re-identification (Re-ID) is the task of identifying the same person in non-overlapping cameras. This task has attracted extensive research interest due to its significance in surveillance and public security. State-of-the-art Re-ID performance is achieved mainly by fully supervised methods Sun et al. (2018); Chen et al. (2019). These methods need sufficient annotations that are expensive and time-consuming to attain, making them impractical in real-world deployments. Therefore, more and more recent studies focus on unsupervised settings, aiming to learn Re-ID models via unsupervised domain adaptation (UDA) Wei et al. (2018b); Qi et al. (2019b); Zhong et al. (2019) or purely unsupervised Lin et al. (2019); Li et al. (2018); Wu et al. (2019b) techniques. Although considerable progress has been made in the unsupervised Re-ID task, there is still a large gap in performance compared to the supervised counterpart.

(a)

(b)

(c)
Figure 1: (a) T-SNE van der Maaten and Hinton (2008)

visualization of the feature distribution on Market-1501. The features are extracted by an ImageNet-pretrained model for images of

randomly selected IDs. The images from one camera are marked with the same colored bounding boxes. (b) and (c) display two sub-regions.

This work addresses the purely unsupervised Re-ID task, which does not require any labeled data and therefore is more challenging than the UDA-based problem. Previous methods mainly resort to pseudo labels for learning, adopting Clustering Lin et al. (2019); Zeng et al. (2020), k-nearest neighbors (k-NN) Li et al. (2018); Chen et al. (2018), or graph Ye et al. (2017); Wu et al. (2019b) based association techniques to generate pseudo labels. The clustering-based methods learn Re-ID models by iteratively conducting a clustering step and a model updating step. These methods have a relatively simple routine but achieve promising results. Therefore, we follow this research line and propose a more effective approach.

Previous clustering-based methods Lin et al. (2019); Zeng et al. (2020); Fan et al. (2018); Zhai et al. (2020)

treat each cluster as a pseudo identity class, neglecting the intra-ID variance caused by the change of pose, illumination, and camera views. When observing the distribution of features extracted by an ImageNet 

Krizhevsky et al. (2012)-pretrained model from Market-1501 Zheng et al. (2015), we notice that, among the images belonging to a same ID, those within cameras are prone to gather closer than the ones from different cameras. That is, one ID may present multiple sub-clusters, as demonstrated in Figure 1(b) and (c).

The above-mentioned phenomenon inspires us to propose a camera-aware proxy assisted learning method. Specifically, we split each single cluster, which is obtained by a camera-agnostic clustering method, into multiple camera-aware proxies. Each proxy represents the instances coming from the same camera. These camera-aware proxies can better capture local structures within IDs. More important, when treating each proxy as an intra-camera pseudo identity class, the variance and noise within a class are greatly reduced. Taking advantage of the proxy-based labels, we design an intra-camera contrastive learning Chen et al. (2020) component to jointly tackle multiple camera-specific Re-ID tasks. When compared to the global Re-ID task, each camera-specific task deals with less number of IDs and smaller variance while using more reliable pseudo labels, and therefore is easier to learn. The intra-camera learning enables our Re-ID model to effectively learn discrimination ability within cameras. Besides, we also design an inter-camera contrastive learning component, which exploits both positive and hard negative proxies across cameras to learn global discrimination ability. A proxy-balanced sampling strategy is also adopted to select appropriate samples within each mini-batch, facilitating the model learning further.

In contrast to previous clustering-based methods, the proposed approach distinguishes itself in the following aspects:

  • Instead of using camera-agnostic clusters, we produce camera-aware proxies which can better capture local structure within IDs. They also enable us to deal with large intra-ID variance caused by different cameras, and generate more reliable pseudo labels for learning.

  • With the assistance of the camera-aware proxies, we design both intra- and inter-camera contrastive learning components which effectively learn ID discrimination ability within and across cameras. We also propose a proxy-balanced sampling strategy to facilitate the model learning further.

  • Extensive experiments on three large-scale datasets, including Market-1501 Zheng et al. (2015)

    , DukeMTMC-reID 

    Zheng et al. (2017), and MSMT17 Wei et al. (2018a), show that the proposed approach outperforms both purely unsupervised and UDA-based methods. Especially, on the challenging MSMT17 dataset, we gain Rank-1 and mAP improvements when compared to the second place.

2 Related Work

2.1 Unsupervised Person Re-ID

According to whether using external labeled datasets or not, unsupervised Re-ID methods can be grouped into purely unsupervised or UDA-based categories.

Purely unsupervised person Re-ID does not require any annotations and thus is more challenging. Existing methods mainly resort to pseudo labels for learning. Clustering Lin et al. (2019); Zeng et al. (2020), k-NN Li et al. (2018); Chen et al. (2018), or graph Ye et al. (2017); Wu et al. (2019b) based association techniques have been developed to generate pseudo labels. Most clustering-based methods like BUC Lin et al. (2019) and HCT Zeng et al. (2020) perform in a camera-agnostic way, which can maintain the similarity within IDs but may neglect the intra-ID variance caused by the change of camera views. Conversely, TAUDL Li et al. (2018), DAL Chen et al. (2018), and UGA Wu et al. (2019b) divide the Re-ID task into intra- and inter-camera learning stages, by which the discrimination ability learned from intra-camera can facilitate ID association across cameras. These methods generate intra-camera pseudo labels via a sparse sampling strategy, and they need a proper way for inter-camera ID association. In contrast to them, our cross-camera association is straightforward. Moreover, we propose distinct learning strategies in both intra- and inter-camera learning parts.

Unsupervised domain adaptation (UDA) based person Re-ID requires some source datasets that are fully annotated, but leaves the target dataset unlabeled. Most existing methods address this task by either transferring image styles Wei et al. (2018b); Deng et al. (2018a); Liu et al. (2019) or reducing distribution discrepancy Qi et al. (2019b); Wu et al. (2019a) across domains. These methods focus more on transferring knowledge from source to target domain, leaving the unlabeled target datasets underexploited. To sufficiently exploit unlabeled data, clustering Fan et al. (2018); Zhai et al. (2020); Ge et al. (2020b) or k-NN Zhong et al. (2019) based methods have also been developed, analogous to those introduced in the purely unsupervised task. Differently, these methods either take into account both original and transferred data Fan et al. (2018); Zhong et al. (2019); Ge et al. (2020b), or integrate a clustering procedure together with an adversarial learning step Zhai et al. (2020).

Figure 2: An overview framework of the proposed method. It iteratively alternates between a clustering step and a model updating step. In the clustering step, a global clustering is first performed and then each cluster is split into multiple camera-aware proxies to generate pseudo labels. In the model updating step, intra- and inter-camera losses are designed based on a proxy-level memory bank to perform contrastive learning.

2.2 Intra-Camera Supervised Person Re-ID

Intra-camera supervision (ICS) Zhu et al. (2019); Qi et al. (2020) is a new setting proposed in recent years. It assumes that IDs are independently labeled within each camera view and no inter-camera ID association is annotated. Therefore, how to effectively perform the supervised intra-camera learning and the unsupervised inter-camera learning are two key problems. To address these problems, various methods such as PCSL Qi et al. (2020), ACAN Qi et al. (2019a), MTML Zhu et al. (2019), MATE Zhu et al. (2020), and Precise-ICS Wang et al. (2021) have been developed. Most of these methods pay much attention to the association of IDs across cameras. When taking camera-aware proxies as pseudo labels, our work shares a similar scenario in the intra-camera learning with these ICS methods. Differently, our inter-camera association is straightforward due to the proxy generation scheme. We therefore focus more on the way to generate reliable proxies and conduct effective learning. Besides, the unsupervised Re-ID task tackled in our work is more challenging than the ICS problem.

2.3 Metric Learning with Proxies

Metric learning plays an important role in person Re-ID and other fine-grained recognition tasks. An extensively utilized loss for metric learning is the triplet loss Hermans et al. (2017), which considers the distances of an anchor to a positive instance and a negative instance. Proxy-NCA Movshovitz-Attias et al. (2017) proposes to use proxies for the measurement of similarity and dissimilarity. A proxy, which represents a set of instances, can capture more contextual information. Meanwhile, the use of proxies instead of data instances greatly reduces the triplet number. Both advantages help metric learning to gain better performance. Further, with the awareness of intra-class variances, Magnet Rippel et al. (2016), MaPML Qian et al. (2018), SoftTriple Qian et al. (2019) and and GEORGE Sohoni et al. (2020) adopt multiple proxies to represent a single cluster, by which local structures are better represented. Our work is inspired by these studies. However, in contrast to set a fixed number of proxies for each class or design a complex adaptive strategy, we split a cluster into a variant number of proxies simply according to the involved camera views, making our proxies more suitable for the Re-ID task.

3 A Clustering-based Re-ID Baseline

We first set up a baseline model for the unsupervised Re-ID task. As the common practice in the clustering-based methods Fan et al. (2018); Lin et al. (2019); Zeng et al. (2020), our baseline learns a Re-ID model iteratively and, at each iteration, it alternates between a clustering step and a model updating step. In contrast to these existing methods Fan et al. (2018); Lin et al. (2019); Zeng et al. (2020), we adopt a different strategy in the model updating step, making our baseline model more effective. The details are introduced as follows.

Given an unlabeled dataset , where is the -th image and

is the image number. We build our Re-ID model upon a deep neural network

with parameters . The parameters are initialized by an ImageNet Krizhevsky et al. (2012)-pretrained model. When image is input, the network performs feature extraction and outputs feature . Then, at each iteration, we adopt DBSCAN Ester et al. (1996) to cluster the features of all images, and further select reliable clusters by leaving out isolated points. All images within each cluster are assigned with a same pseudo identity label. By this means, we get a labeled dataset , in which is a generated pseudo label. is the number of images contained in the selected clusters and is the cluster number.

Once pseudo labels are generated, we adopt a non-parametric classifier 

Wu et al. (2018) for model updating. It is implemented via an external memory bank and a non-parametric Softmax loss. More specifically, we construct a memory bank , where is the feature dimension. During back-propagation when the model parameters are updated by gradient descent, the memory bank is updated by

(1)

where is the -th entry of the memory, storing the updated feature centroid of class . Moreover, is an image belonging to class and is an updating rate.

Then, the non-parametric Softmax loss is defined by

(2)

where is a temperature factor. This loss achieves classification via pulling an instance close to the centroid of its class while pushing away from the centroids of all other classes. This non-parametric loss plays a key role in recent contrastive learning techniques Wu et al. (2018); Zhong et al. (2019); Chen et al. (2020); He et al. (2019), demonstrating a powerful ability in unsupervised feature learning.

4 The Camera-aware Proxy Assisted Method

Like previous clustering-based methods Fan et al. (2018); Lin et al. (2019); Zeng et al. (2020); Zhai et al. (2020), the above-mentioned baseline model conducts clustering in a camera-agnostic way. This clustering way may maintain the similarity within each identity class, but neglect the intra-ID variance. Considering that most severe intra-ID variance is caused by the change of camera views, we split each single class into multiple camera-specific proxies. Each proxy represents the instances coming from the same camera. The obtained camera-aware proxies not only capture the variance within classes, but also enable us to divide the model updating step into intra- and inter-camera learning parts. Such a divide-and-conquer strategy facilitates our model updating. The entire framework is illustrated in Figure 2, in which the modified clustering step and the improved model updating step are alternatively iterated.

More specifically, at each iteration, we split the camera-agnostic clustering results into camera-aware proxies, and generate a new set of pseudo labels that are assigned in a per-camera manner. That is, the proxies within each camera view are independently labeled. It also means that two proxies split from the same cluster may be assigned with two different labels. We denote the newly labeled dataset of the -th camera by . Here, image , which previously is annotated with a global pseudo label , is additionally annotated with an intra-camera pseudo label and a camera label . and are, respectively, the number of images and proxies in camera , and is the number of cameras. Then, the entire labeled dataset is .

Consequently, we construct a proxy-level memory bank , where is the total number of proxies in all cameras. Each entry of the memory stores a proxy, which is updated by the same strategy as introduced in Eq. (1) but considers only the images belonging to the proxy. Based on the memory bank, we design an intra-camera contrastive learning loss that jointly learns per-camera non-parametric classifiers to gain discrimination ability within cameras. Meanwhile, we also design an inter-camera contrastive learning loss , which considers both positive and hard negative proxies across cameras to boost the discrimination ability further.

4.1 The Intra-camera Contrastive Learning

With the per-camera pseudo labels, we can learn a classifier for each camera and jointly learn all the classifiers. This strategy has the following two advantages. First, the pseudo labels generated from the camera-aware proxies are more reliable than the global pseudo labels. It means that the model learning can suffer less from label noise and gain better intra-camera discrimination ability. Second, the feature extraction network shared in the joint learning is optimized to be discriminative in different cameras concurrently, which implicitly helps the Re-ID model to gain cross-camera discrimination ability.

Therefore, we learn one non-parametric classifier for each camera and jointly learn classifiers for all cameras. To this end, we define the intra-camera contrastive learning loss as follows.

(3)

Here, given image , together with its per-camera pseudo label and camera label , we set to be the total proxy number accumulated from the first to the -th camera, and to be the index of the corresponding entry in the memory. is to balance the various number of images in different cameras.

This loss performs contrastive learning within cameras. As illustrated in Figure 3(a), this loss pulls an instance close to the proxy to which it belongs and pushes it away from all other proxies in the same camera.

Figure 3: Illustration of intra- and inter-camera losses.

4.2 The Inter-camera Contrastive Learning

Although the intra-camera learning introduced above provides our model with considerable discrimination ability, the model is still weak at cross-camera discrimination. Therefore, we propose an inter-camera contrastive learning loss, which explicitly exploits correlations across cameras to boost the discrimination ability.

Specifically, given image , we retrieve all positive proxies from different cameras, which share the same global pseudo label . Besides, the K-nearest negative proxies in all cameras are taken as the hard negative proxies, which are crucial to deal with the similarity across identity classes. The inter-camera contrastive learning loss aims to pull an image close to all positive proxies while push away from the mined hard negative proxies, as demonstrated in Figure 3(b). To this end, we define the loss as follows.

(4)

where and denote the index sets of the positive and hard negative proxies, respectively. is the cardinality of . Moreover, .

4.3 A Summary of the Algorithm

The proposed approach iteratively alternates between the camera-aware proxy clustering step and the intra- and inter-camera learning step. The entire loss for model learning is

(5)

where is a parameter to balance two terms. We summarize the whole procedure in Algorithm 1.

Input: An unlabeled training set , a DNN model , the iteration number num_iters, the training batches num_batches, momentum , and temperature ;
Output: Trained model ;

1:for iter = 1 to num_iters do
2:

     Perform a global clustering and remove outliers;

3:     Split clusters into camera-aware proxies, and generate per-camera pseudo labeled dataset ;
4:     Construct a proxy-level memory bank ;
5:     for b = 1 to num_batches do
6:         Sample mini-batch images with a proxy-balanced sampling strategy;
7:         Forward to extract the features of the samples;
8:         Compute the loss in Eq.(5);
9:         Backward to update model ;
10:         Update proxy entries in the memory with the sample features;      
Algorithm 1 Camera-aware Proxy Assisted Learning

A proxy-balanced sampling strategy. A mini-batch in Algorithm 1 involves an update to the Re-ID model using a small set of samples. It is not trivial to choose appropriate samples in each batch. Traditional random sampling strategy may be overwhelmed by identities having more images than the others. Class-balanced sampling, that randomly chooses classes and samples per class as in Hermans et al. (2017), tends to sample an identity more frequently from image-rich cameras, causing ineffective learning for image-deficient cameras. To make samples more effective, we propose a proxy-balanced sampling strategy. In each mini-batch, we choose proxies and samples per proxy. This sampling strategy performs balanced optimization of all camera-aware proxies and enhances the learning of rare proxies, thus promoting the learning efficacy.

5 Experiments

5.1 Experiment Setting

Datasets and metrics.

We evaluate the proposed method on three large-scale datasets: Market-1501 Zheng et al. (2015), DukeMTMC-reID Zheng et al. (2017), and MSMT17 Wei et al. (2018a).

Market-1501 Zheng et al. (2015) contains 32,668 images of 1,501 identities captured by 6 disjoint cameras. It is split into three sets. The training set has 12,936 images of 751 identities, the query set has 3,368 images of 750 identities, and the gallery set contains 19,732 images of 750 identities.

DukeMTMC-reID Zheng et al. (2017) is a subset of DukeMTMC Ristani et al. (2016). It contains 36,411 images of 1,812 identities captured by 8 cameras. Among them, 702 identities are used for training and the rest identities are for testing.

MSMT17 Wei et al. (2018a) is the largest and most challenging dataset. It has 126,411 images of 4,101 identities captured in 15 camera views, containing both indoor and outdoor scenarios. 32,621 images of 1041 identities are for training, the rest including 82,621 gallery images and 11,659 query images are for testing.

Performance is evaluated by the Cumulative Matching Characteristic (CMC) and mean Average Precision (mAP), as the common practice. For the CMC measurement, we report Rank-1, Rank-5, and Rank-10. Note that no post-processing techniques like re-ranking Zhong et al. (2017) are used in our evaluation.

Implementation details.

We adopt an ImageNet-pretrained ResNet-50 He et al. (2016)

as the network backbone. Based upon it, we remove the fully-connected classification layer, and add a Batch Normalization (BN) layer after the Global Average Pooling (GAP) layer. The

normalized feature is used for the updating of proxies in the memory during training, and also for the distance ranking during inference. The memory updating rate is empirically set to be , the temperature factor is , the number of hard negative proxies is , and the balancing factor in Eq. (5) is

. At the beginning of each epoch (i.e. iteration), we compute Jaccard distance with k-reciprocal nearest neighbors 

Zhong et al. (2017) and use DBSCAN Ester et al. (1996) with a threshold of for the camera-agnostic global clustering. During training, only the intra-camera loss is used in the first 5 epochs. In the remaining epochs, both the intra- and inter-camera losses work together. The initial learning rate is with a warmup scheme in the first 10 epochs, and is divided by after each epochs. The total epoch number is . Each training batch consists of images randomly sampled from proxies with images per proxy. Random flipping, cropping and erasing are applied as data augmentation.

Figure 4: T-SNE visualization of features extracted by the models of Baseline, CAP2, and CAP6, respectively shown from left to right in the upper row. Typical examples of IDs #4-7 are shown at bottom.

5.2 Ablation Studies

In this subsection, we investigate the effectiveness of the proposed method by examining the intra- and inter-camera learning components, together with the proxy-balanced sampling strategy. For the purpose of reference, we first present the results of the baseline model introduced in section 3, as shown in Table 1. Then, we examine six variants of the proposed camera-aware proxy (CAP) assisted model, which are referred to as CAP1-6.

Models Components Market-1501 DukeMTMC-ReID MSMT17
PBsampling R1 R5 R10 mAP R1 R5 R10 mAP R1 R5 R10 mAP
Baseline 79.7 88.3 91.2 62.9 74.3 82.7 86.0 57.5 34.0 43.7 49.0 13.7
CAP1 78.7 89.3 92.9 58.9 74.0 83.7 86.6 57.0 48.6 61.7 67.1 23.0
CAP2 82.3 91.7 94.1 64.6 76.5 86.4 89.8 60.9 51.3 64.0 69.4 24.8
CAP3 89.8 95.4 97.1 75.1 76.7 84.8 86.8 59.9 66.3 76.5 80.0 34.0
CAP4 91.1 96.3 97.4 79.9 78.0 85.6 87.9 61.6 66.9 77.4 80.7 35.3
CAP5 89.5 94.9 96.4 75.9 79.1 87.8 89.9 64.5 66.7 76.9 80.5 35.1
CAP6 91.4 96.3 97.7 79.2 81.1 89.3 91.8 67.3 67.4 78.0 81.4 36.9
Table 1: Comparison of the proposed method and its variants. refers to the intra-camera learning, is the inter-camera learning, and PBsampling is the proxy-balanced sampling strategy. When PBsampling is not selected, the model uses the class-balanced sampling strategy.

Compared with the baseline model, the proposed full model (CAP6) significantly boosts the performance on all three datasets. The full model gains Rank-1 and mAP improvements on Market-1501, and Rank-1 and mAP improvements on DukeMTMC-ReID. Moreover, it dramatically boosts the performance on MSMT17, achieving Rank-1 and mAP improvements over the baseline. The MSMT17 dataset is a lot more challenging than the other two datasets, containing complex scenarios and appearance variations. The superior performance on MSMT17 shows that our full model gains an outstanding ability to deal with severe intra-ID variance. In the followings, we take a close look at each component.

Effectiveness of the intra-camera learning. Compared with the baseline model, the intra-camera learning benefits from two aspects. 1) Each intra-camera Re-ID task is easier than the global counterpart because it deals with less number of IDs and smaller intra-ID variance. 2) Intra-camera learning suffers less from label noise since the per-camera pseudo labels are more reliable. These advantages enable the intra-camera learning to gain promising performance. As shown in Table 1, the CAP1 model which only employs the intra-camera loss, performs comparable to the baseline. When adopting the proxy-based sampling strategy, the CAP2 model outperforms the baseline on all datasets. In addition, we can also observe that the performance drops when removing the intra-camera loss from the full model (CAP4 vs. CAP6), validating the necessity of this component.

Effectiveness of the inter-camera learning. Complementary to the above-mentioned intra-camera learning, the inter-camera learning improves the Re-ID model by explicitly exploiting the correlations across cameras. It not only can deal with the intra-ID variance via pulling positive proxies together, but also can tackle the inter-ID similarity problem via pushing hard negative proxies away. With this component, both CAP5 and CAP6 significantly boost the performance over CAP1 and CAP2 respectively. In addition, we find out that the inter-camera loss alone (CAP3) is able to produce decent performance, and adding the intra-camera loss or sampling strategy boosts performance further.

Effectiveness of the proxy-balanced sampling strategy. The proxy-balanced sampling strategy is proposed to balance the various number of images contained in different proxies. To show that the proxy-balanced sampling strategy is indeed helpful, we compare it with the extensively used class-balanced strategy which ignores camera information. Table 1 shows that the models (CAP2, CAP4, and CAP6) using our sampling strategy are superior to the counterparts, validating the effectiveness of this strategy.

Methods Reference Market-1501 DukeMTMC-ReID MSMT17
R1 R5 R10 mAP R1 R5 R10 mAP R1 R5 R10 mAP
Purely Unsupervised
BUC Lin et al. (2019) AAAI19 66.2 79.6 84.5 38.3 47.4 62.6 68.4 27.5 - - - -
UGA Wu et al. (2019b) ICCV19 87.2 - - 70.3 75.0 - - 53.3 49.5 - - 21.7
SSL Lin et al. (2020) CVPR20 71.7 83.8 87.4 37.8 52.5 63.5 68.9 28.6 - - - -
MMCL Wang and Zhang (2020) CVPR20 80.3 89.4 92.3 45.5 65.2 75.9 80.0 40.2 35.4 44.8 49.8 11.2
HCT Zeng et al. (2020) CVPR20 80.0 91.6 95.2 56.4 69.6 83.4 87.4 50.7 - - - -
CycAs Wang et al. (2020b) ECCV20 84.8 - - 64.8 77.9 - - 60.1 50.1 - - 26.7
SpCL Ge et al. (2020b) NeurIPS20 88.1 95.1 97.0 73.1 - - - - 42.3 55.6 61.2 19.1
CAP This paper 91.4 96.3 97.7 79.2 81.1 89.3 91.8 67.3 67.4 78.0 81.4 36.9
Unsupervised Domain Adaptation
PUL Fan et al. (2018) TOMM18 45.5 60.7 66.7 20.5 30.0 43.4 48.5 16.4 - - - -
SPGAN Deng et al. (2018b) CVPR18 51.5 70.1 76.8 22.8 41.1 56.6 63.0 22.3 - - - -
ECN Zhong et al. (2019) CVPR19 75.1 87.6 91.6 43.0 63.3 75.8 80.4 40.4 30.2 41.5 46.8 10.2
pMR Wang et al. (2020a) CVPR20 83.0 91.8 94.1 59.8 74.5 85.3 88.7 55.8 - - - -
MMCL Wang and Zhang (2020) CVPR20 84.4 92.8 95.0 60.4 72.4 82.9 85.0 51.4 43.6 54.3 58.9 16.2
AD-Cluster Zhai et al. (2020) CVPR20 86.7 94.4 96.5 68.3 72.6 82.5 85.5 54.1 - - - -
MMT Ge et al. (2020a) ICLR20 87.7 94.9 96.9 71.2 78.0 88.8 92.5 65.1 50.1 63.9 69.8 23.3
SpCL Ge et al. (2020b) NeurIPS20 90.3 96.2 97.7 76.7 82.9 90.1 92.5 68.8 53.1 65.8 70.5 26.5
Fully Supervised
PCB Sun et al. (2018) ECCV18 93.8 - - 81.6 83.3 - - 69.2 68.2 - - 40.4
ABD-Net Chen et al. (2019) ICCV19 95.6 - - 88.3 89.0 - - 78.6 82.3 90.6 - 60.8
CAP’s Upper Bound This paper 93.3 97.5 98.4 85.1 87.7 93.7 95.4 76.0 77.1 87.4 90.8 53.7
Table 2: Comparison with state-of-the-art methods. Both purely unsupervised and UDA-based methods are included. We also provide several fully supervised methods for reference. The first and second best results among all unsupervised methods are, respectively, marked in red and blue. indicates an UDA-based method working under the purely unsupervised setting.

Visualization of learned feature representations. In order to investigate how each learning component behaves, we utilize t-SNE van der Maaten and Hinton (2008) to visualize the feature representations learned by the baseline model, the intra-camera learned model CAP2, and the full model CAP6. Figure 4 presents the image features of 10 IDs taken from MSMT17. From the figure we observe that the baseline model fails to distinguish and , and , and . In contrast, the CAP2 model, which conducts the intra-camera learning only, separates and , and better. With the additional inter-camera learning component, the full model can distinguish most of the IDs, by greatly improving the intra-ID compactness and inter-ID separability. But it may still fail in some tough cases such as and .

5.3 Comparison with State-of-the-Arts

In this section, we compare the proposed method (named as CAP) with state-of-the-art methods. The comparison results are summarized in Table 2.

Comparison with purely unsupervised methods. Five most recent purely unsupervised methods are included for comparison, which are BUC Lin et al. (2019), UGA Wu et al. (2019b), SSL Lin et al. (2020), HCT Zeng et al. (2020), and CycAs Wang et al. (2020b). Both BUC and HCT are clustering-based, sharing the same technique with ours. Additionally, we also compare with MMCL Wang and Zhang (2020) and SpCL Ge et al. (2020b), two UDA-based methods working under the purely unsupervised setting. From the table, we observe that our proposed method outperforms all state-of-the-art counterparts by a great margin. For instance, compared with the second place method, our approach obtains Rank-1 and mAP gain on Market, Rank-1 and mAP gain on Duke, and Rank-1 and mAP gain on MSMT17.

Comparison with UDA-based methods. Recent unsupervised works focus more on UDA techniques that exploit external labeled data to boost the performance. Table 2 presents eight UDA methods. Surprisingly, without using any labeled information, our approach outperforms seven of them on both Market and Duke, and is on par with SpCL. On the challenging MSMT17 dataset, our approach surpasses all methods by a great margin, achieving Rank-1 and mAP gain when compared to SpCL.

Comparison with fully supervised methods. Finally, we provide two fully supervised method for reference, including one well-known method PCB Sun et al. (2018) and one state-of-the-art method ABD-Net Chen et al. (2019). We also report the performance of our network backbone trained with ground-truth labels, which indicates the upper bound of our approach. We observe that our unsupervised model (CAP) greatly mitigates the gap with PCB on all three datasets. Besides, there is still room for improvement if we could improve our backbone via integrating recent attention-based techniques like ABD-Net.

6 Conclusion

In this paper, we have presented a novel camera-aware proxy assisted learning method for the purely unsupervised person Re-ID task. Our method is able to deal with the large intra-ID variance resulted from the change of camera views, which is crucial for a Re-ID model to improve performance. With the assistance of camera-aware proxies, our proposed intra- and inter-camera learning components effectively improve ID-discrimination within and across cameras, as validated by the experiments on three large-scale datasets. Comparisons with both purely unsupervised and UDA-based methods demonstrate the superiority of our method.

References

  • T. Chen, S. Ding, J. Xie, Y. Yuan, W. Chen, Y. Yang, Z. Ren, and Z. Wang (2019) ABD-net: attentive but diverse person re-identification. In ICCV, Cited by: §1, §5.3, Table 2.
  • T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: §1, §3.
  • Y. Chen, X. Zhu, and S. Gong (2018)

    Deep association learning for unsupervised video person re-identification

    .
    In BMVC, Cited by: §1, §2.1.
  • W. Deng, L. Zheng, Q. Ye, G. Kang, Y. Yang, and J. Jiao (2018a) Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. In CVPR, Cited by: §2.1.
  • W. Deng, L. Zheng, Q. Ye, Y. Yang, and J. Jiao (2018b) Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person reidentification. In CVPR, Cited by: Table 2.
  • M. Ester, H. Kriegel, J. Sander, X. Xu, et al. (1996) A density-based algorithm for discovering clusters in large spatial databases with noise.. In Kdd, Cited by: §3, §5.1.
  • H. Fan, L. Zheng, C. Yan, and Y. Yang (2018) Unsupervised person re-identification: clustering and fine-tuning. ACM TOMM. Cited by: §1, §2.1, §3, §4, Table 2.
  • Y. Ge, D. Chen, and H. Li (2020a) Mutual mean-teaching: pseudo label refinery for unsupervised domain adaptation on person re-identification. In ICLR, Cited by: Table 2.
  • Y. Ge, D. Chen, F. Zhu, R. Zhao, and H. Li (2020b) Self-paced contrastive learning with hybrid memory for domain adaptive object re-id. In NeurIPS, Cited by: §2.1, §5.3, Table 2.
  • K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2019) Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722. Cited by: §3.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §5.1.
  • A. Hermans, L. Beyer, and B. Leibe (2017) In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737. Cited by: §2.3, §4.3.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012)

    Imagenet classification with deep convolutional neural networks

    .
    In NIPS, Cited by: §1, §3.
  • M. Li, X. Zhu, and S. Gong (2018)

    Unsupervised person re-identification by deep learning tracklet association

    .
    In ECCV, Cited by: §1, §1, §2.1.
  • Y. Lin, X. Dong, L. Zheng, Y. Yan, and Y. Yang (2019) A bottom-up clustering approach to unsupervised person re-identification. In AAAI, Cited by: §1, §1, §1, §2.1, §3, §4, §5.3, Table 2.
  • Y. Lin, L. Xie, Y. Wu, C. Yan, and Q. Tian (2020) Unsupervised person re-identification via softened similarity learning. In CVPR, Cited by: §5.3, Table 2.
  • J. Liu, Z. Zha, D. Chen, R. Hong, and M. Wang (2019) Adaptive transfer network for cross-domain person re-identification. In CVPR, Cited by: §2.1.
  • Y. Movshovitz-Attias, A. Toshev, T. K. Leung, S. Ioffe, and S. Singh (2017) No fuss distance metric learning using proxies. In ICCV, Cited by: §2.3.
  • L. Qi, L. Wang, J. Huo, Y. Shi, and Y. Gao (2019a) Adversarial camera alignment network for unsupervised cross-camera person re-identification. arXiv preprint arXiv:1908.00862. Cited by: §2.2.
  • L. Qi, L. Wang, J. Huo, Y. Shi, and Y. Gao (2020) Progressive cross-camera soft-label learning for semi-supervised person re-identification. IEEE TCSVT. Cited by: §2.2.
  • L. Qi, L. Wang, J. Huo, L. Zhou, Y. Shi, and Y. Gao (2019b) A novel unsupervised camera-aware domain adaptation framework for person re-identification. In ICCV, Cited by: §1, §2.1.
  • Q. Qian, L. Shang, B. Sun, J. Hu, H. Li, and R. Jin (2019) SoftTriple loss: deep metric learning without triplet sampling. In ICCV, Cited by: §2.3.
  • Q. Qian, J. Tang, H. Li, S. Zhu, and R. Jin (2018) Large-scale distance metric learning with uncertainty. In CVPR, Cited by: §2.3.
  • O. Rippel, M. Paluri, P. Dollar, and L. Bourdev (2016) Metric learning with adaptive density discrimination. In ICLR, Cited by: §2.3.
  • E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi (2016) Performance measures and a data set for multi-target, multi-camera tracking. In ECCV, Cited by: §5.1.
  • N. Sohoni, J. Dunnmon, G. Angus, A. Gu, and C. Ré (2020) No subclass left behind: fine-grained robustness in coarse-grained classification problems. In NeurIPS, Cited by: §2.3.
  • Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang (2018) Beyond part models: person retrieval with refined part pooling (and a strong convolutional baseline). In ECCV, Cited by: §1, §5.3, Table 2.
  • L. van der Maaten and G. Hinton (2008) Visualizing data using t-SNE. JMLR. Cited by: Figure 1, §5.2.
  • D. Wang and S. Zhang (2020) Unsupervised person re-identification via multi-label classification. In CVPR, Cited by: §5.3, Table 2.
  • G. Wang, J. Lai, W. Liang, and G. Wang (2020a) Smoothing adversarial domain attack and p-memory reconsolidation for cross-domain person re-identification. In CVPR, Cited by: Table 2.
  • M. Wang, B. Lai, H. Chen, J. Huang, X. Gong, and X. Hua (2021) Towards precise intra-camera supervised person re-identification. In WACV, Cited by: §2.2.
  • Z. Wang, J. Zhang, L. Zheng, Y. Liu, Y. Sun, Y. Li, and S. Wang (2020b) CycAs: self-supervised cycle association for learning re-identifiable descriptions. In ECCV, Cited by: §5.3, Table 2.
  • L. Wei, S. Zhang, W. Gao, and Q. Tian (2018a) Person transfer gan to bridge domain gap for person re-identification. In CVPR, Cited by: 3rd item, §5.1, §5.1.
  • L. Wei, S. Zhang, W. Gao, and Q. Tian (2018b) Person transfer gan to bridge domain gap for person re-identification. In CVPR, Cited by: §1, §2.1.
  • A. Wu, W. Zheng, and J. Lai (2019a) Unsupervised person re-identification by camera-aware similarity consistency learning. In ICCV, Cited by: §2.1.
  • J. Wu, Y. Yang, H. Liu, S. Liao, Z. Lei, and S. Z. Li (2019b) Unsupervised graph association for person re-identification. In ICCV, Cited by: §1, §1, §2.1, §5.3, Table 2.
  • Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In CVPR, Cited by: §3, §3.
  • M. Ye, A. J. Ma, L. Zheng, J. Li, and P. C. Yuen (2017) Dynamic label graph matching for unsupervised video re-identification.. In ICCV, Cited by: §1, §2.1.
  • K. Zeng, M. Ning, Y. Wang, and Y. Guo (2020) Hierarchical clustering with hard-batch triplet loss for person re-identification. In CVPR, Cited by: §1, §1, §2.1, §3, §4, §5.3, Table 2.
  • Y. Zhai, S. Lu, Q. Ye, X. Shan, J. Chen, R. Ji, and Y. Tian (2020) Ad-cluster: augmented discriminative clustering for domain adaptive person re-identification. In CVPR, Cited by: §1, §2.1, §4, Table 2.
  • L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian (2015) Scalable person re-identification: a benchmark. In ICCV, Cited by: 3rd item, §1, §5.1, §5.1.
  • Z. Zheng, L. Zheng, and Y. Yang (2017) Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In ICCV, Cited by: 3rd item, §5.1, §5.1.
  • Z. Zhong, L. Zheng, and S. Li (2017) Re-ranking person re-identification with k-reciprocal encoding. In CVPR, Cited by: §5.1, §5.1.
  • Z. Zhong, L. Zheng, Z. Luo, S. Li, and Y. Yang (2019) Invariance matters: exemplar memory for domain adaptive person re-identification. In CVPR, Cited by: §1, §2.1, §3, Table 2.
  • X. Zhu, X. Zhu, M. Li, P. Morerio, V. Murino, and S. Gong (2020) Intra-camera supervised person re-identification. arXiv preprint arXiv:2002.05046. Cited by: §2.2.
  • X. Zhu, X. Zhu, M. Li, V. Murino, and S. Gong (2019) Intra-camera supervised person re-identification: a new benchmark. In ICCVW, Cited by: §2.2.