Person Re-ID aims to identify the same person under different cameras views. It has been used extensively in large-scale surveillance systems. Though great progress has been made in supervised person Re-ID tasks, the reliance on extensive manual annotation greatly constrains its application. Nevertheless, collecting pedestrian images without annotation is much cheaper and easier. Thus, increasing research attention has been drawn to unsupervised person Re-ID, directly learning from unlabeled data, which is more scalable and has more potential to deployments in the real world.
The extant unsupervised person re-ID methods can be broadly divided into two categories, unsupervised domain adaptation Re-ID methods and purely unsupervised Re-ID methods. The first type methods are based on unsupervised domain adaption (UDA) where the source domain dataset is fully annotated and the target domain is an unlabeled dataset. Most of these UDA-based methods address this task by learning the knowledge in the labeled source domain dataset and transferring them to the unlabeled target domain dataset [32, 1, 8]
. The second type of unsupervised Re-ID method is pseudo-label-based fully unsupervised learning that directly learn from unlabeled data in the target domain and use representation features to estimate pseudo labels[23, 29, 9]. This method does not require any annotations and is more challenging. Existing fully unsupervised Re-ID works mainly aim to exploit pseudo labels from clustering and apply contrastive learning which has shown excellent performance in unsupervised representation learning [27, 3, 11].
The performance of the unsupervised methods relies on feature representation learning. More recently, the State-of-the-art method  using a memory bank unit  to store all instance features, treats each image as an individual class, and learns the representation by matching features of the same instance in different augmented views. However, each class usually contains more than one positive instance in Re-ID datasets. SpCL  method alleviates this problem by matching an instance with the centroid of the multiple positives. To further ensure each positive converges to its centroid at a uniform pace, cluster contrast learning  updates the memory dictionary and computes contrastive loss in the cluster level.
Although cluster contrast learning  has achieved impressive performance, the method of applying contrastive learning only in the cluster level does not consider the the relationship between hard instances in the instance level. In fact, previous works in deep metric learning have focused on hard sample mining to lay more emphasis on hard samples inside a class. These methods aim to distinguish samples from different categories and bring samples from the same category closer together. However, these methods usually adopt a mini-batch-based deep metric loss, such as hard triplet loss  and multi-similarity loss . Meanwhile, these losses only utilized a small portion of data without considering the information of all categories.
To learn discriminative feature representation for Re-ID and address the lack of adequately exploring information of hard samples, this paper introduces a novel hard-sample mining strategy and proposes a simple and effective method of hard-sample guided hybrid contrast learning for unsupervised Re-ID. In summary, this paper makes the following contributions:
We propose a hybrid contrast learning framework for unsupervised person Re-ID which combines both cluster-level contrastive loss and instance-level contrastive loss.
We introduce a novel hard instance mining strategy, which is based on an instance memory bank, to explore more discriminative information by selecting global hard samples online for each input instance.
Extensive experiments on two popular large-scale Re-ID benchmarks demonstrate that our HHCL outperforms previous state-of-the-art methods and significantly improves the performance of unsupervised person Re-ID.
2 Related Works
2.1 Unsupervised Re-ID
The domain adaptation strategy has been widely used for unsupervised person Re-ID tasks [1, 8]. The transfer-based methods follow the strategy of UDA, which uses the pre-trained model in the labeled source domain dataset as the initialization of the target domain, or uses the style transfer method to transfer labeled images to the target domain. However, the UDA approach can be very challenging when the categories in the two domains are quite different. The drawback with pseudo-labels is that if the domains are not similar enough, it is not easy for us to obtain high quality pseudo labels, because the labeling noise might be too high to hurt the performance.
More recently, researchers have given more attention to pseudo-label-based methods that do not require source domain data. The pseudo labels can be generated by a pre-trained classifier or by a feature similarity-based clustering algorithm, such as K-means, DB-SCAN. In this way, the pseudo labels are applied to fine-tuning the Re-ID model in a supervised manner. HCT 
combined hierarchical clustering with hard-batch triplet loss to improve the quality of pseudo labels. MMCL formulated unsupervised person re-ID as a multi-label classification task to progressively seek true labels. SpCL  adopted the self-paced contrastive learning strategy to form more reliable clusters. CACL  designed an asymmetric contrastive learning framework to help the siamese network effectively mine the invariance in feature learning.
2.2 Mining Schemes
Sampling is a fundamental operation for reducing bias during model learning. Random sampling is one of the commonly used approaches, and different sampling methods are proposed to facilitate the learning of various loss functions. For the person re-ID task, identity sampling is widely used during the training stage, such as pair-wise sampling for contrastive loss and semi-hard negative mining method for triplet loss.
Hard sample mining is considered as a vital component of many deep metric learning algorithms 
to accelerate network convergence or to improve the final discriminative ability of the neural network because hard samples are more informative for training. The training should focus more on hard samples than easy samples. However, existing hard mining schemes of deep metric learning based on mini-batch training data often suffer from slow convergence, because they employ only one negative or partial negative example in mini-batch while not interacting with the other negative classes that have not been sampled into the current mini-batch in each update. In this paper, we propose a new strategy selecting the global hard samples from a memory bank for each input feature, to improve the model performance. Our hard mining strategy considers the relationship between each query instance and other clusters of different pseudo labels rather than taking into account only the inter-instance relationship with a small fraction of the categories.
Given an unlabeled training set consisting of image samples, the goal is to learn —an encoder parameterized by used to extract features from input images. For inference, this encoder is applied to the gallery set and query set . The gallery set contains the total collection of retrieval images in the database and representations of the query images are used to search the gallery set to retrieve the most similar matches to according to Euclidean distance between the query and gallery embeddings, , where a smaller distance implies increased similarity between the images. Thus, feature representations of the same person are supposed to be as close as possible.
Our hybrid contrast learning framework for fully unsupervised Re-ID consists of two main components: Cluster Centroid Contrastive Loss (CCCL) and Hard Instance Contrastive Loss (HICL). As shown in Fig.2.
4.2 Hybrid Contrast Learning
To increase intra-class compactness and inter-class separability, state-of-the-art contrastive learning methods minimize the distance between samples of the same category and maximize the distance between samples of different categories with InfoNCE loss .
where is an encoded query and is a positive feature which has the same label with selected from a set of candidates . is a temperature hyper-parameter that controls the scale of similarities.
Comparing the non-parametric loss functions of different approaches based on the memory dictionary, the SSL 
considers each image as an individual instance and computes the loss and updates the memory dictionary both in the instance level so that all features of the training data need to be saved. To decrease memory usage and take full advantage of clustering outliers, SPCL computes the loss in cluster level but updates the memory dictionary in the instance level. However, the updating progress for each cluster is inconsistent due to the varying cluster size and randomness of sampling. ClusterNCE loss 
updates the feature vectors and computes the loss both in the cluster level. Although only a smaller storage space needs to be created to hold a cluster size amount of features for ClusterNCE, a single feature vector is not enough for a cluster representation. The averaged momentum representations calculated from all instances belonging to one cluster may lose the intra-class diversity. If updating cluster representation with only an instance feature, would introduce more biases because of noisy pseudo labels generated by unsupervised clustering.
Thus, we proposed a new unsupervised Re-ID framework that combines cluster-level loss with instance-level loss. The overall loss function of our method is as follow:
where is a balancing factor and we set = 0.5 by default. In the following, we will detail the objective function Eq.(2).
Cluster Centroid Contrastive Loss Some instance-level memory dictionary techniques, such as [22, 8] maintaining each instance feature of the dataset and update corresponding memory dictionary with its own instance features in each mini-batch, have the problem of memory updating consistency . Since different instances within the same cluster will have different updating states. In every training iteration, due to the unbalanced distribution of cluster size, a smaller cluster could have a higher proportion of instances updated than a larger cluster. Unlike the previous instance-level memory dictionary, we use cluster-level memory dictionary to keep one cluster feature for each cluster instead of preserving every instance feature. The corresponding memory dictionary is updated regardless of whether the clusters are large or small, ensuring updating consistency of features within the same cluster.
is the number of clusters in a training epoch andis a temperature hyper-parameter. Different from unified contrastive loss, outliers are dropped out.
We calculate cluster centroids and store them in a memory for the cluster centroid contrastive loss. We update the cluster memory bank as follows:
where is the average of -th class instance features in the mini-batch.
4.3 Memory Based Hard Mining Scheme
To further distinguish easily confused sample pairs and explore the inter-instance relationship, we propose a novel hard sample mining strategy based on a memory dictionary. We construct another memory-based dictionary to store instance features, which contains pseudo identities and each identity has instances. As shown in Fig.1, unlike traditional hard mining strategies such as hard triplet loss , which is based on pairwise loss calculating the distance of the hardest positive and the hardest negative instances within a mini-batch, our proposed method is based on all pseudo-labeled categories and contains negative samples for each query. Our hard mining strategy considers the relationship between each query instance and other clusters of different pseudo labels rather than taking into account only the inter-instance relationship with a small fraction of the categories.
For the same query, we construct sample pairs which include one positive pair and hard negative pairs. We define hard instance contrastive loss as follows:
where is an instance temperature hyper-parameter,
is the hard positive instance feature that has the lowest cosine similarity with querywithin the same cluster, and is hard negative instance feature that has the highest cosine similarity that belongs to -th class. They are respectively defined as
Similarly, to ensure memory updating consistency, all instance features of the corresponding K identities in the mini-batch are updated in each training iteration. We update the instance memory bank as follows:
5.1 Data and Metrics
We evaluate our approach on two large-scale benchmark datasets: Market1501 
, and DukeMTMC-reID which are widely used real-world person Re-ID tasks.
Market1501 contains 1,501 person identities with 32,668 images which are captured by 6 cameras in front of the Tsinghua University campus. It contains 12,936 images of 751 identities for training and 19,732 images of 750 identities for testing. All of the images were cropped by a pedestrian detector which inevitably introduced little misalignment, part missing and false positives.
DukeMTMC-reID consists a total of 36,411 images of people from 1404 different identities collected by 8 cameras. Specifically, The dataset is split by randomly selecting 702 identities as the training set and 702 identities as the testing set. it contains 16,522 images for training, 2,228 query images and 17,661 gallery images for testing.
We followed the standard training/test split and evaluation protocol to evaluate the performance of our method. For the evaluation metrics, we used the Rank-k (for k = 1, 5, and 10) matching accuracy, which means the query picture has the match in the top-k list. And we use the mean Average Precision (mAP), which is computed from the Cumulated Matching Characteristics (CMC). Moreover, results reported in this paper are under the single-query setting, and no post-processing technique is applied.
We adopt ResNet-50 
as the backbone of the feature extractor and initialize the model with the parameters pre-trained on ImageNet
. After layer-4, we remove all sub-module layers and add global average pooling (GAP) followed by batch normalization layer and L2-normalization layer, which will produce 2048-dimensional features. During testing, we take the features of the global average pooling layer to calculate the distance. For the beginning of each epoch, we use DB-SCAN  for clustering to generate pseudo labels. The input image is resized
. For training images, we perform random horizontal flipping, padding with 10 pixels, random cropping, and random erasing. Each mini-batch contains 256 images of 16 pseudo person identities (16 instances for each person). We adopt Adam optimizer to train the Re-ID model with weight decay 5e-4. The initial learning rate is set to 3.5e-4, and is reduced to 1/10 of its previous value every 20 epoch in a total of 50 epoch. As the same with the cluster method of paper, we use DB-SCAN and Jaccard distance  to cluster with k nearest neighbors, where k = 30. For DB-SCAN, the maximum distance d between two samples is set as 0.45 and the minimal number of neighbors in a core point is set as 4.
5.3.1 Comparison with unsupervised method
|Unsupervised Domain Adaptation|
|OSNet ||ICCV’ 19||84.9||94.8||-||-||73.5||88.6||-||-|
|ICE  (w/ GT)||ICCV’ 21||86.6||95.1||98.3||98.9||76.5||88.2||94.1||95.7|
|HHCL(w/ GT)||This paper||87.2||94.6||98.5||99.1||80.0||89.8||95.2||96.7|
Experimental results of the proposed HHCL and state-of-the-art methods on Market-1501 and DukeMTMC-reID. Note that the best results are bolded.
We compare our proposed method with state-of-the-art ReID methods including: 1) the unsupervised domain adaptation methods for person Re-ID(e.g. ECN , MAR, SSG, MMCL, JVTC, DG-Net++, ECN+, MMT, DCML, MEB, SpCL ; 2) the purely unsupervised methods for person Re-ID SSL, MMCL, JVTC, HCT, CycAs, SpCL, CAP, CACL , CCL  and ICE). The comparison results of the state-of-the-art unsupervised domain adaptation methods and purely unsupervised methods on Market-1501 and DukeMTMC-reID are reported in Tab. 1.
As shown in Tab.1, we observe our method is competitive with all the state-of-the-art methods. On the three datasets, our proposed HHCL without any identity annotation achieves better performance than all of UDA methods that use of the additional labeled source dataset. It can be found that we not only perform better than all unsupervised domain adaptation methods and also achieve competitive performance with purely unsupervised methods. Under the fully unsupervised setting, HHCL achieves in mAP and in rank-1 accuracy on Market-1501, which is 1.9% higher than the current state of the art (ICE ). On DukeMTMC-reID, our method also achieves a high performance of in mAP/rank-1. These results indicate that our method is effective for unsupervised person Re-ID learning.
5.3.2 Comparison with supervised method
Our HHCL method can be easily implemented as a supervised approach when we replace the pseudo-labels with ground truth. We further find that our proposed unsupervised method is already comparable to some excellent supervised methods, such as PCB  and DG-Net , when ground truth is not used. And our HHCL even achieves a better performance under supervised setting. This result shows that our proposed method achieves better results when using the ground truth to avoid introducing noisy pseudo-labels. And it also further demonstrates the effectiveness of our method for the person Re-ID problem, both unsupervised and supervised.
5.4 Ablation Study
Influence of Hyper-Parameter Tab. 2 reports the experiment result under different value of hyper-parameter. As mentioned in 2, is a balancing factor between 0 and 1, which plays an important role in affecting the weights of the cluster-level loss and instance-level loss. When is equal to 0, the loss function contains only the hard instance contrastive loss. From the fig.3. we can find that the model converges very slowly in the early stage of the training process, and using only the hard samples for comparison is not benefit for learning generalized features and obtaining better clustering pseudo labels. On the contrary, when =1 and cluster-level loss only is used, although a faster convergence can be achieved, only one feature is retained for each cluster, which loses the diversity of intra class and is still not conducive to facilitating the network to learn more discriminative features. It can be seen that combining both kind of contrastive loss leads to better performance obviously. And when = 0.5, we get the best performance 84.2% in mAP, indicating that our proposed hybrid contrastive learning method has a distinct advantage over others during the training process.
|IBN + GeM||87.8||95.1||98.2||98.8|
|IBN + GeM + LS||88.2||94.9||98.3||98.9|
|IBN + GeM||76.8||87.9||93.4||94.9|
|IBN + GeM + LS||77.3||87.7||93.5||95.1|
Instance-batch normalization (IBN)  and Generalized Mean Pooling (GeM)  has been proved effective in both supervised and UDA based Re-ID methods. We compare the performance of HCCL under different settings in Tab.3. The performance of our proposed HHCL can be further improved with an IBN-ResNet50 backbone network and GeM pooling layer.
In this paper, we propose a novel method for the fully unsupervised person re-ID. The new concepts and techniques introduced include a more efficient hybrid contrast learning framework and a memory based hard sample mining scheme. Specifically, our proposed HHCL approach comprehensively consider both of cluster level and instance level information. For effectively exploiting the invariance within and between clusters, HHCL leverages hard samples to guide network to learn more robust and discriminative features. Extensive experiments on two benchmark datasets demonstrated that HHCL achieves the best results comparing with all existing purely unsupervised and UDA-based Re-ID methods.
This work was supported in part by 111 Project of China (B17007), and in part by the National Natural Science Foundation of China (61602011).
-  (2020) Deep credible metric learning for unsupervised domain adaptation person re-identification. In ECCV, Cited by: §1, §2.1, §5.3.1, Table 1.
-  (2021) ICE: inter-instance contrastive encoding for unsupervised person re-identification. ArXiv abs/2103.16364. Cited by: §5.3.1, Table 1.
-  (2020) A simple framework for contrastive learning of visual representations. ArXiv abs/2002.05709. Cited by: §1.
-  (2021) Cluster contrast for unsupervised person re-identification. ArXiv abs/2103.11568. Cited by: §1, §1, §4.2, §4.2, §5.3.1, §5.3.1, Table 1.
-  (2009) ImageNet: a large-scale hierarchical image database. In CVPR, Cited by: §5.2.
-  (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD, Cited by: §2.1, §5.2.
-  (2019) Self-similarity grouping: a simple unsupervised cross domain adaptation approach for person re-identification. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6111–6120. Cited by: §5.3.1, Table 1.
-  (2020) Mutual mean-teaching: pseudo label refinery for unsupervised domain adaptation on person re-identification. ArXiv abs/2001.01526. Cited by: §1, §2.1, §4.2, §5.3.1, Table 1.
-  (2020) Self-paced contrastive learning with hybrid memory for domain adaptive object re-id. ArXiv abs/2006.02713. Cited by: §1, §1, §2.1, §4.2, §5.2, §5.3.1, Table 1.
-  (2007) Evaluating appearance models for recognition, reacquisition, and tracking. Cited by: §5.1.
Momentum contrast for unsupervised visual representation learning.
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9726–9735. Cited by: §1, §1.
-  (2016) Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §5.2.
-  (2017) In defense of the triplet loss for person re-identification. ArXiv abs/1703.07737. Cited by: §1, §4.3.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. ArXiv abs/1502.03167. Cited by: §5.2.
-  (2020) Joint visual and temporal consistency for unsupervised domain adaptive person re-identification. In ECCV, Cited by: §5.3.1, Table 1.
-  (2021) Cluster-guided asymmetric contrastive learning for unsupervised person re-identification. ArXiv abs/2106.07846. Cited by: §2.1, §5.3.1, Table 1.
-  (2020) Unsupervised person re-identification via softened similarity learning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3387–3396. Cited by: §4.2, Table 1.
-  (2018) Two at once: enhancing learning and generalization capacities via ibn-net. In ECCV, Cited by: §5.4.
Fine-tuning cnn image retrieval with no human annotation. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, pp. 1655–1668. Cited by: §5.4.
-  (2018) Beyond part models: person retrieval with refined part pooling. In ECCV, Cited by: §5.3.2, Table 1.
-  (2018) Representation learning with contrastive predictive coding. ArXiv abs/1807.03748. Cited by: §4.2.
-  (2020) Unsupervised person re-identification via multi-label classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10981–10990. Cited by: §2.1, §4.2, §5.3.1, Table 1.
-  (2016) Towards unsupervised open-set person re-identification. 2016 IEEE International Conference on Image Processing (ICIP), pp. 769–773. Cited by: §1.
-  (2021) Camera-aware proxies for unsupervised person re-identification. In AAAI, Cited by: §5.3.1, Table 1.
-  (2019) Multi-similarity loss with general pair weighting for deep metric learning. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5017–5025. Cited by: §1.
-  (2020) CycAs: self-supervised cycle association for learning re-identifiable descriptions. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pp. 72–88. Cited by: §5.3.1, Table 1.
-  (2018) Unsupervised feature learning via non-parametric instance discrimination. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3733–3742. Cited by: §1.
-  (2018) Unsupervised feature learning via non-parametric instance-level discrimination. ArXiv abs/1805.01978. Cited by: §1, §2.2.
-  (2017) Cross-view asymmetric metric learning for unsupervised person re-identification. 2017 IEEE International Conference on Computer Vision (ICCV), pp. 994–1002. Cited by: §1.
-  (2019) Unsupervised person re-identification by soft multilabel learning. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2143–2152. Cited by: Table 1.
-  (2020) Hierarchical clustering with hard-batch triplet loss for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13657–13665. Cited by: §2.1, §5.3.1, Table 1.
-  (2020) Multiple expert brainstorming for domain adaptive person re-identification. ArXiv abs/2007.01546. Cited by: §1, §5.3.1, Table 1.
-  (2015) Scalable person re-identification: a benchmark. 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1116–1124. Cited by: §5.1.
-  (2019) Joint discriminative and generative learning for person re-identification. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2133–2142. Cited by: §5.3.2, Table 1.
-  (2017) Unlabeled samples generated by gan improve the person re-identification baseline in vitro. 2017 IEEE International Conference on Computer Vision (ICCV), pp. 3774–3782. Cited by: §5.1.
-  (2017) Re-ranking person re-identification with k-reciprocal encoding. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3652–3661. Cited by: §5.2.
-  (2019) Invariance matters: exemplar memory for domain adaptive person re-identification. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 598–607. Cited by: §5.3.1, Table 1.
-  (2021) Learning to adapt invariance in memory for person re-identification. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, pp. 2723–2738. Cited by: §5.3.1, Table 1.
-  (2019) Omni-scale feature learning for person re-identification. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3701–3711. Cited by: Table 1.
-  (2020) Joint disentangling and adaptation for cross-domain person re-identification. ArXiv abs/2007.10315. Cited by: §5.3.1, Table 1.