ICE: Inter-instance Contrastive Encoding for Unsupervised Person Re-identification

03/30/2021 ∙ by Hao Chen, et al. ∙ Inria 0

Unsupervised person re-identification (ReID) aims at learning discriminative identity features without annotations. Recently, self-supervised contrastive learning has gained increasing attention for its effectiveness in unsupervised representation learning. The main idea of instance contrastive learning is to match a same instance in different augmented views. However, the relationship between different instances of a same identity has not been explored in previous methods, leading to sub-optimal ReID performance. To address this issue, we propose Inter-instance Contrastive Encoding (ICE) that leverages inter-instance pairwise similarity scores to boost previous class-level contrastive ReID methods. We first use pairwise similarity ranking as one-hot hard pseudo labels for hard instance contrast, which aims at reducing intra-class variance. Then, we use similarity scores as soft pseudo labels to enhance the consistency between augmented and original views, which makes our model more robust to augmentation perturbations. Experiments on several large-scale person ReID datasets validate the effectiveness of our proposed unsupervised method ICE, which is competitive with even supervised methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Person re-identification (ReID) targets at retrieving an person of interest across non-overlapping cameras by comparing the similarity of appearance representations. Supervised ReID methods [26, 2, 20] use human-annotated labels to build discriminative appearance representations which are robust to pose, camera property and view-point variation. However, annotating cross-camera identity labels is a cumbersome task, which makes supervised methods less scalable in real-world deployments. Unsupervised methods [18, 19, 30] directly train a model on unlabeled data and thus have a better scalability.

Most of previous unsupervised ReID methods [25, 9, 39] are based on unsupervised domain adaptation (UDA). UDA methods adjust a model from a labeled source domain to an unlabeled target domain. The source domain provides a good starting point that facilitates target domain adaptation. With the help of a large-scale source dataset, state-of-the-art UDA methods [9, 39] significantly enhance the performance of unsupervised ReID. However, the performance of UDA methods is strongly influenced by source dataset’s scale and quality. Moreover, a large-scale labeled dataset is not always available in the real world. In this case, fully unsupervised methods [18, 19] own more flexibility, as they do not require any identity annotation and directly learn from unlabeled data in a target domain.

Recently, contrastive learning has shown excellent performance in unsupervised representation learning. State-of-the-art contrastive methods [36, 3, 11] consider each image instance as a class and learns representations by matching augmented views of a same instance. As a class is usually composed of multiple positive instances, it hurts the performance of fine-grained ReID tasks when different images of a same identity are considered as different classes. Self-paced Contrastive Learning (SpCL) [10] alleviates this problem by matching an instance with the centroid of the multiple positives, where each positive converges to its centroid at an uniform pace. Although SpCL has achieved impressive performance, this method does not consider inter-instance affinities, which can be leveraged to reduce intra-class variance and make clusters more compact. In supervised ReID, state-of-the-art methods [2, 20] usually adopt a hard triplet loss [13] to lay more emphasis on hard samples inside a class, so that hard samples can get closer to normal samples. In this paper, we introduce Inter-instance Contrastive Encoding (ICE), in which we match an instance with its hardest positive in a mini-batch to make clusters more compact and improve pseudo label quality. Matching the hardest positive refers to using one-hot “hard” pseudo labels.

Since no ground truth is available, mining hardest positives within clusters is likely to introduce false positives into the training process. In addition, the one-hot label does not take the complex inter-instance relationship into consideration when multiple pseudo positives and negatives exist in a mini-batch. Contrastive methods usually use data augmentation to mimic real-world distortions, , occlusion, view-point and resolution variance. After data augmentation operations, certain pseudo positives may become less similar to an anchor, while certain pseudo negatives may become more similar. As a robust model should be invariant to distortions from data augmentation, we propose to use the inter-instance pairwise similarity as “soft” pseudo labels to enhance the consistency before and after augmentation.

Our proposed ICE incorporates class-level label (centroid contrast), instance pairwise hard label (hardest positive contrast) and instance pairwise soft label (augmentation consistency) into one fully unsupervised person ReID framework. Without any identity annotation, ICE significantly outperforms state-of-the-art UDA and fully unsupervised methods on main-stream person ReID datasets.

To summarize, our contributions are:

  1. We propose to use pairwise similarity ranking to mine hardest samples as one-hot hard pseudo labels for hard instance contrast, which reduces intra-class variance.

  2. We propose to use pairwise similarity scores as soft pseudo labels to enhance the consistency between augmented and original instances, which alleviates label noise and makes our model more robust to augmentation perturbation.

  3. Extensive experiments highlight the importance of inter-instance pairwise similarity in contrastive learning. Our proposed method ICE outperforms state-of-the-art methods by a considerable margin, significantly pushing unsupervised ReID to real-world deployment.

2 Related Work

Unsupervised person ReID.

Recent unsupervised person ReID methods can be roughly categorized into unsupervised domain adaptation (UDA) and fully unsupervised methods. Among UDA-based methods, several works [31, 17] leverage semantic attributes to reduce the domain gap between source and target domains. Several works [35, 46, 6, 47, 49] use generative networks to transfer labeled source domain images into the style of target domain. Another possibility is to assign pseudo labels to unlabeled images, where pseudo labels are obtained from clustering [25, 8, 40] or reference data [37]. Pseudo label noise can be reduced by selecting credible samples [1] or using a teacher network to assign soft labels [9]. All these UDA-based methods require a labeled source dataset. Fully unsupervised methods have a better flexibility for deployment. BUC [18] first treats each image as a cluster and progressively merge clusters. Lin [19]

replace clustering-based pseudo labels with similarity-based softened labels. Hierarchical Clustering is proposed in

[38] to improve the quality of pseudo labels. Since each identity usually has multiple positive instances, MMCL [30] introduces a memory-based multi-label classification loss into unsupervised ReID. JVTC [16] and CycAs [33]

explore temporal information to refine visual similarity. SpCL

[10]

considers each cluster and outlier as a single class and then conduct instance-to-centroid contrastive learning. CAP

[32] calculates identity centroids for each camera and conducts intra- and inter-camera centroid contrastive learning. Both SpCL and CAP focus on instance-to-centroid contrast, but neglect inter-instance affinities.

Contrastive Learning.

Recent contrastive learning methods [36, 11, 3] consider unsupervised representation learning as a dictionary look-up problem. Wu [36] retrieve a target representation from a memory bank that stores representations of all the images in a dataset. MoCo [11] introduces a momentum encoder and a queue-like memory bank to dynamically update negatives for contrastive learning. In SimCLR [3], authors directly retrieve representations within a large batch. However, all these methods consider different instances of a same class as different classes, which is not suitable in a fine-grained ReID task. These methods learn invariance from augmented views, which can be regarded as a form of consistency regularization.

Consistency regularization.

Consistency regularization refers to an assumption that model predictions should be consistent when fed perturbed versions of the same image, which is widely considered in recent semi-supervised learning

[27, 24, 4]. The perturbation can come from data augmentation [24], temporal ensembling [27, 15]

and shallow-deep features

[43, 4]. Artificial perturbations are applied in contrastive learning as strong augmentation [5, 34] and momentum encoder [11] to make a model robust to data variance. Wei [34] propose to regularize inter-instance consistency between two sets of augmented views, which neglects intra-class variance problem. We simultaneously reduce intra-class variance and regularize consistency between augmented and original views, which is more suitable for fine-grained ReID tasks.

3 Proposed Method

Figure 1: General architecture of ICE. We maximize the similarity between anchor and pseudo positives in both inter-class (proxy agreement between an instance representation and its cluster proxy ) and intra-class (instance agreement between and its pseudo positive ) manners.

3.1 Overview

Given a person ReID dataset , our objective is to train a robust model on without annotation. For inference, representations of a same person are supposed to be as close as possible. State-of-the-art contrastive methods [11, 3] consider each image as an individual class and maximize similarities between augmented views of a same instance with InfoNCE loss [28]:

(1)

where and are two augmented views of a same instance in a set of candidates . is a temperature hyper-parameter that controls the scale of similarities.

Following MoCo [11], we design our proposed ICE with an online encoder and a momentum encoder as shown in Fig. 1. The online encoder is a regular network, , ResNet50 [12], which is updated by back-propagation. The momentum encoder (weights noted as ) has the same structure as the online encoder, but updated by accumulated weights of the online encoder (weights noted as ):

(2)

where is a momentum coefficient that controls the update speed of the momentum encoder. and refer respectively to the current and last iteration. The momentum encoder builds momentum representations with the moving averaged weights, which are more stable to label noise.

At the beginning of each training epoch, we use the momentum encoder to extract appearance representations

of all the samples in the training set . We use a clustering algorithm DBSCAN [7] on these appearance representations to generate pseudo identity labels . We only consider clustered inliers for contrastive learning, while un-clustered outliers are discarded. We calculate proxy centroids and store them in a memory for a proxy contrastive loss (see Sec. 3.2). Note that this proxy memory can be camera-agnostic [10] or camera-aware [32].

Then, we use a random identity sampler to split the training set into mini-batches where each mini-batch contains pseudo identities and each identity has instances. We train the whole network by combining the (with class-level labels), a hard instance contrastive loss (with hard instance pairwise labels, see Sec. 3.3) and a soft instance consistency loss (with soft instance pairwise labels, see Sec. 3.4):

(3)

To increase the consistency before and after data augmentation, we use different augmentation settings for prediction and target representations in the three losses (see Tab. 1). The overall ICE algorithm is summarized in Appendix A.

Loss Predictions (augmentation) Targets (augmentation)
(Strong) (None)
(Strong) (Strong)
(Strong) (None)
Table 1: Augmentation settings for 3 losses.

3.2 Proxy Centroid Contrastive Baseline

For a camera-agnostic memory,

the proxy of cluster is defined as the averaged momentum representations of all the instances belonging to this cluster:

(4)

where is the number of instances belonging to the cluster .

We apply a set of data augmentation on and feed them to the online encoder. For an online representation belonging to the cluster , the camera-agnostic proxy contrastive loss is a softmax log loss with one positive proxy and all the negatives in the memory:

(5)

where is the number of clusters in a training epoch and is a temperature hyper-parameter. Different from unified contrastive loss [9], outliers are not considered as single instance clusters. In such way, outliers are not pushed away from clustered instances, which allows us to mine more hard samples for our proposed hard instance contrast. As shown in Fig. 2, all the clustered instances converge to a common cluster proxy centroid. However, images inside a cluster are prone to be affected by camera styles, leading to high intra-class variance. This problem can be alleviated by adding a cross-camera proxy contrastive loss [32].

Figure 2: Proxy contrastive loss. Inside a cluster, an instance is pulled to a cluster centroid by and to cross-camera centroids by .

For a camera-aware memory,

if we have cameras, a camera proxy is defined as the averaged momentum representations of all the instances belonging to the cluster in camera :

(6)

where is the number of instances belonging to the cluster captured by camera .

Given an online representation , the cross-camera proxy contrastive loss is a softmax log loss with one positive cross-camera proxy and nearest negative proxies in the memory:

(7)

where

denotes cosine similarity and

is a cross-camera temperature hyper-parameter. is the number of cross-camera positive proxies. Thanks to this cross-camera proxy contrastive loss, instances from one camera are pulled closer to proxies of other cameras, which reduces intra-class camera style variance.

We define a proxy contrastive loss by combining cluster and camera proxies:

(8)

3.3 Hard Instance Contrastive Loss

Although intra-class variance can be alleviated by cross-camera contrastive loss, it has two drawbacks: 1) more memory space is needed to store camera-aware proxies, 2) impossible to use when camera ids are unavailable. We propose a camera-agnostic alternative by exploring inter-instance relationship instead of using camera labels. Along with training, the encoders become more and more strong, which helps outliers progressively enter clusters and become hard inliers. Pulling hard inliers closer to normal inliers effectively increases the compactness of clusters.

A mini-batch is composed of identities, where each identity has positive instances. Given an anchor instance belonging to the th class, we sample the hardest positive momentum representation that has the lowest cosine similarity with , see Fig. 4. For the same anchor, we have negative instances that do not belong to the th class. The hard instance contrastive loss for is a softmax log loss of (1 positive and J negative) pairs, which is defined as:

(9)

where and is the hard instance temperature hyper-parameter. By minimizing the distance between the anchor and the hardest positive and maximizing the distance between the anchor and all negatives, increases intra-class compactness and inter-class separability.

Figure 3: Comparison between triplet and hard instance contrastive loss.

Relation with triplet loss.

Both and triplet loss [13] pull an anchor closer to positive instances and away from negative instances. As shown in Fig. 3, the traditional triplet loss pushes away a negative pair from a positive pair by a margin. Differently, the proposed pushes away all the negative instances as far as it could with a softmax. If we select one negative instance, the can be transformed into the triplet loss. If we calculate pairwise distance within a mini-batch to select the hardest positive and the hardest negative instances, the is equivalent to the batch-hard triplet loss[13]. We compare hard triplet loss (hardest negative) with the proposed (all negatives). in Tab. 2.

Figure 4: Based on inter-instance similarity ranking between anchor (A), pseudo positives (P) and pseudo negatives (N), Hard Instance Contrastive Loss matches an anchor with its hardest positive in a mini-batch. Soft Instance Consistency Loss regularizes the inter-instance similarity before and after data augmentation.
Negative in Market1501 DukeMTMC-reID
mAP Rank1 mAP Rank1
hardest 80.1 92.8 68.2 82.5
all 82.3 93.8 69.9 83.3
Table 2: Comparison between using the hardest negative and all negatives in the denominator of .

3.4 Soft Instance Consistency Loss

Both proxy and hard instance contrastive losses are trained with one-hot hard pseudo labels, which can not capture the complex inter-instance similarity relationship between multiple pseudo positives and negatives. Especially, inter-instance similarity may change after data augmentation. As shown in Fig. 4, the anchor becomes less similar to pseudo positives (), because of the visual distortions. Meanwhile, the anchor becomes more similar to pseudo negatives (), since both of them have red shirts. By maintaining the consistency before and after augmentation, a model is supposed to be more invariant to augmentation perturbations. We use the inter-instance similarity scores without augmentation as soft labels to rectify those with augmentation.

For a batch of images after data augmentation, we measure the inter-instance similarity between an anchor with all the mini-batch instances, as shown in Fig. 4. Then, the inter-instance similarity is turned into a prediction distribution by a softmax:

(10)

where is the soft instance temperature hyper-parameter. is an online representation of the anchor, while is momentum representation of each instance in a mini-batch.

For the same batch without data augmentation, we measure the inter-instance similarity between momentum representations of the same anchor with all the mini-batch instances, because the momentum encoder is more stable. We get a target distribution :

(11)

The soft instance consistency loss is Kullback-Leibler Divergence between two distributions:

(12)

In previous methods, consistency is regularized between weakly augmented and strongly augmented images [24] or two sets of differently strong augmented images [34]. Some methods [15, 27]

also adopted mean square error (MSE) as their consistency loss function. We compare our setting with other possible settings in Tab. 

3.

Consistency Market1501 DukeMTMC-reID
mAP Rank1 mAP Rank1
MSE 80.0 92.7 68.4 82.1
Strong-strong Aug 80.4 92.8 68.2 82.5
ours 82.3 93.8 69.9 83.3
Table 3: Comparison of consistency loss. Ours refers to KL divergence between images with and without data augmentation.
Figure 5:

Parameter analysis on Market-1501 dataset.

4 Experiments

4.1 Datasets and Evaluation Protocols

Market-1501 [41], DukeMTMC-reID[22] and MSMT17 [35] datasets are used to evaluate our proposed method. Market-1501 dataset is collected in front of a supermarket in Tsinghua University from 6 cameras. It contains 12,936 images of 751 identities for training and 19,732 images of 750 identities for test. DukeMTMC-reID is a subset of the DukeMTMC dataset. It contains 16,522 images of 702 persons for training, 2,228 query images and 17,661 gallery images of 702 persons for test from 8 cameras. MSMT17 is a large-scale Re-ID dataset, which contains 32,621 training images of 1,041 identities and 93,820 testing images of 3,060 identities collected from 15 cameras. Both Cumulative Matching Characteristics (CMC) Rank1, Rank5, Rank10 accuracies and mean Average Precision (mAP) are used in our experiments.

4.2 Implementation details

General training settings.

To conduct a fair comparison with state-of-the-art methods, we use an ImageNet

[23] pre-trained ResNet50 [12] as our backbone network. We report results of IBN-ResNet50 [21] in Appendix B. An Adam optimizer with a weight decay rate of 0.0005 is used to optimize our networks. The learning rate is set to 0.00035 with a warm-up scheme in the first 10 epochs. No learning rate decay is used in the training. The momentum encoder is updated with a momentum coefficient . We renew pseudo labels every 400 iterations and repeat this process for 40 epochs. We use a batchsize of 32 where and . We set , and

in the proxy contrastive baseline. Our network is trained on 4 Nvidia 1080 GPUs under Pytorch framework. The total training time is around 2 hours on Market-1501. After training, only the momentum encoder is used for the inference.

Clustering settings.

We calculate -reciprocal Jaccard distance [44] for clustering, where is set to 30. We set a minimum cluster samples to 4 and a distance threshold to 0.55 for DBSCAN. We also report results of a smaller threshold 0.5 (more appropriate for the smaller dataset Market1501) and a larger threshold 0.6 (more appropriate for the larger dataset MSMT17) in Appendix C.

Data augmentation.

All images are resized to 256128. The strong data augmentation refers to random horizontal flipping, cropping, Gaussian blurring and erasing [45].

4.3 Parameter analysis

Compared to the proxy contrastive baseline, ICE brings in four more hyper-parameters, including , for hard instance contrastive loss and , for soft instance consistency loss. We analyze the sensitivity of each hyper-parameter on the Market-1501 dataset. The mAP results are illustrated in Fig. 5. As hardest positives are likely to be false positives, an overlarge or undersized introduce more noise. and balance the weight of each loss in Eq. 3. Given the results, we set and . and control the similarity scale in hard instance contrastive loss and soft instance consistency loss. We finally set and .

Camera-aware memory Market1501 DukeMTMC-reID MSMT17
mAP R1 R5 R10 mAP R1 R5 R10 mAP R1 R5 R10
Baseline 79.3 91.5 96.8 97.6 67.3 81.4 90.8 92.9 36.4 67.8 78.7 82.5
80.5 92.6 97.3 98.4 68.8 82.4 90.4 93.6 38.0 69.1 79.9 83.4
81.1 93.2 97.5 98.5 68.4 82.0 91.0 93.2 38.1 68.7 79.8 83.7
82.3 93.8 97.6 98.4 69.9 83.3 91.5 94.1 38.9 70.2 80.5 84.4
Camera-agnostic memory Market1501 DukeMTMC-reID MSMT17
mAP R1 R5 R10 mAP R1 R5 R10 mAP R1 R5 R10
Baseline 65.8 85.3 95.1 96.6 50.9 67.9 81.6 86.6 24.1 52.3 66.2 71.6
78.2 91.3 96.9 98.0 65.4 79.6 88.9 91.9 30.3 60.8 72.9 77.6
47.2 66.7 86.0 91.6 36.2 50.4 70.3 76.3 17.8 38.8 54.2 60.9
79.5 92.0 97.0 98.1 67.2 81.3 90.1 93.0 29.8 59.0 71.7 77.0
Table 4: Comparison of different losses. Camera-aware memory occupies up to 6, 8 and 15 times memory space than camera-agnostic memory on Market1501, DukeMTMC-reID and MSMT17 datasets.
Figure 6: Dynamic cluster numbers during 40 training epochs on DukeMTMC-reID. “hard” and “soft” respectively denote and . A lower number denotes that clusters are more compact.
Figure 7: Dynamic KL divergence during 40 training epochs on DukeMTMC-reID. Lower KL divergence denotes that a model is more robust to augmentation perturbation.

4.4 Ablation study

The performance boost of ICE in unsupervised ReID mainly comes from the proposed hard instance contrastive loss and soft instance consistency loss. We conduct ablation experiments to validate the effectiveness of each loss, which is reported in Tab. 4. We illustrate the number of clusters during the training in Fig. 6 and t-SNE [29] after training in Fig. 8 to evaluate the compactness of clusters. We also illustrate the dynamic KL divergence of Eq. 12 to measure representation sensitivity to augmentation perturbation in Fig. 7 .

Hard instance contrastive loss.

Our proposed reduces the intra-class variance in a camera-agnostic manner, which increases the quality of pseudo labels. By reducing intra-class variance, a cluster is supposed to be more compact. With a same clustering algorithm, we expect to have less clusters when clusters are more compact. As shown in Fig. 6, DBSCAN generated more clusters during the training without our proposed . The full ICE framework has less clusters, which are closer to the real number of identities in the training set. On the other hand, as shown in Fig. 8, the full ICE framework has a better intra-class compactness and inter-class separability than the camera-aware baseline in the test set. The compactness contributes to better unsupervised ReID performance in Tab. 4.

Soft instance consistency loss.

Hard instance contrastive loss reduces the intra-class variance between naturally captured views, while soft instance consistency loss mainly reduces the variance from artificially augmented perturbation. If we compare the blue (ICE full) and yellow (w/o soft) curves in Fig. 7, we can find that the model trained without is less robust to augmentation perturbation. The quantitative results in Tab. 4 confirms that the improves the performance of baseline. The best performance can be obtained by applying and on the camera-aware baseline.

Camera-agnostic scenario.

Above results are obtained with a camera-aware memory, which strongly relies on ground truth camera ids. We further validate the effectiveness of the two proposed losses with a camera-agnostic memory, whose results are also reported in Tab. 4. Our proposed significantly improves the performance from the camera-agnostic baseline. However, can not be used alone without intra-class variance constraints. reduces intra-class variance, so that before augmentation in Fig. 4. permits that we still have after augmentation. However, when strong variance exists, , , maintaining this relationship equals maintaining intra-class variance, which decreases the ReID performance. On medium datasets (, Market1501 and DukeMTMC-reID) without strong camera variance, our proposed camera-agnostic intra-class variance constraint is enough to make beneficial to ReID. On large datasets (, 15 cameras in MSMT17) with strong camera variance, only camera-agnostic variance constraint is not enough. We provide the dynamic cluster numbers of camera-agnostic ICE in Appendix D.

Figure 8: T-SNE visualization of 10 random classes in DukeMTMC-reID test set between camera-aware baseline (Left) and ICE (Right).
Method Reference Market1501 DukeMTMC-reID MSMT17
mAP R1 R5 R10 mAP R1 R5 R10 mAP R1 R5 R10
Unsupervised Domain Adaptation
ECN [47] CVPR’19 43.0 75.1 87.6 91.6 40.4 63.3 75.8 80.4 10.2 30.2 41.5 46.8
MAR [37] CVPR’19 40.0 67.7 81.9 - 48.0 67.1 79.8 - - - - -
SSG [8] ICCV’19 58.3 80.0 90.0 92.4 53.4 73.0 80.6 83.2 13.3 32.2 51.2 -
MMCL [30] CVPR’20 60.4 84.4 92.8 95.0 51.4 72.4 82.9 85.0 16.2 43.6 54.3 58.9
JVTC [16] ECCV’20 61.1 83.8 93.0 95.2 56.2 75.0 85.1 88.2 20.3 45.4 58.4 64.3
DG-Net++ [49] ECCV’20 61.7 82.1 90.2 92.7 63.8 78.9 87.8 90.4 22.1 48.8 60.9 65.9
ECN+ [48] PAMI’20 63.8 84.1 92.8 95.4 54.4 74.0 83.7 87.4 16.0 42.5 55.9 61.5
MMT [9] ICLR’20 71.2 87.7 94.9 96.9 65.1 78.0 88.8 92.5 23.3 50.1 63.9 69.8
DCML [1] ECCV’20 72.6 87.9 95.0 96.7 63.3 79.1 87.2 89.4 - - - -
MEB [39] ECCV’20 76.0 89.9 96.0 97.5 66.1 79.6 88.3 92.2 - - - -
SpCL [10] NeurIPS’20 76.7 90.3 96.2 97.7 68.8 82.9 90.1 92.5 26.8 53.7 65.0 69.8
Fully Unsupervised
BUC [18] AAAI’19 29.6 61.9 73.5 78.2 22.1 40.4 52.5 58.2 - - - -
SSL [19] CVPR’20 37.8 71.7 83.8 87.4 28.6 52.5 63.5 68.9 - - - -
JVTC [16] ECCV’20 41.8 72.9 84.2 88.7 42.2 67.6 78.0 81.6 15.1 39.0 50.9 56.8
MMCL [30] CVPR’20 45.5 80.3 89.4 92.3 40.2 65.2 75.9 80.0 11.2 35.4 44.8 49.8
HCT [38] CVPR’20 56.4 80.0 91.6 95.2 50.7 69.6 83.4 87.4 - - - -
CycAs [33] ECCV’20 64.8 84.8 - - 60.1 77.9 - - 26.7 50.1 - -
SpCL(agnostic) [10] NeurIPS’20 73.1 88.1 95.1 97.0 65.3 81.2 90.3 92.2 19.1 42.3 55.6 61.2
ICE(agnostic) This paper 79.5 92.0 97.0 98.1 67.2 81.3 90.1 93.0 29.8 59.0 71.7 77.0
CAP(aware)[32] AAAI’21 79.2 91.4 96.3 97.7 67.3 81.1 89.3 91.8 36.9 67.4 78.0 81.4
ICE(aware) This paper 82.3 93.8 97.6 98.4 69.9 83.3 91.5 94.1 38.9 70.2 80.5 84.4
Supervised
PCB [26] ECCV’18 81.6 93.8 97.5 98.5 69.2 83.3 90.5 92.5 40.4 68.2 - -
DG-Net [42] CVPR’19 86.0 94.8 - - 74.8 86.6 - - 52.3 77.2 - -
ICE (w/ ground truth) This paper 86.6 95.1 98.3 98.9 76.5 88.2 94.1 95.7 50.4 76.4 86.6 90.0
Table 5: Comparison of ReID methods on Market1501, DukeMTMC-reID and MSMT17 datasets. The best and second best unsupervised results are marked in red and blue.

4.5 Comparison with state-of-the-art methods

We compare ICE with state-of-the-art ReID methods in Tab. 5. Here, results are reported without any post-processing techniques, , Re-Ranking [44].

Comparison with unsupervised method.

Previous unsupervised methods can be categorized into unsupervised domain adaptation (UDA) and fully unsupervised methods. We first list state-of-the-art UDA methods, including ECN [47], MAR [37], SSG [8], MMCL [30], JVTC [16], DG-Net++ [49], ECN+ [48], MMT [9], DCML [1], MEB [39], SpCL [10]. UDA methods usually rely on source domain annotation to reduce the pseudo label noise. Without any identity annotation, our proposed ICE outperforms all of them on the three datasets.

Under the fully unsupervised setting, ICE also achieves better performance than state-of-the-art methods, including BUC [18], SSL [19], MMCL [30], JVTC [16], HCT [38], CycAs [33], SpCL [10] and CAP [32]. CycAs leveraged temporal information to assist visual matching, while our method only considers visual similarity. SpCL and CAP are based on proxy contrastive learning, which are considered respectively as camera-agnostic and camera-aware baselines in our method. With a camera-agnostic memory, the performance of ICE(agnostic) remarkably surpasses the camera-agnostic baseline SpCL, especially on Market1501 and MSMT17 datasets. With a camera-aware memory, ICE(aware) outperforms the camera-aware baseline CAP on all the three datasets. By mining hard positives to reduce intra-class variance, ICE is more robust to hard samples. We illustrate some hard examples in Fig. 9, where ICE succeeds to notice important visual clues, , characters in the shirt (1st row), blonde hair (2nd row), brown shoulder bad (3rd row) and badge (4th row).

Comparison with supervised method.

We further provide two well-known supervised methods for reference, including the Part-based Convolutional Baseline (PCB) [26] and the joint Discriminative and Generative Network (DG-Net) [42]. Unsupervised ICE achieves competitive performance with the supervised PCB. If we replace the clustering generated pseudo labels with ground truth, our ICE can be transformed into a supervised method. The supervised ICE is competitive with state-of-the-art supervised ReID methods (, DG-Net), which shows that the supervised contrastive learning has a potential to be considered into future supervised ReID.

Figure 9: Comparison of top 5 retrieved images on Market1501 between CAP [32] and ICE. Green boxes denote correct results, while red boxes denote false results. Important visual clues are marked with red dashes.

5 Conclusion

In this paper, we propose a novel inter-instance contrastive encoding method ICE to address unsupervised ReID. Deviated from previous proxy based contrastive ReID methods, we focus on inter-instance affinities to make a model more robust to data variance. We first mine the hardest positive with mini-batch instance pairwise similarity ranking to form a hard instance contrastive loss, which effectively reduces intra-class variance. Smaller intra-class variance contributes to the compactness of clusters. Then, we use mini-batch instance pairwise similarity scores as soft labels to enhance the consistency before and after data augmentation, which makes a model robust to artificial augmentation variance. By combining the proposed hard instance contrastive loss and soft instance consistency loss, ICE significantly outperforms previous unsupervised ReID methods on Market1501, DukeMTMC-reID and MSMT17 datasets.

References

  • [1] G. Chen, Y. Lu, J. Lu, and J. Zhou (2020) Deep credible metric learning for unsupervised domain adaptation person re-identification. In

    European conference on computer vision

    ,
    Cited by: §2, §4.5, Table 5.
  • [2] H. Chen, B. Lagadec, and F. Bremond (2020-03) Learning discriminative and generalizable representations by spatial-channel partition for person re-identification. In The IEEE Winter Conference on Applications of Computer Vision (WACV), Cited by: §1, §1.
  • [3] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: §1, §2, §3.1.
  • [4] T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. Hinton (2020) Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029. Cited by: §2.
  • [5] X. Chen, H. Fan, R. Girshick, and K. He (2020) Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297. Cited by: §2.
  • [6] Y. Chen, X. Zhu, and S. Gong (2019) Instance-guided context rendering for cross-domain person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pp. 232–242. Cited by: §2.
  • [7] M. Ester, H. Kriegel, J. Sander, and X. Xu (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD, Cited by: Appendix C, §3.1.
  • [8] Y. Fu, Y. Wei, G. Wang, Y. Zhou, H. Shi, and T. S. Huang (2019) Self-similarity grouping: a simple unsupervised cross domain adaptation approach for person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6112–6121. Cited by: §2, §4.5, Table 5.
  • [9] Y. Ge, D. Chen, and H. Li (2020) Mutual mean-teaching: pseudo label refinery for unsupervised domain adaptation on person re-identification. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2, §3.2, §4.5, Table 5.
  • [10] Y. Ge, F. Zhu, D. Chen, R. Zhao, and H. Li (2020) Self-paced contrastive learning with hybrid memory for domain adaptive object re-id. In Advances in Neural Information Processing Systems, Cited by: Appendix B, Appendix C, §1, §2, §3.1, §4.5, §4.5, Table 5.
  • [11] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020-06) Momentum contrast for unsupervised visual representation learning. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §1, §2, §2, §3.1, §3.1.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.1, §4.2.
  • [13] A. Hermans, L. Beyer, and B. Leibe (2017) In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737. Cited by: §1, §3.3.
  • [14] J. Jia, Q. Ruan, and T. M. Hospedales (2019) Frustratingly easy person re-identification: generalizing person re-id in practice. In BMVC, Cited by: Appendix B.
  • [15] S. Laine and T. Aila (2016)

    Temporal ensembling for semi-supervised learning

    .
    ArXiv abs/1610.02242. Cited by: §2, §3.4.
  • [16] J. Li and S. Zhang (2020) Joint visual and temporal consistency for unsupervised domain adaptive person re-identification. arXiv preprint arXiv:2007.10854. Cited by: §2, §4.5, §4.5, Table 5.
  • [17] S. Lin, H. Li, C. Li, and A. C. Kot (2018) Multi-task mid-level feature alignment network for unsupervised cross-dataset person re-identification. In BMVC, Cited by: §2.
  • [18] Y. Lin, X. Dong, L. Zheng, Y. Yan, and Y. Yang (2019) A bottom-up clustering approach to unsupervised person re-identification. In AAAI, Cited by: §1, §1, §2, §4.5, Table 5.
  • [19] Y. Lin, L. Xie, Y. Wu, C. Yan, and Q. Tian (2020) Unsupervised person re-identification via softened similarity learning. ArXiv abs/2004.03547. Cited by: §1, §1, §2, §4.5, Table 5.
  • [20] H. Luo, Y. Gu, X. Liao, S. Lai, and W. Jiang (2019-06) Bag of tricks and a strong baseline for deep person re-identification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §1, §1.
  • [21] X. Pan, P. Luo, J. Shi, and X. Tang (2018) Two at once: enhancing learning and generalization capacities via ibn-net. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 464–479. Cited by: Appendix B, §4.2.
  • [22] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi (2016) Performance measures and a data set for multi-target, multi-camera tracking. In European Conference on Computer Vision workshop on Benchmarking Multi-Target Tracking, Cited by: §4.1.
  • [23] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. Berg, and L. Fei-Fei (2015) ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115, pp. 211–252. Cited by: §4.2.
  • [24] K. Sohn, D. Berthelot, C. Li, Z. Zhang, N. Carlini, E. D. Cubuk, A. Kurakin, H. Zhang, and C. Raffel (2020) FixMatch: simplifying semi-supervised learning with consistency and confidence. arXiv preprint arXiv:2001.07685. Cited by: §2, §3.4.
  • [25] L. Song, C. Wang, L. Zhang, B. Du, Q. Zhang, C. Huang, and X. Wang (2020) Unsupervised domain adaptive re-identification: theory and practice. Pattern Recognition 102, pp. 107173. Cited by: §1, §2.
  • [26] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang (2018) Beyond part models: person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European Conference on Computer Vision (ECCV), pp. 480–496. Cited by: §1, §4.5, Table 5.
  • [27] A. Tarvainen and H. Valpola (2017)

    Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results

    .
    In NIPS, Cited by: §2, §3.4.
  • [28] A. van den Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. ArXiv abs/1807.03748. Cited by: §3.1.
  • [29] L. van der Maaten and G. Hinton (2008) Visualizing data using t-SNE.

    Journal of Machine Learning Research

    9, pp. 2579–2605.
    External Links: Link Cited by: §4.4.
  • [30] D. Wang and S. Zhang (2020-06) Unsupervised person re-identification via multi-label classification. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §4.5, §4.5, Table 5.
  • [31] J. Wang, X. Zhu, S. Gong, and W. Li (2018) Transferable joint attribute-identity deep learning for unsupervised person re-identification. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2275–2284. Cited by: §2.
  • [32] M. Wang, B. Lai, J. Huang, X. Gong, and X. Hua (2021) Camera-aware proxies for unsupervised person re-identification. In

    Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)

    ,
    Cited by: Appendix C, §2, §3.1, §3.2, Figure 9, §4.5, Table 5.
  • [33] Z. Wang, J. Zhang, L. Zheng, Y. Liu, Y. Sun, Y. Li, and S. Wang (2020) CycAs: self-supervised cycle association for learning re-identifiable descriptions. In Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J. Frahm (Eds.), Cham, pp. 72–88. External Links: ISBN 978-3-030-58621-8 Cited by: §2, §4.5, Table 5.
  • [34] C. Wei, H. Wang, W. Shen, and A. Yuille (2020) CO2: consistent contrast for unsupervised visual representation learning. ArXiv abs/2010.02217. Cited by: §2, §3.4.
  • [35] L. Wei, S. Zhang, W. Gao, and Q. Tian (2018) Person transfer gan to bridge domain gap for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 79–88. Cited by: §2, §4.1.
  • [36] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3733–3742. Cited by: §1, §2.
  • [37] H. Yu, W. Zheng, A. Wu, X. Guo, S. Gong, and J. Lai (2019) Unsupervised person re-identification by soft multilabel learning. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2143–2152. Cited by: §2, §4.5, Table 5.
  • [38] K. Zeng, M. Ning, Y. Wang, and Y. Guo (2020) Hierarchical clustering with hard-batch triplet loss for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13657–13665. Cited by: §2, §4.5, Table 5.
  • [39] Y. Zhai, Q. Ye, S. Lu, M. Jia, R. Ji, and Y. Tian (2020) Multiple expert brainstorming for domain adaptive person re-identification. arXiv preprint arXiv:2007.01546. Cited by: §1, §4.5, Table 5.
  • [40] X. Zhang, J. Cao, C. Shen, and M. You (2019) Self-training with progressive augmentation for unsupervised cross-domain person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8222–8231. Cited by: §2.
  • [41] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian (2015) Scalable person re-identification: a benchmark. 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1116–1124. Cited by: §4.1.
  • [42] Z. Zheng, X. Yang, Z. Yu, L. Zheng, Y. Yang, and J. Kautz (2019) Joint discriminative and generative learning for person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.5, Table 5.
  • [43] Z. Zheng and Y. Yang (2021)

    Rectifying pseudo label learning via uncertainty estimation for domain adaptive semantic segmentation

    .
    International Journal of Computer Vision, pp. 1–15. Cited by: §2.
  • [44] Z. Zhong, L. Zheng, D. Cao, and S. Li (2017) Re-ranking person re-identification with k-reciprocal encoding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1318–1327. Cited by: §4.2, §4.5.
  • [45] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang (2020) Random erasing data augmentation. In AAAI, Cited by: §4.2.
  • [46] Z. Zhong, L. Zheng, S. Li, and Y. Yang (2018-09) Generalizing a person retrieval model hetero- and homogeneously. In The European Conference on Computer Vision (ECCV), Cited by: §2.
  • [47] Z. Zhong, L. Zheng, Z. Luo, S. Li, and Y. Yang (2019) Invariance matters: exemplar memory for domain adaptive person re-identification. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §4.5, Table 5.
  • [48] Z. Zhong, L. Zheng, Z. Luo, S. Li, and Y. Yang (2020) Learning to adapt invariance in memory for person re-identification. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §4.5, Table 5.
  • [49] Y. Zou, X. Yang, Z. Yu, B. V. K. V. Kumar, and J. Kautz (2020) Joint disentangling and adaptation for cross-domain person re-identification. ArXiv abs/2007.10315. Cited by: §2, §4.5, Table 5.

Appendix A Algorithm Details

The ICE algorithm details are provided in Algorithm 1.

Input : Unlabeled dataset , ImageNet pre-trained online encoder , ImageNet pre-trained momentum encoder , maximal epoch and maximal iteration .
Output : Momentum encoder after training.
1 for  to  do
2       Encode to momentum representations with the momentum encoder ;
3       Rerank and Generate clustering pseudo labels on momentum representations with DBSCAN;
4       Calculate cluster proxies in Eq. (4) and camera proxies in Eq. (6) based on ;
5       for  to  do
6             Calculate inter-instance similarities in a mini-batch;
7             Train with the total loss in Eq. (3) which combines proxy contrastive loss in Eq. (8), hard instance contrastive loss in Eq. (9) and soft instance consistency loss in Eq. (12);
8             Update by Eq. (2);
9       end for
10      
11 end for
Algorithm 1 Inter-instance Contrastive Encoding (ICE) for fully unsupervised ReID.

Appendix B Backbone Network

Instance-batch normalization (IBN)

[21] has shown better performance than regular batch normalization in unsupervised domain adaptation [21, 10] and domain generalization [14]. We compare the performance of ICE with ResNet50 and IBN-ResNet50 backbones in Tab. 6. The performance of our proposed ICE can be further improved with an IBN-ResNet50 backbone network.

Appendix C Threshold in clustering

In DBSCAN [7], the distance threshold is the maximum distance between two samples for one to be considered as in the neighborhood of the other. A smaller distance threshold is likely to make DBSCAN mark more hard positives as different classes. On the contrary, a larger distance threshold makes DBSCAN mark more hard negatives as same class.

In the main paper, the distance threshold for DBSCAN between same cluster neighbors is set to , which is a trade-off number for Market1501, DukeMTMC-reID and MSMT17 datasets. To get a better understanding of how ICE is sensitive to the distance threshold, we vary the threshold from to . As shown in Tab. 7, a smaller threshold is more appreciate for the relatively smaller dataset Market1501, while a larger threshold is more appreciate for the relatively larger dataset MSMT17. State-of-the-art unsupervised ReID methods SpCL [10] and CAP [32] respectively used and as their distance threshold. Our proposed ICE can always outperform SpCL and CAP on the three datasets with a threshold between and .

Figure 10: Dynamic cluster numbers of ICE(agnostic) during 40 training epochs on DukeMTMC-reID. A lower number denotes that clusters are more compact (less intra-cluster variance).

Appendix D Camera-agnostic scenario

As mentioned in the main paper, we provide the dynamic cluster numbers of camera-agnostic ICE during the training in Fig. 10. The red curve is trained without the hard instance contrastive loss as intra-class variance constraint. In this case, the soft instance consistency loss maintains high intra-class variance, , , which leads to less compact clusters. The orange curve is trained without , which has less clusters at the beginning but more clusters at last epochs than the blue curve. The blue curve is trained with both and , whose cluster number is most accurate among the three curves at last epochs. Fig. 10 confirms that combining and reduces naturally captured and artificially augmented view variance at the same time, which gives optimal ReID performance.