Dual-Refinement: Joint Label and Feature Refinement for Unsupervised Domain Adaptive Person Re-Identification

12/26/2020 ∙ by Yongxing Dai, et al. ∙ National University of Singapore Singapore University of Technology and Design Peking University 9

Unsupervised domain adaptive (UDA) person re-identification (re-ID) is a challenging task due to the missing of labels for the target domain data. To handle this problem, some recent works adopt clustering algorithms to off-line generate pseudo labels, which can then be used as the supervision signal for on-line feature learning in the target domain. However, the off-line generated labels often contain lots of noise that significantly hinders the discriminability of the on-line learned features, and thus limits the final UDA re-ID performance. To this end, we propose a novel approach, called Dual-Refinement, that jointly refines pseudo labels at the off-line clustering phase and features at the on-line training phase, to alternatively boost the label purity and feature discriminability in the target domain for more reliable re-ID. Specifically, at the off-line phase, a new hierarchical clustering scheme is proposed, which selects representative prototypes for every coarse cluster. Thus, labels can be effectively refined by using the inherent hierarchical information of person images. Besides, at the on-line phase, we propose an instant memory spread-out (IM-spread-out) regularization, that takes advantage of the proposed instant memory bank to store sample features of the entire dataset and enable spread-out feature learning over the entire training data instantly. Our Dual-Refinement method reduces the influence of noisy labels and refines the learned features within the alternative training process. Experiments demonstrate that our method outperforms the state-of-the-art methods by a large margin.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Person re-identification (re-ID) aims at identifying the same person’s images as the query in a gallery database across disjoint cameras. Due to the great value in practical applications concerning public security and surveillance, person re-ID has attracted booming attention in the research community [69, 61]. Most of the existing works for person re-ID focus on fully supervised scenarios [50, 53, 44, 33, 3, 49, 34], where they have obtained superior performance on some benchmarks as the model’s training and testing are conducted in the same domain. However, their performances often drop dramatically when models trained on the labeled source domain are directly applied to the unlabeled target domain owing to the domain gap. To handle the domain gap issue in the cross-domain re-ID scenario, recently, plenty of unsupervised domain adaptation (UDA) approaches [60, 58] have been proposed. Many UDA re-ID methods adopt clustering algorithms [42, 10, 64, 59, 12] to generate pseudo labels for the unlabeled target domain data, which can then be used as the supervision signal for model training.

Concretely, in clustering-based UDA methods [42, 10, 64, 59, 12], the model is often pre-trained with the labeled source domain data via a fully supervised manner. Then the model is fine-tuned using unlabeled target domain data in an alternative training manner including the off-line pseudo label generation phase and on-line feature learning phase. Specifically, at the off-line phase, pseudo labels are generated by performing clustering on the features of the target domain samples, which are extracted with the trained model. At the on-line phase, the model is trained under the supervision of the pseudo labels generated from the off-line phase. The off-line label generation and on-line model training are conducted alternatively and iteratively over the whole learning process. However, the pseudo labels generated at the off-line phase often contain noise, that directly affects the on-line feature learning performance [12]

. Meanwhile, the discriminability of the on-line learned features, in turn, affects the off-line pseudo label generation at the next epoch. Thus, we need to alleviate the influence of label noises at both off-line and on-line phases to improve UDA re-ID performance. Therefore, in this paper, we propose a novel approach called Dual-Refinement to jointly refine the off-line pseudo label generation and also the on-line feature learning under the noisy pseudo labels, during an alternative training process.

To refine the off-line pseudo labels, we design a novel hierarchical clustering scheme at the off-line phase. Existing clustering-based UDA methods [42, 10, 64] generally perform clustering based on the local similarities among the samples [8] to assign pseudo labels, i.e.,

the sample’s nearest neighborhoods are more likely to be grouped into the same cluster. However, such sample neighborhoods tend to ignore the global and inherent characteristics of each cluster, due to the high intra-cluster variance caused by different poses or viewpoints of the person. As shown in Fig.

1 (a), when only considering the local similarities, the sample can be easily wrongly grouped into , because and share the same pose and viewpoint and they are very similar. As a result, the off-line pseudo label assignment based on such coarse clustering often brings about label noises.

To refine the noisy pseudo labels, we propose to consider the global characteristics of every coarse cluster by selecting several representative prototypes within every coarse cluster. Specifically, we propose a new hierarchical clustering method, in which we perform fine clustering after the coarse clustering. Moreover, the fine sub-cluster centers serve as the representative prototypes for every coarse cluster. Compared with local similarities, the average similarity between the sample and these representative prototypes is more powerful in capturing each person’s global and inherent information. Thus it provides a more robust criterion for the off-line pseudo label assignment, and mitigates the label noise issue. As shown in Fig. 1 (a), when considering the global characteristics within every cluster, the sample should be grouped into the cluster , because the average similarity between and cluster ’s prototypes () is 0.6, which is larger than 0.5 achieved by using ’s prototypes. Thus by considering the global characteristics within each coarse cluster, we can assign more reliable pseudo labels.

Fig. 1: (a) In our proposed hierarchical clustering method, cluster A and B are obtained by coarse clustering first. Then the cluster centers obtained with further fine clustering are selected as the representative prototypes of each coarse cluster. Here , , and , , are representative prototypes of clusters and , respectively. Similarity histograms denote the similarity between the sample and each prototype. Though seems to be more similar to than , the average similarity between and cluster ’s prototypes () is 0.6, which is higher than the average similarity between and (). By considering the global characteristics based on the representative prototypes, we can exploit more inherent and robust similarities. (b) Two-dimensional visualization of feature space. Points and triangles denote two different clusters. Dashed ovals represent class decision boundaries. Arrows represent spreading out the features. By enforcing the spread-out property in one-line feature learning, the effect of noisy pseudo labels can also be alleviated.

At the on-line phase, to refine the feature learning and alleviate the effects of noisy pseudo labels, we propose an instant memory spread-out (IM-spread-out) regularization scheme in our Dual-Refinement method. Since the pseudo labels can still contain noise, directly using such noisy pseudo labels to supervise the metric learning (classification and triplet loss) can limit the on-line feature learning performance. As shown in Fig. 1 (b), the label noises originate from the off-line clustering that aims at discovering class decision boundaries [2, 21], and the noisy samples tend to be located near the decision boundary. Thus, these noisy samples confuse the on-line feature learning and limit the discriminability of the learned features. To pull noisy samples away from the decision boundaries and boost the discriminability of features, we propose an IM-spread-out regularization scheme during on-line feature learning, as shown in Fig. 1 (b). The spread-out property will not break the inherent characteristics of those reliable samples, because the reliable samples are inherently compact in a cluster, which thus is still robust during our IM-spread-out regularization.

To effectively capture the global distribution, we enforce the spread-out property on the whole training dataset. Specifically, we consider every sample in the target domain as an instance, and our IM-spread-out regularization scheme satisfies the positive-centered and negative-separated properties. Moreover, the sample’s nearest neighborhoods can be seen as the positives, and all the remaining of the entire dataset can be seen as the negatives. However, it is hard to enforce the spread-out constraints on the whole training data in the mini-batch training manner, since the mini-batch only captures the local data distribution. A possible solution is to use a memory bank [54, 55] to store the features of all the samples. However, existing memory bank mechanisms [54, 55] are updated with outdated features [18, 75] without back-propagation, which thus are not suitable for our on-line feature learning. Therefore, to enable effective spread-out constraint on the whole training data for alleviating the effects of noisy labels at the on-line stage, we propose a new instant memory bank that can be instantly updated together with the encoder by back-propagating gradients. Our instant memory bank memorizes the sample features when features are fed into the bank, and meanwhile the instant memory bank is updated together with the network instantly at each training iteration. This means our memory bank always stores the instant features (rather than outdated features) of all the samples. Thus it effectively and efficiently captures the global distribution. Thanks to the instant memory bank, our proposed IM-spread-out regularization effectively alleviates the effects of noisy supervision signal during each training iteration of the on-line stage, and further boosts the on-line features’ discriminability. To the best of our knowledge, this is the first memory bank mechanism that is able to maintain the instant features of all the training samples during network optimization iterations, which thus greatly facilitates our spread-out scheme for feature optimization based on the global distribution. The on-line feature refinement and off-line pseudo label refinement are conducted in an alternative manner. Finally, the trained model can generalize well in the target domain.

The major contributions can be summarized as follows:

  • We propose a novel approach, called Dual-Refinement, to alleviate the pseudo label noise in clustering-based UDA re-ID, including the off-line pseudo label refinement to assign more accurate labels and the on-line feature refinement to alleviate the effects of noisy supervision signal.

  • We design a hierarchical clustering scheme to select representative prototypes for every coarse cluster, which captures the more global and inherent characteristics of each person, thus can refine the pseudo labels at the off-line phase.

  • We propose an IM-spread-out regularization scheme to alleviate the effects of pseudo label noises at the on-line phase. Thus it improves the feature discriminability in the target domain. Moreover, a novel instant memory bank is proposed to store instant features and thus can enforce the spread-out property on the whole target training dataset.

  • Extensive experiments have shown that our method outperforms state-of-the-art UDA approaches by a large margin.

Ii Related Work

Ii-a Unsupervised Domain Adaptation

The existing general UDA methods fall into two main categories: closed set UDA [47, 11, 31, 32, 23] and open set UDA [36, 41]. In closed set UDA, both the target and source domain completely share the same classes. Most of the closed set UDA works [47, 11, 31] try to learn the domain invariant features to generalize the class decision boundary well on the target domain. Long et al. propose DAN [31] and RTN [32] to minimize Maximum Mean Discrepancy (MMD) across different domains. In open set UDA [36, 41], the target and source domain dataset only share a part of classes. Saito et al. [41] use adversarial training to align target samples with known source samples or recognize them as an unknown class. All the general UDA methods mentioned above assume that the source and target domain share the whole or partial classes under the image classification scenario, which are difficult to be directly applied to UDA person re-ID tasks.

Ii-B Unsupervised Domain Adaptation for Person re-ID

The unsupervised domain adaptation methods for person re-ID can be mainly categorized into two aspects, one is the GAN-based [6, 52, 74], and the other is the clustering-based [10, 64, 59, 42, 12]. SPGAN [6] and PTGAN [52] use CycleGAN [77] to translate the style of the source domain to the target domain and conduct the feature learning with the source domain labels. HHL [74] uses StarGAN [4] to learn the features with camera invariance and domain connectedness. UDAP [42] first proposes the clustering-based UDA framework for re-ID. SSG [10] and PCB-PAST [64] bring the information of both the global body and local parts into the clustering-based framework. Some clustering-based methods [12, 59, 64] are devoted to solving the pseudo label noise problem. MMT [12] proposes an on-line peer-teaching framework to refine the noisy pseudo labels, which uses the on-line reliable soft labels generated from the temporally average model of one network to supervise the training of another network. However, both MMT [12] and ACT [59] introduce other networks that will bring about extra noises, and they are not memory efficient. Besides, PCB-PAST [64] proposes the ranking-based triplet loss to alleviate the influence of label noises in on-line metric learning but it only focuses on local data distribution based on mini-batches. There are also some other works like ECN [75] and ECN+GPP [76], which use the traditional memory bank [57] to consider every sample as an instance to learn the invariant feature. However, these methods need additional images generated by GAN, and the features stored in the traditional memory bank are outdated [18].

Different from all the UDA re-ID methods mentioned above, we propose a novel Dual-Refinement method that is able to jointly refine the pseudo labels at the off-line stage and alleviate the effects of noisy labels at the on-line stage to refine the on-line features. Moreover, different from traditional memory mechanisms [57, 75] that update the memory with out-dated features and thus may lead to inconsistencies in feature learning due to discrepancies between the memory updating and the encoder updating, in this paper, we also propose a novel instant memory bank, which is instantly updated with the encoder by back-propagating gradients, and thus effectively enforces the spread-out property on the whole training data and captures the global characteristics of the whole target domain.

Ii-C Learning with Noisy Labels

Existing works on learning with noisy labels can be categorized into four main groups. The methods in the first category focus on learning a transition matrix [35, 14, 38, 56]

. However, it is hard to estimate the noise label transition for UDA re-ID because the classes in the target domain are known. The second category

[13, 66, 45]

is to design the loss functions robust to noise labels, but they bring about extra constraints like the mean absolute loss in GCE

[66]. The third category [25, 16, 22] is to utilize additional networks to refine the noisy labels. Co-teaching [16] uses two networks in a co-trained manner. These methods need other networks and complicated sample selection strategies. The last category [40, 46, 17] learns on noisy labels in a self-training manner. Han et al. [17] design a complicated class-prototype selection strategy to train with samples robust to noises in the ground-truth, while we utilize the inherent hierarchical structure to cluster and assign labels. Different from the aforementioned methods in the image classification scenario, we propose a hierarchical clustering scheme to capture the diversity within every coarse cluster, thus can handle the label noise issue in UDA re-ID.

Fig. 2: The framework of our method. After initializing CNN by pre-training it on the labelled source domain, we train our network in the target domain in an alternative manner including two stages. At the beginning of every training epoch, we conduct the off-line stage, where we use the trained model to extract all the sample features and then perform the hierarchical clustering on these features to assign pseudo labels. Then, we conduct the on-line stage, where we use pseudo labels generated from the off-line stage to fine-tune the model with the classification loss and triplet loss, together with a label-free IM-spread-out regularization. These two stages are performed alternatively and iteratively in the target domain. The instant memory bank is used to store instant sample features. The spread-out feature learning aims to separate different sample features (e.g., ) and concentrate the sample feature with its corresponding memory (e.g., ) in the feature space. Best viewed in color.

Ii-D Feature Embedding with Spread-out Property

Feature embedding learning with the spread-out property has improved the performance in deep local feature learning [65], unsupervised embedding learning [62]

, and face recognition

[29, 67, 7]. Zhang et al. [65] propose a Global Orthogonal Regularization to fully utilize the feature space by making the negative pairs close to orthogonal. Ye et al. [62] use a siamese network to learn data augmentation invariant and instance spread-out features under the instance-wise supervision. These works [65, 62] guarantee the spread-out property of features within a mini-batch. Specifically, in [62], only the augmented sample can be seen as the positive and the remaining samples within a mini-batch are the negatives, where the limited number of positive and the negative samples may lead to less discriminative feature learning. Other works in face recognition [29, 67, 7]

use a term regularized on the classifier weights to make class centers spread-out in the holistic feature space. However, they are under the supervision of the ground truth. Unlike the above works, our IM-spread-out regularization is used to alleviate the influence of noisy labels on on-line metric learning for UDA re-ID. Besides, we use an instant memory bank to enforce the spread-out property on the entire training data instead of the mini-batch, where the positive samples are the top-k similar ones selected from the memory bank and all the remaining of the whole training data are the negative ones. Thanks to the instant memory bank, the samples’ diversity can further boost the spread-out feature learning.

Iii Approach

In unsupervised domain adaptive person re-ID, we are given a labeled source domain dataset and an unlabeled target domain dataset. In the source domain, the dataset contains person images and each image corresponds to an identity label . In the target domain, dataset contains unlabeled person images. Our goal is to use both labeled source data and unlabeled target data to learn discriminative image representations in the target domain.

Iii-a Overview of Framework

As shown in Fig. 2, the framework of our method contains two stages including the off-line pseudo label refinement stage and the on-line feature learning stage. The network (CNN) is initialized by pre-training on source domain data, following a similar method in [33]

. CNN is a deep feature encoder

parameterized with , and can encode the person image into a -dimensional feature.

At the off-line stage, we propose a hierarchical clustering scheme for the target domain features extracted by the network (CNN) trained in the last epoch, where . By clustering, we assign samples in the same cluster with the same pseudo label, and then each target domain sample gets two kinds of pseudo labels, including noisy label and refined label , where is the number of the unique labels. The off-line pseudo labels are used for the on-line feature learning.

At the on-line stage, we use samples with noisy labels to train with the classification loss and triplet loss , and use samples with refined labels to train with the loss and . FC is a dimensional fully connected layer followed by softmax function, which is denoted as the identity classifier . To alleviate the influence of pseudo label noises in the on-line supervised metric learning, we propose a label-free IM-spread-out regularization scheme to train together with the classification loss and triplet loss [20]. Specifically, we propose a novel instant memory bank parameterized by to store all the samples’ features instantly. Thus, the instant memory bank can enforce the spread-out property on the whole target training dataset instead of the mini-batch, which can capture the global characteristics of the target domain distribution. We first conduct the off-line stage at the beginning of every epoch and conduct the on-line stage during every epoch, both of which are conducted in an alternative and iterative manner. For simplicity, we omit the superscript for the target domain data in the following sections.

Below, we first introduce the general clustering-based UDA procedure. We then introduce our Dual-Refinement method that can optimize both stages in the cluster-based UDA procedure, i.e., jointly refine off-line pseudo labels and on-line features, and thus improve the overall UDA performance.

Iii-B General Clustering-based UDA Procedure

Existing clustering-based UDA methods [42, 12, 64] for re-ID usually pre-train the backbone network with classification loss and triplet loss on labeled source dataset, and then use the pre-trained model to initialize the model for training the target dataset . We follow a similar method in [33] to pre-train the backbone model with the labeled source domain, and then perform the training procedure on the target domain, which contains two stages: (1) Off-line assigning pseudo labels based on clustering at the beginning of every training epoch. (2) Utilizing target domain data with pseudo labels to train the network with metric learning loss on-line during every training epoch. These two stages are conducted alternatively and iteratively in the training process.

Iii-B1 Off-line assigning pseudo labels

Following the existing clustering based UDA methods [42, 64, 10], we first extract features of all images in the target domain using the CNN trained from last epoch, and then calculate their pair-wise similarity between samples and by:

(1)

is the -reciprocal nearest neighbor set of sample introduced in [72]. We then use the pair-wise similarity to calculate the Jaccard distance by:

(2)

where is the number of target training samples. We use such distance metric to perform DBSCAN clustering [8] on target domain and obtain clusters. We consider each cluster as a unique class and assign the same pseudo label for the samples belonging to the same cluster. Thus, the target domain images with their pseudo labels are represented as , where .

Iii-B2 On-line model training with metric learning losses

In this step, we use the target dataset labeled by pseudo labels to train the network with classification loss and triplet loss [20] , which are foumulated as follows:

(3)
(4)

where denotes the classification layer FC and denotes the cross-entropy loss. and are the features of the hardest positive and negative of the sample through batch hard mining [20] under the supervision of pseudo labels , and is the margin. We denote this general UDA re-ID method as the baseline in this paper and fine-tune the network with an overall loss , that is obtained by combining the loss and :

(5)

Iii-C Off-line Pseudo Label Refinement

At the off-line stage, we design a hierarchical clustering guided label refinement strategy. The refining process contains two stages of hierarchical clustering from coarse to fine. As shown in Fig. 2, the coarse clusters and contain three sub-clusters and respectively. If only one-stage coarse clustering is employed, many noisy samples will be assigned with the false labels because the coarse clustering overlooks the high variances within a cluster, as shown in Fig. 1 (a). During the second stage: fine clustering, we select sub-cluster centers as the representative prototypes in each coarse cluster and reassign the cluster labels based on the average similarity between the noisy sample and prototypes in each coarse cluster, which can capture the more global characteristics of each person, and thus can can provide more robust pseudo labels.

Following the pseudo label assignment in Section III-B, we can obtain coarse clusters and the target sample set of the class is denoted as . We extract features of each class to constitute the feature set and then perform -means clustering [30] on every coarse class feature set , i.e, the coarse cluster will be splited into R sub-clusters. Afters such fine clustering, we pick the center of every sub-cluster as a prototype. Thus, we can obtain representative prototypes for every coarse cluster . By considering all the prototypes, we can get more global and inherent characteristics of every coarse cluster.

As shown in Fig. 1 (a), the global characteristics within each coarse cluster can help assign more reliable pseudo labels. Thus, we define the refined similarity score of sample belonging to class as , which is calculated by

(6)

The feature and the prototype are all L2-normalized. Similarity score can be viewed as the average similarity between the sample and the representative prototypes of class . Instead of assigning the coarse label for all samples within a cluster, our refined similarity score can provide a more reliable similarity between every sample and coarse clusters. With this similarity, we can reassign more reliable labels for target data by

(7)

where is the refined pseudo label of sample . Now every target sample has two kinds of pseudo labels including coarse noisy label and refined label .

Iii-D On-line Feature Refinement

Iii-D1 Metric learning with pseudo labels

We can use the refined labels to optimize the network with metric learning losses including and , where and are obtained by replacing the noisy pseudo labels with our refined labels in Eq. (3) (4). We combine metric losses under the supervision of both noisy pseudo labels and refined pseudo labels by:

(8)

We use to control the weights of reliable and noisy pseudo labels on both classification loss and triplet loss. Moreover, when , the parameter can prevent mistaking hard samples as noisy samples, i.e., refining pseudo labels excessively, because the clustering quality is low at the early training stage.

Iii-D2 IM-spread-out regularization with instant memory bank

Although we have designed the off-line pseudo label refining strategy, it is not possible to eliminate all the label noises. To alleviate the effects of noisy labels on feature learning, we propose label-free regularization that aims to spread out the features in the whole feature space and pull the samples assigned with false labels out of the same class. Though spread-out properties with mini-batches have shown the effectiveness in recent works [7, 62], in our task, to capture the whole characteristics of the target domain for more effective spread-out, we enforce the spread-out property on the entire target training dataset, instead of the mini-batch.

To capture the whole characteristics of the target domain, one possible solution is to use a memory bank [54, 55] to store features of all the samples. However, traditional memory bank mechanisms [57, 54, 75] can only memorize outdated features [18], because each entry in a traditional memory bank is updated only once in each epoch, while the network is contiguously updated at every iteration. This means there are discrepancies between the memory updating and the encoder updating, leading to inconsistencies in on-line feature learning [18]. Hence, such memory bank methods are not suitable for handling our on-line feature refinement problem. To this end, here we propose a new instant memory bank that is able to store the instant features of all the samples, i.e., all the entries in the bank are updated together with the network instantly at every iteration.

As shown in Fig. 2, we propose the instant memory bank , where each entry in is a

-dimensional vector,

i.e., . We use to approximate the feature and thus the memory bank can memorize the approximated features of the entire dataset. For simplicity, we use the normalized feature via and . To make the memory entries approximate the sample features more accurately, the similarity between entry and feature should be as large as possible, i.e., is close to 1.

The spread-out property means the feature of every sample in the entire training dataset should be dissimilar with each other, i.e., should be close to -1 when . To further improve the discrimination of the feature learning, the feature should satisfy not only the spread-out property but also the positive-centered property. We assume that the -nearest neighborhoods of the sample in memory belong to the same class, where we denote the index set of -nearest neighborhoods as . We can consider all the samples in along with itself as the positives of (i.e., ) and the samples not in as the negatives of . To concentrate the positives and separate the negatives far away, our IM-spread-out regularization can be formulated as:

(9)

where is the margin to spread the negatives. The calculation of Eq. (9) with the instant memory bank can be easily implemented by normal matrix operations, which is training efficient. The gradients of with respect to memory entry is derived as follows:

(10)
(11)
Input: Labeled source dataset ; Unlabeled target dataset with images; Feature encoder

pretrained on ImageNet; Identity classifier

; Instant memory bank ; Maximum training epoch ; Maximum training iteration
Output: Optimized feature encoder  for target domain;
1 Pretain feature encoder on the labeled source dataset  with classification loss and triplet loss; Use the source-pretrained encoder to extract all target samples’ features , and initialize the instant memory bank entry as the corresponding . for  to  do
2       // Off-line pseudo label generation and refinement Extract features of the target dataset using and calculate Jaccard distance by Eq. (1) (2); Perform DBSCAN clustering on with and assign the coarse pseudo labels ; Perform fine clustering and assign refined pseudo labels by Eq. (6) (7) to obtain the refined target dataset ; // On-line feature learning and refinement for  to  do
3             Sample from ;
4             Update the feature encoder , classifier and instant memory bank by computing the gradients of the overall loss (Eq. (12)) with back-propagation;
5       end for
6      
7 end for
return feature encoder ;
Algorithm 1 Alternative Training Procedure of Our Dual-Refinement Method

All the entries in our instant memory bank are updated instantly by Eq. (10) (11) together with the network at every training iteration, i.e., it updates together with the instant features of all the training samples, and thus is able to effectively capture the characteristics of the whole target domain distribution in real time.

By performing spread-out regularization with the instant features of all the training samples using our instant memory bank, the effects of the noisy pseudo labels on on-line feature learning can thus be alleviated, and the features’ discriminability can be further boosted. Note that our spread loss can also be seen as a variant of the circle loss [43], yet it is significantly different from the original circle loss, as the circle loss needs to be trained with mini-batch in a fully supervised manner.

Iii-D3 Overall loss

By combining Eq. (8) under the supervision of pseudo labels and the label-free regularization Eq. (9) together, the overall objective loss is formulated as:

(12)

where and are the parameters to balance the losses. Our proposed off-line pseudo label refinement and on-line feature refinement are conducted alternatively and iteratively over the whole learning process. The details about the overall training procedure can be seen in Algorithm 1.

Iv Experiments

Iv-a Datasets and Evaluation Protocol

We conduct experiments on three large-scale person re-ID datasets namely Market1501 [68]

, DukeMTMC-ReID

[70] and MSMT17 [52]. The mean average precision (mAP) and Cumulative Matching Characteristic (CMC) curve [15]

are used as the evaluation metrics. Specially, we use the rank-1 accuracy (R1), rank-5 accuracy (R5) and rank-10 accuracy (R10) in CMC. There is no post-processing like re-ranking

[72] applied at the testing stage.

Market1501 [68] contains 32,668 labeled images of 1,501 identities which are captured from 6 different camera views. All the person images are detected by a Deformable Part Model. The training set consists of 12,936 images of 751 identities and the testing set consists of 19,732 images of 705 identities.

DukeMTMC-ReID [70] contains 36,411 labelled images of 1,404 identities which are captured from 8 different camera views. It is also a subset from the DukeMTMC dataset. The training set consists of 16,522 images of 702 identities for training. The testing set consists of 2,228 query images of 702 identities and 17,661 gallery images.

MSMT17 [52] is a large-scale dataset which contains 126,441 images of 4,101 identities. The images are captured by 15 cameras during 4 days, in which 12 cameras are outdoor and 3 cameras are indoor. The bounding box the every person is detected by Faster RCNN. The training set contains 32,621 images of 1,041 identities, and in the testing set there are 11,659 query images and 82,161 gallery images. MSMT17 is now the large-scale dataset that poses greater challeng to cross-domain person re-ID compared to the other two datasets mentioned above.

Iv-B Implementation Details

We utilized ResNet50 [19]

pre-trained on ImageNet

[5]

as the backbone network. We add a batch normalization (BN) layer followed by ReLU after the global average pooling (GAP) layer. The stride size of the last residual layer is set as 1. The identity classifier layer is a fully connected layer (FC) followed by softmax function. We resize the image size to

. For data augmentation, we perform random cropping, random flipping, and random erasing [73]. The margin of the triplet loss is set as 0.5, and the margin of our IM-spread-out regularization is set as 0.35. If not specified, we set the parameter and to balance the joint loss in Eq (12). We use the Adam [24] optimizer with weight decay and momentum to train the network. The learning rate in the pre-training stage follows the warmup learning strategy where the learning rate linearly increases from to during the first 10 epochs. The learning rate is divided by 10 at the 40th epoch and 70th epoch, respectively, in a total of 80 epochs. We set the batch size as 64 in all our experiments. When training on target data, the learning rate is initialized as and divided by 10 at the 20th epoch in a total of 40 epochs. During testing, we extract the

normalized feature after the BN layer and use Euclidean distance to measure the similarity between the query and gallery images in the testing set. Our model is implemented on PyTorch

[37] platform and trained with 4 NVIDIA TITAN XP GPUs.

Methods Reference DukeMTMC-ReID Market1501 Market1501 DukeMTMC-ReID
mAP R1 R5 R10 mAP R1 R5 R10
LOMO [27] CVPR 2015 8.0 27.2 41.6 49.1 4.8 12.3 21.3 26.6
BOW [68] ICCV 2015 14.8 35.8 52.4 60.3 8.3 17.1 28.8 34.9
UMDL [39] CVPR 2016 12.4 34.5 52.6 59.6 7.3 18.5 31.4 37.6
PTGAN [52] CVPR 2018 - 38.6 - 66.1 - 27.4 - 50.7
PUL [9] TOMM 2018 20.5 45.5 60.7 66.7 16.4 30.0 43.4 48.5
SPGAN [6] CVPR 2018 22.8 51.5 70.1 76.8 22.3 41.1 56.6 63.0
ATNet [28] CVPR 2019 25.6 55.7 73.2 79.4 24.9 45.1 59.5 64.2
TJ-AIDL [51] CVPR 2018 26.5 58.2 74.8 81.1 23.0 44.3 59.6 65.0
SPGAN+LMP [6] CVPR 2018 26.7 57.7 75.8 82.4 26.2 46.4 62.3 68.0
CamStyle [71] TIP 2019 27.4 58.8 78.2 84.3 25.1 48.4 62.5 68.9
HHL [74] ECCV 2018 31.4 62.2 78.8 84.0 27.2 46.9 61.0 66.7
ECN [75] CVPR 2019 43.0 75.1 87.6 91.6 40.4 63.3 75.8 80.4
PDA-Net [26] ICCV 2019 47.6 75.2 86.3 90.2 45.1 63.2 77.0 82.5
UDAP [42] PR 2020 53.7 75.8 89.5 93.2 49.0 68.4 80.1 83.5
PCB-PAST [64] ICCV 2019 54.6 78.4 - - 54.3 72.4 - -
SSG [10] ICCV 2019 58.3 80.0 90.0 92.4 53.4 73.0 80.6 83.2
MMCL [48] CVPR 2020 60.4 84.4 92.8 95.0 51.4 72.4 82.9 85.0
ACT [59] AAAI 2020 60.6 80.5 - - 54.5 72.4 - -
ECN-GPP [76] TPAMI 2020 63.8 84.1 92.8 95.4 54.4 74.0 83.7 87.4
AD-Cluster [63] CVPR 2020 68.3 86.7 94.4 96.5 54.1 72.6 82.5 85.5
MMT [12] ICLR 2020 71.2 87.7 94.9 96.9 65.1 78.0 88.8 92.5
Dual-Refinement This paper 78.0 90.9 96.4 97.7 67.7 82.1 90.1 92.5
Methods Reference DukeMTMC-ReID MSMT17 Market1501 MSMT17
mAP R1 R5 R10 mAP R1 R5 R10
ECN [75] CVPR 2019 10.2 30.2 41.5 46.8 8.5 25.3 36.3 42.1
SSG [10] ICCV 2019 13.3 32.2 - 51.2 13.2 31.6 - 49.6
ECN-GPP [76] TPAMI 2020 16.0 42.5 55.9 61.5 15.2 40.4 53.1 58.7
MMCL [48] CVPR 2020 16.2 43.6 54.3 58.9 15.1 40.8 51.8 56.7
MMT [12] ICLR 2020 23.3 50.1 63.9 69.8 22.9 49.2 63.1 68.8
Dual-Refinement This paper 26.9 55.0 68.4 73.2 25.1 53.3 66.1 71.5
TABLE I: Comparisons between the proposed method and state-of-the-art unsupervised domain adaptation methods for person re-ID. The best results are highlighted with bold and the second best results are highlighted with underline. ’-’ indicates the results not reported
Methods DukeMTMC-ReIDMarket1501 Market1501DukeMTMC-ReID
mAP R1 R5 R10 mAP R1 R5 R10
Fully Supervised (upper bound) 81.2 93.1 97.7 98.6 70.3 84.2 91.7 93.9
Direct Transfer (lower bound) 28.6 58.0 73.7 79.8 27.6 44.5 60.6 66.1
Baseline 67.9 85.7 94.3 96.3 56.4 72.5 84.5 88.2
Baseline with only LR 74.4 88.7 95.1 97.1 65.5 80.0 89.8 92.7
Baseline with only IM-SP 75.5 89.0 95.8 97.5 66.3 80.5 89.6 92.4
Baseline with both LR and IM-SP 78.0 90.9 96.4 97.7 67.7 82.1 90.1 92.5
TABLE II: Ablation studies on supervised, direct transfer and variants combined with baseline. LR means off-line pseudo label refinement with hierarchical clustering. IM-SP means on-line feature refinement with the IM-spread-out regularization in Eq. (9). Our method (Baseline with both LR and IM-SP) is comparable to the fully supervised methods.

Iv-C Comparisons with State-of-the-Arts

Iv-C1 Results on Market1501 and DukeMTMC-ReID

In Table I, we compare our method with state-of-the-art methods. GAN-based methods include PTGAN [52], SPGAN [6], ATNet [28], CamStyle [71], HHL [74] and PDA-Net [26]; UDAP [42], PCB-PAST [64], SSG [10], ACT [59] and MMT [12] are based on clustering; AD-Cluster [63] combines GAN and clustering; ECN [75], ECN-GPP [76] and MMCL [48] use memory bank. Our method achieves the performance of 78.0% on mAP and 90.9% on rank-1 accuracy when DukeMTMC-ReIDMarket1501. Our method outperforms state-of-the-art GAN-based method AD-Cluster [63] by 9.7% on mAP and 4.2% on rank-1 accuracy. Compared with the best clustering-based method MMT [12], our method is more than 6.8% on mAP and 3.2% on rank-1 accuracy when DukeMTMC-ReIDMarket1501. Besides, our method outperforms the best memory-bank-based method ECN-GPP [76] by 14.2% on mAP and 6.8% on rank-1 accuracy when DukeMTMC-ReIDMarket1501. When using Market1501 as the source dataset and DukeMTMC-ReID as the target dataset, we get the performance of 67.7% on mAP and 82.1% on rank-1 accuracy, which is 13.6% and 9.5% higher than AD-Cluster [63]. Compared with state-of-the-art UDA methods, our method improves the performance by a large margin. It should be noted that our method uses a single model in training and does not use any other images generated by GAN. However, MMT [12] uses dual networks for target domain training, where the amount of parameters is more than twice that of our method. For ECN [75] and ECN-GPP [76], their performances heavily depend on the quality of extra augmented images produced by GAN.

Iv-C2 Results on MSMT17

Our method still outperforms state-of-the-art methods on this challenging dataset by a large margin. When considering DukeMTM-ReID as the source dataset, our method achieves the performance of 26.9% on mAP and 55.0% on rank-1 accuracy, which is 3.6% and 4.9% higher than state-of-the-art MMT [12]. When using Market1501 as the source dataset, we get 25.1% performance on mAP and 53.3% on rank-1 accuracy, which surpasses MMT [12] by 2.2% and 4.1%. The improvement of the performance in such a challenging dataset has strongly demonstrated the effectiveness of our method.

Iv-D Ablation Study

In this section, there are extensive ablation studies aming at evaluating the effectiveness of different components of our method.

Iv-D1 Comparisons between supervised learning, direct transfer, and baseline

In Table II

, we compare the performance on fully supervised learning, direct transfer method and the baseline method mentioned in Section

III-B. The fully supervised method can be seen as the upper bound for UDA of re-ID, and they use the ground-truth of the target domain to train with classification loss and triplet loss suggested in [33]. However when this source-trained model directly apply to the target domain, there is a huge performance gap because of the domain bias. When the model is trained on DukeMTMC-ReID and directly tested on Market1501, mAP drops from 70.3% to 28.6%. Furthermore, rank-1 accuracy drops from 84.2% to 58.0%. The baseline method uses the coarse clustering to assign the noisy pseudo labels and trains the model with classification loss and triplet loss, which is mentioned in Section III-B. The baseline method improves the performance by a larger margin compared with the direct transfer method. When transferring from DukeMTMC-ReID to Market1501, mAP and rank-1 accuracy of the baseline are 39.3% and 27.7% higher than the direct transfer method. Due to the huge domain gap, there is still a large margin in performance when comparing the baseline and the fully supervised method. It should be noted that the performance of our baseline is comparable to state-of-the-art methods [76, 63] trained with extra images generated by GANs, as shown in Table I.

Iv-D2 Effectiveness of the off-line pseudo label refinement

In Table II, we evaluate the effectiveness of the off-line pseudo label refinement, denoted as LR in Table II. Baseline with only LR means that we only conduct the off-line refinement to generate pseudo labels to supervised the on-line metric learning. When testing on DukeMTMC-ReID, Baseline with only LR outperforms the baseline method by 9.1% on mAP and 7.5% on rank-1 accuracy. It has shown that the hierarchical clustering guided pseudo label refinement plays an important role in off-line pseudo label refinement, which will promote the on-line discriminative feature learning.

Iv-D3 Effectiveness of the on-line feature refinement

In Table II we denote the IM-spread-out regularization as IM-SP. When only considering the on-line feature refinement, i.e., Baseline with only IM-SP, it is 7.6% higher on mAP and 3.3% higher on rank-1 accuracy than Baseline, which is tested on DukeMTMC-ReIDMarket1501. It shows that the on-line feature refinement can also improve the performance based without the off-line refinement. From the performance of our method, i.e., Baseline with both LR and IM-SP, we can conclude that both the off-line label and on-line feature refinement are indispensable in our Dual-Refinement method.

Iv-D4 Comparisons between the IM-spread-out regularization and its variants

In Table III, we compare different implementations and variants of the IM-spread-out regularization. The method SP+TM represents substituting the traditional memory bank [57] for our instant memory bank, whose performance is 2.9% lower than our method (SP+IM) on mAP when tested on Market1501. It shows that the out-dated features stored in memory bank can degenerate the performance. We also compare with the spread-out regularization training with mini-batch (SP+MB) and the invariance loss [75, 76] equipped with traditional memory bank (IN+TM) in Table III. However, the performances of IN+TM evaluated on DukeMTMC-ReID and Market1501 are all worse than ours. From the above analysis, we can see that the invariance loss, training with mini-batch or the traditional memory bank can not learn the features as discriminative as our method. We observe that the invariance loss is not able to learn more discriminative features, when compared to spread-out regularization. Training with mini-batch is not able to fully utilize the whole feature space, because the spread-out property is only enforced within a mini-batch. Due to the limited spread-out property, SP+MB is inferior to SP+TM. The traditional memory banks cannot update together with the encoder with back-propagation, which leads to inconsistencies in on-line feature learning. Because of the inconsistencies, training with the traditional memory bank (SP+TM) is inferior to our instant memory bank (SP+IM).

Method DukeMarket1501 Market1501Duke
mAP R1 mAP R1
Ours+SP+IM 78.0 90.9 67.7 82.1
Ours+SP+TM 75.1 88.7 66.3 80.9
Ours+SP+MB 73.2 88.4 62.6 77.2
Ours+IN+TM 74.9 89.2 66.2 79.7
TABLE III: Analysis on our IM-spread-out regularization and its variants. SP: Spread-out regularization. IN: Invariance loss. IM: Instant memory bank. MB: Mini-batch. TM: Traditional memory bank.

Iv-D5 Analysis on the quality of pseudo labels

In Fig. 3

, we evaluate the effects of the pseudo labels’ quality. We use F-score

[1] to evaluate the quality of clustering, and higher F-score towards 1.0 implies better clustering quality and less noise in pseudo labels. As shown in Fig. 3 (a), the quality of the refined labels obtained by off-line hierarchical clustering in our method is better than the baseline with only noisy labels from coarse clustering. During training, the pseudo labels with descending noise can improve the performance for UDA re-ID as shown in Fig. 3 (b), which is evaluated on DukeMTMC-ReID Market1501.

Fig. 3: Evaluation on the effects of the quality of pseudo labels on DukeMTMC-ReIDMarket1501. (a) F-score evaluated on the clustering quality of baseline and our method. (b) The performance comparison between baseline and our method during training.
Method Market1501 DukeMTMC-ReID
R1 (%) Time (hours) GPU Memory (MB)
Baseline 72.5 3.17 8692
Dual-Refinement 82.1 3.53 9600
MMT 78.0 11.45 15068
TABLE IV: Computional cost comparisions.

Iv-E Computational cost comparisons

As shown in Table IV, we compare our Dual-Refinement method with the baseline and the state-of-the-art method MMT [12] in the computional cost. The experiments are conducted when Market1501DukeMTMC-ReID. MMT uses two networks to train with each other, which is not memory efficient. Compared with MMT, our Dual-Refinement can achieve higher performance by costing less training time and GPU memory. Compared with the baseline method, our Dual-Refinement only introduces little extra GPU memory cost (about 908 MB) and little extra time cost (about 0.36 hours) because of the proposed instant memory bank. However, our Dual-Refinement outperforms the baseline method’s rank-1 accuracy (R1) by a large margin. Based on the above analyses, our proposed Dual-Refinement is superior not only in the performance but also in the computational cost.

Iv-F Parameter Analysis.

In this section, we evaluate the influences of four hyper-parameters including the weight , the weight in Eq. (12), the size of nearest neighborhoods in Eq. (9) and the fine cluster number in Eq. (6). When evaluating one of the four parameters, we fix the others. We evaluated these parameters on the dataset Market1501 and DukeMTMC and compare the performance with mAP. We set the parameters , , , on Market1501 and on DukeMTMC-ReID based on the following analysis.

Iv-F1 Loss weight

This parameter controls the weight between metric losses under the two kinds of different pseudo label supervision. As shown in Fig. 4 (a), when the parameter increases, the performance increases consistently and when it achieves the peak. We can see that, after achieves 0.5, the performance gets a slight slip, which means that if the pseudo label is refined excessively, extra noise may be induced. We take DukeMarket as an example. When using noisy labels only, mAP is 75.5%. When setting the same weight for both losses supervised by noisy and refined labels, mAP increases to 78.0%. If only using coarse clustering to assign pseudo labels at the off-line stage, the amplification of the label noise during the alternative training will lead to sub-optimal features. When we use hierarchical clustering to correct the label noise explicitly, more cleaned labels are generated to boost the on-line feature learning.

Fig. 4: (a) Evaluation on different values of the parameter in Eq. (12). (b) Evaluation on different values of the parameter in Eq. (12). (c) Evaluation on different value of nearest neighborhoods. (d) Evaluation on different fine clustering number .

Iv-F2 Loss weight

As shown in Fig. 4 (b), we evaluate the performance with different values of the parameter . When it means learning without the IM-spread-out regularization, and our method gets low performance. As increases, the performance is enhanced greatly and it achieves the peak when . It shows that enforcing the spread-out property on the entire dataset with our specially designed IM-spread-out regularization can help to alleviate the effects of the noisy supervision signal When is larger than 0.1, the performance will drop consistently, it can be explained that the features are spreading-out too much, which will break the inherent similarities between samples.

Iv-F3 The number of nearest neighborhoods

The parameter is the number of positives which we assume as the nearest neighborhoods of a sample. As shown in Fig. 4 (c), when varies from 1 to 10, its performance stays relatively high, however when goes larger than 10, its performance drops. It shows that the IM-spread-out regularization is robust to the small values but when gets larger, it may degenerate the performance because too many neighborhoods means too many noisy positives.

Iv-F4 The fine cluster number

Fig. 4 (d) shows that our method achieves the best performance when for DukeMarke1501 and for Market1501Duke. These results can reveal that through hierarchical clustering, we can utilize the hierarchical information from the target data itself to further assign more reliable pseudo labels for samples. means only using the coarse clustering and means that we calculate the average feature within a coarse cluster and represent it as the fine cluster centroid.

V Conclusion

In this work, we propose a novel approach called Dual-Refinement to alleviate the pseudo label noise in clustering-based UDA re-ID, including the off-line pseudo label refinement to assign more accurate labels and the on-line feature refinement to alleviate the effects of noisy supervision signal. Specially, we design an off-line pseudo label refining strategy by utilizing the hierarchical information in target domain data. We also propose an on-line IM-spread-out regularization to alleviate the effects of the noisy samples. The IM-spread-out regularization is equipped with an instant memory bank that can consider the entire target data during training. Compared to state-of-the-art methods on UDA re-ID, Dual-Refinement is trained with only a single model and has shown significant improvement of performances.

References

  • [1] E. Amigó, J. Gonzalo, J. Artiles, and F. Verdejo (2009) A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information retrieval 12 (4), pp. 461–486. Cited by: §IV-D5.
  • [2] M. Caron, P. Bojanowski, A. Joulin, and M. Douze (2018)

    Deep clustering for unsupervised learning of visual features

    .
    In

    Proceedings of the European Conference on Computer Vision (ECCV)

    ,
    pp. 132–149. Cited by: §I.
  • [3] T. Chen, S. Ding, J. Xie, Y. Yuan, W. Chen, Y. Yang, Z. Ren, and Z. Wang (2019) Abd-net: attentive but diverse person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8351–8361. Cited by: §I.
  • [4] Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo (2018)

    StarGAN: unified generative adversarial networks for multi-domain image-to-image translation

    .
    In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    Cited by: §II-B.
  • [5] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §IV-B.
  • [6] W. Deng, L. Zheng, Q. Ye, G. Kang, Y. Yang, and J. Jiao (2018) Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 994–1003. Cited by: §II-B, §IV-C1, TABLE I.
  • [7] Y. Duan, J. Lu, and J. Zhou (2019) Uniformface: learning deep equidistributed representation for face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3415–3424. Cited by: §II-D, §III-D2.
  • [8] M. Ester, H. Kriegel, J. Sander, and X. Xu (1996) A density-based algorithm for discovering clusters in large spatial databases with noise.. In Kdd, Vol. 96, pp. 226–231. Cited by: §I, §III-B1.
  • [9] H. Fan, L. Zheng, C. Yan, and Y. Yang (2018) Unsupervised person re-identification: clustering and fine-tuning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14 (4), pp. 1–18. Cited by: TABLE I.
  • [10] Y. Fu, Y. Wei, G. Wang, Y. Zhou, H. Shi, and T. S. Huang (2019) Self-similarity grouping: a simple unsupervised cross domain adaptation approach for person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6112–6121. Cited by: §I, §I, §I, §II-B, §III-B1, §IV-C1, TABLE I.
  • [11] Y. Ganin and V. Lempitsky (2015)

    Unsupervised domain adaptation by backpropagation

    .
    In

    International Conference on Machine Learning

    ,
    pp. 1180–1189. Cited by: §II-A.
  • [12] Y. Ge, D. Chen, and H. Li (2020) Mutual mean-teaching: pseudo label refinery for unsupervised domain adaptation on person re-identification. International Conference on Learning Representations. Cited by: §I, §I, §II-B, §III-B, §IV-C1, §IV-C2, §IV-E, TABLE I.
  • [13] A. Ghosh, N. Manwani, and P. Sastry (2015) Making risk minimization tolerant to label noise. Neurocomputing 160, pp. 93–107. Cited by: §II-C.
  • [14] J. Goldberger and E. Ben-Reuven (2016)

    Training deep neural-networks using a noise adaptation layer

    .
    Cited by: §II-C.
  • [15] D. Gray, S. Brennan, and H. Tao (2007) Evaluating appearance models for recognition, reacquisition, and tracking. In Proc. IEEE international workshop on performance evaluation for tracking and surveillance (PETS), Vol. 3, pp. 1–7. Cited by: §IV-A.
  • [16] B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama (2018) Co-teaching: robust training of deep neural networks with extremely noisy labels. In Advances in neural information processing systems, pp. 8527–8537. Cited by: §II-C.
  • [17] J. Han, P. Luo, and X. Wang (2019) Deep self-learning from noisy labels. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5138–5147. Cited by: §II-C.
  • [18] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738. Cited by: §I, §II-B, §III-D2.
  • [19] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §IV-B.
  • [20] A. Hermans, L. Beyer, and B. Leibe (2017) In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737. Cited by: §III-A, §III-B2, §III-B2.
  • [21] J. Huang, Q. Dong, S. Gong, and X. Zhu (2019)

    Unsupervised deep learning by neighbourhood discovery

    .
    In International Conference on Machine Learning, pp. 2849–2858. Cited by: §I.
  • [22] L. Jiang, Z. Zhou, T. Leung, L. Li, and L. Fei-Fei (2018) MentorNet: learning data-driven curriculum for very deep neural networks on corrupted labels. In International Conference on Machine Learning, pp. 2304–2313. Cited by: §II-C.
  • [23] G. Kang, L. Jiang, Y. Yang, and A. G. Hauptmann (2019) Contrastive adaptation network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4893–4902. Cited by: §II-A.
  • [24] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §IV-B.
  • [25] K. Lee, X. He, L. Zhang, and L. Yang (2018)

    Cleannet: transfer learning for scalable image classifier training with label noise

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5447–5456. Cited by: §II-C.
  • [26] Y. Li, C. Lin, Y. Lin, and Y. F. Wang (2019) Cross-dataset person re-identification via unsupervised pose disentanglement and adaptation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7919–7929. Cited by: §IV-C1, TABLE I.
  • [27] S. Liao, Y. Hu, X. Zhu, and S. Z. Li (2015) Person re-identification by local maximal occurrence representation and metric learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2197–2206. Cited by: TABLE I.
  • [28] J. Liu, Z. Zha, D. Chen, R. Hong, and M. Wang (2019) Adaptive transfer network for cross-domain person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7202–7211. Cited by: §IV-C1, TABLE I.
  • [29] W. Liu, R. Lin, Z. Liu, L. Liu, Z. Yu, B. Dai, and L. Song (2018) Learning towards minimum hyperspherical energy. In Advances in neural information processing systems, pp. 6222–6233. Cited by: §II-D.
  • [30] S. Lloyd (1982) Least squares quantization in pcm. IEEE transactions on information theory 28 (2), pp. 129–137. Cited by: §III-C.
  • [31] M. Long, Y. Cao, J. Wang, and M. Jordan (2015) Learning transferable features with deep adaptation networks. In International Conference on Machine Learning, pp. 97–105. Cited by: §II-A.
  • [32] M. Long, H. Zhu, J. Wang, and M. I. Jordan (2016) Unsupervised domain adaptation with residual transfer networks. In Advances in neural information processing systems, pp. 136–144. Cited by: §II-A.
  • [33] H. Luo, W. Jiang, Y. Gu, F. Liu, X. Liao, S. Lai, and J. Gu (2019) A strong baseline and batch normalization neck for deep person re-identification. IEEE Transactions on Multimedia. Cited by: §I, §III-A, §III-B, §IV-D1.
  • [34] N. Martinel, G. L. Foresti, and C. Micheloni (2020) Deep pyramidal pooling with attention for person re-identification. IEEE Transactions on Image Processing 29 (), pp. 7306–7316. Cited by: §I.
  • [35] A. Menon, B. Van Rooyen, C. S. Ong, and B. Williamson (2015)

    Learning from corrupted binary labels via class-probability estimation

    .
    In International Conference on Machine Learning, pp. 125–134. Cited by: §II-C.
  • [36] P. Panareda Busto and J. Gall (2017) Open set domain adaptation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 754–763. Cited by: §II-A.
  • [37] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §IV-B.
  • [38] G. Patrini, A. Rozza, A. Krishna Menon, R. Nock, and L. Qu (2017) Making deep neural networks robust to label noise: a loss correction approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1944–1952. Cited by: §II-C.
  • [39] P. Peng, T. Xiang, Y. Wang, M. Pontil, S. Gong, T. Huang, and Y. Tian (2016) Unsupervised cross-dataset transfer learning for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1306–1315. Cited by: TABLE I.
  • [40] S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Rabinovich (2014) Training deep neural networks on noisy labels with bootstrapping. arXiv preprint arXiv:1412.6596. Cited by: §II-C.
  • [41] K. Saito, S. Yamamoto, Y. Ushiku, and T. Harada (2018) Open set domain adaptation by backpropagation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 153–168. Cited by: §II-A.
  • [42] L. Song, C. Wang, L. Zhang, B. Du, Q. Zhang, C. Huang, and X. Wang (2020) Unsupervised domain adaptive re-identification: theory and practice. Pattern Recognition, pp. 107173. Cited by: §I, §I, §I, §II-B, §III-B1, §III-B, §IV-C1, TABLE I.
  • [43] Y. Sun, C. Cheng, Y. Zhang, C. Zhang, L. Zheng, Z. Wang, and Y. Wei (2020) Circle loss: a unified perspective of pair similarity optimization. arXiv preprint arXiv:2002.10857. Cited by: §III-D2.
  • [44] Y. Sun, L. Zheng, Y. Li, Y. Yang, Q. Tian, and S. Wang (2019) Learning part-based convolutional features for person re-identification. IEEE transactions on pattern analysis and machine intelligence. Cited by: §I.
  • [45] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §II-C.
  • [46] D. Tanaka, D. Ikami, T. Yamasaki, and K. Aizawa (2018) Joint optimization framework for learning with noisy labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5552–5560. Cited by: §II-C.
  • [47] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell (2014) Deep domain confusion: maximizing for domain invariance. arXiv preprint arXiv:1412.3474. Cited by: §II-A.
  • [48] D. Wang and S. Zhang (2020) Unsupervised person re-identification via multi-label classification. arXiv preprint arXiv:2004.09228. Cited by: §IV-C1, TABLE I.
  • [49] G. Wang, Y. Yuan, J. Li, S. Ge, and X. Zhou (2020) Receptive multi-granularity representation for person re-identification. IEEE Transactions on Image Processing 29 (), pp. 6096–6109. Cited by: §I.
  • [50] G. Wang, Y. Yuan, X. Chen, J. Li, and X. Zhou (2018) Learning discriminative features with multiple granularities for person re-identification. In Proceedings of the 26th ACM international conference on Multimedia, pp. 274–282. Cited by: §I.
  • [51] J. Wang, X. Zhu, S. Gong, and W. Li (2018) Transferable joint attribute-identity deep learning for unsupervised person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2275–2284. Cited by: TABLE I.
  • [52] L. Wei, S. Zhang, W. Gao, and Q. Tian (2018) Person transfer gan to bridge domain gap for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 79–88. Cited by: §II-B, §IV-A, §IV-A, §IV-C1, TABLE I.
  • [53] L. Wei, S. Zhang, H. Yao, W. Gao, and Q. Tian (2018) GLAD: global–local-alignment descriptor for scalable person re-identification. IEEE Transactions on Multimedia 21 (4), pp. 986–999. Cited by: §I.
  • [54] Z. Wu, A. A. Efros, and S. X. Yu (2018) Improving generalization via scalable neighborhood component analysis. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 685–701. Cited by: §I, §III-D2.
  • [55] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742. Cited by: §I, §III-D2.
  • [56] X. Xia, T. Liu, N. Wang, B. Han, C. Gong, G. Niu, and M. Sugiyama (2019) Are anchor points really indispensable in label-noise learning?. In Advances in Neural Information Processing Systems, pp. 6835–6846. Cited by: §II-C.
  • [57] T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang (2017) Joint detection and identification feature learning for person search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3415–3424. Cited by: §II-B, §II-B, §III-D2, §IV-D4.
  • [58] F. Yang, K. Yan, S. Lu, H. Jia, D. Xie, Z. Yu, X. Guo, F. Huang, and W. Gao (2020) Part-aware progressive unsupervised domain adaptation for person re-identification. IEEE Transactions on Multimedia. Cited by: §I.
  • [59] F. Yang, K. Li, Z. Zhong, Z. Luo, X. Sun, H. Cheng, X. Guo, F. Huang, R. Ji, and S. Li (2020) Asymmetric co-teaching for unsupervised cross domain person re-identification. AAAI. Cited by: §I, §I, §II-B, §IV-C1, TABLE I.
  • [60] F. Yang, Z. Zhong, Z. Luo, S. Lian, and S. Li (2019) Leveraging virtual and real person for unsupervised person re-identification. IEEE Transactions on Multimedia. Cited by: §I.
  • [61] M. Ye, J. Shen, G. Lin, T. Xiang, L. Shao, and S. C. Hoi (2020) Deep learning for person re-identification: a survey and outlook. arXiv preprint arXiv:2001.04193. Cited by: §I.
  • [62] M. Ye, X. Zhang, P. C. Yuen, and S. Chang (2019) Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6210–6219. Cited by: §II-D, §III-D2.
  • [63] Y. Zhai, S. Lu, Q. Ye, X. Shan, J. Chen, R. Ji, and Y. Tian (2020) AD-cluster: augmented discriminative clustering for domain adaptive person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §IV-C1, §IV-D1, TABLE I.
  • [64] X. Zhang, J. Cao, C. Shen, and M. You (2019) Self-training with progressive augmentation for unsupervised cross-domain person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8222–8231. Cited by: §I, §I, §I, §II-B, §III-B1, §III-B, §IV-C1, TABLE I.
  • [65] X. Zhang, F. X. Yu, S. Kumar, and S. Chang (2017) Learning spread-out local feature descriptors. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4595–4603. Cited by: §II-D.
  • [66] Z. Zhang and M. Sabuncu (2018) Generalized cross entropy loss for training deep neural networks with noisy labels. In Advances in neural information processing systems, pp. 8778–8788. Cited by: §II-C.
  • [67] K. Zhao, J. Xu, and M. Cheng (2019) Regularface: deep face recognition via exclusive regularization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1136–1144. Cited by: §II-D.
  • [68] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian (2015) Scalable person re-identification: a benchmark. In Proceedings of the IEEE international conference on computer vision, pp. 1116–1124. Cited by: §IV-A, §IV-A, TABLE I.
  • [69] L. Zheng, Y. Yang, and A. G. Hauptmann (2016) Person re-identification: past, present and future. arXiv preprint arXiv:1610.02984. Cited by: §I.
  • [70] Z. Zheng, L. Zheng, and Y. Yang (2017) Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3754–3762. Cited by: §IV-A, §IV-A.
  • [71] Z. Zhong, L. Zheng, Z. Zheng, S. Li, and Y. Yang (2019) CamStyle: a novel data augmentation method for person re-identification. IEEE Transactions on Image Processing. Cited by: §IV-C1, TABLE I.
  • [72] Z. Zhong, L. Zheng, D. Cao, and S. Li (2017) Re-ranking person re-identification with k-reciprocal encoding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1318–1327. Cited by: §III-B1, §IV-A.
  • [73] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang (2017) Random erasing data augmentation. arXiv preprint arXiv:1708.04896. Cited by: §IV-B.
  • [74] Z. Zhong, L. Zheng, S. Li, and Y. Yang (2018) Generalizing a person retrieval model hetero-and homogeneously. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 172–188. Cited by: §II-B, §IV-C1, TABLE I.
  • [75] Z. Zhong, L. Zheng, Z. Luo, S. Li, and Y. Yang (2019) Invariance matters: exemplar memory for domain adaptive person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 598–607. Cited by: §I, §II-B, §II-B, §III-D2, §IV-C1, §IV-D4, TABLE I.
  • [76] Z. Zhong, L. Zheng, Z. Luo, S. Li, and Y. Yang (2020) Learning to adapt invariance in memory for person re-identification. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §II-B, §IV-C1, §IV-D1, §IV-D4, TABLE I.
  • [77] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, Cited by: §II-B.