Imitating Targets from all sides: An Unsupervised Transfer Learning method for Person Re-identification

04/10/2019 ∙ by Jiajie Tian, et al. ∙ UNC Charlotte BEIJING JIAOTONG UNIVERSITY 0

Person re-identification (Re-ID) models usually show a limited performance when they are trained on one dataset and tested on another dataset due to the inter-dataset bias (e.g. completely different identities and backgrounds) and the intra-dataset difference (e.g. camera invariance). In terms of this issue, given a labelled source training set and an unlabelled target training set, we propose an unsupervised transfer learning method characterized by 1) bridging inter-dataset bias and intra-dataset difference via a proposed ImitateModel simultaneously; 2) regarding the unsupervised person Re-ID problem as a semi-supervised learning problem formulated by a dual classification loss to learn a discriminative representation across domains; 3) exploiting the underlying commonality across different domains from the class-style space to improve the generalization ability of re-ID models. Extensive experiments are conducted on two widely employed benchmarks, including Market-1501 and DukeMTMC-reID, and experimental results demonstrate that the proposed method can achieve a competitive performance against other state-of-the-art unsupervised Re-ID approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 5

page 6

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Person re-identification (Re-ID) [39]

targets at matching people across non-overlapping camera views. It attracts significant attentions due to its great potential applications in video surveillance. Thanks to the development of deep learning 

[29, 34], the person Re-ID performance has been significantly improved in recent years. However, it is still a very challenging task because a query person-of-interest often undergoes large variations in appearance, illumination and background under different cameras. Furthermore, the achieved high performance for person Re-ID is only restricted to supervised learning frameworks as the database consists of a large number of manually labelled images. While in a practical person Re-ID deployment, such manual labelling is not only expensive to aggregate as the number of cameras increases, but also improbable in many cases because it requires the same person appearing in every pair of existing cameras. And, when models trained on a supervised dataset are directly used on another, the Re-ID performance declines precipitously due to the inter-dataset bias [7, 27].

One solution to this problem is unsupervised domain adaption (UDA) where models are trained on a source domain consisting of labelled images and adapted on the target domain composed of unlabelled images. Recently, numerous unsupervised methods for person Re-ID [17, 30] have been proposed to extract view-invariant features. But these methods only achieve a limited Re-ID performance when compared to the supervised counterparts. The main reason is that the inter-domain bias between the labelled source domain and the unlabelled target domain is not reduced effectively. Different domain images are taken under different views in different seasons and backgrounds, and even people who appear in these images might come from different nations. We consider these differences as the domain gap or inter-domain bias. In the unsupervised setting, no labelled pairs in the target domain are provided such that it is more important to exploit label information in the source domain to shrink the inter-domain bias. Another factor that influences the performance of person Re-ID is the intra-domain difference which is caused by different camera configurations in the target domain. Even in the same domain, images captured by different cameras have distinctive styles due to various lighting condition, shooting angle, background, etc.

In this work, we propose a method to explicitly address issues mentioned above. On the one hand, persons in the target domain are imitated from the labelled source domain. On the other hand, a content preserved pseudo target domain is derived to lessen intra-domain difference. We leverage a dual classification loss on both source domain and imitated target domain to strengthen the discriminative ability of the proposed person Re-ID model. There are some works [7, 32] that focus on similarity-preserving source-target translation models to bridge domain gaps and some methods [42, 43] that concentrate on camera style adaptation to generate new datasets in the style of other cameras. However, these works either only emphasize on narrowing inter-domain bias or consider in diminishing intra-domain difference. A transferred model might be interfered by the overall data gap between two domains during the training phase and hampered by camera styles of the target domain during a testing phase. We therefore argue that both the inter-domain bias and the intra-domain difference should be fully considered in the person Re-ID model. To enhance the generalization ability of person Re-ID model, we further exploit a latent commonality of domains beyond source and target, , the margin between same persons should be smaller than that of different persons across any camera in any domain. In view of these aspects, we develop a novel unsupervised transfer learning method, named ImitateModel, to train a cross dataset person Re-ID network. ImitateModel does not need any manual annotations for images in the target domain, but requires a source dataset with identity labels and camera IDs for each image in the source dataset and target dataset. Note that the camera ID for each image can be easily obtained along with raw videos.

To sum up, three contributions are made:(I) We design an imitate model to simultaneously decrease the inter-domain bias and intra-domain difference. (II) We propose a dual classification loss to learn a discriminative representation. (III) We exploit the underlying commonality of domains beyond source and target.

2 Related Work

Supervised person re-ID. Most existing person re-ID models are based on supervised learning, trained on a large number of labelled images across cameras. They focus on feature engineering [10, 15, 21, 23, 37], distance metric learning [3, 4, 13, 18, 26], or creating new deep learning architectures [1, 19]. For example, Kalayeh  [15] learned both part-level features and global features. Chen  [3] proposed a quadruplet loss to handle the weakness of the triplet loss on person re-ID. Li  [19] proposed a new person re-ID network with a joint learning of soft and hard attentions, which took advantage of both joint learning attention selection and feature representation. Although these models offer a promising performance on recent person Re-ID datasets (, Market-1501 [38] and DukeMTMC-reID [28, 40]), it is hard to utilize in practical applications due to the demand of tremendous labelled data.

Unsupervised person Re-ID. Hand-craft features [2, 9, 10, 21, 25] could be directly employed in unsupervised cross-domain person Re-ID. But the cross domain data is not fully exploited by these features because they neglect the inter-domain bias. In the unsupervised person Re-ID community, some works [8, 24, 33, 35, 33] attempted to predict pseudo-labels of unlabelled target images. For instance, Fan  [8]

proposed a method that iteratively applied data clustering, instance selection and fine-tune techniques to estimate labels of images in target domain. Liu  

[24] predicted reliable labels with k-recirocal nearest neighbors. Other works [20, 22, 27, 31] aimed at learning domain-invariant features. Peng  [27] presented a multi-task dictionary to learn a view-invariant representation and Li s [20] proposed to learn a share space between the source domain and the target domain under a deep learning framework. Lin  [22] tried to align the cross-dataset mid-level feature in the task of attribute learning, while Wang  [31] presented a deep Re-ID model to represent an attribute-semantic and identity-discriminative feature space. Different from these models, we propose an ImitateModel to diminish both the inter-domain bias and intra-domain difference, and design a dual classification loss to strengthen the supervision from transferred knowledge.

Image-Image Domain Adaptation for Re-ID. Image-image domain adaptation aims at generating a new dataset that connects the source domain and the target domain in some ways such as content and style. A number of methods [7, 32, 42, 43] have studied image-image translation for person re-ID. Deng  [7] proposed a Similarity Preserving cycle consistent Generative Adversarial Network (SPGAN) to create a new dataset for cross-domain person re-ID models. Wei  [32] presented PTGAN to narrow down the domain gap in the cross-dataset person Re-ID model. Both [42] and  [43] targeted at reducing intra-domain difference and trained several style transfer models between different cameras in a dataset, but the former employed Label Smoothing Regularization (LSR) loss to train a person re-ID model while the latter utilized a triplet loss to train a cross-dataset person re-ID model. In contrast, the proposed ImitateModel develops an imitated target domain transferred from the source dataset and a pseudo target domain transferred from the target dataset, based on which both the inter-domain bias and the intra-domain difference are addressed.

Figure 1: The pipeline of the proposed model. It consists of two branches: 1) Classification branch supervised on the source domain and semi-supervised on the imitated target domain; 2) Commonality branch restricted by a triplet loss on the source domain, imitated target domain, pseudo target domain, and the target domain.

3 Proposed Method

Problem Definition For unsupervised domain adaptation in person re-ID, a source dataset with labelled image-camera pairs and another unlabelled dataset from the target domain are provided, where the source dataset consists of images with corresponding labels (a total of different persons) captured by a total of cameras , and the target dataset consists of images captured by a total of cameras . Our goal is to leverage on both labelled source training images and unlabelled target training images to learn a re-ID model that generalizes well during the test process in the target domain.

3.1 Supervised Learning for Person Re-ID

To obtain a good performance in person re-identification task, the prime goal is to learn discriminative representations to distinguish person identities. With labelled images , an effective strategy is to adopt the ID-discriminative embedding (IDE) [39, 40, 41] borrowed from the classification task. To this end, we employ ResNet-50 [12]

pre-trained on the ImageNet 

[6] as our base model, and append pool-5 and the embedding layer, named “embedding-1024” to extract discriminative features as shown in Fig. 1. The output dimension of the last fully-connected layer is modified to fit the number of training identities. As explained, a branch of the proposed person Re-ID model is regarded as a classification task and employs the cross-entropy loss on the source dataset as described in Eq. (1). The IDE-based model does achieve a very good performance on the single person Re-ID dataset, but a limited performance is found in the cross-domain person Re-ID problem. In terms of this issue, we propose an imitate model to enhance the generalization ability.

(1)

where is a predicted label on the image with ground truth . We also name this model as baseline.

Figure 2: Examples of image transfer on DukeMTMC-reID and Market-1501 by the proposed ImitateModel. An image captured by a certain camera is transferred to views of all the cameras in both datasets. The transferred model preserves the content and identity of the source image while reflects the style of the target view.

3.2 ImitateModel

The inter-dataset bias caused by different domains is a critical factor that declines the generalization ability of unsupervised person Re-ID models. As we have none information on the target dataset, , identities of people, style of images, how to utilize dataset with labels is the key. On the other hand, the intra-dataset difference induced by different cameras in the target dataset is also a crucial element, because in the test procedure images of the same person usually come from different cameras of the target domain. Zhong  [42] proved that the model trained on the source dataset was more sensitive to image variations caused by different cameras than other data augmentation methods. Therefore, if transfer learning is employed in the unsupervised person Re-ID, then how to narrow down the inter-dataset bias and reduce the intra-dataset difference simultaneously is a significant issue. To bridge the inter-domain gap, we propose to generate an imitated target dataset, denoted by , preserving the person identity of the source domain and reflecting the style of different cameras in the target domain. Specifically, each image in the source domain are adapted to all the camera styles in the target domain. To diminish the intra-domain difference, we propose to develop a pseudo target dataset, denoted by , which diversifies camera styles for each image in the target domain. In particular, images in the target domain are transferred to all the camera styles of the target domain.

For image generation models, StarGAN [5] and CycleGAN [44] are widely employed in person Re-ID area. For example, [7] utilizes CycleGAN to do image-image domain adaptation, while [42] employs StarGAN to construct camera style transfer model. In our work, we build the imitated model based on StarGAN on two grounds. First, to learn our entire imitated model, network based on StarGAN only requires one time training, while network based on CycleGAN requires a total of translation as each pair of cameras from source to target should be learned separately. Second, the inter-domain bias and the intra-domain difference can be handled simultaneously through network based on StarGAN, which benefits the extraction of underlying commonality across numerous domains (see details in Section 3.4). Examples of real images and fake images generated by the proposed ImitateModel are shown in Fig. 2.

3.3 Semi-supervised Learning for Person Re-ID

Denote the imitated target dataset as , which is constructed by image-image translation for every camera pair from the source domain to the target domain through ImitateModel. The dataset consists of images represented by with corresponding labels (a total of different persons) and cameras (a totoal of cameras), and it preserves person identities with the source dataset. Specifically, for a real image (the i-th image under the j-th camera) in the source dataset, we generate imitated images via the learned ImitateModel. These images preserve the person identity but their styles are similar to their corresponding target cameras , respectively. Therefore, we have .

We argue that an approach based on supervised learning can perform better than the unsupervised learning method for the same person Re-ID problem as the former encodes more information than the latter. Therefore, in order to boost the cross-domain person Re-ID performance, we view the unsupervised person Re-ID as a semi-supervised person Re-ID task by imitating the target domain. Specifically, a semi-labelled domain

is constructed by fusing the labelled imitated target domain and the unlabelled target domain . These two domains have similar target styles but with totally different identities. From the perspective of semi-supervised classification, a cross-entropy loss on domain is formulated as shown in Eq. (2).

(2)

where is a predicted label for the image with the ground truth .

Further, the dual classification loss is designed as follows:

(3)

where is a hyper-parameter that controls the influence of the imitated target dataset.

Figure 3: Three domains in class-style space: source domain , imitated target domain , and pseudo target and target domain . Here, the source domain and the imitated target domain share the same identities, and the imitated target domain and the pseudo target and target domain possess similar styles. The horizontal axis represents classes, identity, and the vertical axis delineates image styles. The shorter the distance between two images in the space is, the more similar the persons in these images are. It can be observed that there is a latent commonality among all domains in the space, even though the inter-domain bias for every pairs of domains exist.

3.4 Mining Commonality

In section 3.2, with the ImitateModel trained on the source domain and the target domain, we actually create two new domains, and , where the former is described in section 3.3. The pseudo target domain is built by the image-image translation for every camera pair from the target domain to itself. The pseudo target domain consists of images with cameras (a totoal of cameras), which preserves the same identity with the target domain. In particular, with the learned ImitateModel, for a real image (i-th image under the j-th camera) in the target domain, a total of pseudo images are generated. These images hold the person identity with the original images but their styles are similar to the corresponding target camera styles , respectively, which means . Note that the image transferred from itself is included in the pseudo images.

As mentioned above, the source domain and the target domain have totally different classes and styles, which leads to a limited performance when models trained on the source domain are directly executed on the target domain. That is because models trained on the source domain only learn to extract the camera-invariance image feature in the source domain camera styles to distinguish source classes. The models are unaware of any information on the target classes or target domain camera styles. In other words, if models could exploit the latent commonalities of the source domain and the target domain, a better performance on the target domain could be achieved. Naturally, one of the underlying commonalities is that the distance of persons with the same identity should be smaller than that of different persons. Based on this intuition, we design a second branch , after embedding-1024 in the Re-ID network, named “commonality” or “embedding-128” as shown in Fig 1. The two branches have different goals: the first branch is a classification task to learn a discriminative image feature, while the second branch is a commonality mining task to acquire more common information of source and target domains. The commonality branch is restricted by a triplet loss:

(4)

where represents images in a training batch and are images from . is an anchor point, is a farthest positive sample to , and is a closest negative sample to in . is a margin parameter, which is set to 0.3 in our experiments, and is the Euclidean distance between two images in the commonality feature space. Note that during re-ID test process, the feature at pool-5 (2048-dim) layer is utilized as the person descriptor.

To illustrate the commonality of all domains, we view domains in the class-style space where three clusters are formed as shown in Fig 3. denotes the source domain classes and source domain styles, represents source domain classes and target domain styles, and suggests target domain classes and target domain styles. The last cluster is consisted of pseudo target domain and target domain (), where the persons identities in this cluster are unknown. However, we do know that and belong to the same class, and other images from the target domain can be viewed as a different class. Clearly, such three samples share commonality, which is learned by triplet loss:

(5)
(6)
(7)

Consequently, the total loss for the underlying commonality task can be written as follows:

(8)

where are hyper-parameters that control the contribution of three clusters on the latent commonality .

Considering both the classification branch and the commonality mining branch, the total training objective of the proposed network is formulated as follows:

(9)

where are hyper-parameters that control the proportion of the classification task and the commonality mining task. Note that our model is trained in an end-to-end form.

4 Experiments

In this section, we conduct studies to examine the effectiveness of each part in the proposed network and run cross-domain person Re-ID experiments against a number of state-of-the-arts.

4.1 Datasets

To evaluate the performance of the proposed method, experiments are executed on two widely used person Re-ID datasets: Market-1501 [38], and DukeMTMC-reID [28, 40]. The details on the number of training samples under each camera are presented in Table 1.

Market-1501 DukeMTMC-reID
camera # of images camera # of images
1 2017 1 2809
2 1709 2 3009
3 2707 3 1088
4 920 4 1395
5 2338 5 1685
6 3245 6 3700
7 1330
8 1506
Table 1: Number of training samples with respect to each camera in Market-1501 and DukeMTMC-reID datasets

Market-1501 [38] collects from 6 camera views, involving 32,668 labelled images of 1,501 identities. The dataset consists of two non-over-lapping fixed parts: 12,936 images from 751 identities for training and 19,732 gallery images from the other 750 identities for testing. In testing, 3,368 query images from 750 identities are used to retrieve the corresponding person in the gallery.

DukeMTMC-reID [28, 40] contains 36,411 labelled images of 1,404 identities captured by 8 camera. It is split into two non-over-lapping fixed parts: 16,522 images from 702 identities for training and 17,661 gallery images from the other 702 identities for testing. In testing, 2,228 query images from 702 identities are used to retrieve the person in the gallery.

We adopt the conventional rank-1 accuracy and mAP as metrics for cross-domain re-ID evaluation [38] on both datasets. In the experiments, there are two source-target settings:

1. Source: DukeMTMC-reID/ Target: Market-1501.

2. Source: Market-1501/ Target: DukeMTMC-reID.

4.2 Experimental Settings

ImitateModel. Given datasets Market-1501 and DukeMTMC-reID with camera labels, we employ StarGAN [5] to train an ImitateModel to transfer images for every camera pair across two datasets. Note that no identity annotation is required during training. The architecture in [5] is maintained, and specifically, the generator is composed of down-sampling, bottleneck and up-sampling, while we adopt the architecture of PatchGANs [14] as our the discriminator, which includes an input layer, hidden layers and output layers. The input images are resized to in our experiments and Adam optimizer [16] is employed with . Following the update rule in [11], the generator is trained to optimality once after the discriminator parameter updates five times. Note that for each image in the two dataset, a total number of style-transferred images that preserve the identity of the original image are generated to be used in two source-target settings.

Re-ID model training. The input images are resized to

, and we initialize the learning rate to 0.01 for the layer pre-trained on ImageNet and to 0.1 for the other layers. The learning rate is multiplied by a factor of 0.1 every 40 epochs and we use SGD optimizer in a total of 60 epochs. For experiments on the first source-target setting, trained on DukeMTMC-reID and tested on Market-1501, the mini-batch sizes of the source images and imitated target images are set to 64 and 72 for the classification task, while for the commonality mining task the mini-batch sizes of the source images, imitated target images, pseudo target images and target images are set to

, respectively. And the involving parameters are set to 1, 0.9, 0.8, 0.2, 0.5, and 0.8, respectively. For experiments on the second source-target setting, trained on Market-1501 and tested on DukeMTMC-reID, the mini-batch sizes of the counterparts are set to 64 and 128 for the classification task, for the commonality mining task, respectively, and 1.4, 1, 1, 0.2, 0.5, and 0.6 for the involving parameters, respectively. During training, our goal is to minimize the total loss described Eq. (9). In the test procedure, 2048-dim (pool-5) features are extracted to compute Euclidean distance between the query and galley images.

Method DukeMarket-1501 Market-1501Duke
R-1 R-5 R-10 R-20 mAP R-1 R-5 R-10 R-20 mAP
Supervised Learning(Trained on ) 85.5 94.0 96.1 97.5 66.0 73.2 84.8 88.2 91.0 52.7
46.0 63.0 69.7 76.6 19.1 29.9 46.2 53.4 58.8 15.6
64.4 81.8 87.4 91.6 31.4 48.8 63.6 68.6 74.2 25.7
68.1 84.3 89.1 93.0 36.1 52.1 66.7 71.4 76.5 29.5
68.2 85.0 89.7 93.3 37.8 52.7 66.9 72.3 77.0 30.1
72.4 87.4 91.4 94.7 40.1 55.6 68.3 72.4 76.5 31.8
Table 2: Ablation studies: component comparisons using Duke / Market as the source dataset and Market / Duke as the target dataset.

4.3 Ablation studies

To highlight components of the proposed person Re-ID model, we conduct experiments to evaluate their contributions to the cross-domain person Re-ID performance.

Comparisons between supervised learning and direct transfer. The supervised person Re-ID model (baseline) which is trained on the target training dataset is evaluated on the target test dataset, it shows an excellent performance as reported in Tabel 2. However, a large performance drop can be observed when the model is trained on the source training dataset and tested on the target dataset directly. For instance, the baseline model trained and tested on Market-1501 achieves a rank-1 accuracy of 85.5% and mAP of 66.0%, but declines to 46.0% and 19.1% when it is trained on DukeMTMC-reID and tested on Market-1501. The main reason is the bias of data distributions among domains.

The effectiveness of the semi-supervised learning using ImitateModel. An imitated target dataset that is transferred from the source domain to the target domain is created by ImitateModel. It preserves the identity with the source dataset and at the same time reflects the camera style of the target dataset. The semi-supervised dataset is composed by the imitated target dataset and the target dataset, on which we formulate a dual classification loss to learn a discriminate feature under the target style. As shown in Table 2, the performance of is consistently improved in all settings. Compared to the direct transfer method, the proposed semi-supervised method obtains an improvement of +18.4% in rank-1 accuracy and a boost of +12.3% in mAP on Market-1501, and +18.9% in rank-1 accuracy and +10.1% in mAP on Duke. This demonstrates the effectiveness of the proposed semi-supervised formulation.

The effectiveness of commonality mining using ImitateModel. A pseudo target dataset that is transferred from the target dataset to the target dataset via ImitateModel is generated. The dataset composed by the pseudo target dataset and the target dataset. And the triplet loss is constrained over three datasets and to capture the commonality over them in class-style space. The goal is to reduce both the inter-domain bias and the intra-domain difference.

In fact, we attempt to use only one triplet loss to capture the commonality over three datasets. As shown in Table 2, one triplet loss largely improves the performance due to the capture of the commonality on three datasets. For example, when tested on Market-1501, the loss could imporve +3.7% at rank-1 accuracy and +4.7% at mAP, and when tested on Duke, it could improve +3.3% at rank-1 accuracy and +3.8% at mAP. The consistent improvements indicate the existence of the latent commonality.

Method DukeMarket-1501 Market-1501Duke
R-1 R-5 R-10 mAP R-1 R-5 R-10 mAP
LOMO [21] 27.2 41.6 49.1 8.0 12.3 21.3 26.6 4.8
UMDL [27] 34.5 52.6 59.6 12.4 18.5 31.4 37.6 7.3
Bow [38] 35.8 52.4 60.3 14.8 17.2 28.8 34.9 8.3
PTGAN [32] 38.6 - 66.1 - 27.4 - 50.7 -
PUL [8] 45.5 60.7 66.7 20.5 30.0 43.4 48.5 16.4
SPGAN [7] 51.5 70.1 76.8 22.8 41.1 56.6 63.0 22.3
CAMEL [36] 54.5 - - 26.3 - - - -
MMFA [22] 56.7 75.0 81.8 27.4 45.3 59.8 66.3 24.7
SPGAN+LMP [7] 57.7 75.8 82.4 26.7 46.4 62.3 68.0 26.2
TJ-AIDL [31] 58.2 74.8 81.1 26.5 44.3 59.6 65.0 23.0
HHL [42] 62.2 78.8 84.0 31.4 46.9 61.0 66.7 27.2
ours 72.4 87.4 91.4 40.1 55.6 68.3 72.4 31.8
Table 3: Performance comparisons with state-of-the-art person Re-ID methods using Duke / Market as the source dataset and Market / Duke as the target dataset.

In addition, we also evaluate the impacts of the combination of two triplet losses that capture the underlying commonality. As shown in Table 2, the combination of two triplet losses has a little influence on the rank-1 and mAP accuracy compared with the solo triplet loss. For instance, when tested on Market-1501, the objective could give 68.2% at rank-1 accuracy and 37.8% at mAP, and when tested on Duke, it could give 52.7% at rank-1 accuracy and 30.1% at mAP.

Finally, we verify the effectiveness of our hypothesis that the underlying commonality of three datasets can be captured in form of triplet loss. It is clear that “” significantly outperforms semi-supervised learning in all settings. For instance, when tested on Market-1501, “” obtains a rank-1 accuracy of 72.4% and mAP of 40.1% when using Duke as the source dataset. Similar improvements can be observed when tested on DukeMTMC-reID, it could obtain a rank-1 accuracy of 55.6% and mAP of 31.8%. The consistent improvements indicate that the underlying commonality is critical to enhance the generalization ability of models.

The effectiveness of ImitateModel. As mentioned above, inter-dataset bias and intra-dataset difference can be reduced by using the ImitateModel. In fact, on the dataset level, different classes and domain styles are significant factors on the performance of models trained on the source domain and tested on the target domain. On the camera level, in the same dataset, there is diverse between different style of cameras that affects target testing process. And the above experiments have proven success of the ImitateModel model to bridge inter-dataset bias and intra-dataset difference.

4.4 Comparison with the state-of-the-art methods

We compare our method against a number of state-of-the-art unsupervised learning methods on Market-1501 and DukeMTMC-reID in Table 3, which reports the results of evaluation when using these two datasets as the source and target domains respectively. The compared methods are categorized into 4 groups. Two hand-crafted methods including LOMO [21] and Bow [38], three unsupervised methods including UMDL [27], PUL [8], CAMEL [36], two unsupervised domain adaptation approaches without GAN including TJ-AIDL [31], MMFA [22], and three unsupervised domain adaptation approaches with GAN including SPTGAN [32], SPGAN [7] and HHL [42].

The two hand-crafted methods [21, 38] acquire a relative worse accuracy because both of them are directly employed to the target testing dataset, in which there is a large inter-domain bias. In order to overcome this problem, some unsupervised methods that train the model on the target set are proposed and achieve higher results. For instance, CAMEL [36] gives 54.5% rank-1 accuracy when trained on DukeMTMC-reID and tested on Market-1501.

Comparing with unsupervised domain adaptation methods without GAN, our method is preferable. Specifically, when tested on Market-1501, our results outperforms all the other methods, achieving rank-1 accuracy = 72.4% and mAP = 40.1%. For instance, comparing with the recent published TJ-AIDL [31], our results are higher by +4.2% in rank-1 accuracy and +13.6% in mAP. When tested on DukeMTMC-reID, our method achieves rank-1 accuracy +11.3% and mAP +8.8%, higher than all the other methods as well.

And, comparing with unsupervised domain adaptation methods using GAN, our method is also superior. For instance, when tested on Market-1501, comparing with the recently published HHL [42], our results are higher by +10.2% in rank-1 accuracy and +8.7% in mAP. When tested on DukeMTMC-reID, our method achieves rank-1 are higher by +8.7% in rank-1 accuracy and +4.6% in mAP.

5 Conclusion

In this paper, we propose the ImitateModel to solve the unsupervised person re-identification (Re-ID) task by generating new domains that shrink the inter-domain bias and intra-domain difference at the same time. A dual classification loss is proposed to restrict source and imitated target domain under a semi-supervised framework to learn discriminative features. Furthermore, the latent commonality in the class-style space across domains is exploited in order to enhance the generalization ability. Experiments are conducted on Market-1501 and DukeMTMC-reID, and experimental results demonstrate that the proposed architecture outperforms numerous state-of-the-art approaches.

References

  • [1] E. Ahmed, M. Jones, and T. K. Marks. An improved deep learning architecture for person re-identification. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 3908–3916, 2015.
  • [2] L. Bazzani, M. Cristani, and V. Murino. Symmetry-driven accumulation of local features for human characterization and re-identification. Computer Vision and Image Understanding, 117(2):130–144, 2013.
  • [3] W. Chen, X. Chen, J. Zhang, and K. Huang. Beyond triplet loss: a deep quadruplet network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 403–412, 2017.
  • [4] D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng.

    Person re-identification by multi-channel parts-based cnn with improved triplet loss function.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1335–1344, 2016.
  • [5] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo.

    Stargan: Unified generative adversarial networks for multi-domain image-to-image translation.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8789–8797, 2018.
  • [6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. 2009.
  • [7] W. Deng, L. Zheng, Q. Ye, G. Kang, Y. Yang, and J. Jiao. Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 994–1003, 2018.
  • [8] H. Fan, L. Zheng, C. Yan, and Y. Yang. Unsupervised person re-identification: Clustering and fine-tuning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 14(4):83, 2018.
  • [9] M. Farenzena, L. Bazzani, A. Perina, V. Murino, and M. Cristani. Person re-identification by symmetry-driven accumulation of local features. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 2360–2367. IEEE, 2010.
  • [10] D. Gray and H. Tao. Viewpoint invariant pedestrian recognition with an ensemble of localized features. In European conference on computer vision, pages 262–275. Springer, 2008.
  • [11] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5767–5777, 2017.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [13] A. Hermans, L. Beyer, and B. Leibe. In defense of the triplet loss for person re-identification. 2017.
  • [14] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros.

    Image-to-image translation with conditional adversarial networks.

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
  • [15] M. M. Kalayeh, E. Basaran, M. Gökmen, M. E. Kamasak, and M. Shah. Human semantic parsing for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1062–1071, 2018.
  • [16] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [17] E. Kodirov, T. Xiang, and S. Gong. Dictionary learning with iterative laplacian regularisation for unsupervised person re-identification. In BMVC, volume 3, page 8, 2015.
  • [18] M. Koestinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof. Large scale metric learning from equivalence constraints. In 2012 IEEE conference on computer vision and pattern recognition, pages 2288–2295. IEEE, 2012.
  • [19] W. Li, X. Zhu, and S. Gong. Harmonious attention network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2285–2294, 2018.
  • [20] Y.-J. Li, F.-E. Yang, Y.-C. Liu, Y.-Y. Yeh, X. Du, and Y.-C. Frank Wang. Adaptation and re-identification network: An unsupervised deep transfer learning approach to person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 172–178, 2018.
  • [21] S. Liao, Y. Hu, X. Zhu, and S. Z. Li. Person re-identification by local maximal occurrence representation and metric learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2197–2206, 2015.
  • [22] S. Lin, H. Li, C.-T. Li, and A. C. Kot. Multi-task mid-level feature alignment network for unsupervised cross-dataset person re-identification. arXiv preprint arXiv:1807.01440, 2018.
  • [23] Y. Lin, L. Zheng, Z. Zheng, Y. Wu, and Y. Yang. Improving person re-identification by attribute and identity learning. arXiv preprint arXiv:1703.07220, 2017.
  • [24] Z. Liu, D. Wang, and H. Lu. Stepwise metric promotion for unsupervised video person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pages 2429–2438, 2017.
  • [25] B. Ma, Y. Su, and F. Jurie. Covariance descriptor based on bio-inspired features for person re-identification and face verification. Image and Vision Computing, 32(6-7):379–390, 2014.
  • [26] S. Paisitkriangkrai, C. Shen, and A. Van Den Hengel. Learning to rank in person re-identification with metric ensembles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1846–1855, 2015.
  • [27] P. Peng, T. Xiang, Y. Wang, M. Pontil, S. Gong, T. Huang, and Y. Tian. Unsupervised cross-dataset transfer learning for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1306–1315, 2016.
  • [28] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. In European Conference on Computer Vision, pages 17–35. Springer, 2016.
  • [29] F. Wang, W. Zuo, L. Lin, D. Zhang, and L. Zhang. Joint learning of single-image and cross-image representations for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1288–1296, 2016.
  • [30] H. Wang, S. Gong, and T. Xiang. Unsupervised learning of generative topic saliency for person re-identification. 2014.
  • [31] J. Wang, X. Zhu, S. Gong, and W. Li. Transferable joint attribute-identity deep learning for unsupervised person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2275–2284, 2018.
  • [32] L. Wei, S. Zhang, W. Gao, and Q. Tian. Person transfer gan to bridge domain gap for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 79–88, 2018.
  • [33] Y. Wu, Y. Lin, X. Dong, Y. Yan, W. Ouyang, and Y. Yang. Exploit the unknown gradually: One-shot video-based person re-identification by stepwise learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5177–5186, 2018.
  • [34] T. Xiao, H. Li, W. Ouyang, and X. Wang.

    Learning deep feature representations with domain guided dropout for person re-identification.

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1249–1258, 2016.
  • [35] M. Ye, A. J. Ma, L. Zheng, J. Li, and P. C. Yuen. Dynamic label graph matching for unsupervised video re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pages 5142–5150, 2017.
  • [36] H.-X. Yu, A. Wu, and W.-S. Zheng. Cross-view asymmetric metric learning for unsupervised person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pages 994–1002, 2017.
  • [37] R. Zhao, W. Ouyang, and X. Wang. Learning mid-level filters for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 144–151, 2014.
  • [38] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian. Scalable person re-identification: A benchmark. In Proceedings of the IEEE International Conference on Computer Vision, pages 1116–1124, 2015.
  • [39] L. Zheng, Y. Yang, and A. G. Hauptmann. Person re-identification: Past, present and future. arXiv preprint arXiv:1610.02984, 2016.
  • [40] Z. Zheng, L. Zheng, and Y. Yang. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In Proceedings of the IEEE International Conference on Computer Vision, pages 3754–3762, 2017.
  • [41] Z. Zhong, L. Zheng, D. Cao, and S. Li. Re-ranking person re-identification with k-reciprocal encoding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1318–1327, 2017.
  • [42] Z. Zhong, L. Zheng, S. Li, and Y. Yang. Generalizing a person retrieval model hetero-and homogeneously. In Proceedings of the European Conference on Computer Vision (ECCV), pages 172–188, 2018.
  • [43] Z. Zhong, L. Zheng, Z. Zheng, S. Li, and Y. Yang. Camera style adaptation for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5157–5166, 2018.
  • [44] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2223–2232, 2017.