Image-Image Domain Adaptation with Preserved Self-Similarity and Domain-Dissimilarity for Person Re-identification (https://arxiv.org/pdf/1711.07027.pdf)
Person re-identification (re-ID) models trained on one domain often fail to generalize well to another. In our attempt, we present a "learning via translation" framework. In the baseline, we translate the labeled images from source to target domain in an unsupervised manner. We then train re-ID models with the translated images by supervised methods. Yet, being an essential part of this framework, unsupervised image-image translation suffers from the information loss of source-domain labels during translation. Our motivation is two-fold. First, for each image, the discriminative cues contained in its ID label should be maintained after translation. Second, given the fact that two domains have entirely different persons, a translated image should be dissimilar to any of the target IDs. To this end, we propose to preserve two types of unsupervised similarities, 1) self-similarity of an image before and after translation, and 2) domain-dissimilarity of a translated source image and a target image. Both constraints are implemented in the similarity preserving generative adversarial network (SPGAN) which consists of a Siamese network and a CycleGAN. Through domain adaptation experiment, we show that images generated by SPGAN are more suitable for domain adaptation and yield consistent and competitive re-ID accuracy on two large-scale datasets.READ FULL TEXT VIEW PDF
This article studies the domain adaptation problem in person
Person re-identification is to retrieval pedestrian images from no-overl...
Image-to-image translation architectures may have limited effectiveness ...
Image translation between two domains is a class of problems aiming to l...
Person re-identification (Re-ID) aims to match the image frames which co...
Vehicle re-identification (reID) is to identify a target vehicle in diff...
The widespread popularization of vehicles has facilitated all people's l...
Image-Image Domain Adaptation with Preserved Self-Similarity and Domain-Dissimilarity for Person Re-identification (https://arxiv.org/pdf/1711.07027.pdf)
An implementation of SPGAN using TensorFlow
This paper considers domain adaptation in person re-ID. The re-ID task aims at searching for the relevant images to the query. In our setting, the source domain is fully annotated, while the target domain does not have ID labels. In the community, domain adaptation of re-ID is gaining increasing popularity, because 1) of the expensive labeling process and 2) when models trained on one dataset are directly used on another, the re-ID accuracy drops dramatically  due to dataset bias . Therefore, supervised, single-domain re-ID methods may be limited in real-world scenarios, where domain-specific labels are not available.
A common strategy for this problem is unsupervised domain adaptation (UDA). But this line of methods assume that the source and target domains contain the same set of classes. Such assumption does not hold for person re-ID because different re-ID datasets usually contain entirely different persons (classes). In domain adaptation, a recent trend consists in image-level domain translation [18, 4, 28]. In the baseline approach
, two steps are involved. First, labeled images from the source domain are transferred to the target domain, so that the transferred image has a similar style with the target. Second, the style-transferred images and their associated labels are used in supervised learning in the target domain. In literature, commonly used style transfer methods include[27, 22, 48, 57]. In this paper, we use CycleGAN  following the practice in [27, 18].
In person re-ID, there is a distinct yet unconsidered requirement for the baseline described above: the visual content associated with the ID label of an image should be preserved after image-image translation. In our scenario, such visual content usually refers to the underlying (latent) ID information for a foreground pedestrian. To meet this requirement tailored for re-ID, we need additional constraints on the mapping function. In this paper, we propose a solution to this requirement, motivated from two aspects. First, a translated image, despite of its style changes, should contain the same underlying identity with its corresponding source image. Second, in re-ID, the source and target domains contain two entirely different sets of identities. Therefore, a translated image should be different from any image in the target dataset in terms of the underlying ID.
This paper introduces the Similarity Preserving cycle-consistent Generative Adversarial Network (SPGAN), an unsupervised domain adaptation approach which generates images for effective target-domain learning. SPGAN is composed of an Siamese network (SiaNet) and a CycleGAN. Using a contrastive loss, the SiaNet pulls close a translated image and its counter part in the source, and push away the translated image and any image in the target. In this manner, the contrastive loss satisfies the specific requirement in re-ID. Note that, the added constraint is unsupervised, i.e., the source labels are not used during domain adaptation. During training, in each mini-batch (batch size = 1), a training image is firstly used to update the Generator (of CycleGAN), then the Discriminator (of CycleGAN), and finally the layers in SiaNet. Through the coordination between CycleGAN and SiaNet, we are able to generate samples which not only possess the style of target domain but also preserve their underlying ID information.
Using SPGAN, we are able to create a dataset on the target domain in an unsupervised manner. The dataset inherits the labels from the source domain and thus can be used in supervised learning in the target domain. The contributions of this work are summarized below:
Minor contribution: we present a “learning via translation” baseline for domain adaptation in person re-ID.
Major contribution: we introduce SPGAN to improve the baseline. SPGAN works by preserving the underlying ID information during image-image translation.
Image-image translation. Image-image translation aims at constructing a mapping function between two domains. A representative method is the conditional GAN , which using paired training data produces impressive transition results. However, the paired training data is often difficult to acquire. Unpaired image-image translation is thus more applicable. To tackle unpaired settings, a cycle consistency loss is introduced by [22, 48, 57]. In , an unsupervised distance loss is proposed for one side domain mapping. In , a general framework is proposed by making a shared latent space assumption. A camera style adaptation method  is proposed for re-ID based on CycleGAN. Our work aims to find a mapping function between the source domain and target domain, and we are more concerned with similarity preserving translation.
Neural style transfer [12, 23, 43, 21, 5, 24, 19, 25] is another strategy of image-image translation, which aims at replicating the style of one image, while our work focuses on learning the mapping function between two domains, rather than two images.
Unsupervised domain adaptation. Our work relates to unsupervised domain adaptation (UDA) where no labeled target images are available during training. In this community, some methods aim to learn a mapping between source and target distributions [37, 13, 9, 38]. Correlation Alignment (CORAL)  proposes to match the mean and covariance of two distributions. Recent methods [18, 4, 28] use an adversarial approach to learn a transformation in the pixel space from one domain to another. Other methods seek to find a domain-invariant feature space [34, 31, 10, 30, 42, 11, 2]. Long et al.  and Tzeng et al.  use the Maximum Mean Discrepancy (MMD)  for this purpose. Ganin et al.  and Ajakan et al.  introduce a domain confusion loss to learn domain-invariant features. Different from the settings in this paper, most of the UDA methods assume that class labels are the same across domains, while different re-ID datasets contain entirely different person identities (classes). Therefore, the approaches mentioned above can not be utilized directly for domain adaptation in re-ID.
Unsupervised person re-ID. Hand-craft features [32, 14, 7, 33, 26, 51] can be directly employed for unsupervised re-ID. But these feature design methods do not fully exploit rich information from data distribution. Some methods are based on saliency statistics [50, 44]. In 
, K-means clustering is used for learning an unsupervised asymmetric metric. Penget al.  propose an asymmetric multi-task dictionary learning for cross-data transfer.
Recently, several works focus on label estimation of unlabeled target dataset. Yeet al.  use graph matching for cross-camera label estimation. Fan et al.  propose a progressive method based on the iterations between K-means clustering and IDE  fine-tuning. Liu et al.  employ a reciprocal search process to refine the estimated labels. Wu et al.  propose a dynamic sampling stragy for one-shot video-based re-ID. Our work seeks to learn re-ID models that can be utilized directly to target domain, and can potentially cooperate with label estimation methods in model initialization. Finally, we would like to refer the reader to the concurrent work named TJ-AIDL  that utilizes additional attribute annotation to learn a feature representation space for the unlabeled target dataset.
Given an annotated dataset from source domain and unlabeled dataset from target domain, our goal is to use the labeled source images to train a re-ID model that generalizes well to target domain. Figure 2 presents a pipeline of the “learning via translation” framework, which consists of two steps, i.e., source-target image translation for training data creation, and supervised feature learning for re-ID.
Feature learning. With the translated dataset that contains labels, feature learning methods are applied to train re-ID models. Specifically, we adopt the same setting as , in which the rank-1 accuracy and mAP on the fully-supervised Market-1501 dataset is 75.8% and 52.2%.
The focus of this paper is to improve Step 1, so that with better training samples, the overall re-ID accuracy can be improved. The experiment will validate the proposed Step 2 () on several feature learning methods. A brief summary of different methods considered in this paper is presented in Table 1. We denote the method “Direct Transfer” as directly using the training set instead of for model learning. This method yields the lowest accuracy because the style difference between the source and target is not resolved (to be shown in Table 2). Using CycleGAN and SPGAN to generate a new training set, which is more style-consistent with the target, yields improvement.
|Method||Train. Set||Test. Set||Accuracy|
CycleGAN introduces two generator-discriminator pairs, and , which map a sample from source (target) domain to target (source) domain and produce a sample that is indistinguishable from those in the target (source) domain, respectively. For generator and its associated discriminator , the adversarial loss is
where and denote the sample distributions in the source and target domain, respectively. For generator and its associated discriminator , the adversarial loss is
Considering there exist infinitely many alternative mapping functions due to the lack of paired training data, CycleGAN introduces a cycle-consistent loss, which attempts to recover the original image after a cycle of translation and reverse translation, to reduce the space of possible mapping functions. The cycle-consistent loss is
Apart from cycle-consistent loss and adversarial loss, we use the target domain identity constraint 
as an auxiliary for image-image translation. Target domain identity constraint is introduced to regularize the generator to be the identity matrix on samples from target domain, written as
As mentioned in , generators and may change the color of input images without . In experiment, we observe that model may generate unreal results without (Fig. 4(b)). This is undesirable for re-ID feature learning.
Thus, we use to preserve the color composition between the input and output (see Section 4.3).
Applied in person re-ID, similarity preserving is an essential function to generate improved samples for domain adaptation. As analyzed in Section 1, we aim to preserve the ID-related information for each translated image. We emphasize that such information should not be the background or image style, but should be underlying and latent. To fulfill this goal, we integrate a SiaNet with CycleGAN, as shown in Fig 3. During training, CyleGAN is to learn a mapping function between two domains, and SiaNet is to learn a latent space that constrains the learning of mapping function.
are a pair of input vectors,denotes the Euclidean distance between normalized embeddings of two input vectors, and represents the binary label of the pair. if and are positive pair; if and are negative pair.
is the margin that defines the separability in the embedding space. When , the loss of negative training pair is not back-propagated in the system. When , both positive and negative sample pairs are considered. A larger means that the loss of negative training samples has a higher weight in back propagation.
Training image pair selection. In Eq. 5, the contrastive loss uses binary labels of input image pairs. The design of the pair similarities reflects the “self-similarity” and “domain-dissimilarity” principles. Note that, we select training pairs in an unsupervised manner, so that we use the contrastive loss without additional annotations.
Formally, CycleGAN has two generators, i.e., generator which maps source-domain images to the style of the target domain, and generator which maps target-domain images to the style of the source domain. Suppose two samples denoted as and come from the source domain and target domain, respectively. Given and , we define two positive pairs: 1) and , 2) and . In either image pair, the two images contain the same person; the only difference is that they have different styles. In the learning procedure, we encourage the whole network to pull these two images close.
On the other hand, for generators and , we also define two types of negative training pairs: 1) and , 2) and . Such design of negative training pairs is based on the prior knowledge that datasets in different re-ID domains have entirely different sets of IDs. Thus, a translated image should be of different ID from any target image. In this manner, the network pushes two dissimilar images away. Training pairs are shown in Fig. 1. Some positive pairs are also shown in (a) and (d) of each column in Fig. 4.
Overall objective function. The final SPGAN objective can be written as
where controls the relative importance of four objectives. The first three losses belong to the CycleGAN formulation , and the contrastive loss induced by SiaNet imposes a new constraint on the system.
SPGAN training procedure. In the training phase, SPGAN are divided into three components which are learned alternately, the generators, discriminators and SiaNet. When the parameters of two components are fixed, the parameters of the third component is updated. We train the SPGAN until the convergence or the maximum iterations.
Feature learning is the second step of the “learning via translation” framework. Once we have style-transferred dataset composed of the translated images and their associated labels, the feature learning step is the same as supervised methods. Since we mainly focus on Step 1 (source-target image translation), we adopt the baseline ID-discriminative Embedding (IDE) following the practice in [52, 53, 54]. We employ ResNet-50  as the base model and only modify the output dimension of the last fully-connected layer to the number of training identities. During testing, given an input image, we can extract the 2,048-dim Pool5 vector for retrieval under the Euclidean distance.
Local Max Pooling. To further improve re-ID performance on the target dataset
, we introduce a feature pooling method named as local max pooling (LMP). It works on a well-trained IDE model and can reduce the impact of noisy signals incurred by the fake translated images. In the original ResNet-50, global average pooling (GAP) is conducted on Conv5. In our proposal (Fig.5), we first partition the Conv5 feature maps to parts horizontally, and then conduct global max/avg pooling on each part. Finally, we concatenate the output of global max pooling (GMP) or GAP of each part as the final feature representation. The procedure is nonparametric, and can be directly used in the testing phase. In the experiment, we will compare local max pooling and local average pooling, and demonstrate the superiority of the former (LMP).
We select two large-scale re-ID datasets for experiment, i.e., Market-1501  and DukeMTMC-reID [36, 53]. Market-1501 is composed of 1,501 identities, 12,936 training images and 19,732 gallery images (with 2,793 distractors). It is split into 751 identities for training and 750 identities for testing. Each identity is captured by at most 6 cameras. All the bounding boxes are produced by DPM . DukeMTMC-reID is a re-ID version of the DukeMTMC dataset . It contains 34,183 image boxes of 1,404 identities: 702 identities are used for training and the remaining 702 for testing. There are 2,228 queries and 17,661 database images. For both datasets, we adopt rank-1 accuracy and mAP for re-ID evaluation . Sample images of the two datasets are shown in Fig.6.
|CycleGAN (basel.) +||38.5||54.6||60.8||66.6||19.9||48.1||66.2||72.7||80.1||20.7|
|SPGAN () + LMP||46.9||62.6||68.5||74.0||26.4||58.1||76.0||82.7||87.9||26.9|
SPGAN training and testing.
We use Tensorflow to train SPGAN using the training images of Market-1501 and DukeMTMC-reID. Note that, we do not use any ID annotation during training procedure. In all experiments, we empirically set in Eq. 6 and in Eq. 5
. With an initial learning rate 0.0002, and model stop training after 5 epochs. During the testing procedure, we employ the Generatorfor Market-1501 DukeMTMC-reID translation and the Generative for DukeMTMC-reID Market-1501 translation. The translated images are used to fine-tune the model trained on source images.
For CycleGAN, we adopt the architecture released by its authors. For SiaNet, it contains 4 convolutional layers, 4 max pooling layers and 1 fully connected (FC) layer, configured as below. (1) Conv.
, stride = 2, #feature maps = 64; (2) Max pooling, stride = 2; (3) Conv. , stride = 2, #feature maps = 128; (4) Max pooling , stride = 2; (5) Conv. , stride = 2, feature maps = 256; (6) Max pool , stride = 2; (7) Conv. , stride = 2, #feature maps = 512; (8) Max pooling , stride = 2; (9) FC, output dimension = 128.
Feature learning for re-ID. As described in Section 3.3, we adopt IDE for feature learning.
Specifically, ResNet-50 
pretrained on ImageNet is used for fine-tuning on the translated training set. We modify the output of the last fully-connected layer to 751 and 702 for Market-1501 and DukeMTMC-reID, respectively. We use mini-batch SGD to train CNN models on a Tesla K80 GPU. Training parameters such as batch size, maximum number epochs, momentum and gamma are set to 16, 50, 0.9 and 0.1, respectively. The initial learning rate is set as 0.001, and decay to 0.0001 after 40 epochs.
Comparison between supervised learning and direct transfer. The supervised learning method and the direct transfer method are specified in Table 1. When comparing the two methods in Table 2, we can clearly observe a large performance drop when directly using a source-trained model on the target domain. For instance, the ResNet-50 model trained and tested on Market-1501 achieves in rank-1 accuracy, but drops to when trained on DukeMTMC-reID and tested on Market-1501. A similar drop can be observed when DukeMTMC-reID is used as the target domain, which is consistent with the experiments reported in . The reason behind the performance drop is the bias of data distributions in different domains.
The effectiveness of the “learning via translation” baseline using CycleGAN. In this baseline domain adaptation approach (Section 3.1), we first translate the label images from the source domain to the target domain and then use the translated images to train re-ID models. As shown in Table 2, this baseline framework effectively improves the re-ID performance in the target dataset. Compared to the direct transfer method, the CycleGAN transfer baseline gains improvements in rank-1 accuracy on Market-1501. When tested on DukeMTMC-reID, the performance gain is +5.0% in rank-1 accuracy. Through such an image-level domain adaptation method, effective domain adaptation baselines can be learned.
The impact of the target domain identity constraint. We conduct experiment to verify the influence of the identity loss on performance in Table 2. We arrive at mixed observations. On the one hand, on DukeMTMC-reID, compared with the CycleGAN baseline, CycleGAN + achieves similar rank-1 accuracy and mAP. On the other hand, on Market-1501, CycleGAN + gains and
improvement in rank-1 accuracy and mAP, respectively. The reason is that Market-1501 has a larger inter-camera variance. When translating Duke images to the Market style, the translated images may be more prone to translation errors induced by the camera variances. Therefore, the identity loss is more effective when Market is the target domain.
As shown in Fig. 4, this loss prevents CycleGAN from generating strangely colored images.
SPGAN effect. On top of the CycleGAN baseline, we replace CycleGAN with SPGAN (). The effectiveness of the proposed similarity preserving constraint can be seen in Table 2. Compared with Cycle + , on DukeMTMC-reID, the similarity preserving constraint leads to and improvement over CycleGAN + in rank-1 accuracy and mAP, respectively. On Market-1501, the gains are and . The working mechanism of SPGAN consists in preserving the underlying visual cues associated with the ID labels. The consistent improvement suggests that this working mechanism is critical for generating suitable samples for training in the target domain. Examples of translated images by SPGAN are shown in Fig. 6.
Comparison of different feature learning methods. In Step 2, we evaluate three feature learning methods, i.e., IDE  (described in Section 3.3), IDE , and SVDNet . Results are shown in Fig. 7. An interesting observation is that, while IDE and SVDNet are superior to IDE under the scenario of “Direct Transfer”, the three learning methods are basically on par with each other when using training samples generated by SPGAN.
A possible explanation is that some translated images are noisy, which has a large effect on better learning methods.
Sensitivity of SPGAN to key parameters. The margin defined in Eq. 5 is a key parameter. If , the loss of negative pairs is not back propagated. If gets larger, the weight of negative pairs in loss calculation increases. We conduce experiment to verify the impact of , and results are shown in Table 2. When turning off the contribution of negative pairs in Eq. 5, (), SPGAN only marginally improves the accuracy on Market-1501, and even compromises the system on Duke. When increasing to 2, we have much superior accuracy. It indicates that the negative pairs are critical to the system.
Moreover, we evaluate the impact of in Eq. 6 on Market-1501. controls the relative importance of the proposed similarity preserving constraint. As shown in Fig. 9, the proposed constraint is proven effective when compared to , but a larger does not bring more gains in accuracy. Specifically, yields the best accuracy.
Local max pooling.
We apply the LMP on the Conv5 layer to mitigate the influence of noise. Note that LMP is directly adopted in the feature extraction step for testing without fine-tuning. We empirically study how the number of parts and the pooling mode affect the performance. Experiment is conducted on SPGAN. The performance of various numbers of parts () and different pooling modes (max or average) is provided in Table 3. When we use average pooling and , we have the original GAP used in ResNet-50. From these results, we speculate that with more parts, a finer partition leads to higher discriminative descriptors and thus higher re-ID accuracy.
Moreover, we test LMP on supervised learning and domain adaptation scenarios with three feature learning methods, i.e., IDE , IDE , and SVDNet . As shown in Fig. 9, LMP does not guarantee stable improvement on supervised learning as observed in “IDE” and SVDNet.
However, when applied in the scenario of domain adaptation, LMP yields improvement over IDE, IDE
, and SVDNet. The superiority of LMP probably lies in that max pooling filters out some detrimental signals in the descriptor induced by noisy translated images.
We compare the proposed method with the state-of-the-art unsupervised learning methods on Market-1501 and DukeMTMC-reID in Table4 and Table 5, respectively.
Market-1501. On Market-1501, we first compare our results with two hand-crafted features, i.e., Bag-of-Words (BoW)  and local maximal occurrence (LOMO) . Those two hand-crafted features are directly applied on test dataset without any training process, their inferiority can be clearly observed. We also compare existing unsupervised methods, including the Clustering-based Asymmetric MEtric Learning (CAMEL) , the Progressive Unsupervised Learning (PUL) , and UMDL . The results of UMDL are reproduced by Fan et al. . In the single-query setting, we achieve rank-1 accuracy = 51.5% and mAP = 22.8%. It outperforms the second best method  by +6.0% in rank-1 accuracy. In the multiple-query setting, we arrive at rank-1 accuracy = 57.0%, which is +2.5% higher than CAMEL . The comparisons indicate the competitiveness of the proposed method on Market-1501.
DukeMTMC-reID. On DukeMTMC-reID, we compare the proposed method with BoW , LOMO , UMDL , and PUL  under the single-query setting (there is no multiple-query setting in DukeMTMC-reID). The result obtained by the proposed method is rank-1 accuracy = 41.1%, mAP = 22.3%. Compared with the second best method, i.e., PUL , our result is +11.1% higher in rank-1 accuracy. Therefore, the superiority of SPGAN can be concluded.
This paper focuses on domain adaptation in person re-ID. When models trained on one dataset are directly transferred to another dataset, the re-ID accuracy drops dramatically due to dataset bias. To achieve improved performance in the new dataset, we present a “learning via translation” framework for domain adaptation, characterized by 1) unsupervised image-image translation and 2) supervised feature learning. We further propose that the underlying (latent) ID information for the foreground pedestrian should be preserved after image-image translation. To meet this requirement tailored for re-ID, we introduce the unsupervised self-similarity and domain-dissimilarity for similarity preserving image generation (SPGAN). We show that SPGAN better qualifies the generated images for domain adaptation and yields consistent improvement over the CycleGAN.
Acknowledgment. Weijian Deng, Qixiang Ye, and Jianbin Jiao are supported by the NSFC under Grant 61671427, 61771447, and Beijing Municipal Science and Technology Commission. Liang Zheng is the recipient of a SIEF STEM+ Business Fellowship, and Yi Yang is the recipient of the Google Faculty Research Award. We thank Pengxu Wei for many helpful comments.
Tensorflow: A system for large-scale machine learning.In OSDI, 2016.
Unsupervised domain adaptation by backpropagation.In ICML, 2015.
Domain-adversarial training of neural networks.Journal of Machine Learning Research, 2016.
Image style transfer using convolutional neural networks.In CVPR, 2016.
Perceptual losses for real-time style transfer and super-resolution.In ECCV, 2016.
Unsupervised cross-dataset transfer learning for person re-identification.In CVPR, 2016.
European Conference on Computer Vision workshop on Benchmarking Multi-Target Tracking, 2016.
Transferable joint attribute-identity deep learning for unsupervised person re-identification.In CVPR, 2018.