Domain Adaptive Attention Model for Unsupervised Cross-Domain Person Re-Identification

05/25/2019 ∙ by Yangru Huang, et al. ∙ BEIJING JIAOTONG UNIVERSITY 0

Person re-identification (Re-ID) across multiple datasets is a challenging yet important task due to the possibly large distinctions between different datasets and the lack of training samples in practical applications. This work proposes a novel unsupervised domain adaption framework which transfers discriminative representations from the labeled source domain (dataset) to the unlabeled target domain (dataset). We propose to formulate the domain adaption task as an one-class classification problem with a novel domain similarity loss. Given the feature map of any image from a backbone network, a novel domain adaptive attention model (DAAM) first automatically learns to separate the feature map of an image to a domain-shared feature (DSH) map and a domain-specific feature (DSP) map simultaneously. Specially, the residual attention mechanism is designed to model DSP feature map for avoiding negative transfer. Then, a DSH branch and a DSP branch are introduced to learn DSH and DSP feature maps respectively. To reduce domain divergence caused by that the source and target datasets are collected from different environments, we force to project the DSH feature maps from different domains to a new nominal domain, and a novel domain similarity loss is proposed based on one-class classification. In addition, a novel unsupervised person Re-ID loss is proposed to take full use of unlabeled target data. Extensive experiments on the Market-1501 and DukeMTMC-reID benchmarks demonstrate state-of-the-art performance of the proposed method. Code will be released to facilitate further studies on the cross-domain person re-identification task.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The task of person re-identification (Re-ID) is to match people across non-overlapping camera views. It has become

———————

*  Equal contribution.
†  Corresponding author.

one of the most studied problems in video surveillance due to its great potential for security and safety management applications. It is a challenging task because a person’s appearance often changes dramatically across camera views caused by changes in body pose, view angle, occlusion and illumination condition.

Figure 1:

Top: The labeled source dataset (DukeMTMC-reID) and the unlabeled target dataset (Market-1501) are collected from different environments, where the illuminations and backgrounds are widely different. Bottom: The mAP of the Re-ID model trained on DukeMTMC-reID is high when tested on the same environment (Supervised Learning), while drops drastically when applied to Market-1501 dataset directly (Direct transfer). In contrast, the proposed method can transfer discriminative representations effectively.

In order to address these issues, most existing person Re-ID methods are designed on supervised learning  [45, 46, 28, 14, 47, 43, 44, 19]

which aims to learn a discriminative representation from labeled data. Recently, benefited from the success of deep learning

[20, 6, 18], the supervised learning based methods have obtained significant performance improvement [35, 7]. However, these methods require a large number of labeled data to train the Re-ID model, which are limited in many real-world applications [31]. In order to make person Re-ID method more scalable, one solution is to utilize unlabeled data, which are abundant in the context of Re-ID in a busy public space because thousands of people pass by in each camera view everyday. To employ Re-ID model in the unlabeled data, a natural method is to train the model from the existing labeled source dataset collected from elsewhere, and then apply the model directly on the current unlabeled target data. However, as shown in Fig. 1, since different datasets are often collected from different environments with widely different illuminations, backgrounds and image qualities, the performance of Re-ID model trained from the source dataset often drops drastically when applied in the target data.

To handle this challenge, we formulate person Re-ID task as an unsupervised domain adaption problem (UDA) [11, 15, 16, 26], where the existing labeled dataset and current unlabeled dataset are modeled as source and target domains, respectively. The source and target domains contain the identical feature space with the same dimension but totally different person identities (IDs), and the domain divergence [3, 2] is caused by the differences of illuminations, backgrounds and image qualities between two datasets. The goal of UDA is to learn the discriminative representations from the source domain which is adaptive to the target domain.

In order to achieve this goal, given a feature map of any image from a backbone network, a novel domain adaptive attention model (DAAM) is proposed to generate domain-shared (DSH) feature map from the original feature map, which is discriminative to different persons and robust to different domains. The DSH feature map is transferred from different domains and is used to help Re-ID task in the target domain. In addition, to alleviate negative transfer caused by domain divergence, the domain-specific (DSP) feature map is modeled by the residual of the domain adaptive attention. Then, a DSH branch and a DSP branch are introduced to the network to learn DSH and DSP feature maps respectively: 1) For the DSH branch, we force to project the DSH feature maps from source domain and target domain to a new nominal domain to reduce the domain divergence, and a novel domain similarity loss is proposed based on one-class classification [32, 34]. Synchronously, the DSH branch is trained on the source domain by ordinary supervised person Re-ID techniques [47] to make the DSH feature map discriminative to different person IDs. In addition, to take full use of the unlabeled target data, the clustering method is utilized to approximate weak labels of target data, and a novel person Re-ID loss is proposed to an unlabeled target domain based on the approximated weak labels and weighted cross-entropy loss. 2) For the DSP branch, a domain-specific loss is introduced to ensure the DSP feature map is distinguishable to different domains, and the orthogonality loss [4] is utilized to encourage the DSH and DSP feature maps separable and complementary to each other. Although the DSP feature map is useless for person Re-ID task in the target domain, it makes sure that the domain-specific information is accounted for in the model rather than acting as a distracter to corrupt the learned DSH feature map.

To sum up, the main technical contributions in the paper are in three-fold as follows:

- A novel domain adaptive attention model is proposed to automatically separate the feature map of an image to the domain-shared feature map and the domain-specific feature map simultaneously.

- A novel problem formulation of the domain adaption task as one-class classification task is proposed with a domain similarity loss.

- A novel unsupervised person Re-ID loss is proposed to the unlabeled target domain based on a clustering process and the weighted cross-entropy loss.

Extensive experimental analyses and evaluations on the Market-1501 and DukeMTMC-reID benchmarks demonstrate the proposed method can achieve the state-of-the-art performance. To facilitate further studies on the cross-domain person re-identification task, the source code and trained models of this work will be released.

2 Related Works

Person Re-ID has been one of the most studied problems due to its important application, and most of the existing works are designed based on supervised learning frameworks [29, 19, 35, 39, 21, 5]. These methods require sufficient labeled images across cameras to learn a discriminative representation or matching function. However, it is often difficult to label person Re-ID data in many real-world applications, which severely limits the scalability of these supervised learning based methods.

To solve the above scalability issue, a natural solution is to utilize unsupervised domain adaption method which aims to transfer useful Re-ID information from the labeled source domain (dataset) to unlabeled target domain (dataset). However, most of existing unsupervised domain adaptation methods [38, 25, 26, 12, 36, 4, 37] are designed to the cases where the source and target domains have the same recognition tasks (i.e. having the same set of classes), which is invalid to person Re-ID problem as different datasets contain totally different person identities.

In order to address the problem, a few of unsupervised domain adaptation methods  [27, 31, 40, 10, 23] are proposed for person Re-ID recently. The model proposed in  [27] utilizes cross-domain ranking SVMs and UDML [31] presents an asymmetric multi-task dictionary learning approach which attempts to transfer a view-invariant representation from a labeled source dataset to an unlabeled target dataset. However, these methods are both designed by hand-craft features and less effective than the proposed deep model when a large number of training samples is available. Wang et al. [40] introduce a transferable joint attribute-identity deep learning to simultaneously learn an attribute-semantic and identity discriminative feature representation space which is transferrable to the target domain for Re-ID tasks. The method requires additional attribute annotations. PUL [10] generates weak labels for unlabeled target data by iterative clustering, selecting samples and CNN fine tuning. Different with PUL [10], we separate the feature map to the domain-shared (DSH) feature map and the domain-specific (DSP) feature map by the proposed domain adaptive attention model (DAAM), and the DSP feature map is modeled to avoid negative transfer. ARN [23] extends the domain separation network [4] for person Re-ID task which leverages encoder to model domain-shared and domain-specific features. Different with these two methods, our model proposes the DAAM to separate feature map which is more direct than the learned encoder. In addition, a domain similarity loss based on one-class classification is proposed to better reduce the domain divergence, and a novel person Re-ID loss is proposed to take full use of unlabeled target data. Besides, several methods employ GAN to transfer the style of different domain person images [41, 8, 50, 24, 49]. The proposed method does not contradict with those GAN based methods. They aim to increase the cross-domain training samples with style transfer methods, while our goal is to make make better use of existing samples. Furthermore, their generated images can also be used in our method as training samples.

3 Methodology

Suppose there are two types of training data including: a labeled source dataset 111The proposed method can be easily extended to multiple source datasets. More details of multi-source domain adaptation are shown in the supplemental material. and an unlabeled target dataset , where and are pedestrian images from the source and target datasets respectively. The person ID of each image is only available in the source training data and denoted by . The source and target datasets are collected from different environments and contain totally different person identities (IDs). To transfer dataset-shared discriminative representations from the source dataset to the target dataset, the source and target datasets are represented by source and target domains respectively, and person Re-ID task is formulated as an unsupervised domain adaptation problem [11, 15, 16, 26], where source and target domains contain the identical feature space with the same dimension but totally different person identities. To resolve this problem, as is shown in Fig.3, a novel deep network is designed which consists of four modules: a backbone network, a domain adaptive attention module, a domain-shared branch and a domain-specific branch. ResNet-50 is chosen as our backbone network which is same to most state-of-arts person Re-ID methods [35, 40, 23, 10, 19]. Given any image , the output of the backbone network is the corresponding feature map where , and are the height, width and the number of channels, respectively. More details of the domain adaptive attention module, domain-shared branch and domain-specific branch are listed below.

Figure 2:   Diagram of domain adaptive attention module. Spatial attention aims to learn domain-shared regions and channel attention aims to pay more attention to the domain-shared kernels.

Figure 3: The framework of the proposed method. The domain adaptive attention module separates the feature map into domain-shared (DSH) and domain-specific feature maps simultaneously. Then, a DSH branch and a DSP branch are introduced to learn these two feature maps respectively. The Re-ID loss and domain similarity loss are adopted to the DSH branch to make the DSH feature discriminative to different persons and transferable to different domains. In contrary, the domain-specific loss is employed to make the DSP branch to capture the domain distinguishable information to avoid to distract the DSH feature. Finally, the soft orthogonality constraint is introduced to make these two parts complementary and separable.

3.1 Domain Adaptive Attention Module

The feature map of any image is naturally divided into a domain-shared discriminative feature map and a domain-specific feature map , that is

(1)

To make the model easy to learn, a domain-shared attention map is proposed to approximate and by:

(2)

where means element-wise product. Inspired by [22], is decomposed by the spatial attention and channel attention as:

(3)

Therefore, a spatial attention module and a channel attention module are proposed to learn and respectively. They are highly complementary with high compatibility to each other in functionality. The former selects the domain-shared body regions, and the later pays more attention to domain-shared conventional kernels. The structure of attention module is shown in Fig. 2.

Spatial Attention Module. The input of is the feature map , and the output is the spatial attention . The spatial attention is produced by exploiting the inter-pixel relationship of feature maps, and shared by all channels. Hence, a global average-pooling operation is applied to all channels, then a conv layer of

filter with stride 2 and an upsampling layer are introduced to select spotlight based on the spatial location. In order to effectively integrate with channel information later, a conv layer of

is further added to automatically learn an adaptive scale.

Channel Attention Module. The input of is the feature map , and the output is the channel attention . The channel attention is produced by exploiting the inter-channel relationship of feature maps, and irrelevant to pixels. Hence, the global average pooling (GAP) is applied to all spatial and the latter process is performed as follows:

(4)

where and are the transformation matrices respectively, and is the reduction ratio.

3.2 Domain-Shared Branch

Given of any image , the domain-shared (DSH) branch is designed to extract feature representations which are applied for the person Re-ID task at the target domain. Specifically, a Global Average Pooling (GAP) operation, a

fully connected (FC) layer, a Batch Normalized (BC) layer and a ReLU active function are employed sequentially to project

to a 256-dimensional feature vector

. To make

transferable to different domains and discriminative to different persons, two types of loss functions, including the domain similarity loss and person Re-ID loss, are introduced respectively.

Domain Similarity Loss. To make transferable to different domains, a natural method is to make the distributions of in source domain are similar in target domain. To reduce domain divergence, the images from the source domain and the target domain are forced to project into a unified new nominal domain, and it can be formulated as a one-class classification (OCC) problem [9]. Therefore, a novel domain similarity loss is proposed based on OCC. Specifically, a FC layer and a sigmoid activate function are employed sequentially to

to predict the probability

of that whether image belongs to the new nominal domain, and the domain similarity loss is:

(5)

Compared with the widely-used domain adaption method [13] where a gradient reversal layer (GRL) is added to confuse the domain discriminator, the proposed OCC loss attempts to project into one single nominal domain, and the constraint is stronger. An intuitive example of the comparison is shown in Fig. 4. In addition, MMD [17] loss is also proposed to map instances from the source domain and target domain into a new data space, but it requires that the source and target domain have identical classes, and it can not be used for person Re-ID task directly where the source and target domains contain totally different person IDs.

Figure 4:   The implication of OCC. GRL attempts to confuse the domain discriminator and OCC attempts to close the distribution.

Person Re-ID Loss. For the labeled source dataset, the person Re-ID loss is similar to the existing supervised learning methods. That is, consider every person ID as a class, and person Re-ID task is formulated as a classification task. Therefore, a FC layer and a softmax activate function are performed sequentially following to output the probabilities of that the image is from person . Then, the cross-entropy loss is adopted as:

(6)

Since the training data in the target domain is unlabeled, the proposed Re-ID loss Eq. (6

) cannot be directly applied to the target dataset. To take full use of the unlabeled target dataset, we assume that images with more similar appearances are more likely from the same person. Following this assumption, the clustering method k-means++ 

[1] is adopted to generate weak labels for images from the target dataset. Specifically, the samples of target dataset are clustered into groups based on , and the corresponding group centers are respectively. We assume that the images in a unified group from same person, and then each image is assigned to a weak label by

(7)

Unlike the labeled data, is approximated and inaccurate. Hence a weighted cross-entropy loss is proposed:

(8)

where is calculated by performing a FC layer and a softmax function sequentially to . The sample weight is used to evaluate the confidence of that is from person . Generally speaking, the sample closer to should have a larger confidence, hence is defined as a descending function of the distance between and the corresponding group center :

(9)

3.3 Domain-Specific Branch

Similar to the DSH branch, the domain-specific (DSP) branch adopts a GAP operation, a FC layer, a BN layer and a ReLU active function sequentially to project to a 256-dimensional feature vector . The network parameters of the DSP branch are independent to DSH branch since they are designed to different tasks. To ensure domain-specific,

should be distinguishable to different domains. Therefore, a domain classifier is introduced which is composed of a

FC layer and a softmax function, and it is used to predict the probabilities ( and ) of that image is from the source domain or the target domain, respectively. The domain classifier should predict the domain well, and the domain-specific loss is defined as a cross-entropy loss:

(10)

In addition, to ensure that and are totally mutual exclusive and independent, and are encouraged to have soft orthogonality constraints. This idea can assist the model to learn more robust DSH features by further eliminating DSP features for person Re-ID. The soft orthogonal constraint loss [4] is written as follow:

(11)

where means the inner product of two vectors.

4 Learning

The proposed network is optimized by minimizing Eq. (5), Eq. (6), Eq. (8), Eq. (10) and Eq. (11) jointly. The total loss function is defined as:

(12)

In the learning procedure, we firstly pre-train the network on the labeled source dataset in a supervised way only using . Then, of target data are extracted by the pre-trained model, and the weak labels of the target data are estimated by performing the clustering method to , and sample weights in Eq. (8) is calculated by Eq. (9) with . Secondly, the whole network is updated by minimizing over both source and target datasets, and new are extracted. Then, the weak labels and sample weights are updated by the new , and we re-train the network by updated weak labels and sample weights to enter the next iteration. The iterations terminate when a stopping criterion is met, and the number of iterations is typically 10 in our experiments. Alg. 1 concludes the proposed learning method.

Input: Labeled source data , unlabeled target data and the number of clusters ;
Output: The trained network parameters.
1 Pre-train the network on by .
2 for  do
3        Extract .
4        Update by Eq. (7) and by kmeans++.
5        Compute sample weights according to Eq. (9).
6        Update the network by minimizing Eq. (12).
7       
Algorithm 1 The proposed learning algorithm.

5 Experiment

5.1 Datasets

As same to [23, 10, 41, 8, 24], the experiments are conducted on two widely used benchmark datasets including Market-1501 [46] and DukeMTMC-reID [33]. Market-1501 contains 32, 668 images of 1, 501 identities captured by 6 camera views. The pedestrians are cropped with bounding-boxes predicted by DPM detector [30]. Following the standard setting [46], the whole dataset is divided into a training set containing 12, 936 images of 751 identities and a testing set containing 19, 732 images of 750 identities. DukeMTMC-reID consists of 36,411 images of 1,812 persons from 8 high-resolution cameras, where 1,404 people appear more than two cameras and other 408 people images are regarded as distractors. 16,522 images of 702 persons are randomly selected from the dataset as the training set, and the remaining 702 persons are divided into the testing set where contains 2,228 query images and 17,661 gallery images. The dataset split setting is same to [33].

In the experiments, there are two source-target setting: (1)the source dataset is Market-1501 and the target dataset is DukeMTMC-reID, (2) the source dataset is DukeMTMC-reID, and the target dataset is Market-1501. The training data in source and target datasets are supposed to be labeled and unlabeled respectively. The rank-1, rank-5, rank-10 accuracy and mean Average Precision (mAP) [46] are employed for person Re-ID evaluations. All the results are achieved under the single-query model without Re-Ranking [48] refinement for fair comparison.

5.2 Implementation Details

The parameters of ResNet-50 [18]

are pre-trained on ImageNet, and other network parameters are all initialized randomly. The proposed method is implemented on Pytorch and all images are resized to

. The stochastic gradient descent with a momentum of 0.9 is adopted. At each iteration of Alg. 

1, the learning rate is set to at beginning, and decays to and

after 20 epochs and 120 epochs respectively. The training at each iteration lasts for 200 epochs. The clusters

and at Alg. 1 are set 650 and 7 for both Market-1501 and DukeMTMC-reID datasets, and the influences of them are evaluated in the experiments.

5.3 Comparative Results

The proposed work is compared with eight state-of-arts unsupervised domain adaption person Re-ID methods under same setting, including handcrafted feature based approach (UMDL [31]), GAN-based deep learning methods (CAMEL [42],SPGAN [8], PTGAN [41] and HHL [49]

) and other deep transfer learning approaches (PUL 

[10], TJ-AIDL [40] and ARN [23]). The comparative results on Market-1501 and DukeMTMC-reID are shown in Tabel 1 and Tabel 2, respectively.

Methods DukeMTMC-reID Market-1501

  mAP   Rank-1   Rank-5   Rank-10



UMDL [31]
12.4 34.5 52.6 59.6
PUL [10] 20.5 45.5 60.7 66.7
CAMEL [42] 26.3 54.5 - -
TJ-AIDL [40] 26.5 58.2 74.8 81.1
PTGAN [41] - 38.6 - 66.1
SPGAN [8] 26.7 57.7 75.8 82.4
HHL [49] 31.4 62.2 78.8 84.0
ARN [23] 39.4 70.3 80.4 86.3

Ours
 53.1  77.8  89.9  93.7

Table 1: The comparative results where DukeMTMC-reID is set as the source dataset and the target dataset is Market-1501. The best results are in bold.
Methods Market-1501 DukeMTMC-reID

  mAP   Rank-1   Rank-5   Rank-10




UMDL [31]
7.3 18.5 31.4 37.6

PUL [10]
16.4 30.0 43.4 48.5
TJ-AIDL [40] 23.0 44.3 59.6 65.0
PTGAN [41] - 27.4 - 50.7
SPGAN [8] 26.2 46.4 62.3 68.0
HHL [49] 27.2 46.9 61.0 66.7
ARN [23] 33.4 60.2 73.9 79.5

Ours
 48.8   71.3   82.4 86.3
Table 2: The comparative results where Market-1501 is set as the source dataset and the target dataset is DukeMTMC-reID. The best results are in bold.
  Methods DukeMTMC-reID Market-1501 Market-1501 DukeMTMC-reID

 mAP     Rank-1  Rank-5  Rank-10   mAP    Rank-1  Rank-5  Rank-10

  direct transfer
 17.5   42.3  59.5   67.1  14.5   29.3  45.4    52.0
  DSH   48.6   74.4   86.6   90.8   45.2   66.6   78.3   82.3
  DSH DSP   48.5   74.7   87.6   90.7   45.3   66.9   78.7   82.2
  DSH DSP IA   49.1   75.2   87.9   91.5   46.0   67.7   78.9   83.4
  DSH DSP DAAM   52.1   76.8   89.1   92.7   48.2   70.2   81.3   85.4
  DSH DSP DAAM Orth   53.1   77.8   89.9   93.7   48.8   71.3   82.4   86.3
Table 3:   Ablation studies of the proposed model. IA denotes two individual attention modules are used to learn the domain-shared and domain-specific part.

As is shown in Tabel 1 and Tabel 2, we have following key findings: (1) The proposed method outperforms UMDL by a large margin, because the deep network model can learn more discriminative representations than hand-crafted features. (2) PUL also train the model iteratively on the unlabeled dataset by weak labels. The proposed method can obtain () and () improvement in Rank-1 and mAP at Market-1501 (DukeMTMC-reID) dataset, respectively. The reason is that the proposed DAAM can extract more effective domain-shared feature by domain adaptive attention module. (3) Compared with GAN-based methods SPGAN, PTGAN and HHL, the proposed method can achieve higher performance without generating new images. It indicates that the proposed method can take full use of existing unlabeled data more effectively. (4) The proposed method outperforms the second best method ARN by +7.5% (+11.1%) in Rank-1 accuracy and +13.7% (+15.4%) in mAP at Market-1501 (DukeMTMC-reID) dataset, and our advantage is clear. The reason is that the proposed method separates feature map to domain-shared and domain-specific part, and it is more directly than separating learned encoder used in ARN. In addition, the proposed ReID loss can take better use of the target training data.

5.4 Ablation Studies

Effectiveness of the network modules. A series of experiments are conducted to validate the contribution of the proposed model modules, and the results are shown in Table 3. Firstly, compared with the direct transfer learning method, the Rank-1 accuracy on market-1501 (DukeMTMC-reID) gains +32.1% (+37.3%) improvements by introducing DSH branch. It indicates that creating a shared representation between the two domains is useful to improve generalization ability of the model. When DSP branch is added , the performances don’t have any improvement (even drop) because two branches with totally different tasks confuse the feature map during learning. When independent attention modules (IA) are designed for DSH and DSP branches respectively, the improvement is limited because IA cannot separate the feature map to DSH and DSP feature maps effectively. In contrast, the proposed DAAM exploits the mutual exclusion and complementarity of domain-shared and domain-specific features by the residual attention. And the result demonstrates the effectiveness of the proposed model. Finally, the used constraint loss can refine the results furthermore.


Methods
 DukeMTMC-reID  Market-1501


Market-1501 DukeMTMC-reID
   mAP   Rank-1  mAP   Rank-1

w/o
     50.8  74.1     46.2  67.9
GRL      52.4  75.9     47.7  69.8
w/o weights      42.7  69.4     40.6  64.3
Ours (Full model)      53.1  77.8    48.8  71.3
Table 4:   Effect of the proposed losses. our domain similarity loss based on OCC and cross-entropy based on weight.

Effectiveness of the proposed losses. To validate the effectiveness of the proposed loss in Eq. (5), two cases are evaluated including: remove directly (denoted by “w/o ”), and replace by GRL [13]. As shown in Tabel 4, the full model outperforms these two cases clearly, and it indicates the proposed is effective to unsupervised cross-domain person Re-ID.

In addition, we replace the proposed weighted cross-entropy loss in Eq. (8) by normal cross-entropy loss where weights are all set to 1.0. The performances (denoted as “ w/o weights”) are much poorer than the full model. The reason is that the weal labels are approximated and the proposed weights can represent the confidences of each sample well. This is also the main difference between the proposed method with other unsupervised person Re-ID methods [10].

Figure 5:   Effect of the number of iteration. The performances are improved in early iterations, and converge after 5 iterations for both datasets.
Figure 6:   Effect of the number of cluster on Re-ID accuracy. The proposed method can outperform the second best method ARN at all cases.
Figure 7:   Visualisation of feature map. The first column is the original images, the second column is the domain-shared feature map and the third column is the domain-specific feature map. And the warmer the color, the greater the weight.
Figure 8:   Feature distributions visualized by t-SNE. (a) The model trained on Market-1501 in the supervised way. (b) The model trained on DukeMTMC-reID is directly transferred to Market-1501. (c) We transfer the identity information from DukeMTMC-reID to Market-1501 with our proposed Method.

Influences of hyper-parameters. Here we evaluate the influences of two main hyper-parameters including the in Alg. 1 and the clusters . Firstly, we evaluate the learned model after respectively, and the evaluation results are shown in Fig. 5. As the model becomes stronger in each iterations, and more reliable weak labels of target images are generated, the performance is improved in early iterations. Finally, it converges after 5 iterations for both datasets.

We also set the number of cluster as 300, 350,…,1000 respectively and the results are shown in Fig. 6, where the performances of ARN are also listed as a reference. The proposed method can outperform ARN at all test cases. In addition, it yields the best result when for both datasets and it suggests is robust to different datasets.

5.5 Visualisation

First, the domain-shared and domain-specific feature maps are visualized following [51]. As is shown in Fig. 7, the domain-shared attention map pays more attention to the body part of the pedestrian, and the residual attention map of domain-specific branch reflects the background and other factors. It demonstrates our DAAM effectively separate the feature map to domain-shared and domain-specific parts.

Second, in order to observe the discrimination of learned features, we visualize the data distribution of features on Market-1501 learned by supervised learning, direct transfer learning and our method respectively. As shown in Fig. 8 , our model is better than directly transfer which can well separate different identities and the classification result is very close to supervised classification.

6 Conclusion

In the paper, we have proposed a novel unsupervised cross-domain transfer learning network architecture for Re-ID task. It differs significantly from existing methods in that it can transfer knowledge from the labeled dataset to unlabeled dataset by jointly modelling the domain-shared and domain-specific features. Extensive experiments on Market-1501 and DukeMTMC-reID datasets have demonstrated the effectiveness and robustness of the proposed model.

References

  • [1] D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In ACM-SIAM symposium on Discrete algorithms, pages 1027–1035. Society for Industrial and Applied Mathematics, 2007.
  • [2] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains. Machine learning, 79(1-2):151–175, 2010.
  • [3] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation. In NIPS, pages 137–144, 2007.
  • [4] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan. Domain separation networks. In NIPS, pages 343–351, 2016.
  • [5] W. Chen, X. Chen, J. Zhang, and K. Huang. Beyond triplet loss: a deep quadruplet network for person re-identification. In CVPR, pages 403–412, 2017.
  • [6] F. Chollet. Xception: Deep learning with depthwise separable convolutions. In CVPR, pages 1251–1258, 2017.
  • [7] J. Deng, J. Guo, N. Xue, and S. Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. arXiv preprint arXiv:1801.07698, 2018.
  • [8] W. Deng, L. Zheng, Q. Ye, G. Kang, Y. Yang, and J. Jiao. Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. In CVPR, pages 994–1003, 2018.
  • [9] S. M. Erfani, S. Rajasegarar, S. Karunasekera, and C. Leckie.

    High-dimensional and large-scale anomaly detection using a linear one-class svm with deep learning.

    Pattern Recognition, 58:121–134, 2016.
  • [10] H. Fan, L. Zheng, C. Yan, and Y. Yang. Unsupervised person re-identification: Clustering and fine-tuning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 14(4):83, 2018.
  • [11] B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars. Unsupervised visual domain adaptation using subspace alignment. In CVPR, pages 2960–2967, 2013.
  • [12] Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. arXiv preprint arXiv:1409.7495, 2014.
  • [13] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky.

    Domain-adversarial training of neural networks.

    The Journal of Machine Learning Research, 17(1):2096–2030, 2016.
  • [14] M. Geng, Y. Wang, T. Xiang, and Y. Tian. Deep transfer learning for person re-identification. arXiv preprint arXiv:1611.05244, 2016.
  • [15] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic flow kernel for unsupervised domain adaptation. In CVPR, pages 2066–2073. IEEE, 2012.
  • [16] R. Gopalan, R. Li, and R. Chellappa. Domain adaptation for object recognition: An unsupervised approach. In ICCV, pages 999–1006. IEEE, 2011.
  • [17] A. Gretton, D. Sejdinovic, H. Strathmann, S. Balakrishnan, M. Pontil, K. Fukumizu, and B. K. Sriperumbudur. Optimal kernel choice for large-scale two-sample tests. In NIPS, pages 1205–1213, 2012.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  • [19] A. Hermans, L. Beyer, and B. Leibe. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017.
  • [20] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, pages 4700–4708, 2017.
  • [21] D. Li, X. Chen, Z. Zhang, and K. Huang. Learning deep context-aware features over body and latent parts for person re-identification. In CVPR, pages 384–393, 2017.
  • [22] W. Li, X. Zhu, and S. Gong. Harmonious attention network for person re-identification. In CVPR, pages 2285–2294, 2018.
  • [23] Y.-J. Li, F.-E. Yang, Y.-C. Liu, Y.-Y. Yeh, X. Du, and Y.-C. Frank Wang. Adaptation and re-identification network: An unsupervised deep transfer learning approach to person re-identification. In CVPR Workshops, pages 172–178, 2018.
  • [24] J. Liu, B. Ni, Y. Yan, P. Zhou, S. Cheng, and J. Hu. Pose transferrable person re-identification. In CVPR, pages 4099–4108, 2018.
  • [25] M. Long, Y. Cao, J. Wang, and M. I. Jordan. Learning transferable features with deep adaptation networks. arXiv preprint arXiv:1502.02791, 2015.
  • [26] M. Long, H. Zhu, J. Wang, and M. I. Jordan. Unsupervised domain adaptation with residual transfer networks. In NIPS, pages 136–144, 2016.
  • [27] A. J. Ma, J. Li, P. C. Yuen, and P. Li. Cross-domain person reidentification using domain adaptation ranking svms. IEEE transactions on image processing, 24(5):1599–1613, 2015.
  • [28] T. Matsukawa, T. Okabe, E. Suzuki, and Y. Sato. Hierarchical gaussian descriptor for person re-identification. In CVPR, pages 1363–1372, 2016.
  • [29] S. Paisitkriangkrai, C. Shen, and A. Van Den Hengel. Learning to rank in person re-identification with metric ensembles. In CVPR, pages 1846–1855, 2015.
  • [30] M. Pandey and S. Lazebnik. Scene recognition and weakly supervised object localization with deformable part-based models. In ICCV, pages 1307–1314. IEEE, 2011.
  • [31] P. Peng, T. Xiang, Y. Wang, M. Pontil, S. Gong, T. Huang, and Y. Tian. Unsupervised cross-dataset transfer learning for person re-identification. In CVPR, pages 1306–1315, 2016.
  • [32] P. Perera and V. M. Patel. Learning deep features for one-class classification. arXiv preprint arXiv:1801.05365, 2018.
  • [33] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. In ECCV, pages 17–35. Springer, 2016.
  • [34] L. Ruff, N. Görnitz, L. Deecke, S. A. Siddiqui, R. Vandermeulen, A. Binder, E. Müller, and M. Kloft. Deep one-class classification. In ICML, pages 4390–4399, 2018.
  • [35] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In ECCV, pages 480–496, 2018.
  • [36] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultaneous deep transfer across domains and tasks. In ICCV, pages 4068–4076, 2015.
  • [37] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In CVPR, pages 7167–7176, 2017.
  • [38] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.
  • [39] R. R. Varior, M. Haloi, and G. Wang.

    Gated siamese convolutional neural network architecture for human re-identification.

    In ECCV, pages 791–808. Springer, 2016.
  • [40] J. Wang, X. Zhu, S. Gong, and W. Li. Transferable joint attribute-identity deep learning for unsupervised person re-identification. In CVPR, pages 2275–2284, 2018.
  • [41] L. Wei, S. Zhang, W. Gao, and Q. Tian. Person transfer gan to bridge domain gap for person re-identification. In CVPR, pages 79–88, 2018.
  • [42] H.-X. Yu, A. Wu, and W.-S. Zheng. Cross-view asymmetric metric learning for unsupervised person re-identification. In ICCV, pages 994–1002, 2017.
  • [43] L. Zhang, T. Xiang, and S. Gong. Learning a discriminative null space for person re-identification. In CVPR, pages 1239–1248, 2016.
  • [44] Y. Zhang, B. Li, H. Lu, A. Irie, and X. Ruan. Sample-specific svm learning for person re-identification. In CVPR, pages 1278–1287, 2016.
  • [45] R. Zhao, W. Ouyang, and X. Wang. Learning mid-level filters for person re-identification. In CVPR, pages 144–151, 2014.
  • [46] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian. Scalable person re-identification: A benchmark. In ICCV, pages 1116–1124, 2015.
  • [47] L. Zheng, Y. Yang, and A. G. Hauptmann. Person re-identification: Past, present and future. arXiv preprint arXiv:1610.02984, 2016.
  • [48] Z. Zhong, L. Zheng, D. Cao, and S. Li. Re-ranking person re-identification with k-reciprocal encoding. In CVPR, pages 1318–1327, 2017.
  • [49] Z. Zhong, L. Zheng, S. Li, and Y. Yang. Generalizing a person retrieval model hetero-and homogeneously. In ECCV, pages 172–188, 2018.
  • [50] Z. Zhong, L. Zheng, Z. Zheng, S. Li, and Y. Yang. Camera style adaptation for person re-identification. In CVPR, pages 5157–5166, 2018.
  • [51] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba.

    Learning deep features for discriminative localization.

    In CVPR, 2016.