Learning to Adapt Invariance in Memory for Person Re-identification

08/01/2019 ∙ by Zhun Zhong, et al. ∙ Xiamen University 7

This work considers the problem of unsupervised domain adaptation in person re-identification (re-ID), which aims to transfer knowledge from the source domain to the target domain. Existing methods are primary to reduce the inter-domain shift between the domains, which however usually overlook the relations among target samples. This paper investigates into the intra-domain variations of the target domain and proposes a novel adaptation framework w.r.t. three types of underlying invariance, i.e., Exemplar-Invariance, Camera-Invariance, and Neighborhood-Invariance. Specifically, an exemplar memory is introduced to store features of samples, which can effectively and efficiently enforce the invariance constraints over the global dataset. We further present the Graph-based Positive Prediction (GPP) method to explore reliable neighbors for the target domain, which is built upon the memory and is trained on the source samples. Experiments demonstrate that 1) the three invariance properties are indispensable for effective domain adaptation, 2) the memory plays a key role in implementing invariance learning and improves the performance with limited extra computation cost, 3) GPP could facilitate the invariance learning and thus significantly improves the results, and 4) our approach produces new state-of-the-art adaptation accuracy on three re-ID large-scale benchmarks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 5

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Person re-identification (re-ID) [zheng2016personsurvery]

is an image retrieval task, that aims at seeking matched persons of the query from a disjoint-camera database. The predominant methods have demonstrated dramatic performance when trained and tested on the same data distribution. However, they may suffer a significant degradation in the performance when evaluated on a different domain, due to dataset shifts from the changes of scenario, season, illumination, camera deployment, et al. It raises a domain adaptation problem that often encountered in real world applications and attracts increasing attention in the community

[fan2017pul, deng2018image, wang2018reid, Zhong_2018_ECCV, zhong2019invariance]. In this work, we study the problem of unsupervised domain adaptation (UDA) in re-ID. The goal is to improve the generalization ability of models on a target domain, using a labeled source domain and an unlabeled target domain.

Conventional methods of UDA are mainly designed for a closed-set setting, where the source and target domains share a common label space, i.e. the classes of two domains are exactly the same. A popular approach is to align the feature distributions of both domains, but it does not readily apply to the context of re-ID. Since domain adaptation in re-ID is a special open-set problem [busto2017open-set, saito2018open, sohn2019unsupervised], where the source and target domains have completely disjoint classes/identities. For such label constraint, directly aligning the feature distributions of two domains will align the samples from different classes and may be detrimental to the adaptation accuracy.

To address the challenges of domain adaptive re-ID, recent works concentrate on aligning the source-target distributions in a common space, such as pixel-level space [deng2018image, wei2018person] and attribute label space [wang2018reid, lin2018multibmvc]. Despite their success, these works only consider the overall inter-domain shift between the source and target domains, but largely overlook the intra-domain variations of the target domain. In the re-ID system, the intra-domain variations are important factors that affect the performance. Without considering the intra-domain variations of the target domain, an adapted model will produce poor performance, when the intra-domain variations in the target testing set are seriously different from the source domain.

In this work, we explicitly consider the intra-domain variations of the target domain and design our framework w.r.t three types of underlying invariance, i.e., Exemplar-Invariance (EI), Camera-Invariance (CI), and Neighborhood-Invariance (NI), as described below.

Exemplar-Invariance (EI): The first property is motivated by the retrieval results of re-ID. Given a re-ID model trained on a labeled source training set, we evaluate it on a source/target testing set. On the one hand, we observe that the top-ranked retrieval results (both positive and negative samples) always are visually similar to the query when tested on the source set. A similar phenomenon is shown in image classification [wu2018unsupervised]. This indicates that the model has learned to distinguish persons by apparent similarity for the source domain. On the other hand, when tested on the target set, the top-ranked results often include many samples that are visually dissimilar to the query. This suggests that the ability of the model to distinguish persons by apparent similarity is degraded on the target domain. In reality, each person exemplar could differ significantly from others even shared the same identity. Therefore, it is possible to enable the model to capture the apparent representation by learning to distinguish individual exemplars. To achieve this goal, we introduce the exemplar-invariance (EI) to improve the discrimination ability of the model on the target domain, by encouraging each exemplar to be close to itself while away from others.

Camera-Invariance (CI): Camera style (CamStyle) difference is a critical factor for re-ID that can be clearly identified, since the appearance of a person may change largely under different cameras [zhong2018camera, zhong2019camstyle]. Due to the camera deployments of the source and target domains are usually different, the model trained on the source domain may suffer from the variations caused by the target cameras. To address this problem, Zhong et al. [Zhong_2018_ECCV]

introduce camera-invariance (CI) by enforcing an target example and its corresponding CamStyle transferred images to be close to each other. Inspired by them, we integrate the camera-invariance learning into our model by classifying an target example and its CamStyle counterparts to a same class.

Neighborhood-Invariance (NI)

: Apart from the easily identified camera variance, some other latent intra-domain variations are hard to explicitly discern without fine-grained labels, such as the changes of pose, view, and background. To overcome this difficulty, we attempt to generalize the model with the neighbors of target samples. Suppose we are given an appropriate model trained on the source and target domains, a target sample and its nearest-neighbors in the target set may share the same identity with a higher potential. Considering this trait, we introduce the neighborhood-invariance (NI) to learn a model that is more robust to overcome the latent intra-domain variations of the target domain. We accomplish this constraint by encouraging an exemplar and its reliable neighbors to be close to each other. Examples of the three types of invariance are illustrated in Fig. 

1.

Fig. 1: Examples of three underlying properties of invariance. Colors indicate identities. (a) Exemplar-invariance: an input exemplar (denoted by ) is enforced to be away from others. (b) Camera-invariance: an input exemplar (denoted by ) and its CamStyle transferred images (with dashed outline) are encouraged to be close to each other. (c) Neighborhood-invariance: an input exemplar (denoted by ) and its reliable neighbors (highlighted in dashed circle) are forced to be close to each other. Best viewed in color.

Intuitively, a straightforward way to implement the three invariance properties is to constrain them with contrastive/triplet loss [hadsell2006contrastive, hermans2017defense] within a training mini-batch. However, the number of samples in a mini-batch is relatively small compared with the entire training set. In this manner, it is difficult to form a mini-batch with ideal examples, and the overall relations between training samples cannot be considered thoroughly during the network adaptation procedure. To deal with this issue, we develop a novel framework to effectively accommodate the three invariance properties for domain adaptive re-ID. Specifically, we introduce an exemplar memory module into the network to store the up-to-date representations of all training samples. The memory enables the network to enforce the invariance constraints over the entire/global target training data instead of the current mini-batch. With the memory, the invariance learning of the target domain can be effectively implemented with a non-parametric classification loss, considering each target sample as an individual class.

In our previous work [zhong2019invariance], we directly select top-

nearest neighbors from the memory for the learning of NI. This straightforward strategy ignores the underlying relations between samples in the memory. As a result, the similarity estimation of hard samples may not be accurate when the model has inferior discriminative ability. As a notable extension of our previous work

[zhong2019invariance], we propose a graph-based positive prediction (GPP) approach to address this problem, thereby promoting the invariance learning. GPP is built upon the memory module and designed by graph convolutional networks (GCNs), which aims to predict positive neighbors from the memory for a training target sample. In addition to the target memory, we also construct a memory for saving features of the source domain. This enables us to imitate the neighbor exploring process of target invariance learning and thus learn GPP on the labeled source domain. The learned GPP is then directly applied to the unlabeled target domain for facilitating the learning of NI.

In summary, our contribution is as follow:

  • This work comprehensively investigates the intra-domain variations of the target domain and studies three underlying properties of target invariance. The experiment demonstrates that the three properties are indispensable for improving the transferable ability of the model in the context of re-ID.

  • This work proposes a novel framework equipped with a memory module that can effectively enforce the three constraints into the network. The memory enables us to fully exploit the sample relations over the whole training set instead of the mini-batch. With the memory, the performance can be significantly improved, requiring very limited extra computation cost and GPU memory.

  • This work introduces a Graph-based Positive Prediction (GPP) approach to leverage the relationships between candidate neighbors and infer accurate positive neighbors for the training target sample. The experiment shows that GPP is beneficial to the learning of neighborhood-invariance and could consistently improve the results, especially the mAP.

  • In addition, we analyze the mechanism of the three invariance properties, which helps us to understand how these three properties encourage the network to adapt to the target domain.

  • Experiments demonstrate the effectiveness and superiority of the proposed method over the state-of-the-art UDA approaches. Our results outperform state of the art by a large margin on three datasets, including Market-1501, DukeMTMC-reID and MSMT17.

2 Related Work

Unsupervised domain adaptation. This work is closely related to unsupervised domain adaptation (UDA). In classical UDA, methods are designed under the assumption of the closed-set scenario, where the classes of the source and target domains are precisely the same. For this problem, recent popular approaches are mainly focused on distribution alignment learning, which attempts to reduce the domain discrepancy by using Maximum Mean Discrepancy (MMD) minimization [gretton2007kernel, long2015learning, yan2017mind] or domain adversarial training [bousmalis2016domain, tzeng2017adversarial]. However, these methods are usually not applicable to the open-set scenario, where unknown classes exist in the source/target domain. Since samples of unknown classes should not be aligned with the ones of known classes. It raises to the problem of open set UDA, introduced by Busto and Grall [busto2017open-set]. To tackle this problem, Busto and Grall [busto2017open-set] develop a method to learn a mapping from the source domain to the target domain by jointly predicting unknown-class target samples and discarding them. Saito et al. [saito2018open] introduce an adversarial learning framework to separate target samples into known and unknown classes. Meanwhile, unknown classes are rejected during feature alignment. Recently, Sohn et al. [sohn2019unsupervised] consider a more challenging setting of open-set UDA, where the source and target domain belong disjoint label spaces, i.e. the classes of two domains are completely different. In practice, many tasks match such setting, e.g.

, cross-ethnicity face recognition and UDA in re-ID considered in our work. To address this problem, Sohn

et al. first reformulate the disjoint classification task to a binary verification one. Then, a Feature Transfer Network (FTN) is proposed to transfer the source feature space to a new space and align with the target feature space. In this paper, we study this problem in the aspect of intra-domain variations in the target domain and propose an effective framework to improve the generalization ability of the model by target invariance learning.

Unsupervised person re-identification (re-ID). Recent supervised methods have made great achievements in re-ID [Li_2018_CVPR, zhong2017re, sun2018beyond, zheng2019joint, sun2019dissecting], benefiting from rich-labeled data and the increasing ability of deep networks [Yawei2019Taking, resnet]. However, the performance of these strong methods may have a large drop when tested on an unseen (target) dataset, due to the dataset shift. To address this problem, many hand-craft features based methods are designed for unsupervised re-ID, such as ELF [gray2008viewpoint], LOMO [liao2015lomo] and BOW [zheng2015scalable]. These methods can be directly applied to any dataset without training, but fail to obtain fulfilling performance on large-scale, complex scenarios. Although labeling re-ID data is difficult, it is relatively easy to collect sufficient unlabeled data in the target domain. Recently, many works are proposed to transfer the knowledge of a labeled data to an unlabeled one. These works mainly can be divided into two categories: 1) discovering pseudo labels for target dataset, and 2) reducing the source-target discrepancy in a common label space. For the first category, methods use a labeled source dataset to initialize a re-ID model and explore pseudo labels for target dataset based on clustering [yu2017cross, fan2017pul], associating label with labeled source dataset [yu2019unsupervised], assigning label with nearest-neighbors[chen2018deep, yang2018leveraging, li2018unsupervised, li2019unsupervised], or regarding camera style counterparts as positive sample [Zhong_2018_ECCV]. These methods are closely related to our work in that using the relationship between target samples to refine the model. The main difference is that our work comprehensively considers three latent invariance constraints and enforces them over the entire dataset. We show the mutual benefit among the three invariance properties and the effectiveness of exploiting the global sample relationship. Methods of the second category attempt to align the distributions of the source and target domains in a common space, such as image-level space [deng2018image, wei2018person, Bak_2018_ECCV, zhong2019camstyle] and attribute label space [wang2018reid, lin2018multibmvc]. These methods only consider the overall discrepancy between the source and target domains, but largely ignore the intra-domain variations in the target domain. In this work, we explicitly consider the invariance properties in the target domain to address the problem of domain adaptive re-ID.

Neural networks with augmented-memory.

Learning neural networks with augmented-memory is proposed to address various tasks, such as question answering

[MemoryNetworks2015, sukhbaatar2015end], few-shot learning [santoro2016meta, vinyals2016matching], and video understanding [wu2018long]

. The augmented-memory enables the networks to store the intermediate knowledge into a structural and addressable table. The community of learning with augmented-memory mainly can be divided into two categories. The first category aims to augment neural networks with a fully differentiable memory module, such as Neural Turing Machine

[graves2014neural] and Memory Networks [MemoryNetworks2015]. Memory Networks introduce a long-term memory component that can be read and written. The memory is utilized to query and retrieve fact for the task of question answering. Another category focus on developing a non-parametric memory [vinyals2016matching, xiao2017joint, wu2018unsupervised, wu2018improving] which can directly save the features of samples into a feature bank and be updated during training. These approaches then use the attention mechanism to calculate the similarities between the query and instances in the memory. We draw inspiration from these methods and develop a memory-based framework for unsupervised domain adaptive re-ID.

Fig. 2: The framework of the proposed method. During training, the inputs are drawn from the labeled source domain and the unlabeled domain. We feed-forward the inputs into the feature extractor to obtain the up-to-date representations. Subsequently, two branches are designed to optimize the framework with the source data and the target data, respectively. In each branch, we introduce a memory to store the features of corresponding domain data. The source branch aims to learn basic representation for the model with identity classification loss , as well as, to learn a graph-based positive prediction (GPP) network with binary classification loss . For training of GPP, we first select top-ranked features of the input from the source memory. Then, the selected features are regarded as candidate neighbors and are used to train the GPP network. The GPP network is trained to predict the positive and negative neighbors of the input. The target branch attempts to enforce the invariance learning on the target data. The invariance learning loss is calculated by estimating the similarities between the target sample and whole features in the target memory. In addition, we employ the GPP network to infer reliable positive neighbors for the target sample, thereby facilitating the invariance learning. In testing, the L2-normalized output of the global average pooling (GAP) is used as the feature of an image.

Graph convolutional network (GCN)

. In real-world applications, data are usually represented in the form of graphs or networks, such as knowledge graphs and social networks. Recently, many works concentrate on extending the deep convolutional neural network for handling graph data. In light of the objectives on graph data, these works are generally divided into spectral based methods

[bruna2014spectral, defferrard2016convolutional, henaff2015deep, kipf2016semi] and spatial based methods [monti2017geometric, niepert2016learning, hamilton2017inductive, shen2018graph, wang2019linkage]. The spectral based methods are developed based on graph spectral theory. These methods usually handle the entire graph to learn a domain-dependent graph network, leading them difficult to apply on large graphs and can not be applied to a different structure. The spatial based methods extend the convolution operation of CNNs to the graph structure. The graph convolution is directly performed based on the graph nodes and their spatial neighbors. This work is mostly inspired by spatial based methods and proposes a graph-based positive prediction (GPP) approach to predict reliable neighbors across graphs formed by different candidates. The proposed GPP is also related to LBFC [wang2019linkage] and SGGNN [shen2018graph], which are designed for face clustering and re-ID re-ranking, respectively. The main difference is that GPP is designed to explore reliable neighbors that contribute to generalize re-ID networks. Importantly, GPP is trained on hard negative samples selected from the whole dataset, while LBFC and SGGNN are trained with limited samples from a mini-batch.

3 The Proposed Method

This paper aims to address the problem of unsupervised domain adaptation (UDA) in person re-identification (re-ID). In the context of UDA in re-ID, we are provided with a fully labeled source domain {} and an unlabeled target domain . The source domain includes person images of identities. Each person image is associated with an identity . The target domain contains person images. The identity annotation of the target domain is not available. In general, the source and target domains are drawn from different distributions, and the identities in the source and target domains are completely different. In this paper, our goal is to learn a transferable deep re-ID model that generalizes well on the target testing set. The labeled source domain and unlabeled target domain are exploited during training.

3.1 Overview of Framework

The framework of the proposed method is illustrated in Fig. 2. In our method, the inputs are sampled from the labeled source domain and the unlabeled target domain. The inputs are first fed-forward into the feature extractor to obtain the up-to-date features. The feature extractor is composed of convolutional layers, global average pooling (GAP) [lin2013network]

and a batch normalization layer

[ioffe2015batch]. The convolutional layers are the residual blocks of ResNet-50 [resnet]

pre-trained on ImageNet

[deng2009imagenet]. The output of feature extractor is 2,048 dimensional. Subsequently, the source branch and the target branch are developed for training our model with the source data and the target data, respectively. Each branch includes an exemplar memory module. The memory module is served as a feature-storage that saves the up-to-date output of feature extractor for each source/target sample. The aims of the source branch are twofold. On the one hand, we use the identity classifier with cross-entropy loss to learn basic representations for the feature extractor. On the other hand, we first select the top-ranked features of the input from the source memory. These top-ranked features are regarded as candidate neighbors and are utilized to learn a Graph-based Positive Prediction (GPP) network. The GPP network consists several graph convolution layers [kipf2016semi]

and a positive classifier, which aims to predict the probability of a candidate that belongs to a positive sample of the input. The target branch is designed to enforce the proposed three properties of invariance on unlabeled target data,

i.e. exemplar-, camera- and neighborhood- invariance. Given a target sample, the invariance learning loss is calculated by estimating the similarities between the target sample and whole target samples in the target memory. During invariance learning, we use the learned GPP to infer reliable positive neighbors of the target sample. These neighbors are selected from the candidates of the target memory, according to the probabilities obtained by the positive classifier of GPP. Note that, the GPP network is trained with only the source data. Besides, the loss of the GPP network is not utilized to update the feature extractor.

3.2 Supervised Learning for Source Domain

Given the labeled source data, we are able to learn a basic discriminative model in a supervised way. In this paper, we treat the training process of the source domain as an identity classification problem [zheng2016personsurvery]. Thus, the supervised loss of the source domain is calculated by cross-entropy loss, formulated as,

(1)

where is the predicted probability that the source image belongs to identity . is obtained by the identity classifier of the source branch.

It is reported that the model trained using the labeled source data produces a high accuracy on the same distributed testing set. However, without adapting the model, the performance will deteriorate significantly when tested on an unseen (target) testing set. This deterioration is mainly caused by the domain shift and will be more serious with the increase of domain shift. Next, we will introduce the exemplar memory based target invariance learning method to improve the transferability of the model.

3.3 Invariance Learning for Target Domain

The deep re-ID model trained with only the source data is usually sensitive to the intra-domain variations of the target domain. In fact, the variations are critical factors influencing the performance in target testing set. Therefore, it is important and necessary to consider the intra-domain variations of the target domain during transferring the knowledge from the source domain to the target domain. To this end, this work investigates three underlying properties of target invariance and adapts the network with the constraints of the three properties. The three properties of target invariance are Exemplar-Invariance (EI), Camera-Invariance (CI) and Neighborhood-Invariance (NI), which are described as follows.

3.3.1 Three Properties of Target Invariance

Exemplar-invariance. Given a well-trained re-ID model, the top-ranked results are usually visually similar to the query. This suggests that the model has learned to discriminate persons by apparent similarity. However, this phenomenon may no longer apply when tested on a different distributed dataset. In fact, the appearance of each person image may be very different from others, even if they share the same identity. We call this property as exemplar-invariance, where each person image should be close to itself while far away from others. Therefore, by enforcing the exemplar-invariance on the target images, it is possible to enable the model to improve the ability of distinguishing person images based on the apparent representation.

Camera-invariance. Camera style variation is a natural and important factor in re-ID, where a person image may encounter significant changes in appearance under different cameras. A model trained using labeled source data can learn the camera-invariance for the source cameras, but may suffer from the variations caused by the target cameras. This is because the camera settings of the two domains will be very different. Inspired by HHL [Zhong_2018_ECCV], we achieve the property of target camera-invariance by learning with unlabeled target images and their camera style transferred counterparts. These counterparts are in the style of different cameras but share the same identity with the original image. In the constraint of camera-invariance, a target image and its camera-style transferred counterparts are encouraged to be close to each other. Suppose we have cameras in the target set, we train CamStyle model [zhong2018camera] for the target domain with CycleGAN [zhu2017cyclegan] or StarGAN [stargan]. For Market-1501 [zheng2015scalable] and DukeMTMC-reID [zheng2017unlabeled] that have less cameras, we use CycleGAN to train CamStyle model. For MSMT17 that has many cameras, we apply StarGAN to train CamStyle model. With the learned CamStyle model, each real target image collected from camera is augmented with images in the styles of other cameras while remaining the original identity. Examples of real images and fake images generated by the CamStyle model are shown in Fig. 3.

Fig. 3: Example of camera style-transferred images on DukeMTMC-reID. A real image collected from a certain camera is transferred to images in the styles of other cameras. In this process, the identity information is preserved to some extent. The real image and its corresponding fake images are assumed to belong to the same class during camera-invariance learning.

Neighborhood-invariance. Besides the camera-invariance property that is natural and can be easily identified, there are many crucial intra-domain variations that are hard to overcome without the annotation of identity, such as the variations of pose and background. In fact, for each target image, there may exist a number of positive samples in the target dataset. If we could exploit these positive samples during the adapting process, we are able to further improve the robustness of re-ID model in overcoming the variations in the target domain. Assuming that we are given an appropriate transferred model and a query target sample. The nearest neighbors of the query in the target set are most likely to be positive samples of the query. In light of this, we introduce the neighborhood-invariance under the assumption that a target image and its reliable neighbors will share the same identity and should be close to each other.

Next, we will introduce the loss functions of enforcing the three invariance constraints into the network.

3.3.2 Target Invariance Learning with Memory

An essential step of enforcing the three properties of invariance is to estimate the similarities (relationships) among target samples. Intuitively, a straightforward solution is to calculate the similarities between samples within a training mini-batch and enforce the three constraints by contrastive loss [18] or triplet loss [22]. However, the number of samples in a training mini-batch is far less than that in the entire training set due to the limited GPU memory. This will cause two problems during invariance learning. First, it is hard to compose an appropriate mini-batch that includes various samples, and their corresponding CamStyle counterparts and neighbors. Second, the similarities between mini-batch samples are local and limited compared to the overall dataset. To address these problems, we introduce an exemplar memory into the network to store the up-to-date representations of the entire target samples. The memory enables the network to directly estimate the similarities between a training target sample and whole target samples, thereby effectively implementing the invariance constraints globally.

Exemplar Memory. The exemplar memory is a feature bank [xiao2017joint] that stores the up-to-date features of the entire dataset. Given an unlabeled target data including images, we construct an exemplar memory () which has slots. Each slot stores the L2-normalized output of the feature extractor for the corresponding target image. In the initialization, we initialize the values of all the features in the memory to zeros. During each training iteration, for a target training sample , we forward it through the model and obtain the L2-normalized output of the feature extractor, . During the back-propagation, we update the feature in the memory for the training sample through,

(2)

where is the feature of image in the -th slot. The hyper-parameter controls the updating rate. is then L2-normalized via . Next, we will introduce the approach of memory based invariance learning.

Exemplar-invariance learning. The exemplar-invariance enforces a target image to be close to itself while far away from others. To achieve this goal, we regard the target images as different classes and apply a non-parameterized manner to classify each image into its own class. For simplicity, we assign the corresponding index as the class of each target sample. Specifically, given a target image

, we first compute the cosine similarities between the embedding

and features saved in the memory . Then, the predicted probability that belongs to class is calculated using softmax function,

(3)

where is temperature fact that balances the scale of distribution. The objective of exemplar-invariance learning is to minimize the negative log-likelihood over the target image , as

(4)

Camera-invariance learning. The camera-invariance enforces a target image to be close to its style-transferred counterparts. To introduce the camera-invariance learning into the model, we try to classify each real image and its style-transferred counterparts to the same class. The loss function of camera-invariance is explained as,

(5)

where is a target image randomly sampled from the style-transferred images of . In this way, images in different camera styles of the same sample are forced to be close to each other.

Neighborhood-invariance learning. The neighborhood-invariance enforces a target image to be close to its reliable neighbors. To endow this property into the network, we classify the target image into the classes of its reliable neighbors. Supposing the selected reliable neighbors of in the memory are denoted as . We assign the weight of the probability that belongs to the class as,

(6)

where denotes the size of . The objective of neighborhood-invariance learning is formulated as a soft-label loss,

(7)

Note that, in order to distinguish between exemplar-invariance learning and neighborhood-invariance learning, is not classified to its own class in Eq. 7.

The key of neighborhood-invariance learning is including as many positive samples as possible in while rejecting negative samples. In Section 3.4, we will introduce a vanilla neighbor selection (VNS) method and a graph-based positive prediction (GPP) method for exploring reliable neighbors .

Fig. 4: The pipeline of graph-based positive prediction (GPP). Given the embedding of an input sample, 1) we first compute the similarities between the input and features in the memory. 2) The top-k ranked samples are selected as candidate neighbors and utilized to construct a graph for positive prediction. 3) The features of nodes are refined by graph convolutional networks (GCNs) on the graph. 4) The positive classifier is employed to predict positive probabilities of every node. For the source domain, the positive classification loss is computed for training the network of GPP. For the target domain, we select reliable neighbors for the input target sample, depending on the predicted positive probabilities of nodes.

Overall loss of invariance learning. By jointly considering the exemplar-invariance, camera-invariance and neighborhood-invariance, the overall loss of invariance learning over a target training image can be written as,

(8)

where is an image randomly sampled from the union set of and its camera style-transferred images. In Eq. 8, when , we optimize the network with the exemplar-invariance learning and camera-invariance learning by classifying into its own class. When , the network is optimized with the neighborhood-invariance learning by leading to be close to its reliable neighbors in .

3.4 Graph-based Positive Prediction

Vanilla Neighbor Selection (VNS). A naive way to select neighbors of a target image is based on the cosine similarities between and features saved in the memory. In the vanilla neighbor selection (VNS) method, we directly select the top- ranked samples from the memory as the . However, in such simple approach, the similarities between and features are estimated independently, neglecting the relationships among the features in the memory. As a consequence, the similarity estimation of hard positive and negative samples may not be accurate, especially when the model has poor performance on the target domain.

To overcome this problem, next, we will propose a novel neighbor selection method, called graph-based positive prediction (GPP). The network of GPP is constructed by graph convolutional networks (GCNs) [kipf2016semi] and positive classifier. The goal of GPP is to refine the similarities between and features in the memory by leveraging the relationships among the features. To achieve this goal, the network of GPP is trained to predict probabilities of samples in the memory that belong to positive class of the input sample. We train the GPP network with labeled source samples and then apply it to infer reliable neighbors for target training samples.

3.4.1 Training GPP on Source Domain

To train the network of GPP, we simulate the positive prediction process on the source domain. In the implementation, we construct a source memory with slots for storing the features of source samples. The network of GPP includes several graph convolutional layers and a positive classifier. The training process of GPP is illustrated in Fig. 4, which is divided into four steps as described below.

1. Similarity computation. Given a training source sample , we first extract its feature , and compute the cosine similarities between and features in the source memory .

2. Graph construction. We select -nearest-neighbors from the ranked-list as candidate neighbors of , which is denoted as . Then, we construct a complete undirected graph , where denotes the set of nodes and indicates the set of edges. We denote as the node features. In order to encode the information of , we normalize by subtracting ,

(9)

In practice, the can be represented by a matrix with a size of , where is the feature dimension of each node. As well, the weights of are represented by an adjacency matrix , where

(10)

Lastly, each row of is normalized by softmax function.

3. Feature updating with GCNs. In this step, we aim to improve the node representations with the neighbor relations, so that can accurately predict positive samples from the candidates. To achieve this goal, we employ the graph convolution network (GCNs) [kipf2016semi, wang2019linkage] to update node features on . The input to the GCNs is a set of node features and an adjacency matrix , and the output is a new set of node features . Every graph convolutional layer in the GCNs can be written as a non-linear function,

(11)

with and . is the number of graph convolutional layers. is the matrix concatenation operation. is a learnable weight matrix for the -th graph convolutional layer. and are the dimensions of input feature and out feature, respectively. In this paper, we adopt 4 graph convolutional layers to form the GCNs. The output is used to predict the positive probabilities for nodes by a positive classifier, as described in the next step.

Fig. 5: Toy example of invariance learning. Dot colors denote classes. In each step, an input and its reliable neighbors (highlighted in circle) are enforced to be close by neighborhood-invariance learning, while an input and other samples (out of circle) are enforced to be far away by exemplar-invariance learning. With the interaction of exemplar-invariance and neighborhood-invariance, samples with the same class are gradually grouped closer, while dissimilar groups are separated.

4. Prediction with positive classifier. Given the updated features , we use a positive classifier to predict the probability that a node belongs to the positive sample of the input . For training of the GPP network, the loss function on the source domain is formulated by,

(12)

where . is the predicted probability that node belongs to the positive of the input . is the ground-truth binary label, defined as,

(13)

We optimize the network of GPP with computed on the labeled source data.

3.4.2 Reliable Neighbors Selection on Target Domain

We infer the network of GPP on the target domain with the same steps as the training process on the source domain, except the loss computation in the last step. For a target sample , we first compute the similarities between and features in the target memory with step 1. Then, we obtain the positive probabilities for the candidate neighbors with the remaining steps. Finally, the reliable neighbors are selected depending on the positive probabilities, defined as,

(14)

where is threshold value that controls whether a candidate should be selected as a reliable neighbor.

3.5 Final Loss for Network

By combining the losses of supervised learning, invariance learning and graph-based positive prediction learning, the final loss for the overall network is formulated as,

(15)

To this end, we introduce a framework for UDA in person re-ID, in which aims to maintain a basic representation for person. Meanwhile, attempts to take the knowledge from the labeled source domain and incorporate the invariance properties of the target domain into the network. tries to learn a reliable neighbor prediction network that could facilitate the invariance learning. Note that, is only calculated on the source domain and is only used to update the network of GPP.

3.6 Discussion on the Three Invariance Properties

We analyze the advantage and disadvantage for the three invariance properties, and the mutual effectiveness among them. The exemplar-invariance enforces each exemplar away from each other. It is beneficial to enlarge the distance between exemplars from different identities. However, exemplars of the same identity will also be far apart, which is harmful to the system. On the contrast, neighborhood-invariance encourages each exemplar and its neighbors to be close to each other. It is beneficial to reduce the distance between exemplars of the same identity. However, neighborhood-invariance might also pull closer images of different identities, because we could not guarantee that each neighbor shares the same identity with the input exemplar. Therefore, there exists a trade off between exemplar-invariance and neighborhood-invariance, where the former aims to lead the exemplars from different identities to be far away while the latter attempts to encourage exemplars of the same identity to be close to each other. In other words, the interaction process between exemplar-invariance and neighborhood-invariance can be considered as a kind of local-clustering. On the one hand, when enforcing neighbor-invariance, samples with the same identity would be progressively grouped closer, through the neighborhood relation and the connection of their shared neighbors. On the other hand, exemplar-invariance encourages dissimilar samples to be pushed away from each other, so that dissimilar groups would be separated. Camera-invariance has the similar effect as the exemplar-invariance and also leads the exemplar and its camera-style transferred samples to share the same representation. A toy example of invariance learning is shown in Fig. 5.

4 Experiment

4.1 Dataset

We evaluate the proposed method on three large-scale person re-identification (re-ID) benchmarks: Market-1501 [zheng2015scalable], DukeMTMC-reID [zheng2017unlabeled] and MSMT17 [wei2018person].

Market-1501 [zheng2015scalable] includes 32,668 labeled person images of 1,501 identities collected from six non-overlapping camera views. For evaluation, the dataset is divided into 12,936 images of 751 identities for training, 3,368 query images and 19,732 images of 705 identities for testing.

DukeMTMC-reID [zheng2017unlabeled] is a re-ID benchmark collected from the DukeMTMC dataset [ristani2016performance]. The dataset is captured from eight cameras, including 36,411 person images from 1,812 identities. It contains 16,522 images of 702 identities for training, 2,228 query images of 702 identities and 17,611 gallery images for testing.

MSMT17 [wei2018person] is a newly released large-scale person re-ID benchmark. It is composed of 126,411 person images from 4,101 identities collected by an 15-camera system. The training set consists of 32,621 images of 1,041 identities, and the testing set contains 11,659 images as query and 82,161 images as gallery. The dataset has serious variations of scene and lighting, and is more challenging than the other two benchmarks.

Evaluation Protocol. During training, we use a labeled training dataset as the source domain and an unlabeled training dataset as the target domain. In testing, performance is evaluated on the target testing set by the cumulative matching characteristic (CMC) and mean Average Precision (mAP). For CMC, we use rank-1, rank-5, rank-10, and rank-20 as metrics.

4.2 Experimental Settings

Deep re-ID model. We adopt ResNet-50 [resnet] as the backbone of the feature extractor and initialize the model with the parameters pre-trained on ImageNet [deng2009imagenet]

. Specifically, we fix the first two residual layers and set the stride size of last residual layer to 1. After the global average pooling (GAP) layer, we add a batch normalization (BN) layer

[ioffe2015batch]

followed by ReLU

[nair2010relu] and Dropout [srivastava2014dropout]. The identity classifier is an -dimensional FC layer followed by softmax function. The input image is resized to 256 . During training, we perform random flipping, random cropping and random erasing [zhong2017random]

for data augmentation. The probability of dropout is set to 0.5. We train the re-ID model with a learning rate of 0.01 for ResNet-50 base layers and of 0.1 for the others in the first 40 epochs. The learning rate is divided by 10 for the next 20 epochs. The SGD optimizer is used to train the re-ID model with a mini-batch size of 128 for both source and target domains. We initialize the updating rate of memory

to 0.01 and increase linearly with the number of epochs, i.e., . Without specification, we set the temperature fact . For vanilla neighbor selection, we set and directly select the top- nearest-neighbors from the memory as the reliable neighbors. For GPP, we set the number of candidate neighbors and neighbor selection threshold . We train the model with exemplar-invariance and camera-invariance learning at the first 10 epochs and add the neighborhood-invariance learning for the rest epochs. In testing, we extract the L2-normalized output of GAP as the image feature and adopt the Euclidean metric to measure the similarities between query and gallery images.

Network of graph-based positive prediction (GPP). The GPP network contains four graph convolutional layers and a positive classifier. The input and out dimensions of these graph convolutional layers are: 20482048, 2048512, 512256, 256256. The positive classifier is composed of a 256-dimensional FC layer, a BN layer, a PReLU layer [he2015PReLU], and a 2-dimensional FC layer. We utilize SGD optimizer to update the GPP network after the 5th epoch. The learning rate is initialized as 0.01 and divided by 10 after the 40th epoch.

Baseline setting. We set the model as the baseline when trained the network using only the identity classifier. We employ the baseline in two ways: 1) Train on target, training the baseline on the labeled target training data and testing on the target testing set, and 2) Source only, training the baseline on a source labeled training set and directly applying to a target testing set without modification. The “Train on target” and the “Source only” can be considered as the upper bond and lower bond of our method, respectively.

Duke Market Market Duke
Rank-1 mAP Rank-1 mAP
0.01 50.1 21.1 45.6 21.5
0.03 79.4 54.5 68.1 45.6
0.05 84.1 63.8 74.0 54.4
0.1 75.7 46.5 63.0 39.8
0.5 59.0 30.9 51.0 28.0
1.0 57.9 30.3 44.6 23.8
TABLE I: Evaluation with different values of in Eq. 3.
Fig. 6: Evaluation with different number of candidate samples for graph-based positive prediction.
Methods Invariance DukeMTMC-reID Market-1501 Market-1501 DukeMTMC-reID
EI CI NI R-1 R-5 R-10 R-20 mAP R-1 R-5 R-10 R-20 mAP
Train on target - - - 87.6 95.5 97.2 98.3 69.4 75.6 87.3 90.6 92.9 57.8
Source only - - - 43.1 58.8 67.3 74.3 17.7 28.9 44.0 50.9 57.5 14.8
Ours - - 48.7 67.4 74.0 80.2 21.0 34.2 51.3 58 64.2 18.7
Ours - 63.1 79.1 84.6 89.1 28.4 53.9 70.8 76.1 80.7 29.7
Ours - 71.8 83.1 87.1 90.6 45.7 67.2 80.0 83.8 86.7 48.3
Ours 84.1 92.8 95.4 96.9 63.8 74.0 83.7 87.4 90.0 54.4
TABLE II: Methods comparison when transferred to Market-1501 and DukeMTMC-reID. Train on target: baseline model trained with the labeled target training data. Source Only: baseline model trained with only labeled source data. EI: exemplar-invariance. CI: camera-invariance. NI: neighborhood-invariance.

4.3 Parameter Analysis

We first investigate the sensitivities of our approach to three important hyper-parameters, i.e., the temperature fact in Eq. 3, the number of candidate neighbors in graph and the neighbor selection threshold in Eq. 14. In order to clearly analyze the impact of every parameter, we vary the value of one parameter and keep fixed the others. Experiments are evaluated on the setting of transferring between Market-1501 and DukeMTMC-reID.

Temperature fact . In Table I, we report the impact of the temperature fact in Eq. 3. Assigning a lower value to gives rise to a lower entropy, which commonly produces better results. However, when assigning a extremely low value to , the model does not converge and produces a poor performance, e.g., . The best results are obtained at .

Number of candidate neighbors . In Fig. 6, we evaluate the performance of using a different number of candidate neighbors in the graph . When , our approach reduces to the model trained with only exemplar-invariance and camera-invariance. It is evident that injecting neighborhood-invariance into the network () can significantly improve accuracy, and our method is insensitive to the changes of . The rank-1 accuracy and mAP first improve with the increase of and reach stable when . This improvement tendency is reasonable, because: 1) Using a larger will include more positive samples in the graph and GPP might discover more hard positive samples for neighborhood-invariance learning; 2) Most positive samples are within the top-100 nearest neighbours, so it is unnecessary to include too many candidate samples. Considering the trade-off between accuracy and speed, we set in our approach.

Fig. 7: Evaluation with different values of in Eq. 14

Threshold of positive neighbor selection . In Fig. 7, we compare the performance of using different values of in Eq. 14. On the one hand, assigning a too high value to would only select easy positive samples for neighbor-invariance learning, e.g., . This will result in the model suffers from hard positive samples in testing. Note that, when , our method reduces to the model trained without neighbor-invariance learning. Since extremely few samples would be selected as reliable positive neighbors. On the other hand, giving a too low value to might include too many false positive samples for neighbor-invariance learning, e.g., . Approaching a sample to false positive samples would undoubtedly have deleterious effects on the results. Our approach reaches the best result when is around 0.9.

According to above analysis, we set , and in the following experiments.

4.4 Evaluation and Analysis

Performance of baseline in domain adaptation. In Table II, we report the results of the baseline when transferring between Market-1501 and DukeMTMC-reID. When trained on the labeled target training set, the baseline (Train on target) achieves high accuracy. However, we observe a serious drop in performance when the baseline (Source only) is trained only using the labeled source set and directly applied to the target testing set. For the case of testing on Market-1501, the baseline (Train on target) achieves a rank-1 accuracy of 87.6%, but the rank-1 accuracy of the baseline (Source only) declines to 43.1%. A similar drop can be observed when transferred from Market-1501 to DukeMTMC-reID. This decline in accuracy is mainly caused by the domain shifts between datasets.

Ablation study on invariance learning. To study the effectiveness of the proposed invariance learning of target domain, we conduct ablation experiments in Table II. We start from the basic model (Our method w/ EI), which enforces exemplar-invariance learning into the baseline (Source only) model, and then add camera-invariance, neighborhood-invariance, and both.

First, we show the effect of adding exemplar-invariance learning. As shown in Table II, “Ours w/ EI” consistently improves the results over the baseline (Source only). Specifically, the rank-1 accuracy improves from 43.1% to 48.7% and 28.9% to 34.2% when tested on Market-1501 and DukeMTMC-reID, respectively. This demonstrates that exemplar-invariance learning is an effective way to improve the discrimination of person descriptors for the target domain.

# CamStyle Duke Market Market Duke
samples Rank-1 mAP Rank-1 mAP
0 71.8 45.7 67.2 48.3
1 83.5 61.1 72.1 52.3
3 84.0 62.8 73.0 53.0
-1 84.1 63.8 74.0 54.4
TABLE III: Evaluation with different number of camera style samples for each target image. is the number of cameras in the target domain. Model is trained with all three invariance constraints.
Module Training time GPU memory Duke Market Market Duke
Memory GPP (s/iter) (MB) R-1 mAP R-1 mAP
- - 0.50 6,800 70.2 40.6 55.3 33.7
- 0.52 7,000 77.4 45.5 65.4 42.6
0.63 9,800 84.1 63.8 74.0 54.4
TABLE IV: Results and computational cost analysis of the exemplar memory and graph-based positive prediction (GPP).
Neighbor Condition Duke Market MarketDuke
selection R-1 mAP R-1 mAP
VNS top-8 77.4 45.5 65.4 42.6
Variant VNS 77.0 46.4 64.5 42.8
Variant GPP top-8 79.8 51.3 68.2 45.3
GPP 84.1 63.8 74.0 54.4
TABLE V: Results of using different neighbor selection methods. VNS: Vanilla neighbor selection. GPP: graph-based positive prediction.

Second, we validate the effectiveness of camera-invariance learning over the basic model (ours w/ EI). In Table II, we observe significant improvement when adding camera-invariance learning into the system. For example, “Ours w/ EI+CI” achieves a rank-1 accuracy of 63.1% when transferred from DukeMTMC-reID to Market-1501. This is higher than “Ours w/ EI” by 14.4% in rank-1 accuracy. The improvement demonstrates that the image variations caused by target cameras severely impact the performance in target testing set. Injecting camera-invariance learning into the model could effectively improve the robustness of the system to camera style variations. In Table III, we also report the results of using different number of camera style samples for each target image. Even with only one camera style sample for each target image, the results of our approach can be considerable improved and are slightly lower than that of using all camera style samples. This indicates that the number of camera style samples in our approach can be agnostic to the number of cameras, and thus our approach is scalable to a scenario with many cameras.

Third, we evaluate the effect of neighborhood-invariance learning. As reported in Table II, neighborhood-invariance significantly improves performance of the basic model (Ours w/ EI). When adding neighborhood-invariance to the basic model, “Ours w/ EI+NI” obtains 67.2% rank-1 accuracy and 48.3% mAP when transferred from Market-1501 to DukeMTMC-reID. This increases the results of the basic model by 33% in rank-1 accuracy and by 29.7% in mAP, respectively. A similar improvement is achieved when transferred to Market-1501.

Finally, we demonstrate the mutual benefit among the three invariance properties. As shown in the last row in Table II, the three invariance properties are complementary to each other. The integration of them achieves obvious improvement over independently adding camera-invariance or neighborhood-invariance to the basic model. For example, “Ours w/ EI+CI+NI” achieves rank-1 accuracy of 84.1% when transferred from DukeMTMC-reID to Market-1501, outperforming “Ours w/ EI+CI” by 21% and “Ours w/ EI+NI” by 12.3%. Similar improvement is observed when transferred to DukeMTMC-reID. Particularly, our final model (Ours w/ EI+CI+NI) has only a small gap with the upper bond model (Train on target). For instance, our final model reaches rank-1 accuracy of 74.0% and mAP of 54.4% when transferred to DukeMTMC-reID, which is lower than “Train on target” by 1.6% in rank-1 accuracy and by 3.4% in mAP. This demonstrates that our approach has a strong capacity of bridging the gap between person re-ID domains.

Methods Reference Market-1501 DukeMTMC-reID
Source R-1 R-5 R-10 mAP Source R-1 R-5 R-10 mAP
LOMO [liao2015lomo] CVPR 15 - 27.2 41.6 49.1 8.0 - 12.3 21.3 26.6 4.8
BOW [zheng2015scalable] ICCV 15 - 35.8 52.4 60.3 14.8 - 17.1 28.8 34.9 8.3
UMDL [peng2016unsupervised] CVPR 16 Duke 34.5 52.6 59.6 12.4 Market 18.5 31.4 37.6 7.3
PTGAN [wei2018person] CVPR 18 Duke 38.6 - 66.1 - Market 27.4 - 50.7 -
PUL [fan2017pul] TOMM 18 Duke 45.5 60.7 66.7 20.5 Market 30.0 43.4 48.5 16.4
SPGAN [deng2018image] CVPR 18 Duke 51.5 70.1 76.8 22.8 Market 41.1 56.6 63.0 22.3
CAMEL [yu2017cross] ICCV 17 Multi 54.5 73.1 - 26.3 Multi 40.3 57.6 - 19.8
MMFA [lin2018multibmvc] BMVC 18 Duke 56.7 75.0 81.8 27.4 Market 45.3 59.8 66.3 24.7
SPGAN+LMP [deng2018image] CVPR 18 Duke 57.7 75.8 82.4 26.7 Market 46.4 62.3 68.0 26.2
TJ-AIDL [wang2018reid] CVPR 18 Duke 58.2 74.8 81.1 26.5 Market 44.3 59.6 65.0 23.0
CamStyle [zhong2019camstyle] TIP 19 Duke 58.8 78.2 84.3 27.4 Market 48.4 62.5 68.9 25.1
DECAMEL [yu2018unsupervised] TPAMI 19 Multi 60.2 76.0 - 32.4 - - - - -
HHL [Zhong_2018_ECCV] ECCV 18 Duke 62.2 78.8 84.0 31.4 Market 46.9 61.0 66.7 27.2
DASy [Bak_2018_ECCV] ECCV 18 SyRI 65.7 - - - - - - - -
MAR [yu2019unsupervised] CVPR 19 MSMT 67.7 81.9 - 40.0 MSMT 67.1 79.8 - 48.0
ECN [zhong2019invariance] CVPR 19 Duke 75.1 87.6 91.6 43.0 Market 63.3 75.8 80.4 40.4
Ours This paper Duke 84.1 92.8 95.4 63.8 Market 74.0 83.7 87.4 54.4
TABLE VI: Unsupervised person re-ID performance comparison with state-of-the-art methods on Market-1501 and DukeMTMC-reID. Market: Market-1501 [zheng2015scalable]. Duke: DukeMTMC-reID [zheng2017unlabeled]. MSMT: MSMT17 [wei2018person]. SyRI: SyRI [Bak_2018_ECCV]. Multi: a combination of seven datasets.

Benefit of exemplar memory. In Table IV, we validate the effectiveness of the exemplar memory. When training our model without memory, we enforce the invariance learning within a mini-batch. Specifically, the inputs of the target mini-batch are composed of the target samples, corresponding CamStyle samples and corresponding -nearest neighbors. Without the memory, we could not employ GPP in our model. Therefore, for a fair comparison, we train our memory based model without GPP, i.e. select the reliable neighbors by vanilla neighbor selection (VNS) method. As shown in Table  IV, the exemplar memory based method clearly outperforms the mini-batch based method. This demonstrates the effectiveness of leveraging relations among whole datasets by memory. It is noteworthy that using the exemplar memory introduces limited additional training time ( + 0.02 s/iter) and GPU memory ( + 200 MB) compared to using the mini-batch.

Fig. 8: The curve of selected reliable neighbors in (a) recall and (b) precision throughout the training. Results are compared between vanilla neighbor selection (VNS) and graph-based positive prediction (GPP).

Effectiveness of graph-based positive prediction. In Table IV, we report the results of training our method with and without graph-based positive prediction (GPP). Without GPP, our method reduces to training with vanilla neighbor selection (VNS), i.e. directly select -nearest neighbors from memory as reliable neighbors . It is clearly that, adding GPP into the system significantly improves the accuracy, requiring extra 0.11 s/iter training time and 2,800 MB GPU memory.

In Table V, we also evaluate two other neighbor selection methods that are constructed as follows:

  • Variant VNS: We train a positive prediction classifier as the same way as GPP, but without using graph convolution layers to update features of candidate samples. is selected according to the predicted positive probability and threshold . We set for variant VNS.

  • Variant GPP: We use the same architecture as the proposed GPP. The top- samples are selected as , according to the positive prediction of GPP. We set for variant GPP.

Table V shows that: (1) GPP based methods consistently outperforms VNS based methods, whether selecting by top- samples or threshold of predicted positive probability. This demonstrates the effectiveness of GPP, which updates features with the neighbor relations. (2) For GPP based methods, it is better to selecting reliable neighbors based on threshold than fix top- samples. Because the number of positive samples of each sample is very different, using fix top- samples might include negative samples that have low positive probability or ignore positive samples that have high positive probability. While threshold based strategy is more flexible and will always select samples with high positive probability. (3) The improvement of threshold based strategy is limited or even negative for the VNS method. Because the inputs of positive prediction classifier are unrelated features, which do not consider the relations among candidates. In this case, the positive prediction classifier fails to refine the similarities between the input and candidates.

To further validate the effectiveness of GPP, we evaluate the selected reliable neighbors throughout the training for VNS and GPP. Two metrics are applied: recall and precision. Recall is the fraction of true positive samples in over the total amount of positive samples in the dataset. Precision is the fraction of true positive samples in among the number of samples in . Fig. 8 shows that: 1) GPP consistently produces higher precision than VNS method. 2) The recall of GPP is lower than VNS in the early training epochs. However, the recall of GPP consistently grows with the training epochs and is largely higher than that of VNS in later training epochs. This confirms the superiority of GPP over VNS in terms of reliable neighbor selection and adaptation accuracy.

4.5 Comparison with State-of-the-art Methods

In Table VI and Table VII

, we compare our approach with state-of-the-art unsupervised learning methods when tested on Market-1501, DukeMTMC-reID and MSMT17.

SOTA on Market-1501 and DukeMTMC-reID. Table VI

reports the comparisons when tested on Market-1501 and DukeMTMC-reID. We compare with two hand-crafted feature based methods without transfer learning: LOMO

[liao2015lomo] and BOW [zheng2015scalable], four unsupervised methods that use a labeled source data to initialize the model but ignore the labeled source data during learning feature for the target domain: CAMEL [yu2017cross], DECAMEL [yu2018unsupervised], UMDL [peng2016unsupervised] and PUL [fan2017pul], and nine unsupervised domain adaptation approaches: PTGAN [wei2018person], SPGAN [deng2018image], MMFA [lin2018multibmvc], TJ-AIDL [wang2018reid], DASy [Bak_2018_ECCV], MAR [yu2019unsupervised], CamStyle [zhong2019camstyle], HHL [Zhong_2018_ECCV] and ECN [zhong2019invariance].

We first compare with hand-crafted feature based methods, which neither require learning on labeled source set nor unlabeled target set. These two hand-crafted features have demonstrated the effectiveness on small datasets, but fail to produce competitive results on large-scale datasets. For example, the rank-1 accuracy of LOMO is 12.3% when tested on DukeMTMC-reID, which is much lower than transferring learning based methods.

Next, we compare with four unsupervised methods. Benefit from initializing model from the labeled source data and learning with unlabeled target data, the results of these three unsupervised approaches are commonly superior to hand-crafted methods. Such as, PUL obtains rank-1 accuracy of 45.5% when using DukeMTMC-reID as source set and tested on Market-1501, surpassing BOW by 9.7% in rank-1 accuracy.

Methods Reference Source MSMT17
R-1 R-5 R-10 mAP
TAUDL [li2018unsupervised] ECCV 18 MSMT* 28.4 - - 12.5
UTAL [li2019unsupervised] TPAMI 19 31.4 - - 13.1
DECA [yu2018unsupervised] TPAMI 19 Multi 30.3 - - 11.1
PTGAN [wei2018person] CVPR 18 Market 10.2 - 24.4 2.9
ECN [zhong2019invariance] CVPR 19 25.3 36.3 42.1 8.5
Ours This paper 40.4 53.1 58.7 15.2
PTGAN [wei2018person] CVPR 18 Duke 11.8 - 27.4 3.3
ECN [zhong2019invariance] CVPR 19 30.2 41.5 46.8 10.2
Ours This paper 42.5 55.9 61.5 16.0
TABLE VII: Unsupervised/semi-supervised person re-ID performance comparison with state-of-the-art methods on MSMT17. *: Using within-camera identity annotations of target dataset. Multi: a combination of seven datasets.

Finally, we compare with the domain adaptation based approaches, which produce state-of-the-art results. As can be seen, our approach outperforms them by a large margin on both datasets. Specifically, our method achieves rank-1 accuracy = 84.1% and mAP = 63.8% when using DukeMTMC-reID as the source set and tested on Market-1501, and, obtains rank-1 accuracy = 74.0% and mAP = 54.4% vice-versa. It is worth noting that our approach significantly outperforms MAR [yu2019unsupervised] which uses a larger dataset (MSMT17) as the source domain. Compared to the current best method (ECN [zhong2019invariance]), our method surpasses ECN by 20% and 14.4% in mAP when tested on Market-1501 and DukeMTMC-reID, respectively.

SOTA on MSMT17. We also evaluate our approach on a larger and more challenging dataset, i.e., MSMT17. Since it is a newly released dataset, there are only three unsupervised domain adaptation methods (PTGAN [wei2018person], DECA [yu2019unsupervised], and ECN [zhong2019invariance]) reported results on MSMT17. In addition, we further compare with two semi-supervised methods, TAUDL [li2018unsupervised] and UTAL [li2019unsupervised], which use within-camera identity annotations of the target domain. As shown in Table VII, our approach clearly exceeds both unsupervised domain adaptation methods and semi-supervised methods. For instance, our method produces rank-1 accuracy = 42.5% and mAP = 16.0% when using DukeMTMC-reID as the source set, which is higher than ECN by 12.3% in rank-1 accuracy and by 5.8% in mAP. Similar superiority of our method can be observed when using the Market-1501 as the source domain.

5 Conclusion

In this paper, we propose an exemplar memory based unsupervised domain adaptation (UDA) framework for person re-ID task. With the exemplar memory, we can directly evaluate the relationships between target samples. And thus we could effectively enforce the underlying invariance constraints of the target domain into the network training process. Moreover, the memory enables us to design a graph-based positive prediction (GPP) method that can infer reliable neighbors from candidate neighbors. Experiment demonstrates the effectiveness of the invariance learning, memory and GPP for improving the transferable ability of deep re-ID model. Our approach produces new state of the art in UDA accuracy on three large-scale domains.

References