1 Introduction
To reidentify a particular is to identify it as (numerically) the same particular as one encountered on a previous occasionplantinga1961things . Image/video reidentification (reID) is a fundamental problem in computer vision and reID techniques serve as an indispensable tool for numerous real life applications, for instance, person reID for public safety Zheng2016Oct , travel time measurement via vehicle reID coifman1998vehicle . The key component of reID tasks is the notion of identity, which makes reID tasks quite different from traditional classification tasks in the following ways: Firstly, determining the label involves two samples, i.e., there is no label when an individual sample is given; secondly, in reID tasks the samples in test sets belong to a previously unseen identity while in classification tasks the test samples all fall into a known class. Take the person reID task as an example, our target is to spot a person of interest in an image set, which do not have a specific class and is not accessible in the training set.
In many practical situations, we face the problem that the training data and testing data are in different domains. Going back to the person reID example, data from a new camera is placed in a new environment, i.e., a new domain is added, which are too costly and impractical to be annotated, a serviceable reID model should have a satisfactory accuracy on unlabeled data. Unsupervised domain adaptation means that learning a model for target domain when given both a fully annotated source dataset and an unlabeled target dataset. Existing algorithms for unsupervised domain adaptive reID tasks typically learn domaininvariant representation or generate data for target domain through some newly designed networks, which are practical solutions but lack theoretical support Li2018Apr ; Wang2018Mar ; Deng2017Nov . Meanwhile, current theoretical analysis of unsupervised domain adaptation are only concerned with classification tasks BenDavid2010 ; BenDavid2014 ; Mansour2009Feb , which is not suitable for reID tasks. A theoretical guarantee of the domain adaptive reID problem is in need.
In this paper, we first theoretically analyze unsupervised domain adaptive reID tasks based on BenDavid2014 , in which three assumptions are introduced for classification. It is assumed that the source domain and the target domain share a same label space in BenDavid2014 . However, in reID tasks, the notion of label is defined for pairwise data and the label indicates a data pair belongs to a same ID or not. We adapt the three assumptions for the input space of pairwise data. Moreover, instead of imposing the assumptions on the raw data as BenDavid2014
, we assume the resemblance between the feature space of two domains. The first assumption is that the criteria of classifying feature pairs is the same between two domains, which is referred to as covariate assumption. The second one is Separately Probabilistic Lipschitzness, indicating that the feature pairs can be divided into clusters. And the last assumption is weight ratio, concerning the probability of existing a repeated feature among all the features from the two domains. Based on the three assumptions, we show the learnability of unsupervised domain adaptive reID tasks. Moreover, since our guarantee is built on well extracted features from real images, the encoder, i.e. feature extractor, is trained via a novel selftraining framework, which is originally proposed for NLP tasks
mcclosky2006effective ; mcclosky2006reranking . Concretely, we iteratively refine the encoder by making guesses on unlabeled target domain and then train the encoder with these samples. In the light of the mentioned assumptions, we propose several loss functions on the encoder and samples with guessed label. And the problem of selecting which sample with guessed label to train is optimized by minimizing the proposed loss functions. For the Separately Probabilistic Lipschitzness assumption, we wish to minimize the intracluster and intercluster distance. Then the sample selecting problem is turned into data clustering problem and minimizing loss functions is transformed into finding a distance metric for the data. Also, another metric for Weight Ratio is designed. After combining the two metrics together, we have a distance evaluating the confidence of the guessed labels. Finally, the DBSCAN clustering method coifman1998vehicle is employed to generate data clusters according to a threshold on the distance. With pseudolabels on selected data cluster from target domain, the encoder is trained with triplet loss Weinberger2009 , which is effective for reID tasks.We carry out experiments on diverse reID tasks, which demonstrate the priority of our framework. First the well studied person reID task is tested and we show the adaptation results between two large scale datasets, i.e. Market1501 Zheng2015Dec and DukeMTMCreID Ristani2016 . Then we evaluate our algorithm on vehicle reID task, in which larger datasets VeRi776 Liu7553002 and PKUVehicleID liu2016deep are involved. To sum up, the structure of our paper is shown in Figure 2 and our contributions are as follows:

We introduce the theoretical guarantees of unsupervised domain adaptive reID based on BenDavid2014 . A learnability result is shown under three assumptions that concerning the feature space. To the best of our knowledge, our paper is the first theoretical analysis work on domain adaptive reID tasks.

We theoretically turn the goal of satisfying the assumptions into tractable loss functions on the encoder network and data samples.

A selftraining scheme is proposed to iteratively minimizing the loss functions. Our framework is applicable to all reID tasks and the effectiveness is verified on largescale datasets for diverse reID tasks.
1.1 Related work
Unsupervised domain adaptation has been widely studied for decades and the algorithms are divided into four categories in a survey margolis2011literature . Using the notions in the survey, our proposed method can be viewed as a combination of feature representation and selftraining. Nevertheless, recently unsupervised domain adaptation is widely studied for the person reID task.
Unsupervised domain adaptation and feature representation.
Feature representation based methods try to find a latent feature space shared between domains. In satpal2007domain , they wish to minimize a distance between means of the two domains. In a more general manner, pan2008transfer and chen2009extracting try to find a feature space in which the source and target distributions are similar and the statistic Maximum Mean Discrepancy (MMD) is employed. Also, ganin2016domain utilize features that cannot discriminate between source and target domains. Our method and these methods have a same intuition that some features from the source and target domain are generalizable. However, unlike these methods, the process of approximating the intuition in our method is in an iterative manner and we do not directly optimize on the distribution of target domain features.
Unsupervised domain adaptation and selftraining.
Selftraining methods make guesses on target domain and iteratively refine the guesses and are closely related to the Expectation Maximization (EM) algorithm
nigam2006semi . In tan2009adapting , they increase the weight on the target data at each iteration, which is actually altering the relative contribution of source and target domains. A more similar work is bacchiani2003unsupervised , in which the model is initially trained on source domain, and then the top1 recognition hypotheses on the target domain are used for adapting their language model. In our algorithm, we do not guess the labels since different reID datasets have totally different labels (identities) and instead we perform clustering on the data.Unsupervised domain adaptive person reID.
Due to the rapid development of person reID techniques, some useful unsupervised domain adaptive person reID methods are proposed. peng2016unsupervised adopt a multitask dictionary learning scheme to learn a viewinvariant representation. Besides, generative models are also applied to domain adaptation in Deng2017Nov ; Wei_2018_CVPR . Wang et al. Wang2018Mar design a network learning an attributesemantic and identity discriminative feature representation. Similarly, Li et al. Li2018Apr leverages information across datasets and derives domaininvariant features via an adaptation and a reID network. Though all the above methods solve the adaptation problem, they are not supported by a theoretical framework and their generalization abilities are not verified in other reID tasks. Fan et al. Fan2017May
propose a progressive unsupervised learning method consisting of clustering and finetuning the network, which is similar to our selftraining scheme. However, they only focuses on unsupervised learning, not unsupervised domain adaptation. In addition, their iteration framework is not guided by specific assumptions thus having no theoretical derived loss functions as ours.
2 Notations and Basic Definitions
In classification tasks, let be the input space and be the output space, and each sample from the input space is denoted by black lower case letters . We denote source domain as and target domain as
, and both of them are probability distribution over the input space
. Moreover, the real label of each sample is denoted by a labeling function . However, the above notations could not be directly used to analyze the reID tasks, because there is no same identity in two domains, i.e. and do not have the same output (label) space. Fortunately, for reID tasks, by treating reID as classifying same or different data pairs we are still able to utilize the notations and former results with some simple reformulations.Specifically, in reID tasks, we have a training set consist of data pairs, which means that the input space is , and the output space is , where means the identities in the pair are the same and means different. Observing that in reID tasks, the two domain indeed have some overlapping cues, such as color of clothes, wearing a backpack or not in person reid. That is, we can encode the original data from the two domains with some feature variables or latent variables, and then it is reasonable to assume that distribution of features from two domains satisfy some criteria just as the assumptions used in BenDavid2014 for classification tasks. Formally, we denote the feature encoder as and , and then the labeling function is , where is the extracted feature space. For simplicity, we denote , where means two different raw data. Note that the labeling function is symmetric, i.e. .
3 Assumptions and DALearnability
In this section, we first introduce some assumptions reflecting how the source domain interacting with target domain. Then with these assumptions we show the learnability of unsupervised domain adaptive reID.
The first assumption is covariate shift, which means that the criteria of classifying data pairs are the same for source domain and target domain. In other words, we have for classification tasks, and similarly we can define the covariate shift for reid tasks on the extracted feature space.
Definition 1 (Covariate Shift).
We say that source and target distribution satisfy the covariate shift assumption if they have the same labeling function, i.e. if we have .
Another assumption is inspired by the “Probabilistic Lipschitzness”, which is originally proposed for semisupervised learning in
Urner2011 and then investigated with application to domain adaptation tasks in BenDavid2014 . This assumption captures the intuition that in a classification task, the data can be divided into labelhomogeneous clusters and are separated by lowdensity regions. However, in reid tasks, the labeling function is a multivariable function, thus the original Probabilistic Lipschitzness is not applicable. Note that the intuition of reid tasks is that similar pairs can form as a cluster. That is, for an instance, the similar data can be divided into a cluster and the cluster is separated out from the data space with a lowdensity gap. Mathematically, we have the following definition.Definition 2 (Separately Probabilistic Lipschitzness (SPL)).
Let be monotonically increasing. Symmetric function is SPL with respect to a distribution on , if for all ,
(1) 
To ensure the learnability of the domain adaptation task, we still need a critical assumption concerning how much overlap there is between the source and target domain. We again follow the assumption used in BenDavid2014 on the source and target distribution, which is a relaxation of the pointwise density ratio between the two distributions.
Definition 3 (Weight Ratio).
Let be a collection of subsets of the input space measurable with respect to both and . For some we define the weight ratio of the source distribution and the target distribution with respect to as
(2) 
Further, we define the weight ratio of the source distribution and the target distribution with respect to as
(3) 
Following the notations in BenDavid2014 , we also assume that our domain is the unit cube and let denote the set of axis aligned rectangles in . For our reID tasks, the risk of a classifier on target domain is
(4) 
Let the Nearest Neighbor classifier be , then the following theorem implies the learnability of domain adaptive reID, of which the proof is included in supplemental materials.
Theorem 1.
Let the domain be the unit cube,, and for some , let be a pair of source and target distributions over satisfying the covariate shift assumption, with , and their common deterministic labeling function satisfying the SPL property with respect to the target distribution, for some function . Then, for all , for all , if is a source generated sample set of size at least
then, with probability at least (over the choice of ), is at most .
4 Reinforcing the Assumptions
In previous section, we show that with some assumptions on the extracted feature space, unsupervised domain adaptation is learnable. Thus we are concerned with how to train a feature extractor, i.e. encoder, satisfying the mentioned assumptions. Briefly speaking, we first derive several loss functions according to the assumptions and then iteratively train the encoder to minimize the loss functions via a selftraining framework.
Selftraining framework.
Assume that we have an encoder and some samples with guessed label on target domain, and the loss function is . In selftraining, at first a is used to extract features from all available unlabeled samples, and the target now is minimizing the loss through selecting samples, that is . On the next round, with these selected samples, the encoder is updated by solving the minimization problem .
It is worthwhile to note that the covariate shift assumption only depends on the property of labeling function, thus in this section we only consider the proposed SPL and weight ratio.
4.1 Reinforcing the SPL
Recall that the original data is and we wish to iteratively find a encoder such that in the feature space the SPL property is satisfied as much as possible. So we first need a definition to evaluate whether one encoder is better than another concerning the SPL property.
Definition 4.
Encoder is said to be more clusterable than with respect to a labeling function and a distribution over , if there exists , and with , such that
The above equation differs from the original SPL (1) for the reason that the original form is too strict to be satisfied. Now we can easily define a loss function
where means a set of samples and is the guessed labeling function. However, directly performing optimization on the loss function is infeasible since the analytical form is unknown. To overcome the difficulty, we adopt intracluster distance and intercluster distance,
(5)  
(6) 
We show that minimizing and is appropriate for being more clusterable through the following theorem.
Theorem 2.
For two encoders , a distribution and a labeling function , then
For proof we refer reader to the supplemental materials. Here, Definition 4 and Theorem 2 describe how to evaluate an encoder with a fixed distribution and labeling function . Obviously, we can fix the encoder and rewrite the results to evaluate the samples with guessed labels. For the sake of conciseness, the details are omitted. When and are fixed during the iteration procedure, minimizing and are straightforward. Contrastingly, we have to focus more on the strategy of picking out samples with guessed labels.
Selecting samples via clustering.
In spite of the similarity between and , they do not share a same strategy regarding the sample selection step. For , if all the data in are encoded with a , then for each pair , it is natural to assume that a smaller implies a higher confidence that . Likewise, a larger implies a higher confidence that . But choosing a high confidence different pair as training data does not really improve the real performance, because the accuracy is more sensitive about the minimal distance of different pairs, i.e., . So rather than directly selecting different pairs, we treat the selected samples as a series of clusters and dissimilar pairs are selected on the basis of different clusters. That is to say, in order to minimize and simultaneously, we perform clustering on the data with guessed labels.
Distance metrics and loss functions.
Up to this point, we are facing an unsupervised clustering problem, which is largely settled by the distance metric. In other words, designing a sample selecting strategy to minimize a loss turns into designing a distance metric between samples, and a better distance should lead to a lower and
. It is a common practice in image retrieval that the contextual similarity
Jegou2007 measure is more robust and beneficial for a lower .In our practice, we adopt the reciprocal encoding in Zhong2017Jan as the distance metric, which is a variation of Jaccard distance between nearest neighbors sets. Precisely, with an encoder , all samples from are encoded and with these features a distance matrix is computed where and is the total number of target samples. Then is updated by
(7) 
where the indices set is the so called robust set for . is determined by first choosing mutual nearest neighbors for the probe, then incrementally adding elements. Specifically, denote the indices of mutual nearest neighbors of as and then for all , if , let . In particular, for a pair , we have
(8) 
4.2 Reinforcing the weight ratio
As mentioned before, weight ratio is a crucial part to support the learnability of domain adaptation. Apart from directly define a loss based on the original weight ratio definition, a similar way as the SPL case is minimizing the loss
(9) 
where is the source domain. The intuition here is to enhance the degree of similarity, which means that each target feature is close to some source features. We denote as the weight ratio when using as the encoder, where is defined in Section 3. The following theorem demonstrate that our makes sense and the proof is in the supplemental materials.
Theorem 3.
However, unlike and , it is hard to optimize on for because of the infimum. On the other hand, selecting samples is easily done via giving more confidence to the sample with smaller . More specifically, for each from , we search the nearest neighbor in . The function measuring the confidence for each is denoted by
(10) 
where means the nearest neighbor of in source domain , and a smaller means a higher confidence. To transform and onto the same scale, we perform a simple normalization on , i.e., divided by . Combining with , the final distance matrix is and
(11) 
where is a balancing parameter.
4.3 Overall algorithm
So far, general outlines of reinforcing the assumptions have been elaborated, except the details about the clustering method. In our framework, a good clustering method should possess the following properties: (a) it does not require the number of clusters as an input, because in fact a cluster means an identity and the number of identities is trivial and unknown; (b) it is able to avoid pairs of low confidence, that is allowing some points not belonging to any clusters; (c) it is scalable enough to incorporate our theoretically derived distance metric. We employ the clustering method named DBSCAN Ester , which has stood the test of time and exactly have the mentioned advantages.
Now we provide some other practical details of our domain adaptive reID algorithm. At the beginning, an encoder is well trained on and all the pairs are computed with Eqn.(11). Next, we describe how we set the threshold controlling whether a pair should be used to train. Intuitively, the threshold should be irrelevant to tasks since the scale of varies from tasks. So in our method, we first sort all the distance from lowest to highest and the average value of top pairs is set to be the threshold , where is the total number of possible pairs and is percentage. On these data with pseudolabels, the encoder is then trained with triplet loss Weinberger2009 . Our whole framework is concluded in Algorithm 1.
5 Experiments
In this section, we test our unsupervised domain adaptation algorithm on person reID and vehicle reID. The performance are evaluated by cumulative matching characteristic (CMC) and mean Average Precision (mAP), which are multigalleryshot evaluation metrics defined in
Zheng2015Dec .Parameter settings and implementation details.
In all the following reID experiments, we empirically set , and . Basically, the encoder is ResNet50 he2016deep
pretrained on ImageNet. Both triplet and softmax loss are used for initializing the network on source domain, while only triplet loss is used for refining the encoder on target domain. More details about the network, training parameters and visual examples from different domains are included in the supplemental materials. Moreover, in the supplemental materials we also investigate other distance metrics and clustering methods.
5.1 Person reID
Datasets  Training  Testing  

#IDs  #Images  #IDs  #Images  
Market Zheng2015Dec  751  12,936  750  19,732 
Duke Ristani2016  702  16,522  702  19,889 
VeRi Liu7553002  576  37,778  200  13,257 
PKU liu2016deep  2,290  24,157     
Market1501 Zheng2015Dec and DukeMTMCreID Ristani2016 are two large scale datasets and frequently used for unsupervised domain adaptation experiments. Both of the two datasets are split into a training set and a testing set. The details including the number of identities and images are shown in Table 1.
Comparison methods are selected in three aspects. Firstly, we show the performance of direct transfer, that is directly using the initial sourcetrained encoder on the target domain. Also, the plain selftraining scheme is compared as a baseline, which means sample selection only depends on their Euclidean distance. Secondly, our method is compared with three most recent stateoftheart methods^{1}^{1}1Our results also outperforms PTGANWei_2018_CVPR by large margin, but the comparison with PTGAN is not shown here since we adopt a different backbone network.: SPGAN Deng2017Nov , TJAIDL Wang2018Mar and ARN Li2018Apr . We report the original results quoted from in their papers. Thirdly, we show the results of our methods with and without , which can be viewed as ablation studies. The results are shown in Table 2, from which we can observe the following facts: (a) The accuracy of selftraining baseline is high and even better than two recent methods, indicating that our clustering based selftraining scheme is fairly good; (b) The version without is better than selftraining baseline, which shows the effectiveness of , and after incorporated with the final method achieves the highest accuracy, reflecting the advantage of . Thus our two assumptions are both useful according to the ablation studies. (c) Although the proposed is beneficial, the increase of accuracy brought by it varies from different tasks. We think this is related to the distribution of source and target domains. Please refer to for more discussion in B.3 on this problem.
Furthermore, we draw the mAP curves (Figure 1) during the iterations of the adaptation task DukeMarket, in which selftraining baseline, using distance without and are compared. We can see that except the baseline, all the curves have a similar tendency toward convergence. A subtle distinction is that after 18 iterations methods with smaller become unstable, while methods with larger move toward convergence.
Methods  DukeMTMCreIDMarket1501  Market1501DukeMTMCreID  

rank1  rank5  rank10  mAP  rank1  rank5  rank10  mAP  
Direct Transfer  46.8  64.6  71.5  19.1  27.3  41.2  47.1  11.9 
Selftraining Baseline  66.7  80.0  85.0  39.6  40.8  53.9  60.5  24.7 
SPGAN Deng2017Nov  57.7  75.8  82.4  26.7  46.4  62.3  68.0  26.2 
TJAIDL Wang2018Mar  58.2  74.8  81.1  26.5  44.3  59.6  65.0  23.0 
ARN Li2018Apr  70.3  80.4  86.3  39.4  60.2  73.9  79.5  33.4 
Ours w/o  75.1  88.7  92.4  52.5  68.1  80.1  83.2  49.0 
Ours  75.8  89.5  93.2  53.7  68.4  80.1  83.5  49.0 
5.2 Vehicle reID
We use VeRi776 Liu7553002 and part of PKUVehicleID liu2016deep for vehicle reID experiments^{2}^{2}2In PKUVehicleID, the camera information is not provided but needed when computing the CMC and mAP, so we only test with the setting that PKUVehicleID as source dataset and VeRi776 as target dataset., the details are included in Table 1. Unlike person reID, currently there are no unsupervised domain adaptation algorithms designed for vehicle reID. Thus, we use the existing solutions for person reID as comparisons^{3}^{3}3We only test SPGAN. Because (1) source code of ARN is not available; (2) TJAIDL requires attribute labels as an input, which is not available in vehicle reID datasets. For SPGAN, the experiments are carried out with their default parameters for person reID.. As shown in Table 1, not only are the conclusions from person reID verified again, but also the generalization ability of our method is shown. We discover that the compared SPGAN generates quite presentable images and we put the images into supplemental materials, but their accuracy is still lower than the selftraining baseline, not to mention our proposed method.
6 Conclusion and Future Work
In this work, we bridge the gap between theories of unsupervised domain adaptation and reid tasks. Inspired by previous work BenDavid2014 , we make assumptions on the extracted feature space and then show the learnability of unsupervised domain adaptive reid tasks. Treating the assumptions as the goal of our encoder, several loss functions are proposed and then minimized via selftraining framework.
Though the proposed solution is effective and outperforms stateoftheart methods, there are still problems unsolved in our algorithm. Firstly, with regard of the weight ratio assumption, we propose the loss function , which is ignored when updating the encoder because of the intractable infimum. So designing another feasible loss function is an interesting direction of research. Another promising issue is to improve the data selecting step in the selftraining scheme. We turn the data selecting step into a clustering problem, which can be thought of as a version with hard threshold. This suggest that there may be a better strategy which utilize the relative values between distances. We hope that our analyses could open the door to develop new domain adaptive reID tasks and can lift the burden of designing large and complicate networks.
References
 [1] Michiel Bacchiani and Brian Roark. Unsupervised language model adaptation. In Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03). 2003 IEEE International Conference on, volume 1, pages I–I. IEEE, 2003.
 [2] Shai BenDavid, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine Learning, 79(1):151–175, May 2010.

[3]
Shai BenDavid and Ruth Urner.
Domain adaptation–can quantity compensate for quality?
Annals of Mathematics and Artificial Intelligence
, 70(3):185–202, Mar 2014.  [4] Bo Chen, Wai Lam, Ivor Tsang, and TakLam Wong. Extracting discriminative concepts for domain adaptation in text mining. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 179–188. ACM, 2009.
 [5] Benjamin Coifman. Vehicle reidentification and travel time measurement in realtime on freeways using existing loop detector infrastructure. Transportation Research Record: Journal of the Transportation Research Board, (1643):181–191, 1998.

[6]
Weijian Deng, Liang Zheng, Qixiang Ye, Guoliang Kang, Yi Yang, and Jianbin
Jiao.
Imageimage domain adaptation with preserved selfsimilarity and
domaindissimilarity for person reidentification.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, June 2018.  [7] Martin Ester, HansPeter Kriegel, Jörg Sander, and Xiaowei Xu. A densitybased algorithm for discovering clusters a densitybased algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96, pages 226–231. AAAI Press, 1996.
 [8] Hehe Fan, Liang Zheng, and Yi Yang. Unsupervised Person Reidentification: Clustering and Finetuning. arXiv, May 2017.
 [9] Brendan J Frey and Delbert Dueck. Clustering by passing messages between data points. science, 315(5814):972–976, 2007.

[10]
Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo
Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky.
Domainadversarial training of neural networks.
The Journal of Machine Learning Research, 17(1):2096–2030, 2016.  [11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
 [12] H. Jegou, H. Harzallah, and C. Schmid. A contextual dissimilarity measure for accurate and efficient image search. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–8, June 2007.
 [13] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv, Dec 2014.

[14]
YuJhe Li, FuEn Yang, YenCheng Liu, YuYing Yeh, Xiaofei Du, and YuChiang
Frank Wang.
Adaptation and reidentification network: An unsupervised deep transfer learning approach to person reidentification.
In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2018.  [15] Hongye Liu, Yonghong Tian, Yaowei Wang, Lu Pang, and Tiejun Huang. Deep relative distance learning: Tell the difference between similar vehicles. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2167–2175, 2016.
 [16] X. Liu, W. Liu, H. Ma, and H. Fu. Largescale vehicle reidentification in urban surveillance videos. In 2016 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6, July 2016.
 [17] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain Adaptation: Learning Bounds and Algorithms. arXiv, Feb 2009.
 [18] Anna Margolis. A literature review of domain adaptation with unlabeled data. Technical report, pages 1–42, 2011.
 [19] David McClosky, Eugene Charniak, and Mark Johnson. Effective selftraining for parsing. In Proceedings of the main conference on human language technology conference of the North American Chapter of the Association of Computational Linguistics, pages 152–159. Association for Computational Linguistics, 2006.
 [20] David McClosky, Eugene Charniak, and Mark Johnson. Reranking and selftraining for parser adaptation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 337–344. Association for Computational Linguistics, 2006.
 [21] Kamal Nigam, Andrew McCallum, and Tom Mitchell. Semisupervised text classification using em. SemiSupervised Learning, pages 33–56, 2006.
 [22] Sinno Jialin Pan, James T Kwok, and Qiang Yang. Transfer learning via dimensionality reduction. In AAAI, volume 8, pages 677–682, 2008.
 [23] Peixi Peng, Tao Xiang, Yaowei Wang, Massimiliano Pontil, Shaogang Gong, Tiejun Huang, and Yonghong Tian. Unsupervised crossdataset transfer learning for person reidentification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1306–1315, 2016.
 [24] Alvin Plantinga. Things and persons. The Review of Metaphysics, pages 493–519, 1961.
 [25] Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi. Performance measures and a data set for multitarget, multicamera tracking. In European Conference on Computer Vision workshop on Benchmarking MultiTarget Tracking, 2016.
 [26] Sandeepkumar Satpal and Sunita Sarawagi. Domain adaptation of conditional probability models via feature subsetting. In European Conference on Principles of Data Mining and Knowledge Discovery, pages 224–235. Springer, 2007.

[27]
Songbo Tan, Xueqi Cheng, Yuefen Wang, and Hongbo Xu.
Adapting naive bayes to domain adaptation for sentiment analysis.
In European Conference on Information Retrieval, pages 337–349. Springer, 2009.  [28] Ruth Urner, Shai BenDavid, and Shai ShalevShwartz. Access to unlabeled data can speed up prediction time. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, pages 641–648, USA, 2011. Omnipress.
 [29] Nam Vo and James Hays. Generalization in Metric Learning: Should the Embedding Layer be the Embedding Layer? arXiv, Mar 2018.

[30]
Jingya Wang, Xiatian Zhu, Shaogang Gong, and Wei Li.
Transferable joint attributeidentity deep learning for unsupervised person reidentification.
In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.  [31] Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. Person transfer gan to bridge domain gap for person reidentification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
 [32] Kilian Q. Weinberger and Lawrence K. Saul. Distance Metric Learning for Large Margin Nearest Neighbor Classification. Journal of Machine Learning Research, 10(Feb):207–244, 2009.
 [33] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian. Scalable Person Reidentification: A Benchmark. 2015 IEEE International Conference on Computer Vision (ICCV), pages 1116–1124, Dec 2015.
 [34] Liang Zheng, Yi Yang, and Alexander G. Hauptmann. Person Reidentification: Past, Present and Future. arXiv, Oct 2016.
 [35] Zhun Zhong, Liang Zheng, Donglin Cao, and Shaozi Li. Reranking person reidentification with kreciprocal encoding. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3652–3661. IEEE, 2017.
Supplementary Materia
Appendix A Theorems and Proofs
To prove Theorem 1, we first give a lemma on the upper error bound of . Let denote the set of axis aligned rectangles in and, given some , let denotes the class of axis aligned rectangles with sidelength . For a sample set from source domain, we have
Lemma 1.
Let the domain be the unit cube, , and for some and some , let be source and target distributions over satisfying the covariate shift assumption, with , and their common reid labeling function satisfying the SPL property with respect to the target distribution, for some function . Then, for all , and all ,
(12) 
Proof.
A test pair gets the wrong label under two conditions: (a) at least one test data do not have a close neighbor with all the training data; (b) have a close neighbor pair which have the opposite label. For (a), we can use the results from Lemma 7 and Theorem 8 in BenDavid2014 . Specifically, let be a cover of the set using boxes of sidelength . We have
(13) 
If is in the box , then the probability of (a) can be expressed as . Observing that
and , so (a) is bounded by . For (b), we denote the nearest neighbor to in is and then (b) means in the box we have
(14) 
Seeing that
So
Combining the two bounds together, we conclude our proof. ∎
If we have a stronger weight ratio assumption, i.e. , we get the following result of domain adaptation learnability.
Theorem 4.
Let the domain be the unit cube,, and for some , let be a pair of source and target distributions over satisfying the covariate shift assumption, with , and their common deterministic labeling function satisfying the SPL property with respect to the target distribution, for some function . Then, for all , for all , if is a source generated sample set of size at least
then, with probability at least (over the choice of ), the target error of the Nearest Neighbor classifier is at most .
Proof.
From the proof in Theorem 1, the error was bounded under two circumstances. As for (a), we apply Markov’s inequality and get
(15) 
Then for (b), we just set , so . Finally, setting the probability to be smaller than yields that if
then with probability at least , the target error of the Nearest Neighbor classifier is at most . ∎
Theorem 5.
For two encoders , a distribution and a labeling function , then
Proof.
There exists , and with , such that
When , and ,
So And let , and , then
So
We have
Denote as the mean value of , then
In like manner, denote as the mean value of , then
∎
Theorem 6.
For two encoders , a distribution , if is a random variable and its support is a subset of , then
Proof.
∎
Appendix B Additional Experimental Details and Results
We present the structure of the paper in Figure 2 and the most important contributions in our work are Theorem 2 and 3, both of which aim to turn the abstract and somewhat too theoretical assumptions into practical loss functions. Although Theorem 1 seems like a straightforward extension of previous work BenDavid2014 , it plays a fundamental role in the paper. Through the DAlearnability shown in Theorem 1, we can see that the three assumptions imposed on the distribution of two domains in Section 3 are sufficient for solving the domain adaptive reID problem. In other words, the sufficiency of reinforcing the three assumptions in Section 4 is shown via Theorem 1.
b.1 Visualization of datasets and results
To understand the variations between different domains more clearly, Figure 3 presents some samples from the datasets used in our experiments. These datasets all have their own special characteristics. For instance, people riding a bicycle are common in Market1501, while these people are rare in DukeMTMCreID. More importantly, the images in these reID datasets are heavily related to the cameras, which means that the images contain information closely knitted together with the camera, such as background, viewpoints or lighting condition.
Moreover, we present some generated samples of SPGAN for vehicle reID. As shown in Figure 4, their imageimage translation indeed works but fails to produce satisfactory reID results as person reID. This indicates that either their proposed generative method is not suitable for unsupervised domain adaptive vehicle reID, or their parameters need careful tuned for a new task.
b.2 Encoder network
Basically, the encoder network is ResNet50 he2016deep pretrained on ImageNet and the whole network is presented in Figure 5.
Person reID.
The size of input images is , so the output of conv5 is and a average pool layer is added after conv5 to have a output of size . We denote the output of this layer as feat1. During training on the source domain, feat1 is connected to a fullyconnected (fc) layer with output 2048, denoted fc0, then the 2048 fc layer is connected to a fc layer with output 751 (Market1501) or 702 (DukeMTMCreID). Let the output of finally fc layer be fc1. The loss functions are Softmax(fc1) and Triplet(feat1), which are added directly (without extra balancing parameter). The model is trained by Adam optimizer Kingma2014Dec
. Training parameters are set as follows: batch size 128 (PK sampling with P=16, K=8); maximum number epochs 70; learning rate 3e4.
When training with data from target domain, there is no fc1 layer and we use two triplet loss, that is Triplet(feat1) and Triplet(fc0). The trick of using two triplet losses comes from Vo2018Mar
. The model is trained by stochastic gradient descent and in each iteration step we perform data augmentation (random flip and random erasing) on the data. Training parameters are set as follows: batch size 128 (PK sampling with P=32, K=4); momentum 0.9; maximum number epochs 70; learning rate 6e5. The networks are trained with two TITAN X GPUs.
Vehicle reID.
All parameters including network architecture are same as person reID, except the size of input data. The input data here are resized to and the output of conv5 is .
b.3 More results
Effectiveness of .
From Table 2, Table 1 and Figure 1, we observe that in a practical view, using actually is not appealing. We think the reasons are two folds. Firstly, the effectiveness of depends on the distribution of source and target domain. In Fig.(6), we design a simple example in 2D feature space to show the validity of . In the left figure, the grey points denotes the extracted feature from source space and the colored points denotes the features of target data with real label. In the middle figure, we show the pseudo labels generated with DBSCAN when setting , i.e., not using . In the right figure, the results with is shown. Comparing the middle figure and the right figure, we can see that is important in such situation. The key idea in this demo is that those “easy” target points happen to be near the source data. Here, “easy” target points means the points belonging to the same ID are “close” in the extracted feature space with present encoder. This example can be also used for classification tasks since the Weight Ratio is a shared assumptions between our work and BenDavid2014 . Secondly, is derived from , but the potential value of is not fully exploited in our algorithm. Thus, using in practical application is not appealing. However, when not using , the results are stable and good enough and already outperforms existing methods by large margin, which shows the power of the selftraining scheme in domain adaptive reID problems. For real applications, if the computation resources are limited, we recommend just setting and not making the effort to search for an optimal .
Comparison of distance metrics.
As for other contextual distance metrics, we test the performance of using the original Jaccard distance with or without (also set ). For the Jaccard distance, we first compute the nearest neighbor set and then compute the distance between the sets. Another conclusion is that taking into consideration is also beneficial for Jaccard distance. However, both of the two distance metrics are worse than the selftraining baseline, i.e., Euclidean distance. The reason is that the Jaccard distance only consider the nearest neighbor sets and therefore pairs without overlapping nearest neighbors will have a Jaccard distance 1, which is too strict to generate enough training pairs. The shortcoming also leads to a slow or even halted increasing of accuracy, for more details see the convergence comparison paragraph and Figure 1. As is shown in Table 3, reciprocal encoding employed in our method positively improve the performance of plain Jaccard distance.
Methods  DukeMTMCreIDMarket1501  

rank1  rank5  rank10  mAP  
Jaccard distance  63.8  80.0  85.3  37.1 
Jaccard distance with  65.7  81.2  86.5  38.1 
Comparison of clustering methods.
Due to the restrictions of a suitable clustering method, we only test a version with affinity propagation frey2007clustering . For task DukeMTMCreIDMarket1501, we investigate the effectiveness of affinity propagation with other distance metrics. It is obvious that affinity propagation is not a proper clustering method for the reason that all data are used for clustering, which means it cannot avoid those pairs of low confidence. As shown in Table 4, a interesting fact is that with affinity propagation just using Euclidean distance is better than our proposed distance. The reason behind this phenomenon is that the number of IDs (clusters) generated by affinity propagation is much larger when using our proposed distance. In Figure 7, we show the number of IDs with respect to each iteration step. Using our distance leads to a larger number of clusters out of the reason that our distance will enlarge the gap between the dissimilar pairs, which is ought to be beneficial of getting rid of these helpless stray samples. However, affinity propagation is a clustering method that every sample is assigned to some cluster and therefore using our distance performs worse than Euclidean distance.
Distance Metrics  rank1  rank5  rank10  mAP 

Euclidean  63.5  76.6  80.7  36.9 
Ours w/o  62.4  74.6  78.9  35.2 
Ours  62.4  74.5  78.8  35.6 
Parameters analysis
Among all the parameters in our algorithm, the most influencing parameters are the percentage and the balancing parameter . Since the influence of has been reported, here we perform experiments with a series of different from DukeMTMCreID to Market1501 and the results are shown in Table 5. As we can see from the table, even a small change () of has a discernible impact on the final accuracy. It is because that we use large scale datasets and the number of all possible pairs from target datasets is large. Take Market1501 as an example, the number of training images is 12,936, so the number of all data pairs is over . Thus a small change of can cause a large change of the threshold.

Convergence comparison.
In Figure 8, we use DukeMTMCreID as source domain and Market1501 as target domain and we first show the convergence results with different distance metric and clustering method in (a) and (b). Several conclusions can be drawn from the curves: First, we can see that the Jaccard distance based version becomes more stable after adding ; Second, the accuracy of the Jaccard distance based version almost stops increasing after 14 iterations, which is caused by the special property of Jaccard distance mentioned before; Third, using affinity propagation converges very fast and after about 8 iterations the accuracy stop increasing, which is caused by the inaccurate number of clusters and all the samples are used to train the network. Thus the loss functions fail to be minimized through sample selection step. Moreover, we show the results with different in (c) and (d). It is obvious that all the curves have a similar convergence tendency, which demonstrates that our iteration process is robust with regard of the crucial parameter .
Comments
There are no comments yet.