1 Introduction
Hiving largescale labeled datasets is one of the reasons for the recent success of deep convolutional neural networks (CNNs)
[14]. Nevertheless, the collection and annotation of numerous samples in various domains is an extremely expensive and timeconsuming process. Meanwhile, traditional CNNs trained on one large dataset show low generalization ability on another due to the data bias or shift [38].Unsupervised domain adaptation (UDA) methods tackle the mentioned problem by transferring knowledge from a labelrich source domain to a fully unlabeled target domain [28, 27]. The deep UDA methods have achieved remarkable performance [40, 22, 9, 10, 2, 39, 25, 33, 30, 16], which usually seek to jointly achieve small source generalization error and crossdomain distribution discrepancy.
Most prior efforts focus on matching global source and target data distributions to learn domaininvariant representations. However, the learned representations may not only bring the source and target domains closer, but also mix samples with different class labels together. Recent studies [23, 34, 13, 32, 44, 29, 21, 35, 44, 41] started to consider learning discriminative representations for the target domain. Specifically, some of them [34, 32, 44] proposed to use pseudolabels to learn target discriminative representations, which encourages a lowdensity separation between classes in the target domain [20]
. Despite their efficacy, these approaches faces two critical limitations. Firstly, they require a strong preassumption that the correctlypseudolabeled samples can reduce the bias caused by falselypseudolabeled samples. Nevertheless, it is challenged to satisfy the assumption, especially when the domain discrepancy is large. The learned classifiers might be incapable of confidently distinguishing target samples, or precisely pseudolabel them with an expected accuracy requirement. Secondly, they backpropagate the category loss for target samples based on pseudolabeled samples, which makes the target performance vulnerable to the error accumulation.
During the exploration, we empirically observe the distinct data patterns in the target domain. The motivation is demonstrated in Fig. 1
. The intraclass distribution variance exists in the target domain. Some target samples, which we call easy samples, are very likely to be classified correctly since they are sufficiently close to the source domain, and we can directly assign pseudolabels to them without any adaptation. Some target samples, which we call hard samples, lay far away from the source domain and they are ambiguous for the classification boundaries. Moreover, some easy samples, which we call falseeasy samples, lay in the support of noncorresponding source classes and are prone to be falsely pseudolabeled with high confidence. These falselabeled samples introduce wrong information in the category alignment and potentially result in the error accumulation. Thus it is prerequisite to alleviate their negative influences in the context of UDA.
In this paper, we propose a Progressive Feature Alignment Network (PFAN), which largely extends the ability of prior discriminative representationsbased approaches by explicitly enforcing the category alignment in a progressive manner. Firstly, an EasytoHard Transfer Strategy (EHTS) progressively selects reliable pseudolabeled target samples with crossdomain similarity measurements. However, the selected samples may include some misclassified target samples with high confidence. Then, to suppress the negative influence of falselylabeled samples, we propose an Adaptive Prototype Alignment (APA) to align the source and target prototypes for each category. Rather than backpropagating the category loss for target samples based on pseudolabeled samples, our work statistically align the crossdomain class distributions based on the source samples and the selected pseudolabeled target samples.
The EHTS and APA update iteratively and alternatively, where EHTS boosts the robustness of APA by providing reliable pseudolabeled samples, and the crossdomain category alignment learned by APA can effectively alleviate those falselylabeled samples introduced by the EHTS. Moreover, upon observing that a good adaptation model usually requires a nonsaturated source classifier, we consider a simple yet efficient way to retard the convergence speed of the source classification loss by further involving a temperature variate into the softmax function. The experimental results reveal that the proposed PFAN exceeds the stateoftheart performance on three UDA datasets.
2 Related Work
We summarize the work most relevant to our proposed approach. We focus primarily on deep UDA methods due to their empirical superiority in this problem.
Inspired by the recent success of generative adversarial networks (GAN) [11], deep adversarial domain adaptation has received increasing attention in learning domaininvariant representations to reduce the domain discrepancy and provide remarkable results [9, 39, 29, 43, 44, 17]. These methods try to find a feature space such that confusion between the source and the target distributions in that space is maximal. For example, [9] proposed a gradient reversal layer to train a feature extractor that produces features which maximize the domain binary classifier loss, while simultaneously minimizing the label predictor loss.
Many approaches utilize a distance metric to measure the domain discrepancy between the source and target domains, such as maximum mean discrepancy (MMD), KLdivergence or Wasserstein distance [12, 22, 37, 24, 42, 6]. Most prior efforts intend to achieve domain alignment by matching and . However, an exact domainlevel alignment does not imply a finegrained classtoclass overlap. Thus, it is important to pursue the categorylevel alignment under the absence of target true labels.
[3, 5, 23, 34, 32, 44, 41] utilize the pseudolabels to compensate the lack of categorical information in the target domain. [23] jointly matched both the marginal distribution and conditional distribution using a revised MMD. [32] utilized an asymmetric tritraining strategy to learn discriminative representations for the target domain. [44]
iteratively selected pseudolabeled target samples based on the classifier from the previous training epoch and retrained the model by using the enlarged training set.
[41] proposed to assign pseudolabels to all target samples and utilize them to achieve semantic alignment across domains. However, these approaches highly relied on the hypothesis that correctlypseudolabeled samples can reduce the bias caused by falselypseudolabeled samples. They do not explicitly alleviate those falselypseudolabeled samples. When the falselypseudolabeled samples take the prominent position, their performances will be limited.3 Progressive Feature Alignment Network
In this section, we first provide the details of the proposed PFAN and then theoretically investigate the effectiveness of our approach. The overall architecture of PFAN is depicted in Fig. 2, which consists of three components, EHTS, APA, and the softmax function with a temperature variate. EHTS provides reliable pseudolabeld samples from easy to hard by iterations and APA explicitly enforces the crossdomain category alignment.
3.1 Task Formulation
In UDA, we are given a source domain (, ) of labeled samples and given a target domain () of unlabeled samples [28]
. The source and target domains are drawn from the joint probability distributions
and respectively, and . We assume that the source and target domains contain the same object classes, and we consider classes in all.3.2 EasytoHard Transfer Strategy
The EHTS is biased to favor easier samples and this bias helps to avoid including the hard samples which are more likely to be given false pseudolabels. In our approach, the easy samples are increasing progressively. Thus the “hard” samples will potentially be selected in further steps. The selected pseudolabeled samples by EHTS can be used to align with their corresponding source categories as described in Section 3.3.
The EHTS first computes a dimensional prototype
of each class in the source domain. The source prototype is a mean vector of the embedded source samples in each class through an embedding function
(i.e. the feature extractor in Fig. 2) with trainable parameters ,(1) 
where denotes the set of samples labeled with class in the source domain and is the number of corresponding samples. Then, a set of prototypes are obtained. The embedded target samples are supposed to gather around the source prototypes in the latent feature space. Thus, we use a similarity measurement to cluster th unlabeled target sample, , to the corresponding source prototypes, where is computed as follows,
(2) 
where
denotes the cosine similarity function between two vectors.
is added into the target domain of the class with a pseudolabel where .Then, the unlabeled target samples are partitioned into classes (i.e. ) and each sample is scored by its similarity. To obtain the “easy” samples, we constrain that the similarity scores should above a certain threshold . During the training process, the values of the similarity increase continuously because the source samples and the target samples become closer to each other in the hidden space as training proceeds. “Hard” samples in the earlier stages may be selected as “easy” in the later stages. However, the constant threshold will turn too much “hard” samples into “easy” samples in each step. To control the growth rate of the “easy” samples, we gradually adjust the threshold step by step as follows,
(3) 
where is a constant and () denotes the training step. Therefore, the sample selection function is formulated as follows,
(4) 
where indicates to be selected; otherwise, indicates not to be selected. Finally, we obtain a selected pseudolabeled target domain , where denotes the number of selected samples.
3.3 Adaptive Prototype Alignment
In this section, we introduce the proposed APA, which considers the pairwise semantic similarity across domains to explicitly alleviate the negative influence of those falseeasy samples and enforce the crossdomain category consistency. It can be implemented by aligning the prototype of source and selected target samples for each category. We measure the distance between two prototypes as follows,
(5) 
where and represent the source and target prototypes, respectively. We opt for the squared Euclidean distance as the distance measure function. The justification is that the cluster mean yields optimal cluster representatives when a Bregman divergence (e.g. squared Euclidean distance and Mahalanobis distance) is used [36]. An optional approach for prototype alignment is to compute and align the local prototypes based on the minibatch sampled from and at each iteration. However, this approach is in a position of weakness because the categorical information in each minibatch is expected to be insufficient, even one falsely labeled sample in the target minibatch may cause huge bias between the computed prototype and true prototype.
To overcome the aforementioned problems, we propose to adaptively align the global prototypes. The APA first computes the initial global prototypes based on the selected pseudolabeled target samples as follows,
(6) 
In each iteration, we compute a set of local prototypes using the minibatch samples. The accumulated prototypes are computed as the average of all previous local prototypes in each iteration,
(7) 
where denotes the iteration times in the current training step. Then, the new are updated as follows,
(8) 
where is the cosine distance which was defined in Eq. (2) and is the tradeoff parameters. let be analogously updated for the source domain. To this end, the APA loss is formulated as follows,
(9) 
The motivations of APA is intuitive: 1) the accumulated prototypes are introduced to estimate the accumulated shift caused by the falsely labeled samples, and then we can use their similarity with the previous global prototypes to decide the new global prototypes
; and 2) we statistically align the crossdomain category distributions which can alleviate the error accumulation of the pseudolabels.3.4 Training Losses
In this work, we empirically found that a good adaptor needs a nonsaturated source classifier. This empirical result is supported by the theoretical analysis described in Section 3.5. The justification is that the adaptation model is biased towards minimizing the source classification loss, which usually converges rapidly since the available of the source true labels. However, this bias may lead the overfitting to the source samples and resulting in a limited target performance. Inspired by [15], we propose to add a high temperature variate () to the source classifier (as depicted in Fig. 2). By that means we can retard the convergence speed of the source classification loss and effectively guides the adaptor to a better adaptation performance. We achieve this behavior via the following softmax function,
(10) 
where denotes the class probabilities for a source samples and
is the logit that produced by source classifier. Using a higher value for
produces a softer output and naturally retards the convergence speed.Adversarial learning has been successfully introduced to UDA by extracting domaininvariant features to achieve domain alignment [9]. However, the learned representations can not ensure category alignment, which is the main source of performance reduction. Therefore, our work simultaneously considers domainlevel and categorylevel alignment. In our PFAN, the input is first embedded by to a dimensional feature vector , i.e. . In order to make f domaininvariant, the parameters of feature extractor are expected to be optimized by maximizing the loss of the domain discriminator , while the parameters of domain discriminator are trained by minimizing the loss of the domain discriminator, the discriminator is optimized following a standard classification loss:
(11) 
In addition, we also need to simultaneously minimize the loss of the label predictor for the labeled source samples and the APA loss. Formally, our ultimate goal is to optimize the following minimax objective:
(12) 
where is the standard crossentropy loss, and are weights that control the interaction among the source classification loss, the domain confusion loss and the APA loss. The pseudocode of training PFAN is shown in Algorithm 1, the EHTS and APA work alternatively and iteratively.
3.5 Theoretical Analysis
In this section, we theoretically show that our approach improves the boundary of the expected error on the target samples, making use of the theory of domain adaptation [1]. Formally, let be the hypothesis class and given two domains and , the probabilistic bound of the error of hypothesis on the target domain is defined as,
(13) 
where the expected error on the target samples, , are bounded by three terms: (1) the expected error on the source domain, ; (2) is the domain divergence measured by a discrepancy distance between two distributions and w.r.t. a hypothesis set ; (3) the shared error of the ideal joint hypothesis, .
In Inequality (13), is expected to be small and prone to be optimized by a deep network since we have source labels. On the other hand, prior efforts [9] seeks to minimize by the domain classifierbased adversarial learning. However, A small and a small do not guarantee small . It is possible that tends to be large when the crossdomain category alignment is not be explicitly enforced (i.e. the marginal distribution is well aligned, but the class conditional distribution is not guaranteed). Therefore, needs to be bounded as well. Unfortunately, we cannot directly measure due to the absence of target true labels. Thus, we resort to the pseudolabels to give the approximate evaluation and minimization.
Definition 1.
If denotes the expected risk on the selected pseudolabeled target set , the ideal joint hypothesis is the hypothesis which minimizes the combined error
and the combined error of the ideal hypothesis is
(14) 
where and are the labeling functions for the source and target domains, respectively.
To bound the combined error of the ideal hypothesis, the following inequality holds:
Theorem 1.
Let be the pseudolabeling function. Given and as the minimum shared error and the degree to which the target samples are falsely labeled on , respectively. We have
(15) 
We show the derivation of Theorem 1 in the Supplementary Material. It is easy to respectively find a suitable in to approximate the and since we have the source labels and target pseudolabels. However, we assume that when the category alignment has not been achieved, there exists an optimality gap between and (Fig. 3(a)). While most existing methods do not consider such phenomenon and directly minimizing , which leads the overfitting to source samples.
Remark 1 (Minimizing ).
The proposed softmax function with a temperature variate alleviates the overfitting to source samples (i.e. enforcing a nonsaturated source classifier) by retarding the convergence speed of . This guides the adaptation model to a better target performance, i.e., a smaller . Note that when the crossdomain category distributions is well aligned, the aforementioned optimality gap is removed (Fig. 3(b)).
Recall that the labeling function can be decomposed into the feature extractor and label classifier
. By considering the 01 loss function
for , we have(16) 
where
(17) 
Remark 2 (Minimizing shared error).
The proposed approach aims to progressively align feature in categorylevel, i.e., it aligns the th class in source domain with the same pseudolabeled target class . When the categories are aligned, it is safe to assume that . Thus, is expected to be minimized.
Remark 3 (Minimizing the degree to which the target samples are falsely labeled on ).
The proposed EHTS aims to select reliable pseudolabeled samples in the target domain which minimizes .
4 Experiments
4.1 Datasets and Baselines
Office31 [31] is a popular benchmark for evaluation on domain adaptation. It contains images of categories in total, which are collected from three domains, including Amazon (A) comprising 2817 images downloaded from online merchants, Webcam (W) involving 795 low resolution images acquired from webcams, and DSLR (D) containing 498 high resolution images of digital SLRs. We try all 6 combinations of two domains for evaluation.
ImageCLEFDA [4]
originally used for the ImageCLEF 2014 domain adaptation challenge consists of twelve common classes from three domains: ImageNet ILSVRC 2012 (
I), Pascal VOC 2012 (P), and Caltech256 (C). Each doamin has 600 images in total and contains 50 images per class. We test 6 tasks by using all domain combinations.MNIST [19], SVHN [26] and USPS [7] contain digital images of classes. In particular, the images in MNIST and SVHN are grey, and are of size and , respectively; USPS consists of color images of size , and there are often more than one digit in one image. Following previous works, we consider the three transfer tasks: MNISTSVHN, SVHNMNIST and MNISTUSPS.
4.2 Implementation Details
Joining previous practices, we instantiate our backbone by AlexNet that has been pretrained on ImageNet for Office31 and ImageCLEFDA, and employ the CNN architecture by [39] for the digital datasets. As suggested by [25], we finetune the feature extractor upon the backbone and train the predictor
from the scratch via back propagation. We utlize stochastic gradient descent (SGD) for the training with a momentum of 0.9 and a annealing learning rate (lr) given by
, where is increased linearly from 0 to 1 as the training proceeds, , , and . In order to suppress noisy signal especially for the initial training steps, we use the similar schedule method as [9] to adaptively change the values of and in Eq. (12) by computing with . We set in Eq. (10) and in Eq. (3) for all experiments. The batch size is selected as 128. The means and standard derivations of all results are obtained over 5 random runs. All experiments are implemented by the Caffe framework.
Method  A W  D W  W D  A D  D A  W A  Avg 
AlexNet [18]  61.50.5  95.10.3  99.00.2  64.40.5  48.80.3  47.00.4  69.3 
DDC [40]  61.80.4  95.00.5  98.50.4  64.40.3  52.10.6  52.20.4  70.6 
DAN [22]  68.50.4  96.00.3  99.00.2  67.00.4  54.00.4  53.10.3  72.9 
RTN [24]  73.30.3  96.80.2  99.60.1  71.00.2  50.50.3  51.00.1  73.7 
RevGrad [9]  73.00.5  96.40.3  99.20.3  72.30.3  53.40.4  51.20.5  74.3 
JAN [25]  74.90.3  96.60.2  99.50.2  71.80.2  58.30.3  55.00.4  76.0 
MADA [29]  78.50.2  99.80.1  100.0.0  74.10.1  56.00.2  54.50.3  77.1 
MSTN [41]  80.50.4  96.90.1  99.90.1  74.50.4  62.50.4  60.00.6  79.1 
PFAN  83.00.3  99.00.2  99.90.1  76.30.3  63.30.3  60.80.5  80.4 
Method  I P  P I  I C  C I  C P  P C  Avg 
AlexNet [18]  66.20.2  70.00.2  84.30.2  71.30.4  59.30.5  84.50.3  73.9 
DAN [22]  67.30.2  80.50.3  87.70.3  76.00.3  61.60.3  88.40.2  76.9 
RevGrad [9]  66.50.5  81.80.4  89.00.5  79.80.5  63.50.4  88.70.4  78.2 
JAN [25]  67.20.5  82.80.4  91.30.5  80.00.5  63.50.4  91.00.4  79.3 
MADA [29]  68.30.3  83.00.1  91.00.2  80.70.2  63.80.2  92.20.3  79.8 
MSTN [41]  67.30.3  82.80.2  91.50.1  81.70.3  65.30.2  91.20.2  80.0 
PFAN  68.50.5  84.40.4  92.20.6  82.30.4  66.30.3  91.70.2  80.9 
4.3 Comparisons with StateoftheArts
Stateofthearts.
We compare our approach with various stateoftheart UDA methods, including AlexNet [18], Deep Domain Confusion (DDC) [40], Deep Adaptation Network (DAN) [22], Residual Transfer Network (RTN) [24] , Reverse Gradient (RevGrad) [9], Adversarial Discriminative Domain Adaptation (ADDA) [39], Joint Adaptation Networks (JAN) [25], Asymmetric TriTraining (ATT) [32] , MultiAdversarial Domain Adaptation (MADA) [29], and Moving Semantic Transfer Network (MSTN) [41]. For all above methods, we summarize the results reported in their original papers. For similarity, we term our method as PFAN hereafter.
Table 1 displays the results on Office31. The proposed PFAN outperforms all compared methods in general and improves the stateoftheart result from to on average. If we focus more on the hard transfer tasks (e.g. and ), PFAN substantially exhibits better transferring ability than others. In contrast to JAN, MADA and MSTN, our PFAN additionally considers both the target intraclass variation and the nonsaturated source classifier. Our better performance over them could indicate the effectiveness of these two components. RevGrad has also taken the domain adversarial adaptation into account, but its results are still inferior to ours. The advantage of our model compared to RevGrad is that, we furhter perform EHTS and APA, which as supported by our experiments can explicitly enforce the crossdomain category alignment, hence delivering better performance.
The results of ImageCLEFDA are reported in Table 2. Our approach outperforms all comparison methods on most transfer tasks, which reveals that PFAN is scalable for different datasets. However, the improvements are less than Office31 since the difference in domain sizes will cause shift [25]. More comparisons with ResNetbased methods [30, 44, 17] are provided in the Supplementary Material.
The results of digit classification are reported in Table 3. We follow the training protocol established in [39]. For adaptation between MNIST and USPS, we randomly sample 2000 images from MNIST and 1800 from USPS. For adaptation between SVHN and MNIST, we use the full training sets. For the hard transfer task MNISTSVHN, we reproduced the MSTN [41] but were unable to get it to converge, since the performance of this approach depends strongly on the accuracy of the pseudolabeled samples which was lower on this task. In contrast, our approach significantly outperforms the suboptimal result by +4.8%, which clearly demonstrates the effect of our approach on selecting reliable pseudolabeled samples and alleviating the negative influence of falselylabeled samples on the challenging scenario. For the easier tasks SVHNMNIST and MNISTUSPS, our approach also shows superiority.
Source  MNIST  SVHN  MNIST 
Target  SVHN  MNIST  USPS 
Source Only  33.01.2  60.11.1  75.21.6 
RevGrad [9]  35.7  73.9  77.11.8 
ADDA [39]    76.01.8  89.40.2 
ATT [32]  52.8  85.0   
MSTN [41]  did not converge  91.71.5  92.91.1 
PFAN  57.61.8  93.90.8  95.01.3 
Model  AW  IP  SVHNMNIST 

Source Only  61.6  66.2  60.1 
PFAN (Random)  77.0  67.0  87.2 
PFAN (Full)  81.9  68.0  92.5 
PFAN (woAPA)  76.4  67.1  82.0 
PFAN (woA)  82.2  68.1  93.0 
PFAN (woT)  80.6  67.9  92.1 
PFAN  83.0  68.5  93.9 
4.4 Further Empirical Analysis
Ablation Study.
To isolate the contribution of our work, we perform ablation study by evaluating several variants of PFAN: (1) PFAN (Random), which randomly selects the target samples instead of using the easytohard order; (2) PFAN (Full), which uses all target samples at the training period; (3) PFAN (woAPA), which denotes training completely without the APA (i.e. in Eq. (12)); (4) PFAN (woA), which denotes aligning the prototypes based on the current minibatch without considering the global and accumulated prototypes; (5) PFAN (woT), which removes the temperature from our model (i.e. in Eq. (10)). The results are shown in Table 4. We can observe that all the components are designed reasonably and when any one of these components is removed, the performance degrades. It is noteworthy that PFAN outperforms both PFAN (Random) and PFAN (Full), which reveals that the EHTS can provide more reliable and informative target samples for the crossdomain category alignment.
Pseudolabeling Accuracy.
We show the relationship between the pseudolabeling accuracy and test accuracy in Fig. 5. We found that (1) the pseudolabeling accuracy keeps higher and stable throughout as training proceeds, which thanks to the EHTS by selecting reliable pseudolabeled samples; (2) the test accuracy increases with the increasing of labeled samples, which implies that the number of correctly and falsely labeled samples are both proportionally increasing, but our approach can explicitly alleviate the negative influence of the falselylabeled samples.
Comparing Different Temperature .
We perform sensitivity analysis of the temperature in Eq. (10) on transfer task . We provide the classification accuracy as the changing of in [1.5, 0.1, 2.0] and the results are reported in Table 5. The accuracy first increases and then decreases as varies and shows a best result when . The results implicitly confirm our assumption that a good UDA model needs a nonsaturated source classifier.
Nonsaturated source classifier.
To further verify our hypothesis about the nonsaturated source classifier, we investigate the source classification loss in different temperature setting. The results are reported in Fig. 4(a). The model converges faster than especially at the beginning of training. However, such difference gradually decreases as training proceeds. The justification is that we use a higher to retard the convergence speed of the source classification loss (i.e. alleviating the adaptor overfitting to the source samples), thus showing better adaptation.
Distribution Discrepancy.
The domain adaptation theory [1] suggests that distance can be used as a measure of domain discrepancy. The way of estimating empirical distance was defined as , where is the generalization error of a classifier trained to discriminate the source and target features. We utilize a kernel SVM to estimate the distance. Fig. 4(b) demonstrates the distance calculated with the features from AlexNet, RevGrad and PFAN on tasks and . We can observe that our method significantly reduces the distance compared with the AlexNet. However, when compared with RevGrad, PFAN shows smaller improvement with respect to distance, but improves the performance by large margin, which demonstrates that a low domain divergence does not imply better performance in the target domain. This phenomenon is consistent with the analysis in 3.5.
1.5  1.6  1.7  1.8  1.9  2.0  

Accuracy (%)  80.9  81.8  82.1  83.0  80.9  80.7 
Feature Visualization.
We utilize tSNE [8]
to visualize the deep feature of the network activations on task
(randomly selected 8 classes) learned by RevGrad (the bottleneck layer) and PFAN (the bottleneck layer). As shown in Fig. 4(c)4(d), we can see that the RevGrad features on target domain can not be discriminated very well, some categories have been mixed up in the feature space. By contrast, PFAN can learn more discriminative representations, which jointly enlarges the interclass dispersion and reduces the intraclass variations.5 Conclusion
In this paper, we proposed a novel approach called Progressive Feature Alignment Network, to take advantage of target domain intraclass variance and crossdomain category consistency for addressing UDA problems. The proposed EHTS and APA complement each other in selecting reliable pseudolabeled samples and alleviating the bias caused by the falselylabeled samples. The performance is further improved by retarding the convergence speed of the source classification loss. The extensive experiments reveal that our approach outperforms stateoftheart UDA approaches on three domain adaptation datasets.
References
 [1] S. BenDavid, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains. Machine learning, 79(12):151–175, 2010.
 [2] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan. Domain separation networks. In Advances in Neural Information Processing Systems, pages 343–351, 2016.
 [3] L. Bruzzone and M. Marconcini. Domain adaptation problems: A dasvm classification technique and a circular validation strategy. IEEE transactions on pattern analysis and machine intelligence, 32(5):770–787, 2010.
 [4] B. Caputo, H. Müller, J. MartinezGomez, M. Villegas, B. Acar, N. Patricia, N. Marvasti, S. Üsküdarlı, R. Paredes, M. Cazorla, et al. Imageclef 2014: Overview and analysis of the results. In International Conference of the CrossLanguage Evaluation Forum for European Languages, pages 192–211. Springer, 2014.
 [5] M. Chen, K. Q. Weinberger, and J. Blitzer. Cotraining for domain adaptation. In Advances in neural information processing systems, pages 2456–2464, 2011.

[6]
Q. Chen, Y. Liu, Z. Wang, I. Wassell, and K. Chetty.
Reweighted adversarial adaptation network for unsupervised domain
adaptation.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 7976–7985, 2018.  [7] J. S. Denker, W. Gardner, H. P. Graf, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel, H. S. Baird, and I. Guyon. Neural network recognizer for handwritten zip code digits. In Advances in neural information processing systems, pages 323–331, 1989.
 [8] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning, pages 647–655, 2014.
 [9] Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. In International Conference on Machine Learning, pages 1180–1189, 2015.
 [10] M. Gong, K. Zhang, T. Liu, D. Tao, C. Glymour, and B. Schölkopf. Domain adaptation with conditional transferable components. In International conference on machine learning, pages 2839–2848, 2016.
 [11] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [12] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel twosample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.
 [13] P. Haeusser, T. Frerix, A. Mordvintsev, and D. Cremers. Associative domain adaptation. In International Conference on Computer Vision (ICCV), volume 2, page 6, 2017.
 [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [15] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
 [16] J. Hoffman, E. Tzeng, T. Park, J.Y. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell. Cycada: Cycleconsistent adversarial domain adaptation. In International Conference on Machine Learning (ICML), 2018.

[17]
G. Kang, L. Zheng, Y. Yan, and Y. Yang.
Deep adversarial attention alignment for unsupervised domain adaptation: the benefit of target expectation maximization.
In European Conference on Computer Vision, 2018.  [18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [19] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

[20]
D.H. Lee.
Pseudolabel: The simple and efficient semisupervised learning method for deep neural networks.
In Workshop on Challenges in Representation Learning, ICML, volume 3, page 2, 2013.  [21] S. Li, S. Song, G. Huang, Z. Ding, and C. Wu. Domain invariant and class discriminative feature learning for visual domain adaptation. IEEE Transactions on Image Processing, 27(9):4260–4273, 2018.
 [22] M. Long, Y. Cao, J. Wang, and M. Jordan. Learning transferable features with deep adaptation networks. In International Conference on Machine Learning, pages 97–105, 2015.

[23]
M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu.
Transfer feature learning with joint distribution adaptation.
In Proceedings of the IEEE international conference on computer vision, pages 2200–2207, 2013.  [24] M. Long, H. Zhu, J. Wang, and M. I. Jordan. Unsupervised domain adaptation with residual transfer networks. In Advances in Neural Information Processing Systems, pages 136–144, 2016.

[25]
M. Long, H. Zhu, J. Wang, and M. I. Jordan.
Deep transfer learning with joint adaptation networks.
In International Conference on Machine Learning, pages 2208–2217, 2017. 
[26]
Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng.
Reading digits in natural images with unsupervised feature learning.
In
NIPS workshop on deep learning and unsupervised feature learning
, volume 2011, page 5, 2011.  [27] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 22(2):199–210, 2011.
 [28] S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.

[29]
Z. Pei, Z. Cao, M. Long, and J. Wang.
Multiadversarial domain adaptation.
In
AAAI Conference on Artificial Intelligence
, 2018.  [30] P. O. Pinheiro and A. Element. Unsupervised domain adaptation with similarity learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8004–8013, 2018.
 [31] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual category models to new domains. In European conference on computer vision, pages 213–226. Springer, 2010.
 [32] K. Saito, Y. Ushiku, and T. Harada. Asymmetric tritraining for unsupervised domain adaptation. In International Conference on Machine Learning, pages 2988–2997, 2017.
 [33] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada. Maximum classifier discrepancy for unsupervised domain adaptation. arXiv preprint arXiv:1712.02560, 2017.
 [34] O. Sener, H. O. Song, A. Saxena, and S. Savarese. Learning transferrable representations for unsupervised domain adaptation. In Advances in Neural Information Processing Systems, pages 2110–2118, 2016.
 [35] R. Shu, H. H. Bui, H. Narui, and S. Ermon. A dirtt approach to unsupervised domain adaptation. arXiv preprint arXiv:1802.08735, 2018.
 [36] J. Snell, K. Swersky, and R. Zemel. Prototypical networks for fewshot learning. In Advances in Neural Information Processing Systems, pages 4077–4087, 2017.
 [37] B. Sun and K. Saenko. Deep coral: Correlation alignment for deep domain adaptation. In European Conference on Computer Vision, pages 443–450. Springer, 2016.
 [38] A. Torralba and A. A. Efros. Unbiased look at dataset bias. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1521–1528. IEEE, 2011.
 [39] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In Computer Vision and Pattern Recognition (CVPR), volume 1, page 4, 2017.
 [40] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.
 [41] S. Xie, Z. Zheng, L. Chen, and C. Chen. Learning semantic representations for unsupervised domain adaptation. In International Conference on Machine Learning, pages 5419–5428, 2018.
 [42] H. Yan, Y. Ding, P. Li, Q. Wang, Y. Xu, and W. Zuo. Mind the class weight bias: Weighted maximum mean discrepancy for unsupervised domain adaptation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 3, 2017.
 [43] J. Zhang, Z. Ding, W. Li, and P. Ogunbona. Importance weighted adversarial nets for partial domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8156–8164, 2018.
 [44] W. Zhang, W. Ouyang, W. Li, and D. Xu. Collaborative and adversarial network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3801–3809, 2018.
Comments
There are no comments yet.