The recent advances in deep neural networks have convincingly demonstrated high capability in learning vision models on large datasets. For instance, an ensemble of residual nets
achieves 3.57% top-5 error on the ImageNet test set, which is even lower than 5.1% of the reported human-level performance. The achievements rely heavily on the requirement to have large quantities of annotated data for deep model learning. However, performing intensive manual labeling on a new dataset is expensive and time-consuming. A valid question is why not recycling off-the-shelf learnt knowledge/models in source domain for new domain(s). The difficulty originates from the domain gap that may adversely affect the performance especially when the source and target data distributions are very different. An appealing way to address this challenge would be unsupervised domain adaptation, which aims to utilize labeled examples or learnt models in the source domain and the large number of unlabeled examples in the target domain to generalize a target model.
A common practice in unsupervised domain adaptation is to align data distributions between source and target domains or build invariance across domains by minimizing domain shift through measures such as correlation distances [27, 34] or maximum mean discrepancy . In this paper, we explore general-purpose and task-specific domain adaptations under the framework of Prototypical Networks 
. The design of prototypical networks assumes the existence of an embedding space in which the projections of samples in each class cluster around a single prototype (or centroid). The classification is then performed by computing the distances to prototype representations of each class in the embedding space. In this way, the general-purpose adaptation is to represent each class distribution by a prototype and match the prototypes of each class in the embedding space learnt on the data from different domains. The inspiration of task-specific adaptation is from the rationale that the target data should be classified correctly by the task-specific model when the source and target distributions are well aligned. In the context of prototypical networks, task-specific adaptation is equivalent to adapting the score distributions produced by prototypes in different domains.
By consolidating the idea of general-purpose adaptation and task-specific adaptation into unsupervised domain adaptation, we present a novel Transferrable Prototypical Networks (TPN) architecture. Ideally, TPN is to learn a non-linear mapping (a neural network) of the input examples into an embedding space, in which the representations are invariant across domains. Specifically, TPN takes a batch of labeled source and unlabeled target examples, compares each target example to each of the prototypes computed on source data, and assigns the label of the nearest prototype as a “pseudo” label to each target example. As such, the general-purpose adaptation is then formulated to minimize the distances between the prototypes measured on source data, target data with pseudo labels, and source plus target data. That is to alleviate domain discrepancy on class level. In task-specific adaptation, we utilize a softmax over distances of the embedding of each example to the prototypes as the classifier. The KL-divergence is exploited to model the mismatch of score distribution by classifiers on prototypes computed in each domain or their combination. In this case, domain discrepancy is amended on sample level. The whole TPN is end-to-end trained by minimizing the classification loss on labeled source data plus the two adaptation terms, and switching the learning from batch to batch. At inference stage, each prototype is computed as a priori. A test target example is projected into the embedding space to compare to each prototype and the outputs of softmax are taken as predictions.
2 Related Work
Inspired by the recent advances in image representation using deep convolutional neural networks (DCNNs), a few deep architecture based methods have been proposed for unsupervised domain adaptation. In particular, one common deep solution for unsupervised domain adaptation is to guide the feature learning in DCNNs by minimizing the domain discrepancy with Maximum Mean Discrepancy (MMD). MMD is an effective non-parametric metric for the comparisons between the distributions of source and target domains.  is one of early works that incorporates MMD into DCNNs with regular supervised classification loss on source domain to learn both semantically meaningful and domain invariant representation. Later in , Long et al. simultaneously exploit transferability of features from multiple layers via the multiple kernel variant of MMD. The work is further extended by adapting classifiers through a residual transfer module in . Most recently, 
explores domain shift reduction in joint distributions of the network activation of multiple task-specific layers.
Another branch of unsupervised domain adaptation in DCNNs is to exploit the domain confusion by learning a domain discriminator [4, 14, 29, 30, 35]. Here the domain discriminator is designed to predict the domain (source/target) of each input sample and is trained in an adversarial fashion, similar to GANs , for learning domain invariant representation. For example,  devises a domain confusion loss measured in domain discriminator for enforcing the learnt representation to be domain invariant. Similar in spirit, Ganin et al. explore such domain confusion problem as a binary classification task and optimize the domain discriminator via a gradient reversal algorithm in . Coupled GANs  directly applies GANs into domain adaptation problem to explicitly reduce the domain shifts by learning a joint distribution of multi-domain images. Recently,  combines adversarial learning with discriminative feature learning for unsupervised domain adaptation. Most recently,  extends domain discriminator by learning domain-invariant feature extractor and performing feature augmentation.
In summary, our approach belongs to domain discrepancy based methods. Similar to previous approaches [16, 31], our TPN leverages additional unlabeled target data for learning task-specific classifiers. The novelty is on the exploitation of multi-granular domain discrepancy in Prototypical Networks, at class-level and sample-level, that has not been fully explored in the literature. Class-level domain discrepancy is reduced by learning similar prototypes of each class in different domains, while sample-level discrepancy is by enforcing similar score distributions across prototypes of different domains.
3 Unsupervised Domain Adaptation
Our Transferrable Prototypical Networks (TPN) is to remould Prototypical Networks towards the scenario of unsupervised domain adaptation by jointly bridging the domain gap via minimizing multi-granular domain discrepancies, and constructing classifiers with unlabeled target data and labeled source data. The classifiers in Prototypical Networks are typically achieved by measuring distances between the example and prototype of each class, which can be flexibly adapted across domains by only updating prototypes in a specific domain. To learn transferrable representations in Prototypical Networks, TPN firstly utilizes the classifiers learnt on source-only data to directly predict the pseudo labels of unlabeled target data and thus produces another two kinds of prototype-based classifiers constructed in target-only and source-target data. The training of TPN is then performed simultaneously by classifying each source sample as correct class and reducing multi-granular domain discrepancy at class level & sample level. The class-level domain discrepancy is reduced via matching the prototypes of each class, and the sample-level domain discrepancy is minimized by enforcing the score distributions over classes of each sample synchronized, across different domains. We alternate the above two steps in each training iteration and optimize the whole TPN in an end-to-end fashion.
3.1 Preliminary—Prototypical Networks
Prototypical Networks is preliminarily proposed in  to construct an embedding space in which points cluster around a single prototype representation of each class. In particular, given a set with labeled samples belonging to categories, where is the class label of sample . The objective is to learn an embedding function for transforming each input sample into a -dimensional embedding space through a deep architecture of Prototypical Networks, where represents the learnable parameters. To convey the high-level description of the class as meta-data, the prototype of each class is defined by taking the average of all embedded samples belonging to that class:
where denotes the set of samples from class . Given a query sample , Prototypical Networks directly produce its score distribution over classes via a softmax function on distances to the prototypes, whose
-th element is the probability ofbelonging to class :
where is the distance function (e.g., Euclidean distance as in ) between query sample and the prototype. The training of Prototypical Networks is performed by minimizing the negative log-likelihood probability of assigning correct class label to this sample:
3.2 Problem Formulation
In unsupervised domain adaptation, we are given labeled samples in the source domain and unlabeled samples in the target domain. Based on the widely adopted assumption of the existence of a shared feature space for source and target domains in [16, 20, 29], the ultimate goal of this task is to design an embedding function which formally reduces domain shifts in the shared feature space and enables learning of both transferrable representations and classifiers depending on and . Different from the existing transfer techniques [16, 17] which are typically composed of two cascaded networks for learning domain-invariant features and target-discriminative classifiers respectively, we consider unsupervised domain adaptation in the framework of Prototypical Networks. Such framework naturally unifies the learning of features and classifiers into one network by constructing classifiers purely on the prototype of each class. This design reflects a very simple inductive bias that is beneficial in domain adaptation regime. Specifically, to make Prototypical Networks transferrable across domains, two adaptation mechanisms are devised to align distributions of source and target domains through reducing multi-granular (i.e., class-level and sample-level) domain discrepancies. In between, the general-purpose adaptation matches the prototypes of each class and the task-specific adaptation enforces similar score distributions over classes of each sample, across different domains, as shown in Figure 1.
3.3 General-purpose Domain Adaptation
Most existing works resolve unsupervised domain adaptation by minimizing the domain discrepancy between source and target data distributions with MMD , or maximizing the domain confusion across domains via a domain discriminator . Both of the domain discrepancy and domain confusion terms are measured over the entire source and target data, irrespective of the specific class of each sample. Moreover, the domain discrepancy has been seldom exploited across domains for each class, possibly because measuring such class-level domain discrepancy needs the labels of both source and target samples, while in typical unsupervised domain adaptation settings, no label is provided for target samples.
Inspired from self-labeling [11, 24] for domain adaptation, we directly utilize prototype-based classifier learnt on labeled source data for matching each target sample to the nearest prototype in the source domain, and then assign the target sample a “pseudo” label. As such, all the target samples are with pseudo labels. After obtaining the real/pseudo labels of source/target data, three kinds of classifiers (i.e., prototypes , and ) could be calculated on source-only data (), target-only data () and source-target data (), respectively:
where and denote the sets of source/target samples from the same class .
To measure the class-level domain discrepancy across domains, we take the inspiration from MMD-based transfer techniques [16, 17] and compute pairwise reproducing kernel Hilbert space (RKHS) distance between the prototypes of the same class from different domains. The basic idea is that if the data distributions of source and target domains are identical, the prototypes of the same class achieved on different domains are the same. Formally, we define the following class-level discrepancy loss as
where , and denote the corresponding prototypes in reproducing kernel Hilbert space . By minimizing this term, the prototype of each class computed in each domain will be enforced to be in close proximity in the embedding space, leading to invariant representation distribution across domains in general.
Connections with MMD. MMD 
is a kernel two-sample test which measures the distribution difference between source and target data by mapping them into a reproducing kernel Hilbert space. The empirical estimation of MMD is computed by
where is the mapping to RKHS . Taking a close look on the objective of MMD and our class-level discrepancy loss in Eq.(5), we can observe some interesting connections. Concretely, the means of source and target data (i.e., and ) measured in MMD can be interpreted as the holistic prototype of each domain in RKHS. MMD is then expressed as the RKHS distance between the holistic prototypes across domains. Our class-level domain discrepancy, different from MMD, is computed as the RKHS distance across the prototypes of each class from different domains. In other words, a fine-grained alignment of source and target data distributions is performed on class level, instead of simply minimizing the distance between holistic prototypes across domains.
3.4 Task-specific Domain Adaptation
The general-purpose domain adaptation only enforces similarity in feature distributions, while leaving the relations between samples and task-specific classifiers (i.e., prototypes) unexploited. Furthermore, we devise a new adaptation mechanism, i.e., task-specific adaptation, to reduce sample-level domain discrepancy by aligning the score distributions of different classifiers (i.e., prototypes) across domains for each sample. The rationale of sample-level domain discrepancy is that each source/target sample should be classified correctly by the task-specific classifiers when source and target distributions are well aligned, leading to consistent decisions of classifiers across domains.
In particular, given each source/target sample , three score distributions (, and ) are obtained via three kinds of classifiers (i.e., prototypes , and ) learnt on source-only, target-only and source-target data, respectively. To measure the sample-level domain discrepancy, we utilize KL-divergence to evaluate the pairwise distance between the score distributions from different domains. The sample-level discrepancy loss over the source and target samples are defined as
where is the KL-divergence factor and is the symmetric pairwise KL-divergence.
Please note that different from general-purpose domain adaptation which independently matches the prototypes of each class across different domains, task-specific adaptation simultaneously adapts the prototypes of all classes, pursuing similar score distributions over classes of each sample.
The overall training objective of our TPN integrates the supervised classification loss in Eq.(3) and multi-granular discrepancy losses (i.e., class-level discrepancy loss in Eq.(5) and sample-level discrepancy loss in Eq.(7)). Hence we obtain the following optimization problem:
where and are tradeoff parameters. With this overall loss objective, the crucial goal of the optimization is to learn the deep embedding function , in which the output representations are invariant across domains.
Training Procedure. To address the optimization problem in Eq.(8), we split the training process into two steps: 1) calculate classifier (i.e., prototypes ) on source domain and perform it to assign pseudo labels to target samples; 2) calculate classifiers (i.e., prototypes and ) on target-only and source-target data, and update with respect to the gradient descent of overall objective function. We alternate the two steps in each training iteration and stop the procedure until a convergence criterion is met. Note that to remedy the error of self-labeling, we only assign pseudo labels to the target examples whose maximized scores are over 0.6 and resample the target examples for labeling in each training iteration to avoid overfitting of pseudo labels. Furthermore, the training process of our TPN is also resistant to the noise of pseudo labels since we iteratively utilize both labeled source examples and pseudo-labeled target examples for learning the embedding function. This procedure not only ensures the accuracy in source domain, but also effectively minimizes class-level and sample-level discrepancy. Such cycle will gradually improve the accuracy in target domain.
Inference. After training TPN, we can obtain the deep embedding function . With this, all the three sets of prototypes (, and ) are calculated over the whole training set in advance and stored in memory. Any one of the three prototype sets can be utilized as the final classifier for classifying test target sample at the testing stage. We empirically verified that the performance is not sensitive to the selection of prototypes111The accuracy constantly fluctuates within 0.002 when using different set of prototypes for four domain shifts in experiments., which implicitly reveals the domain invariant characteristic of the learnt feature representation. Hence, given a test target sample, we compute its embedding representation via and compare the distances to prototypes of each class to output the final prediction scores.
3.6 Theoretical Analysis
We formalize the error bound of TPN by an extension of the theory in . As TPN performs training on a mixture of labeled source examples and target samples with pseudo labels, the classification error is naturally considered as the linear weighted sum of errors in source and target domain. Denote and as the ground truth labels of source examples and the pseudo labels of target samples, respectively, and as a hypothesis. The error is then formally written as
where is the tradeoff parameter. The term and represents the expected error over the sample distribution of target domain and source domain with respect to pseudo labels and ground truth labels, respectively.
Next, a valid question is how close the error is to an oracle error that evaluates the classifier learnt on the ground truth labels of the target examples. The closer the two losses are, the more desirable the domain adaptation performs. The following Lemma proves that the difference between the two losses could be bounded for our TPN.
Let be a hypothesis in class . Then
where measures the domain discrepancy in the hypothesis space . denotes the ratio of target examples with false pseudo labels. is the combined error in two domains of the joint ideal hypothesis , which is the optimal hypothesis by minimizing the combined error:
Lemma 1 decomposes the bound into three terms: domain discrepancy measured by the disagreement of hypothesis in the space , the error of the ideal joint hypothesis and the ratio of the noise in pseudo labels. In TPN, the first term is assessed through quantifying class-level discrepancy of prototypes and sample-level discrepancy over score distributions across different domains. As stated in , when the combined error of the joint ideal hypothesis is large, there is no classifier that performs well on both domains. Instead, in the most relevant cases for domain adaptation, is usually considered to be negligibly small and thus the second term can be disregarded. Furthermore, in each iteration, TPN searches for the optimal hypothesis and improves the accuracy of pseudo-label prediction on target examples. The increase of correct pseudo labels in turn benefits the reduction of domain discrepancy. We will empirically verify that the third term of the noise in pseudo labels is iteratively decreased in Section 4.3. As such, TPN constantly tightens the bound in Eq.(10).
We conduct extensive evaluations of TPN for unsupervised domain adaptation from four domain shifts, including three Digits image transfer across three Digits datasets (i.e., MNIST , USPS  and SVHN ) and one synthetic-to-real image transfer on VisDA 2017 dataset .
4.1 Datasets and Experimental Settings
Datasets. The MNIST (M) and USPS (U) image datasets are both handwritten Digits datasets containing 10 classes of digits. The MNIST dataset consists of 70 images and the USPS dataset includes 9.3 images. Unlike the two, the SVHN (S) dataset is a real-world Digits dataset of house numbers in Google street view images and contains 100 cropped Digits images. The VisDA 2017 dataset is the largest synthetic-to-real object classification dataset to date with over 280 images in the training, validation and testing splits (domains). All the three domains share the same 12 object categories. The training domain consists of 152 synthetic images which are generated by rendering 3D models of the same object categories from different angles and under different lighting conditions. The validation domain includes 55 images by cropping object in real images from COCO . The testing domain contains 72 images cropped from video frames in YT-BB .
Digits Image Transfer. Following , we consider three directions: M U, U M and S M, for unsupervised domain adaptation among Digits datasets. For the transfer between MNIST and USPS, we sample 2 images from MNIST and 1.8 images from USPS as in . For S M, the two training sets are fully utilized. In addition, the CNN architecture for the three Digits image transfer tasks is a simple modified version of  (2 conv-layer LeNet), which is also exploited in .
Synthetic-to-Real Image Transfer. The second experiment was conducted over the most challenging synthetic-to-real image transfer task in VisDA 2017. As the annotations of the testing data in VisDA are not publicly available, we take the training data (i.e., synthetic images) as source domain and the validation data (i.e., cropped COCO images) as target domain. Moreover, we adopt 50-layer ResNet  pre-trained on ImageNet  as our basic CNN structure.
Implementation Details. The two tradeoff parameters and in Eq.(8) are simply set as 1. A common practice in unsupervised domain adaption is the lack of annotations in target domain, making the parameters unable to be well estimated. As such, we directly fix the tradeoff parameters in all the experiments. We strictly follow [2, 30] and set the embedding size
as 10/512 for Digits/synthetic-to-real image transfer. We mainly implement TPN based on Caffe. Specifically, the network weights are trained by ADAM  with 0.0005 weight decay and 0.9/0.999 momentum for Digits/synthetic-to-real image transfer. The learning rate and mini-batch size are set as 0.0002/0.00001 and 128/60 for Digits/synthetic-to-real image transfer. The maximum training iteration is set as 70 for all the experiments. Moreover, following 
, we pre-train TPN on labeled source data. For Digits image transfer tasks, we adopt the classification accuracy on target domain as evaluation metric. For synthetic-to-real image transfer, we measure the per-category classification accuracy on target domain. The final metric is the average of accuracy over all categories.
Compared Methods. To empirically verify the merit of our TPN, we compare the following approaches: (1) Source-only directly exploits the classification model trained on source domain to classify target samples. (2) RevGrad  treats domain confusion as a binary classification task and trains the domain discriminator via gradient reversal. (3) DC  explores a Domain Confusion loss measured in domain discriminator for unsupervised domain adaptation. (4) DAN  utilizes multiple kernel variant of MMD to align feature representations from multiple layers. (5) RTN  extends DAN by adapting classifiers through a residual transfer module. (6) ADDA  designs an unified unsupervised domain adaptation model based on adversarial learning objectives. (7) JAN  learns a transfer model by aligning joint distributions of the network activation of multiple layers across domains. (8) MCD  aligns distributions of source and target domains by utilizing the task-specific decision boundaries. (9) S-En  explores the mean teacher variant of temporal ensembling  for unsupervised domain adaptation. (10) TPN is the proposal in this paper. Moreover, two slightly different settings of TPN are named as TPN and TPN which are trained with only general-purpose and task-specific adaptation, respectively. (11) Train-on-target is an oracle run that trains the classifier on all labeled target samples.
4.2 Performance Comparison
Digits Image Transfer. Table 2(a) shows the performance comparisons on three transfer directions among Digits datasets. Overall, the results across three adaptations consistently indicate that our proposed TPN achieves superior performances against other state-of-the-art techniques including MMD based models (DAN, RTN, JAN) and domain discriminator based approaches (RevGrad, DC, ADDA, MCD). In particular, the accuracy of TPN can achieve 92.1% and 94.1% on the adaptation of M U and U M, making the absolute improvement over the best competitor ADDA by 2.7% and 4%, respectively, which is generally considered as a significant progress on the adaptation between MNIST and USPS. It is noteworthy that compared to JAN, our TPN also promotes the classification accuracy evidently on the harder transfer S M, where the source and target domains are substantially different. The results in general highlight the key importance of exploring both class-level and sample-level domain discrepancy via general-purpose and task-specific adaptation in unsupervised domain adaptation, leading to more domain-invariant feature representations.
The performances of Source-only which trains the classifier only on labeled source data could be regarded as a lower bound without domain adaptation. By additionally incorporating the domain adaptation term (MMD/domain discriminator), RevGrad, DC, DAN, RTN, ADDA, JAN and MCD lead to a large performance boost over Source-only, which basically indicates the advantage of measuring the domain discrepancy/domain confusion over the source and target data. Furthermore, the performances of them on harder transfer S M are much lower than our TPN and TPN which exploits the class-level/sample-level domain discrepancy in Prototypical Networks by matching the prototypes across domains for each class and score distributions of different classifiers (i.e., prototypes) for each sample, respectively. This confirms the effectiveness of leveraging class-level and sample-level domain discrepancy in general-purpose and task-specific adaptation, especially between more distinct domains. For the two easy transfer tasks between MNIST and USPS, TPN is inferior to ADDA, MCD and TPN, which indicates that solely matching score distributions of each sample might inject noise more easier than domain discriminator/class-level domain discrepancy on transfer task across similar domains. In addition, by simultaneously utilizing both general-purpose and task-specific adaptation, our TPN consistently boosts up the performances on all the three Digits image transfer tasks. The results demonstrate the advantage of jointly leveraging multi-granular domain discrepancy at class level and sample level for unsupervised domain adaptation. Note that we exclude the published results of S-En in this comparison as S-En is originally built with deeper CNNs (i.e., 9 conv layers) on Digits image datasets and our TPN is based on 2 conv-layer LeNet. When equipped with the same CNNs in S-En, the accuracy of our TPN is boosted up to 98.6% on M U which is higher than 98.3% of S-En.
Synthetic-to-Real Image Transfer. The performance comparisons for synthetic-to-real image transfer task on VisDA 2017 dataset are summarized in Table 2(b). Here the results of S-En are all reported on the setting with multiple data augmentations (DA). Our TPN performs consistently better than other runs without any DA involved. In particular, the mean accuracy across all the 12 categories can reach 80.4%, making the absolute improvement over JAN by 13.9%. Similar to the observations on the hard Digits image transfer S M, TPN and TPN exhibit better performance than JAN by taking class-level and sample-level domain discrepancy into account for unsupervised domain adaptation. In addition, TPN performs better than TPN and a larger degree of improvement is attained when exploiting both general-purpose and task-specific adaptation by TPN. Please note that the highest accuracy 82.8% of S-En is equipped with the test-time augmentation (Test-aug), i.e., averaged predictions of 16 different augmentations of each image, while the accuracy 80.4% of our TPN is on single model without any DA. When relying on one kind of DA (Mini-aug), S-En only achieves 74.2% which is still lower than ours.
4.3 Experimental Analysis
Feature Visualization. Figure 2(a)-(b) depict the t-SNE  visualizations of features learnt by Source-only and our TPN on VisDA 2017 dataset (10 samples in each domain). We can see that the distribution of target sample is far from that of source samples for Source-only run without domain adaptation. Through domain adaptation by TPN, the two distributions are brought closer, making the target distribution indistinguishable from the source one.
Confusion Matrix Visualization. Figure 2(c)-(f) show the visualizations of confusion matrix for the classifier learnt by Source-only, JAN, our TPN and Train-on-target on VisDA. Examining the confusion matrix of Source-only reveals that the domain shift is relatively large and the majority of the confusion are observed between objects with similar 3D structures, e.g., knife & skateboard (sktbrd) and truck & car. Through domain adaptation by JAN and TPN, the confusion is reduced for most classes. In particular, among all the 12 categories, TPN achieves higher accuracies than JAN for 10 categories, demonstrating that the features learnt by our TPN are more discriminative on target domain.
Convergence Analysis. To illustrate the convergence of our TPN, we visualize the evolution of the embedded representation of a subset on VisDA 2017 dataset (10 samples for each domain) with t-SNE during training. Figure 3(a)-(h) illustrate that the target classes are becoming increasingly well discriminated by TPN source classifier. Figure 3(i) further depicts that the accuracy constantly increases (i.e., the noise of the pseudo labels decreases) and the two adaptation losses decrease when iterating more steps. Specifically, at the initial time, the ratio of target examples with false pseudo labels is 44.7%, i.e., only 55.3% of target samples are assigned with the correct labels. With the increase of training iterations of our TPN, such noise of pseudo labels is gradually decreased and the final accuracy will be boosted up to 80.4% after model convergence. This again verifies that minimizing class-level and sample-level domain discrepancy will lead to better adaptation.
We have presented Transferrable Prototypical Networks (TPN), which explores domain adaptation in an unsupervised manner. Particularly, we study the problem from the viewpoint of both general-purpose and task-specific adaptation. To verify our claim, we have devised the measure of each adaptation in the framework of prototypical networks. The general-purpose adaptation is to push the prototype of each class computed in each domain to be close in the embedding space, resulting in invariant representation distribution across domains in general. The task-specific adaptation further takes the decisions of classifiers into account when aligning feature distributions, which ideally leads to domain-invariant representations. Experiments conducted on the transfers across MNIST, USPS and SVHN datasets validate our proposal and analysis. More remarkably, we achieve new state-of-the-art performance of single model on synthetic-to-real image transfer in VisDA 2017 challenge.
-  Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 2010.
-  Geoffrey French, Michal Mackiewicz, and Mark Fisher. Self-ensembling for domain adaptation. In ICLR, 2018.
-  Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. Springer series in statistics New York, 2001.
Yaroslav Ganin and Victor Lempitsky.
Unsupervised domain adaptation by backpropagation.In ICML, 2015.
-  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.
-  Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 2012.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM MM, 2014.
-  Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
-  Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998.
Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks.In Workshop on Challenges in Representation Learning, ICML, 2013.
-  Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014.
-  Ming-Yu Liu and Oncel Tuzel. Coupled generative adversarial networks. In NIPS, 2016.
-  Fuchen Long, Ting Yao, Qi Dai, Xinmei Tian, Jiebo Luo, and Tao Mei. Deep domain adaptation hashing with adversarial learning. In SIGIR, 2018.
-  Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. Learning transferable features with deep adaptation networks. In ICML, 2015.
Mingsheng Long, Jianmin Wang, and Michael I Jordan.
Deep transfer learning with joint adaptation networks.In ICML, 2017.
-  Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. Unsupervised domain adaptation with residual transfer networks. In NIPS, 2016.
-  Laurens Van Der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. JMLR, 2008.
Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y
Reading digits in natural images with unsupervised feature learning.
Workshop on Deep Learning and Unsupervised Feature Learning, NIPS, 2011.
-  Sinno Jialin Pan, James T Kwok, and Qiang Yang. Transfer learning via dimensionality reduction. In AAAI, 2008.
-  Xingchao Peng, Ben Usman, Neela Kaushik, Judy Hoffman, Dequan Wang, and Kate Saenko. VisDA: The visual domain adaptation challenge. arXiv preprint arXiv:1710.06924, 2017.
-  Esteban Real, Jonathon Shlens, Stefano Mazzocchi, Xin Pan, and Vincent Vanhoucke. Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video. In CVPR, 2017.
-  Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
-  Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. Asymmetric tri-training for unsupervised domain adaptation. In ICML, 2017.
-  Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. Maximum classifier discrepancy for unsupervised domain adaptation. In CVPR, 2018.
-  Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In NIPS, 2017.
-  Baochen Sun, Jiashi Feng, and Kate Saenko. Return of frustratingly easy domain adaptation. In AAAI, 2016.
-  Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In NIPS, 2017.
-  Eric Tzeng, Judy Hoffman, Trevor Darrell, and Kate Saenko. Simultaneous deep transfer across domains and tasks. In ICCV, 2015.
-  Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In CVPR, 2017.
-  Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.
-  Riccardo Volpi, Pietro Morerio, Silvio Savarese, and Vittorio Murino. Adversarial feature augmentation for unsupervised domain adaptation. In CVPR, 2018.
-  Ting Yao, Chong-Wah Ngo, and Shiai Zhu. Predicting domain adaptivity: redo or recycle? In ACM MM, 2012.
-  Ting Yao, Yingwei Pan, Chong-Wah Ngo, Houqiang Li, and Tao Mei. Semi-supervised domain adaptation with subspace learning for visual recognition. In CVPR, 2015.
-  Yiheng Zhang, Zhaofan Qiu, Ting Yao, Dong Liu, and Tao Mei. Fully convolutional adaptation networks for semantic segmentation. In CVPR, 2018.