1 Introduction
Deep neural networks have significantly advanced the stateoftheart performance for various machine learning problems
[13, 15] and applications [11, 20, 30]. A common prerequisite of deep neural networks is the rich labeled data to train a highcapacity model to have sufficient generalization power. Such rich supervision is often prohibitive in realworld applications due to the huge cost of data annotation. Thus, to reduce the labeling cost, there is a strong need to develop versatile algorithms that can leverage rich labeled data from a related source domain. However, this domain adaptation paradigm is hindered by the dataset shift underlying different domains, which forms a major bottleneck to adapting the category models to novel target tasks [29, 36].A major line of the existing domain adaptation methods bridge different domains by learning domaininvariant feature representations in the absence of target labels, i.e., unsupervised domain adaptation. Existing methods assume that the source and target domains share the same set of class labels [32, 12]
, which is crucial for directly applying the sourcetrained classifier to the target domain. Recent studies in deep learning reveal that deep networks can disentangle explanatory factors of variations behind domains
[8, 42], thus learning more transferable features to improve domain adaptation significantly. These deep domain adaptation methods typically embed distribution matching modules, including moment matching
[38, 21, 22, 23] and adversarial training [10, 39, 37, 24, 17], into deep architectures for endtoend learning of transferable representations.Although existing methods can reduce the featurelevel domain shift, they assume label spaces across domains are identical. In realworld applications, it is often formidable to find a relevant dataset with the label space identical to the target dataset of interest which is often unlabeled. A more practical scenario is Partial Domain Adaptation (PDA) [5, 43, 6]
, which assumes that the source label space is a superspace of the target label space, relaxing the constraint of identical label spaces. PDA enables knowledge transfer from a big domain of many labels to a small domain of few labels. With the emergence of Big Data, largescale labeled datasets such as ImageNet1K
[31] and Google Open Images [19]are readily accessible to empower datadriven artificial intelligence. These repositories are almost universal to subsume categories of the target domain, making PDA feasible to many applications. PDA can also work in the regime where target data are in limited categories. For example, functions of protein are limited. A large database of known protein structures can be collected, which includes all functions. For a new species, proteins have different structures, but their functions are contained in the database. Predicting protein functions for new species falls into the PDA problem.
As a generalization to standard domain adaptation, partial domain adaptation is more challenging: the target labels are unknown at training, and there must be many “outlier” source classes that are useless for the target task. This technical challenge is intuitively illustrated in Figure 1, where the target classes (like purple ‘’ and orange ‘’) will be forcefully aligned to the outlier source classes (like ‘+’) by existing domain adaptation methods. As a result, negative transfer will happen because the learner migrates harmful knowledge from the source domain to the target domain. Negative transfer is the principal obstacle to the application of domain adaptation techniques [29].
Thus, matching the whole source and target domains as previous methods [21, 10] is not a safe solution to the PDA problem. We need to develop algorithms versatile enough to transfer useful examples from the manyclass dataset (source domain) to the fewclass dataset (target domain) while robust enough to irrelevant or outlier examples. Three approaches to partial domain adaptation [5, 43, 6] address the PDA by weighing each data point in the domainadversarial networks, where a domain discriminator is learned to distinguish the source and target. While decreasing the impact of irrelevant examples on domain alignment, they do not undo the negative effect of the outlier classes on the source classifier. Moreover, they evaluate the transferability of source samples without considering the underlying discriminative and multimodal structures. As a result, it is still vulnerable that they may align the features of outlier source classes and target classes, giving way to negative transfer.
Towards a safe approach to partial domain adaptation, we present the Example Transfer Network (ETN), which improves the previous work [5, 43, 6] by learning to transfer useful examples. ETN automatically evaluates the transferability of source examples with a transferability quantifier based on their similarities to the target domain, which is used to weigh their contributions to both the source classifier and the domain discriminator. In particular, ETN improves the weight quality over previous work [43] by further revealing the discriminative structure to the transferability quantifier. By this means, irrelevant source examples can be better detected and filtered out. Another key improvement of ETN over the previous methods is the capability to simultaneously confine the source classifier and the domainadversarial network within the autodiscovered shared label space, thus promoting the positive transfer of relevant examples and mitigating negative transfer of irrelevant examples. Comprehensive experiments demonstrate that our model achieves stateoftheart results on several benchmark datasets, including Office31, OfficeHome, ImageNet1K, and Caltech256.
2 Related Work
Domain Adaptation
Domain adaptation, a special scenario of transfer learning
[29], bridges domains of different distributions to mitigate the burden of annotating target data for machine learning [28, 9, 44, 41][32, 12, 16]and natural language processing
[7]. The main technical difficulty of domain adaptation is to formally reduce the distribution discrepancy across different domains. Deep networks can learn representations that suppress explanatory factors of variations behind data [3] and manifest invariant factors across different populations. These invariant factors enable knowledge transfer across relevant domains [42]. Deep networks have been extensively explored for domain adaptation [27, 16], yielding significant performance gains against shallow domain adaptation methods.While deep representations can disentangle complex data distributions, recent advances show that they can only reduce, but not remove, the crossdomain discrepancy [38]. Thus deep learning alone cannot bound the generalization risk for the target task [25, 1]. Recent works bridge deep learning and domain adaptation [38, 21, 10, 39, 22]. They extend deep networks to domain adaptation by adding adaptation layers through which highorder statistics of distributions are explicitly matched [38, 21, 22], or by adding a domain discriminator to distinguish features of the source and target domains, while the features are learned adversarially to deceive the discriminator in a minimax game [10, 39].
Partial Domain Adaptation While the standard domain adaptation advances rapidly, it still needs the vanilla assumption that the source and target domains share the same label space. This assumption does not hold in partial domain adaptation (PDA), which transfers models from manyclass domains to fewclass domains. There are three valuable efforts towards the PDA problem. Selective Adversarial Network (SAN) [5] adopts multiple adversarial networks with a weighting mechanism to select out source examples in the outlier classes. Partial Adversarial Domain Adaptation [6] improves SAN by employing only one adversarial network and further adding the classlevel weight to the source classifier. Importance Weighted Adversarial Nets (IWAN) [43]
uses the Sigmoid output of an auxiliary domain classifier (not involved in domainadversarial training) to derive the probability of a source example belonging to the target domain, which is used to weigh source examples in the domainadversarial network. These pioneering approaches achieve dramatical performance gains over standard methods in partial domain adaptation tasks.
These efforts mitigate negative transfer caused by outlier source classes and promote positive transfer among shared classes. However, as outlier classes are only selected out for the domain discriminators, the source classifier is still trained with all classes [5], whose performance for shared classes may be distracted by outlier classes. Further, the domain discriminator of IWAN [43] for obtaining the importance weights distinguishes the source and target domains only based on the feature representations, without exploiting the discriminative information in the source domain. This will result in nondiscriminative importance weights to distinguish shared classes from outlier classes. This paper proposes an Example Transfer Network (ETN) that downweights the irrelevant examples of outlier classes further on the source classifier and adopts a discriminative domain discriminator to quantify the example transferability.
OpenSet Domain Adaptation On par with domain adaptation, research has been dedicated to open set recognition, with the goal to reject outliers while correctly recognizing inliers during testing. Open Set SVM [18] trains a probabilistic SVM and rejects unknown samples by a threshold. Open Set Neural Network [2]
generalizes deep neural networks to open set recognition by introducing an OpenMax layer, which estimates the probability of an input from an unknown class and rejects the unknown point by a threshold. Open Set Domain Adaptation (OSDA)
[4, 33] tackles the setting when the training and testing data are from different distributions and label spaces. OSDA methods often assume which classes are shared by the source and target domains are known at training. Unlike OSDA, in our scenario, target classes are entirely unknown at training. It is interesting to extend our work to the open set scenario under the generic assumption that all target classes are unknown.3 Example Transfer Network
The scenario of partial domain adaptation (PDA) [5] constitutes a source domain of labeled examples associated with classes and a target domain of unlabeled examples drawn from classes. Note that in PDA the source domain label space is a superspace of the target domain label space i.e.
. The source and target domains are drawn from different probability distributions
and respectively. Besides as in standard domain adaptation, we further have in partial domain adaptation, where denotes the distribution of the source domain data in label space . The goal of PDA is to learn a deep network that enables endtoend training of a transferable feature extractor and an adaptive classifier to sufficiently close the distribution discrepancy across domains and bound the target risk .We incur deteriorated performance when directly applying the source classifier trained with standard domain adaptation methods to the target domain. In partial domain adaptation, it is difficult to identify which part of the source label space is shared with the target label space because the target domain is fully unlabeled and is unknown at the training stage. Under this condition, most of existing deep domain adaptation methods [21, 10, 39, 22] are prone to negative transfer, a degenerated case where the classifier with adaptation performs even worse than the classifier without adaptation. The negative transfer happens since they assume that the source and target domains have identical label space and match whole distributions and even though and are nonoverlapping and cannot be matched in principle. Thus, decreasing the negative effect of the source examples in outlier label space is the key to mitigating negative transfer in partial domain adaptation. Besides, we also need to reduce the distribution shift across and to enhance positive transfer in the shared label space as before. Note that the irrelevant source examples may come from both outlier classes and shared classes, thus requiring a versatile algorithm to identify them.
3.1 Transferability Weighting Framework
The key technical problem of domain adaptation is to reduce the distribution shift between the source and target domains. Domain adversarial networks [10, 39] tackle this problem by learning transferable features in a twoplayer minimax game: the first player is a domain discriminator trained to distinguish the feature representations of the source domain from the target domain, and the second player is a feature extractor trained simultaneously to deceive the domain discriminator.
Specifically, the domaininvariant features are learned in a minimax optimization procedure: the parameters of the feature extractor are trained by maximizing the loss of domain discriminator , while the parameters of the domain discriminator are trained by minimizing the loss of the domain discriminator . Note that our goal is to learn a source classifier that transfers to the target, hence the loss of the source classifier is also minimized. This leads to the optimization problem proposed in [10]:
(1)  
where is the union of the source and target domains and , is the domain label, and
are the crossentropy loss functions.
While domain adversarial networks yield reliable results for standard domain adaptation, they will incur performance degeneration on the partial domain adaptation where . This degeneration is caused by the outlier classes in the source domain, which are undesirably matched to the target classes . Due to the domain gap, even the source examples in the shared label space may not transfer well to the target domain. As a consequence, we need to design a new framework for partial domain adaptation.
This paper presents a new transferability weighting framework to address the technical difficulties of partial domain adaptation. Denote by the weight of each source example , which quantifies the example’s transferability. Then for a source example with a larger weight, we should increase its contribution to the final model to enhance positive transfer; otherwise, we should decrease its contribution to mitigating negative transfer. IWAN [43], a previous work for partial domain adaptation, reweighs the source examples in the loss of the domain discriminator . We further put the weights in the loss of the source classifier . This significantly enhances our ability to diminish the irrelevant source examples that deteriorate our final model.
Furthermore, the unknownness of target labels can make the identification of shared classes difficult, making partial domain adaptation more difficult. We thus believe that the exploitation of unlabeled target examples by semisupervised learning is also indispensable. We make use of the entropy minimization principle
[14]. Let , the entropy loss to quantify the uncertainty of a target example’s predicted label is .The transferability weighting framework is shown in Figure 2. By weighting the losses of the source classifier and the domain discriminator using the transferability of each source example, and combining the entropy minimization criterion, we achieve the following objective:
(2)  
(3)  
where is a hyperparameter to tradeoff the labeled source examples and unlabeled target examples.
The transferability weighting framework can be trained endtoend by a minimax optimization procedure as follows, yielding a saddle point solution :
(4)  
3.2 Example Transferability Quantification
With the proposed transferability weighting framework in Equations (2) and (3), the key technical problem is how to quantify the transferability of each source example . We introduce an auxiliary domain discriminator , which is also trained to distinguish the representations of the source domain from the target domain, using the similar loss as Equation (3) but dropping . It is not involved in the adversarial training procedure, i.e., the features are not learned to confuse . Such an auxiliary domain discriminator can roughly quantify the transferability of the source examples, through the Sigmoid probability of classifying each source example to the target domain.
Such an auxiliary domain discriminator discriminates source and target domains based on the assumption that source examples of shared classes are closer to the target domain than to those source examples in the outlier classes , thus having higher probability to be predicted as from the target domain. However, the auxiliary domain discriminator only distinguishes the source and target examples based on domain information. There is potential small gap between ’s outputs for transferable and irrelevant source examples especially when is trained well. So the model is still exposed to the risk of mixing up the transferable and irrelevant source examples, yielding unsatisfactory transferability measures . In partial domain adaptation, the source examples in differentiate from those in mainly in that is shared with the target domain while has no overlap with the target domain. Thus, it is natural to integrate discriminative information into our weight design to resolve the ambiguity between shared and outlier classes.
Inspired by ACGANs [26] that integrate the labeled information into the discriminator, we aim to integrate the label information into the auxiliary domain discriminator . However, we hope to develop a transferability measure with both the discriminative information and domain information to generate clearly separable weights for source data in and respectively. Thus, we add an auxiliary label predictor with leakysoftmax activation. Within , the feature from feature extractor are transformed to dimension . Then will be passed through a leakysoftmax activation as follows,
(5) 
where is the th dimension of . The leakysoftmax has the property that the elementsum of its outputs is smaller than
; when the logit
of class is very large, the probability to classify an example as class is high. As the auxiliary label predictor is trained on source examples and labels, the source examples will have higher probability to be classified as a specific source class , while the target examples will have smaller logits and uncertain predictions. Therefore, the elementsum of the leakysoftmax outputs are closer to for source examples and closer to for target examples. If we define as(6) 
where is the probability of each example belonging to class , then can be seen as computing the probability of each example belonging to the source domain. For a source example, the smaller the value of is, the more probable that it comes from the target domain, meaning that it is closer to the target domain and more likely to be in the shared label space . Thus, the output of is suitable for transferability quantification.
We train the auxiliary label predictor with the leakysoftmax by a multitask loss over onevsrest binary classification tasks for the class classification problem:
(7)  
where denotes whether class is the groundtruth label for source example , and is a hyperparameter. We also train the auxiliary domain discriminator to distinguish the features of the source domain and the target domain as
(8)  
From Equations (6) to (8), we observe that the outputs of the auxiliary domain discriminator depend on the outputs of the auxiliary label predictor . This guarantees that is trained with both label and domain information, resolving the ambiguity between shared and outlier classes to better quantify the example transferability.
Finally, with the help of the auxiliary label predictor and the auxiliary domain discriminator , we can derive more accurate and discriminative weights to quantify the transferability of each source example as
(9) 
Since the outputs of for source examples are closer to , implying very small weights, we normalize the weights in each minibatch of batch size as .
3.3 Minimax Optimization Problem
With the aforementioned derivation, we now formulate our final model, Example Transfer Network (ETN). We unify the transferability weighting framework in Equations (2)–(3) and the example transferability quantification in Equations (6)–(9). Denoting by the parameters of the auxiliary label predictor , the proposed ETN model can be solved by a minimax optimization problem that finds saddlepoint solutions , , and to model parameters as follows,
(10)  
ETN enhances partial domain adaptation by learning to transfer relevant examples and diminish outlier examples for both source classifier and domain discriminator . It exploits progressive weighting schemes from the auxiliary domain discriminator and auxiliary label predictor , well quantifying the transferability of source examples.
Method  OfficeHome  
ArCl  ArPr  ArRw  ClAr  ClPr  ClRw  PrAr  PrCl  PrRw  RwAr  RwCl  RwPr  Avg  
ResNet [15]  46.33  67.51  75.87  59.14  59.94  62.73  58.22  41.79  74.88  67.40  48.18  74.17  61.35 
DANN [10]  43.76  67.90  77.47  63.73  58.99  67.59  56.84  37.07  76.37  69.15  44.30  77.48  61.72 
ADDA [37]  45.23  68.79  79.21  64.56  60.01  68.29  57.56  38.89  77.45  70.28  45.23  78.32  62.82 
RTN [22]  49.31  57.70  80.07  63.54  63.47  73.38  65.11  41.73  75.32  63.18  43.57  80.50  63.07 
IWAN [43]  53.94  54.45  78.12  61.31  47.95  63.32  54.17  52.02  81.28  76.46  56.75  82.90  63.56 
SAN [5]  44.42  68.68  74.60  67.49  64.99  77.80  59.78  44.72  80.07  72.18  50.21  78.66  65.30 
PADA [6]  51.95  67.00  78.74  52.16  53.78  59.03  52.61  43.22  78.79  73.73  56.60  77.09  62.06 
ETN  59.24  77.03  79.54  62.92  65.73  75.01  68.29  55.37  84.37  75.72  57.66  84.54  70.45 
Method  Office31  ImageNetCaltech Avg  
AW  DW  WD  AD  DA  WA  Avg  I C  C I  
ResNet [15]  75.591.09  96.270.85  98.090.74  83.441.12  83.920.95  84.970.86  87.050.94  69.690.78  71.290.74  70.490.76 
DAN [21]  59.320.49  73.900.38  90.450.36  61.780.56  74.950.67  67.640.29  71.340.46  71.300.46  60.130.50  65.720.48 
DANN [10]  73.560.15  96.270.26  98.730.20  81.530.23  82.780.18  86.120.15  86.500.20  70.800.66  67.710.76  69.230.71 
ADDA [37]  75.67 0.17  95.380.23  99.850.12  83.41 0.17  83.620.14  84.250.13  87.030.16  71.820.45  69.320.41  70.570.43 
RTN [22]  78.980.55  93.220.52  85.350.47  77.070.49  89.250.39  89.460.37  85.560.47  75.500.29  66.210.31  70.850.30 
IWAN [43]  89.150.37  99.320.32  99.360.24  90.450.36  95.620.29  94.260.25  94.690.31  78.060.40  73.330.46  75.700.43 
SAN [5]  93.900.45  99.320.52  99.360.12  94.270.28  94.150.36  88.730.44  94.960.36  77.750.36  75.260.42  76.510.39 
PADA [6]  86.540.31  99.320.45  100.00.00  82.170.37  92.690.29  95.410.33  92.690.29  75.030.36  70.480.44  72.760.40 
ETN  94.520.20  100.00.00  100.00.00  95.030.22  96.210.27  94.640.24  96.730.16  83.230.24  74.930.28  79.080.26 
Method  Office31  
AW  DW  WD  AD  DA  WA  Avg  
VGG [34]  60.340.84  97.970.63  99.360.36  76.430.48  72.960.56  79.120.54  81.03 0.57 
DAN [21]  58.780.43  85.860.32  92.780.28  54.760.44  55.420.56  67.290.20  69.150.37 
DANN [10]  50.850.12  95.230.24  94.270.16  57.960.20  51.770.14  62.320.12  68.730.16 
ADDA [37]  53.280.15  94.330.18  95.360.08  58.780.12  50.240.10  63.340.08  69.220.12 
RTN [22]  69.350.42  98.420.48  99.590.32  75.430.38  81.450.32  82.980.36  84.540.38 
IWAN [43]  82.900.31  79.750.26  88.530.16  90.950.33  89.570.24  93.360.22  87.510.25 
SAN [5]  83.390.36  99.320.45  100.00.00  90.700.20  87.160.23  91.850.35  92.070.27 
PADA [6]  86.050.36  99.420.24  100.00.00  81.730.34  93.000.24  95.260.27  92.540.24 
ETN  85.660.16  100.00.00  100.00.00  89.430.17  95.930.23  92.280.20  96.740.13 
Method  OfficeHome  
ArCl  ArPr  ArRw  ClAr  ClPr  ClRw  PrAr  PrCl  PrRw  RwAr  RwCl  RwPr  Avg  
ETN w/o classifier  56.18  71.93  79.32  65.11  65.57  73.66  65.47  52.90  82.88  72.93  56.93  82.91  68.93 
ETN w/o auxiliary  48.36  50.42  79.13  56.57  45.88  65.49  56.38  49.07  77.53  75.57  58.81  78.32  61.79 
ETN  59.24  77.03  79.54  62.92  65.73  75.01  68.29  55.37  84.37  75.72  57.66  84.54  70.45 
4 Experiments
We conduct experiments to evaluate our approach with stateoftheart (partial) domain adaptation methods. Codes and datasets will be available at github.com/thuml.
4.1 Setup
Office31 [32] is de facto for domain adaptation. It is relatively small with 4,652 images in 31 classes. Three domains, namely A, D, W, are collected by downloading from amazon.com (A), taking from DSLR (D) and from web camera (W). Following the protocol in [5], we select images from the 10 categories shared by Office31 and Caltech256 to build new target domain, creating six partial domain adaptation tasks: AW, DW, WD, AD, DA and WA. Note that there are 31 categories in the source domain and 10 categories in the target domain.
OfficeHome [40] is a larger dataset, with 4 domains of distinct styles: Artistic, Clip Art, Product and RealWorld. Each domain contains images of 65 object categories. Denoting them as Ar, Cl, Pr, Rw, we obtain twelve partial domain adaptation tasks: ArCl, ArPr, ArRw, ClAr, ClPr, ClRw, PrAr, PrCl, PrRw, RwAr, RwCl, and RwPr. For PDA, we use images from the first 25 classes in alphabetical order as the target domain and images from all 65 classes as the source domain.
ImageNetCaltech is a large dataset built with ImageNet1K [31] and Caltech256. They share 84 classes, and thus we form two partial domain adaptation tasks: ImageNet (1000)Caltech (84) and Caltech (256)ImageNet (84). As most networks are trained on the training set of ImageNet, we use images from ImageNet validation set as target domain for Caltech (256)ImageNet (84) task.
We compare the proposed ETN model with stateoftheart deep learning and (partial) domain adaptation methods: ResNet50 [15], Deep Adaptation Network (DAN) [21], DomainAdversarial Neural Networks (DANN) [10], Adversarial Discriminative Domain Adaptation (ADDA) [37], Residual Transfer Networks (RTN) [22], Selective Adversarial Network (SAN) [5], Importance Weighted Adversarial Network (IWAN) [43] and Partial Adversarial Domain Adaptation (PADA) [6].
Besides ResNet50 [15], we also evaluate ETN and some methods based on VGG [34] on the Office31 dataset. We perform ablation study to justify the example transfer mechanism, by evaluating two ETN variants: 1) ETN w/o classifier is the variant without weights on the source classifier; 2) ETN w/o auxiliary is the variant without the auxiliary label predictor on the auxiliary domain discriminator.
We implement all methods based on PyTorch, and finetune ResNet50 [15] and VGG [34] pretrained on ImageNet. New layers are trained from scratch, and their learning rates are 10 times that of the finetuned layers. We use minibatch SGD with momentum of 0.9 and the learning rate decay strategy implemented in DANN [10]: the learning rate is adjusted during SGD using , where is the training progress linearly changing from to . The flipcoefficient of the gradient reversal layer is increased gradually from to as DANN [10]. Hyperparameters are optimized with importance weighted crossvalidation [35].
4.2 Results
The classification results based on ResNet50 on the the twelve tasks of OfficeHome, six tasks of Office31 and the two largescale tasks of ImageNetCaltech are shown in Tables 1 and 2. We also compare all methods on Office31 with VGG backbone in Table 3. ETN outperforms all other methods w.r.t average accuracy, showing that ETN performs well with different base networks on different datasets.
Specifically, we have several observations. 1) ADDA, DANN, and DAN outperform ResNet only on some tasks, implying that they suffer from the negative transfer issue. 2) RTN exploits the entropy minimization criterion to amend itself with semisupervised learning. Thus, it has some improvement over ResNet but still suffers from negative transfer for some tasks. 3) Partial domain adaptation methods (SAN [5] and IWAN [43]) perform better than ResNet and other domain adaptation methods on most tasks, due to their weighting mechanism to mitigate negative transfer caused by outlier classes and promote positive transfer among shared classes. 4) ETN outperforms SAN and IWAN on most tasks, showing its power to discriminate the outlier classes from the shared classes accurately and to transfer relevant examples.
In particular, ETN outperforms SAN and IWAN by much larger margin on the largescale ImageNetCaltech dataset, indicating that ETN is robuster to outlier classes and performs better even on dataset with large number of outlier classes ( in ImageNetCaltech) relative to the shared classes ( in ImageNetCaltech). ETN has two advantages: learning discriminative weights and filtering outlier classes out from both source classifier and domain discriminator, which boost partial domain adaptation performance.
We inspect the efficacy of different modules by comparing in Tables 4 the results of ETN variants. 1) ETN outperforms ETN w/o classifier, proving that the weighting mechanism on the source classifier can reduce the negative influence of outlierclasses examples and focus the source classifier on the examples belonging to the target label space. 2) ETN also outperforms ETN w/o auxiliary by a larger margin, proving that the auxiliary classifier can inject label information into the domain discriminator to yield discriminative weights, which in turn enables ETN to filter out irrelevant examples.
4.3 Analysis
Feature Visualization: We plot in Figures 3 the tSNE embeddings [8] of the features learned by DANN, SAN, IWAN and ETN on A (31 classes)W (10 classes) with class information in the target domain. We observe that features learned by DANN, IWAN, and SAN are not clustered as clearly as ETN, indicating that ETN can better discriminate target examples than the compared methods.
Class Overlap: We conduct a wide range of partial domain adaptation with different numbers of target classes. Figure 4 shows that when the number of target classes decreases fewer than , the performance of DANN degrades quickly, implying that negative transfer becomes severer when the label space overlap becomes smaller. The performance of SAN decreases slowly and stably, indicating that SAN potentially eliminates the influence of outlier classes. IWAN only performs better than DANN when the label space nonoverlap is very large and negative transfer is very severe. ETN performs stably and consistently better than all compared methods, showing the advantage of ETN to partial domain adaptation. ETN also performs better than DANN in standard domain adaptation when the label spaces totally overlap, implying that the weighting mechanism will not degrade performance when there are no outlier classes.
Convergence Performance: As shown in Figure 5, the test errors of all methods converge fast but baselines to high error rates. Only ETN converges to the lowest test error. Such phenomenon implies that ETN can be trained more efficiently and stably than previous domain adaptation methods.
Weight Visualization: We plot the approximate density function of the weights in Equation (9) generated by IWAN [43] and ETN for all source examples in Figure 6 on task ClPr. The orange curve shows examples in shared classes and the blue curve shows outlier classes . Compared to IWAN, our ETN approach assigns much larger weights to shared classes and much smaller weights to outlier classes. Most examples of outlier classes have nearly zero weights, explaining the strong performance of ETN on these datasets.
5 Conclusion
This paper presented Example Transfer Network (ETN), a discriminative and robust approach to partial domain adaptation. It quantifies the transferability of source examples by integrating the discriminative information into the transferability quantifier and downweights the negative influence of the outlier source examples upon both the source classifier and the domain discriminator. Based on the evaluation, our model performs strongly for partial domain adaptation tasks.
Acknowledgements
This work is supported by National Key R&D Program of China (No. 2016YFB1000701) and National Natural Science Foundation of China (61772299, 71690231, and 61672313).
References
 [1] S. BenDavid, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains. Machine Learning, 79(12):151–175, 2010.

[2]
Abhijit Bendale and Terrance E. Boult.
Towards open set deep networks.
In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pages 1563–1572, 2016.  [3] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 35(8):1798–1828, 2013.
 [4] P. P. Busto and J. Gall. Open set domain adaptation. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 754–763, 2017.
 [5] Zhangjie Cao, Mingsheng Long, Jianmin Wang, and Michael I. Jordan. Partial transfer learning with selective adversarial networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
 [6] Zhangjie Cao, Lijia Ma, Mingsheng Long, and Jianmin Wang. Partial adversarial domain adaptation. In The European Conference on Computer Vision (ECCV), September 2018.
 [7] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12:2493–2537, 2011.
 [8] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International Conference on Machine Learning (ICML), 2014.
 [9] L. Duan, I. W. Tsang, and D. Xu. Domain transfer multiple kernel learning. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 34(3):465–479, 2012.
 [10] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor S. Lempitsky. Domainadversarial training of neural networks. Journal of Machine Learning Research, 17:59:1–59:35, 2016.
 [11] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
 [12] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic flow kernel for unsupervised domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
 [13] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems (NeurIPS), pages 2672–2680, 2014.
 [14] Yves Grandvalet and Yoshua Bengio. Semisupervised learning by entropy minimization. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems (NeurIPS), pages 529–536. MIT Press, 2005.
 [15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
 [16] J. Hoffman, S. Guadarrama, E. Tzeng, R. Hu, J. Donahue, R. Girshick, T. Darrell, and K. Saenko. LSDA: Large scale detection through adaptation. In Advances in Neural Information Processing Systems (NeurIPS), 2014.
 [17] Judy Hoffman, Eric Tzeng, Taesung Park, JunYan Zhu, Phillip Isola, Kate Saenko, Alexei A. Efros, and Trevor Darrell. Cycada: Cycleconsistent adversarial domain adaptation. In Proceedings of the 35th International Conference on Machine Learning (ICML), pages 1994–2003, 2018.
 [18] Lalit P. Jain, Walter J. Scheirer, and Terrance E. Boult. Multiclass open set recognition using probability of inclusion. In European Conference on Computer Vision (ECCV), pages 393–409, 2014.
 [19] Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Ferrari, Sami AbuElHaija, Alina Kuznetsova, Hassan Rom, Jasper Uijlings, Stefan Popov, Shahab Kamali, Matteo Malloci, Jordi PontTuset, Andreas Veit, Serge Belongie, Victor Gomes, Abhinav Gupta, Chen Sun, Gal Chechik, David Cai, Zheyun Feng, Dhyanesh Narayanan, and Kevin Murphy. Openimages: A public dataset for largescale multilabel and multiclass image classification. Dataset available from https://storage.googleapis.com/openimages/web/index.html, 2017.
 [20] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431–3440, 2015.
 [21] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I. Jordan. Learning transferable features with deep adaptation networks. In International Conference on Machine Learning (ICML), 2015.
 [22] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. Unsupervised domain adaptation with residual transfer networks. In Advances in Neural Information Processing Systems (NeurIPS), pages 136–144, 2016.
 [23] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I. Jordan. Deep transfer learning with joint adaptation networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 2208–2217, 2017.
 [24] Zelun Luo, Yuliang Zou, Judy Hoffman, and FeiFei Li. Label efficient learning of transferable representations acrosss domains and tasks. In Advances in Neural Information Processing Systems (NeurIPS), pages 164–176, 2017.

[25]
Y. Mansour, M. Mohri, and A. Rostamizadeh.
Domain adaptation: Learning bounds and algorithms.
In
Conference on Computational Learning Theory (COLT)
, 2009.  [26] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classifier gans. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 2642–2651, 2017.

[27]
M. Oquab, L. Bottou, I. Laptev, and J. Sivic.
Learning and transferring midlevel image representations using convolutional neural networks.
In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2013.  [28] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks and Learning Systems (TNNLS), 22(2):199–210, 2011.
 [29] S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering (TKDE), 22(10):1345–1359, 2010.
 [30] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster rcnn: Towards realtime object detection with region proposal networks. In Advances in Neural Information Processing Systems (NeurIPS), pages 91–99. 2015.
 [31] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li FeiFei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
 [32] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual category models to new domains. In European Conference on Computer Vision (ECCV), 2010.

[33]
Kuniaki Saito, Shohei Yamamoto, Yoshitaka Ushiku, and Tatsuya Harada.
Open set domain adaptation by backpropagation.
In European Conference on Computer Vision (ECCV), September 2018.  [34] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. In International Conference on Learning Representations (ICLR), 2015 (arXiv:1409.1556v6), 2015.
 [35] Masashi Sugiyama, Matthias Krauledat, and KlausRobert Muller. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research (JMLR), 8(May):985–1005, 2007.
 [36] A. Torralba and A. A. Efros. Unbiased look at dataset bias. In CVPR 2011, pages 1521–1528, 2011.
 [37] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
 [38] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell. Deep domain confusion: Maximizing for domain invariance. 2014.
 [39] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell. Simultaneous deep transfer across domains and tasks. In IEEE International Conference on Computer Vision (ICCV), 2015.
 [40] Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
 [41] X. Wang and J. Schneider. Flexible transfer learning under support and model shift. In Advances in Neural Information Processing Systems (NeurIPS), 2014.
 [42] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems (NeurIPS), 2014.
 [43] Jing Zhang, Zewei Ding, Wanqing Li, and Philip Ogunbona. Importance weighted adversarial nets for partial domain adaptation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
 [44] K. Zhang, B. Schölkopf, K. Muandet, and Z. Wang. Domain adaptation under target and conditional shift. In International Conference on Machine Learning (ICML), 2013.
Comments
There are no comments yet.