A Sample Selection Approach for Universal Domain Adaptation

by   Omri Lifshitz, et al.

We study the problem of unsupervised domain adaption in the universal scenario, in which only some of the classes are shared between the source and target domains. We present a scoring scheme that is effective in identifying the samples of the shared classes. The score is used to select which samples in the target domain to pseudo-label during training. Another loss term encourages diversity of labels within each batch. Taken together, our method is shown to outperform, by a sizable margin, the current state of the art on the literature benchmarks.



page 1

page 2

page 3

page 4


Unveiling Class-Labeling Structure for Universal Domain Adaptation

As a more practical setting for unsupervised domain adaptation, Universa...

TWINs: Two Weighted Inconsistency-reduced Networks for Partial Domain Adaptation

The task of unsupervised domain adaptation is proposed to transfer the k...

Progressively Select and Reject Pseudo-labelled Samples for Open-Set Domain Adaptation

Domain adaptation solves image classification problems in the target dom...

Confidence Score for Source-Free Unsupervised Domain Adaptation

Source-free unsupervised domain adaptation (SFUDA) aims to obtain high p...

S^3VAADA: Submodular Subset Selection for Virtual Adversarial Active Domain Adaptation

Unsupervised domain adaptation (DA) methods have focused on achieving ma...

Distributionally Robust Learning for Unsupervised Domain Adaptation

We propose a distributionally robust learning (DRL) method for unsupervi...

Towards Learning free Naive Bayes Nearest Neighbor-based Domain Adaptation

As of today, object categorization algorithms are not able to achieve th...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In real world situations, the necessity of applying domain adaptation is the rule and not the exception, since “no man ever steps in the same river twice”. This is true not only for the input samples, whose distribution is likely to change both because of the shifting setting and due to the practical considerations of collecting training samples, but also with regards to the output labels. In many cases, the classes seen and labeled during training differ from those encountered during the deployment phase.

Unsupervised domain adaptation seeks to learn a classifier in a source domain in which supervised training samples exist, such that it would be effective in a target domain for which only unsupervised samples exist. Universal domain adaptation (UDA) adds the challenge that some of the classes in the source domain do not appear in the target domain and vice versa. Therefore, the classifier, when applied to the target domain, has to classify only to the relevant classes, and also identify the samples that belong to the classes that are unique to the target domain.

The method we propose is based on three losses. The first loss is the conventional domain confusion loss, which encourages the representation of the samples to be domain agnostic. The second one is the pseudo-labeling loss, which is a very common loss in semi-supervised learning and, in particular, in unsupervised domain adaptation. However, the application of pseudo-labels in the UDA setting requires additional care, since labeling every sample is almost guaranteed to lead to adverse results. We, therefore, propose to identify the samples in the target domain for which the labels are likely to be in the set of shared classes.

The third loss is the batch diversity loss, which encourages the predicted samples both from the source domain and from the target domain to be uniformly distributed between the classes, and not all predicted to be from the same limited set of classes. This is especially important in the UDA case, in which the learned model can declare all target samples to be from an unknown class. This trivial decision can be justified by the reasoning that a cat in the source domain is not the same as a cat in the target domain. Naturally, diversity among the labels is more justified for correctly classified samples, and we, therefore, similarly to the pseudo-labeling scheme above, attempt to apply it only for samples from the shared classes.

In our work, the samples from the target domain that are likely to be from the shared classes are identified based on two signals. The first is the certainty of the classifier, assuming that the classifier is more likely to be confused when encountering samples from unseen classes. Second, the samples in the target domain that are more similar to the source domain are likely to be from the shared classes. We, therefore, suggest a scoring scheme that combines the outputs of both the label classifier and the domain classifier.

Our experiments show that using a scoring scheme based on the two aforementioned signals together with the three loss terms improves the state of the art accuracy in the UDA scenario.

Our main contributions are: (i) a direct method for UDA, which employs selective pseudo-labels as the main loss, (ii) encouraging diversity in the labels by the batch diversity loss, (iii) a new sample scoring scheme that outperforms the one previously proposed, and (iv) state of the art results across datasets and benchmarks.

2 Related work

The problem of unsupervised domain adaptation can be divided into four different categories, based on the relation between the label sets of the source and target domains: closed-set, open-set, partial and universal. Closed-set domain adaptation is a scenario where the source and target domains share the same label set. The main challenge in this scenario is to overcome the domain gap that comes as a result of the samples being taken from different distributions. There are two common approaches to the close-set problem: feature adaptation and generative models. Generative based approaches [1, 24, 13, 15, 18, 14, 28] attempt to generate labeled target samples from the source samples. Methods based on CycleGAN [32] generate synthetic target-like samples from the source domain and source-like samples from the target in order to train classifiers on each of the domains [12, 20].

Methods based on feature adaptation aim to reduce the discrepancy between the feature distribution of samples from the source and target domains. In [8], a domain adversarial network is introduced and added to a classifier network with the purpose of creating features that are indiscriminate with respect to a shift between domains, yet still discriminative for the main classification task. By introducing a gradient reversal unit, the feature extractor is trained to produce features that confuse the domain classifier.

Open-set domain adaptation, first proposed by [2] assumes knowledge of the shared label set between the source and target domain, while all private label sets are marked as “unknown”. A modification proposed by [23] requires no data from the private source label set.

Partial-set domain adaptation assumes that the target domain’s label set is a subset of the source’s label set. Cao et al[3] employ adversarial distribution matching by using a number of domain discriminators together with a weighting scheme at both the class and instance level. Zhang et al[30] use an adversarial method to identify the source samples that are potentially from the private target label set. The results were further improved by using a single adversarial domain network and down-weighting the data of the source private set for the classifier and domain adversarial during training [4].

Universal domain adaptation was first introduced in [29] and unlike the aforementioned scenarios, it does not assume any prior knowledge about the relation between the source and target label sets.

Pseudo-labels are a simple yet effective tool used in closed-set domain adaptation, in order to learn categorical representation of the target domain [7, 22, 25, 26, 31, 5]. Although the use of pseudo-labels during training can greatly improve the final outcome of the network, the use of false pseudo-labels leads to negative transfer, which is a major concern in UDA.

3 Problem setting

(a) (b) (c)
Figure 1: Possible label set relations. The UDA framework covers all possible relations between the source and target domain. (a) The case of “partial domain”, the target label set is a subset of the source label set and thus all target labels are in the shared set. (b) The “open set” case, in which the source label set is a subset of the target’s. (c) In the general UDA case, the target label set and the source label set intersect, yet both domains have a private label set. In all cases, the shared label set is unknown during training.

We follow the setting of UDA proposed by [29]. During training, we are provided with a source domain of labeled data sampled from distribution and a target domain of unlabeled data sampled from distribution , which is the marginalization of the distribution of samples and their labels in the target domain. We denote by () the label set of the source (target) domain. The shared label set is denoted by . For convenience, we denote the private label sets of the source and target domain in the following manner: and , respectively.

As can be seen in Fig. 1, UDA generalizes all other variants of domain adaptation. Namely, the partial-set case in which the target classes are a subset of the source classes (closed-set is a special case of partial-set), and the open-set case in which the source classes are a subset of the target classes. The latter case is the more challenging of the two since some of the target domain samples cannot be adapted to match the samples seen during training.

The Jaccard index of the label sets of the two domains,

, is used to measure the overlap in classes. The objective in the UDA scenario is to create a model that maximises the target classification on the shared label set, as well as distinguishes between samples with labels from and those in . i.e.




and is the symbol used to mark unknown classes not seen in the labeled training set .

4 Method

(a) (b)
Figure 2: Architecture of the network during training and deployment. During the training stage (a), the score along with the label and domain classification is used to calculate the loss. During the deployment stage (b) the scores are used as a threshold to decide whether the sample is from the shared label set or should be marked as unknown.

The architecture we employ is shown in Fig. 2(a). It consists of a domain classifier , a feature extractor , and a label classifier . By using one adversarial domain classifier , our method is simpler than previous work [29], which uses two domain classifiers.

Input (from both domains) is fed into the feature extractor

, yielding the feature vector

. is, in turn, fed to both the domain classifier and the label classifier . The label classifier outputs the label prediction of classes from the source domain

, which is a vector of pseudo probabilities obtained by the softmax function. The adversarial domain classifier yields the probability of the sample being from the source domain

. The results from both classifiers are used for calculating the sample transfer score and for calculating the losses.

The sample transfer score,

, estimates the confidence that

is from the shared label set. The score is calculated using the prediction and the domain classification as detailed below. A higher value of indicates that the appears to be from the shared label set and that the correct label was identified.

During the deployment stage, the test sample undergoes the same path as before, but rather than calculating losses, we use the score as a threshold to decide whether we should predict a class or label the sample as the symbol that represents all labels unseen during training. We use a hyper-parameter and output the class label according to the following:


4.1 The sample transfer score

We define a scoring mechanism that represents the confidence that a sample is from the shared label set . This score is used in both training and deployment. During the training stage, the scores are used as a threshold for losses on samples from the target domain, as explained in the following sections. During the deployment stage, the scores are used in order to decide whether or not a sample should be labeled as or predicted from one of the classes in the source label set, as shown in Eq. 3.

The score is a combination of two signals: (i) the confidence in the classification label, as it manifests itself in the vector of pseudo probabilities , and (ii) the estimation of the probability of it being in the source domain, as is estimated by . The usage of the second signal on target domain samples, is meant to measure the similarity of these samples to the source domain samples. Naturally, target samples that are more similar to the source domain samples are more likely to be in the shared label set.

It is reasonable to expect that


In other words, the maximal value of the pseudo probability can be used as a measure for identifying the target samples that have labels in . We, therefore, derive the following scoring mechanism for target samples:


Let us notice that as (higher values are associated with source samples) and it holds that .

In [29], the authors propose to use a different scoring scheme and use it as a weight for training the second domain classifier they use (we do not employ this component). Their scoring scheme employs the following scores to target domain samples


where is the entropy of vector . In their work, source domain samples are also scored, by the score , while we only select target samples as detailed below. Nevertheless, despite using scoring for completely different losses and to different sets of samples, we explore empirically the replacement of our scoring mechanism with their and demonstrate that our scheme is superior by a sizable margin.

4.2 Pseudo-labels

In order to utilize the unlabeled data as much as possible, we opt to use pseudo-labels. As explained in [5], pseudo-labels can be an extremely simple yet effective tool when training a network in a semi-supervised scenario. The difficulty with pseudo-labels in the UDA scenario is the high risk of negative transfer, i.e., decreasing the classifier’s performance due to the incorporation of false supervision. In the universal scenario, the target label set is unknown and, therefore, assuming that the network’s classification is correct is even more likely to be detrimental than in the conventional domain adaptation case.

In order to deal with the risk of negative transfer, our approach is to use pseudo-labels only on high confidence samples that are likely to be in the shared label set . As a confidence measure, we employ the sample’s transfer score, , and only use pseudo-labels for samples where is above a certain threshold. We use a dynamic threshold, , that changes during the training process according to the following:


where is the current training step and is the total number of training steps. A dynamic threshold is used in order to avoid negative transfer; at first the threshold is set at a high value and as the training advances the network better classifies samples with the threshold and thus it is reasonable to lower the threshold further.

Our pseudo-label classification loss is the following:


where is the standard cross-entropy loss, and is a trade-off parameter.

4.3 The batch diversity loss

The method also employs a regularization term aimed at enforcing the samples to be distributed among the different labels in a uniform manner. This regularization helps to better utilize all of the clusters formed in the encoding space, which is likely to be beneficial when transferring to a different domain.

Given a batch of samples , for each sample the pseudo probability of it being in class is denoted by . We define the following regularization term:


This term has a maximal value of 1, which is reached when all of the samples from the batch are mapped to a single class. Its minimal value is and that is obtained if and only if for each sample . Adding this penalty to the network’s objective encourages a solution that is more uniformly distributed across the different classes.

This loss term is used on samples from both domains. However, in order to avoid negative transfer of samples that are from the target domain specific labels , we apply this loss only to samples from the target domain whose sample transfer score is above a threshold . Let us denote by the source samples in the batch and the target samples. Let us also denote by the target samples for which . Our loss term becomes the following:


4.4 Domain adversarial loss

In addition to the losses described above, we also use the conventional adversarial domain loss first introduced in [8]. The domain classifier’s network is trained with a binary cross-entropy loss.


A gradient reversal layer is used when backpropagating to network


4.5 The compound loss

To summarize, the final loss used is the following:


Note that the components are unweighted. However, contains the parameter .

5 Experiments

We compare our method with state of the art methods from different domain adaptation settings. We also perform a comparison between our scoring mechanism and that proposed in [29] and perform an ablation study to show the necessity of our losses.

Datasets  Following [29], we use four datasets. Office-Home [27] is a dataset made up of 65 different classes from four domains: Artistic (Ar), Clipart (Cl), Product (Pr) and Real-world images (RW). Keeping in line with [29] we test each combination of source and target domain by setting the first 10 classes in alphabetical order as the shared label set , the next five as the source private, , and the rest of the classes (50 classes) are the private target, . Office31 [21] consists of three domains, each with 31 classes. The domains are Amazon (A), DSLR (D) and Webcam (W). The 10 shared classes between this dataset and Caltech-256 [10] are used as the shared label set. Aside from these classes, we set the first 10 classes in alphabetical order as and the last 11 classes as . VisDA2017 [19]

is a dataset with a single source and target domain testing the ability to perform transfer learning from synthetic images to natural images. The dataset has 12 classes identical in each domain; we use the first six as the shared label set, the next six as the private source label set and the last three as private target label set.


employs Imagenet-1K 

[6] with 1000 different classes and Caltech-256 [10] with 256 classes. The shared label set is comprised of the 84 shared classes between the two datasets, while the source and target private label sets are comprised of all other classes in each dataset.

Evaluation protocol  The protocol of the Open-Set challenge in VisDA2018 is employed. After the training stage, the model is tested only on samples from the target domain. The network must classify the test data into different classes, where the last label contains all labels from the target domain’s private label set. As detailed above, our network tries to classify using the labels from the source domain and only classifies into the “unknown” class if the sample’s transfer score is lower than a predetermined threshold.

Implementation details  The architecture of , , and follows that of UAN [29]

in order to provide a direct comparison with this previous work. The method is implemented in Pytorch using a ResNet-50 model 

[11], pretrained on ImageNet [6], as the backbone feature extractor . The label classifier network, , is a fully connected network with a single layer used to classify the features . The domain classifier network,

, is comprised of three fully connected layers with ReLU between the first two.

Our method enjoys a very limited number of hyperparameters. Early on during the development process, we fixed the following hyperparameters across all datasets:

, and . Below we provide some parameter sensitivity experiments to demonstrate the robustness of the method to its parameters.

5.1 Classification results

Method Office-Home
ArCl ArPr ArRw ClAr ClPr ClRw PrAr PrCl PrRw RwAr RwCl RwPr Avg
ResNet [11] 59.37 76.58 87.48 68.86 71.11 81.66 73.72 56.30 86.07 78.68 59.22 78.59 73.22
DANN [8] 56.17 81.72 85.87 68.67 73.38 83.76 69.92 56.84 85.80 79.41 57.26 78.26 73.17
RTN [16] 50.46 77.80 86.90 65.12 73.40 85.07 67.86 45.23 85.50 79.20 55.55 78.79 70.91
IWAN [30] 52.55 81.40 86.51 70.58 70.99 85.29 74.88 57.33 85.07 77.48 59.65 79.91 73.39
PADA [4] 39.59 69.37 76.26 62.57 67.39 77.47 48.39 35.79 79.60 75.94 44.50 78.10 62.91
ATI [2] 52.90 80.37 85.91 71.08 72.41 84.39 74.28 57.84 85.61 76.06 60.17 78.42 73.29
OSBP [23] 47.75 60.90 76.78 59.23 61.58 74.33 61.67 44.50 79.31 70.59 54.95 75.18 63.90
UAN [29] 63.00 82.83 87.85 76.88 78.70 85.36 78.22 58.59 86.80 83.37 63.17 79.43 77.02
Ours 63.59 85.02 91.42 77.01 84.09 88.29 79.50 56.49 89.85 77.52 61.00 85.69 78.29
Table 1: Average class accuracy (%) on the Office-Home (). The results for all methods besides our are taken from UAN[29]
Method Office31 ImageNet-Caltech VisDA2017
A W D W W D A D D A W A Avg I C C I Avg
ResNet [11] 75.94 89.60 90.91 80.45 78.83 81.42 82.86 70.28 65.14 67.71 52.80
DANN [8] 80.65 80.94 88.07 82.67 74.82 83.54 81.78 71.37 66.54 68.96 52.94
RTN [16] 85.70 87.80 88.91 82.69 74.64 83.26 84.18 71.94 66.15 69.05 53.92
IWAN [30] 85.25 90.09 90.00 84.27 84.22 86.25 86.68 72.19 66.48 69.34 58.72
PADA [4] 85.37 79.26 90.91 81.68 55.32 82.61 79.19 65.47 58.73 62.10 44.98
ATI [2] 79.38 92.60 90.08 84.40 78.85 81.57 84.48 71.59 67.36 69.48 54.81
OSBP [23] 66.13 73.57 85.62 72.92 47.35 60.48 67.68 62.08 55.48 58.78 30.26
UAN [29] 85.62 94.77 97.99 86.50 85.45 85.12 89.24 75.28 70.17 72.73 60.83
Ours 90.25 95.25 96.96 88.84 90.19 89.30 91.80 76.13 74.67 75.40 64.31
Table 2: Average class accuracy for Office31(), ImageNet-Caltech () and VisDA2017(

We compare our approach with prior methods in the UDA setting. Tab. 12 present the results on the acceptable benchmarks of the field. The success rate for methods other than ours is taken from [29]. As can be observed, our approach achieves state of the art results on the majority of the domain adaptation tasks across the different datasets. Our method is able to improve the average classification accuracy of each dataset by 1-5 percents.

5.2 Scoring scheme analysis

A W D W W D A D D A W A Avg
UAN[29] 85.62 94.77 97.99 86.50 85.45 85.12 89.24
Ours with 86.23 93.26 91.79 84.31 86.09 85.41 87.84
Ours with 85.26 93.81 95.16 82.84 85.31 83.51 87.65
Ours, w/o 89.95 95.70 96.46 88.05 89.92 89.05 91.52
Ours, w/o 81.9 89.11 87.85 83.68 81.73 82.55 82.55
Ours with 90.25 95.25 96.96 88.84 90.19 89.30 91.80
Table 3: Comparison on Office31 between UAN[29] and our approach when using either (with ablation on the score’s components) or UAN’s as the scoring scheme, as well as other variants.

In Fig. 3

we present the estimated probability density function for the different components of

on the Office31 dataset for the domain shift AD. , shown in Fig. 3(a), displays the following expected behavior: . In Fig. 3(b) we analyze the max probability of the classifier, , validating the hypothesis in Eq. 4 and justifying using this component as part of our score scheme. Finally, in Fig. 3(c), we present the full sample transfer score . The results show that target samples with higher scores are typically from the shared label set. This justifies the use of our scoring scheme to distinguish between samples that we can predict correctly and those that should be labeled .

(a) (b) (c)
Figure 3: Distributions of the different components of the scoring scheme on the four following sample groups: source samples in (orange), source sample in (blue), target samples in (red) and target samples in (green). (a) Distribution of the domain classifier’s output . (b) The label classifier’s maximum probability, . (c) The score , which combines both.

Comparing scoring schemes

We next compare our proposed scoring scheme , as shown in Eq. 5, to the scoring scheme proposed in [29], given by Eq. 6. In order to compare the two scoring schemes, we use the score proposed on the target samples instead of . The method that uses was tuned to optimize its performance. In addition to the scoring scheme , we also compare to an entropy based one, since entropy has been shown to be a good criterion in domain adaptation [9, 17]. Based on the assumptions that target samples from the shared label set are similar to the source samples and will thus have a lower entropy, we define the following scoring scheme:


The comparison is done on Office31 dataset and the results appear in Tab. 3. As can be seen, our scoring scheme produces superior results across the entire dataset. We thus conclude that our scoring mechanism outperforms the one proposed in [29] and when used in the context of our method.

Tab. 3 also presents an ablation study on the components of the scoring mechanism. “ w/o ” refers to the score function when removing the domain factor from Eq. 5 and “ w/o ” to the score function when removing the classification component. The results show that both components are necessary for achieving our final result. However, the classification component is more crucial to the success of our scoring mechanism, and by itself already outperforms the state of the art.

5.3 Parameter sensitivity

Figure 4: (blue) Accuracy on Office31 AD (, , ) and (red) on OfficeHome ArCl (, , ) as a function of the threshold . Note that there are two axes due to a different level of baseline performance.

Pseudo-label threshold analysis 

(a) (b)
Figure 5: (a) Accuracy w.r.t. threshold (, , ). (b) Accuracy w.r.t (, , dynamic ). Both results are on OfficeHome Art to Clipart

We study the sensitivity of our method to the threshold , which is used to determine whether or not the pseudo-label of a target sample should be taken into consideration when calculating its loss. While our method employs a dynamic threshold, in order to obtain a clearer image, we perform the experiment when the threshold is fixed. We compare the average accuracy on the OfficeHome dataset with the domain shift ArCl and on Office31 with the domain shift AD. The tests are conducted by fixing all other hyperparameters to the default values and only changing the value of .

The results are presented in Fig. 4. By taking the lowest threshold possible, , we allow the use of pseudo-labels on every sample seen during the training stage. As can be seen from the performance graph, this yields a much lower result than higher thresholds, probably due to negative transfer. This result is also evident when looking at the results of Tab. 77 when setting the threshold .

The second edge case is , which is the maximal value that the can have. With this threshold, no score will ever satisfy and thus it is equivalent to not using pseudo-labels at all. From Fig. 4 one can observe that does not yield the best results, meaning that the use of pseudo-labels does, in fact, help train the network. Tab. 7 shows the results on the OfficeHome dataset in the case where pseudo-labels are applied as suggested in our approach and when they are not applied at all. These results show that the use of pseudo-labels during training does improve the accuracy during the deployment stage across the entire dataset. These results are also affirmed by Tab. 7 for the Office31 tasks.

In addition, one can observe that setting a very low threshold also leads to results much lower than the best possible performance. This comes as no surprise, since using pseudo-labels on all the target samples leads to negative transfer.

Dynamic threshold 

A W D W W D A D D A W A Avg
Ours w/o pseudo-labels 85.38 94.34 96.67 87.58 85.00 85.32 89.05
Ours, 87.92 90.62 87.44 85.19 84.00 84.65 86.64
Ours, static 89.75 95.47 95.95 88.78 89.53 88.88 91.39
Ours 90.25 95.25 96.96 88.84 90.19 89.30 91.80
Table 4: Comparison between different thresholds for the use of pseudo-labels on Office31.
ArCl ArPr ArRw ClAr ClPr ClRw PrAr PrCl PrRw RwAr RwCl RwPr Avg
Ours w/o pseudo-labels 63.0 83.97 90.53 72.40 75.56 84.25 70.83 54.21 89.01 74.96 56.76 84.20 74.97
Ours, 57.34 81.50 88.51 72.58 83.06 86.32 75.73 58.46 86.94 74.84 58.33 82.29 75.49
Ours, static 62.93 82.86 90.60 74.52 84.27 88.84 81.66 55.10 88.92 78.56 61.44 84.25 77.83
Ours 63.59 85.02 91.42 77.01 84.09 88.29 79.50 56.49 89.85 77.52 61.00 85.69 78.29
Table 5: Comparison between different thresholds for the use of pseudo-labels on Office-Home.
A W D W W D A D D A W A Avg
No diversity loss 87.97 95.77 97.11 88.20 89.16 88.53 91.12
Diversity loss on target samples 85.68 95.22 96.58 87.81 89.29 88.25 90.47
Diversity loss on all samples 90.25 95.25 96.96 88.84 90.19 89.30 91.80
Table 6: Results on Office31 when using the diversity loss in different manners.
ArCl ArPr ArRw ClAr ClPr ClRw PrAr PrCl PrRw RwAr RwCl RwPr Avg
No diversity loss 63.81 84.65 91.42 77.65 84.15 87.81 79.92 56.71 88.11 76.6 60.24 85.43 78.04
Target only 63.04 85.03 91.49 77.07 84.47 87.78 79.36 56.42 88.80 77.32 60.21 85.61 78.05
Target + Source 63.59 85.02 91.42 77.01 84.09 88.29 79.50 56.49 89.85 77.52 61.00 85.69 78.29
Table 7: Results on OfficeHome when using the diversity loss in different manners.

We next analyze the advantage of employing a dynamic threshold for the use of pseudo-labels. The analysis is done on the Office31 dataset by fixing a set threshold, (which was found to provide the optimal value).

The results are reported in Tab. 7. As can be seen, the use of a dynamic threshold does seem to give better results overall. This is probably due to the fact that we are able to use more samples whose transfer score is above and below the set at later parts of the training.

Another insight to the advantage of using a dynamic threshold can be seen by looking at Fig. 4. From the figure, one can observe that different domains and datasets give better results when using different thresholds for the pseudo-labels. In particular, for the domain shift ArCl in OfficeHome, the best result is achieved with and for Office31 AD, the best result is achieved at . The dynamic threshold enables us to take advantage of different thresholds during training, yielding improved results over a number of datasets.

Diversity loss threshold analysis  We next analyze the threshold used to determine which of the target samples are used when calculating the batch diversity loss. We compare the average accuracy when only changing the threshold while all other hyper-parameters are fixed at the default value (including the threshold ). The results can be found in Fig. 5(a). The accuracy varies by around 1% but it is clear that the use of pseudo-labels under a relatively stable threshold value does improve the final result. Note that the default threshold is not the one that produces the best possible outcome.

Decision threshold analysis  Another component of the network we analyze is the decision threshold , which is used to decide whether the model would label a sample as or use the predicted label. The analysis is done in a similar manner to the two previous sections. However, here we use a dynamic threshold as described in Eq. 7.

As is evident from the results in Fig. 5

(b), there is little variance in the results for a threshold in a wide range between

and . For thresholds higher than , we see a sharp fall in the accuracy until finally reaching the lowest possible values at . This fall in accuracy occurs because for high enough thresholds only a very small number of samples have a transfer score higher than the threshold and thus most samples are labeled . The extreme case, as seen in the graph, is where no sample can pass this threshold and all are labeled , leading to an accuracy score that is the fraction of samples from novel target classes in this benchmark.

Diversity loss study

We next explore the contribution of the batch diversity loss and whether the diversity loss only needs to be applied on the target domain, for which we have no label, as a means of semi-supervised learning, or whether using it also on the source domain helps improve the network’s outcome. All hyper-parameters used in the analysis are the same default parameters as described above.

Tab. 7 (7) presents the results on the Office31 (OfficeHome) dataset in three scenarios: (i) when not applying the diversity loss at all, (ii) when using the diversity loss on both domains and (iii) when using the loss only on the target domain. As can be seen, the results for the different settings are very similar. However, the use of the batch diversity loss on both domains does yield slightly better results.

6 Conclusions

We study unsupervised domain adaptation in the challenging case where there is a partial overlap between the source and target domain classes. Our method adapts through the usage of pseudo-labels and a diversity loss. However, since some of the samples of the target domain cannot be properly labeled by any of the source labels, we propose to score the samples and apply a threshold.

Our scoring takes into consideration the confidence of the label classifier, as well as the confidence of the domain discriminator. The more certain the first network is in its prediction and the less certain the second is that the sample is from the target domain, the more likely the target domain sample is from the shared label set.

The method obtains state of the art results by a sizable margin on the relevant literature benchmarks, despite being simpler than previous work. We also demonstrate that our scoring scheme is superior to the ones previously proposed.


This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant ERC CoG 725974).


  • [1] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan (2016) Unsupervised pixel-level domain adaptation with generative adversarial networks. External Links: 1612.05424 Cited by: §2.
  • [2] P. P. Busto, A. Iqbal, and J. Gall (2019) Open set domain adaptation for image and action recognition. External Links: 1907.12865 Cited by: §2, Table 1, Table 2.
  • [3] Z. Cao, M. Long, J. Wang, and M. I. Jordan (2017) Partial transfer learning with selective adversarial networks. External Links: 1707.07901 Cited by: §2.
  • [4] Z. Cao, L. Ma, M. Long, and J. Wang (2018) Partial adversarial domain adaptation. External Links: 1808.04205 Cited by: §2, Table 1, Table 2.
  • [5] J. Choi, M. Jeong, T. Kim, and C. Kim (2019) Pseudo-labeling curriculum for unsupervised domain adaptation. External Links: 1908.00262 Cited by: §2, §4.2.
  • [6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: §5, §5.
  • [7] G. French, M. Mackiewicz, and M. Fisher (2017) Self-ensembling for visual domain adaptation. External Links: 1706.05208 Cited by: §2.
  • [8] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016)

    Domain-adversarial training of neural networks


    Journal of Machine Learning Research

    17 (59), pp. 1–35.
    Cited by: §2, §4.4, Table 1, Table 2.
  • [9] Y. Grandvalet and Y. Bengio (2004) Semi-supervised learning by entropy minimization. In Proceedings of the 17th International Conference on Neural Information Processing Systems, NIPS’04, Cambridge, MA, USA, pp. 529–536. External Links: Link Cited by: §5.2.
  • [10] G. Griffin, A. Holub, and P. Perona (2006) Caltech256 image dataset. . External Links: Link Cited by: §5.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: Table 1, Table 2, §5.
  • [12] J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell (2017) CyCADA: cycle-consistent adversarial domain adaptation. External Links: 1711.03213 Cited by: §2.
  • [13] L. Hu, M. Kan, S. Shan, and X. Chen (2018-06) Duplex generative adversarial network for unsupervised domain adaptation. In CVPR, pp. 1498–1507. Cited by: §2.
  • [14] S. Huang, A. Lin, S. Chen, Y. Wu, P. Hsu, and S. Lai (2018-08) AugGAN: cross domain adaptation with gan-based data augmentation. In ECCV, pp. . Cited by: §2.
  • [15] Y. Liu, Y. Yeh, T. Fu, S. Wang, W. Chiu, and Y. F. Wang (2017) Detach and adapt: learning cross-domain disentangled deep representation. External Links: 1705.01314 Cited by: §2.
  • [16] M. Long, J. Wang, and M. I. Jordan (2016) Unsupervised domain adaptation with residual transfer networks. CoRR abs/1602.04433. External Links: Link, 1602.04433 Cited by: Table 1, Table 2.
  • [17] M. Long, H. Zhu, J. Wang, and M. I. Jordan (2016) Unsupervised domain adaptation with residual transfer networks. External Links: 1602.04433 Cited by: §5.2.
  • [18] Z. Murez, S. Kolouri, D. Kriegman, R. Ramamoorthi, and K. Kim (2017) Image to image translation for domain adaptation. External Links: 1712.00479 Cited by: §2.
  • [19] X. Peng, B. Usman, N. Kaushik, J. Hoffman, D. Wang, and K. Saenko (2017) VisDA: the visual domain adaptation challenge. External Links: 1710.06924 Cited by: §5.
  • [20] P. Russo, F. M. Carlucci, T. Tommasi, and B. Caputo (2017) From source to target and back: symmetric bi-directional adaptive gan. External Links: 1705.08824 Cited by: §2.
  • [21] K. Saenko, B. Kulis, M. Fritz, and T. Darrell (2010) Adapting visual category models to new domains. In Proceedings of the 11th European Conference on Computer Vision: Part IV, ECCV’10, Berlin, Heidelberg, pp. 213–226. External Links: ISBN 3-642-15560-X, 978-3-642-15560-4, Link Cited by: §5.
  • [22] K. Saito, Y. Ushiku, and T. Harada (2017) Asymmetric tri-training for unsupervised domain adaptation. External Links: 1702.08400 Cited by: §2.
  • [23] K. Saito, S. Yamamoto, Y. Ushiku, and T. Harada (2018) Open set domain adaptation by backpropagation. External Links: 1804.10427 Cited by: §2, Table 1, Table 2.
  • [24] S. Sankaranarayanan, Y. Balaji, C. D. Castillo, and R. Chellappa (2017) Generate to adapt: aligning domains using generative adversarial networks. External Links: 1704.01705 Cited by: §2.
  • [25] O. Sener, H. O. Song, A. Saxena, and S. Savarese (2016) Learning transferrable representations for unsupervised domain adaptation. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 2110–2118. External Links: Link Cited by: §2.
  • [26] R. Shu, H. H. Bui, H. Narui, and S. Ermon (2018) A dirt-t approach to unsupervised domain adaptation. External Links: 1802.08735 Cited by: §2.
  • [27] H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Panchanathan (2017) Deep hashing network for unsupervised domain adaptation. External Links: 1706.07522 Cited by: §5.
  • [28] R. Volpi, P. Morerio, S. Savarese, and V. Murino (2017) Adversarial feature augmentation for unsupervised domain adaptation. External Links: 1711.08561 Cited by: §2.
  • [29] K. You, M. Long, Z. Cao, J. Wang, and M. I. Jordan (2019-06) Universal domain adaptation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §3, §4.1, §4, §5.1, §5.2, Table 1, Table 2, Table 3, §5, §5, §5.
  • [30] J. Zhang, Z. Ding, W. Li, and P. Ogunbona (2018) Importance weighted adversarial nets for partial domain adaptation. External Links: 1803.09210 Cited by: §2, Table 1, Table 2.
  • [31] W. Zhang, W. Ouyang, W. Li, and D. Xu (2018-06) Collaborative and adversarial network for unsupervised domain adaptation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vol. , pp. 3801–3809. External Links: Document, ISSN Cited by: §2.
  • [32] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017)

    Unpaired image-to-image translation using cycle-consistent adversarial networks

    External Links: 1703.10593 Cited by: §2.