. However, one major disadvantage of deep learning is that neural networks generally require a large amount of training data to converge to the right solution and generalize. When the training data are insufficient, the model performance is usually adversely affected. Sometimes even if the training data are sufficient, the domain gap, i.e. the difference between data distributions, between the source domain (the data we train model on) and the target domain (the desired target task) may still contribute to low generalizability. This is because in conventional machine learning tasks, we usually assume that the training data distribution is the same as the testing data distribution. However, in real world, testing data are uncontrollable, and thus the difference between the source and the target domain can be substantial, which results in the overfitting problem, that is, the model does not generalize well to the testing set.
In order to reduce the domain gap and better utilize the source domain knowledge, domain adaptation techniques have been proposed to resolve the issue. Domain adaptation assumes that the source domain has sufficient amount of labeled data to train a good model, while the desired target domain has insufficient amount of data to train the model. Note that in most domain adaptation tasks, the target domain does not have labeled data at all, which is also known as unsupervised domain adaptation. Domain adaptation methods leverage the knowledge from the source domain with sufficient labeled data to learn a model that works well in the target domain with insufficient or no labeled data. Typically, domain adaptation methods reduce the domain gap by diminishing the divergence between the label-rich source and the label-scarce target domain [14, 15, 11, 12, 36, 25, 34, 30, 21, 8, 7, 28]. A significant drawback of most of the existing works is that they assume the source label space and the target label space are identical, i.e. the target classes are assumed to have appeared during the training process and the training classes do not contain classes that are not in the target domain. This is also known as closed set domain adaptation (CSDA), which is infrequent in real-world scenarios since it is hard to guarantee that the target domain classes are the same as the source domain classes. Aligning the source and the target domain by brute force when their label spaces are different is extremely detrimental to the model’s generalizability, which is also known as the negative transfer phenomenon, because the CSDA methods would try to align the additional unknown classes as well during adaptation.
To handle the domain adaptation tasks without assuming identical label spaces, open set domain adaptation (OSDA) methods were proposed to first detect the irrelevant or unknown samples and then avoid adaptation on the unknown samples and perform domain adaptation only on the known classes [4, 29, 18, 2]
, which can be achieved by forcing the model to learn a clear boundary between known classes and unknown classes and a clear boundary within the known classes. After adaptation, the model is applied to the target samples as follows: either a target sample is classified as a class among the known classes, or rejected as an unknown class. However, all of the conventional OSDA methods employ a rejection threshold hyperparameter, where if a score or statistic for a sample is higher than the threshold, then the sample will be rejected as an unknown class and discarded during adaptation, and thus the model’s sensitivity is largely affected by this rejection threshold.
In our work, we handle the existing issue in OSDA by 1) utilizing an entropy-based instance-level reweighting strategy and 2) extreme value theory (EVT) which has proven to be useful in many classification tasks due to its ability to model extreme values [9, 23, 6, 37, 3]
. We propose to use entropy of probability distribution of a sample to measure the likelihood it belongs to unknown classes. That is, the higher the entropy is, the more likely the sample belongs to the unknown classes, because the model is more uncertain regarding the prediction. We utilize the entropy values to construct an instance-level weight for domain adaptation, instead of setting a hard threshold. In this way, every sample is taken into account during the adaptation process, and a sample with high entropy should be focused less to avoid the negative transfer problem. In inference, we model the tail probability of the entropy distribution by fitting a generalized extreme value (GEV) distribution, and we use the cumulative distribution function (CDF) score to indicate if a sample belongs to unknown or known classes. Experimental results on three conventional domain adaptation datasets show outperformance over both state-of-the-art CSDA and OSDA benchmarks.
We claim the following key contributions. First, we model the tail of entropy distributions with EVT to detect and reject unknown classes. Second, the entropy-based instance-level weighting strategy avoids setting a hard threshold manually and thus is more robust and stable. Third, we have done extensive experiments on three conventional datasets in domain adaptation, which show that our model outperforms benchmarks by a significant margin.
The rest of the paper is organized as follows. In Section 2, we discuss the related literature and methods. In Section 3, we explain our OSDA method in details. In Section 4, we conduct experiments to validate the effectiveness of the proposed method and in Section 5, we conclude the paper by reiterating the main contributions.
2 Related Work
Existing works on CSDA typically try to diminish the domain gap by minimizing a statistical divergence between two domains for adaptation or by an adversarial approach. MCD  utilize two task-specific classifiers to detect the target domain samples which are far from the source domain by maximizing the two classifiers’ inconsistency regarding predictions. Based on the mean teacher model 
that is originally proposed for semi-supervised learning tasks, the self-ensemble network is proposed to calculate the exponentially moving average of the student model weights and it assigns the weights to the teacher model to reduce the domain gap. There are also a plethora of works on adversarial learning to reduce domain gap. DANN  reduces the domain gap by introducing a domain discriminator during the training process to discriminate between domains where the domain discriminator is optimized by a specially designed gradient reversal layer. ADDA  incorporates adversarial training to reduce the domain gap by the discriminator to distinguish across domains. The soft labeling method 
utilizes a soft cross-entropy loss function to optimize for the domain invariance and thus aligning the source and target domains. WDGRL is proposed to minimize the Wasserstein distance across domains adversarially for learning domain invariance. CADA 
aligns the joint distribution for labels and features across two domains by exploring classifier predictions for adversarial domain adaptation. Our proposed method differs from these prior works in that we focus on the practical OSDA setting while these prior works assume an identical label space across domains, and thus they cannot recognize the unknown classes samples included during domain adaptation. Our method is able to detect the unknown classes by instance-level weights based on entropy to avoid negative transfer.
There are also several existing works on OSDA to address the issue of unknown classes. STA  is proposed to develop a coarse-to-fine progressive separation method for unknown and known classes. OSBP  forces the generator to either match target samples with source known classes or ignore them in adaptation as unknown samples by adversarial training. Based on the work of SE  in CSDA,  modifies the adaptation loss and then aligns the target domain with the source domain to address the unknown classes using weights calculated by entropy values. By factorization and joint separation, D-FRODA  represents the source and target classes with a shared embedding space for domain adaptation. ATI  is also proposed to detect the target samples that potentially belong to the known classes by learning a mapping from the source domain to the target domain. We see the difference between our proposed work and the prior works in the utilization of a hard threshold to recognize and reject the unknown samples, and thus the model accuracy largely depends on how the threshold is set. The proposed method utilizes entropy to indicate the likelihood of unknown classes and it constructs a soft instance-level weight. We further fit an EVT model on the tail of the entropy distribution to detect unknown classes. Therefore, by avoiding manual thresholding and introducing EVT to detect unknown classes, the proposed method is more robust and suitable in the OSDA setting, which is also validated in the experimental results.
In this section, denotes the number of known classes; denote a source, target sample, respectively; denotes a source label; denote the source, target sample distribution, respectively; denote the source, target domain, respectively; denotes a feature extractor; denotes a -way classifier to classify samples into one of the known classes; and denotes the binary domain classifier to classify samples into source or target domains.
Even though all samples in the source domain are labeled, in one loss component we assume a split of known classes to (the classes of interest) and (the irrelevant classes), where , and we treat the classes in as unknown. To this end, we denote by as the sample distribution of the samples corresponding to the classes in (unknown-class distribution). This is a known concept, e.g. [18, 29].
3.1 Loss functions
Entropy Weighted Domain Adversarial Training: In our work, we propose a novel loss to train the model in an adversarial manner by a domain classifier to distinguish samples from the source and target domains, where outputs a probability indicating the likelihood of belonging to the target domain, with instance-level weights on target samples to avoid negative transfer in OSDA. The domain classifier is optimized based on a binary cross-entropy loss. We train the feature extractor to maximize the domain classification loss and the domain classifier to minimize the domain classification loss.
The domain classification loss is calculated by
Here is the partitioning function such that .
The feature extractor is trained to maximize so that the embeddings from both domains are similar and the domain classifier is trained to minimize for a good classification performance. Classifier is trained together with a different loss component which is discussed later. We make the feature extractor fool the domain classifier and the domain classifier learns how to perform better from the feature extractor in the adversarial manner.
Entropy Maximization for Source Unknown Classes: The entropy of classifier predictions is high if the sample is from unknown classes and vice versa, because the classifier
is optimized to minimize the cross-entropy loss on source known classes. Therefore, the known classes predictions are close to one-hot vectors with low entropy. Based on this observation, the unknown classes detection task can be tackled by separating the known classes and unknown classes by their entropy values. In order to force the classifier to predict unknown classes with high entropy, we penalize the model if the prediction is different from the uniform distribution for a source unknown sample, because we want to force the unknown classes predictions as uncertain as possible. The loss function component is calculated as
By maximizing the source unknown entropy and minimizing the source known entropy, we can further separate the known and unknown classes with a clear boundary based on their entropy values.
Total Loss: Let us denote as the parameters of the feature extractor , as the parameters of the classifier and as the parameters of the domain classifier . The total loss to minimize is calculated as
where is the conventional cross-entropy classification loss on source known classes and are weights for the loss components.
To force the domain classifier to minimize the adversarial loss and the feature extractor to maximize the adversarial loss, we seek a saddle point of satisfying the conditions:
On a saddle point, minimize the domain classification loss , minimize the conventional classification loss , maximize the domain classification loss (thus the feature divergence is minimized across the two domains) and minimize the classification loss (thus the features are discriminative).
In the total loss , in order to align domains more effectively in OSDA, we propose a novel loss component with domain discriminator and instance-level weights to reduce the domain gap adversarially. Components and already appear in the existing works such as [18, 29].
Target Sample Classification:
We propose to fit GEV using the tail of the entropy distribution of source samples. The probability density function for GEV is calculated as
and the CDF is calculated as
where are parameters associated with the distribution.
After fitting GEV using the source samples’ entropy values based on the trained , for target samples, we calculate the CDF based on the source sample GEV. If the CDF value is higher than 0.5, then the sample is classified as an unknown sample; otherwise it is input into the known-class classifier to be predicted into one of the known classes. We abbreviate our overall methodology as ADAGEV (Adversarial Domain Adaptation with GEV).
3.2 Training strategy
We train our model to minimize loss function . Since we want the feature extractor to maximize the adversarial loss, we utilize the gradient reversal layer  as the first layer of the domain discriminator such that during forward propagation, the layer is essentially an identity function, while in backward propagation, the gradient is multiplied by . In this way, the total loss function can be jointly optimized in one single step instead of alternate training, i.e. optimizing each loss component individually step by step. There are two advantages of avoiding alternate training: 1. lower computational costs, as fewer gradient iterations are calculated and 2. fewer hyperparameters to tune, e.g. optimizing one component for times and optimizing another component for times are absent, etc.
To estimate the partitioning function, we have tried the following approaches: 1. using the same batch of data (as the batch sampled to calculate the numerator of) to calculate ; 2. using a different batch of data of the same size to calculate ; 3. using the sum of the values of the first two approaches (sum of weights of the samples). We find that in practice, the first approach achieves overall the best performance with the lowest computational cost, and thus it is used in ADAGEV. The entire algorithm is exhibited in Algorithm 1.
4 Experimental results
We evaluate ADAGEV on three conventional benchmark datasets in domain adaptation: Digits, Office-31  and VisDA . In the experiments, the source domain samples are labeled and the target domain samples are not labeled, and the task is to either classify target samples into one of the known classes or reject as unknown. We use the conventional OSDA metrics OS and OS* as performance measures, which denote the mean accuracy for all classes (all known classes and the additional unknown class) and the mean accuracy for the known classes, respectively. The hyperparameters are set as (discussed later). The benchmark state-of-the-art methods are discussed in Section 2, and the accuracies are obtained from the corresponding papers except , which only reported the OS accuracy. To complete the comparison for OS*, we ran the source code of  and after finetuning we are able to achieve a slightly better performance for OS than the original paper, and both the OS and OS* accuracies are reported in our experiments. We discuss the network architectures for each experiment in the following subsections. For a fair comparison, the same network architecture is used in ADAGEV as in the benchmark algorithms (which all use the same architecture). The stopping criteria are also discussed in the following subsections.
4.1 Digits Experiments
In the Digits experiments, we follow the conventions in OSDA to use USPS 
and MNIST to conduct evaluations, and we perform domain adaptation from SVHN to MNIST, USPS to MNIST and MNIST to USPS. Note that the USPS to MNIST and the MNIST to USPS tasks are easier than the SVHN to MNIST task, since MNIST and USPS both contain 2-dimensional black-and-white images and are hand-written digits, while SVHN contains 3-dimensional RGB images for real-world house numbers which are part of Google Street View images. Therefore, the domain gap between SVHN to MNIST is significantly larger than in the other two experiments. For consistency, we use digits 0, 1, 2, 3 as known classes, 4, 5, 6 as source unknown classes and 7, 8, 9 as target unknown classes, following the conventional protocols. Also following the existing works, we use the same CNN architectures (details can be found in the appendix of
) for a fair comparison. We compare ADAGEV with the state-of-the-art methods in both CSDA and OSDA which are discussed in Section 2. We report accuracies for ADAGEV after training for 20 epochs. For benchmarks, they either not mention the stopping criteria or use 200 epoches.
shows the overall performance comparison, where the model trained on source only performs the worst, which is expected because it does not align the two domains at all. The CSDA benchmarks DAN and DANN perform slightly better than training only on source data, because they reduce domain misalignment, but they do not reject unknown classes in domain adaptation. The OSDA methods OSBP, STA and KASE outperform the CSDA methods but only improve by a small margin (1.6%, 2.8%, 3.6%, respectively), from which we observe the difficulty in the OSDA scenario. ADAGEV achieves 92.8% OS accuracy, outperforming the previous state-of-the-art OSDA method by 9.6%, 8.4% and 7.6% respectively. The method also outperforms the CSDA benchmarks by 12.1% and 11.2%. In the most difficult task S-M, our method is able to achieve more than 20% improvement on both OS and OS* accuracies, which again validates the effectiveness of the proposed method in the OSDA scenario. We also experiment on 5 different random seeds and the standard deviations for average OS and OS* are 0.49 and 0.37, indicating the performance is robust and stable. We also observe that comparing the OS and OS* metrics, OS* is always higher than OS for the same method, meaning the unknown class accuracy is lower than the overall accuracy on known classes only. This points to the future research direction in the OSDA area that the unknown classes should be addressed more on their detection and separation. Our relative OS improvements regarding benchmarks are 25.1%, 15.0%, 13.7%, 11.5%, 10.0%, 8.92% which are substantial improvements.
Ablation study In order to have a better understanding on what component in our model contributes most to the performance, we conduct ablation studies by removing components from our model. The ablation study results are exhibited in Figure 1, where only the OS accuracy is shown for simplicity since the OS* accuracy shows a very similar performance, i.e. the OS* accuracy is consistently higher than the OS accuracy by 1% 2%. Note that in the second experiment (i.e. w/o EVT) we replace EVT by a binary classifier to classify target samples into known/unknown classes where the classifier is trained using source known and unknown samples.
From Figure 1 we observe that removing the soft reweighting strategy contributes most to the accuracy drop, i.e. the average OS drops by 10.0% without entropy-based reweighting. Removing the EVT component is also detrimental to the performance, where the average OS drops by 7.3% without EVT. Including both reweighting and EVT in ADAGEV achieves the new state-of-the-art performance in OSDA.
Experiments with less source data In domain adaptation tasks, it is usually assumed that the source domain contains sufficient amount of labeled data. In order to see how our method performs with less source data, we experiment on limiting the source data to 10%, 25% and 50%. From Figure 2, both the OS and OS* accuracies increase when we have more source data. The model trained with 10% data achieves comparable performance with the benchmarks, and the model trained with 25% source data already outperforms the previous state-of-the-art OSDA method. This validates that ADAGEV shows robust performance even when less source data are presented.
Sensitivity on hyperparameters We also experiment on varying the loss function weight hyperparameters to observe their sensitivity. The results are shown in Figure 3, where in general, the hyperparameter for source classification loss is less sensitive in that varying its value results in the least accuracy changes, which we postulate is because the source knowledge can already be learned well for a small weight. The hyperparameters for domain discriminator loss and unknown entropy maximization loss are more sensitive to the changes where either increasing or decreasing the values results in an accuracy drop. The combination which provides the best accuracy is used as the final hyperparameters across all datasets.
4.2 Office-31 Experiments
We also experiment on the Office-31 dataset which is another frequently used domain adaptation dataset containing three domains: Amazon, DSLR and Webcam. The following 6 domain adaptation tasks are created: Amazon to DSLR, DSLR to Amazon, Amazon to Webcam, Webcam to Amazon, DSLR to Webcam and Webcam to DSLR. Each domain contains 31 classes, and the images are either collected directly from www.amazon.com or they are office environment images taken with different lighting, etc. using a webcam or a digital single-lens reflex (DSLR) camera. Therefore, the domain gap between DSLR and Webcam is slightly smaller. Following the same experimental setup as in previous works, we set the 10 common classes from the Office-Caltech dataset  as the known classes, and the first 10 remaining classes in the alphabetical order are set as the unknown classes for the source domain and the last 11 classes are set as the unknown classes for the target domain. We use the same CNN model AlexNet  as in previous works across all methods for fair comparison. Following the conventions in , we report accuracies after training for 50 epoches. The experimental results are presented in Table 2.
From Table 2, the CSDA methods generally have accuracy below 80% since they are unable to address the additional unknown classes during adaptation. The OSDA methods outperform the CSDA methods in this experiment with a large margin, where the previous state-of-the-art method KASE achieves 87.0% OS accuracy and 87.8% OS* accuracy, significantly outperforming other methods. ADAGEV shows 88.4% OS and 89.8% OS* accuracies, consistently beating the previous state-of-the-art method in all domain adaptation experiments for both known and unknown classes. Our relative OS improvements regarding benchmarks vary from 1.6% to 28.3% with the average being 11.5%.
4.3 VisDA Experiments
The VisDA dataset  is a conventional but more difficult task in domain adaptation. The source domain contains synthetic images and the target domain contains real-world images. For fair comparison, we follow the conventions to set “bicycle,” “bus,” “car,” “motorcycle,” “train,” “truck” as 6 known categories. In alphabetical order, the first 3 remaining categories “aeroplane,” “horse,” “knife” are set as source unknown categories and the last 3 remaining categories “person,” “plant,” “skateboard” are set as target unknown categories. We report accuracies after training for 50 epoches.
From Table 3, the previous state-of-the-art methods DANN and OSBP consistently outperform the model trained only on source data, while ADAGEV is able to achieve the OS accuracy gains of 17.7%, 8.9%, 5.0% and OS* accuracy gains of 7.6%, 4.3%, 5.8% comparing with the benchmarks, from which the effectiveness of the proposed method is validated again even in this difficult task. Our relative OS improvements regarding benchmarks are 35.7%, 15.2% and 8.0%.
We propose an open set domain adaptation method which models the tail of the entropy distributions using EVT and utilizes an instance-level reweighting strategy to detect and reject unknown samples. Experiments show that the method achieves the new state-of-the-art performance by beating the existing benchmarks by a large margin for three conventional domain adaptation datasets.
Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, Cited by: §1.
-  (2019) Learning factorized representations for open-set domain adaptation. In ICLR, Cited by: §1, §2, Table 2.
-  (2016) Towards open set deep networks. In CVPR, Cited by: §1.
-  (2017) Open set domain adaptation. In ICCV, Cited by: §1, §2, Table 2.
-  (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP, Cited by: §1.
-  (2019) Open set incremental learning for automatic target recognition. In IEEE T-GRS, Cited by: §1.
-  (2017) Self-ensembling for visual domain adaptation. In ICLR, Cited by: §1, §2, §2.
Unsupervised domain adaptation by backpropagation. In ICML, Cited by: §1, §2, §3.2, Table 1, Table 2, Table 3.
-  (2020) Recent advances in open set recognition: a survey. In IEEE T-PAMI, Cited by: §1.
-  (2018) Detectron. Note: https://github.com/facebookresearch/detectron Cited by: §1.
-  (2012) Geodesic flow kernel for unsupervised domain adaptation. In CVPR, Cited by: §1, §4.2.
-  (2012) Cross language text classification via subspace co-regularized multi-view learning. In ICML, Cited by: §1.
-  (2015) Deep residual learning for image recognition. In CVPR, Cited by: §1.
-  (2015) Simultaneous deep transfer across domains and tasks. In ICCV, Cited by: §1, §2.
Domain adaptation by mixture of alignments of second- or higher-order scatter tensors. In CVPR, Cited by: §1.
-  (2012) ImageNet classification with deep convolutional neural networks. In NIPS, Cited by: §1, §4.2.
-  (1998) Gradient-based learning applied to document recognition. In Proceedings of the IEEE, Cited by: §4.1.
-  (2019) Known-class aware self-ensemble for open set domain adaptation. In arXiv:1905.01068, Cited by: §1, §2, §3.1, §3, Table 1, Table 2, §4.
-  (2019) Separate to adapt: open set domain adaptation via progressive separation. In CVPR, Cited by: §2, Table 1.
-  (2018) Conditional adversarial domain adaptation. In NIPS, Cited by: §2.
-  (2016) Unsupervised domain adaptation with residual transfer networks. In NIPS, Cited by: §1, Table 1, Table 2.
-  (2011) Reading digits in natural images with unsupervised feature learning. In NIPS, Cited by: §4.1.
-  (2019) C2AE: class conditioned auto-encoder for open-set recognition. In CVPR, Cited by: §1.
-  (2017) VisDA: the visual domain adaptation challenge. In arXiv:1710.06924, Cited by: §4.3, §4.
-  (2018) A unified framework for multimodal domain adaptation. In ACM Multimedia, Cited by: §1.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE T-PAMI. Cited by: §1.
-  (2010) Adapting visual category models to new domains. In ECCV, Cited by: §4.
-  (2018) Maximum classifier discrepancy for unsupervised domain adaptation. In CVPR, Cited by: §1, §2.
-  (2018) Open set domain adaptation by backpropagation. In ECCV, Cited by: §1, §2, §3.1, §3, §4.1, §4.2, Table 1, Table 2, Table 3.
-  (2017) Wasserstein distance guided representation learning for domain adaptation. In AAAI, Cited by: §1, §2.
-  (2014) Very deep convolutional networks for large-scale image recognition. In arXiv:1409.1556, Cited by: §1.
-  (2019) Towards VQA models that can read. In CVPR, Cited by: §1.
-  (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In ICLR, Cited by: §2.
-  (2017) Adversarial discriminative domain adaptation. In CVPR, Cited by: §1, §2.
-  (2017) Aggregated residual transformations for deep neural networks. In CVPR, Cited by: §1.
-  (2015) Semi-supervised domain adaptation with subspace learning for visual recognition. In CVPR, Cited by: §1.
-  (2017) Sparse representation-based open set recognition. In IEEE T-PAMI, Cited by: §1.
-  (2018) Learning to count objects in natural images for visual question answering. In ICML, Cited by: §1.