In recent years, deep learning has made impressive progress in the classification task. The success of deep neural networks is based on the large scale datasets with a tremendous amount of labeled samples. However, in many practical situations, a large number of labeled samples are inaccessible. The deep neural networks pre-trained on existing datasets cannot generalize well on the new data with different appearance characteristics. Essentially, the difference in data distribution between domains makes it difficult to transfer knowledge from the source to target domains. This transferring problem is known as domain shift .
Unsupervised domain adaptation (UDA) tackles the above domain shift
problem while transferring the model from a labeled source domain to an unlabeled target domain. The common idea of UDA is to make features extracted by neural networks similar between domains[17, 6]. In particular, the domain-adversarial learning methods [6, 33] train a domain discriminator to distinguish whether the feature is from the source domain or target domain. To fool the discriminator, the feature generator has to output similar source and target feature distributions. However, it is challenging for this type of UDA methods to learn discriminative features on the target domain [28, 38]
. That is because they overlook whether the aligned target features can be discriminated by the classifier.
become another solution for UDA and achieve state-of-the-art performance on multiple tasks. A typical way of self-training is to generate pseudo-labels corresponding to large prediction probability of target samples and train the model with these pseudo-labels. In this way, the features contributing to the target classification are enhanced. However, the alignment between the source and target feature distributions is implicit and has no theoretical guarantee. With unmatched target features, self-training based methods can lead to a drop of performance in the case of shallow networks[41, 27].
In conclusion, domain-adversarial learning is able to align the feature distributions with a theoretical guarantee, while self-training can learn discriminative target features. It is ideal to have a method to combine the advantages of these two types of methods. To achieve this goal, we first analyze the loss function of self-training with pseudo-labels  on the unlabeled target domain. Previous works in learning from noisy labels [29, 40] proposed accounting for noisy labels with a confusion matrix. Following their analyzing approach, we reveal that the loss function using pseudo-labels  differs from the loss function learned with the ground truth by a confusion matrix. Concretely, the commonly used cross entropy loss becomes:
where represents the number of categories, is the ground truth label for the sample , is the model prediction, i.e., pseudo-labels, and is the -th component of the confusion matrix.
If the confusion matrix can be estimated correctly, we can minimize the noise in pseudo-labels and boost the training of target samples. In this paper, we propose a novel method called Adversarial-learned Loss for Domain Adaptation (ALDA). As illustrated in Fig.1
, we generate the confusion matrix with a discriminator network. After multiplying with the confusion matrix, the pseudo-label vector turns into a corrected label vector, which serves as the training label on the target domain. As there is no direct way to optimize the confusion matrix, we learn it withnoise-correcting domain discrimination. Specifically, the domain discriminator has to produce different corrected labels for different domains, while the feature generator aims to confuse the domain discriminator. The adversarial process finally leads to a proper confusion matrix on the target domain.
The main contributions of this paper are as follows:
We analyze the noise in pseudo-labels with the confusion matrix, and propose our Adversarial-learned Loss for Domain Adaptation (ALDA) method, which uses adversarial learning to estimate the confusion matrix.
We theoretically prove that ALDA can align the feature distributions between domains and correct the target prediction of the classifier. In this way, ALDA takes the strengths of domain-adversarial learning and self-training based methods.
ALDA can outperform state-of-the-art methods on four standard unsupervised domain adaptation datasets.
Unsupervised Domain Adaptation. With the success of deep learning, unsupervised domain adaptation (UDA) [34, 17, 19, 6] has been embedded into deep neural networks to transfer the knowledge between the labeled source domain and unlabeled target domain. It has been revealed that the accuracy of the classifier on the target domain is bounded by the accuracy of the source and the domain discrepancy . Therefore, the major line of the current UDA study is to align the distributions between the source and target domains. The distribution divergence between domains can be measured by Maximum Mean Discrepancy (MMD) [34, 17] or second-order statistics .
Domain-adversarial Methods. The domain-adversarial learning-based methods [6, 33] utilize a domain discriminator to represent the domain discrepancy. These methods play a minimax game: the discriminator is trained to distinguish the feature come from the source or target sample while the feature generator has to confuse the discriminator. However, due to practical issues, e.g., mode collapse , domain-adversarial learning cannot match the multi-modal distributions. Recently, together with the prediction of classifier [18, 10], the discriminator can match the distributions of each category, which significantly enhances the final classification results.
Self-training Methods.Semi-supervised learning [14, 8, 31] is a similar task with domain adaptation, which also deals with labeled and unlabeled samples. With the data “manifold” assumption, some methods train the model based on the prediction of itself to smooth the decision boundary around the data. In particular,  minimizes the prediction entropy as a regularizer for unlabeled samples. Pseudo-label method  selects high-confidence predictions as training target for unlabeled samples. Mean Teacher method  sets the exponential moving average of the model as the teacher model and lets the prediction of the teacher model guide the original model.
For unsupervised domain adaptation, we have a labeled source domain and a unlabeled target domain . We train a generator network to extract the high-level feature from the data or , and a classifier network to finish the -class classification task on the feature space. The classifier outputs probability vectors , indicating the prediction probability of respectively.
In this paper, we consider providing a proper loss function on the target domain. Theoretically, the ideal loss function is the loss with the ground truth :
where is a basic loss function, e.g., cross entropy (CE), mean absolute error (MAE).
However, the target ground truth is unavailable in the UDA setting. Pseudo-label method [14, 41] substitutes with the model prediction: , where is a threshold. As mentioned in the introduction, we analyze the difference between the ideal loss and the loss with pseudo-labels:
where is the confusion matrix. The confusion matrix is unknown on the unlabeled target domain. For brevity, we define and name as the corrected label vector.
In previous works studying noisy labels , it is commonly assumed that the confusion matrix is conditionally independent of inputs and uniform with noise rate . The unhinged loss has been proved to be robust to the uniform noise [36, 7],
However, these assumptions cannot hold in the case of pseudo-labels, which makes the problem more intractable.
The general idea of our method is that if we can adequately estimate the noise matrix , the noise in pseudo-labels will be corrected and we can approximately optimize the ideal loss function on the target domain.
Firstly, to simplify the noisy label problem, we assume that the noise is class-wise uniform with vector .
Definition 1. Noise is class-wise uniform with vector , if for , and for .
In this work, we propose to use an extra neural network, called noise-correcting domain discriminator, to learn the vector .
Noise-correcting Domain Discrimination
As shown in Fig. 2, the noise-correcting domain discriminator
is a multi-layer neural network, which takes the deep featureas the input and outputs a multi-class score vector . After a sigmoid layer, the discriminator produces the noise vector . Each component of denotes the probability that the pseudo label is the same as the correct label: .
We adopt the idea of the domain-adversarial learning  that makes the discriminator and the generator play a minimax game. Instead of letting the discriminator perform a domain classification task, we let the discriminator generate different noise vectors for the source and target domains. As illustrated in Fig. 2, for the source feature , the discriminator aims to minimize the discrepancy between the corrected label vector and the ground truth . The adversarial loss for the source data is:
As for the target feature , the discriminator do the opposite way. The discriminator will correct pseudo-labels to the opposite distribution , in which for and for . The adversarial loss for the target data is:
The total adversarial loss becomes:
The discriminator needs to minimize the loss function to distinguish between the source and target feature. On the other hand, the generator has to fool the discriminator, by maximizing the above loss function. Compared to the common domain-adversarial learning, this adversarial loss takes the classifier prediction and the label information into consideration. In this way, our noise-correcting domain discriminator can achieve the class-wise feature alignment.
As revealed in the works of generative adversarial networks (GANs) , the training process of adversarial learning can be unstable. Following , we add a classification task on the source domain to the discriminator to make its training more stable. Consequently, the discriminator not only has to distinguish the source and target domains but also correctly classify the source samples.
To embed the classification task into training, we add a regularization term to the loss of the discriminator:
where and is the cross entropy loss. Then the final loss function for the discriminator becomes:
Corrected Loss Function
After the adversarial learning of the confusion matrix , we can construct a proper loss function for the target samples. As the unhinged loss (Eq. 5) is robust to the uniform part of noise, we choose the unhinged loss as the basic loss function :
Together with the supervised loss on the source domain, the losses for the classifier and the generator become:
where is a trade-off parameter.
In the feature space generated by the generator , the source and target feature distributions are and respectively. If we assume that both distributions are continuous with densities and , for a feature vector , the probabilities that it belongs to source and target distributions are and respectively.
When the noise-correcting domain discrimination
achieves the optimal point and , the feature distributions generated by are aligned: .
Proof. The proof is given in the supplemental material.
As a result, the noise-correcting domain discrimination can align the feature distribution between the source and target domain. According to the theory of , the expected error on the target samples can be bounded by the expected error on the source domain and feature discrepancy between domains. Therefore, the target expected error of our noise-correcting domain discrimination is theoretically bounded.
|Method||A W||D W||W D||A D||D A||W A||Avg|
|Method||Ar Cl Ar Pr Ar Rw Cl Ar Cl Pr Cl Rw Pr Ar Pr Cl Pr Rw Rw Ar Rw Cl Rw Pr||Avg|
|ResNet-50 ||34.9 50.0 58.0 37.4 41.9 46.2 38.5 31.2 60.4 53.9 41.2 59.9||46.1|
|DANN ||45.6 59.3 70.1 47.0 58.5 60.9 46.1 43.7 68.5 63.2 51.8 76.8||57.6|
|JAN ||45.9 61.2 68.9 50.4 59.7 61.0 45.8 43.4 70.3 63.9 52.4 76.8||58.3|
|CDAN+E ||50.7 70.6 76.0 57.6 70.0 70.0 57.4 50.9 77.3 70.9 56.7 81.6||65.8|
|TAT ||51.6 69.5 75.4 59.4 69.5 68.6 59.5 50.5 76.8 70.9 56.6 81.6||65.8|
|ALDA||53.7 70.1 76.4 60.2 72.6 71.5 56.8 51.9 77.1 70.2 56.3 82.1||66.6|
Furthermore, we can prove that by optimizing the corrected loss function, the noise in pseudo-labels is reduced.
When the optimal point and are achieved in Theorem 1, if there is a optimal labeling function in the feature space , then and , we have:
where denotes that for and otherwise.
Proof. The proof is given in the supplemental material.
As Theorem 2 shows, when we optimize the target loss , the loss of pseudo-labels will be enhanced when () and suppressed otherwise (). In this way, the training of classifier can be corrected by the discriminator on the target domain and will be more efficient than the original pseudo-label method.
We evaluate the proposed adversarial-learned loss for domain adaptation (ALDA) with state-of-the-art approaches on four standard unsupervised domain adaptation datasets: digits, office-31, office-home, and VisDA-2017.
Digits. Following the evaluation protocol of , we experiment on three adaptation scenarios: USPS to MNIST (U M), MNIST to USPS (M U), and SVHN to MNIST (S M). MNIST  contains images of handwritten digits and USPS  contains images. Street View House Numbers (SVHN)  consists of images with digits and numbers in natural scenes. We report the evaluation results on the test sets of MNIST and USPS.
Office-31  is a commonly used dataset for unsupervised domain adaptation, which contains images and categories collected from three domains: Amazon (A), Webcam (W) and DSLR (D). We evaluate all methods across six domain adaptation tasks: A W, D W, W D, A D, D A and W A.
Office-Home  is a more difficult domain adaptation dataset than office-31, including images from four different domains: Artistic images (Ar), Clip Art (Cl), Product images (Pr) and Real-World (Rw). For each domain, the dataset contains images of object categories that are common in office and home scenarios. We evaluate all methods in adaptation scenarios.
VisDA-2017  is a large-scale dataset and challenge for unsupervised domain adaptation from simulation to real. The dataset contains synthetic images as the source domain and real-world images as the target domain. object categories are shared by these two domains. Following previous works [28, 18], we evaluate all methods on the validation set of VisDA.
|Method||U M||M U||S M||Avg|
|Method||A W||D W||W D||A D||D A||W A||Avg|
|DANN + ST||91.8||98.4||100.0||89.1||68.8||68.7||86.1|
For the other three datasets, we employ ResNet-50 
as the generator network. The ResNet-50 is pre-trained on ImageNet. Our discriminator consists of three fully connected layers with dropout, which is the same as other works [6, 18]. As we train the classifier and discriminator from scratch, we set their learning rates to be
times that of the generator. We train the model with Stochastic Gradient Descent (SGD) optimizer with the momentum of. We schedule the learning rate with the strategy in : the learning rate is adjusted by , where is the training progress linearly changing from to , , , . We implement the algorithms using PyTorch .
There are two hyper-parameters in our method: the threshold of pseudo-labels and the trade-off . If the prediction of a target sample is below the threshold, we ignore these samples in training. We set to for digit adaptation and for office-31, office-home datasets and VisDA dataset. In all experiment, is gradually increased from to by , same as .
Image Results. Table 1 reports the results with ResNet-50 on Office-31. ALDA significantly outperforms state-of-the-art methods. Because ALDA combines with self-training methods to learn discriminative features, ALDA achieves better results than the domain-adversarial learning-based methods, e.g., DANN, JAN, MADA. Similar to ALDA, CDAN+E also takes the classification prediction into the discrimination and uses the entropy of prediction as an importance weight. However, ALDA outperforms CDAN+E on hard transfer tasks, e.g., A W, A D, D A and W A. The outstanding results show that it is important to combine the domain-adversarial learning and self-training based methods properly.
Table 2 summarizes the results with ResNet-50 on Office-home. For these more difficult adaptation datasets, ALDA still exceeds the most advanced methods. Compared to Office-31, Office-Home has more categories and has a larger appearance gap between domains. A larger number of categories indicates more components of the discriminator output in ALDA, which results in a stronger capacity of class-wise domain discrimination.
Table 3 shows the quantitative results with ResNet-50 and ResNet-101 on VisDA classification dataset. Even though only based on ResNet-50, our ALDA performs better than other domain adaptation methods.
Digits Results. Table 4 summarizes the experimental results for digits adaption comparing with state-of-the-art methods. For fair comparisons, we only resize and normalize the image and do not apply any addition data augment like 
. We conduct each experiment three times and report their average results and variance. As the table shows, ALDA outperforms the most advanced distribution alignment methods, e.g., DANN, MCD, CDAN, and self-training based methods, e.g., Mean Teacher with a confident threshold (MT+CT). ALDA also reduces the performance gap between UDA and the supervised learning on the target domain by a large margin.
In Table 4, we also investigate the effect of the threshold for pseudo-labels on the digits datasets. As we decrease the threshold from to , the performances are improved. It is because the digits datasets are relatively easy to transfer and do not require high thresholds to obtain high precision pseudo-labels. The lower threshold will take more target samples into training, which promotes the training of samples with low prediction confidence. For the digits datasets, ALDA with achieves the best result.
In Table 5, we perform an ablation study on Office-31 to investigate the effect of different components in ALDA. Firstly, we apply self-training  to unsupervised domain adaptation, which is denoted as “ST”. “DANN+ST” denotes that we directly combine the domain-adversarial learning and the self-training methods. However, the performance of “DANN+ST” is inferior to “ALDA”, proving the importance of properly combining these two methods. To investigate the effect of the regularization term in Eq. 10, we remove the term in the final loss of the discriminator, denoted as “ALDA w/o ”. The results show that without , the performance of ALDA drops dramatically. This phenomenon is because the regularization term can enhance the stability of the adversarial process.
To investigate the effect of the corrected target loss in Eq.13, we remove the and only keep the noise-correcting domain discrimination, denotes as “ALDA w/o ”. As Table 5 shows, “ALDA w/o ” can achieve competitive results but inferior to “ALDA”. The phenomenon shows the superiority of our noise-correcting domain discrimination and the importance of combining domain discrimination and corrected pseudo-labels to enhance the performance. Additionally, we replace the corrected target loss with uncorrected target loss, i.e., self-training with pseudo-labels, which is denoted as “ALDA+ST w/o ”. However, “ALDA+ST w/o ” does not improve the performance, which manifests the importance of correcting pseudo-labels.
As mentioned before, the unhinged loss has been proved to be robust to the uniform part of the noise. To verify the effect of choosing the unhinged loss as basic loss function, we substitute the unhinged loss with the cross-entropy loss in the target loss , denoted as “ALDA w/ ”. The results in Table 5 demonstrate that the cross-entropy loss performs worse than the unhinged loss in ALDA. The unhinged loss can remove the uniform part of the noise, which facilitates the noise-correcting process.
We use t-SNE  to visualize the feature extracted by ResNet-50, Self-training, DANN and ALDA for A W adaptation (31 classes) in Fig. 3. When using ResNet-50 only, the target feature distribution is not aligned with the source. Although self-training and DANN can align the distributions of the source and target domain, their target clusters are not fully matched with source clusters. For ALDA, the target clusters are closely matched with the corresponding source clusters, which demonstrates the target features extracted by ALDA are well aligned and discriminative.
In this paper, we propose Adversarial-Learned Loss for Domain Adaptation (ALDA) to combine the strengths of domain-adversarial learning and self-training. We first introduce the confusion matrix to represent the noise in pseudo-labels. As the confusion matrix is unknown, we employ noise-correcting domain discrimination to learn the confusion matrix. Then the target classifier is optimized with the corrected loss function. Our ALDA is theoretically and experimentally proven to be effective for unsupervised domain adaption and achieves state-of-the-art performance on four standard datasets.
This work was supported in part by The National Key Research and Development Program of China (Grant Nos: 2018AAA0101400), in part by The National Nature Science Foundation of China (Grant Nos: 61936006, 61973271).
-  (2010-05-01) A theory of learning from different domains. Machine Learning 79 (1), pp. 151–175. Cited by: Related Work, Theorem 1..
-  (2017) Mode regularized generative adversarial. In ICLR, Cited by: Related Work.
Domain adaptation for semantic segmentation with maximum squares loss.
The IEEE International Conference on Computer Vision (ICCV), Cited by: Introduction, Related Work.
-  (2009) ImageNet: A large-scale hierarchical image database. In CVPR, Cited by: Introduction, Setup.
-  (2018) Self-ensembling for visual domain adaptation. In ICLR, Cited by: Introduction, Related Work, Setup, Result, Table 4.
-  (2016) Domain-adversarial training of neural networks. JMLR 17, pp. 2096–2030. Cited by: Introduction, Figure 2, Related Work, Related Work, Noise-correcting Domain Discrimination, Table 1, Table 2, Setup, Table 3, Table 4, Table 5.
-  (2017) Robust loss functions under label noise for deep neural networks. In AAAI, Cited by: Preliminaries.
-  (2004) Semi-supervised learning by entropy minimization. In NIPS, Cited by: Related Work.
-  (2016) Deep residual learning for image recognition. In CVPR, Cited by: Table 1, Table 2, Setup, Table 5.
-  (2018) Conditional generative adversarial network for structured domain adaptation. In CVPR, Cited by: Related Work.
-  (1994) A database for handwritten text recognition research. PAMI 16, pp. 550–554. Cited by: Datasets.
-  (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: Setup.
-  (1998) Gradient-based learning applied to document recognition. In Proceedings of the IEEE, Vol. 86, pp. 2278–2324. Cited by: Datasets.
-  (2013) Pseudo-label : the simple and efficient semi-supervised learning method for deep neural networks. In ICML, Cited by: Related Work, Preliminaries.
-  (2019) Distant supervised centroid shift: a simple and efficient approach to visual domain adaptation. In CVPR, Cited by: Table 1, Table 4.
-  (2019) Transferable adversarial training: a general approach to adapting deep classifiers. In ICML, Cited by: Table 2.
-  (2015) Learning transferable features with deep adaptation networks. In ICML, Cited by: Introduction, Related Work.
-  (2017) Conditional adversarial domain adaptation. In NeurIPS, Cited by: Related Work, Table 1, Table 2, Datasets, Datasets, Setup, Setup, Table 3, Table 4.
Deep transfer learning with joint adaptation networks. In ICML, Cited by: Related Work, Table 1, Table 2.
-  (2017) Least squares generative adversarial networks. In ICCV, Cited by: Regularization Term.
-  (2011) Reading digits in natural images with unsupervised feature learning. In NIPS, Cited by: Datasets.
-  (2016) Conditional image synthesis with auxiliary classifier gans. In ICML, Cited by: Regularization Term.
-  (2017) Automatic differentiation in pytorch. In NIPS-W, Cited by: Setup.
-  (2018) Multi-adversarial domain adaptation. In AAAI, Cited by: Table 1.
-  (2017) VisDA: the visual domain adaptation challenge. ArXiv abs/1710.06924. Cited by: Datasets.
-  (2010) Adapting visual category models to new domains. In ECCV, Cited by: Datasets.
-  (2019) Semi-supervised domain adaptation via minimax entropy. ArXiv abs/1904.06487. Cited by: Introduction.
-  (2018) Maximum classifier discrepancy for unsupervised domain adaptation. In CVPR, Cited by: Introduction, Datasets, Table 3, Table 4.
-  (2014) Learning from noisy labels with deep neural networks. In ICLR, Cited by: Introduction.
-  (2016) Deep coral: correlation alignment for deep domain adaptation. In ECCV Workshops, Cited by: Related Work.
-  (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In ICLR, Cited by: Related Work.
-  (2011) Unbiased look at dataset bias. In CVPR, Cited by: Introduction.
-  (2017) Adversarial discriminative domain adaptation. In CVPR, Cited by: Introduction, Related Work, Table 1, Table 4.
-  (2014) Deep domain confusion: maximizing for domain invariance. CoRR abs/1412.3474. Cited by: Related Work.
-  (2008) Visualizing data using t-sne. In JMLR, Cited by: Visualization.
-  (2015) Learning with symmetric label noise: the importance of being unhinged. In NIPS, Cited by: Preliminaries.
-  (2017) Deep hashing network for unsupervised domain adaptation. In CVPR, Cited by: Datasets.
-  (2018) Learning semantic representations for unsupervised domain adaptation. In ICML, Cited by: Introduction.
-  (2018) Collaborative and adversarial network for unsupervised domain adaptation. In CVPR, Cited by: Table 1.
-  (2018) Generalized cross entropy loss for training deep neural networks with noisy labels. In NIPS, Cited by: Introduction, Preliminaries.
-  (2018) Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In ECCV, Cited by: Introduction, Introduction, Related Work, Preliminaries, Table 1, Analysis, Table 3, Table 5.