Domain adaptation (DA) aims to learn a discriminative classifier in the presence of a shift between training data in source domain and test data in target domain [2, 6, 33, 35, 36]. Currently, DA can be divided into three categories: supervised DA , semi-supervised DA  and unsupervised DA (UDA) . When the number of labeled data is few in target domain, supervised DA is also known as few-shot DA . Since unlabeled data in target domain can be easily obtained, UDA exhibits the greatest potential in the real world [6, 7, 9, 11, 22, 25, 26].
UDA methods train with clean labeled data in source domain (i.e., clean source data) and unlabeled data in target domain (i.e., unlabeled target data) to obtain classifiers for the target domain, which mainly consist of three orthogonal techniques: integral probability metrics
integral probability metrics(IPM) [8, 11, 12, 18, 22], adversarial training [7, 10, 16, 20, 26, 31] and pseudo labeling . Compared to IPM- and adversarial-training-based methods, the pseudo-labeling-based method (i.e., asymmetric tri-training domain adaptation (ATDA) ) can construct a high-quality target-specific representation, providing a better classification performance. Besides, ATDA has been theoretically justified .
However, in the wild, the data volume of source domain tends to be large. To avoid the expensive labeling cost, labeled data in source domain normally come from amateur annotators or the Internet [19, 27, 29]. This brings us a new, more realistic and more challenging problem, wildy unsupervised domain adaptation (abbreviated as WUDA, Figure 1). This adaptation aims to transfer knowledge from noisy labeled data in source domain (, i.e., noisy source data) to unlabeled target data (). Unfortunately, existing UDA methods share an implicit assumption that there are no noisy source data. Namely, these methods focus on transferring knowledge from clean source data () to unlabeled target data (). Therefore, these methods cannot well handle the WUDA.
In this paper, we theoretically reveal the deficiency of existing UDA methods. To improve these methods, a straightforward strategy is a two-step approach. In Figure 1, we can first use label-noise algorithms to train a model on noisy source data, then leverage this trained model to assign pseudo labels for noisy source data. Via UDA methods, we can transfer knowledge from pseudo-labeled source data () to unlabeled target data (). Nonetheless, pseudo-labeled source data are still noisy, and such two-step strategy may relieve but cannot eliminate noise effects.
To circumvent the issue of two-step approach, under the theoretical guidance, we present a robust one-step approach called Butterfly. In high level, Butterfly directly transfers knowledge from to , and uses the transferred knowledge to construct target-specific representations. In low level, Butterfly maintains four networks dividing two branches (Figure 2): Two networks in Branch-I are jointly trained on noisy source data and pseudo-labeled target data (data in mixture domain); while two networks in Branch-II are trained on pseudo-labeled target data.
The reason why Butterfly can be robust takes root in the dual-checking principle: Butterfly checks high-correctness data out, from not only the data in mixture domain but also the pseudo-labeled target data. After cross-propagating these high-correctness data, Butterfly can obtain high-quality domain-invariant representations (DIR) and target-specific representations (TSR) simultaneously in an iterative manner. If we only check data in the mixture domain (i.e., single checking), the error existed in pseudo-labeled target data will accumulate, leading to low-quality DIR and TSR.
We conduct experiments on simulated WUDA tasks, including MNIST-to-SYND tasks, SYND-to-MNIST tasks and human-sentiment tasks. Besides, we conduct experiments on real-world WUDA tasks. Empirical results demonstrate that Butterfly can robustly transfer knowledge from noisy source data to unlabeled target data. Meanwhile, Butterfly performs much better than existing UDA methods when source domain suffers the extreme (e.g., ) noise.
2 Wildly unsupervised domain adaptation
In this section, we first define the new problem setting, and then analyze why it is so difficult.
2.1 Problem setting
We use following notations in this section: 1) a space and as a label set; 2) , and
represent densities of noisy, correct and incorrect multivariate random variables (m.r.v.) defined on, respectively111There are two common ways to express the density of noisy m.r.v. (Appendix 0.A). One way is to use a mixture of densities of correct and incorrect m.r.v.., and , and are their marginal densities; and 3) represents density of m.r.v. defined on ; and 4) we use
to represent loss function between two labelling functions; and 5) we useand to represent expected risks on the noisy and correct m.r.v.; and 6) we use , and to represent expected discrepancy between two labelling functions under different marginal densities; 7) the ground-truth and pseudo labeling function of the target domain are denoted by and .
We formally define the new adaptation as follows.
Definition 1 (Wildly Unsupervised Domain Adaptation)
Let be a multivariate random variable defined on the space with respective a probability density , where . Given i.i.d. data and drawn from and , a wildly unsupervised domain adaptation aims to train with and to accurately annotate each .
2.2 WUDA provably ruins all UDA methods
Theoretically, we analyze why existing UDA methods cannot well transfer useful knowledge from noisy source data to unlabelled target data directly. We first present a theorem to show relations between and .
In Eq. (2), represents the expected risk on the incorrect m.r.v.. To ensure that we can gain useful knowledge from , we need to avoid . Specifically, we assume: there is a constant such that .
Theorem 1 shows that the expected risk only equals when two cases happen: 1) and and 2) some special combinations (e.g., special , , , and ) to make the second term in Eq. (1) equal zero or to make the second term in Eq. (2) equal . Case 1) means that data in source domain is clean, which is not real in the wild. Case 2) almost never happens, since it is hard to find such special combinations when , , and are unknown. Thus, has an essential difference with . Then, we derive the upper bound of as follows.
For any labelling function , we have
To ensure that we can gain useful knowledge from , we assume: there is a constant such that and , where and .
3 Two-step approach versus one-step approach
In this section, we fist analyze the deficiency of two-step approach and then prove that one-step approach can eliminate noise effects under certain assumptions.
3.1 Two-step approach (a compromise solution)
To reduce noise effects from noisy source data, a straightforward way is to apply a two-step strategy. For example, we first use Co-teaching  to train a model with noisy source data, then these data are assigned pseudo labels using the trained model. Via ATDA approach, we can transfer knowledge from the pseudo-labeled source data to unlabeled target data.
Nonetheless, the pseudo-labeled source data is still noisy. Let labels of noisy source data be replaced with pseudo labels after pre-processing. Noise effects will become pseudo-label effects as follows.
where and correspond to and in . It is clear that the difference between and is . The first term in may be less than that in due to Co-teaching, but the second term in may be higher than that in since Co-teaching does not consider to minimize it. Thus, it is hard to say whether (i.e., ). This means that, the two-step strategy may not really reduce noise effects.
3.2 One-step approach (a noise-eliminating solution)
To eliminate noise effects , we aim to select correct data simultaneously from noisy source data and pseudo-labeled target data.
In theory, we prove that noise effects will be eliminated if we can select correct data with a high probability. Let represent the probability that incorrect data is selected from noisy source data, and represent the probability that incorrect data is selected from pseudo-labeled target data. Theorem 3 shows that if and and presents a new upper bound of .
Data drawn from the distribution of can be regarded as a pool that mixes the selected () and unselected () noisy source data. Data drawn from the distribution of can be regarded as a pool that mixes the selected () and unselected () pseudo-labeled target data. Theorem 3 shows that if selected data have a high probability to be correct ones ( and ), then and approach , meaning that noise effects are eliminated. This motivates us to find a reliable way to select correct data from noisy source data and pseudo-labeled target data and build up a one-step approach for WUDA.
4 Butterfly: Towards robust one-step approach
This section presents a robust one-step approach called Butterfly in details, and demonstrates how Butterfly minimizes all terms in the right side of Eq. (3).
4.1 Principled design of Butterfly
Guided by Theorem 3, a robust approach should check high-correctness data out (meaning and ). This checking process will make and , , become . Then, we can obtain gradients of , and w.r.t. parameters of and use these gradients to minimize them, which minimizes and as . Note that cannot be directly minimized since we cannot pinpoint clean source data. However, following , we can indirectly minimize via minimizing , as , where the last inequality follows (5). This means that a robust approach guided by Theorem 3 can minimize all terms in the right side of inequality in (3).
To realize this robust approach, we propose a Butterfly framework (Algorithm 2), which trains four networks dividing into two branches (Figure 2
). By using dual-checking principle, Branch-I checks which data is correct in the mixture domain; while Branch-II checks which pseudo-labeled target data is correct. To ensure these checked data highly-correct, we apply the small-loss trick based on memorization effects of deep learning. After cross-propagating these checked data , Butterfly can obtain high-quality DIR and TSR simultaneously in an iterative manner. Theoretically, Branch-I minimizes ; while Branch-II minimizes . This means that Butterfly can minimize all terms in the right side of inequality in (3).
4.2 Loss function in Butterfly
Due to , and in Theorem 3, four networks trained by Butterfly share the same loss function but with different inputs.
where is the batch size, and represents a network (e.g., and ). is a mini-batch for training a network, where could be data in mixture domain or target domain (Figure 2), and represents parameters of and is an -by-vector whose elements equal or . For two networks in Branch-I, following , we also add a regularizer in their loss functions, where and are weights of the first fully-connect layer of and . With this regularizer, and will learn from different features.
4.3 Training procedure of Butterfly
For two networks in each branch, they will first check high-correctness data out and then cross update their parameters using these data. Algorithm 1 show how and (or and ) check these data out and use checked data to update parameters of them.
Based on loss function defined in Eq. (7), the entire training procedure of Butterfly is shown in Algorithm 2. First, the algorithm initializes training data for two branches ( for Branch-I and for Branch-II), four networks ( and ) and the number of pseudo labels (line
). In the first epoch (), and are the same with because there are only unlabeled target data. After mini-batch is fetched from (line ), and check high-correctness data out and update their parameters using Algorithm 1 (lines ). Using similar procedures, and can also update their parameters using Algorithm 1 (lines -).
In each epoch, after mini-batch updating, we randomly select unlabeled target data and assign them pseudo labels using and (lines ). Following , the Labeling function in Algorithm 2 (line ) assigns pseudo labels for unlabeled target data, when predictions of and agree and at least one of them is confident about their predictions (probability above or ). Using this function, we can obtain the pseudo-labeled target data for training Branch-II in the next epoch. Then, we merge and to be for training Branch-I in the next epoch (line ). Finally, we update , and in lines - according to  and .
4.4 Relations to Co-teaching and TCL
Although Co-teaching  applies the small-loss trick and the cross-update technique to train deep networks against noisy data, it can only deal with one-domain problem instead cross-domain problem. Recalling definitions of and in (2), Co-teaching can only minimize the first term in or , and ignore the second term in . This deficiency limits Co-teaching to eliminate noise effects . However, Butterfly can naturally eliminate them. Recently, transferable curriculum learning (TCL) is a robust UDA method to handle noise . TCL uses small-loss trick to train the domain-adversarial neural network
domain-adversarial neural network(DANN) . However, TCL can only minimize , while Butterfly can minimize all terms in the right side of (3).
5.1 Simulated WUDA tasks
We verify the effectiveness of our approach on three benchmark datasets (vision and text), including MNIST, SYN-DIGITS (SYND) and Amazon products reviews (e.g., book, dvd, electronics and kitchen). They are used to construct basic tasks: MNISTSYND (MS), SYNDMNIST (SM), bookdvd (BD), bookelectronics (BE), , and kitchen electronics (KE). These tasks are often used for evaluation of UDA methods [7, 25, 26]. Since all source datasets are clean, we need to corrupt source datasets manually by a noise transition matrix [15, 17], which can form simulated WUDA tasks. We assume that the matrix has two representative structures: 1) Symmetry flipping; 2) Pair flipping , which are defined in Appendix 0.B.
The noise rate is chosen from . Intuitively, means almost over half of the noisy source data have wrong labels that cannot be learned without additional assumptions. means only labels are corrupted, which is a low-level noise situation. Note that pair case is much harder than symmetry case . For each basic task, we have four kinds of noisy source data: Pair- (P), Pair- (P), Symmetry- (S), Symmetry- (S). Thus, we evaluate the performance of each method using simulated WUDA tasks: digit recognition tasks and human-sentiment tasks. Note that the human-sentiment task is a binary classification problem, so pair flipping is equal to symmetry flipping. Thus, we only have human-sentiment tasks. Results on human-sentiment tasks are reported in Appendix 0.C.
5.2 Real-world WUDA tasks
We also verify the efficacy of our approach on “cross-dataset benchmark” including Bing, Caltech256, Imagenet and SUN . In this benchmark, Bing, Caltech256, Imagenet and SUN contain common classes. Since Bing dataset was formed by collecting images retrieved by Bing image search, it contains rich noisy data, with presence of multiple objects in the same image, polysemy and caricaturization . We use Bing as noisy source data, and Caltech256, Imagenet and SUN as unlabeled target data, which can form three real-world WUDA tasks.
We realize Butterfly using four networks (abbreviated as B-Net) and compare B-Net with following baselines: 1) ATDA: representative pseudo label based UDA method ; 2) deep adaptation networks (DAN): representative IPM based UDA method ; 3) DANN: representative adversiral training based UDA method ; 4) Co teaching+ATDA (Co+ATDA): a two-step method, which is a combination of the state-of-the-art label-noise algorithm (Co-teaching)  and UDA method (ATDA) ; 5) TCL: an existing robust UDA method; 6) B-Net with target-specific network (B-Net-1T): without considering (single-checking method). Note that ATDA is the most related UDA method compared to B-Net. Implementation details are demonstrated in Appendix 0.D.
5.4 Results on simulated WUDA (including tasks)
Table 1 reports the accuracy on the unlabled target data in tasks. As can be seen, on S case (the easiest case), most methods work well. ATDA has a satisfactory performance although it does not consider the noise effects explicitly. Then, when facing harder cases (i.e., P and P), ATDA fails to transfer useful knowledge from noisy source data to unlabeled target data. On Pair-flip cases, the performance of ATDA is much lower than our methods. When facing hardest cases (i.e., MS with P and S), DANN has the higher accuracy than DAN and ATDA. However, when facing easiest cases (i.e., SM with P and S), the performance of DANN is worse than that of DAN and ATDA.
Although two-step method Co+ATDA outperforms ATDA in all tasks, it cannot beat one-step methods (B-Net-1T and B-Net) in terms of average accuracy. This result is an evidence for the claim in Section 3. In Table 1, B-Net outperforms B-Net-1T in out of tasks. This reveals that pseudo-labeled target data indeed reduce the quality of TSR. Note that B-Net cannot outperform all methods in all tasks. In the task SM with P, Co+ATDA outperforms all methods (slightly higher than B-Net), since pseudo-labeled source data are almost correct. In the task MS with S, B-Net-1T outperforms all methods, including the second best B-Net. We conjecture that pseudo-labeled target data may contain much instance-dependence noise in this special case, where small-loss data may not be fully correct.
Figures 3 and 4 show the target-domain accuracy vs. number of epochs among ATDA, Co+ATDA, B-Net-1T and B-Net. Besides, we show the accuracy of ATDA trained with clean source data (ATDA-TCS) as a reference point. When accuracy of one method is close to that of ATDA-TCS (red dash line), this method successfully eliminates noise effects. From our observations, it is clear that B-Net is very close to ATDA-TCS in out of tasks (except for SM task with P, Figure 3-(d)), which is an evidence of Theorem 3. Since P case is the hardest one, it is reasonable that B-Net cannot perfectly eliminate noise effects. An interesting phenomenon is that, B-Net outperforms ATDA-TCS in MS tasks (Figure 4-(a), (c)). This means that B-Net transfers more useful knowledge (from noisy source data to unlabeled target data) even than ATDA (from clean source data to unlabeled target data).
5.5 Results on real-world WUDA (including tasks)
Finally, we show our results on real-world WUDA tasks. Table 2 reports the target-domain accuracy for tasks. B-Net enjoys the best performance on all tasks. It should be noted that, in both BingCaltech256 and BingImageNet tasks, ATDA is slightly worse than B-Net. However, in BingSUN task, ATDA is much worse than B-Net. The reason is that the DIR between Bing and SUN are more affected by noisy source data. This phenomenon is also observed when comparing DANN and TCL. Compared to Co+ATDA, ATDA is slightly better than Co+ATDA. This abnormal phenomenon can be explained using Eq. (4). In Eq. (4), after using Co-teaching to assign pseudo labels for noisy source data ( in Figure 1), the second term in may increase, which results in , i.e., noise effects actually increase. This phenomenon is an evidence that a two-step method may not really reduce noise effects.
This paper opens a new problem called wildly unsupervised domain adaptation (WUDA). However, existing UDA methods cannot handle WUDA well. Under the theoretical guidance, we propose a robust one-step approach called Butterfly. Butterfly maintains four deep networks simultaneously: Two take care of all adaptations; while the other two can focus on classification in target domain. We compare Butterfly with existing UDA methods on simulated and real-world WUDA tasks. Empirical results demonstrate that Butterfly can robustly transfer knowledge from noisy source data to unlabeled target data. In future, we can extend our Butterfly framework to address few-shot DA and open-set UDA when source domain contains noisy data.
-  D. Arpit, S. Jastrzebski, N. Ballas, D. Krueger, E. Bengio, M. Kanwal, T. Maharaj, A. Fischer, A. Courville, and Y. Bengio. A closer look at memorization in deep networks. In ICML, 2017.
-  S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains. MLJ, 79(1-2):151–175, 2010.
-  Y. Bengio. Evolving culture versus local minima. In Growing Adaptive Machines, pages 109–138. 2014.
-  A. Bergamo and L. Torresani. Exploiting weakly-labeled web images to improve object classification: a domain adaptation approach. In NeurIPS, pages 181–189, 2010.
-  J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
Y. Ganin and V. S. Lempitsky.
Unsupervised domain adaptation by backpropagation.In ICML, pages 1180–1189, 2015.
-  Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. S. Lempitsky. Domain-adversarial training of neural networks. JMLR, 17:59:1–59:35, 2016.
-  M. Ghifary, D. Balduzzi, W. B. Kleijn, and M. Zhang. Scatter component analysis : A unified framework for domain adaptation and domain generalization. TPAMI, 39(7):1414–1430, 2017.
-  B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic flow kernel for unsupervised domain adaptation. In CVPR, pages 2066–2073, 2012.
-  M. Gong, K. Zhang, B. Huang, C. Glymour, D. Tao, and K. Batmanghelich. Causal generative domain adaptation networks. CoRR, abs/1804.04333, 2018.
-  M. Gong, K. Zhang, T. Liu, D. Tao, and C. Glymour. Domain adaptation with conditional transferable components. In ICML, pages 2839–2848, 2016.
-  A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. J. Smola. A kernel two-sample test. JMLR, 13:723–773, 2012.
-  G. Griffin, A. Holub, and P. Perona. Caltech-256 object category dataset. Technical report, California Institute of Technology, 2007.
-  Y. Guo and M. Xiao. Cross language text classification via subspace co-regularized multi-view learning. In ICML, 2012.
-  B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. W. Tsang, and M. Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In NeurIPS, pages 8527–8537, 2018.
-  J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell. Cycada: Cycle-consistent adversarial domain adaptation. In ICML, pages 1994–2003, 2018.
-  L. Jiang, Z. Zhou, T. Leung, L. Li, and F. Li. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In ICML, pages 2309–2318, 2018.
-  J. Lee and M. Raginsky. Minimax statistical learning with wasserstein distances. In NeurIPS, pages 2692–2701, 2018.
K. Lee, X. He, L. Zhang, and L. Yang.
Cleannet: Transfer learning for scalable image classifier training with label noise.In CVPR, pages 5447–5456, 2018.
-  Y. Li, X. Tian, M. Gong, Y. Liu, T. Liu, K. Zhang, and D. Tao. Deep domain generalization via conditional invariant adversarial networks. In ECCV, pages 647–663, 2018.
-  T. Liu and D. Tao. Classification with noisy labels by importance reweighting. TPAMI, 38(3):447–461, 2016.
-  M. Long, Y. Cao, J. Wang, and M. I. Jordan. Learning transferable features with deep adaptation networks. In ICML, pages 97–105, 2015.
-  E. Malach and S. Shalev-Shwartz. Decoupling ”when to update” from ”how to update”. In NeurIPS, pages 961–971, 2017.
-  S. Motiian, Q. Jones, S. M. Iranmanesh, and G. Doretto. Few-shot adversarial domain adaptation. In NeurIPS, pages 6673–6683, 2017.
-  K. Saito, Y. Ushiku, and T. Harada. Asymmetric tri-training for unsupervised domain adaptation. In ICML, pages 2988–2997, 2017.
-  K. Saito, K. Watanabe, Y. Ushiku, and T. Harada. Maximum classifier discrepancy for unsupervised domain adaptation. In CVPR, pages 3723–3732, 2018.
-  F. Schroff, A. Criminisi, and A. Zisserman. Harvesting image databases from the web. TPAMI, 33(4):754–766, 2011.
-  Y. Shu, Z. Cao, M. Long, and J. Wang. Transferable curriculum for weakly-supervised domain adaptation. In AAAI, 2019.
-  T. Tommasi and T. Tuytelaars. A testbed for cross-dataset analysis. In ECCV TASK-CV Workshops, pages 18–31, 2014.
-  E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultaneous deep transfer across domains and tasks. In ICCV, pages 4068–4076, 2015.
-  E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In CVPR, pages 2962–2971, 2017.
J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba.
SUN database: Large-scale scene recognition from abbey to zoo.In CVPR, pages 3485–3492, 2010.
-  M. Xiao and Y. Guo. Feature space independent semi-supervised domain adaptation via kernel matching. TPAMI, 37(1):54–66, 2015.
-  T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang. Learning from massive noisy labeled data for image classification. In CVPR, pages 2691–2699, 2015.
-  K. Zhang, M. Gong, and B. Schölkopf. Multi-source domain adaptation: A causal view. In AAAI, pages 3150–3157, 2015.
-  K. Zhang, B. Schölkopf, K. Muandet, and Z. Wang. Domain adaptation under target and conditional shift. In ICML, pages 819–827, 2013.
Appendix 0.A Review of generation of noisy labels
This section presents a review on two label-noise generation processes.
0.a.1 Transition matrix
We assume that there is a clean multivariate random variable () defined on with a probability density , where is a label set with labels. However, samples of () cannot be directly obtained and we only can observe noisy source data from the multivariate random variable () defined on with a probability density . is generated by a transition probability , i.e., the flip rate from a clean label to a noisy label . When we generate using , we often assume that , i.e., the class conditional noise . All these transition probabilities are summarized into a transition matrix , where .
The transition matrix
is easily estimated in certain situations. However, in more complex situations, such as clothing1M dataset , noisy source data is directly generated by selecting data from a pool, which mixes correct data (data with correct labels) and incorrect data (data with incorrect labels). Namely, how the correct label is corrupted to () is unclear.
0.a.2 Sample selection
Formally, there is a multivariate random variable defined on with a probability density , where and means “correct” and means “incorrect”. Nonetheless, samples from cannot be obtained and we can only observe from a distribution with the following density.
where . The density in Eq. (8) means that we lost the information from . If we uniformly select samples drawn from , the noisy rate of these samples is . It is clear that the multivariate random variable is the clean multivariate random variable defined in Appendix 0.A.1. Then, is used to describe the density of incorrect multivariate random variable . Using and , can be expressed by the following equation.