1 Introduction
Domain adaptation (DA) aims to learn a discriminative classifier in the presence of a shift between training data in source domain and test data in target domain [2, 6, 33, 35, 36]. Currently, DA can be divided into three categories: supervised DA [30], semisupervised DA [14] and unsupervised DA (UDA) [25]. When the number of labeled data is few in target domain, supervised DA is also known as fewshot DA [24]. Since unlabeled data in target domain can be easily obtained, UDA exhibits the greatest potential in the real world [6, 7, 9, 11, 22, 25, 26].
UDA methods train with clean labeled data in source domain (i.e., clean source data) and unlabeled data in target domain (i.e., unlabeled target data) to obtain classifiers for the target domain, which mainly consist of three orthogonal techniques:
integral probability metrics
(IPM) [8, 11, 12, 18, 22], adversarial training [7, 10, 16, 20, 26, 31] and pseudo labeling [25]. Compared to IPM and adversarialtrainingbased methods, the pseudolabelingbased method (i.e., asymmetric tritraining domain adaptation (ATDA) [25]) can construct a highquality targetspecific representation, providing a better classification performance. Besides, ATDA has been theoretically justified [25].However, in the wild, the data volume of source domain tends to be large. To avoid the expensive labeling cost, labeled data in source domain normally come from amateur annotators or the Internet [19, 27, 29]. This brings us a new, more realistic and more challenging problem, wildy unsupervised domain adaptation (abbreviated as WUDA, Figure 1). This adaptation aims to transfer knowledge from noisy labeled data in source domain (, i.e., noisy source data) to unlabeled target data (). Unfortunately, existing UDA methods share an implicit assumption that there are no noisy source data. Namely, these methods focus on transferring knowledge from clean source data () to unlabeled target data (). Therefore, these methods cannot well handle the WUDA.
In this paper, we theoretically reveal the deficiency of existing UDA methods. To improve these methods, a straightforward strategy is a twostep approach. In Figure 1, we can first use labelnoise algorithms to train a model on noisy source data, then leverage this trained model to assign pseudo labels for noisy source data. Via UDA methods, we can transfer knowledge from pseudolabeled source data () to unlabeled target data (). Nonetheless, pseudolabeled source data are still noisy, and such twostep strategy may relieve but cannot eliminate noise effects.
To circumvent the issue of twostep approach, under the theoretical guidance, we present a robust onestep approach called Butterfly. In high level, Butterfly directly transfers knowledge from to , and uses the transferred knowledge to construct targetspecific representations. In low level, Butterfly maintains four networks dividing two branches (Figure 2): Two networks in BranchI are jointly trained on noisy source data and pseudolabeled target data (data in mixture domain); while two networks in BranchII are trained on pseudolabeled target data.
The reason why Butterfly can be robust takes root in the dualchecking principle: Butterfly checks highcorrectness data out, from not only the data in mixture domain but also the pseudolabeled target data. After crosspropagating these highcorrectness data, Butterfly can obtain highquality domaininvariant representations (DIR) and targetspecific representations (TSR) simultaneously in an iterative manner. If we only check data in the mixture domain (i.e., single checking), the error existed in pseudolabeled target data will accumulate, leading to lowquality DIR and TSR.
We conduct experiments on simulated WUDA tasks, including MNISTtoSYND tasks, SYNDtoMNIST tasks and humansentiment tasks. Besides, we conduct experiments on realworld WUDA tasks. Empirical results demonstrate that Butterfly can robustly transfer knowledge from noisy source data to unlabeled target data. Meanwhile, Butterfly performs much better than existing UDA methods when source domain suffers the extreme (e.g., ) noise.
2 Wildly unsupervised domain adaptation
In this section, we first define the new problem setting, and then analyze why it is so difficult.
2.1 Problem setting
We use following notations in this section: 1) a space and as a label set; 2) , and
represent densities of noisy, correct and incorrect multivariate random variables (m.r.v.) defined on
, respectively^{1}^{1}1There are two common ways to express the density of noisy m.r.v. (Appendix 0.A). One way is to use a mixture of densities of correct and incorrect m.r.v.., and , and are their marginal densities; and 3) represents density of m.r.v. defined on ; and 4) we useto represent loss function between two labelling functions; and 5) we use
and to represent expected risks on the noisy and correct m.r.v.; and 6) we use , and to represent expected discrepancy between two labelling functions under different marginal densities; 7) the groundtruth and pseudo labeling function of the target domain are denoted by and .We formally define the new adaptation as follows.
Definition 1 (Wildly Unsupervised Domain Adaptation)
Let be a multivariate random variable defined on the space with respective a probability density , where . Given i.i.d. data and drawn from and , a wildly unsupervised domain adaptation aims to train with and to accurately annotate each .
2.2 WUDA provably ruins all UDA methods
Theoretically, we analyze why existing UDA methods cannot well transfer useful knowledge from noisy source data to unlabelled target data directly. We first present a theorem to show relations between and .
Theorem 1
Remark 1
In Eq. (2), represents the expected risk on the incorrect m.r.v.. To ensure that we can gain useful knowledge from , we need to avoid . Specifically, we assume: there is a constant such that .
Theorem 1 shows that the expected risk only equals when two cases happen: 1) and and 2) some special combinations (e.g., special , , , and ) to make the second term in Eq. (1) equal zero or to make the second term in Eq. (2) equal . Case 1) means that data in source domain is clean, which is not real in the wild. Case 2) almost never happens, since it is hard to find such special combinations when , , and are unknown. Thus, has an essential difference with . Then, we derive the upper bound of as follows.
Theorem 2
For any labelling function , we have
(3) 
Remark 2
To ensure that we can gain useful knowledge from , we assume: there is a constant such that and , where and .
3 Twostep approach versus onestep approach
In this section, we fist analyze the deficiency of twostep approach and then prove that onestep approach can eliminate noise effects under certain assumptions.
3.1 Twostep approach (a compromise solution)
To reduce noise effects from noisy source data, a straightforward way is to apply a twostep strategy. For example, we first use Coteaching [15] to train a model with noisy source data, then these data are assigned pseudo labels using the trained model. Via ATDA approach, we can transfer knowledge from the pseudolabeled source data to unlabeled target data.
Nonetheless, the pseudolabeled source data is still noisy. Let labels of noisy source data be replaced with pseudo labels after preprocessing. Noise effects will become pseudolabel effects as follows.
(4) 
where and correspond to and in . It is clear that the difference between and is . The first term in may be less than that in due to Coteaching, but the second term in may be higher than that in since Coteaching does not consider to minimize it. Thus, it is hard to say whether (i.e., ). This means that, the twostep strategy may not really reduce noise effects.
3.2 Onestep approach (a noiseeliminating solution)
To eliminate noise effects , we aim to select correct data simultaneously from noisy source data and pseudolabeled target data.
In theory, we prove that noise effects will be eliminated if we can select correct data with a high probability. Let represent the probability that incorrect data is selected from noisy source data, and represent the probability that incorrect data is selected from pseudolabeled target data. Theorem 3 shows that if and and presents a new upper bound of .
Theorem 3
Remark 3
Data drawn from the distribution of can be regarded as a pool that mixes the selected () and unselected () noisy source data. Data drawn from the distribution of can be regarded as a pool that mixes the selected () and unselected () pseudolabeled target data. Theorem 3 shows that if selected data have a high probability to be correct ones ( and ), then and approach , meaning that noise effects are eliminated. This motivates us to find a reliable way to select correct data from noisy source data and pseudolabeled target data and build up a onestep approach for WUDA.
4 Butterfly: Towards robust onestep approach
This section presents a robust onestep approach called Butterfly in details, and demonstrates how Butterfly minimizes all terms in the right side of Eq. (3).
4.1 Principled design of Butterfly
Guided by Theorem 3, a robust approach should check highcorrectness data out (meaning and ). This checking process will make and , , become . Then, we can obtain gradients of , and w.r.t. parameters of and use these gradients to minimize them, which minimizes and as . Note that cannot be directly minimized since we cannot pinpoint clean source data. However, following [25], we can indirectly minimize via minimizing , as , where the last inequality follows (5). This means that a robust approach guided by Theorem 3 can minimize all terms in the right side of inequality in (3).
To realize this robust approach, we propose a Butterfly framework (Algorithm 2), which trains four networks dividing into two branches (Figure 2
). By using dualchecking principle, BranchI checks which data is correct in the mixture domain; while BranchII checks which pseudolabeled target data is correct. To ensure these checked data highlycorrect, we apply the smallloss trick based on memorization effects of deep learning
[1]. After crosspropagating these checked data [3], Butterfly can obtain highquality DIR and TSR simultaneously in an iterative manner. Theoretically, BranchI minimizes ; while BranchII minimizes . This means that Butterfly can minimize all terms in the right side of inequality in (3).4.2 Loss function in Butterfly
Due to , and in Theorem 3, four networks trained by Butterfly share the same loss function but with different inputs.
(7) 
where is the batch size, and represents a network (e.g., and ). is a minibatch for training a network, where could be data in mixture domain or target domain (Figure 2), and represents parameters of and is an byvector whose elements equal or . For two networks in BranchI, following [25], we also add a regularizer in their loss functions, where and are weights of the first fullyconnect layer of and . With this regularizer, and will learn from different features.
4.3 Training procedure of Butterfly
For two networks in each branch, they will first check highcorrectness data out and then cross update their parameters using these data. Algorithm 1 show how and (or and ) check these data out and use checked data to update parameters of them.
Based on loss function defined in Eq. (7), the entire training procedure of Butterfly is shown in Algorithm 2. First, the algorithm initializes training data for two branches ( for BranchI and for BranchII), four networks ( and ) and the number of pseudo labels (line
). In the first epoch (
), and are the same with because there are only unlabeled target data. After minibatch is fetched from (line ), and check highcorrectness data out and update their parameters using Algorithm 1 (lines ). Using similar procedures, and can also update their parameters using Algorithm 1 (lines ).In each epoch, after minibatch updating, we randomly select unlabeled target data and assign them pseudo labels using and (lines ). Following [25], the Labeling function in Algorithm 2 (line ) assigns pseudo labels for unlabeled target data, when predictions of and agree and at least one of them is confident about their predictions (probability above or ). Using this function, we can obtain the pseudolabeled target data for training BranchII in the next epoch. Then, we merge and to be for training BranchI in the next epoch (line ). Finally, we update , and in lines  according to [25] and [15].
4.4 Relations to Coteaching and TCL
Although Coteaching [15] applies the smallloss trick and the crossupdate technique to train deep networks against noisy data, it can only deal with onedomain problem instead crossdomain problem. Recalling definitions of and in (2), Coteaching can only minimize the first term in or , and ignore the second term in . This deficiency limits Coteaching to eliminate noise effects . However, Butterfly can naturally eliminate them. Recently, transferable curriculum learning (TCL) is a robust UDA method to handle noise [28]. TCL uses smallloss trick to train the
domainadversarial neural network
(DANN) [7]. However, TCL can only minimize , while Butterfly can minimize all terms in the right side of (3).5 Experiments
5.1 Simulated WUDA tasks
We verify the effectiveness of our approach on three benchmark datasets (vision and text), including MNIST, SYNDIGITS (SYND) and Amazon products reviews (e.g., book, dvd, electronics and kitchen). They are used to construct basic tasks: MNISTSYND (MS), SYNDMNIST (SM), bookdvd (BD), bookelectronics (BE), , and kitchen electronics (KE). These tasks are often used for evaluation of UDA methods [7, 25, 26]. Since all source datasets are clean, we need to corrupt source datasets manually by a noise transition matrix [15, 17], which can form simulated WUDA tasks. We assume that the matrix has two representative structures: 1) Symmetry flipping; 2) Pair flipping [15], which are defined in Appendix 0.B.
The noise rate is chosen from . Intuitively, means almost over half of the noisy source data have wrong labels that cannot be learned without additional assumptions. means only labels are corrupted, which is a lowlevel noise situation. Note that pair case is much harder than symmetry case [15]. For each basic task, we have four kinds of noisy source data: Pair (P), Pair (P), Symmetry (S), Symmetry (S). Thus, we evaluate the performance of each method using simulated WUDA tasks: digit recognition tasks and humansentiment tasks. Note that the humansentiment task is a binary classification problem, so pair flipping is equal to symmetry flipping. Thus, we only have humansentiment tasks. Results on humansentiment tasks are reported in Appendix 0.C.
5.2 Realworld WUDA tasks
We also verify the efficacy of our approach on “crossdataset benchmark” including Bing, Caltech256, Imagenet and SUN [29]. In this benchmark, Bing, Caltech256, Imagenet and SUN contain common classes. Since Bing dataset was formed by collecting images retrieved by Bing image search, it contains rich noisy data, with presence of multiple objects in the same image, polysemy and caricaturization [29]. We use Bing as noisy source data, and Caltech256, Imagenet and SUN as unlabeled target data, which can form three realworld WUDA tasks.
5.3 Baselines
We realize Butterfly using four networks (abbreviated as BNet) and compare BNet with following baselines: 1) ATDA: representative pseudo label based UDA method [25]; 2) deep adaptation networks (DAN): representative IPM based UDA method [22]; 3) DANN: representative adversiral training based UDA method [7]; 4) Co teaching+ATDA (Co+ATDA): a twostep method, which is a combination of the stateoftheart labelnoise algorithm (Coteaching) [15] and UDA method (ATDA) [25]; 5) TCL: an existing robust UDA method; 6) BNet with targetspecific network (BNet1T): without considering (singlechecking method). Note that ATDA is the most related UDA method compared to BNet. Implementation details are demonstrated in Appendix 0.D.
5.4 Results on simulated WUDA (including tasks)
Table 1 reports the accuracy on the unlabled target data in tasks. As can be seen, on S case (the easiest case), most methods work well. ATDA has a satisfactory performance although it does not consider the noise effects explicitly. Then, when facing harder cases (i.e., P and P), ATDA fails to transfer useful knowledge from noisy source data to unlabeled target data. On Pairflip cases, the performance of ATDA is much lower than our methods. When facing hardest cases (i.e., MS with P and S), DANN has the higher accuracy than DAN and ATDA. However, when facing easiest cases (i.e., SM with P and S), the performance of DANN is worse than that of DAN and ATDA.
Although twostep method Co+ATDA outperforms ATDA in all tasks, it cannot beat onestep methods (BNet1T and BNet) in terms of average accuracy. This result is an evidence for the claim in Section 3. In Table 1, BNet outperforms BNet1T in out of tasks. This reveals that pseudolabeled target data indeed reduce the quality of TSR. Note that BNet cannot outperform all methods in all tasks. In the task SM with P, Co+ATDA outperforms all methods (slightly higher than BNet), since pseudolabeled source data are almost correct. In the task MS with S, BNet1T outperforms all methods, including the second best BNet. We conjecture that pseudolabeled target data may contain much instancedependence noise in this special case, where smallloss data may not be fully correct.
Figures 3 and 4 show the targetdomain accuracy vs. number of epochs among ATDA, Co+ATDA, BNet1T and BNet. Besides, we show the accuracy of ATDA trained with clean source data (ATDATCS) as a reference point. When accuracy of one method is close to that of ATDATCS (red dash line), this method successfully eliminates noise effects. From our observations, it is clear that BNet is very close to ATDATCS in out of tasks (except for SM task with P, Figure 3(d)), which is an evidence of Theorem 3. Since P case is the hardest one, it is reasonable that BNet cannot perfectly eliminate noise effects. An interesting phenomenon is that, BNet outperforms ATDATCS in MS tasks (Figure 4(a), (c)). This means that BNet transfers more useful knowledge (from noisy source data to unlabeled target data) even than ATDA (from clean source data to unlabeled target data).
Tasks  Type  DAN  DANN  ATDA  TCL  Co+ATDA  BNet1T  BNet 

SM  P  90.17%  79.06%  55.95%  80.81%  95.37%  93.45%  95.29% 
P  67.00%  55.34%  53.66%  55.97%  75.43%  83.53%  90.21%  
S  90.74%  75.19%  89.87%  80.23%  95.22%  94.44%  95.88%  
S  89.31%  65.87%  87.53%  68.54%  92.03%  94.89%  94.97%  
MS  P  40.82%  58.78%  33.74%  58.88%  58.02%  58.35%  60.36% 
P  28.41%  43.70%  19.50%  45.31%  46.80%  54.05%  56.62%  
S  30.62%  53.52%  49.80%  56.74%  56.64%  54.90%  57.05%  
S  28.21%  43.76%  17.20%  49.91%  54.29%  57.51%  56.18%  
Average  58.16%  58.01%  50.91%  62.05%  71.73%  73.89%  75.82% 
5.5 Results on realworld WUDA (including tasks)
Finally, we show our results on realworld WUDA tasks. Table 2 reports the targetdomain accuracy for tasks. BNet enjoys the best performance on all tasks. It should be noted that, in both BingCaltech256 and BingImageNet tasks, ATDA is slightly worse than BNet. However, in BingSUN task, ATDA is much worse than BNet. The reason is that the DIR between Bing and SUN are more affected by noisy source data. This phenomenon is also observed when comparing DANN and TCL. Compared to Co+ATDA, ATDA is slightly better than Co+ATDA. This abnormal phenomenon can be explained using Eq. (4). In Eq. (4), after using Coteaching to assign pseudo labels for noisy source data ( in Figure 1), the second term in may increase, which results in , i.e., noise effects actually increase. This phenomenon is an evidence that a twostep method may not really reduce noise effects.
Target  DAN  DANN  ATDA  TCL  Co+ATDA  BNet1T  BNet 

Caltech256  77.83%  78.00%  80.84%  79.35%  79.89%  81.26%  81.71% 
Imagenet  70.29%  72.16%  74.89%  72.53%  74.73%  74.81%  75.00% 
SUN  24.56%  26.80%  26.26%  28.80%  26.31%  30.45%  30.54% 
Average  57.56%  58.99%  60.66%  60.23%  60.31%  62.17%  62.42% 
6 Conclusions
This paper opens a new problem called wildly unsupervised domain adaptation (WUDA). However, existing UDA methods cannot handle WUDA well. Under the theoretical guidance, we propose a robust onestep approach called Butterfly. Butterfly maintains four deep networks simultaneously: Two take care of all adaptations; while the other two can focus on classification in target domain. We compare Butterfly with existing UDA methods on simulated and realworld WUDA tasks. Empirical results demonstrate that Butterfly can robustly transfer knowledge from noisy source data to unlabeled target data. In future, we can extend our Butterfly framework to address fewshot DA and openset UDA when source domain contains noisy data.
References
 [1] D. Arpit, S. Jastrzebski, N. Ballas, D. Krueger, E. Bengio, M. Kanwal, T. Maharaj, A. Fischer, A. Courville, and Y. Bengio. A closer look at memorization in deep networks. In ICML, 2017.
 [2] S. BenDavid, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains. MLJ, 79(12):151–175, 2010.
 [3] Y. Bengio. Evolving culture versus local minima. In Growing Adaptive Machines, pages 109–138. 2014.
 [4] A. Bergamo and L. Torresani. Exploiting weaklylabeled web images to improve object classification: a domain adaptation approach. In NeurIPS, pages 181–189, 2010.
 [5] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li. Imagenet: A largescale hierarchical image database. In CVPR, pages 248–255, 2009.

[6]
Y. Ganin and V. S. Lempitsky.
Unsupervised domain adaptation by backpropagation.
In ICML, pages 1180–1189, 2015.  [7] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. S. Lempitsky. Domainadversarial training of neural networks. JMLR, 17:59:1–59:35, 2016.
 [8] M. Ghifary, D. Balduzzi, W. B. Kleijn, and M. Zhang. Scatter component analysis : A unified framework for domain adaptation and domain generalization. TPAMI, 39(7):1414–1430, 2017.
 [9] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic flow kernel for unsupervised domain adaptation. In CVPR, pages 2066–2073, 2012.
 [10] M. Gong, K. Zhang, B. Huang, C. Glymour, D. Tao, and K. Batmanghelich. Causal generative domain adaptation networks. CoRR, abs/1804.04333, 2018.
 [11] M. Gong, K. Zhang, T. Liu, D. Tao, and C. Glymour. Domain adaptation with conditional transferable components. In ICML, pages 2839–2848, 2016.
 [12] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. J. Smola. A kernel twosample test. JMLR, 13:723–773, 2012.
 [13] G. Griffin, A. Holub, and P. Perona. Caltech256 object category dataset. Technical report, California Institute of Technology, 2007.
 [14] Y. Guo and M. Xiao. Cross language text classification via subspace coregularized multiview learning. In ICML, 2012.
 [15] B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. W. Tsang, and M. Sugiyama. Coteaching: Robust training of deep neural networks with extremely noisy labels. In NeurIPS, pages 8527–8537, 2018.
 [16] J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell. Cycada: Cycleconsistent adversarial domain adaptation. In ICML, pages 1994–2003, 2018.
 [17] L. Jiang, Z. Zhou, T. Leung, L. Li, and F. Li. Mentornet: Learning datadriven curriculum for very deep neural networks on corrupted labels. In ICML, pages 2309–2318, 2018.
 [18] J. Lee and M. Raginsky. Minimax statistical learning with wasserstein distances. In NeurIPS, pages 2692–2701, 2018.

[19]
K. Lee, X. He, L. Zhang, and L. Yang.
Cleannet: Transfer learning for scalable image classifier training with label noise.
In CVPR, pages 5447–5456, 2018.  [20] Y. Li, X. Tian, M. Gong, Y. Liu, T. Liu, K. Zhang, and D. Tao. Deep domain generalization via conditional invariant adversarial networks. In ECCV, pages 647–663, 2018.
 [21] T. Liu and D. Tao. Classification with noisy labels by importance reweighting. TPAMI, 38(3):447–461, 2016.
 [22] M. Long, Y. Cao, J. Wang, and M. I. Jordan. Learning transferable features with deep adaptation networks. In ICML, pages 97–105, 2015.
 [23] E. Malach and S. ShalevShwartz. Decoupling ”when to update” from ”how to update”. In NeurIPS, pages 961–971, 2017.
 [24] S. Motiian, Q. Jones, S. M. Iranmanesh, and G. Doretto. Fewshot adversarial domain adaptation. In NeurIPS, pages 6673–6683, 2017.
 [25] K. Saito, Y. Ushiku, and T. Harada. Asymmetric tritraining for unsupervised domain adaptation. In ICML, pages 2988–2997, 2017.
 [26] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada. Maximum classifier discrepancy for unsupervised domain adaptation. In CVPR, pages 3723–3732, 2018.
 [27] F. Schroff, A. Criminisi, and A. Zisserman. Harvesting image databases from the web. TPAMI, 33(4):754–766, 2011.
 [28] Y. Shu, Z. Cao, M. Long, and J. Wang. Transferable curriculum for weaklysupervised domain adaptation. In AAAI, 2019.
 [29] T. Tommasi and T. Tuytelaars. A testbed for crossdataset analysis. In ECCV TASKCV Workshops, pages 18–31, 2014.
 [30] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultaneous deep transfer across domains and tasks. In ICCV, pages 4068–4076, 2015.
 [31] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In CVPR, pages 2962–2971, 2017.

[32]
J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba.
SUN database: Largescale scene recognition from abbey to zoo.
In CVPR, pages 3485–3492, 2010.  [33] M. Xiao and Y. Guo. Feature space independent semisupervised domain adaptation via kernel matching. TPAMI, 37(1):54–66, 2015.
 [34] T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang. Learning from massive noisy labeled data for image classification. In CVPR, pages 2691–2699, 2015.
 [35] K. Zhang, M. Gong, and B. Schölkopf. Multisource domain adaptation: A causal view. In AAAI, pages 3150–3157, 2015.
 [36] K. Zhang, B. Schölkopf, K. Muandet, and Z. Wang. Domain adaptation under target and conditional shift. In ICML, pages 819–827, 2013.
Appendix 0.A Review of generation of noisy labels
This section presents a review on two labelnoise generation processes.
0.a.1 Transition matrix
We assume that there is a clean multivariate random variable () defined on with a probability density , where is a label set with labels. However, samples of () cannot be directly obtained and we only can observe noisy source data from the multivariate random variable () defined on with a probability density . is generated by a transition probability , i.e., the flip rate from a clean label to a noisy label . When we generate using , we often assume that , i.e., the class conditional noise [21]. All these transition probabilities are summarized into a transition matrix , where .
The transition matrix
is easily estimated in certain situations
[21]. However, in more complex situations, such as clothing1M dataset [34], noisy source data is directly generated by selecting data from a pool, which mixes correct data (data with correct labels) and incorrect data (data with incorrect labels). Namely, how the correct label is corrupted to () is unclear.0.a.2 Sample selection
Formally, there is a multivariate random variable defined on with a probability density , where and means “correct” and means “incorrect”. Nonetheless, samples from cannot be obtained and we can only observe from a distribution with the following density.
(8) 
where . The density in Eq. (8) means that we lost the information from . If we uniformly select samples drawn from , the noisy rate of these samples is . It is clear that the multivariate random variable is the clean multivariate random variable defined in Appendix 0.A.1. Then, is used to describe the density of incorrect multivariate random variable . Using and , can be expressed by the following equation.