1 Introduction
Deep learning algorithms rely heavily on large annotated training samples (Daniely and Granot, 2019). However, it is often expensive and sometimes infeasible to annotate large datasets accurately. Therefore, cheap datasets with label noise have been widely employed to train deep learning models (Xiao et al., 2015)
. Recent results show that label noise significantly degenerates the performance of deep learning models, as deep neural networks easily memorizes and eventually fits label noise
(Zhang et al., 2017; Arpit et al., 2017).Existing methods for learning with noisy labels can be divided into two categories: algorithms with statistically inconsistent or consistent
classifiers. Methods in the first category usually employ heuristics to reduce the sideeffects of label noise, such as extracting reliable examples
(Yu et al., 2019; Han et al., 2018b; Malach and ShalevShwartz, 2017; Ren et al., 2018; Jiang et al., 2018), correcting labels (Ma et al., 2018; Kremer et al., 2018; Tanaka et al., 2018; Reed et al., 2014), and adding regularization (Han et al., 2018a; Guo et al., 2018; Veit et al., 2017; Vahdat, 2017; Li et al., 2017, 2020). Although those methods empirically work well, the classifiers learned from noisy data are not guaranteed to be statistically consistent. To address this limitation, algorithms in the second category have been proposed. They aim to design classifierconsistent algorithms (Zhang and Sabuncu, 2018; Kremer et al., 2018; Liu and Tao, 2016; Northcutt et al., 2017; Scott, 2015; Natarajan et al., 2013; Goldberger and BenReuven, 2017; Patrini et al., 2017; Thekumparampil et al., 2018; Yu et al., 2018; Liu and Guo, 2019; Xu et al., 2019), where classifiers learned by exploiting noisy data will statistically converge to the optimal classifiers defined on the clean domain. Intuitively, when facing largescale noisy data, models trained via classifierconsistent algorithms will approximate to the optimal models trained with clean data.The transition matrix plays an essential role in designing statistically consistent classifiers, where and we set
as the probability of the event
,as the random variable of instances/features,
as the variable for the noisy label, and as the variable for the clean label. The basic idea is that the clean class posterior can be inferred by using the transition matrix and noisy class posterior (which can be estimated by using noisy data). In general, the transition matrix is unidentifiable and thus hard to learn (Xia et al., 2019). Current stateoftheart methods (Han et al., 2018b, a; Patrini et al., 2017; Northcutt et al., 2017; Natarajan et al., 2013) assume that the transition matrix is classdependent and instanceindependent, i.e., . Given anchor points, i.e., the data points that belong to a specific class almost surely, the classdependent and instanceindependent transition matrix is identifiable (Liu and Tao, 2016; Scott, 2015), and it could be estimated by exploiting the noisy class posterior of anchor points (Liu and Tao, 2016; Patrini et al., 2017; Yu et al., 2018) (more details can be found in Section 2). In this paper, we will focus on learning the classdependent and instanceindependent transition matrix which can be used to improve the classification accuracy of the current methods if the matrix is learned more accurate.The estimation error for the noisy class posterior is usually much larger than that of the clean class posterior, especially when the sample size is limited. An illustrative example is in Fig. 1. The rationale is that label noise is randomly generated according to a classdependent transition matrix. Specifically, to learn the noisy class posterior, we need to fit the mapping from instances to clean (latent) labels, as well as the mapping from clean labels to noisy labels. Since the latter mapping is random and independent of instances, the learned mapping that fits label noise is prone to overfitting and thus will lead to a large estimation error for the noisy class posterior. The error will also lead to a large estimation error for the transition matrix. As estimating the transition matrix is a bottleneck for designing consistent classifiers, the large estimation error will significantly degenerate the classification performance (Xia et al., 2019).
Motivated by this problem, in this paper, to reduce the estimation error of the transition matrix, we propose the dual transition estimator (dual estimator) to effectively estimate transition matrices. In a high level, by properly introducing an intermediate class, the dual estimator avoids directly estimating the noisy class posterior via factorizing the original transition matrix into two new transition matrices, which we denote as and . represents the transition from the clean labels to the intermediate class labels and the transition from the clean and intermediate class labels to the noisy labels. Note that although we are going to estimate two transition matrices rather than one, we are not reducing the original problem to a harder one. In philosophy, our idea belongs to the divide and conquer paradigm, i.e., decomposing a hard problem into simple subproblems and composing the solutions of the subproblems to solve the original problem. The two new transition matrices are easier to estimate than the original transition matrix, because we will show that (1) there is no estimation error for the transition matrix , (2) the estimation error for the transition matrix relies on predicting noisy class labels, which is much easier than learning a class posterior as the labels are discrete while the posteriors are continuous, and (3) the estimators for the two new transition matrices are easy to implement in practice. We will also theoretically analyze that the two new transition matrices are easier to predict than the original transition matrix. Empirical results on several datasets and label noise settings consistently justify the effectiveness of the dual estimator on reducing the estimation error of transition matrices and boosting the classification performance.
The rest of the paper is organized as follows. In Section 2, we review the current transition matrix estimator that exploits anchor points. In Section 3, we introduce our method and analyze how it reduces the estimation error. Experimental results on both synthetic and realworld datasets are provided in Section 4. Finally, we conclude the paper in Section 5.
2 Estimating Transition Matrix
Problem setup. Let be the distribution of a pair of random variables , where denotes the variable of instances, the variable of labels, the feature space, the label space, and the size of classes. In many realworld classification problems, examples independently drawn from are unavailable. Before being observed, their clean labels are randomly flipped into noisy labels because of, e.g., contamination (Scott et al., 2013). Let be the distribution of the noisy pair , where denotes the variable of noisy labels. In labelnoise learning, we only have a sample set independently drawn from . The aim is to learn a robust classifier from the noisy sample that can assign clean labels for test instances.
Transition matrix. To build statistically consistent classifiers, which will converge to the optimal classifiers defined by using clean data, we need to introduce the concept of transition matrix (Natarajan et al., 2013; Liu and Tao, 2016; Reed et al., 2014). Specifically, the th entry of the transition matrix, i.e., , represents the probability that the instance with the clean label will have a noisy label . The transition matrix has been widely studied to build statistically consistent classifiers, because the clean class posterior can be inferred by using the transition matrix and the noisy class posterior , i.e., we have
. Specifically, the transition matrix has been used to modify loss functions to build riskconsistent estimators, e.g.,
(Goldberger and BenReuven, 2017; Patrini et al., 2017; Yu et al., 2018; Xia et al., 2019), and has been used to correct hypotheses to build classifierconsistent algorithms, e.g., (Natarajan et al., 2013; Scott, 2015; Patrini et al., 2017). Moreover, the stateoftheart statically inconsistent algorithms (Jiang et al., 2018; Han et al., 2018b) also use diagonal entries of the transition matrix to help select reliable examples used for training.As the noisy class posterior can be estimated by exploiting the noisy training data, the key step remains how to effectively estimate the transition matrix. Given only noisy data, the transition matrix is unidentifiable without any knowledge on the clean label (Xia et al., 2019). Specifically, the transition matrix can be decomposed to product of two new transition matrices, i.e., , and a different clean class posterior can be obtained by composing with , i.e., . Therefore, are both valid decompositions. The current stateoftheart methods (Han et al., 2018b, a; Patrini et al., 2017; Northcutt et al., 2017; Natarajan et al., 2013) then studied a special case by assuming that the transition matrix is classdependent and instanceindependent, i.e., . Note that there are specific settings (Elkan and Noto, 2008; Lu et al., 2018; Bao et al., 2018) where noise is independent of instances. A series of assumptions (Liu and Tao, 2016; Scott, 2015; Ramaswamy et al., 2016) were further proposed to identify or efficiently estimate the transition matrix by only exploiting noisy data. In this paper, we focus on estimating the classdependent and instanceindependent transition matrix which is focused by vast majority of current stateoftheart labelnoise learning algorithms (Han et al., 2018b, a; Patrini et al., 2017; Northcutt et al., 2017; Natarajan et al., 2013; Jiang et al., 2018; Han et al., 2018b). The estimated matrix by using our method then can be seamlessly embedded into these algorithms, and the classification accuracy of the algorithms can be improved, if the transition matrix is estimated more accurate.
Transition matrix estimation. The anchor point assumption (Liu and Tao, 2016; Scott, 2015; Xia et al., 2019) is a widely adopted assumption to estimate the transition matrix. Anchor points are defined in the clean data domain. Formally, an instance is an anchor point of the th clean class if (Liu and Tao, 2016; Xia et al., 2019). Suppose we can assess to the the noisy class posterior and anchor points, the transition matrix can be obtained via , where the second equation holds because when and otherwise. The last equation holds because the transition matrix is independent of the instance. According to the Equation, to estimate the transition matrix, we need to find anchor points and estimate the noisy class posterior, then the transition matrix can be estimated as follows,
(1) 
This estimation method has been widely used (Liu and Tao, 2016; Patrini et al., 2017; Xia et al., 2019) in labelnoise learning and we term it the transition estimator (estimator).
Note that some methods assume anchor points have already been given (Yu et al., 2018). However, this assumption could be strong for applications, where anchor points are hard to identify. It has been proven that anchor points can be learned from noisy data (Liu and Tao, 2016), i.e., , which only holds for binary classification. The same estimator has also been employed for multiclass classification (Patrini et al., 2017). It empirically performs well but lacks theoretical guarantee. How to identify anchor points in the multiclass classification problem with theoretical guarantee remains an unsolved problem.
Eq. (1) and the above discussions on learning anchor points show that the estimator relies heavily on the estimation of the noisy class posterior. Unfortunately, due to the randomness of label noise, the estimation error of the noisy class posterior is usually large. As the example illustrated in Fig. 1, with the same number of training examples, the estimation error of the noisy class posterior is significantly larger than that of the clean class posterior. This motivates us to seek for an alternative estimator that avoids directly using the estimated noisy class posterior to approximate the transition matrix.
3 Reducing Estimation Error for Transition Matrix
To avoid directly using the estimated noisy class posterior to approximate the transition matrix, we propose a new estimator in this section.
3.1 Dual Estimator
By introducing an intermediate class, the transition matrix can be factorized in the following way:
(2) 
where represent the random variable for the introduced intermediate class, , and . Note that and are two transition matrices representing the transition from the clean and intermediate class labels to the noisy class labels and transition from the clean labels to the intermediate class labels, respectively.
By looking at Eq. (2), it seems we have changed an easy problem into a hard one. However, this is totally not true. Actually, we break down a problem into simple subproblems. Combining the solutions to the subproblems gives a solution to the original problem. Thus, in philosophy, our idea belongs to the divide and conquer paradigm. In the rest of this subsection, we will explain why it is easy to estimate the transition matrices and . Moreover, in the next subsection, we will theoretically compare the estimation error of the dual estimator with that of the estimator.
It can be found that has a similar form to . We can employ the same method that is developed for , i.e., the estimator, to estimate . However, there seems to have two challenges: (1) it looks as if difficult to access ; (2) we may also have an error for estimating . Fortunately, these two challenges can be well addressed by properly introducing the intermediate class. Specifically, we design the intermediate class in such a way that , where represents an estimated noisy class posterior. Note that can be obtained by exploiting the noisy data at hand. As we have discussed, due to the randomness of label noise, estimating directly will have a large estimation error especially when the noisy training sample size is limited. However, as we have access to directly, according to Eq. (1), the estimation error for is zero if anchor points are given^{1}^{1}1If the anchor points are to learn, the estimation error remains unchanged for the estimator and dual estimator by employing ..
Although the transition matrix contains three variables, i.e., the clean class, intermediate class, and noisy class, we have class labels available for two of them, i.e., the intermediate class and noisy class. Note that the intermediate class labels can be assigned by using . Usually, the clean class labels are not available. This motivates us to find a way to eliminate the dependence on clean class for . From an informationtheoretic point of view (Csiszár et al., 2004), if the clean class is less informative for the noisy class than the intermediate class , in other words, given , contains no more information for predicting , then is independent of conditioned on , i.e.,
(3) 
A sufficient condition for holding the above equalities is to let the intermediate class labels be identical to noisy labels. Note that it is hard to find an intermediate class whose labels are identical to noisy labels. The mismatch will be the main factor that contributes to the estimation error for . Note also that since we have labels for the noisy class and intermediate class, in Eq. (3) is easy to estimate by just counting the discrete labels, and it will have a small estimation error which converges to zero exponentially fast (Boucheron et al., 2013).
Based on the above discussion, by factorizing the transition matrix into and , we can change the problem of estimating the noisy class posterior into the problem of fitting the noisy labels. Note that the noisy class posterior is in the range of while the noisy class labels are in the set . Intuitively, learning the class labels are much easier than learning the class posteriors. In Section 4, our empirical experiments on synthetic and realworld datasets further justify this by showing a significant error gap between the estimation error of the estimator and dual estimator.
Implementation of the dual estimator. The dual estimator is described in Algorithm 1. Specifically, the transition matrix can be easily estimated by letting and then employing the estimator (see Section 2). By generating intermediate class labels, e.g., letting be the label for the instance , the transition matrix can be estimating via counting, i.e.,
(4) 
where is an indicator function which equals one when holds true and zero otherwise, are examples from the training sample , and represents the AND operation.
Many statistically consistent algorithms (Goldberger and BenReuven, 2017; Patrini et al., 2017; Yu et al., 2018; Xia et al., 2019) consist of a twostep training procedure. The first step estimates the transition matrix and the second step builds statistically consistent algorithms, for example, via modifying loss functions. Our proposed dual estimator can be seamlessly embedded into their frameworks. More details can be found in Section 4.
3.2 Theoretical Analysis
In this subsection, we will justify that the estimation error could be greatly reduced if we estimate and rather than estimating directly.
As we have discussed before, the estimation error of the estimator is caused by estimating the noisy class posterior; the estimation error of the dual estimator comes from the estimation error of , i.e., fitting the noisy class labels and estimating by counting discrete labels. Note that to eliminate the dependence on the clean label for , we need to achieve . Let the estimation error for the noisy class posterior be , i.e., . Let the estimation error for by counting discrete labels is , i.e., . Let the estimation error for fitting the noisy class labels is , i.e., . We will show that under the following assumption, the estimation error of the dual estimator is smaller than the estimation error the estimator.
Assumption 1.
For all , .
Assumption 1 is easy to hold. Theoretically, the error involves no predefined hypothesis space, and the probability that is larger than any positive number will converge to zero exponentially fast (Boucheron et al., 2013). Thus, is usually much smaller than and . We therefore focus on comparing with by ignoring . Intuitively, the error is smaller than because it is easy to obtain a small estimation error for fitting noisy class labels than that for estimating noisy class posteriors. We note that the noisy class posterior is in the continuous range of while the noisy class labels are in the discrete set . For example, suppose we have an instance
, then, as long as the empirical posterior probability
is greater than , the noisy label will be accurately learned. However, the estimated error of the noisy class posterior probability can be up to . We also empirically verify the relation among these errors in Appendix 2.Theorem 1.
Under Assumption 1, the estimation error of the dual estimator is smaller than the estimation error the estimator.
4 Experiments
We compare the transition matrix estimator error produced by the proposed dual estimator and the estimator on both synthetic and realworld datasets. We also compare the classification accuracy of stateoftheart labelnoise learning algorithms (Liu and Tao, 2016; Patrini et al., 2017; Jiang et al., 2018; Han et al., 2018b; Xia et al., 2019; Zhang et al., 2018; Malach and ShalevShwartz, 2017) obtained by using the estimator and the dual estimator, respectively. The MNIST (LeCun et al., 2010), FashionMINIST (or FMINIST) (Xiao et al., 2017), CIFAR10, CIFAR100 (Krizhevsky et al., 2009), and Clothing1M (Xiao et al., 2015) are used in the experiments. Note that as there is no estimation error for , we do not need to do ablation study to show how the two new transition matrices contribute to the estimation error for transition matrix estimation.
4.1 Transition Matrix Estimation
We compare the estimation error between our estimator and the
estimator on both synthetic and realworld datasets with different sample size and different noise types. The synthetic dataset is created by sampling from 2 different 10dimensional Gaussian distributions. One of the distribution has unit variance and zero mean among all dimension. Another one has unit variance and mean of two among all dimensions. The realworld image datasets used to evaluate transition matrices estimation error are MNIST
(LeCun et al., 2010), FMINIST (Xiao et al., 2017), CIFAR10, and CIFAR100 (Krizhevsky et al., 2009).We conduct experiments on the commonly used noise types (Han et al., 2018b; Xia et al., 2019). Specifically, two representative structures of the transition matrix will be investigated: Symmetry flipping (Sym) (Patrini et al., 2017); (2) Pair flipping (Pair) (Han et al., 2018b). To generate noisy datasets, we corrupt the training and validation set of each dataset according to the transition matrix .
Neural network classifiers are used to estimate transition matrices. For fair comparison, the same network structure is used for both estimators. Specifically, on the synthetic dataset, a twohiddenlayer network is used, and the hidden unit size is ; on the realworld datasets, we follow the network structures used by the stateoftheart method (Patrini et al., 2017), i.e., using a LeNet network with dropout rate for MNIST, a ResNet network for FMINIST and CIFAR10, a ResNet
network for CIFAR100, and a ResNet50 pretrained on ImageNet for Clothing1M. The network is trained for 100 epochs, and stochastic gradient descent (SGD) optimizer is used. The initial learning rate is
, and it is decayed by a factor after th epoch. We use training examples for validation, and the model with the best validation accuracy is selected for estimating transition matrix. The estimation error is calculated by measuring the distance between the estimated transition matrix and the ground truth. The average estimation error and the standard deviation over
repeated experiments for the both estimators is illustrated in Fig. 2 and 3.Fig. 2 illustrates the estimation error of the estimator and the dual estimation on the synthetic dataset. For two different noise types and sample sizes, the estimation error of the both estimation methods tend to decrease with the increasing of the training sample size. However, the estimation error of the dual estimator is continuously smaller than that of the estimator. Moreover, the estimation error of the dual estimator is less sensitive to different noise types compared to the estimator. Specifically, even the estimator is trained with all the training examples, its estimation error on Pair noise is approximately doubled than that on Sym noise, which is observed by looking at the righthand side of the estimation error curves. In contrast, when training the dual estimator
with all the training examples, its estimation error on the different noise types does not significantly different, which all less than . Similar to the results on the synthetic dataset, the experiments on the realworld image datasets illustrated in Fig. 3 also shows that the estimation error of the dual estimator is continuously smaller than that of the estimator except CIFAR100, which illustrates the effectiveness of the proposed estimator. On CIFAR100, both estimators have a larger estimation error compared to the results on MNIST, FMINIST, and CIFAR10. The Dual estimator outperforms the estimator with the large sample size. However, when training sample size is small, the estimation error of the dual estimator can be larger than that of the estimator, it is because the number of images per class are too small to estimate the transition matrix which can be very sparse and lead to a large estimation error.
MNIST  FMNIST  
Sym20%  Sym50%  Pair45%  Sym20%  Sym50%  Pair45%  
CE  
Mixup  
Decoupling  
MentorNet  
MentorNet  
Coteaching  
Coteaching  
Forward  
Forward  
Reweighting  
Reweighting  
Revision  
Revision 
CIFAR10  CIAR100  
Sym20%  Sym50%  Pair45%  Sym20%  Sym50%  Pair45%  
CE  
Mixup  
Decoupling  
MentorNet  
MentorNet  
Coteaching  
Coteaching  
Forward  
Forward  
Reweighting  
Reweighting  
Revision  
Revision 
CE  Mixup  Decoupling  MentorNet  Coteaching  Forward  Reweighting  Revision 

()  ()  ()  () 
4.2 Classification accuracy Evaluation
We investigate how the estimation of the estimator and the dual estimator will affect the classification accuracy in labelnoise learning. The experiments are conducted on MNIST, FMINIST, CIFAR10, CIFAR100, and Clothing1M. The classification accuracy are reported in Table 1 and Table 2. Eight popular baselines are selected for comparison, i.e., Coteaching (Han et al., 2018b), and MentorNet (Jiang et al., 2018) which use diagonal entries of the transition matrix to help selecting reliable examples used for training; Forward (Patrini et al., 2017), and Revision (Xia et al., 2019), which use the transition matrix to correct hypotheses; Reweighting (Liu and Tao, 2016), which uses the transition matrix to build riskconsistent algorithms. There are three baselines without requiring any knowledge of the transition matrix, i.e., CE, which trains a network on the noisy sample directly by using cross entropy loss; Decoupling (Malach and ShalevShwartz, 2017), which trains two networks and updates the parameters only using the examples which fhave different prediction from two classifiers; Mixup (Zhang et al., 2018)
which reduces the memorization of corrupt labels by using linear interpolation to featuretarget pairs. The estimation of the
estimator and the dual estimator are both applied to the baselines which rely on the transition matrix. The baselines using the estimation of estimator are called Coteaching, MentorNet, Forward, Revision, and Reweighting. The baselines using estimation of dual estimator are called Coteaching, MentorNet, Forward, Revision, and Reweighting.The settings of our experiments may be different from the original paper, thus the reported accuracy can be different. For instance, in the original paper of Coteaching (Han et al., 2018b), the noise rate is given, and all data are used for training. In contrast, we assume the noise rate is unknown and needed to be estimated. We only use data for training, since data are leaved out as the validation set for transition matrix estimation. In the original paper of revision (Xia et al., 2019), the experiments on Clothing1M use clean data for validation. In contrast, we only use noisy data for validation.
In Table 1 and Table 2, we bold the better classification accuracy produced by the baseline methods integrated with the estimator or the dual estimator. The best classification accuracy among all the methods in each column is highlighted with . The tables show the classification accuracy of all the methods by using our estimation is better than using that of the estimator for most of the experiments. It is because that the dual estimator leads to a smaller estimation error than the estimator when training with large sample size, which can be observed at the righthand side of the estimation error curves in Fig. 3. The baselines with the most significant improvement by using our estimation are Coteaching and MentorNet. Coteaching outperforms all the other methods under Sym noise. On Clothing1M dataset, revision has the best classification accuracy. The experiments on the realworld datasets not only show the effectiveness of the dual estimator for improving the classification accuracy of the current noisy learning algorithms, but also reflect the importance of the transition matrix estimation in labelnoise learning.
5 Conclusion
The transition matrix plays an important role in labelnoise learning. In this paper, to avoid the large estimation error of the noisy class posterior leading to the poorly estimated transition matrix, we have proposed a new transition matrix estimator named dual estimator. The new estimator estimates the transition matrix by exploiting the divideandconquer paradigm, i.e., factorizes the original transition matrix into the product of two easytoestimate transition matrices by introducing an intermediate class state. Both theoretical analysis and experiments on both synthetic and realworld label noise data show that our estimator reduces the estimation error of the transition matrix, which leads to a better classification accuracy for the current labelnoise learning algorithms.
Acknowledgments
TL was supported by Australian Research Council Project DE190101473. BH was supported by HKBU Tier 1 Startup Grant and HKBU CSD Startup Grant. GN and MS were supported by JST AIP Acceleration Research Grant Number JPMJCR20U3, Japan.
References
 A closer look at memorization in deep networks. In ICML, pp. 233–242. Cited by: §1.
 Classification from pairwise similarity and unlabeled data. In ICML, pp. 452–461. Cited by: §2.
 Concentration inequalities: a nonasymptotic theory of independence. Oxford university press. Cited by: §3.1, §3.2.
 Information theory and statistics: a tutorial. Foundations and Trends® in Communications and Information Theory 1 (4), pp. 417–528. Cited by: §3.1.
 Generalization bounds for neural networks via approximate description length. In NeurIPS, pp. 12988–12996. Cited by: Appendix B, §1.
 Learning classifiers from only positive and unlabeled data. In SIGKDD, pp. 213–220. Cited by: §2.
 Training deep neuralnetworks using a noise adaptation layer. In ICLR, Cited by: §1, §2, §3.1.

Curriculumnet: weakly supervised learning from largescale web images
. In ECCV, pp. 135–150. Cited by: §1.  Masking: a new perspective of noisy supervision. In NeurIPS, pp. 5836–5846. Cited by: §1, §1, §2.
 Coteaching: robust training of deep neural networks with extremely noisy labels. In NeurIPS, pp. 8527–8537. Cited by: §1, §1, §2, §2, §4.1, §4.2, §4.2, §4.
 MentorNet: learning datadriven curriculum for very deep neural networks on corrupted labels. In ICML, pp. 2309–2318. Cited by: §1, §2, §2, §4.2, §4.
 Robust active label correction. In AISTATS, pp. 308–316. Cited by: §1.
 Learning multiple layers of features from tiny images. Cited by: §4.1, §4.
 MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann. lecun. com/exdb/mnist 2. Cited by: §4.1, §4.
 Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. In AISTATS, Cited by: §1.
 Learning from noisy labels with distillation. In ICCV, pp. 1910–1918. Cited by: §1.
 Classification with noisy labels by importance reweighting. IEEE Transactions on pattern analysis and machine intelligence 38 (3), pp. 447–461. Cited by: §1, §1, §2, §2, §2, §2, §4.2, §4.
 Peer loss functions: learning from noisy labels without knowing noise rates. arXiv preprint arXiv:1910.03231. Cited by: §1.
 On the minimal supervision for training any binary classifier from only unlabeled data. In ICLR, Cited by: §2.
 Dimensionalitydriven learning with noisy labels. In ICML, pp. 3361–3370. Cited by: §1.
 Decoupling" when to update" from" how to update". In NeurIPS, pp. 960–970. Cited by: §1, §4.2, §4.
 Learning with noisy labels. In NeurIPS, pp. 1196–1204. Cited by: §1, §1, §2, §2.
 Learning with confident examples: rank pruning for robust classification with noisy labels. In UAI, Cited by: §1, §1, §2.

Making deep neural networks robust to label noise: a loss correction approach.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 1944–1952. Cited by: §1, §1, §2, §2, §2, §2, §3.1, §4.1, §4.1, §4.2, §4.  Mixture proportion estimation via kernel embeddings of distributions. In ICML, pp. 2052–2060. Cited by: §2.
 Training deep neural networks on noisy labels with bootstrapping. CoRR. Cited by: §1, §2.
 Learning to reweight examples for robust deep learning. In ICML, pp. 4331–4340. Cited by: §1.
 Classification with asymmetric label noise: consistency and maximal denoising. In Conference On Learning Theory, pp. 489–511. Cited by: §2.
 A rate of convergence for mixture proportion estimation, with application to learning from noisy labels. In Artificial Intelligence and Statistics, pp. 838–846. Cited by: §1, §1, §2, §2, §2.
 Joint optimization framework for learning with noisy labels. In CVPR, pp. 5552–5560. Cited by: §1.
 Robustness of conditional gans to noisy labels. In NeurIPS, pp. 10271–10282. Cited by: §1.
 Toward robustness against label noise in training deep discriminative neural networks. In NeurIPS, pp. 5596–5605. Cited by: §1.
 Learning from noisy largescale datasets with minimal supervision. In CVPR, pp. 839–847. Cited by: §1.
 Are anchor points really indispensable in labelnoise learning?. In NeurIPS, pp. 6835–6846. Cited by: §1, §1, §2, §2, §2, §3.1, §4.1, §4.2, §4.2, §4.

Fashionmnist: a novel image dataset for benchmarking machine learning algorithms
. arXiv preprint arXiv:1708.07747. Cited by: §4.1, §4.  Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2691–2699. Cited by: §1, §4.
 L_DMI: a novel informationtheoretic loss function for training deep nets robust to label noise. In NeurIPS, pp. 6222–6233. Cited by: §1.
 How does disagreement help generalization against label corruption?. ICML. Cited by: §1.
 Learning with biased complementary labels. In ECCV, pp. 68–83. Cited by: §1, §1, §2, §2, §3.1.
 Understanding deep learning requires rethinking generalization. In ICLR, Cited by: §1.
 Mixup: beyond empirical risk minimization. In ICLR, Cited by: §4.2, §4.
 Generalized cross entropy loss for training deep neural networks with noisy labels. In NeurIPS, pp. 8778–8788. Cited by: §1.
Appendix A Proof of Theorem 1
Proof.
According to Eq. (1) in the main paper, the estimation error for the estimator is
(5) 
As we have assumed, for all instance , for all ,
(6) 
Then, we have
(7) 
The estimation error for the the entry of the dual estimator is
(8)  
where the first equation holds because there is no estimation error for the transition matrix denoting the transition from the clean class to the intermediate class (as we have discussed in Section 3.1). The estimation error for the dual estimator comes from the estimation error for fitting the noisy class labels (to eliminate the dependence on the clean label) and the estimation error for by counting discrete labels.
We have assumed that the estimation error for is , i.e., and that the estimation error for fitting the noisy class labels is , i.e., . Note that, to eliminate the dependence on the clean label for , we need to achieve for all . The error will be introduced if there is an error for fitting the noisy class labels. We have that .
We have
(9)  
where the second equation holds because the transition matrices are independent of instances. Hence, the estimation error of is
(10) 
Therefore, under Assumption 1 in the main paper, the estimation error of the dual estimator is smaller than the estimation error the estimator. ∎
Appendix B Empirical Validation of Assumption 1
We empirically verify the relations among the three different errors in Assumption 1. Note that is the estimation error for the noisy class posterior, i.e., ; is the estimation error for counting discrete labels, i.e., ; is the estimation error for fitting the noisy class labels, i.e., .
The experiments are conducted on the synthetic dataset, and setting is same as those of the synthetic experiments in Section . The three errors are calculated on the training set, since both the estimator and the dual estimator estimates the transition matrix on the training set.
Figure 4 shows that the error is very small and can be ignored. is continuously smaller than when the sample size is small. The recent work Daniely and Granot [2019] shows that the sample complex of the network is linear in the number of parameters, which means that, usually, we may not have enough training examples to learn the noisy class posterior well (e.g., CIFAR10, CIFAR100, and FashionMNIST), and Assumption 1 can be easily satisfied. It is also worth to mention that, even Assumption 1 does not hold, the estimation error of the dual estimator may also be smaller than the estimator. Specifically, the error of the proposed estimator is upper bounded by . Generally, the increasing of the upper bound does not imply the increasing of the error .
Comments
There are no comments yet.