Dual T: Reducing Estimation Error for Transition Matrix in Label-noise Learning

06/14/2020 ∙ by Yu Yao, et al. ∙ 0

The transition matrix, denoting the transition relationship from clean labels to noisy labels, is essential to build statistically consistent classifiers in label-noise learning. Existing methods for estimating the transition matrix rely heavily on estimating the noisy class posterior. However, the estimation error for noisy class posterior could be large due to the randomness of label noise. The estimation error would lead the transition matrix to be poorly estimated. Therefore, in this paper, we aim to solve this problem by exploiting the divide-and-conquer paradigm. Specifically, we introduce an intermediate class to avoid directly estimating the noisy class posterior. By this intermediate class, the original transition matrix can then be factorized into the product of two easy-to-estimate transition matrices. We term the proposed method the dual T-estimator. Both theoretical analyses and empirical results illustrate the effectiveness of the dual T-estimator for estimating transition matrices, leading to better classification performances.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning algorithms rely heavily on large annotated training samples (Daniely and Granot, 2019). However, it is often expensive and sometimes infeasible to annotate large datasets accurately. Therefore, cheap datasets with label noise have been widely employed to train deep learning models (Xiao et al., 2015)

. Recent results show that label noise significantly degenerates the performance of deep learning models, as deep neural networks easily memorizes and eventually fits label noise

(Zhang et al., 2017; Arpit et al., 2017).

Existing methods for learning with noisy labels can be divided into two categories: algorithms with statistically inconsistent or consistent

classifiers. Methods in the first category usually employ heuristics to reduce the side-effects of label noise, such as extracting reliable examples

(Yu et al., 2019; Han et al., 2018b; Malach and Shalev-Shwartz, 2017; Ren et al., 2018; Jiang et al., 2018), correcting labels (Ma et al., 2018; Kremer et al., 2018; Tanaka et al., 2018; Reed et al., 2014), and adding regularization (Han et al., 2018a; Guo et al., 2018; Veit et al., 2017; Vahdat, 2017; Li et al., 2017, 2020). Although those methods empirically work well, the classifiers learned from noisy data are not guaranteed to be statistically consistent. To address this limitation, algorithms in the second category have been proposed. They aim to design classifier-consistent algorithms (Zhang and Sabuncu, 2018; Kremer et al., 2018; Liu and Tao, 2016; Northcutt et al., 2017; Scott, 2015; Natarajan et al., 2013; Goldberger and Ben-Reuven, 2017; Patrini et al., 2017; Thekumparampil et al., 2018; Yu et al., 2018; Liu and Guo, 2019; Xu et al., 2019), where classifiers learned by exploiting noisy data will statistically converge to the optimal classifiers defined on the clean domain. Intuitively, when facing large-scale noisy data, models trained via classifier-consistent algorithms will approximate to the optimal models trained with clean data.

The transition matrix plays an essential role in designing statistically consistent classifiers, where and we set

as the probability of the event

,

as the random variable of instances/features,

as the variable for the noisy label, and as the variable for the clean label. The basic idea is that the clean class posterior can be inferred by using the transition matrix and noisy class posterior (which can be estimated by using noisy data). In general, the transition matrix is unidentifiable and thus hard to learn (Xia et al., 2019). Current state-of-the-art methods (Han et al., 2018b, a; Patrini et al., 2017; Northcutt et al., 2017; Natarajan et al., 2013) assume that the transition matrix is class-dependent and instance-independent, i.e., . Given anchor points, i.e., the data points that belong to a specific class almost surely, the class-dependent and instance-independent transition matrix is identifiable (Liu and Tao, 2016; Scott, 2015), and it could be estimated by exploiting the noisy class posterior of anchor points (Liu and Tao, 2016; Patrini et al., 2017; Yu et al., 2018) (more details can be found in Section 2). In this paper, we will focus on learning the class-dependent and instance-independent transition matrix which can be used to improve the classification accuracy of the current methods if the matrix is learned more accurate.

Figure 1: Estimation errors for clean class posteriors and noisy class posteriors on synthetic data. The estimation errors are calculated as the average absolute value between the ground-truth and estimated class posteriors on randomly sampled test data points. The other details are the same as those of the synthetic experiments in Section 4.

The estimation error for the noisy class posterior is usually much larger than that of the clean class posterior, especially when the sample size is limited. An illustrative example is in Fig. 1. The rationale is that label noise is randomly generated according to a class-dependent transition matrix. Specifically, to learn the noisy class posterior, we need to fit the mapping from instances to clean (latent) labels, as well as the mapping from clean labels to noisy labels. Since the latter mapping is random and independent of instances, the learned mapping that fits label noise is prone to overfitting and thus will lead to a large estimation error for the noisy class posterior. The error will also lead to a large estimation error for the transition matrix. As estimating the transition matrix is a bottleneck for designing consistent classifiers, the large estimation error will significantly degenerate the classification performance (Xia et al., 2019).

Motivated by this problem, in this paper, to reduce the estimation error of the transition matrix, we propose the dual transition estimator (dual -estimator) to effectively estimate transition matrices. In a high level, by properly introducing an intermediate class, the dual -estimator avoids directly estimating the noisy class posterior via factorizing the original transition matrix into two new transition matrices, which we denote as and . represents the transition from the clean labels to the intermediate class labels and the transition from the clean and intermediate class labels to the noisy labels. Note that although we are going to estimate two transition matrices rather than one, we are not reducing the original problem to a harder one. In philosophy, our idea belongs to the divide and conquer paradigm, i.e., decomposing a hard problem into simple sub-problems and composing the solutions of the sub-problems to solve the original problem. The two new transition matrices are easier to estimate than the original transition matrix, because we will show that (1) there is no estimation error for the transition matrix , (2) the estimation error for the transition matrix relies on predicting noisy class labels, which is much easier than learning a class posterior as the labels are discrete while the posteriors are continuous, and (3) the estimators for the two new transition matrices are easy to implement in practice. We will also theoretically analyze that the two new transition matrices are easier to predict than the original transition matrix. Empirical results on several datasets and label noise settings consistently justify the effectiveness of the dual -estimator on reducing the estimation error of transition matrices and boosting the classification performance.

The rest of the paper is organized as follows. In Section 2, we review the current transition matrix estimator that exploits anchor points. In Section 3, we introduce our method and analyze how it reduces the estimation error. Experimental results on both synthetic and real-world datasets are provided in Section 4. Finally, we conclude the paper in Section 5.

2 Estimating Transition Matrix

Problem setup. Let be the distribution of a pair of random variables , where denotes the variable of instances, the variable of labels, the feature space, the label space, and the size of classes. In many real-world classification problems, examples independently drawn from are unavailable. Before being observed, their clean labels are randomly flipped into noisy labels because of, e.g., contamination (Scott et al., 2013). Let be the distribution of the noisy pair , where denotes the variable of noisy labels. In label-noise learning, we only have a sample set independently drawn from . The aim is to learn a robust classifier from the noisy sample that can assign clean labels for test instances.

Transition matrix. To build statistically consistent classifiers, which will converge to the optimal classifiers defined by using clean data, we need to introduce the concept of transition matrix (Natarajan et al., 2013; Liu and Tao, 2016; Reed et al., 2014). Specifically, the -th entry of the transition matrix, i.e., , represents the probability that the instance with the clean label will have a noisy label . The transition matrix has been widely studied to build statistically consistent classifiers, because the clean class posterior can be inferred by using the transition matrix and the noisy class posterior , i.e., we have

. Specifically, the transition matrix has been used to modify loss functions to build risk-consistent estimators, e.g.,

(Goldberger and Ben-Reuven, 2017; Patrini et al., 2017; Yu et al., 2018; Xia et al., 2019), and has been used to correct hypotheses to build classifier-consistent algorithms, e.g., (Natarajan et al., 2013; Scott, 2015; Patrini et al., 2017). Moreover, the state-of-the-art statically inconsistent algorithms (Jiang et al., 2018; Han et al., 2018b) also use diagonal entries of the transition matrix to help select reliable examples used for training.

As the noisy class posterior can be estimated by exploiting the noisy training data, the key step remains how to effectively estimate the transition matrix. Given only noisy data, the transition matrix is unidentifiable without any knowledge on the clean label (Xia et al., 2019). Specifically, the transition matrix can be decomposed to product of two new transition matrices, i.e., , and a different clean class posterior can be obtained by composing with , i.e., . Therefore, are both valid decompositions. The current state-of-the-art methods (Han et al., 2018b, a; Patrini et al., 2017; Northcutt et al., 2017; Natarajan et al., 2013) then studied a special case by assuming that the transition matrix is class-dependent and instance-independent, i.e., . Note that there are specific settings (Elkan and Noto, 2008; Lu et al., 2018; Bao et al., 2018) where noise is independent of instances. A series of assumptions (Liu and Tao, 2016; Scott, 2015; Ramaswamy et al., 2016) were further proposed to identify or efficiently estimate the transition matrix by only exploiting noisy data. In this paper, we focus on estimating the class-dependent and instance-independent transition matrix which is focused by vast majority of current state-of-the-art label-noise learning algorithms (Han et al., 2018b, a; Patrini et al., 2017; Northcutt et al., 2017; Natarajan et al., 2013; Jiang et al., 2018; Han et al., 2018b). The estimated matrix by using our method then can be seamlessly embedded into these algorithms, and the classification accuracy of the algorithms can be improved, if the transition matrix is estimated more accurate.

Transition matrix estimation. The anchor point assumption (Liu and Tao, 2016; Scott, 2015; Xia et al., 2019) is a widely adopted assumption to estimate the transition matrix. Anchor points are defined in the clean data domain. Formally, an instance is an anchor point of the -th clean class if (Liu and Tao, 2016; Xia et al., 2019). Suppose we can assess to the the noisy class posterior and anchor points, the transition matrix can be obtained via , where the second equation holds because when and otherwise. The last equation holds because the transition matrix is independent of the instance. According to the Equation, to estimate the transition matrix, we need to find anchor points and estimate the noisy class posterior, then the transition matrix can be estimated as follows,

(1)

This estimation method has been widely used (Liu and Tao, 2016; Patrini et al., 2017; Xia et al., 2019) in label-noise learning and we term it the transition estimator (-estimator).

Note that some methods assume anchor points have already been given (Yu et al., 2018). However, this assumption could be strong for applications, where anchor points are hard to identify. It has been proven that anchor points can be learned from noisy data (Liu and Tao, 2016), i.e., , which only holds for binary classification. The same estimator has also been employed for multi-class classification (Patrini et al., 2017). It empirically performs well but lacks theoretical guarantee. How to identify anchor points in the multi-class classification problem with theoretical guarantee remains an unsolved problem.

Eq. (1) and the above discussions on learning anchor points show that the -estimator relies heavily on the estimation of the noisy class posterior. Unfortunately, due to the randomness of label noise, the estimation error of the noisy class posterior is usually large. As the example illustrated in Fig. 1, with the same number of training examples, the estimation error of the noisy class posterior is significantly larger than that of the clean class posterior. This motivates us to seek for an alternative estimator that avoids directly using the estimated noisy class posterior to approximate the transition matrix.

3 Reducing Estimation Error for Transition Matrix

To avoid directly using the estimated noisy class posterior to approximate the transition matrix, we propose a new estimator in this section.

3.1 Dual -Estimator

By introducing an intermediate class, the transition matrix can be factorized in the following way:

(2)

where represent the random variable for the introduced intermediate class, , and . Note that and are two transition matrices representing the transition from the clean and intermediate class labels to the noisy class labels and transition from the clean labels to the intermediate class labels, respectively.

By looking at Eq. (2), it seems we have changed an easy problem into a hard one. However, this is totally not true. Actually, we break down a problem into simple sub-problems. Combining the solutions to the sub-problems gives a solution to the original problem. Thus, in philosophy, our idea belongs to the divide and conquer paradigm. In the rest of this subsection, we will explain why it is easy to estimate the transition matrices and . Moreover, in the next subsection, we will theoretically compare the estimation error of the dual -estimator with that of the -estimator.

It can be found that has a similar form to . We can employ the same method that is developed for , i.e., the -estimator, to estimate . However, there seems to have two challenges: (1) it looks as if difficult to access ; (2) we may also have an error for estimating . Fortunately, these two challenges can be well addressed by properly introducing the intermediate class. Specifically, we design the intermediate class in such a way that , where represents an estimated noisy class posterior. Note that can be obtained by exploiting the noisy data at hand. As we have discussed, due to the randomness of label noise, estimating directly will have a large estimation error especially when the noisy training sample size is limited. However, as we have access to directly, according to Eq. (1), the estimation error for is zero if anchor points are given111If the anchor points are to learn, the estimation error remains unchanged for the -estimator and dual -estimator by employing ..

Although the transition matrix contains three variables, i.e., the clean class, intermediate class, and noisy class, we have class labels available for two of them, i.e., the intermediate class and noisy class. Note that the intermediate class labels can be assigned by using . Usually, the clean class labels are not available. This motivates us to find a way to eliminate the dependence on clean class for . From an information-theoretic point of view (Csiszár et al., 2004), if the clean class is less informative for the noisy class than the intermediate class , in other words, given , contains no more information for predicting , then is independent of conditioned on , i.e.,

(3)

A sufficient condition for holding the above equalities is to let the intermediate class labels be identical to noisy labels. Note that it is hard to find an intermediate class whose labels are identical to noisy labels. The mismatch will be the main factor that contributes to the estimation error for . Note also that since we have labels for the noisy class and intermediate class, in Eq. (3) is easy to estimate by just counting the discrete labels, and it will have a small estimation error which converges to zero exponentially fast (Boucheron et al., 2013).

Based on the above discussion, by factorizing the transition matrix into and , we can change the problem of estimating the noisy class posterior into the problem of fitting the noisy labels. Note that the noisy class posterior is in the range of while the noisy class labels are in the set . Intuitively, learning the class labels are much easier than learning the class posteriors. In Section 4, our empirical experiments on synthetic and real-world datasets further justify this by showing a significant error gap between the estimation error of the -estimator and dual -estimator.

Implementation of the dual -estimator. The dual -estimator is described in Algorithm 1. Specifically, the transition matrix can be easily estimated by letting and then employing the -estimator (see Section 2). By generating intermediate class labels, e.g., letting be the label for the instance , the transition matrix can be estimating via counting, i.e.,

(4)

where is an indicator function which equals one when holds true and zero otherwise, are examples from the training sample , and represents the AND operation.

  Input: Noisy training sample ; Noisy validation sample .
1:  Obtain the learned noisy class posterior, i.e., , by exploiting training and validation sets;
2:  Let and employ -estimator to estimate according to Eq. (1);
3:  Use Eq. (4) to estimate ;
4:  ;
  Output: The estimated transition matrix .
Algorithm 1 Dual -Estimator

Many statistically consistent algorithms (Goldberger and Ben-Reuven, 2017; Patrini et al., 2017; Yu et al., 2018; Xia et al., 2019) consist of a two-step training procedure. The first step estimates the transition matrix and the second step builds statistically consistent algorithms, for example, via modifying loss functions. Our proposed dual -estimator can be seamlessly embedded into their frameworks. More details can be found in Section 4.

3.2 Theoretical Analysis

In this subsection, we will justify that the estimation error could be greatly reduced if we estimate and rather than estimating directly.

As we have discussed before, the estimation error of the -estimator is caused by estimating the noisy class posterior; the estimation error of the dual -estimator comes from the estimation error of , i.e., fitting the noisy class labels and estimating by counting discrete labels. Note that to eliminate the dependence on the clean label for , we need to achieve . Let the estimation error for the noisy class posterior be , i.e., . Let the estimation error for by counting discrete labels is , i.e., . Let the estimation error for fitting the noisy class labels is , i.e., . We will show that under the following assumption, the estimation error of the dual -estimator is smaller than the estimation error the -estimator.

Assumption 1.

For all , .

Assumption 1 is easy to hold. Theoretically, the error involves no predefined hypothesis space, and the probability that is larger than any positive number will converge to zero exponentially fast (Boucheron et al., 2013). Thus, is usually much smaller than and . We therefore focus on comparing with by ignoring . Intuitively, the error is smaller than because it is easy to obtain a small estimation error for fitting noisy class labels than that for estimating noisy class posteriors. We note that the noisy class posterior is in the continuous range of while the noisy class labels are in the discrete set . For example, suppose we have an instance

, then, as long as the empirical posterior probability

is greater than , the noisy label will be accurately learned. However, the estimated error of the noisy class posterior probability can be up to . We also empirically verify the relation among these errors in Appendix 2.

Theorem 1.

Under Assumption 1, the estimation error of the dual -estimator is smaller than the estimation error the -estimator.

4 Experiments

We compare the transition matrix estimator error produced by the proposed dual -estimator and the -estimator on both synthetic and real-world datasets. We also compare the classification accuracy of state-of-the-art label-noise learning algorithms (Liu and Tao, 2016; Patrini et al., 2017; Jiang et al., 2018; Han et al., 2018b; Xia et al., 2019; Zhang et al., 2018; Malach and Shalev-Shwartz, 2017) obtained by using the -estimator and the dual -estimator, respectively. The MNIST (LeCun et al., 2010), Fashion-MINIST (or F-MINIST) (Xiao et al., 2017), CIFAR10, CIFAR100 (Krizhevsky et al., 2009), and Clothing1M (Xiao et al., 2015) are used in the experiments. Note that as there is no estimation error for , we do not need to do ablation study to show how the two new transition matrices contribute to the estimation error for transition matrix estimation.

Figure 2: Estimation error of transition matrix on the synthetic dataset.

4.1 Transition Matrix Estimation

We compare the estimation error between our estimator and the

-estimator on both synthetic and real-world datasets with different sample size and different noise types. The synthetic dataset is created by sampling from 2 different 10-dimensional Gaussian distributions. One of the distribution has unit variance and zero mean among all dimension. Another one has unit variance and mean of two among all dimensions. The real-world image datasets used to evaluate transition matrices estimation error are MNIST

(LeCun et al., 2010), F-MINIST (Xiao et al., 2017), CIFAR10, and CIFAR100 (Krizhevsky et al., 2009).

We conduct experiments on the commonly used noise types (Han et al., 2018b; Xia et al., 2019). Specifically, two representative structures of the transition matrix will be investigated: Symmetry flipping (Sym-) (Patrini et al., 2017); (2) Pair flipping (Pair-) (Han et al., 2018b). To generate noisy datasets, we corrupt the training and validation set of each dataset according to the transition matrix .

Neural network classifiers are used to estimate transition matrices. For fair comparison, the same network structure is used for both estimators. Specifically, on the synthetic dataset, a two-hidden-layer network is used, and the hidden unit size is ; on the real-world datasets, we follow the network structures used by the state-of-the-art method (Patrini et al., 2017), i.e., using a LeNet network with dropout rate for MNIST, a ResNet- network for F-MINIST and CIFAR10, a ResNet-

network for CIFAR100, and a ResNet-50 pre-trained on ImageNet for Clothing1M. The network is trained for 100 epochs, and stochastic gradient descent (SGD) optimizer is used. The initial learning rate is

, and it is decayed by a factor after -th epoch. We use training examples for validation, and the model with the best validation accuracy is selected for estimating transition matrix. The estimation error is calculated by measuring the -distance between the estimated transition matrix and the ground truth 

. The average estimation error and the standard deviation over

repeated experiments for the both estimators is illustrated in Fig. 2 and 3.

Fig. 2 illustrates the estimation error of the -estimator and the dual estimation on the synthetic dataset. For two different noise types and sample sizes, the estimation error of the both estimation methods tend to decrease with the increasing of the training sample size. However, the estimation error of the dual -estimator is continuously smaller than that of the -estimator. Moreover, the estimation error of the dual -estimator is less sensitive to different noise types compared to the -estimator. Specifically, even the -estimator is trained with all the training examples, its estimation error on Pair- noise is approximately doubled than that on Sym- noise, which is observed by looking at the right-hand side of the estimation error curves. In contrast, when training the dual -estimator

Figure 3: Transition matrix estimation error on MNIST, F-MNIST, CIFAR10, and CIFAR100. The error bar for standard deviation in each figure has been shaded. The lower the better.

with all the training examples, its estimation error on the different noise types does not significantly different, which all less than . Similar to the results on the synthetic dataset, the experiments on the real-world image datasets illustrated in Fig. 3 also shows that the estimation error of the dual -estimator is continuously smaller than that of the -estimator except CIFAR100, which illustrates the effectiveness of the proposed -estimator. On CIFAR100, both estimators have a larger estimation error compared to the results on MNIST, F-MINIST, and CIFAR10. The Dual -estimator outperforms the -estimator with the large sample size. However, when training sample size is small, the estimation error of the dual -estimator can be larger than that of the -estimator, it is because the number of images per class are too small to estimate the transition matrix which can be very sparse and lead to a large estimation error.

MNIST F-MNIST
Sym-20% Sym-50% Pair-45% Sym-20% Sym-50% Pair-45%
CE
Mixup
Decoupling
-MentorNet
-MentorNet
-Coteaching
-Coteaching
-Forward
-Forward
-Reweighting
-Reweighting
-Revision
-Revision
CIFAR10 CIAR100
Sym-20% Sym-50% Pair-45% Sym-20% Sym-50% Pair-45%
CE
Mixup
Decoupling
-MentorNet
-MentorNet
-Coteaching
-Coteaching
-Forward
-Forward
-Reweighting
-Reweighting
-Revision
-Revision
Table 1: Classification accuracy (percentage) on MNIST, F-MNIST, CIFAR10, and CIFAR100.
CE Mixup Decoupling -MentorNet -Coteaching -Forward -Reweighting -Revision
() () () ()
Table 2: Classification accuracy (percentage) on Clothing1M.

4.2 Classification accuracy Evaluation

We investigate how the estimation of the -estimator and the dual -estimator will affect the classification accuracy in label-noise learning. The experiments are conducted on MNIST, F-MINIST, CIFAR10, CIFAR100, and Clothing1M. The classification accuracy are reported in Table 1 and Table 2. Eight popular baselines are selected for comparison, i.e., Coteaching (Han et al., 2018b), and MentorNet (Jiang et al., 2018) which use diagonal entries of the transition matrix to help selecting reliable examples used for training; Forward (Patrini et al., 2017), and Revision (Xia et al., 2019), which use the transition matrix to correct hypotheses; Reweighting (Liu and Tao, 2016), which uses the transition matrix to build risk-consistent algorithms. There are three baselines without requiring any knowledge of the transition matrix, i.e., CE, which trains a network on the noisy sample directly by using cross entropy loss; Decoupling (Malach and Shalev-Shwartz, 2017), which trains two networks and updates the parameters only using the examples which fhave different prediction from two classifiers; Mixup (Zhang et al., 2018)

which reduces the memorization of corrupt labels by using linear interpolation to feature-target pairs. The estimation of the

-estimator and the dual -estimator are both applied to the baselines which rely on the transition matrix. The baselines using the estimation of -estimator are called -Coteaching, -MentorNet, -Forward, -Revision, and -Reweighting. The baselines using estimation of dual -estimator are called -Coteaching, -MentorNet, -Forward, -Revision, and -Reweighting.

The settings of our experiments may be different from the original paper, thus the reported accuracy can be different. For instance, in the original paper of Coteaching (Han et al., 2018b), the noise rate is given, and all data are used for training. In contrast, we assume the noise rate is unknown and needed to be estimated. We only use data for training, since data are leaved out as the validation set for transition matrix estimation. In the original paper of -revision (Xia et al., 2019), the experiments on Clothing1M use clean data for validation. In contrast, we only use noisy data for validation.

In Table 1 and Table 2, we bold the better classification accuracy produced by the baseline methods integrated with the -estimator or the dual -estimator. The best classification accuracy among all the methods in each column is highlighted with . The tables show the classification accuracy of all the methods by using our estimation is better than using that of the -estimator for most of the experiments. It is because that the dual -estimator leads to a smaller estimation error than the -estimator when training with large sample size, which can be observed at the right-hand side of the estimation error curves in Fig. 3. The baselines with the most significant improvement by using our estimation are Coteaching and MentorNet. -Coteaching outperforms all the other methods under Sym- noise. On Clothing1M dataset, -revision has the best classification accuracy. The experiments on the real-world datasets not only show the effectiveness of the dual -estimator for improving the classification accuracy of the current noisy learning algorithms, but also reflect the importance of the transition matrix estimation in label-noise learning.

5 Conclusion

The transition matrix plays an important role in label-noise learning. In this paper, to avoid the large estimation error of the noisy class posterior leading to the poorly estimated transition matrix, we have proposed a new transition matrix estimator named dual -estimator. The new estimator estimates the transition matrix by exploiting the divide-and-conquer paradigm, i.e., factorizes the original transition matrix into the product of two easy-to-estimate transition matrices by introducing an intermediate class state. Both theoretical analysis and experiments on both synthetic and real-world label noise data show that our estimator reduces the estimation error of the transition matrix, which leads to a better classification accuracy for the current label-noise learning algorithms.

Acknowledgments

TL was supported by Australian Research Council Project DE-190101473. BH was supported by HKBU Tier 1 Start-up Grant and HKBU CSD Start-up Grant. GN and MS were supported by JST AIP Acceleration Research Grant Number JPMJCR20U3, Japan.

References

  • D. Arpit, S. Jastrzębski, N. Ballas, D. Krueger, E. Bengio, M. S. Kanwal, T. Maharaj, A. Fischer, A. Courville, Y. Bengio, et al. (2017) A closer look at memorization in deep networks. In ICML, pp. 233–242. Cited by: §1.
  • H. Bao, G. Niu, and M. Sugiyama (2018) Classification from pairwise similarity and unlabeled data. In ICML, pp. 452–461. Cited by: §2.
  • S. Boucheron, G. Lugosi, and P. Massart (2013) Concentration inequalities: a nonasymptotic theory of independence. Oxford university press. Cited by: §3.1, §3.2.
  • I. Csiszár, P. C. Shields, et al. (2004) Information theory and statistics: a tutorial. Foundations and Trends® in Communications and Information Theory 1 (4), pp. 417–528. Cited by: §3.1.
  • A. Daniely and E. Granot (2019) Generalization bounds for neural networks via approximate description length. In NeurIPS, pp. 12988–12996. Cited by: Appendix B, §1.
  • C. Elkan and K. Noto (2008) Learning classifiers from only positive and unlabeled data. In SIGKDD, pp. 213–220. Cited by: §2.
  • J. Goldberger and E. Ben-Reuven (2017) Training deep neural-networks using a noise adaptation layer. In ICLR, Cited by: §1, §2, §3.1.
  • S. Guo, W. Huang, H. Zhang, C. Zhuang, D. Dong, M. R. Scott, and D. Huang (2018)

    Curriculumnet: weakly supervised learning from large-scale web images

    .
    In ECCV, pp. 135–150. Cited by: §1.
  • B. Han, J. Yao, G. Niu, M. Zhou, I. Tsang, Y. Zhang, and M. Sugiyama (2018a) Masking: a new perspective of noisy supervision. In NeurIPS, pp. 5836–5846. Cited by: §1, §1, §2.
  • B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama (2018b) Co-teaching: robust training of deep neural networks with extremely noisy labels. In NeurIPS, pp. 8527–8537. Cited by: §1, §1, §2, §2, §4.1, §4.2, §4.2, §4.
  • L. Jiang, Z. Zhou, T. Leung, L. Li, and L. Fei-Fei (2018) MentorNet: learning data-driven curriculum for very deep neural networks on corrupted labels. In ICML, pp. 2309–2318. Cited by: §1, §2, §2, §4.2, §4.
  • J. Kremer, F. Sha, and C. Igel (2018) Robust active label correction. In AISTATS, pp. 308–316. Cited by: §1.
  • A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §4.1, §4.
  • Y. LeCun, C. Cortes, and C. Burges (2010) MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann. lecun. com/exdb/mnist 2. Cited by: §4.1, §4.
  • M. Li, M. Soltanolkotabi, and S. Oymak (2020) Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. In AISTATS, Cited by: §1.
  • Y. Li, J. Yang, Y. Song, L. Cao, J. Luo, and L. Li (2017) Learning from noisy labels with distillation. In ICCV, pp. 1910–1918. Cited by: §1.
  • T. Liu and D. Tao (2016) Classification with noisy labels by importance reweighting. IEEE Transactions on pattern analysis and machine intelligence 38 (3), pp. 447–461. Cited by: §1, §1, §2, §2, §2, §2, §4.2, §4.
  • Y. Liu and H. Guo (2019) Peer loss functions: learning from noisy labels without knowing noise rates. arXiv preprint arXiv:1910.03231. Cited by: §1.
  • N. Lu, G. Niu, A. K. Menon, and M. Sugiyama (2018) On the minimal supervision for training any binary classifier from only unlabeled data. In ICLR, Cited by: §2.
  • X. Ma, Y. Wang, M. E. Houle, S. Zhou, S. M. Erfani, S. Xia, S. Wijewickrema, and J. Bailey (2018) Dimensionality-driven learning with noisy labels. In ICML, pp. 3361–3370. Cited by: §1.
  • E. Malach and S. Shalev-Shwartz (2017) Decoupling" when to update" from" how to update". In NeurIPS, pp. 960–970. Cited by: §1, §4.2, §4.
  • N. Natarajan, I. S. Dhillon, P. K. Ravikumar, and A. Tewari (2013) Learning with noisy labels. In NeurIPS, pp. 1196–1204. Cited by: §1, §1, §2, §2.
  • C. G. Northcutt, T. Wu, and I. L. Chuang (2017) Learning with confident examples: rank pruning for robust classification with noisy labels. In UAI, Cited by: §1, §1, §2.
  • G. Patrini, A. Rozza, A. Krishna Menon, R. Nock, and L. Qu (2017) Making deep neural networks robust to label noise: a loss correction approach. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 1944–1952. Cited by: §1, §1, §2, §2, §2, §2, §3.1, §4.1, §4.1, §4.2, §4.
  • H. Ramaswamy, C. Scott, and A. Tewari (2016) Mixture proportion estimation via kernel embeddings of distributions. In ICML, pp. 2052–2060. Cited by: §2.
  • S. E. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Rabinovich (2014) Training deep neural networks on noisy labels with bootstrapping. CoRR. Cited by: §1, §2.
  • M. Ren, W. Zeng, B. Yang, and R. Urtasun (2018) Learning to reweight examples for robust deep learning. In ICML, pp. 4331–4340. Cited by: §1.
  • C. Scott, G. Blanchard, and G. Handy (2013) Classification with asymmetric label noise: consistency and maximal denoising. In Conference On Learning Theory, pp. 489–511. Cited by: §2.
  • C. Scott (2015) A rate of convergence for mixture proportion estimation, with application to learning from noisy labels. In Artificial Intelligence and Statistics, pp. 838–846. Cited by: §1, §1, §2, §2, §2.
  • D. Tanaka, D. Ikami, T. Yamasaki, and K. Aizawa (2018) Joint optimization framework for learning with noisy labels. In CVPR, pp. 5552–5560. Cited by: §1.
  • K. K. Thekumparampil, A. Khetan, Z. Lin, and S. Oh (2018) Robustness of conditional gans to noisy labels. In NeurIPS, pp. 10271–10282. Cited by: §1.
  • A. Vahdat (2017) Toward robustness against label noise in training deep discriminative neural networks. In NeurIPS, pp. 5596–5605. Cited by: §1.
  • A. Veit, N. Alldrin, G. Chechik, I. Krasin, A. Gupta, and S. Belongie (2017) Learning from noisy large-scale datasets with minimal supervision. In CVPR, pp. 839–847. Cited by: §1.
  • X. Xia, T. Liu, N. Wang, B. Han, C. Gong, G. Niu, and M. Sugiyama (2019) Are anchor points really indispensable in label-noise learning?. In NeurIPS, pp. 6835–6846. Cited by: §1, §1, §2, §2, §2, §3.1, §4.1, §4.2, §4.2, §4.
  • H. Xiao, K. Rasul, and R. Vollgraf (2017)

    Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms

    .
    arXiv preprint arXiv:1708.07747. Cited by: §4.1, §4.
  • T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang (2015) Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2691–2699. Cited by: §1, §4.
  • Y. Xu, P. Cao, Y. Kong, and Y. Wang (2019) L_DMI: a novel information-theoretic loss function for training deep nets robust to label noise. In NeurIPS, pp. 6222–6233. Cited by: §1.
  • X. Yu, B. Han, J. Yao, G. Niu, I. W. Tsang, and M. Sugiyama (2019) How does disagreement help generalization against label corruption?. ICML. Cited by: §1.
  • X. Yu, T. Liu, M. Gong, and D. Tao (2018) Learning with biased complementary labels. In ECCV, pp. 68–83. Cited by: §1, §1, §2, §2, §3.1.
  • C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2017) Understanding deep learning requires rethinking generalization. In ICLR, Cited by: §1.
  • H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2018) Mixup: beyond empirical risk minimization. In ICLR, Cited by: §4.2, §4.
  • Z. Zhang and M. Sabuncu (2018) Generalized cross entropy loss for training deep neural networks with noisy labels. In NeurIPS, pp. 8778–8788. Cited by: §1.

Appendix A Proof of Theorem 1

Proof.

According to Eq. (1) in the main paper, the estimation error for the -estimator is

(5)

As we have assumed, for all instance , for all ,

(6)

Then, we have

(7)

The estimation error for the -the entry of the dual -estimator is

(8)

where the first equation holds because there is no estimation error for the transition matrix denoting the transition from the clean class to the intermediate class (as we have discussed in Section 3.1). The estimation error for the dual -estimator comes from the estimation error for fitting the noisy class labels (to eliminate the dependence on the clean label) and the estimation error for by counting discrete labels.

We have assumed that the estimation error for is , i.e., and that the estimation error for fitting the noisy class labels is , i.e., . Note that, to eliminate the dependence on the clean label for , we need to achieve for all . The error will be introduced if there is an error for fitting the noisy class labels. We have that .

We have

(9)

where the second equation holds because the transition matrices are independent of instances. Hence, the estimation error of is

(10)

Therefore, under Assumption 1 in the main paper, the estimation error of the dual -estimator is smaller than the estimation error the -estimator. ∎

Appendix B Empirical Validation of Assumption 1

We empirically verify the relations among the three different errors in Assumption 1. Note that is the estimation error for the noisy class posterior, i.e., ; is the estimation error for counting discrete labels, i.e., ; is the estimation error for fitting the noisy class labels, i.e., .

The experiments are conducted on the synthetic dataset, and setting is same as those of the synthetic experiments in Section . The three errors are calculated on the training set, since both the -estimator and the dual -estimator estimates the transition matrix on the training set.

Figure 4: The relations among , and

Figure 4 shows that the error is very small and can be ignored. is continuously smaller than when the sample size is small. The recent work Daniely and Granot [2019] shows that the sample complex of the network is linear in the number of parameters, which means that, usually, we may not have enough training examples to learn the noisy class posterior well (e.g., CIFAR10, CIFAR100, and Fashion-MNIST), and Assumption 1 can be easily satisfied. It is also worth to mention that, even Assumption 1 does not hold, the estimation error of the dual -estimator may also be smaller than the -estimator. Specifically, the error of the proposed estimator is upper bounded by . Generally, the increasing of the upper bound does not imply the increasing of the error .