Learning with noisy labels can be dated back to [Angluin and Laird, 1988]
, which has recently drawn a lot of attention, especially from the deep learning community, e.g.,[Reed et al., 2015, Zhang and Sabuncu, 2018, Kremer et al., 2018, Goldberger and Ben-Reuven, 2017, Patrini et al., 2017, Thekumparampil et al., 2018, Yu et al., 2018b, Liu and Guo, 2020, Xu et al., 2019, Yu et al., 2019, Han et al., 2018b, Malach and Shalev-Shwartz, 2017, Ren et al., 2018, Jiang et al., 2018, Ma et al., 2018, Tanaka et al., 2018, Han et al., 2018a, Guo et al., 2018, Veit et al., 2017, Vahdat, 2017, Li et al., 2017, 2020b, 2020a, Hu et al., 2020, Lyu and Tsang, 2020, Nguyen et al., 2020]. The main reason is that it is expensive and sometimes even infeasible to accurately label large-scale datasets [Karimi et al., 2019]; while it is relatively easy to obtain cheap but noisy datasets [Yu et al., 2018b, Vijayanarasimhan and Grauman, 2014, Welinder and Perona, 2010].
Methods for dealing with label noise can be divided into two categories: algorithms without or with modeling label noise. In the first category, many heuristics reduce the side-effects of label noise without modeling it, e.g., extractingconfident examples with small losses [Han et al., 2018b, Yu et al., 2019, Wang et al., 2019]. Although these algorithms empirically work well, without modeling the label noise explicitly, their reliability cannot be guaranteed. For example, the small-loss-based methods rely on accurate label noise rates.
This inspires researchers to model and learn label noise [Goldberger and Ben-Reuven, 2017, Scott, 2015, Scott et al., 2013]. The transition matrix [Natarajan et al., 2013, Cheng et al., 2020] was proposed to explicitly model the generation process of label noise, where ,
denotes as the probability of the event,
as the random variable for the instance,as the noisy label, and as the latent clean label. Given the transition matrix, an
optimal classifier defined by clean data can be learned by exploiting noisy data only[Patrini et al., 2017, Liu and Tao, 2016, Yu et al., 2018b]. The basic idea is that, the clean class posterior can be inferred by using the noisy class posterior (learned from the noisy data) and the transition matrix [Berthon et al., 2020].
However, in general, it is ill-posed to learn the transition matrix by only exploiting noisy data [Cheng et al., 2020, Xia et al., 2019], i.e., the transition matrix is unidentifiable. Therefore, some assumptions are proposed to tackle this issue. For example, additional information is given [Berthon et al., 2020]; the matrix is symmetric [Menon et al., 2018]; the noise rates for instances are upper bounded [Cheng et al., 2020], or even to be instance-independent [Xia et al., 2019, Han et al., 2018a, Patrini et al., 2017, Northcutt et al., 2017, Natarajan et al., 2013], i.e., . Note that there are specific applications where these assumptions are held. However, in general, these assumptions are hard to verify, and the gaps are large between instance-independent and instance-dependent transition matrices.
To solve the above problem, in this paper, we propose a new but practical assumption for instance-dependent label noise: The noise depends only on parts of instances. We term this kind of noise as parts-dependent label noise. This assumption is motivated by that annotators usually annotate instances based on their parts rather than the whole instances. Specifically, there are psychological and physiological evidences showing that we human perceive objects starting from their parts [Palmer, 1977, Wachsmuth et al., 1994, Logothetis and Sheinberg, 1996]. There are also computational theories and learning algorithms showing that object recognition rely on parts-representations [Biederman, 1987, Ullman et al., 1996, Dietterich et al., 1997, Norouzi et al., 2013, Hosseini-Asl et al., 2015, Agarwal et al., 2004]. Since instances can be well reconstructed by combinations of parts [Lee and Seung, 1999, 2001], the parts-dependence assumption is mild in the sense. Intuitively, for a given instance, a combination of parts-dependent transition matrices can well approximate the instance-dependent transition matrix, which has been empirically verified in Section 4.2.
To fulfil the approximation, we need to learn the transition matrices for parts and the combination parameters. Since the parts are semantic [Lee and Seung, 1999], their contributions to perceiving the instance could be similar in the contributions to understanding (or annotating) them [Biederman, 1987, Agarwal et al., 2004]. Therefore, it is natural to assume that for constructing the instance-dependent transition matrix, the combination parameters of parts-dependent transition matrices are identical to those of parts for reconstructing an instance. We illustrate this in Figure 1, where the combinations in the top and bottom panels share the same parameters. The transition matrices for parts can be learned by exploiting anchor points, which are defined by instances that belong to a specific clean class with probability one [Liu and Tao, 2016]
. Note that the assumption for combination parameters and the requirement of anchor points might be strong. If they are not held, the parts-dependent transition matrix might be poorly estimated. To solve this issue, we also use the slack variable trick in[Xia et al., 2019] to modify the instance-independent transition matrix.
Extensive experiments on both synthetic and real-world datasets show that the parts-dependent transition matrices can well address instance-dependent label noise. Specifically, when the instance-dependent label noise is heavy, i.e., 50%, the proposed method outperforms state-of-the-art methods by almost of classification accuracy on CIFAR-10. More details can be found in Section 4.
The rest of the paper is organized as follows. In Section 2, we briefly review related work on modeling label noise and parts-based learning. In Section 3, we discuss how to learn parts-dependent transition matrices. In Section 4, we provide empirical evaluations of our learning algorithm. In Section 5, we conclude our paper. Codes will be available online.
2 Related Work
Label noise models Currently, there are three typical label noise models, i.e., the random classification noise (RCN) model [Biggio et al., 2011, Natarajan et al., 2013, Manwani and Sastry, 2013], the class-conditional label noise (CCN) model [Patrini et al., 2017, Xia et al., 2019, Zhang and Sabuncu, 2018], and the instance-dependent label noise (IDN) model [Berthon et al., 2020, Cheng et al., 2020, Du and Cai, 2015]. Specifically, RCN assumes that clean labels flip randomly with a constant rate [A.Aslam and E.Decatur, 1996, Angluin and Laird, 1988, Kearns, 1993]; CCN assumes that the flip rate depends on the latent clean class [Ma et al., 2018, Han et al., 2018b, Yu et al., 2019]; IDN considers the most general case of label noise, where the flip rate depends on its instance. However, IDN is non-identifiable without any additional assumption, which is hard to learn with only noisy data [Xia et al., 2019]. The proposed parts-dependent label noise (PDN) model assumes that the label noise depends on parts of instances, which could be an important “intermediate” model between CCN and IDN.
Estimating the transition matrix
The transition matrix bridges the class posterior probabilities for noisy and clean data. It is essential to buildclassifier-/risk-consistent estimators in label-noise learning [Patrini et al., 2017, Liu and Tao, 2016, Scott, 2015, Yu et al., 2018b]. To estimate the transition matrix, a cross-validation method is used for the binary classification task [Natarajan et al., 2013]. For the multi-classification task, the transition matrix could be learned by exploiting anchor points [Patrini et al., 2017, Yu et al., 2018a]. To remove strong dependence on anchor points, data points having high noisy class posterior probabilities (similar to anchor points) can also be used to estimate the transition matrix via a slack variable trick [Xia et al., 2019]. The slack variable is added to revise the transition matrix, which can be learned and validated together by using noisy data.
Parts-based learning Non-negative matrix factorization (NMF) [D.Lee and Seung, 1999]
is the representative work of parts-based learning. It decomposes a non-negative data matrix into the product of two non-negative factor matrices. In contrast to principal components analysis (PCA)[Abdi and J. Williams, 2010]
and vector quantization (VQ)[Gray, 1990] that learn holistic but not parts-based representations, NMF allows additive but not subtractive combinations. Several variations extended the applicable range of NMF methods [Liu et al., 2010, Guan et al., 2019, Liu et al., 2017, Yoo and Choi, 2010].
3 Parts-dependent Label Noise
Preliminaries Let be the noisy training sample that contains instance-dependent label noise. Our aim is to learn a robust classifier from the noisy training sample that could assign clean labels for test data. In the rest of the paper, we use to denote the -th row of the matrix , the -th column of the matrix , and the -th entry of the matrix . We will use as the norm of the matrices or vector, e.g., .
Learning parts-based representations NMF has been widely employed to learn parts-based representations [D.Lee and Seung, 1999]. Many variants of NMF were proposed to enlarge its application fields [Guan et al., 2019, Liu et al., 2017, Yoo and Choi, 2010], e.g., allowing the data matrix or/and the matrix of parts to have mixed signs [Liu et al., 2010]. For our problem, we do not require the matrix of parts to be non-negative, as our input data matrix is not restricted to be non-negative. However, we require the combination parameters (as known as new representation in the NMF community [D.Lee and Seung, 1999, Liu et al., 2017, Guan et al., 2019]) for each instance to be not only non-negative but also to have a unit norm. This is because we want to treat the parameters as the weights that measure how much the parts contribute to reconstructing the corresponding instance.
Let be the data matrix, where is the dimension of data points. The parts-based representation learning for the parts-dependent label noise problem can be formulated as
, where parts are linearly combined to reconstruct the instance. Note that to fulfil the power of deep learning, the data matrix could consist of deep representations extracted by a deep neural network trained on the noisy training data.
Approximating instance-dependent transition matrices Since there are computational theories [Biederman, 1987, Ullman et al., 1996] and learning algorithms [Agarwal et al., 2004, Hosseini-Asl et al., 2015] showing that object recognition rely on parts-representations, it is therefore natural to model label noise on the parts level. Thus, we propose a parts-dependent noise (PDN) model, where label noise depends on parts rather than the whole instance. Specifically, for each part, e.g., , we assume there is a parts-dependent transition matrix, e.g., . Since we have parts, there are different parts-dependent transition matrices, i.e., . Similar to the idea that parts can be used to reconstruct instances, we exploit the idea that instance-dependent transition matrix can be approximated by a combination of parts-dependent transition matrices, which is illustrated in the bottom panel of Figure 1.
To approximate the instance-dependent transition matrices, we need to learn the parts-dependent transition matrices and the combination parameters. However, they are not identifiable because it is ill-posed to factorize the instance-dependent transition matrix into the product of parts-dependent transition matrices and combination parameters. Fortunately, we could identify the parts-dependent transition matrices by assuming that the parameters for reconstructing the instance-dependent transition matrix are identical to those for reconstructing an instance. The rational behind this assumption is that the learned parts are semantic [Lee and Seung, 1999], and their contributions to perceiving the instance should be similar in the contributions to understanding and annotating them [Biederman, 1987, Agarwal et al., 2004]. Let be the combination parameters to reconstruct the instance . The instance-dependent transition matrix can be approximated by
Note that can be learned via Eq. (1). The normalization constraint on the combination parameters, i.e., , ensures that the combined matrix in the right-hand side of Eq. (2) is also a valid transition matrix, which is non-negative and the sum of each row equals one.
Learning the parts-dependent transition matrices Note that parts-dependent transition matrices in Eq. (2) are unknown. We will show that they can be learned by exploiting anchor points. The concept of anchor points was proposed in [Liu and Tao, 2016]. They are defined in the clean data domain, i.e., an instance is an anchor point of the -th clean class if is equal to one.
Let be an anchor point of the -th class. We have
where the first equation holds because of Law of total probability; the second equation holds becausefor all and . As can be unbiasedly learned [Bartlett et al., 2006] by exploiting the noisy training sample and the anchor point , Eq. (3) shows that the -th row of the instance-dependent transition matrix can be unbiasedly learned. This sheds light on the learnability of the parts-dependent transition matrices. Specifically, as shown in Figure 1, we are going to reconstruct the instance-dependent transition matrix by using a weighted combination of the parts-dependent transition matrices. If the instance-dependent transition matrix111Note that according to (3), given an anchor point , the -th row of its instance-dependent transition matrix can be learned and thus available. and combination parameters are given, learning the parts-dependent transition matrices is a convex problem.
Given an anchor point , we can learn the -th rows of the parts-dependent transition matrices by matching the -th row of the reconstructed transition matrix, i.e., , with the -th row of the instance-dependent transition matrix, i.e., . Since we have parts-dependent transition matrices, to identify all the entries of the -th rows of the parts-dependent transition matrices, we need at least anchor points of the -th class to build equations. Let be anchor points of the -th class, where . We robustly learn the -th rows of the parts-dependent transition matrices by minimizing the reconstruction error instead of solving equations. Therefore, we propose the following optimization problem to learn the parts-dependent transition matrices:
where the sum over the index calculates the reconstruction error over all rows of transition matrices. Note that in Eq. (4), we require that anchors for each class are given. If anchor points are not available, they can be learned from the noisy data as did in [Patrini et al., 2017, Liu and Tao, 2016, Xia et al., 2019].
Implementation The overall procedure to learn the parts-dependent transition matrices is summarized in Algorithm 1. Given only a noisy training sample set , we first learn deep representations of the instances. Note that we use a noisy validation set to select the deep model. Then, we minimize Eq. (1) to learn the combination parameters. The parts-dependent transition matrices are learned by minimizing Eq. (4). Finally, we use the weighted combination to get an instance-dependent transition matrix for each instance according to Eq. (2). Note that as we learn the anchor points from the noisy training data, as did in [Patrini et al., 2017, Liu and Tao, 2016, Xia et al., 2019], instances that are similar to anchor points will be learned if there are no anchor points available in the training data. Then, the instance-independent transition matrix will be poorly estimated. To address this issue, we employ the slack variable in [Xia et al., 2019] to modify the instance-independent transition matrix.
4.1 Experiment setup
Datasets We verify the efficacy of our approach on the manually corrupted version of three datasets, i.e., Fashion-MNIST [Xiao et al., 2017], SVHN [Netzer et al., 2011], and CIFAR-10 [Krizhevsky, 2009], and one real-world noisy dataset, i.e., clothing1M [Xiao et al., 2015]. Fashion-MNIST contains 60,000 training images and 10,000 test images with 10 classes. SVHN and CIFAR-10 both have 10 classes of images, but the former contains 73,257 training images and 26,032 test images, and the latter contains 50,000 training images and 10,000 test images. The three datasets contain clean data. We corrupted the training sets manually according to Algorithm 2. More details about this instance-dependent label noise generation approach can be found in Appendix B. IDN- means that the noise rate is controlled to be . All experiments on those datasets with synthetic instance-dependent label noise are repeated five times. Clothing1M has 1M images with real-world noisy labels and 10k images with clean labels for testing.
For all the datasets, we leave out 10% of the noisy training examples as a noisy validation set, which is for model selection. We also conduct synthetic experiments on MNIST [LeCun et al., ]. Due to the space limit, we put its corresponding experimental results in Appendix C.
Baselines and measurements We compare the proposed method with the following state-of-the-art approaches: (i). CE, which trains the standard deep network with the cross entropy loss on noisy datasets. (ii). Decoupling [Malach and Shalev-Shwartz, 2017], which trains two networks on samples whose the predictions from the two networks are different. (iii). MentorNet [Jiang et al., 2018], Co-teaching [Han et al., 2018b], and Co-teaching+ [Yu et al., 2019]. These approaches mainly handle noisy labels by training on instances with small loss values. (iv). Joint [Tanaka et al., 2018], which jointly optimizes the sample labels and the network parameters. (v). DMI [Xu et al., 2019]
, which proposes a novel information-theoretic loss function for training deep neural networks robust to label noise. (vi). Forward[Patrini et al., 2017], Reweight [Liu and Tao, 2016], and T-Revision [Xia et al., 2019]. These approaches utilize a class-dependent transition matrix to correct the loss function. We use the classification accuracy to evaluate the performance of each model on the clean test set. Higher classification accuracy means that the algorithm is more robust to the label noise.
Network structure and optimization
For fair comparison, all experiments are conducted on NVIDIA Tesla V100, and all methods are implemented by PyTorch. We use a ResNet-18 network forFashion-MNIST, a ResNet-34 network for SVHN and CIFAR-10. The transition matrix for each instance can be learned according to Algorithm 1. Exploiting the transition matrices, we can bridge the class posterior probabilities for noisy and clean data. We first use SGD with momentum 0.9, weight decay , batch size 128, and an initial learning rate of
to initialize the network. The learning rate is divided by 10 at the 40th epochs and 80th epochs. We set 100 epochs in total. Then, the optimizer and learning rate are changed to Adam andto learn the classifier and slack variable. Note that the slack variable is initialized to be with all zero entries in the experiments. During the training, can be ensured to be a valid transition matrix by first projecting their negative entries to be zero and then performing row normalization. For clothing1M
, we use a ResNet-50 pre-trained on ImageNet. Different from existing methods, we do not use the 50k clean training data or the 14k clean validation data but only exploit the 1M noisy data to learn the transition matrices and classifiers. Note that for real-world scenarios, it is more practical that no extra special clean data is provided to help adjust the model. After the transition matrixis obtained according to the Algorithm 1, we use SGD with momentum 0.9, weight decay , batch size 32, and run with learning rate for 10 epochs. For learning the classifier and the slack variable, Adam is used and learning rate is changed to .
Explanation We abbreviate our proposed method of learning with the parts-dependent transition matrices as PTD. Methods with “-F” and “-R” mean that the instance-dependent transition matrices are exploited by using the Forward [Patrini et al., 2017] method and the Reweight [Liu and Tao, 2016] method, respectively; Methods with “-V” means that the transition matrices are revised. Details for these methods can be found in Appendix A.
Illustration of the transition matrix approximation error and the hyperparameter sensitivity. Figure (a) illustrates how the approximation error for the instance-dependent transition matrix varies by increasing the number of parts. Figure (b) illustrates how the number of parts affects the test classification performance. The error bar for standard deviation in each figure has been shaded
4.2 Ablation study
We have described that how to learn parts-dependent transition matrices for approximating the instance-dependent transition matrix in Section 3. To further prove that our proposed method is not sensitive to the number of parts, we perform ablation study in this subsection. The experiments are conducted on CIFAR-10 with 50% noise rate.
In Figure 2, we show how well the instance-dependent transition matrix can be approximated by employing the class-dependent transition matrix and the parts-dependent transition matrix. We use norm to measure the difference. For each instance, we analyze the approximation error of a specific row rather than the whole transition matrix. The reason is that we only used one row of the instance-dependent transition matrix to generate the noisy label. Specifically, given an instance with clean class label (note that we have access to clean labels for the test data to conduct evaluation), we only exploit the -th row of the instance-dependent transition matrix to flip the label from the class to another class. Note that “Class-dependent” represents the standard class-dependent transition matrix learning methods [Liu and Tao, 2016, Patrini et al., 2017] and “T-Revision” represents the revision methods to learn class-dependent transition matrix [Xia et al., 2019]. The Class-dependent and T-Revision methods are independent of parts. Their curves are therefore straight. We can see that the parts-dependent (PTD) transition matrix can achieve much smaller approximation error than the class-dependent (parts-independent) transition matrix and the results are insensitive to the number of parts. Figure 2 shows that the classification performance of our proposed method is robust and not sensitive to the change of the number of parts. More detailed experimental results can be found in Appendix D.
4.3 Comparison with the State-of-the-Arts
For Fashion-MNIST and SVHN, in the easy cases, e.g., IDN-10% and IDN-20%, almost all methods work well. In the IDN-30% case, the advantages of PTD begin to show. We surpassed all methods obviously except for T-Revision, e.g., the classification accuracy of PTD-R-V is 1.14% higher than Co-teaching+ on Fashion-MNIST, 1.33% higher than DMI on SVHN. When the noise rate raises, T-Revision is gradually defeated. In the IDN-30% case, the classification accuracy of PTD-R-V is 1.68% and 2.01% higher than T-Revision on SVHN and CIFAR-10 respectively. Finally, in the hardest case, i.e., IDN-50%, the superiority of PTD widens the gap of performance. The classification accuracy of PTD-R-V is 6.97% and 9.07% higher than the best baseline method.
For CIFAR-10, the algorithms with the assist of PTD overtake the other methods with clear gaps. From IDN-10% to IDN-50% case, the advantages of our proposed method increase with the increasing of the noise rate. In the 10% and 20% cases, the performance of PTD-R-V is outstanding, i.e., the classification accuracy is 2.02% and 2.16% higher than the best baseline Joint. In the 30% and 40% case, the gap is expanded to 3.25% and 3.87%. Lastly, in the 50% case, PTD-R-V outperforms state-of-the-art methods by almost 10 % of classification accuracy.
To sum up, the synthetic experiments reveal that our method is powerful in handling instance-dependent label noise particularly in the situation of high noise rates.
Results on real-world datasets The proposed method outperforms the baselines as shown in Table 4, where the highest accuracy is bold faced. The comparison denotes that, the noise model of clothing1M dataset is more likely to be instance-dependent noise, and our proposed method can better model instance-dependent noise than other methods.
In this paper, we focus on learning with instance-dependent label noise, which is a more general case of label noise but lacking understanding and learning. Inspired by parts-based learning, we exploit parts-dependent transition matrix to approximate instance-dependent transition matrix, which is intuitive and learnable. Experimental results show our proposed method consistently outperforms existing methods, especially for the case of high-level noise rates. In future, we can extend the work in the following aspects. First, we can incorporate some prior knowledge of transition matrix and parts (e.g., sparsity), which improves parts-based learning. Second, we can introduce slack variables to modify the parameters for combination.
TLL was supported by Australian Research Council Project DE-190101473. NNW was supported by National Natural Science Foundation of China under Grant 61922066, Grant 61876142. GN and MS were supported by JST AIP Acceleration Research Grant Number JPMJCR20U3, Japan. The authors would give special thanks to Pengqian Lu for helpful discussions and comments.
- A.Aslam and E.Decatur  Javed A.Aslam and Scott E.Decatur. On the sample complexity of noise- tolerant learning. Information Processing Letters, 1996.
- Abdi and J. Williams  Hervé Abdi and Lynne J. Williams. Principal component analysis. wiley interdisciplinary reviews computational statistics, 2(4):433–459, 2010.
- Agarwal et al.  Shivani Agarwal, Aatif Awan, and Dan Roth. Learning to detect objects in images via a sparse, part-based representation. IEEE transactions on pattern analysis and machine intelligence, 26(11):1475–1490, 2004.
- Angluin and Laird  Dana Angluin and Philip Laird. Learning from noisy examples. Machine Learning, 2(4):343–370, 1988.
- Bartlett et al.  Peter L Bartlett, Michael I Jordan, and Jon D McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
- Berthon et al.  Antonin Berthon, Bo Han, Gang Niu, Tongliang Liu, and Masashi Sugiyama. Confidence scores make instance-dependent label-noise learning possible. arXiv preprint arXiv:2001.03772, 2020.
- Biederman  Irving Biederman. Recognition-by-components: a theory of human image understanding. Psychological review, 94(2):115, 1987.
- Biggio et al.  Battista Biggio, Blaine Nelson, and Pavel Laskov. Support vector machines under adversarial label noise. In ACML, 2011.
- Cheng et al.  Jiacheng Cheng, Tongliang Liu, Kotagiri Ramamohanarao, and Dacheng Tao. Learning with bounded instance-and label-dependent label noise. In ICML, 2020.
- Dietterich et al.  Thomas G Dietterich, Richard H Lathrop, and Tomás Lozano-Pérez. Solving the multiple instance problem with axis-parallel rectangles. Artificial intelligence, 89(1-2):31–71, 1997.
- D.Lee and Seung  Daniel D.Lee and H.Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. Nature, pages 788–791, 1999.
- Du and Cai  Jun Du and Zhihua Cai. Modelling class noise with symmetric and asymmetric distributions. In AAAI, 2015.
- Goldberger and Ben-Reuven  Jacob Goldberger and Ehud Ben-Reuven. Training deep neural-networks using a noise adaptation layer. In ICLR, 2017.
- Gray  Robert M. Gray. Vector quantization. In Readings in Speech Recognition, 1990.
- Gretton et al.  Arthur Gretton, Alex Smola, Jiayuan Huang, Marcel Schmittfull, Karsten Borgwardt, and Bernhard Schölkopf. Covariate shift by kernel mean matching. Dataset shift in machine learning, pages 131–160, 2009.
- Guan et al.  Naiyan Guan, Tongliang Liu, Zhang Yangmuzi, Dacheng Tao, and Larry Steven Davis. Truncated cauchy non-negative matrix factorization. IEEE Transactions on pattern analysis and machine intelligence, 41(1):246–259, 2019.
Guo et al. 
Sheng Guo, Weilin Huang, Haozhi Zhang, Chenfan Zhuang, Dengke Dong, Matthew R
Scott, and Dinglong Huang.
Curriculumnet: Weakly supervised learning from large-scale web images.In ECCV, pages 135–150, 2018.
- Han et al. [2018a] Bo Han, Jiangchao Yao, Gang Niu, Mingyuan Zhou, Ivor Tsang, Ya Zhang, and Masashi Sugiyama. Masking: A new perspective of noisy supervision. In NeurIPS, pages 5836–5846, 2018a.
- Han et al. [2018b] Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In NeurIPS, pages 8527–8537, 2018b.
Hosseini-Asl et al. 
Ehsan Hosseini-Asl, Jacek M Zurada, and Olfa Nasraoui.
Deep learning of part-based representation of data using sparse autoencoders with nonnegativity constraints.IEEE transactions on neural networks and learning systems, 27(12):2486–2498, 2015.
- Hu et al.  Wei Hu, Zhiyuan Li, and Dingli Yu. Simple and effective regularization methods for training on noisily labeled data with generalization guarantee. In ICLR, 2020. URL https://openreview.net/forum?id=Hke3gyHYwH.
- Jiang et al.  Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. MentorNet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In ICML, pages 2309–2318, 2018.
- Karimi et al.  Davood Karimi, Haoran Dou, Simon K Warfield, and Ali Gholipour. Deep learning with noisy labels: exploring techniques and remedies in medical image analysis. arXiv preprint arXiv:1912.02911, 2019.
Efficient noise-tolerant learning from statistical queries.
Proceedings of the twenty- fifth annual ACM symposium on Theory of computing - STOC 93, 1993.
- Kremer et al.  Jan Kremer, Fei Sha, and Christian Igel. Robust active label correction. In AISTATS, pages 308–316, 2018.
- Krizhevsky  Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
-  Yann LeCun, Corinna Cortes, and Christopher J.C. Burges. The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/.
- Lee and Seung  Daniel D Lee and H Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, 1999.
- Lee and Seung  Daniel D Lee and H Sebastian Seung. Algorithms for non-negative matrix factorization. In NeurIPS, pages 556–562, 2001.
Li et al. [2020a]
Junnan Li, Richard Socher, and Steven C.H. Hoi.
Dividemix: Learning with noisy labels as semi-supervised learning.In ICLR, 2020a. URL https://openreview.net/forum?id=HJgExaVtwr.
- Li et al. [2020b] Mingchen Li, Mahdi Soltanolkotabi, and Samet Oymak. Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. In AISTATS, 2020b.
- Li et al.  Yuncheng Li, Jianchao Yang, Yale Song, Liangliang Cao, Jiebo Luo, and Li-Jia Li. Learning from noisy labels with distillation. In ICCV, pages 1910–1918, 2017.
- Liu et al.  Ding Liu, Chris H.Q., Tao Li, and Michael I. Jordan. Convex and semi-nonnegative matrix factorizations. IEEE Transactions on pattern analysis and machine intelligence, 32(1):45–55, 2010.
- Liu and Tao  Tongliang Liu and Dacheng Tao. Classification with noisy labels by importance reweighting. IEEE Transactions on pattern analysis and machine intelligence, 38(3):447–461, 2016.
- Liu et al.  Tongliang Liu, Mingming Gong, and Dacheng Tao. Large cone non-negative matrix factorization. IEEE Transactions on Neural Networks and Learning Systems, 28(9):2129–2141, 2017.
- Liu and Guo  Yang Liu and Hongyi Guo. Peer loss functions: Learning from noisy labels without knowing noise rates. In ICML, 2020.
- Logothetis and Sheinberg  Nikos K Logothetis and David L Sheinberg. Visual object recognition. Annual review of neuroscience, 19(1):577–621, 1996.
- Lyu and Tsang  Yueming Lyu and Ivor W. Tsang. Curriculum loss: Robust learning and generalization against label corruption. In ICLR, 2020. URL https://openreview.net/forum?id=rkgt0REKwS.
- Ma et al.  Xingjun Ma, Yisen Wang, Michael E Houle, Shuo Zhou, Sarah M Erfani, Shu-Tao Xia, Sudanthi Wijewickrema, and James Bailey. Dimensionality-driven learning with noisy labels. In ICML, pages 3361–3370, 2018.
- Malach and Shalev-Shwartz  Eran Malach and Shai Shalev-Shwartz. Decoupling” when to update” from” how to update”. In NeurIPS, pages 960–970, 2017.
- Manwani and Sastry  Naresh Manwani and P.S. Sastry. Noise tolerance under risk minimization. IEEE Transactions on Cybernetics, 2013.
- Menon et al.  Aditya Krishna Menon, Brendan Van Rooyen, and Nagarajan Natarajan. Learning from binary labels with instance-dependent noise. Machine Learning, 107(8-10):1561–1595, 2018.
- Natarajan et al.  Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning with noisy labels. In NeurIPS, pages 1196–1204, 2013.
- Netzer et al.  Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y.Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
- Nguyen et al.  Duc Tam Nguyen, Chaithanya Kumar Mummadi, Thi Phuong Nhung Ngo, Thi Hoai Phuong Nguyen, Laura Beggel, and Thomas Brox. Self: Learning to filter noisy labels with self-ensembling. In ICLR, 2020. URL https://openreview.net/forum?id=HkgsPhNYPS.
- Norouzi et al.  Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, Andrea Frome, Greg S Corrado, and Jeffrey Dean. Zero-shot learning by convex combination of semantic embeddings. In NeurIPS, 2013.
- Northcutt et al.  Curtis G Northcutt, Tailin Wu, and Isaac L Chuang. Learning with confident examples: Rank pruning for robust classification with noisy labels. In UAI, 2017.
- Palmer  Stephen E Palmer. Hierarchical structure in perceptual representation. Cognitive psychology, 9(4):441–474, 1977.
- Patrini et al.  Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: A loss correction approach. In CVPR, pages 1944–1952, 2017.
- Reed et al.  Scott E Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and Andrew Rabinovich. Training deep neural networks on noisy labels with bootstrapping. In ICLR, 2015.
- Ren et al.  Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for robust deep learning. In ICML, pages 4331–4340, 2018.
- Scott  Clayton Scott. A rate of convergence for mixture proportion estimation, with application to learning from noisy labels. In AISTATS, pages 838–846, 2015.
- Scott et al.  Clayton Scott, Gilles Blanchard, and Gregory Handy. Classification with asymmetric label noise: Consistency and maximal denoising. In COLT, pages 489–511, 2013.
- Tanaka et al.  Daiki Tanaka, Daiki Ikami, Toshihiko Yamasaki, and Kiyoharu Aizawa. Joint optimization framework for learning with noisy labels. In CVPR, 2018.
- Thekumparampil et al.  Kiran K Thekumparampil, Ashish Khetan, Zinan Lin, and Sewoong Oh. Robustness of conditional gans to noisy labels. In NeurIPS, pages 10271–10282, 2018.
- Ullman et al.  Shimon Ullman et al. High-level vision: Object recognition and visual cognition, volume 2. MIT press Cambridge, MA, 1996.
- Vahdat  Arash Vahdat. Toward robustness against label noise in training deep discriminative neural networks. In NeurIPS, pages 5596–5605, 2017.
- Veit et al.  Andreas Veit, Neil Alldrin, Gal Chechik, Ivan Krasin, Abhinav Gupta, and Serge Belongie. Learning from noisy large-scale datasets with minimal supervision. In CVPR, pages 839–847, 2017.
Vijayanarasimhan and Grauman 
Sudheendra Vijayanarasimhan and Kristen Grauman.
Large-scale live active learning: Training object detectors with crawled data and crowds.
International journal of computer vision, 108(1-2):97–114, 2014.
- Wachsmuth et al.  E Wachsmuth, MW Oram, and DI Perrett. Recognition of objects and their component parts: responses of single units in the temporal cortex of the macaque. Cerebral Cortex, 4(5):509–522, 1994.
Wang et al. 
Xiaobo Wang, Shuo Wang, Jun Wang, Hailin Shi, and Tao Mei.
Co-mining: Deep face recognition with noisy labels.In ICCV, pages 9358–9367, 2019.
- Welinder and Perona  Peter Welinder and Pietro Perona. Online crowdsourcing: rating annotators and obtaining cost-effective labels. In CVPR-Workshop, pages 25–32, 2010.
- Xia et al.  Xiaobo Xia, Tongliang Liu, Nannan Wang, Bo Han, Chen Gong, Gang Niu, and Masashi Sugiyama. Are anchor points really indispensable in label-noise learning? In NeurIPS, pages 6835–6846, 2019.
- Xiao et al.  Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
- Xiao et al.  Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and Xiaogang Wang. Learning from massive noisy labeled data for image classification. In CVPR, pages 2691–2699, 2015.
- Xu et al.  Yilun Xu, Peng Cao, Yuqing Kong, and Yizhou Wang. L_dmi: A novel information-theoretic loss function for training deep nets robust to label noise. In NeurIPS, pages 6222–6233, 2019.
- Yoo and Choi  Jiho Yoo and Seungjin Choi. Nonnegative matrix factorization with orthogonality constraints. Management Science, 58(11):2037–2056, 2010.
- Yu et al.  Xingrui Yu, Bo Han, Jiangchao Yao, Gang Niu, Ivor W Tsang, and Masashi Sugiyama. How does disagreement benefit co-teaching? In ICML, 2019.
- Yu et al. [2018a] Xiyu Yu, Tongliang Liu, Mingming Gong, Kayhan Batmanghelich, and Dacheng Tao. An efficient and provable approach for mixture proportion estimation using linear independence assumption. In CVPR, pages 4480–4489, 2018a.
- Yu et al. [2018b] Xiyu Yu, Tongliang Liu, Mingming Gong, and Dacheng Tao. Learning with biased complementary labels. In ECCV, pages 68–83, 2018b.
- Zhang and Sabuncu  Zhilu Zhang and Mert Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels. In NeurIPS, pages 8778–8788, 2018.
Appendix A How to learn robust classifiers by exploiting parts-dependent transition matrices
For those who are not familiar with how to use the transition matrix to learn robust classifiers, in this supplementary material, we will provide how to learn robust classifiers by exploiting parts-dependent transition matrices.
We begin by introducing notation. Let be the distribution of the variables , the distribution of the variables . Let be i.i.d. samples drawn from the distribution , i.i.d. samples drawn from the distribution , and the size of label classes.
The aim of multi-class classification is to learn a classifier that can assign labels for given instances. The classifier is of the following form: , where is an estimate of Pr. Expected risk of employing is defined as
The optimal classifier to learn is the one that minimizes the risk . Due to the distribution is usually unknown, the optimal classifier is approximated by the minimizer of the empirical risk:
Given only the noisy training samples , the noisy version of the empirical risk is defined as:
In the main paper (Section 3), we show how to approximate instance-dependent transition matrix by exploiting parts-dependent transition matrices. For an instance , according to the definition of instance-dependent transition matrix, we have that PrPr, we let
The empirical risk of our PTD-F algorithm is defined as:
Here, is an estimate for Pr and is an estimate for Pr.
When the slack variable is introduced to modify the instance-dependent transition matrices, reviewing Eq. (8), we replace with to get , i.e.,
Then the empirical risks of PTD-F-V and PTD-R-V are defined as and , i.e.,
To learn noise robust classifiers under noisy supervision, we minimize the empirical risk of PTD-F, PTD-R, PTD-F-V, and PTD-R-V, respectively.
Appendix B Instance-dependent Label Noise Generation
Note that it is more realistic that different instances have different flip rates. Without constraining different instances to have a same flip rate, it is more challenging to model the label noise and train robust classifiers. In Step 1, in order to control the global flip rate as but without constraining all of the instances to have a same flip rate, we sample their flip rates from a truncated normal distribution . Specifically, this distribution limits the flip rates of instances in the range . Their mean and standard deviation are equal to the mean and the standard deviation 0.1 of the selected truncated normal distribution respectively.
In Step 2, we sample parameters from the standard normal distribution for generating instance-dependent label noise. The dimensionality of each parameter is , where denotes the dimensionality of the instance. Learning these parameters is critical to model instance-dependent label noise. However, it is hard to identify these parameters without any assumption.
Note that an instance with clean label will be flipped only according to the -th row of the transition matrix. Thus, in Steps 4 to 7, we only use the -th row of the instance-dependent transition matrix for the instance . Specifically, Steps 5 and 7 are to ensure the diagonal entry of the -th row is 1- . Step 6 is to ensure that the sum of the off-diagonal entries is .
Appendix C Experiments complementary on synthetic noisy dataset
In the main paper (Section 4), we present the experimental results on three synthetic noisy datasets, i.e., Fashion-MNIST, SVHN and CIFAR-10. In this supplementary material, we provide the experimental results on another synthetic noisy dataset MNIST. MNIST contains 60,000 training images and 10,000 test images with 10 classes. We use a LeNet-5 network for it. The detailed experimental results are shown in Table 5. The classification performance shows that our proposed method is more robust than the baseline methods when coping with instance-dependent label noise.
Appendix D The experimental results of ablation study
In Section 4.2, we have shown that our proposed method is insensitive to the number of parts. Due the space limit, we only provide the illustration by exploiting the figures. In this supplementary material, more detailed results including means and standard deviations of approximation error and classification accuracy about the ablation study are shown in Table 6 and Table 7.