1 Introduction
Learning with noisy labels can be dated back to [Angluin and Laird, 1988]
, which has recently drawn a lot of attention, especially from the deep learning community, e.g.,
[Reed et al., 2015, Zhang and Sabuncu, 2018, Kremer et al., 2018, Goldberger and BenReuven, 2017, Patrini et al., 2017, Thekumparampil et al., 2018, Yu et al., 2018b, Liu and Guo, 2020, Xu et al., 2019, Yu et al., 2019, Han et al., 2018b, Malach and ShalevShwartz, 2017, Ren et al., 2018, Jiang et al., 2018, Ma et al., 2018, Tanaka et al., 2018, Han et al., 2018a, Guo et al., 2018, Veit et al., 2017, Vahdat, 2017, Li et al., 2017, 2020b, 2020a, Hu et al., 2020, Lyu and Tsang, 2020, Nguyen et al., 2020]. The main reason is that it is expensive and sometimes even infeasible to accurately label largescale datasets [Karimi et al., 2019]; while it is relatively easy to obtain cheap but noisy datasets [Yu et al., 2018b, Vijayanarasimhan and Grauman, 2014, Welinder and Perona, 2010].Methods for dealing with label noise can be divided into two categories: algorithms without or with modeling label noise. In the first category, many heuristics reduce the sideeffects of label noise without modeling it, e.g., extracting
confident examples with small losses [Han et al., 2018b, Yu et al., 2019, Wang et al., 2019]. Although these algorithms empirically work well, without modeling the label noise explicitly, their reliability cannot be guaranteed. For example, the smalllossbased methods rely on accurate label noise rates.This inspires researchers to model and learn label noise [Goldberger and BenReuven, 2017, Scott, 2015, Scott et al., 2013]. The transition matrix [Natarajan et al., 2013, Cheng et al., 2020] was proposed to explicitly model the generation process of label noise, where ,
denotes as the probability of the event
,as the random variable for the instance,
as the noisy label, and as the latent clean label. Given the transition matrix, anoptimal classifier defined by clean data can be learned by exploiting noisy data only
[Patrini et al., 2017, Liu and Tao, 2016, Yu et al., 2018b]. The basic idea is that, the clean class posterior can be inferred by using the noisy class posterior (learned from the noisy data) and the transition matrix [Berthon et al., 2020].However, in general, it is illposed to learn the transition matrix by only exploiting noisy data [Cheng et al., 2020, Xia et al., 2019], i.e., the transition matrix is unidentifiable. Therefore, some assumptions are proposed to tackle this issue. For example, additional information is given [Berthon et al., 2020]; the matrix is symmetric [Menon et al., 2018]; the noise rates for instances are upper bounded [Cheng et al., 2020], or even to be instanceindependent [Xia et al., 2019, Han et al., 2018a, Patrini et al., 2017, Northcutt et al., 2017, Natarajan et al., 2013], i.e., . Note that there are specific applications where these assumptions are held. However, in general, these assumptions are hard to verify, and the gaps are large between instanceindependent and instancedependent transition matrices.
To solve the above problem, in this paper, we propose a new but practical assumption for instancedependent label noise: The noise depends only on parts of instances. We term this kind of noise as partsdependent label noise. This assumption is motivated by that annotators usually annotate instances based on their parts rather than the whole instances. Specifically, there are psychological and physiological evidences showing that we human perceive objects starting from their parts [Palmer, 1977, Wachsmuth et al., 1994, Logothetis and Sheinberg, 1996]. There are also computational theories and learning algorithms showing that object recognition rely on partsrepresentations [Biederman, 1987, Ullman et al., 1996, Dietterich et al., 1997, Norouzi et al., 2013, HosseiniAsl et al., 2015, Agarwal et al., 2004]. Since instances can be well reconstructed by combinations of parts [Lee and Seung, 1999, 2001], the partsdependence assumption is mild in the sense. Intuitively, for a given instance, a combination of partsdependent transition matrices can well approximate the instancedependent transition matrix, which has been empirically verified in Section 4.2.
To fulfil the approximation, we need to learn the transition matrices for parts and the combination parameters. Since the parts are semantic [Lee and Seung, 1999], their contributions to perceiving the instance could be similar in the contributions to understanding (or annotating) them [Biederman, 1987, Agarwal et al., 2004]. Therefore, it is natural to assume that for constructing the instancedependent transition matrix, the combination parameters of partsdependent transition matrices are identical to those of parts for reconstructing an instance. We illustrate this in Figure 1, where the combinations in the top and bottom panels share the same parameters. The transition matrices for parts can be learned by exploiting anchor points, which are defined by instances that belong to a specific clean class with probability one [Liu and Tao, 2016]
. Note that the assumption for combination parameters and the requirement of anchor points might be strong. If they are not held, the partsdependent transition matrix might be poorly estimated. To solve this issue, we also use the slack variable trick in
[Xia et al., 2019] to modify the instanceindependent transition matrix.Extensive experiments on both synthetic and realworld datasets show that the partsdependent transition matrices can well address instancedependent label noise. Specifically, when the instancedependent label noise is heavy, i.e., 50%, the proposed method outperforms stateoftheart methods by almost of classification accuracy on CIFAR10. More details can be found in Section 4.
The rest of the paper is organized as follows. In Section 2, we briefly review related work on modeling label noise and partsbased learning. In Section 3, we discuss how to learn partsdependent transition matrices. In Section 4, we provide empirical evaluations of our learning algorithm. In Section 5, we conclude our paper. Codes will be available online.
2 Related Work
Label noise models Currently, there are three typical label noise models, i.e., the random classification noise (RCN) model [Biggio et al., 2011, Natarajan et al., 2013, Manwani and Sastry, 2013], the classconditional label noise (CCN) model [Patrini et al., 2017, Xia et al., 2019, Zhang and Sabuncu, 2018], and the instancedependent label noise (IDN) model [Berthon et al., 2020, Cheng et al., 2020, Du and Cai, 2015]. Specifically, RCN assumes that clean labels flip randomly with a constant rate [A.Aslam and E.Decatur, 1996, Angluin and Laird, 1988, Kearns, 1993]; CCN assumes that the flip rate depends on the latent clean class [Ma et al., 2018, Han et al., 2018b, Yu et al., 2019]; IDN considers the most general case of label noise, where the flip rate depends on its instance. However, IDN is nonidentifiable without any additional assumption, which is hard to learn with only noisy data [Xia et al., 2019]. The proposed partsdependent label noise (PDN) model assumes that the label noise depends on parts of instances, which could be an important “intermediate” model between CCN and IDN.
Estimating the transition matrix
The transition matrix bridges the class posterior probabilities for noisy and clean data. It is essential to build
classifier/riskconsistent estimators in labelnoise learning [Patrini et al., 2017, Liu and Tao, 2016, Scott, 2015, Yu et al., 2018b]. To estimate the transition matrix, a crossvalidation method is used for the binary classification task [Natarajan et al., 2013]. For the multiclassification task, the transition matrix could be learned by exploiting anchor points [Patrini et al., 2017, Yu et al., 2018a]. To remove strong dependence on anchor points, data points having high noisy class posterior probabilities (similar to anchor points) can also be used to estimate the transition matrix via a slack variable trick [Xia et al., 2019]. The slack variable is added to revise the transition matrix, which can be learned and validated together by using noisy data.Partsbased learning Nonnegative matrix factorization (NMF) [D.Lee and Seung, 1999]
is the representative work of partsbased learning. It decomposes a nonnegative data matrix into the product of two nonnegative factor matrices. In contrast to principal components analysis (PCA)
[Abdi and J. Williams, 2010]and vector quantization (VQ)
[Gray, 1990] that learn holistic but not partsbased representations, NMF allows additive but not subtractive combinations. Several variations extended the applicable range of NMF methods [Liu et al., 2010, Guan et al., 2019, Liu et al., 2017, Yoo and Choi, 2010].3 Partsdependent Label Noise
Preliminaries Let be the noisy training sample that contains instancedependent label noise. Our aim is to learn a robust classifier from the noisy training sample that could assign clean labels for test data. In the rest of the paper, we use to denote the th row of the matrix , the th column of the matrix , and the th entry of the matrix . We will use as the norm of the matrices or vector, e.g., .
Learning partsbased representations NMF has been widely employed to learn partsbased representations [D.Lee and Seung, 1999]. Many variants of NMF were proposed to enlarge its application fields [Guan et al., 2019, Liu et al., 2017, Yoo and Choi, 2010], e.g., allowing the data matrix or/and the matrix of parts to have mixed signs [Liu et al., 2010]. For our problem, we do not require the matrix of parts to be nonnegative, as our input data matrix is not restricted to be nonnegative. However, we require the combination parameters (as known as new representation in the NMF community [D.Lee and Seung, 1999, Liu et al., 2017, Guan et al., 2019]) for each instance to be not only nonnegative but also to have a unit norm. This is because we want to treat the parameters as the weights that measure how much the parts contribute to reconstructing the corresponding instance.
Let be the data matrix, where is the dimension of data points. The partsbased representation learning for the partsdependent label noise problem can be formulated as
(1) 
where is the matrix of parts (each column of denotes a part of the instances) and the denotes the combination parameters to reconstruct the instance . Eq. (1) corresponds to the top panel of Figure 1
, where parts are linearly combined to reconstruct the instance. Note that to fulfil the power of deep learning, the data matrix could consist of deep representations extracted by a deep neural network trained on the noisy training data.
Approximating instancedependent transition matrices Since there are computational theories [Biederman, 1987, Ullman et al., 1996] and learning algorithms [Agarwal et al., 2004, HosseiniAsl et al., 2015] showing that object recognition rely on partsrepresentations, it is therefore natural to model label noise on the parts level. Thus, we propose a partsdependent noise (PDN) model, where label noise depends on parts rather than the whole instance. Specifically, for each part, e.g., , we assume there is a partsdependent transition matrix, e.g., . Since we have parts, there are different partsdependent transition matrices, i.e., . Similar to the idea that parts can be used to reconstruct instances, we exploit the idea that instancedependent transition matrix can be approximated by a combination of partsdependent transition matrices, which is illustrated in the bottom panel of Figure 1.
To approximate the instancedependent transition matrices, we need to learn the partsdependent transition matrices and the combination parameters. However, they are not identifiable because it is illposed to factorize the instancedependent transition matrix into the product of partsdependent transition matrices and combination parameters. Fortunately, we could identify the partsdependent transition matrices by assuming that the parameters for reconstructing the instancedependent transition matrix are identical to those for reconstructing an instance. The rational behind this assumption is that the learned parts are semantic [Lee and Seung, 1999], and their contributions to perceiving the instance should be similar in the contributions to understanding and annotating them [Biederman, 1987, Agarwal et al., 2004]. Let be the combination parameters to reconstruct the instance . The instancedependent transition matrix can be approximated by
(2) 
Note that can be learned via Eq. (1). The normalization constraint on the combination parameters, i.e., , ensures that the combined matrix in the righthand side of Eq. (2) is also a valid transition matrix, which is nonnegative and the sum of each row equals one.
Learning the partsdependent transition matrices Note that partsdependent transition matrices in Eq. (2) are unknown. We will show that they can be learned by exploiting anchor points. The concept of anchor points was proposed in [Liu and Tao, 2016]. They are defined in the clean data domain, i.e., an instance is an anchor point of the th clean class if is equal to one.
Let be an anchor point of the th class. We have
(3) 
where the first equation holds because of Law of total probability; the second equation holds because
for all and . As can be unbiasedly learned [Bartlett et al., 2006] by exploiting the noisy training sample and the anchor point , Eq. (3) shows that the th row of the instancedependent transition matrix can be unbiasedly learned. This sheds light on the learnability of the partsdependent transition matrices. Specifically, as shown in Figure 1, we are going to reconstruct the instancedependent transition matrix by using a weighted combination of the partsdependent transition matrices. If the instancedependent transition matrix^{1}^{1}1Note that according to (3), given an anchor point , the th row of its instancedependent transition matrix can be learned and thus available. and combination parameters are given, learning the partsdependent transition matrices is a convex problem.Given an anchor point , we can learn the th rows of the partsdependent transition matrices by matching the th row of the reconstructed transition matrix, i.e., , with the th row of the instancedependent transition matrix, i.e., . Since we have partsdependent transition matrices, to identify all the entries of the th rows of the partsdependent transition matrices, we need at least anchor points of the th class to build equations. Let be anchor points of the th class, where . We robustly learn the th rows of the partsdependent transition matrices by minimizing the reconstruction error instead of solving equations. Therefore, we propose the following optimization problem to learn the partsdependent transition matrices:
(4)  
where the sum over the index calculates the reconstruction error over all rows of transition matrices. Note that in Eq. (4), we require that anchors for each class are given. If anchor points are not available, they can be learned from the noisy data as did in [Patrini et al., 2017, Liu and Tao, 2016, Xia et al., 2019].
Implementation The overall procedure to learn the partsdependent transition matrices is summarized in Algorithm 1. Given only a noisy training sample set , we first learn deep representations of the instances. Note that we use a noisy validation set to select the deep model. Then, we minimize Eq. (1) to learn the combination parameters. The partsdependent transition matrices are learned by minimizing Eq. (4). Finally, we use the weighted combination to get an instancedependent transition matrix for each instance according to Eq. (2). Note that as we learn the anchor points from the noisy training data, as did in [Patrini et al., 2017, Liu and Tao, 2016, Xia et al., 2019], instances that are similar to anchor points will be learned if there are no anchor points available in the training data. Then, the instanceindependent transition matrix will be poorly estimated. To address this issue, we employ the slack variable in [Xia et al., 2019] to modify the instanceindependent transition matrix.
4 Experiments
4.1 Experiment setup
Datasets We verify the efficacy of our approach on the manually corrupted version of three datasets, i.e., FashionMNIST [Xiao et al., 2017], SVHN [Netzer et al., 2011], and CIFAR10 [Krizhevsky, 2009], and one realworld noisy dataset, i.e., clothing1M [Xiao et al., 2015]. FashionMNIST contains 60,000 training images and 10,000 test images with 10 classes. SVHN and CIFAR10 both have 10 classes of images, but the former contains 73,257 training images and 26,032 test images, and the latter contains 50,000 training images and 10,000 test images. The three datasets contain clean data. We corrupted the training sets manually according to Algorithm 2. More details about this instancedependent label noise generation approach can be found in Appendix B. IDN means that the noise rate is controlled to be . All experiments on those datasets with synthetic instancedependent label noise are repeated five times. Clothing1M has 1M images with realworld noisy labels and 10k images with clean labels for testing.
For all the datasets, we leave out 10% of the noisy training examples as a noisy validation set, which is for model selection. We also conduct synthetic experiments on MNIST [LeCun et al., ]. Due to the space limit, we put its corresponding experimental results in Appendix C.
Baselines and measurements We compare the proposed method with the following stateoftheart approaches: (i). CE, which trains the standard deep network with the cross entropy loss on noisy datasets. (ii). Decoupling [Malach and ShalevShwartz, 2017], which trains two networks on samples whose the predictions from the two networks are different. (iii). MentorNet [Jiang et al., 2018], Coteaching [Han et al., 2018b], and Coteaching+ [Yu et al., 2019]. These approaches mainly handle noisy labels by training on instances with small loss values. (iv). Joint [Tanaka et al., 2018], which jointly optimizes the sample labels and the network parameters. (v). DMI [Xu et al., 2019]
, which proposes a novel informationtheoretic loss function for training deep neural networks robust to label noise. (vi). Forward
[Patrini et al., 2017], Reweight [Liu and Tao, 2016], and TRevision [Xia et al., 2019]. These approaches utilize a classdependent transition matrix to correct the loss function. We use the classification accuracy to evaluate the performance of each model on the clean test set. Higher classification accuracy means that the algorithm is more robust to the label noise.Network structure and optimization
For fair comparison, all experiments are conducted on NVIDIA Tesla V100, and all methods are implemented by PyTorch. We use a ResNet18 network for
FashionMNIST, a ResNet34 network for SVHN and CIFAR10. The transition matrix for each instance can be learned according to Algorithm 1. Exploiting the transition matrices, we can bridge the class posterior probabilities for noisy and clean data. We first use SGD with momentum 0.9, weight decay , batch size 128, and an initial learning rate ofto initialize the network. The learning rate is divided by 10 at the 40th epochs and 80th epochs. We set 100 epochs in total. Then, the optimizer and learning rate are changed to Adam and
to learn the classifier and slack variable. Note that the slack variable is initialized to be with all zero entries in the experiments. During the training, can be ensured to be a valid transition matrix by first projecting their negative entries to be zero and then performing row normalization. For clothing1M, we use a ResNet50 pretrained on ImageNet. Different from existing methods, we do not use the 50k clean training data or the 14k clean validation data but only exploit the 1M noisy data to learn the transition matrices and classifiers. Note that for realworld scenarios, it is more practical that no extra special clean data is provided to help adjust the model. After the transition matrix
is obtained according to the Algorithm 1, we use SGD with momentum 0.9, weight decay , batch size 32, and run with learning rate for 10 epochs. For learning the classifier and the slack variable, Adam is used and learning rate is changed to .Explanation We abbreviate our proposed method of learning with the partsdependent transition matrices as PTD. Methods with “F” and “R” mean that the instancedependent transition matrices are exploited by using the Forward [Patrini et al., 2017] method and the Reweight [Liu and Tao, 2016] method, respectively; Methods with “V” means that the transition matrices are revised. Details for these methods can be found in Appendix A.
Illustration of the transition matrix approximation error and the hyperparameter sensitivity. Figure (a) illustrates how the approximation error for the instancedependent transition matrix varies by increasing the number of parts. Figure (b) illustrates how the number of parts affects the test classification performance. The error bar for standard deviation in each figure has been shaded
4.2 Ablation study
We have described that how to learn partsdependent transition matrices for approximating the instancedependent transition matrix in Section 3. To further prove that our proposed method is not sensitive to the number of parts, we perform ablation study in this subsection. The experiments are conducted on CIFAR10 with 50% noise rate.
In Figure 2, we show how well the instancedependent transition matrix can be approximated by employing the classdependent transition matrix and the partsdependent transition matrix. We use norm to measure the difference. For each instance, we analyze the approximation error of a specific row rather than the whole transition matrix. The reason is that we only used one row of the instancedependent transition matrix to generate the noisy label. Specifically, given an instance with clean class label (note that we have access to clean labels for the test data to conduct evaluation), we only exploit the th row of the instancedependent transition matrix to flip the label from the class to another class. Note that “Classdependent” represents the standard classdependent transition matrix learning methods [Liu and Tao, 2016, Patrini et al., 2017] and “TRevision” represents the revision methods to learn classdependent transition matrix [Xia et al., 2019]. The Classdependent and TRevision methods are independent of parts. Their curves are therefore straight. We can see that the partsdependent (PTD) transition matrix can achieve much smaller approximation error than the classdependent (partsindependent) transition matrix and the results are insensitive to the number of parts. Figure 2 shows that the classification performance of our proposed method is robust and not sensitive to the change of the number of parts. More detailed experimental results can be found in Appendix D.
IDN10%  IDN20%  IDN30%  IDN40%  IDN50%  
CE  88.540.31  88.380.42  84.220.35  68.860.78  51.420.66 
Decoupling  89.270.31  86.500.35  85.330.47  78.540.53  57.322.11 
MentorNet  90.000.34  87.020.41  86.020.82  80.120.76  58.621.36 
Coteaching  90.820.33  87.890.41  86.880.32  82.780.95  63.221.58 
Coteaching+  90.920.51  89.770.45  88.520.45  83.571.77  59.322.77 
Joint  70.240.99  56.830.45  51.270.67  44.240.78  30.450.45 
DMI  91.980.62  90.330.21  84.810.44  69.011.87  51.641.78 
Forward  89.050.43  88.610.43  84.270.46  70.251.28  57.333.75 
Reweight  90.330.27  89.700.35  87.040.35  80.290.89  65.271.33 
TRevision  91.560.31  90.680.66  89.460.45  84.011.24  68.991.04 
PTDF  90.480.17  90.010.31  87.420.65  83.890.49  68.252.61 
PTDR  91.010.22  90.030.32  87.680.42  84.030.52  72.431.76 
PTDFV  91.610.19  90.790.29  89.330.33  85.320.36  71.892.54 
PTDRV  92.010.35  91.080.46  89.660.43  85.690.77  75.961.38 
IDN10%  IDN20%  IDN30%  IDN40%  IDN50%  
CE  90.770.45  90.230.62  86.331.34  65.661.65  48.014.59 
Decoupling  90.490.15  90.470.66  85.270.34  82.571.45  42.562.79 
MentorNet  90.280.12  90.370.37  86.490.49  83.750.75  40.273.14 
Coteaching  91.330.31  90.560.67  88.930.78  85.470.64  45.902.31 
Coteaching+  93.051.20  91.050.82  85.332.71  57.243.77  42.563.65 
Joint  86.010.34  78.580.72  76.340.56  65.141.72  46.783.77 
DMI  93.511.09  93.220.62  91.781.54  69.342.45  48.932.34 
Forward  90.890.63  90.650.27  87.320.59  78.462.58  46.273.90 
Reweight  92.490.44  91.090.34  90.250.77  84.480.86  45.463.56 
TRevision  94.240.53  94.000.88  93.010.83  88.631.37  49.024.33 
PTDF  93.620.61  92.770.45  90.110.94  87.250.77  54.824.65 
PTDR  93.210.45  92.360.68  90.570.42  86.780.63  55.883.73 
PTDFV  94.700.37  94.390.62  92.070.59  90.561.21  57.924.32 
PTDRV  94.440.37  94.230.46  93.110.78  90.640.98  58.092.57 
IDN10%  IDN20%  IDN30%  IDN40%  IDN50%  
CE  74.490.29  68.210.72  60.480.62  49.841.27  38.862.71 
Decoupling  74.090.78  70.010.66  63.050.65  44.271.91  38.632.32 
MentorNet  74.450.66  70.560.34  65.420.79  46.220.98  39.892.62 
Coteaching  76.990.17  72.990.45  67.220.64  49.251.77  42.773.41 
Coteaching+  74.271.20  71.070.77  64.770.58  47.732.32  39.472.14 
Joint  76.890.37  73.890.34  69.030.79  54.755.98  44.727.72 
DMI  75.020.45  69.890.33  61.880.64  51.231.18  41.451.97 
Forward  73.450.23  68.990.62  60.210.75  47.172.96  40.752.09 
Reweight  74.550.23  68.420.75  62.580.46  50.120.96  41.082.45 
TRevision  74.610.39  69.320.64  64.090.37  50.380.87  42.573.27 
PTDF  76.010.45  73.450.62  65.250.84  49.880.85  46.881.25 
PTDR  78.710.22  75.020.73  71.860.42  56.150.45  49.072.56 
PTDFV  76.290.38  73.880.61  69.010.47  50.430.62  48.762.01 
PTDRV  79.010.20  76.050.53  72.280.49  58.620.88  53.982.34 
4.3 Comparison with the StateoftheArts
Results on synthetic noisy datasets Tables 1, 2, and 3 report the classification accuracy on the datasets of FashionMNIST, SVHN and CIFAR10, respectively.
For FashionMNIST and SVHN, in the easy cases, e.g., IDN10% and IDN20%, almost all methods work well. In the IDN30% case, the advantages of PTD begin to show. We surpassed all methods obviously except for TRevision, e.g., the classification accuracy of PTDRV is 1.14% higher than Coteaching+ on FashionMNIST, 1.33% higher than DMI on SVHN. When the noise rate raises, TRevision is gradually defeated. In the IDN30% case, the classification accuracy of PTDRV is 1.68% and 2.01% higher than TRevision on SVHN and CIFAR10 respectively. Finally, in the hardest case, i.e., IDN50%, the superiority of PTD widens the gap of performance. The classification accuracy of PTDRV is 6.97% and 9.07% higher than the best baseline method.
For CIFAR10, the algorithms with the assist of PTD overtake the other methods with clear gaps. From IDN10% to IDN50% case, the advantages of our proposed method increase with the increasing of the noise rate. In the 10% and 20% cases, the performance of PTDRV is outstanding, i.e., the classification accuracy is 2.02% and 2.16% higher than the best baseline Joint. In the 30% and 40% case, the gap is expanded to 3.25% and 3.87%. Lastly, in the 50% case, PTDRV outperforms stateoftheart methods by almost 10 % of classification accuracy.
CE  Decoupling  MentorNet  Coteaching  Coteaching+  Joint  DMI 

68.88  54.53  56.79  60.15  65.15  70.88  70.12 
Forward  Reweight  TRevision  PTDF  PTDR  PTDFV  PTDRV 
69.91  70.40  70.97  70.07  71.51  70.26  71.67 
To sum up, the synthetic experiments reveal that our method is powerful in handling instancedependent label noise particularly in the situation of high noise rates.
Results on realworld datasets The proposed method outperforms the baselines as shown in Table 4, where the highest accuracy is bold faced. The comparison denotes that, the noise model of clothing1M dataset is more likely to be instancedependent noise, and our proposed method can better model instancedependent noise than other methods.
5 Conclusion
In this paper, we focus on learning with instancedependent label noise, which is a more general case of label noise but lacking understanding and learning. Inspired by partsbased learning, we exploit partsdependent transition matrix to approximate instancedependent transition matrix, which is intuitive and learnable. Experimental results show our proposed method consistently outperforms existing methods, especially for the case of highlevel noise rates. In future, we can extend the work in the following aspects. First, we can incorporate some prior knowledge of transition matrix and parts (e.g., sparsity), which improves partsbased learning. Second, we can introduce slack variables to modify the parameters for combination.
Acknowledgments
TLL was supported by Australian Research Council Project DE190101473. NNW was supported by National Natural Science Foundation of China under Grant 61922066, Grant 61876142. GN and MS were supported by JST AIP Acceleration Research Grant Number JPMJCR20U3, Japan. The authors would give special thanks to Pengqian Lu for helpful discussions and comments.
References
 A.Aslam and E.Decatur [1996] Javed A.Aslam and Scott E.Decatur. On the sample complexity of noise tolerant learning. Information Processing Letters, 1996.
 Abdi and J. Williams [2010] Hervé Abdi and Lynne J. Williams. Principal component analysis. wiley interdisciplinary reviews computational statistics, 2(4):433–459, 2010.
 Agarwal et al. [2004] Shivani Agarwal, Aatif Awan, and Dan Roth. Learning to detect objects in images via a sparse, partbased representation. IEEE transactions on pattern analysis and machine intelligence, 26(11):1475–1490, 2004.
 Angluin and Laird [1988] Dana Angluin and Philip Laird. Learning from noisy examples. Machine Learning, 2(4):343–370, 1988.
 Bartlett et al. [2006] Peter L Bartlett, Michael I Jordan, and Jon D McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
 Berthon et al. [2020] Antonin Berthon, Bo Han, Gang Niu, Tongliang Liu, and Masashi Sugiyama. Confidence scores make instancedependent labelnoise learning possible. arXiv preprint arXiv:2001.03772, 2020.
 Biederman [1987] Irving Biederman. Recognitionbycomponents: a theory of human image understanding. Psychological review, 94(2):115, 1987.
 Biggio et al. [2011] Battista Biggio, Blaine Nelson, and Pavel Laskov. Support vector machines under adversarial label noise. In ACML, 2011.
 Cheng et al. [2020] Jiacheng Cheng, Tongliang Liu, Kotagiri Ramamohanarao, and Dacheng Tao. Learning with bounded instanceand labeldependent label noise. In ICML, 2020.
 Dietterich et al. [1997] Thomas G Dietterich, Richard H Lathrop, and Tomás LozanoPérez. Solving the multiple instance problem with axisparallel rectangles. Artificial intelligence, 89(12):31–71, 1997.
 D.Lee and Seung [1999] Daniel D.Lee and H.Sebastian Seung. Learning the parts of objects by nonnegative matrix factorization. Nature, pages 788–791, 1999.
 Du and Cai [2015] Jun Du and Zhihua Cai. Modelling class noise with symmetric and asymmetric distributions. In AAAI, 2015.
 Goldberger and BenReuven [2017] Jacob Goldberger and Ehud BenReuven. Training deep neuralnetworks using a noise adaptation layer. In ICLR, 2017.
 Gray [1990] Robert M. Gray. Vector quantization. In Readings in Speech Recognition, 1990.
 Gretton et al. [2009] Arthur Gretton, Alex Smola, Jiayuan Huang, Marcel Schmittfull, Karsten Borgwardt, and Bernhard Schölkopf. Covariate shift by kernel mean matching. Dataset shift in machine learning, pages 131–160, 2009.
 Guan et al. [2019] Naiyan Guan, Tongliang Liu, Zhang Yangmuzi, Dacheng Tao, and Larry Steven Davis. Truncated cauchy nonnegative matrix factorization. IEEE Transactions on pattern analysis and machine intelligence, 41(1):246–259, 2019.

Guo et al. [2018]
Sheng Guo, Weilin Huang, Haozhi Zhang, Chenfan Zhuang, Dengke Dong, Matthew R
Scott, and Dinglong Huang.
Curriculumnet: Weakly supervised learning from largescale web images.
In ECCV, pages 135–150, 2018.  Han et al. [2018a] Bo Han, Jiangchao Yao, Gang Niu, Mingyuan Zhou, Ivor Tsang, Ya Zhang, and Masashi Sugiyama. Masking: A new perspective of noisy supervision. In NeurIPS, pages 5836–5846, 2018a.
 Han et al. [2018b] Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. Coteaching: Robust training of deep neural networks with extremely noisy labels. In NeurIPS, pages 8527–8537, 2018b.

HosseiniAsl et al. [2015]
Ehsan HosseiniAsl, Jacek M Zurada, and Olfa Nasraoui.
Deep learning of partbased representation of data using sparse autoencoders with nonnegativity constraints.
IEEE transactions on neural networks and learning systems, 27(12):2486–2498, 2015.  Hu et al. [2020] Wei Hu, Zhiyuan Li, and Dingli Yu. Simple and effective regularization methods for training on noisily labeled data with generalization guarantee. In ICLR, 2020. URL https://openreview.net/forum?id=Hke3gyHYwH.
 Jiang et al. [2018] Lu Jiang, Zhengyuan Zhou, Thomas Leung, LiJia Li, and Li FeiFei. MentorNet: Learning datadriven curriculum for very deep neural networks on corrupted labels. In ICML, pages 2309–2318, 2018.
 Karimi et al. [2019] Davood Karimi, Haoran Dou, Simon K Warfield, and Ali Gholipour. Deep learning with noisy labels: exploring techniques and remedies in medical image analysis. arXiv preprint arXiv:1912.02911, 2019.

Kearns [1993]
Michael Kearns.
Efficient noisetolerant learning from statistical queries.
In
Proceedings of the twenty fifth annual ACM symposium on Theory of computing  STOC 93
, 1993.  Kremer et al. [2018] Jan Kremer, Fei Sha, and Christian Igel. Robust active label correction. In AISTATS, pages 308–316, 2018.
 Krizhevsky [2009] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
 [27] Yann LeCun, Corinna Cortes, and Christopher J.C. Burges. The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/.
 Lee and Seung [1999] Daniel D Lee and H Sebastian Seung. Learning the parts of objects by nonnegative matrix factorization. Nature, 401(6755):788–791, 1999.
 Lee and Seung [2001] Daniel D Lee and H Sebastian Seung. Algorithms for nonnegative matrix factorization. In NeurIPS, pages 556–562, 2001.

Li et al. [2020a]
Junnan Li, Richard Socher, and Steven C.H. Hoi.
Dividemix: Learning with noisy labels as semisupervised learning.
In ICLR, 2020a. URL https://openreview.net/forum?id=HJgExaVtwr.  Li et al. [2020b] Mingchen Li, Mahdi Soltanolkotabi, and Samet Oymak. Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. In AISTATS, 2020b.
 Li et al. [2017] Yuncheng Li, Jianchao Yang, Yale Song, Liangliang Cao, Jiebo Luo, and LiJia Li. Learning from noisy labels with distillation. In ICCV, pages 1910–1918, 2017.
 Liu et al. [2010] Ding Liu, Chris H.Q., Tao Li, and Michael I. Jordan. Convex and seminonnegative matrix factorizations. IEEE Transactions on pattern analysis and machine intelligence, 32(1):45–55, 2010.
 Liu and Tao [2016] Tongliang Liu and Dacheng Tao. Classification with noisy labels by importance reweighting. IEEE Transactions on pattern analysis and machine intelligence, 38(3):447–461, 2016.
 Liu et al. [2017] Tongliang Liu, Mingming Gong, and Dacheng Tao. Large cone nonnegative matrix factorization. IEEE Transactions on Neural Networks and Learning Systems, 28(9):2129–2141, 2017.
 Liu and Guo [2020] Yang Liu and Hongyi Guo. Peer loss functions: Learning from noisy labels without knowing noise rates. In ICML, 2020.
 Logothetis and Sheinberg [1996] Nikos K Logothetis and David L Sheinberg. Visual object recognition. Annual review of neuroscience, 19(1):577–621, 1996.
 Lyu and Tsang [2020] Yueming Lyu and Ivor W. Tsang. Curriculum loss: Robust learning and generalization against label corruption. In ICLR, 2020. URL https://openreview.net/forum?id=rkgt0REKwS.
 Ma et al. [2018] Xingjun Ma, Yisen Wang, Michael E Houle, Shuo Zhou, Sarah M Erfani, ShuTao Xia, Sudanthi Wijewickrema, and James Bailey. Dimensionalitydriven learning with noisy labels. In ICML, pages 3361–3370, 2018.
 Malach and ShalevShwartz [2017] Eran Malach and Shai ShalevShwartz. Decoupling” when to update” from” how to update”. In NeurIPS, pages 960–970, 2017.
 Manwani and Sastry [2013] Naresh Manwani and P.S. Sastry. Noise tolerance under risk minimization. IEEE Transactions on Cybernetics, 2013.
 Menon et al. [2018] Aditya Krishna Menon, Brendan Van Rooyen, and Nagarajan Natarajan. Learning from binary labels with instancedependent noise. Machine Learning, 107(810):1561–1595, 2018.
 Natarajan et al. [2013] Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning with noisy labels. In NeurIPS, pages 1196–1204, 2013.
 Netzer et al. [2011] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y.Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
 Nguyen et al. [2020] Duc Tam Nguyen, Chaithanya Kumar Mummadi, Thi Phuong Nhung Ngo, Thi Hoai Phuong Nguyen, Laura Beggel, and Thomas Brox. Self: Learning to filter noisy labels with selfensembling. In ICLR, 2020. URL https://openreview.net/forum?id=HkgsPhNYPS.
 Norouzi et al. [2013] Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, Andrea Frome, Greg S Corrado, and Jeffrey Dean. Zeroshot learning by convex combination of semantic embeddings. In NeurIPS, 2013.
 Northcutt et al. [2017] Curtis G Northcutt, Tailin Wu, and Isaac L Chuang. Learning with confident examples: Rank pruning for robust classification with noisy labels. In UAI, 2017.
 Palmer [1977] Stephen E Palmer. Hierarchical structure in perceptual representation. Cognitive psychology, 9(4):441–474, 1977.
 Patrini et al. [2017] Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: A loss correction approach. In CVPR, pages 1944–1952, 2017.
 Reed et al. [2015] Scott E Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and Andrew Rabinovich. Training deep neural networks on noisy labels with bootstrapping. In ICLR, 2015.
 Ren et al. [2018] Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for robust deep learning. In ICML, pages 4331–4340, 2018.
 Scott [2015] Clayton Scott. A rate of convergence for mixture proportion estimation, with application to learning from noisy labels. In AISTATS, pages 838–846, 2015.
 Scott et al. [2013] Clayton Scott, Gilles Blanchard, and Gregory Handy. Classification with asymmetric label noise: Consistency and maximal denoising. In COLT, pages 489–511, 2013.
 Tanaka et al. [2018] Daiki Tanaka, Daiki Ikami, Toshihiko Yamasaki, and Kiyoharu Aizawa. Joint optimization framework for learning with noisy labels. In CVPR, 2018.
 Thekumparampil et al. [2018] Kiran K Thekumparampil, Ashish Khetan, Zinan Lin, and Sewoong Oh. Robustness of conditional gans to noisy labels. In NeurIPS, pages 10271–10282, 2018.
 Ullman et al. [1996] Shimon Ullman et al. Highlevel vision: Object recognition and visual cognition, volume 2. MIT press Cambridge, MA, 1996.
 Vahdat [2017] Arash Vahdat. Toward robustness against label noise in training deep discriminative neural networks. In NeurIPS, pages 5596–5605, 2017.
 Veit et al. [2017] Andreas Veit, Neil Alldrin, Gal Chechik, Ivan Krasin, Abhinav Gupta, and Serge Belongie. Learning from noisy largescale datasets with minimal supervision. In CVPR, pages 839–847, 2017.

Vijayanarasimhan and Grauman [2014]
Sudheendra Vijayanarasimhan and Kristen Grauman.
Largescale live active learning: Training object detectors with crawled data and crowds.
International journal of computer vision
, 108(12):97–114, 2014.  Wachsmuth et al. [1994] E Wachsmuth, MW Oram, and DI Perrett. Recognition of objects and their component parts: responses of single units in the temporal cortex of the macaque. Cerebral Cortex, 4(5):509–522, 1994.

Wang et al. [2019]
Xiaobo Wang, Shuo Wang, Jun Wang, Hailin Shi, and Tao Mei.
Comining: Deep face recognition with noisy labels.
In ICCV, pages 9358–9367, 2019.  Welinder and Perona [2010] Peter Welinder and Pietro Perona. Online crowdsourcing: rating annotators and obtaining costeffective labels. In CVPRWorkshop, pages 25–32, 2010.
 Xia et al. [2019] Xiaobo Xia, Tongliang Liu, Nannan Wang, Bo Han, Chen Gong, Gang Niu, and Masashi Sugiyama. Are anchor points really indispensable in labelnoise learning? In NeurIPS, pages 6835–6846, 2019.
 Xiao et al. [2017] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashionmnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
 Xiao et al. [2015] Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and Xiaogang Wang. Learning from massive noisy labeled data for image classification. In CVPR, pages 2691–2699, 2015.
 Xu et al. [2019] Yilun Xu, Peng Cao, Yuqing Kong, and Yizhou Wang. L_dmi: A novel informationtheoretic loss function for training deep nets robust to label noise. In NeurIPS, pages 6222–6233, 2019.
 Yoo and Choi [2010] Jiho Yoo and Seungjin Choi. Nonnegative matrix factorization with orthogonality constraints. Management Science, 58(11):2037–2056, 2010.
 Yu et al. [2019] Xingrui Yu, Bo Han, Jiangchao Yao, Gang Niu, Ivor W Tsang, and Masashi Sugiyama. How does disagreement benefit coteaching? In ICML, 2019.
 Yu et al. [2018a] Xiyu Yu, Tongliang Liu, Mingming Gong, Kayhan Batmanghelich, and Dacheng Tao. An efficient and provable approach for mixture proportion estimation using linear independence assumption. In CVPR, pages 4480–4489, 2018a.
 Yu et al. [2018b] Xiyu Yu, Tongliang Liu, Mingming Gong, and Dacheng Tao. Learning with biased complementary labels. In ECCV, pages 68–83, 2018b.
 Zhang and Sabuncu [2018] Zhilu Zhang and Mert Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels. In NeurIPS, pages 8778–8788, 2018.
Appendix A How to learn robust classifiers by exploiting partsdependent transition matrices
For those who are not familiar with how to use the transition matrix to learn robust classifiers, in this supplementary material, we will provide how to learn robust classifiers by exploiting partsdependent transition matrices.
We begin by introducing notation. Let be the distribution of the variables , the distribution of the variables . Let be i.i.d. samples drawn from the distribution , i.i.d. samples drawn from the distribution , and the size of label classes.
The aim of multiclass classification is to learn a classifier that can assign labels for given instances. The classifier is of the following form: , where is an estimate of Pr. Expected risk of employing is defined as
(5) 
The optimal classifier to learn is the one that minimizes the risk . Due to the distribution is usually unknown, the optimal classifier is approximated by the minimizer of the empirical risk:
(6) 
Given only the noisy training samples , the noisy version of the empirical risk is defined as:
(7) 
In the main paper (Section 3), we show how to approximate instancedependent transition matrix by exploiting partsdependent transition matrices. For an instance , according to the definition of instancedependent transition matrix, we have that PrPr, we let
(8) 
The empirical risk of our PTDF algorithm is defined as:
(9) 
By employing the importance reweighting technique [Gretton et al., 2009, Liu and Tao, 2016, Xia et al., 2019], the empirical risk of our PTDR algorithm is defined as:
(10) 
Here, is an estimate for Pr and is an estimate for Pr.
When the slack variable is introduced to modify the instancedependent transition matrices, reviewing Eq. (8), we replace with to get , i.e.,
(11) 
Then the empirical risks of PTDFV and PTDRV are defined as and , i.e.,
(12) 
and
(13) 
To learn noise robust classifiers under noisy supervision, we minimize the empirical risk of PTDF, PTDR, PTDFV, and PTDRV, respectively.
Appendix B Instancedependent Label Noise Generation
Note that it is more realistic that different instances have different flip rates. Without constraining different instances to have a same flip rate, it is more challenging to model the label noise and train robust classifiers. In Step 1, in order to control the global flip rate as but without constraining all of the instances to have a same flip rate, we sample their flip rates from a truncated normal distribution . Specifically, this distribution limits the flip rates of instances in the range . Their mean and standard deviation are equal to the mean and the standard deviation 0.1 of the selected truncated normal distribution respectively.
In Step 2, we sample parameters from the standard normal distribution for generating instancedependent label noise. The dimensionality of each parameter is , where denotes the dimensionality of the instance. Learning these parameters is critical to model instancedependent label noise. However, it is hard to identify these parameters without any assumption.
Note that an instance with clean label will be flipped only according to the th row of the transition matrix. Thus, in Steps 4 to 7, we only use the th row of the instancedependent transition matrix for the instance . Specifically, Steps 5 and 7 are to ensure the diagonal entry of the th row is 1 . Step 6 is to ensure that the sum of the offdiagonal entries is .
Appendix C Experiments complementary on synthetic noisy dataset
In the main paper (Section 4), we present the experimental results on three synthetic noisy datasets, i.e., FashionMNIST, SVHN and CIFAR10. In this supplementary material, we provide the experimental results on another synthetic noisy dataset MNIST. MNIST contains 60,000 training images and 10,000 test images with 10 classes. We use a LeNet5 network for it. The detailed experimental results are shown in Table 5. The classification performance shows that our proposed method is more robust than the baseline methods when coping with instancedependent label noise.
Appendix D The experimental results of ablation study
In Section 4.2, we have shown that our proposed method is insensitive to the number of parts. Due the space limit, we only provide the illustration by exploiting the figures. In this supplementary material, more detailed results including means and standard deviations of approximation error and classification accuracy about the ablation study are shown in Table 6 and Table 7.
IDN10%  IDN20%  IDN30%  IDN40%  IDN50%  
CE  98.240.07  98.210.06  96.780.12  93.760.18  79.694.35 
Decoupling  96.630.12  96.620.22  92.730.36  90.340.33  80.562.67 
MentorNet  97.450.11  97.210.13  92.880.31  88.231.65  80.021.71 
Coteaching  97.560.12  97.320.15  94.810.24  92.450.59  83.301.37 
Coteaching+  98.320.07  98.070.12  96.700.35  94.370.48  82.971.11 
Joint  98.530.06  98.170.14  96.510.17  93.070.62  83.723.22 
DMI  98.630.04  98.400.11  97.750.21  96.450.23  87.521.03 
Forward  97.230.15  96.870.15  95.010.27  90.300.61  77.423.28 
Reweight  98.210.07  97.990.13  96.960.14  94.550.67  80.874.14 
TRevision  98.490.06  98.390.09  97.550.14  96.500.31  84.713.47 
PTDF  98.550.05  97.920.27  97.340.11  94.670.83  84.012.11 
PTDR  98.220.10  98.120.17  97.060.13  94.750.54  82.722.04 
PTDFV  98.710.05  98.460.11  97.770.09  96.070.45  88.551.96 
PTDRV  98.660.03  98.430.15  97.810.23  96.730.20  88.671.25 
Classdependent  TRevision  PTD  PTDFV  PTDRV  

=10  0.9450.051  0.9220.037  0.8400.030  0.8150.011  0.8110.020 
=11  0.9450.051  0.9220.037  0.8410.022  0.8020.010  0.8150.011 
=12  0.9450.051  0.9220.037  0.8310.015  0.8060.014  0.8120.014 
=13  0.9450.051  0.9220.037  0.8140.024  0.7900.019  0.7910.017 
=14  0.9450.051  0.9220.037  0.8210.040  0.7920.022  0.7910.016 
=15  0.9450.051  0.9220.037  0.8290.034  0.8120.017  0.8020.025 
=16  0.9450.051  0.9220.037  0.8310.029  0.8000.018  0.8000.020 
=17  0.9450.051  0.9220.037  0.8190.012  0.8000.011  0.7920.013 
=18  0.9450.051  0.9220.037  0.8290.011  0.7980.012  0.7940.017 
=19  0.9450.051  0.9220.037  0.8270.017  0.7990.013  0.7950.018 
=20  0.9450.051  0.9220.037  0.8320.025  0.8050.021  0.8000.015 
PTDF  PTDR  PTDFV  PTDRV  

=10  46.842.34  49.022.55  48.842.74  53.782.77 
=11  47.221.77  49.111.98  48.641.58  53.722.63 
=12  47.012.65  48.751.95  48.623.05  53.521.99 
=13  47.051.87  48.992.67  48.631.42  53.331.96 
=14  47.011.65  49.123.02  48.771.46  53.722.13 
=15  46.881.29  49.141.89  48.651.01  53.901.67 
=16  47.191.49  49.031.78  48.592.03  53.981.95 
=17  47.011.36  49.022.06  48.621.62  54.011.72 
=18  47.091.45  48.892.51  48.581.03  53.692.31 
=19  47.391.48  49.092.58  48.791.01  53.752.77 
=20  46.881.25  49.072.56  48.761.75  53.982.34 
Comments
There are no comments yet.