L_DMI: An Information-theoretic Noise-robust Loss Function

09/08/2019
by   Yilun Xu, et al.
Peking University
16

Accurately annotating large scale dataset is notoriously expensive both in time and in money. Although acquiring low-quality-annotated dataset can be much cheaper, it often badly damages the performance of trained models when using such dataset without particular treatment. Various of methods have been proposed for learning with noisy labels. However, they only handle limited kinds of noise patterns, require auxiliary information (e.g,, the noise transition matrix), or lack theoretical justification. In this paper, we propose a novel information-theoretic loss function, L_ DMI, for training deep neural networks robust to label noise. The core of L_ DMI is a generalized version of mutual information, termed Determinant based Mutual Information (DMI), which is not only information-monotone but also relatively invariant. To the best of our knowledge, L_ DMI is the first loss function that is provably not sensitive to noise patterns and noise amounts, and it can be applied to any existing classification neural networks straightforwardly without any auxiliary information. In addition to theoretical justification, we also empirically show that using L_ DMI outperforms all other counterparts in the classification task on Fashion-MNIST, CIFAR-10, Dogs vs. Cats datasets with a variety of synthesized noise patterns and noise amounts as well as a real-world dataset Clothing1M. Codes are available at https://github.com/Newbeeer/L_DMI

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

02/19/2020

Improving Generalization by Controlling Label-Noise Information in Neural Network Weights

In the presence of noisy or incorrect labels, neural networks have the u...
06/06/2021

Asymmetric Loss Functions for Learning with Noisy Labels

Robust loss functions are essential for training deep neural networks wi...
05/31/2019

Max-MIG: an Information Theoretic Approach for Joint Learning from Crowds

Eliciting labels from crowds is a potential way to obtain large labeled ...
05/27/2019

Combating Label Noise in Deep Learning Using Abstention

We introduce a novel method to combat label noise when training deep neu...
02/05/2021

The Fourier Loss Function

This paper introduces a new loss function induced by the Fourier-based M...
04/19/2020

A Committee of Convolutional Neural Networks for Image Classication in the Concurrent Presence of Feature and Label Noise

Image classification has become a ubiquitous task. Models trained on goo...
01/16/2021

DeepMI: A Mutual Information Based Framework For Unsupervised Deep Learning of Tasks

In this work, we propose an information theory based framework DeepMI to...

Code Repositories

L_DMI

Code for NeurIPS 2019 Paper, "L_DMI: An Information-theoretic Noise-robust Loss Function"


view repo

PeerLoss-Keras

A Tensorflow (Keras) implementation of Peer loss functions for classification with noisy labels.


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks, together with large scale accurately annotated datasets, have achieved remarkable performance in a great many classification tasks in recent years (e.g., Krizhevsky et al. (2012); He et al. (2016)). However, it is usually money- and time- consuming to find experts to annotate labels for large scale datasets. While collecting labels from crowdsourcing platforms like Amazon Mechanical Turk is a potential way to get annotations cheaper and faster, the collected labels are usually very noisy. The noisy labels hampers the performance of deep neural networks since the commonly used cross entropy loss is not noise-robust. This raises an urgent demand on designing noise-robust loss functions.

Some previous works have proposed several loss functions for training deep neural networks with noisy labels. However, they either use auxiliary information Patrini et al. (2017); Hendrycks et al. (2018), like an additional set of clean data or the noise transition matrix, which cannot be easily obtained in practice, or make assumptions on the noise Ghosh et al. (2017); Zhang and Sabuncu (2018) and thus can only handle limited kinds of the noise patterns (see perliminaries for definition of different noise patterns), like uniform noise only or diagonally dominant noise only.

One reason that the loss functions used in previous works are not robust to a certain noise pattern, say diagonally non-dominant noise, is that they are distance-based, i.e

., the loss is the distance between the classifier’s outputs and the labels (

e.g. 0-1 loss, cross entropy loss). When datapoints are labeled by a careless annotator who tends to label the a priori popular class (e.g. For medical images, given the prior knowledge is malignant and

benign, a careless annotator labels “benign” when the underline true label is “benign” and labels “benign” with 90% probability when the underline true label is “malignant”.), the collected noisy labels have a diagonally non-dominant noise pattern and are extremely biased to one class (“benign”). In this situation, the distanced-based losses will prefer the meaningless classifier who always outputs the a priori popular class (“benign”) than the classifier who outputs the true labels.

To address this issue, instead of using distance-based losses, we try the information-theoretic loss such that the classifier, whose outputs have the highest mutual information with the labels, has the lowest loss. The key observation is that the meaningless classifier has no information about anything and will be naturally eliminated by the information-theoretic loss. Moreover, the information-monotonicity of the mutual information guarantees that adding noises to a classifier’s output will make this classifier less preferred by the information-theoretic loss.

However, the key observation is not sufficient. In fact, we want a information measure I to satisfy

Unfortunately, the traditional Shannon mutual information (MI) does not satisfy the above formula. Therefore, we generalize it to a new information measure, DMI (Determinant based Mutual Information). Like MI, DMI measures the correlation between two random variables. It is defined as the determinant of the matrix that describes the joint distribution over the two variables. Intuitively, when two random variables are independent, their joint distribution matrix has low rank and zero determinant. Moreover, DMI is not only information-monotone like MI, but also relatively invariant because of the multiplication property of the determinant. The relative invariance of DMI makes it satisfy the above formula.

Based on DMI, we propose a noise-robust loss function which is simply

With , the following equation holds:

and the noise amount is a constant given the dataset. The equation reveals that with , training with the noisy labels is theoretically equivalent with training with the clean labels in the dataset, regardless of the noise patterns, including the noise amounts.

In a nutshell, our main contribution can be summarized into the following three points:

  • We propose a new information measure, DMI, which is information-monotone and relatively invariant. Under the performance measure based on DMI, the measurement based on noisy labels is consistent with the measurement based on clean labels.

  • We design a novel information theoretic noise-robust loss function based on the above information measure. It is not sensitive to noise patterns and it can be easily applied to any existing classification neural networks straightforwardly without any auxiliary information.

  • Extensive experiments on Fashion-MNIST, CIFAR-10, Dogs vs. Cats datasets with a variety of synthesized noise patterns and noise amounts as well as a real-world dataset Clothing1M demonstrate the superior performance of .

2 Related Work

A series of works have attempted to design noise-robust loss functions. In the context of binary classification, some loss functions (e.g., 0-1 lossManwani and Sastry (2013), ramp lossBrooks (2011), unhinged lossVan Rooyen et al. (2015), savage lossMasnadi-Shirazi and Vasconcelos (2009)) have been proved to be robust to uniform or symmetric noise and Natarajan et al. Natarajan et al. (2013) presented a general way to modify any given surrogate loss function. Ghosh et al. Ghosh et al. (2017) generalized the existing results for binary classification problem to multi-class classification problem and proved that MAE (Mean Absolute Error) is robust to diagonally dominant noise. Zhang et al. Zhang and Sabuncu (2018) showed MAE performs poorly with deep neural network and they combined MAE and cross entropy loss to obtain a new loss function. Patrini et al. Patrini et al. (2017) provided two kinds of loss correction methods with the noise transition matrix. Hendrycks et al. Hendrycks et al. (2018) proposed another loss correction technique with an additional set of clean data. To the best of our knowledge, we are the first to provide a loss function that is provably not sensitive to noise patterns and noise amounts.

Instead of designing an inherently noise-robust function, several works used special architectures to deal with the problem of training deep neural networks with noisy labels. Some of them focused on estimating the noise transition matrix to handle the label noise and proposed a variety of ways to constrain the optimization

Sukhbaatar et al. (2014); Xiao et al. (2015); Goldberger and Ben-Reuven (2016); Vahdat (2017); Han et al. (2018a); Yao et al. (2019). Some of them focused on finding ways to distinguish noisy labels from clean labels and used example re-weighting strategies to give the noisy labels less weights Reed et al. (2014); Ren et al. (2018); Ma et al. (2018). While these methods seem to perform well in practice, they cannot guarantee the robustness to label noise theoretically and are also outperformed by our method empirically.

On the other hand, Zhang et al. Zhang et al. (2016)

have shown that deep neural networks can easily memorize completely random labels, thus several works propose frameworks to prevent this overfitting issue empirically in the setting of deep learning from noisy labels. For example, teacher-student curriculum learning framework

Jiang et al. (2017) and co-teaching framework Han et al. (2018b) have been shown to be helpful. Multi-task frameworks that jointly estimates true labels and learns to classify images are also introduced Veit et al. (2017); Lee et al. (2018); Tanaka et al. (2018); Yi and Wu (2019). We consider a different perspective from them and focus on designing an inherently noise-robust function.

3 Preliminaries

3.1 Problem settings

We denote the set of classes by and the size of by . We also denote the domain of datapoints by . A classifier is denoted by , where is the set of all possible distributions over . represents a randomized classifier such that given , is the probability that maps into class . Note that fixing the input , the randomness of a classifier is independent of everything else.

There are datapoints . For each datapoint , there is an unknown ground truth . We assume that there is an unknown prior distribution over such that are i.i.d. samples drawn from and

Note that here we allow the datapoints to be “imperfect” instances, i.e., there still exists uncertainty for conditioning on fully knowing .

Traditional supervised learning aims to train a classifier

that is able to classify new datapoints into their ground truth categories with access to . However, in the setting of learning with noisy labels, instead, we only have access to where is a noisy version of .

We use a random variable to denote the noisy version of and to denote the transition distribution between and , i.e.

We use to represent the matrix format of .

Generally speaking Patrini et al. (2017); Ghosh et al. (2017); Zhang and Sabuncu (2018), label noise can be divided into several kinds according to the noise transition matrix . It is defined as class-independent (or uniform) if a label is substituted by a uniformly random label regardless of the classes, i.e. (e.g. ). It is defined as diagonally dominant if for every row of , the magnitude of the diagonal entry is larger than any non-diagonal entry, i.e. (e.g. ). It is defined as diagonally non-dominant if it is not diagonally dominant (e.g. the example mentioned in introduction, ).

We assume that the noise is independent of the datapoints conditioning on the ground truth, which is commonly assumed in the literature Patrini et al. (2017); Ghosh et al. (2017); Zhang and Sabuncu (2018), i.e.,

Assumption 3.1 (Independent noise).

is independent of conditioning on .

We also need that the noisy version is still informative.

Assumption 3.2 (Informative noisy label).

is invertible, i.e., .

3.2 Information theory concepts

Since Shannon’s seminal work Shannon (1948), information theory has shown its powerful impact in various of fields, including several recent deep learning works Hjelm et al. (2018); Cao et al. (2018). Our work is also inspired by information theory. This section introduces several basic information theory concepts.

Information theory is commonly related to random variables. For every random variable , Shannon’s entropy measures the uncertainty of . For example, deterministic has lowest entropy. For every two random variables and , Shannon mutual information measures the amount of relevance between and . For example, when and are independent, they have the lowest Shannon mutual information, zero.

Shannon mutual information is non-negative, symmetric, i.e., , and also satisfies a desired property, information-monotonicity, i.e., the mutual information between and will always decrease if either or has been “processed”.

Fact 3.3 (Information-monotonicity Csiszár et al. (2004)).

For all random variables , when is less informative for than , i.e., is independent of conditioning ,

This property naturally induces that for all random variables ,

since is always the most informative random variable for itself.

Based on Shannon mutual information, a performance measure for a classifier can be naturally defined. High quality classifier’s output should have high mutual information with the ground truth category . Thus, a classifier ’s performance can be measured by .

However, in our setting, we only have access to the i.i.d. samples of and . A natural attempt is to measure a classifier ’s performance by . Unfortunately, under this performance measure, the measurement based on noisy labels may not be consistent with the measurement based on true labels . (See a counterexample in Appendix B.) That is,

Thus, we cannot use Shannon mutual information as the performance measure for classifiers. Here we give a new version of mutual information, Determinant based Mutual Information (DMI), such that under the performance measure based on DMI, the measurement based on noisy labels is consistent with the measurement based on true labels.

Definition 3.4 (Determinant based Mutual Information).

Given two discrete random variables

, we define the Determinant based Mutual Information between and as

where is the matrix format of the joint distribution over and .

DMI is a generalized version of Shannon’s mutual information: it preserves all properties of Shannon mutual information, including non-negativity, symmetry and information-monotonicity and it is additionally relatively invariant.

Theorem 3.5 (Properties of DMI).

DMI is non-negative, symmetric and information-monotone. Moreover, it is relatively invariant: for all random variables , when is less informative for than , i.e., is independent of conditioning ,

where is the matrix format of

Proof.

The non-negativity and symmetry follow directly from the definition, so we only need to prove the relatively invariance. Note that

as is independent of conditioning on . Thus,

where , , are the matrix formats of , , , respectively. We have

because of the multiplication property of the determinant (i.e. for every two matrices ). Therefore, .

The relative invariance and the symmetry imply the information-monotonicity of DMI. When is less informative for than , i.e., is independent of conditioning on ,

because of the fact that for every square transition matrix , Seneta (2006). ∎

Based on DMI, an information-theoretic performance measure for each classifier is naturally defined as . Under this performance measure, the measurement based on noisy labels is consistent with the measurement based on clean labels , i.e., for every two classifiers and ,

4 : An Information-theoretic Noise-robust Loss Function

4.1 Method overview

Our loss function is defined as

where is the joint distribution over and is the matrix format of . The randomness comes from both the randomness of and the randomness of . The function here resolves many scaling issues111 while , matrix and constant ..

Figure 1: The computation of in each step of iteration

Figure 1 shows the computation of . In each step of iteration, we sample a batch of datapoints and their noisy labels . We denote the outputs of the classifier by a matrix . Each column of is a distribution over , representing for an output of the classifier. We denote the noisy labels by a 0-1 matrix . Each row of

is an one-hot vector, representing for a label. i.e.

We define , i.e.,

We have ( means expectation, see proof in Appendix B). Thus, is an empirical estimation of . By abusing notation a little bit, we define

as the empirical loss function. Our formal training process is shown in Appendix A.

4.2 Theoretical justification

Theorem 4.1 (Main Theorem).

With Assumption 3.1 and Assumption 3.2, is

legal

if there exists a ground truth classifier such that , then it must have the lowest loss, i.e., for all classifier ,

and the inequality is strict when is not a permutation of , i.e., there does not exist a permutation s.t. ;

noise-robust

for the set of all possible classifiers ,

and in fact, training using noisy labels is the same as training using clean labels in the dataset except a constant shift,

information-monotone

for every two classifiers , if is less informative for than , i.e. is independent of conditioning on , then

Proof.

The relatively invariance of (Theorem 3.5) implies

Therefore,

Thus, the information-monotonicity and the noise-robustness of follows and the constant .

The legal property follows from the information-monotonicity of as is the most informative random variable for itself and the fact that for every square transition matrix , if and only if is a permutation matrix Seneta (2006). ∎

5 Experiments

We evaluate our method on both synthesized and real-world noisy datasets with different deep neural networks to demonstrate that our method is independent of both architecture and data domain. We call our method DMI and compare it with: CE (the cross entropy loss), FW (the forward loss Patrini et al. (2017)), GCE (the generalized cross entropy loss Zhang and Sabuncu (2018)) and LCCN (the latent class-conditional noise model Yao et al. (2019)). For the synthesized data, noises are added to the training and validation sets, and test accuracy is computed with respect to true labels. For our method, we pick the best learning rate from

based on the minimum validation loss. For other methods, we use the best hyperparameters they provided in similar settings. The classifiers are pretrained with cross entropy loss first. All reported experiments were repeated five times. We implement all networks and training procedures in Pytorch

Paszke et al. (2017) and conduct all experiments on NVIDIA TITAN Xp GPUs. The explicit noise transition matrix and the experimental data for each synthesized case are shown in Appendix C and Appendix D.

5.1 Experiments on Fashion-MNIST

Fashion-MNIST Xiao et al. (2017) consists of 70,000 grayscale fashion product image from classes, which is split into a -image training set, a -image valiadation set and a -image test set. We convert the label to two classes, bags and clothes, to synthesize a highly imbalanced dataset ( bags,

clothes). We use a simple two-layer convolutional neural network as the classifier. Adam with default parameters and a learning rate of

is used as the optimizer during training. Batch size is set to .

We synthesize three cases of noise patterns: (1) class-independent: with probability , a true label is substituted by a random label through uniform sampling. (2) class-dependent (a): with probability , bags clothes, that is, a true label of the a priori less popular class, “bags”, is flipped to the popular one, “clothes”. This happens in real world when the annotators are lazy. (e.g., a careless medical image annotator may be more likely to label “benign” since most images are in the “benign” category.) (3) class-dependent (b): with probability , clothes bags, that is, the a priori more popular class, “clothes”, is flipped to the other one, “bags”. This happens in real world when the annotators are risk-avoid and there will be smaller adverse effects if the annotators label the image to a certain class. (e.g. a risk-avoid medical image annotator may be more likely to label “malignant” since it is usually safer when the annotator is not confident, even if it is less likely a priori.) Note that the parameter in the above three cases also represents the amount of noise. When , the labels are clean and when , the labels are totally uninformative. Moreover, in case (2) and (3), as increases, the noise pattern changes from diagonally dominant to diagonally non-dominant.

Figure 2: Test accuracy (mean and std. dev.) on Fashion-MNIST.

Note that CE is distance-based and DMI is information-theoretic. As we mentioned in introduction, CE will perform badly when the noise is non-diagonally dominant and the labels are biased to one class, as distance-based losses prefer the meaningless classifier who always outputs the class who is the majority in the labels. ( “clothes” and has accuracy in case (2) and “bags” and has accuracy in case (3)). The experiment results match our expectation. CE performs similarly with our DMI for diagonally dominant noises. For non-diagonally dominant noises, however, CE only obtains the meaningless classifier while DMI still performs pretty well.

5.2 Experiments on CIFAR-10 and Dogs vs. Cats

CIFAR-10 3 consists of 60,000 color images from classes, which is split into a training set, a valiadation set and a test set. Dogs vs. Cats 5 consists of images from classes, dogs and cats, which is split into a -image training set, a -image validation set and a -image test set. We use ResNet-34 as the classifier for CIFAR-10 and VGG-16 as the classifier for Dogs vs. Cats. SGD with a momentum of 0.9, a weight decay of and a learning rate of is used as the optimizer during training. Batch size is set to . We use per-pixel normalization, horizontal random flip and

random crops after padding with

pixels on each side as data augmentation. Following Yao et al. (2019), the noise for CIFAR-10 is added between the similar classes, i.e. truck automobile, bird airplane, deer horse, cat dog, with probability . The noise for Dogs vs. Cats is added as cat dog with probability .

As shown in Figure 3, our method DMI almost outperforms all other methods in every experiment and its accuracy drops slowly as the noise amount increases. GCE has great performance in diagonally dominant noises but it fails in diagonally non-dominant noises. This phenomenon matches its theory: it assumes that the label noise is diagonally dominant. FW needs to pre-estimate a noise transition matrix before training and LCCN uses the output of the model to estimate the true labels. These tasks become harder as the noise amount grows larger, so their performance also drop quickly as the noise amount increases.

Figure 3: Test accuracy (mean and std. dev.) on CIFAR-10 and Dogs vs. Cats.

5.3 Experiments on Clothing1M

Clothing1M Xiao et al. (2015) is a large-scale real world dataset, which consists of 1 million images of clothes collected from shopping websites with noisy labels from 14 classes assigned by the surrounding text provided by the sellers. It has additional 14k and 10k clean data respectively for validation and test. We use ResNet-50 as the classifier and apply random crop of , random flip, brightness and saturation as data augmentation. SGD with a momentum of , a weight decay of is used as the optimizer during training. We train the classifier with learning rates of in the first epochs and in the second epochs. Batch size is set to .

Method CE FW GCE LCCN DMI
Accuracy
Table 1: Test accuracy (mean) on Clothing1M

As shown in Table 4, DMI also outperforms other methods in the real-world setting.

6 Conclusion

We propose a simple yet powerful loss function, , for training deep neural networks robust to label noise. It is based on a generalized version of mutual information, DMI. We provide theoretical validation to our approach and compare our approach experimentally with previous methods on both synthesized and real-world datasets. To the best of our knowledge, is the first loss function that is provably not sensitive to noise patterns and noise amounts, and can be applied to any existing neutral networks architecture straightforwardly without any auxiliary information.

References

  • J. P. Brooks (2011) Support vector machines with the ramp loss and the hard margin loss. Operations research 59 (2), pp. 467–479. Cited by: §2.
  • P. Cao, Y. Xu, Y. Kong, and Y. Wang (2018) Max-mig: an information theoretic approach for joint learning from crowds. Cited by: §3.2.
  • [3] CIFAR-10 and CIFAR-100 datasets. Note: https://www.cs.toronto.edu/~kriz/cifar.html2009 Cited by: §5.2.
  • I. Csiszár, P. C. Shields, et al. (2004) Information theory and statistics: a tutorial. Foundations and Trends® in Communications and Information Theory 1 (4), pp. 417–528. Cited by: Fact 3.3.
  • [5] Dogs vs. Cats competition. Note: https://www.kaggle.com/c/dogs-vs-cats2013 Cited by: §5.2.
  • A. Ghosh, H. Kumar, and P. Sastry (2017) Robust loss functions under label noise for deep neural networks. In

    Thirty-First AAAI Conference on Artificial Intelligence

    ,
    Cited by: §1, §2, §3.1, §3.1.
  • J. Goldberger and E. Ben-Reuven (2016) Training deep neural-networks using a noise adaptation layer. Cited by: §2.
  • B. Han, J. Yao, G. Niu, M. Zhou, I. Tsang, Y. Zhang, and M. Sugiyama (2018a) Masking: a new perspective of noisy supervision. In Advances in Neural Information Processing Systems, pp. 5836–5846. Cited by: §2.
  • B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama (2018b) Co-teaching: robust training of deep neural networks with extremely noisy labels. In Advances in Neural Information Processing Systems, pp. 8527–8537. Cited by: §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 770–778. Cited by: §1.
  • D. Hendrycks, M. Mazeika, D. Wilson, and K. Gimpel (2018) Using trusted data to train deep networks on labels corrupted by severe noise. In Advances in Neural Information Processing Systems, pp. 10456–10465. Cited by: §1, §2.
  • R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, A. Trischler, and Y. Bengio (2018) Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670. Cited by: §3.2.
  • L. Jiang, Z. Zhou, T. Leung, L. Li, and L. Fei-Fei (2017) Mentornet: regularizing very deep neural networks on corrupted labels. arXiv preprint arXiv:1712.05055 4. Cited by: §2.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
  • K. Lee, X. He, L. Zhang, and L. Yang (2018)

    Cleannet: transfer learning for scalable image classifier training with label noise

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5447–5456. Cited by: §2.
  • X. Ma, Y. Wang, M. E. Houle, S. Zhou, S. M. Erfani, S. Xia, S. Wijewickrema, and J. Bailey (2018) Dimensionality-driven learning with noisy labels. arXiv preprint arXiv:1806.02612. Cited by: §2.
  • N. Manwani and P. Sastry (2013) Noise tolerance under risk minimization. IEEE transactions on cybernetics 43 (3), pp. 1146–1151. Cited by: §2.
  • H. Masnadi-Shirazi and N. Vasconcelos (2009)

    On the design of loss functions for classification: theory, robustness to outliers, and savageboost

    .
    In Advances in neural information processing systems, pp. 1049–1056. Cited by: §2.
  • N. Natarajan, I. S. Dhillon, P. K. Ravikumar, and A. Tewari (2013) Learning with noisy labels. In Advances in neural information processing systems, pp. 1196–1204. Cited by: §2.
  • A. Paszke, S. Gross, S. Chintala, and G. Chanan (2017) Tensors and dynamic neural networks in python with strong gpu acceleration. Pytorch. Cited by: §5.
  • G. Patrini, A. Rozza, A. Krishna Menon, R. Nock, and L. Qu (2017) Making deep neural networks robust to label noise: a loss correction approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1944–1952. Cited by: §1, §2, §3.1, §3.1, §5.
  • S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Rabinovich (2014) Training deep neural networks on noisy labels with bootstrapping. arXiv preprint arXiv:1412.6596. Cited by: §2.
  • M. Ren, W. Zeng, B. Yang, and R. Urtasun (2018) Learning to reweight examples for robust deep learning. arXiv preprint arXiv:1803.09050. Cited by: §2.
  • E. Seneta (2006)

    Non-negative matrices and markov chains

    .
    Springer Science & Business Media. Cited by: §3.2, §4.2.
  • C. E. Shannon (1948) A mathematical theory of communication. Bell System Technical Journal 27 (3), pp. 379–423. External Links: Document, Link, https://onlinelibrary.wiley.com/doi/pdf/10.1002/j.1538-7305.1948.tb01338.x Cited by: §3.2.
  • S. Sukhbaatar, J. Bruna, M. Paluri, L. Bourdev, and R. Fergus (2014) Training convolutional networks with noisy labels. arXiv preprint arXiv:1406.2080. Cited by: §2.
  • D. Tanaka, D. Ikami, T. Yamasaki, and K. Aizawa (2018) Joint optimization framework for learning with noisy labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5552–5560. Cited by: §2.
  • A. Vahdat (2017) Toward robustness against label noise in training deep discriminative neural networks. In Advances in Neural Information Processing Systems, pp. 5596–5605. Cited by: §2.
  • B. Van Rooyen, A. Menon, and R. C. Williamson (2015) Learning with symmetric label noise: the importance of being unhinged. In Advances in Neural Information Processing Systems, pp. 10–18. Cited by: §2.
  • A. Veit, N. Alldrin, G. Chechik, I. Krasin, A. Gupta, and S. Belongie (2017) Learning from noisy large-scale datasets with minimal supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 839–847. Cited by: §2.
  • H. Xiao, K. Rasul, and R. Vollgraf (2017) External Links: cs.LG/1708.07747 Cited by: §5.1.
  • T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang (2015) Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2691–2699. Cited by: §2, §5.3.
  • J. Yao, H. Wu, Y. Zhang, I. W. Tsang, and J. Sun (2019) Safeguarded dynamic label regression for noisy supervision. Cited by: §2, §5.2, §5.
  • K. Yi and J. Wu (2019) Probabilistic end-to-end noise correction for learning with noisy labels. arXiv preprint arXiv:1903.07788. Cited by: §2.
  • C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2016) Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530. Cited by: §2.
  • Z. Zhang and M. Sabuncu (2018) Generalized cross entropy loss for training deep neural networks with noisy labels. In Advances in Neural Information Processing Systems, pp. 8778–8788. Cited by: §1, §2, §3.1, §3.1, §5.

Appendix A Training Process

0:  A training dataset , a validation dataset , a classifier modeled by deep neural network , the running epoch number , the learning rate and the batch size .
1:  Pretrain the classifier on the dataset with cross entropy loss
2:  Initialize the best classifier:
3:  Randomly sample a batch of samples from the validation dataset
4:  Initialize the minimum validation loss:
5:  for epoch  do
6:     for batch  do
7:        Randomly sample a batch of samples from the training dataset
8:        Compute the training loss:
9:        Update :
10:     end for
11:     Randomly sample a batch of samples from the validation dataset
12:     Compute the validation loss:
13:     if  then
14:        Update the minimum validation loss:
15:        Update the best classifier:
16:     end if
17:  end for
18:  return  the best classifier
Algorithm 1 The training process with

Appendix B Other Proofs

Claim B.1.

where

Proof.

Recall that the randomness of comes from both and and the randomness of is independent of everything else.

(i.i.d. samples)
(definition of randomized classifier)
(fixing , the randomness of is independent of everything else)

Claim B.2.

Under the the performance measure based on Shannon mutual information, the measurement based on noisy labels is not consistent with the measurement based on true labels . i.e., for every two classifiers and ,

Proof.

See a counterexample:

The matrix format of the joint distribution is , the matrix format of the joint distribution is and the noise transition matrix is .

Given these conditions, and .

If we use Shannon mutual information as the performance measure,

Thus we have but .

Therefore,

Appendix C Noise Transition Matrices

On Fashion-MNIST, case (1): ;

On Fashion-MNIST, case (2): ;

On Fashion-MNIST, case (3): ;

On CIFAR-10, ;

On Dogs vs. Cats, .

For Fashion-MNIST, case (1), are diagonally dominant noises.

For other cases, are diagonally dominant noises and are diagonally non-dominant noises.

Appendix D Experimental Data

(1), CE (1), DMI (2), CE (2), DMI (3), CE (3), DMI
Table 2: Test accuracy on Fashion-MNIST (mean std. dev.)
CE FW GCE LCCN DMI
Table 3: Test accuracy on CIFAR-10 (mean std. dev.)
CE FW GCE LCCN DMI
Table 4: Test accuracy on Dogs vs. Cats (mean std. dev.)