glc
Gold Loss Correction
view repo
The growing importance of massive datasets with the advent of deep learning makes robustness to label noise a critical property for classifiers to have. Sources of label noise include automatic labeling for large datasets, nonexpert labeling, and label corruption by data poisoning adversaries. In the latter case, corruptions may be arbitrarily bad, even so bad that a classifier predicts the wrong labels with high confidence. To protect against such sources of noise, we leverage the fact that a small set of clean labels is often easy to procure. We demonstrate that robustness to label noise up to severe strengths can be achieved by using a set of trusted data with clean labels, and propose a loss correction that utilizes trusted examples in a dataefficient manner to mitigate the effects of label noise on deep neural network classifiers. Across vision and natural language processing tasks, we experiment with various label noises at several strengths, and show that our method significantly outperforms existing methods.
READ FULL TEXT VIEW PDFGold Loss Correction
Robustness to label noise is set to become an increasingly important property of supervised learning models. With the advent of deep learning, the need for more labeled data makes it inevitable that not all examples will have highquality labels. This is especially true of data sources that admit automatic label extraction, such as web crawling for images, and tasks for which highquality labels are expensive to produce, such as semantic segmentation or parsing. Additionally, label corruption may arise in data poisoning
(li2016; steinhardtpoison). Both natural and malicious label corruptions tend to sharply degrade the performance of classification systems (Zhu2004).Most prior work on label corruption robustness assumes that all training data are potentially corrupted. However, it is usually the case that a number of trusted examples are available. Trusted data are gathered to create validation and test sets. When it is possible to curate trusted data, a small set of trusted data could be created for training. We depart from the assumption that all training data are potentially corrupted by assuming that a subset of the training is trusted. In turn we demonstrate that having some amount of trusted training data enables significant robustness gains.
To leverage the additional information from trusted labels, we propose a new loss correction and empirically verify it on a number of vision and natural language datasets with label corruption. Specifically, we demonstrate recovery from extremely high levels of label noise, including the dire case when the untrusted data has a majority of its labels corrupted. Such severe corruption can occur in adversarial situations like data poisoning, or when the number of classes is large. In comparison to loss corrections that do not employ trusted data (Patrini), our method is significantly more accurate in problem settings with moderate to severe label noise. Relative to a recent method which also uses trusted data (Li), our method is far more dataefficient and generally more accurate. These results demonstrate that systems can weather label corruption with access only to a small number of gold standard labels. Experiment code is available at https://github.com/mmazeika/glc.
The performance of machine learning systems reliant on labeled data has been shown to degrade noticeably in the presence of label noise
(Nettleton; Pechenizkiy). In the case of adversarial label noise, this degradation can be even worse (Reed). Accordingly, modeling, correcting, and learning with noisy labels has been well studied (Natarajan; Biggio; Frenay).The methods of (Mnih; Larsen; Patrini; Sukhbaatar) allow for label noise robustness by modifying the model’s architecture or by implementing a loss correction. Unlike Mnih who focus on binary classification of aerial images and Larsen who assume symmetric label noise, (Patrini; Sukhbaatar) consider label noise in the multiclass problem setting with asymmetric noise.
Sukhbaatar
introduce a stochastic matrix measuring label corruption, note its inability to be calculated without access to the true labels, and propose a method of forward loss correction. Forward loss correction adds a linear layer to the end of the model and the loss is adjusted accordingly to incorporate learning about the label noise.
Patrinialso make use of the forward loss correction mechanism, and propose an estimate of the label corruption estimation matrix which relies on strong assumptions, and does not make use of clean labels.
Contra (Sukhbaatar; Patrini), we make the assumption that during training the model has access to a small set of clean labels. See semiverified for a general analysis of this assumption. This assumption has been leveraged by others for the purpose of label noise robustness, most notably GoogleMultiLabel; Li; Xiao; LearningToReweight. GoogleMultiLabel use humanverified labels to train a label cleaning network by estimating the residuals between the noisy and clean labels in a multilabel classification setting. In the multiclass setting that we focus on in this work, Li propose distilling the predictions of a model trained on clean labels into a second network trained on these predictions and the noisy labels. Our work differs from these two in that we do not train neural networks on the clean labels alone.
We are given an untrusted dataset of examples , and we assume that these examples are potentially corrupted examples from the true data distribution with classes. Corruption is specified by a label noise distribution . We are also given a trusted dataset of examples drawn from , where . We refer to as the trusted fraction. Concretely, a web scraper labeling images from metadata may produce an untrusted set, while expertannotated examples would form a trusted dataset and be a gold standard.
We explore two avenues of utilizing to improve this approach. The first directly uses the trusted data while training the final classifier. As this could be applied to existing methods, we run ablations to demonstrate its effect. The second avenue uses the additional information conferred by the clean labels to better model the label noise for use in a corrected classifier.
Our method makes use of to estimate the matrix of corruption probabilities . Once this estimate is obtained, we use it to train a modified classifier from which we recover an estimate of the desired conditional distribution . We call this method the Gold Loss Correction (GLC), so named because we make use of trusted or gold standard labels.
Estimating The Corruption Matrix. First, we train a classifier on . Let and be in the set of possible labels. To estimate the probability , we use the identity . Integrating over all gives us
We can approximate the integral on the left with the expectation of over the empirical distribution of given . Assuming conditional independence of and given , we have , which is directly approximated by , the classifier trained on . More explicitly, let be the subset of in with label . Denote our estimate of by . We have
This is how we estimate our corruption matrix for the GLC. The approximation relies on being a good estimate of , on the number of trusted examples of each class, and on the extent to which the conditional independence assumption is satisfied. The conditional independence assumption is reasonable, as it is usually the case that noisy labeling processes do not have access to the true label. Moreover, when the data are separable (i.e. is deterministic given ), the assumption follows. A proof of this is provided in the Supplementary Material. We investigate these factors in experiments.
Training a Corrected Classifier.
Now with , we follow the method of (Sukhbaatar; Patrini) to train a corrected classifier, which we now briefly describe. Given the softmax output of an untrained classifier, we define the new output as . We then train on the noisy labels with crossentropy loss. We can further improve on this method by using trusted data to train the corrected classifier. Thus, we apply no correction on examples from the trusted set encountered during training. This has the effect of allowing the GLC to handle a degree of instancedependency in the label noise (Menon), though our experiments suggest that most of the GLC’s performance gains derive from our estimate. A concrete algorithm of our method is provided here.
Generating Corrupted Labels. Suppose our dataset has examples. We sample a set of trusted datapoints , and the remaining untrusted examples form , which we probabilistically corrupt according to a true corruption matrix . Note that we do not have knowledge of which of our untrusted examples are corrupted. We only know that they are potentially corrupted.
To generate the untrusted labels from the true labels in , we first obtain a corruption matrix . Then, for an example with true label , we sample the corrupted label from the categorical distribution parameterized by the th row of . Note that this does not satisfy the conditional independence assumption required for our estimate of . However, we find that the GLC still works well in practice, perhaps because this assumption is also satisfied when the data are separable, in the sense that each has a single true , which is approximately true in many of our experiments.
Comparing Loss Correction Methods. The GLC differs from previous loss corrections for label noise in that it reasonably assumes access to a highquality annotation source. Therefore, to compare to other loss correction methods, we ask how each method performs when starting from the same dataset with the same label noise. In other words, the only additional information our method uses is knowledge of which examples are trusted, and which are potentially corrupted.








MNIST. The MNIST dataset contains grayscale images of the digits 09. The training set has 60,000 images and the test set has 10,000 images. For preprocessing, we rescale the pixels to the interval
.We train a 2layer fully connected network with 256 hidden dimensions. We train with Adam for 10 epochs using batches of size 32 and a learning rate of 0.001. For regularization, we use
weight decay on all layers with .CIFAR. The two CIFAR datasets contain color images. CIFAR10 has ten classes, and CIFAR100 has 100 classes. CIFAR100 has 20 “superclasses” which partition its 100 classes into 20 semantically similar sets. We use these superclasses for hierarchical noise. Both datasets have 50,000 training images and 10,000 testing images. For both datasets, we train a Wide Residual Network (wideresnet)
of depth 40 and a widening factor of 2. We train for 75 epochs using SGD with Nesterov momentum and a cosine learning rate schedule
(sgdr).IMDB. The IMDB Large Movie Reviews dataset (imdb)
contains 50,000 highly polarized movie reviews from the Internet Movie Database, split evenly into train and test sets. We pad and clip reviews to a length of 200 tokens, and learn 50dimensional word vectors from scratch for a vocab size of 5,000.We train an LSTM with 64 hidden dimensions on this data. We train using the Adam optimizer
(adam) for 3 epochs with batch size 64 and the suggested learning rate of 0.001. For regularization, we use dropout (dropout) on the linear output layer with a dropping probability of 0.2.Twitter. The Twitter Part of Speech dataset (Gimpel2011) contains 1,827 tweets annotated with 25 POS tags. This is split into a training set of 1,000 tweets, a development set of 327 tweets, and a test set of 500 tweets. We use the development set to augment the training set. We use pretrained 50dimensional word vectors, and for each token, we concatenate word vectors in a fixed window centered on the token. These form our training and test set. We use a window size of 3, and train a 2layer fully connected network with hidden size 256, and use the GELU nonlinearity (gelu). We train with Adam for 15 epochs with batch size 64 and learning rate of 0.001. For regularization, we use weight decay with on all but the linear output layer.
SST. The Stanford Sentiment Treebank dataset consists of single sentence movie reviews (sst). We use the 2class version (i.e. SST2), which has 6,911 reviews in the training set, 872 in the development set, and 1,821 in the test set. We use the development set to augment the training set. We pad and clip reviews to a length of 30 tokens and learn 100dimensional word vectors from scratch for a vocab size of 10,000. Our classifier is a wordaveraging model with an affine output layer. We use the Adam optimizer for 5 epochs with batch size 50 and learning rate 0.001. For regularization, we use weight decay with on the output layer.
Corruption Type  Percent Trusted  Trusted Only  No Corr.  Forward  Forward Gold  Distill.  Confusion Matrix  GLC (Ours)  


MNIST 
Uniform  5  37.6  12.9  14.5  13.5  42.1  21.8  10.3 
Uniform  10  12.9  12.3  13.9  12.3  9.2  15.1  6.3  
Uniform  25  6.6  9.3  11.8  9.2  5.8  11.0  4.7  
210  Flip  5  37.6  50.1  51.7  41.4  46.5  11.7  3.4 
Flip  10  12.9  51.1  48.8  36.4  32.4  5.6  2.9  
Flip  25  6.6  47.7  50.2  37.1  28.2  3.8  2.6  
Mean  19.0  30.6  31.8  25.0  27.4  11.5  5.0  
CIFAR10 
Uniform  5  39.6  31.9  9.1  27.8  29.7  22.4  9.0 
Uniform  10  31.3  31.9  8.6  20.6  18.3  22.7  6.9  
Uniform  25  17.4  32.7  7.7  27.1  11.6  16.7  6.4  
210  Flip  5  39.6  53.3  38.6  47.8  29.7  8.1  6.6 
Flip  10  31.3  53.2  36.5  51.0  18.1  8.2  6.2  
Flip  25  17.4  52.7  37.6  49.5  11.8  7.1  6.1  
Mean  29.4  42.6  23.0  37.3  19.9  14.2  6.9  
CIFAR100 
Uniform  5  82.4  48.8  47.7  49.6  87.5  53.6  42.4 
Uniform  10  67.3  48.4  47.2  48.9  61.2  49.7  33.9  
Uniform  25  52.2  45.4  43.6  46.0  39.8  39.6  27.3  
210  Flip  5  82.4  62.1  61.6  62.6  87.1  28.6  27.1 
Flip  10  67.3  61.9  61.0  62.2  61.8  26.9  25.8  
Flip  25  52.2  59.6  57.5  61.4  40.0  25.1  24.7  
210  Hierarchical  5  82.4  50.9  51.0  52.4  87.1  45.8  34.8 
Hierarchical  10  67.3  51.9  50.5  52.1  61.7  38.8  30.2  
Hierarchical  25  52.2  54.3  47.0  51.1  39.7  29.7  25.4  
Mean  67.3  53.7  51.9  54.0  62.9  37.5  30.2  

Forward Loss Correction. The forward correction method from Patrini also obtains by training a classifier on the noisy labels, and using the resulting softmax probabilities. However, this method does not make use of a trusted fraction of the training data. Instead, it uses the at the
percentile of softmax probabilities for a given class as a heuristic for detecting an example that is truly a member of said class. As in the original paper, we replace this with the
over all softmax probabilities for a given class on CIFAR100 experiments. The estimate of is then used to train a corrected classifier in the same way as the GLC.Forward Gold. To examine the effect of training on trusted labels as done by the GLC, we augment the Forward method by replacing its estimate of with the identity on trusted examples. We call this method Forward Gold. It can also be seen as the GLC with the Forward method’s estimate of .
Distillation. The distillation method of Li involves training a neural network on a large trusted dataset and using this network to provide soft targets for the untrusted data. In this way, labels are “distilled” from a neural network. If the classifier’s decisions for untrusted inputs are less reliable than the original noisy labels, then the network’s utility is limited. Thus, to obtain a reliable neural network, a large trusted dataset is necessary. A new classifier is trained using labels that are a convex combination of the soft targets and the original untrusted labels.
Confusion Matrices. An intuitive alternative to the GLC is to estimate by a confusion matrix. To do this, we train a classifier on the untrusted examples, obtain its confusion matrix on the trusted examples, rownormalize the matrix, and then train a corrected classifier as in the GLC.
Corruption Type  Percent Trusted  Trusted Only  No Corr.  Forward  Forward Gold  Distill.  Confusion Matrix  GLC (Ours)  


SST 
Uniform  5  45.4  27.5  26.5  26.6  43.4  26.1  24.2 
Uniform  10  35.2  27.2  26.2  25.9  33.3  25.0  23.5  
Uniform  25  26.1  26.5  25.3  24.6  25.0  22.4  21.7  
210  Flip  5  45.4  50.2  50.3  50.3  48.8  26.0  24.9 
Flip  10  35.2  49.9  50.1  49.9  42.1  24.6  23.5  
Flip  25  26.1  48.7  49.0  47.3  31.8  22.4  21.7  
Mean  35.6  38.3  37.9  37.4  37.4  24.4  23.3  
IMDB 
Uniform  5  36.9  26.7  27.9  27.6  35.5  25.4  25.0 
Uniform  10  26.2  25.8  27.2  26.1  24.9  23.3  22.3  
Uniform  25  22.2  21.4  23.0  20.1  21.0  18.9  18.7  
210  Flip  5  36.9  49.2  49.2  49.2  41.8  25.8  25.2 
Flip  10  26.2  47.8  48.3  47.5  28.0  22.1  22.0  
Flip  25  22.2  39.4  39.6  36.6  23.5  19.2  18.5  
Mean  28.5  35.0  35.9  34.5  29.1  22.5  22.0  

Uniform  5  35.9  37.1  51.7  44.1  32.0  41.5  31.0 
Uniform  10  23.6  33.5  49.5  40.2  22.2  33.6  22.3  
Uniform  25  16.3  25.5  40.6  26.4  16.6  20.0  15.5  
210  Flip  5  35.9  56.2  61.6  54.8  36.4  23.4  15.8 
Flip  10  23.6  53.8  59.0  48.9  26.1  15.9  12.9  
Flip  25  16.3  43.0  52.5  36.7  20.5  13.3  12.8  
Mean  25.3  41.5  52.5  41.9  25.7  24.6  18.4  

CorruptionGenerating Matrices. We consider three types of corruption matrices: corrupting uniformly to all classes, i.e.
, flipping a label to a different class, and corrupting uniformly to classes which are semantically similar. To create a uniform corruption at different strengths, we take a convex combination of an identity matrix and the matrix
. We refer to the coefficient of as the corruption strength for a “uniform” corruption. A “flip” corruption at strength involves, for each row, giving an offdiagonal column probability mass and the entries along the diagonal probability mass . Finally, a more realistic corruption is hierarchical corruption. For this corruption, we apply uniform corruption only to semantically similar classes; for example, “bed” may be corrupted to “couch” but not “beaver” in CIFAR100. For CIFAR100, examples are deemed semantically similar if they share the same “superclass” label specified by the dataset creators.Experiments and Analysis of Results. We train the models described in Section 4.1 under uniform, labelflipping, and hierarchical label corruptions at various fractions of trusted data. To assess the performance of the GLC, we compare it to other loss correction methods and two baselines: one where we train a network only on trusted data without any label corrections, and one where the network trains on all data without any label corrections. We record errors on the test sets at the corruption strengths . Since we compute the model’s accuracy at numerous corruption strengths, CIFAR experiments involve training over 500 Wide Residual Networks. In Tables 1 and 2, we report the area under the error curves across corruption strengths for all baselines and corrections. A sample of error curves are displayed in Figure 2
. These curves are the linear interpolation of the errors at the eleven corruption strengths.
Across all experiments, the GLC obtains better area under the error curve than the baselines and the Forward and Distillation methods. The rankings of the other methods and baselines are mixed. On MNIST, training on the trusted data alone outperforms all methods save for the GLC and Confusion Matrix, but performs significantly worse on CIFAR100, even with large trusted fractions.
The Confusion Matrix correction performs second to the GLC, which indicates that normalized confusion matrices are not as accurate as our method of estimating . We verified this by inspecting the estimates directly, and found that normalized confusion matrices give a highly biased estimate due to taking an over class scores rather than using random sampling. Figure 1 shows an example of this bias in the case of label flipping corruption at a strength of .
Interestingly, Forward Gold performs worse than Forward on several datasets. We did not observe the same behavior when turning off the corresponding component of the GLC
, and believe it may be due to variance introduced during training by the difference in signal provided by the Forward method’s
estimate and the clean labels. The GLC provides a superior estimate, and thus may be better able to leverage training on the clean labels. Additional results on SVHN are in the Supplementary Material.We also compare the GLC to the recent work of LearningToReweight, which proposes a loss correction that uses a trusted set and metalearning. We find that the GLC consistently outperforms this method. To conserve space, results are in the Supplementary Material.
Percent Trusted  Trusted Only  No Corr.  Forward  Forward Gold  Distill.  Confusion Matrix  GLC (Ours)  


CIFAR10 
1  62.9  28.3  28.1  30.9  60.4  31.9  26.9 
5  39.6  27.1  26.6  25.5  28.1  27  21.9  
10  31.3  25.9  25.1  22.9  17.8  24.2  19.2  
Mean  44.6  27.1  26.6  26.4  35.44  27.7  22.7  
CIFAR100 
5  82.4  71.1  73.9  73.6  88.3  74.1  68.7 
10  67.3  66  68.2  66.1  62.5  63.8  56.6  
25  52.2  56.9  56.9  51.4  39.7  50.8  40.8  
Mean  67.3  64.7  66.3  63.7  63.5  62.9  55.4  

Our next benchmark for the GLC is to use noisy labels obtained from a weak classifier. This models the scenario of label noise arising from a classification system weaker than one’s own, but with access to information about the true labels that one wishes to transfer to one’s own system. For example, scraping image labels from surrounding text on web pages provides a valuable signal, but these labels would train a subpar classifier without correcting the label noise. This setting exactly satisfies the conditional independence assumption on label noise used for our estimate, because the weak classifier does not take the true label as input when outputting noisy labels.
Weak Classifier Label Generation. To obtain the labels, we train 40layer Wide Residual Networks on CIFAR10 and CIFAR100 with clean labels for ten epochs each. Then, we sample from their softmax distributions with a temperature of , and fix the resulting labels. This results in noisy labels which we use in place of the labels obtained through the uniform, flip, and hierarchical corruption methods. The labelings produced by the weak classifiers have accuracies of on CIFAR10 and on CIFAR100. Despite the presence of highly corrupted labels, we are able to significantly recover performance with the use of a trusted set. Note that unlike the previous corruption methods, weak classifier labels have only one corruption strength. Thus, performance is measured in percent error rather than area under the error curve. Results are displayed in Table 3.
Analysis of Results. On average, the GLC outperforms all other methods in the weak classifier label experiments. The Distillation method performs better than the GLC by a small margin at the highest trusted fraction, but performs worse at lower trusted fractions, indicating that the GLC enjoys superior data efficiency. This is highlighted by the GLC attaining a error rate on CIFAR10 with a trusted fraction of only , down from the original error rate of . It should be noted, however, that training with no correction attains error on this experiment, suggesting that the weak classifier labels have low bias. The improvement conferred by the GLC is greater with larger trusted fractions.
Data Efficiency. We have seen that the GLC works for small trusted fractions. We further corroborate its data efficiency by turning to the Clothing1M dataset (Xiao). Clothing1M is a massive dataset with both humanannotated and noisy labels, which we use to compare the data efficiency of the GLC to that of Distillation when very few trusted labels are present. It consists in 1 million noisily labeled clothing images obtained by crawling online marketplaces. 50,000 images have humanannotated examples, from which we take subsamples as our trusted set.
For both the GLC and Distillation, we first finetune a ResNet34 on untrusted training examples for four epochs, and use this to estimate our corruption matrix. Thereafter, we finetune the network for four more epochs on the combined trusted and untrusted sets using the respective method. During fine tuning, we freeze the first seven layers, and train using gradient descent with Nesterov momentum and a cosine learning rate schedule. For preprocessing, we randomly crop and use mirroring. We also upsample the trusted dataset, finding this to give better performance for both methods.
As shown in Figure 3, the GLC outperforms Distillation by a large margin, especially with fewer trusted examples. This is because Distillation requires finetuning a classifier on the trusted data alone, which generalizes poorly with very few examples. By contrast, estimating the matrix can be done with very few examples. Correspondingly, we find that our advantage decreases as the number of trusted examples increases.
With more trusted labels, performance on Clothing1M saturates, as evident in Figure 3. We consider the extreme and train on the entire trusted set for Clothing1M. We finetune a pretrained 50layer ResNeXt (resnext) on untrusted training examples to estimate our corruption matrix. Then, we finetune the ResNeXt on all training examples. During finetuning, we use gradient descent with Nesterov momentum. During the first two epochs, we tune only the output layer with a learning rate of . Thereafter, we tune the whole network at a learning rate of for two epochs, and for another two epochs at . Then we apply our loss correction. Now, we finetune the entire network at a learning rate of for two epochs, continue training at , and earlystop based upon the validation set. In a previous work, Xiao obtain in this setting. However, our method obtains a stateoftheart accuracy of , while with this procedure the Forward method only obtains accuracy.
Improving Estimation. For some datasets, the classifier may be a poor estimate of , presenting a bottleneck in the estimation of for the GLC. To see the extent to which this could impact performance, and whether simple methods for improving could help, we ran several variants of the GLC experiment on CIFAR100 under the label flipping corruption at a trusted fraction of which we now describe. For all variants, we averaged the area under the error curve over five random initializations.
1. In the first variant, we replaced the GLC estimate of with , the true corruption matrix.
2. As demonstrated by hendrycks17baseline; Guo2017, modern deep neural network classifiers tend to have overconfident softmax distributions. We found this to be the case with our estimate, despite the higher entropy of the noisy labels, so we used the temperature scaling confidence calibration method proposed in the paper to calibrate .
3. Suppose we know the base rates of corrupted labels , where , and the base rate of true labels of the trusted set. If we posit that corrupted the labels, then we should have . Thus, we may obtain a superior estimate of the corruption matrix by computing a new estimate subject to .
We found that using the true corruption matrix as our provides a benefit of percentage points in area under the error curve, but neither the confidence calibration nor the base rate incorporation was able to change the performance from the original GLC. This indicates that the GLC is robust to the use of uncalibrated networks for estimating , and that improving its performance may be difficult without directly improving the performance of the neural network used to estimate .
In this work, we have shown the impact of having a small set of trusted examples on label noise robustness in neural network classifiers. We proposed the Gold Loss Correction (GLC), a method for coping with label noise. This method leverages the assumption that the model has access to a small set of correct labels in order to yield accurate estimates of the noise distribution. Throughout our experiments, the GLC surpasses previous label noise robustness methods across various natural language processing and vision domains which we showed by considering several corruptions and numerous strengths, including severe strengths. These results demonstrate that the GLC is a powerful, dataefficient method for improving robustness to label noise.
We thank NVIDIA for donating GPUs used in this research.
We show here that the conditional independence assumption required by our estimator is satisfied when the data are separable, meaning that the label is deterministic given the input.
Let
be random variables following a data distribution
, where and are categorical. Semantically, represents the true label, and represents the noisy label. Suppose that the data are separable, meaning that holds for all but , in which case we have . For brevity in the rest of the proof, we will use shorthand probability notation, i.e. . Using the separability assumption, we have(1) 
We will use this to show that for all . Let and be given. For , we have
because separability implies for . This is also equal to , so the case where is covered. Suppose . We have
where in the last step we use equation (1). This completes the proof.
Corruption Type  Percent Trusted  Trusted Only  No Corr.  Ren et al.  GLC (Ours)  


MNIST 
Uniform  5  37.6  12.9  10.6  10.3 
Uniform  10  12.9  12.3  7.7  6.3  
Uniform  25  6.6  9.3  8.5  4.7  
27  Flip  5  37.6  50.1  20.2  3.4 
Flip  10  12.9  51.1  22.7  2.9  
Flip  25  6.6  47.7  22.7  2.6  
Mean  19.0  30.6  15.4  5.0  
CIFAR10 
Uniform  5  39.6  31.9  30.5  9.0 
Uniform  10  31.3  31.9  30.8  6.9  
Uniform  25  17.4  32.7  33.3  6.4  
27  Flip  5  39.6  53.3  21.9  6.6 
Flip  10  31.3  53.2  23.0  6.2  
Flip  25  17.4  52.7  24.4  6.1  
Mean  29.4  42.6  27.3  6.9  
CIFAR100 
Uniform  5  82.4  48.8  68.5  42.4 
Uniform  10  67.3  48.4  71.5  33.9  
Uniform  25  52.2  45.4  72.8  27.3  
27  Flip  5  82.4  62.1  67.2  27.1 
Flip  10  67.3  61.9  68.4  25.8  
Flip  25  52.2  59.6  71.5  24.7  
Mean  67.3  54.4  70.0  30.2  

Corruption Type  Percent Trusted  Trusted Only  No Corr.  Forward  Forward Gold  Distill.  Confusion Matrix  GLC (Ours)  


MNIST 
Uniform  5  37.6  12.9  14.5  13.5  42.1  21.8  10.3 
Uniform  10  12.9  12.3  13.9  12.3  9.2  15.1  6.3  
Uniform  25  6.6  9.3  11.8  9.2  5.8  11.0  4.7  
210  Flip  5  37.6  50.1  51.7  41.4  46.6  11.7  3.4 
Flip  10  12.9  51.1  48.8  36.4  32.4  5.6  2.9  
Flip  25  6.6  47.7  50.2  37.1  28.2  3.8  2.6  
Mean  19.0  30.6  31.8  25.0  27.4  11.5  5.0  
SVHN 
Uniform  0.1  80.4  25.5  26.2  26.8  80.9  25.7  24.4 
Uniform  1  79.7  25.5  24.2  24.9  80.4  28.2  28.1  
Uniform  5  24.3  25.5  15.0  15.7  24.1  2.7  2.8  
210  Flip  0.1  80.4  51.0  51.0  50.9  89.1  19.8  19.4 
Flip  1  79.7  51.0  43.9  49.5  86.3  17.8  21.7  
Flip  5  24.3  51.0  43.2  49.0  17.6  2.2  2.2  
Mean  61.5  38.2  33.9  36.1  63.1  16.1  16.4  
CIFAR10 
Uniform  5  39.6  31.9  9.1  27.9  29.7  22.4  9.0 
Uniform  10  31.3  31.9  8.6  20.6  18.3  22.7  6.9  
Uniform  25  17.4  32.7  7.7  27.1  11.6  16.7  6.4  
210  Flip  5  39.6  53.3  38.6  47.8  29.7  8.1  6.6 
Flip  10  31.3  53.2  36.5  51.0  18.1  8.2  6.2  
Flip  25  17.4  52.7  37.6  49.5  11.8  7.1  6.1  
Mean  29.4  42.6  23.0  37.3  19.9  14.2  6.9  
CIFAR100 
Uniform  5  82.4  48.8  47.7  49.6  87.5  53.6  42.4 
Uniform  10  67.3  48.4  47.2  48.9  61.2  49.7  33.9  
Uniform  25  52.2  45.4  43.6  46.0  39.8  39.6  27.3  
210  Flip  5  82.4  62.1  61.6  62.6  87.1  28.6  27.1 
Flip  10  67.3  61.9  61.0  62.2  61.9  26.9  25.8  
Flip  25  52.2  59.6  57.5  61.4  40.0  25.1  24.7  
210  Hierarchical  5  82.4  50.9  51.0  52.4  87.1  45.8  34.8 
Hierarchical  10  67.3  51.9  50.5  52.1  61.7  38.8  30.2  
Hierarchical  25  52.2  54.3  47.0  51.1  39.7  29.7  25.4  
Mean  67.3  53.7  51.9  54.0  62.9  37.5  30.2  














































Comments
There are no comments yet.