RoGNoisyLabel
Description Code for the paper "Robust Inference via Generative Classifiers for Handling Noisy Labels".
view repo
Largescale datasets may contain significant proportions of noisy (incorrect) class labels, and it is wellknown that modern deep neural networks (DNNs) poorly generalize from such noisy training datasets. To mitigate the issue, we propose a novel inference method, termed Robust Generative classifier (RoG), applicable to any discriminative (e.g., softmax) neural classifier pretrained on noisy datasets. In particular, we induce a generative classifier on top of hidden feature spaces of the pretrained DNNs, for obtaining a more robust decision boundary. By estimating the parameters of generative classifier using the minimum covariance determinant estimator, we significantly improve the classification accuracy with neither retraining of the deep model nor changing its architectures. With the assumption of Gaussian distribution for features, we prove that RoG generalizes better than baselines under noisy labels. Finally, we propose the ensemble version of RoG to improve its performance by investigating the layerwise characteristics of DNNs. Our extensive experimental results demonstrate the superiority of RoG given different learning models optimized by several training techniques to handle diverse scenarios of noisy labels.
READ FULL TEXT VIEW PDFDescription Code for the paper "Robust Inference via Generative Classifiers for Handling Noisy Labels".
Deep neural networks (DNNs) tend to generalize well when they are trained on largescale datasets with groundtruth label annotations. For example, DNNs have achieved stateoftheart performance on many classification tasks, e.g., image classification (He et al., 2016), object detection (Girshick, 2015), and speech recognition (Amodei et al., 2016). However, as the scale of training dataset increases, it becomes infeasible to obtain all groundtruth class labels from domain experts. A common practice is collecting the class labels from data mining on social media (Mahajan et al., 2018) or web data (Krause et al., 2016). Machinegenerated labels are often used; e.g., the Open Images Dataset V4 contains such 70 million labels for training images (Kuznetsova et al., 2018). However, they may contain incorrect labels, and recent studies have shown that modern deep architectures may generalize poorly from the noisy datasets (Zhang et al., 2017) (e.g., see the black line of Figure 1(a)).
To address the poor generalization issue of DNNs with noisy labels, many training strategies have been investigated (Reed et al., 2014; Patrini et al., 2017; Ma et al., 2018; Han et al., 2018b; Hendrycks et al., 2018; Goldberger & BenReuven, 2017; Jiang et al., 2018; Ren et al., 2018; Zhang & Sabuncu, 2018; Malach & ShalevShwartz, 2017; Han et al., 2018a)
. However, using such training methods may incurs expensive backandforth costs (e.g., additional time and hyperparameter tuning) and suffer from the reproducibility issue. This motivates our approach of developing a more plausible inference method applicable to any pretrained deep model. Hence, our direction is complementary to the prior works: one can combine ours and a prior training method for the best performance (see Tables
3, 4, & 5 in Section 4).The key contribution of our work is to develop such an inference method, Robust Generative classifier (RoG), which is applicable to any discriminative (e.g., softmax) neural classifier pretrained on noisy datasets (without retraining). Our main idea is inducing a better posterior distribution from the pretrained (noisy, though) feature representation by utilizing a robust generative classifier. Here, our belief is that the softmax DNNs can learn meaningful feature patterns shared by multiple training examples even under datasets with noisy labels, e.g., see (Arpit et al., 2017).
To motivate our approach, we first observe that training samples with noisy labels (red circles) are distributed like outliers when their hidden features are projected in a 2dimensional space using tSNE (Maaten & Hinton, 2008) (see Figure 1(b)). In other words, this phenomena implies that DNN representations even when trained with noisy labels may still exhibit clustering properties (i.e., the DNN learns embedding that tend to group clean examples of the same class into the clusters while pushing away the examples with corrupt labels outside these clusters). The observation inspires us to induce a generative classifier on the pretrained hidden features since it can model joint data distributions for input and its label
for outlier detection and thus produce robust posterior
for prediction. Here, one may suggest to train a deep generative classifier from scratch. However, such a fully generative approach is expensive and has been not popular for recent stateofart classification. We instead postprocess a light generative classifier only for inference.In particular, we propose to induce the generative classifier under linear discriminant analysis (LDA) assumption and choose its parameters by the minimum covariance determinant (MCD) (Rousseeuw, 1984) estimator which calculates more robust parameters. We provide a theoretical support on the generalization property (Durrant & Kabán, 2010) of RoG based on MCD: it has the smaller errors on parameter estimations provably under some Gaussian assumptions. To improve RoG further, we observe that RoG built from lowlevel features can be often more effective since DNNs tend to have similar hidden features, regardless of whether they are trained with clean or noisy labels at early layers (Arpit et al., 2017; Morcos et al., 2018). Under the observations, we finally propose an ensemble version of RoG to incorporate all effects of low and high layers.
We demonstrate the effectiveness of RoG using modern neural architectures on image classification and natural language processing tasks. In all tested cases, our methods (e.g., see green and blue lines in Figure
1(a)) significantly outperform the softmax classifier, although they use the same feature representations trained by the noisy dataset. In particular, we demonstrate that RoG can be used to further improve various prior training methods (Reed et al., 2014; Patrini et al., 2017; Ma et al., 2018; Han et al., 2018b; Hendrycks et al., 2018) which are specialized to handle the noisy environment. For example, we improve the test accuracy of the stateoftheart training method (Han et al., 2018b) on CIFAR100 dataset with 45% noisy labels from 33.34% to 43.02%. Finally, RoG is shown to be working properly against more semantic noisy labels (generated from a machine labeler) and openset noisy labels (Wang et al., 2018), i.e., noisy samples from the open world.One of major directions for handling noisy labels is utilizing an estimated/corrected labels during training: Reed et al. (2014) proposed a bootstrapping method which trains deep models with new labels generated by a convex combination of the raw (noisy) labels and their predictions, and Ma et al. (2018) improved the bootstrapping method by utilizing the dimensionality of subspaces during training. Patrini et al. (2017) modified the loss and posterior distribution to eliminate the influence of noisy labels, and Hendrycks et al. (2018) improved such a loss correction method by utilizing the information from data with true class labels. Another promising direction has focused on training on selected (cleaner) samples: Jiang et al. (2018) introduced a metalearning model, called MentorNet, and Han et al. (2018a) proposed a meta approach which can improve MentorNet. Ren et al. (2018) adaptively assigned weights to training samples based on their gradient directions. Malach & ShalevShwartz (2017) and Han et al. (2018b) proposed the selection methods based on an ensemble of deep models. Compared the above training methods, Adopting the above training methods might incur expensive backandforth costs. On the other hand, our generative inference method is very cheap and enjoys an orthogonal usage, i.e., ours can be easily applied to improve any of them.
Inducing a generative classifier (e.g., a mixture of Gaussian) on pretrained deep models also has been investigated for various purposes: Hermansky et al. (2000) propose Tandem approaches which induce a generative model on top of hidden features for speech recognition. More recently, by inducing the generative model, Lee et al. (2018)
introduce the Mahalanobis distancebased confidence score for novelty detection. However, their methods use naive parameter estimation under assuming perfect clean training labels, which should be highly influenced by outliers. We overcome the issue by using the MCD estimator.
In this section, we propose a novel inference method which obtains a robust posterior distribution from any softmax neural classifier pretrained on datasets with noisy labels. Our idea is inducing the generative classifier given hidden features of the deep model. We show the robustness of our method in terms of high breakdown points (Hampel, 1971), and generalization error (Durrant & Kabán, 2010). We also investigate the layerwise characteristics of generative classifiers, and introduce an ensemble of them to improve its performance.
Let be an input and be its class label. Without loss of generality, suppose that a pretrained softmax neural classifier is given: where and are the weight and the bias of the softmax classifier for class , and denotes the output of the penultimate layer of DNNs. Then, without any modification on the pretrained softmax neural classifier, we induce a generative classifier by assuming the classconditional distribution follows the multivariate Gaussian distribution. In particular, we define Gaussian distributions with a tied covariance , i.e., linear discriminant analysis (LDA) (Fisher, 1936)
, and a Bernoulli distribution for the class prior:
where is the mean of multivariate Gaussian distribution and is the normalized prior for class . We provide an analytic justification on the LDA (i.e., tied covariance) assumption in the supplementary material. Then, based on the Bayesian rule, we induce a new posterior different from the softmax one as follows:To estimate the parameters of the generative classifier, one can compute the sample class mean and covariance of training samples :
(1) 
where is the number of samples labeled to be class .
However, one can expect that the naive sample estimator (3.1) can be highly influenced by outliers (i.e., training samples with noisy labels). In order to improve the robustness, we propose the socalled Robust Generative classifier (RoG), which utilizes the minimum covariance determinant (MCD) estimator (Rousseeuw & Driessen, 1999) to estimate its parameters. For each class , the main idea of MCD is finding a subset for which the determinant of the corresponding sample covariance is minimized:
(2) 
where is the set of training samples labeled to be class , is the sample covariance of and is a hyperparameter. Then, only using the samples in , it estimates the parameters, i.e., , of the generative classifier, by following (3.1). Such a new estimator can be more robust by removing the outliers which might be widely scattered in datasets (see Figure 1(c)).
The robustness of MCD estimator has been justified in the literature: it is known to have nearoptimal breakdown points (Hampel, 1971), i.e., the smallest fraction of data points that need to be replaced by arbitrary values (i.e., outliers) to fool the estimator completely. Formally, denote as a set obtained by replacing data points of set by some arbitrary values. Then, for a multivariate mean estimator from , the breakdown point is defined as follows (see the supplementary material for more detailed explanations including the breakdown point of covariance estimator):
where the set is denoted by for positive integer . While the breakdown point of the naive sample estimator is 0%, the MCD estimator for the generative classifier under LDA assumption is known to attain near optimal breakdown value of (Lopuhaa et al., 1991). Inspired by this fact, we choose the default value of in (2) by .
We also establish the following theoretical support that the MCDbased generative classifier (i.e., RoG) can have smaller errors on parameter estimations, compared to the naive sample estimator, under some assumptions for its analytic tractability.
Assume the followings:
For (clean) sample of correct label, the classconditional distribution of hidden feature of DNNs has mean and tied covariance matrix . For (outlier) sample of incorrect label, the distribution of hidden feature has mean and covariance matrix , where
is the identity matrix.
All classes have the same number of samples (i.e., ), the same fraction of outliers, and the sample fraction of samples selected by MCD estimator.
The outliers are widely scattered such that .
The number of outliers is not too large such that and .
Let and be the outputs of the MCD and sample estimators, respectively. Then, for all , converge almost surely to their expectation as , and it holds that
(3)  
(4) 
where .
The proof of the above theorem is given in the supplementary material, where it is built upon the fact that the determinants can be expressed as the th degree polynomial of outlier ratio under the assumptions. We note that one might enforce the assumptions of the diagonal covariance matrices in to hold under an affine translation of hidden features. In addition, the assumption in holds when is large enough. Nevertheless, we think most assumptions of Theorem 1 are not necessary to claim the superiority of RoG and it is an interesting future direction to explore to relax them.
The generalization error bound of a generative classifier under the assumption that the classconditional Gaussian distributions of features is known to be bounded as follows (Durrant & Kabán, 2010):
for some constant . Therefore, (3) and (4) together imply that utilizing the MCD estimator provides a better generalization bound, compared to the sample estimator.
Even though the MCD estimator has several advantages, the optimization (2) is computationally intractable (i.e., NPhard) to solve (Bernholt, 2006). To handle this issue, we aim to compute its approximate solution, by following the idea of Hubert & Van Driessen (2004). We design two step scheme as follows: (a) obtain the mean and covariance, i.e., , using Algorithm 1 for each class , and (b) compute the tied covariance by . In other words, we apply the MCD estimator for each class, and combine the individual covariances into a single one due to the tied covariance assumption of LDA. Even though finding the optimal solution of the MCD estimator under a single Gaussian distribution is still intractable, Algorithm 1 can produce a local optimal solution since it monotonically decreases the determinant under any random initial subset (Rousseeuw & Driessen, 1999). We choose in our experiments since additional iterations would not improve the results significantly.
Model  Inference method  Ensemble  Clean  Uniform (20%)  Uniform (40%)  Uniform (60%) 

DenseNet  Softmax    94.11  81.01  72.34  55.42 
Generative + sample    94.18  85.12  76.75  60.14  
✓  93.97  87.40  81.27  69.81  
Generative + MCD (ours)    94.22  86.54  80.27  67.67  
✓  94.18  87.41  81.83  75.45  
ResNet  Softmax    94.76  80.88  61.98  39.96 
Generative + sample    94.80  82.97  65.92  42.76  
✓  94.82  83.36  68.57  46.45  
Generative + MCD (ours)    94.76  83.86  68.03  44.87  
✓  94.68  84.62  75.28  54.57 
To further improve the performance of our method, we consider the ensemble of generative classifiers not only from the penultimate features but also from other lowlevel features in DNNs. Formally, given training data, we extract th hidden features of DNNs, denoted by , and compute the corresponding parameters of a generative classifier (i.e., and ) using the (approximated version of) MCD estimator. Then, the final posterior distribution is obtained by the weighted sum of all posterior distributions of generative classifiers: where is an ensemble weight at th layer. In our experiments, we choose the weight of each layer by optimizing negative loglikelihood (NLL) loss over the validation set. One can expect that this natural scheme can bring an extra gain in improving the performance due to ensemble effects.
Noise type (%)  ResNet  DenseNet  

CIFAR10  CIFAR100  SVHN  CIFAR10  CIFAR100  SVHN  
Softmax / RoG  Softmax / RoG  
Clean  94.76 / 94.68  76.81 / 76.97  95.96 / 96.09  94.11 / 94.18  75.69 / 72.67  96.59 / 96.18 
Uniform (20%)  80.88 / 84.62  64.43 / 68.29  83.52 / 91.67  81.01 / 87.41  61.72 / 64.29  86.92 / 89.50 
Uniform (40%)  61.98 / 75.28  48.62 / 60.76  72.89 / 87.16  72.34 / 81.83  50.89 / 55.68  81.91 / 85.71 
Uniform (60%)  39.96 / 54.57  27.57 / 48.42  61.23 / 80.52  55.42 / 75.45  38.33 / 44.12  71.18 / 77.67 
Flip (20%)  79.65 / 88.73  65.14 / 73.37  85.49 / 93.00  79.18 / 91.23  65.37 / 69.03  95.04 / 94.86 
Flip (40%)  58.13 / 61.56  46.61 / 66.71  65.88 / 87.96  56.29 / 86.42  46.04 / 69.38  88.83 / 93.57 
In this section, we demonstrate the effectiveness of the proposed method using deep neural networks on various vision and natural language processing tasks. We provide the more detailed experimental setups in the supplementary material. The code will be publicly available in the final draft.
For evaluation, we apply the proposed method to deep neural networks including DenseNet (Huang & Liu, 2017) and ResNet (He et al., 2016) for the classification tasks on CIFAR (Krizhevsky & Hinton, 2009), SVHN (Netzer et al., 2011), Twitter Part of Speech (Gimpel et al., 2011), and Reuters (Lewis et al., 2004) datasets with noisy labels. Following the setups of (Ma et al., 2018; Han et al., 2018b), we first consider two types of random noisy labels: corrupting a label to other class uniformly at random (uniform) and corrupting a label only to a specific class (flip). Our method is also evaluated on semantic noisy labels from a machine classifier and openset noisy labels (Wang et al., 2018).
For ensembles of generative classifiers, we induce the generative classifiers from basic blocks of the last dense (or residual) block of DenseNet (or ResNet), where ensemble weights of each layer are tuned on an additional validation set, which consists of 1000 images with noisy labels. Here, when learning the weights, we use only 500 samples out of 1000, chosen by the MCD estimator to remove the outliers (see the supplementary material for more details). The size of feature maps on each convolutional layers is reduced by average pooling for computational efficiency: , where is the number of channels and is the spatial dimension.
Dataset 

Clean  Uniform (20%)  Uniform (40%)  Uniform (60%)  

Softmax / RoG  
CIFAR10  Crossentropy  94.34 / 94.20  81.95 / 84.63  63.84 / 74.72  62.45 / 67.47  
Bootstrap (hard)  94.56 / 94.52  82.90 / 86.27  75.97 / 80.72  72.91 / 75.41  
Bootstrap (soft)  94.46 / 94.28  80.29 / 84.82  65.22 / 74.22  58.55 / 66.68  
Forward  94.53 / 94.52  85.80 / 86.84  77.95 / 79.87  72.56 / 74.75  
Backward  94.39 / 94.44  77.44 / 79.16  62.83 / 68.29  56.64 / 66.44  
D2L  94.55 / 94.29  88.89 / 89.00  86.68 / 87.00  76.83 / 77.92  
CIFAR100  Crossentropy  76.31 / 75.40  61.11 / 64.82  45.08 / 55.90  34.97 / 41.25  
Bootstrap (hard)  75.65 / 75.49  61.61 / 64.81  51.27 / 57.22  39.04 / 43.69  
Bootstrap (soft)  76.40 / 76.02  60.28 / 64.04  47.66 / 56.51  34.68 / 42.47  
Forward  75.84 / 75.93  63.73 / 66.02  53.03 / 57.69  41.28 / 45.28  
Backward  76.75 / 76.28  56.24 / 62.13  37.70 / 50.23  23.55 / 37.18  
D2L  76.13 / 75.93  71.90 / 72.09  63.61 / 64.85  9.51 / 40.57  
SVHN  Crossentropy  96.38 / 96.41  83.45 / 91.14  60.86 / 80.36  38.29 / 54.99  
Bootstrap (hard)  96.40 / 96.12  83.43 / 91.98  74.25 / 86.83  66.51 / 77.14  
Bootstrap (soft)  96.51 / 96.10  86.43 / 90.84  58.22 / 79.90  44.52 / 62.52  
Forward  96.36 / 96.00  88.21 / 91.99  80.35 / 86.49  82.16 / 84.99  
Backward  96.43 / 96.09  87.00 / 87.11  72.02 / 73.32  50.54 / 64.01  
D2L  96.49 / 96.37  92.31 / 93.58  94.46 / 94.68  92.87 / 93.25 
Dataset  Noise type (%)  Crossentropy  Decoupling  MentorNet  Coteaching 



CIFAR10  Flip (45%)  49.50  48.80  58.14  71.17  71.26  
Uniform (50%)  48.87  51.49  71.10  74.12  76.67  
Uniform (20%)  76.25  80.44  80.76  82.13  84.32  
CIFAR100  Flip (45%)  31.99  26.05  31.60  33.34  43.18  
Uniform (50%)  25.21  25.80  39.00  41.49  45.42  
Uniform (20%)  47.55  44.52  52.13  54.27  58.16 
Dataset 

Softmax / RoG  

Clean  Uniform (20%)  Uniform (40%)  Uniform (60%)  
Crossentropy  87.47 / 85.28  79.13 / 81.66  66.74 / 79.37  50.83 / 73.65  
Forward (gold)  78.07 / 83.59  72.97 / 81.60  64.55 / 78.24  51.59 / 72.33  
GLC  83.47 / 84.68  66.09 / 81.66  59.72 / 79.00  53.14 / 72.93  
Reuters  Crossentropy  95.88 / 94.77  87.74 / 92.83  76.54 / 82.20  57.49 / 64.98  
Forward (gold)  94.57 / 94.75  88.44 / 93.24  77.85 / 82.56  61.01 / 66.56  
GLC  95.97 / 94.91  81.45 / 92.75  73.40 / 83.82  59.21 / 67.91 
We first evaluate the performance of generative classifiers with various assumptions: identity covariance (Euclidean) and tied covariance (LDA). In the case of identity covariance, we also apply a robust estimator called the least trimmed square (LTS) estimator (Rousseeuw, 1984) which finds a subset with smallest error and computes the sample mean from it, i.e., . Figure 2(a)
reports the test set accuracy of the softmax and generative classifiers on features extracted from the penultimate layer using ResNet34 trained on the CIFAR10 dataset with the uniform noise type. First, one can observe that the generative classifiers with LDA assumption (blue and purple bars) generalize better than the softmax (red bar) and generative classifiers with identity covariance (orange and green bars) well from noisy labels. Here, we remark that they still provide a comparable classification accuracy of softmax classifier when the model is trained on clean dataset (i.e., noise = 0%). On top of that, by utilizing the MCD estimator, the classification accuracy (blue bar) is further improved compared to that employs only the naive sample estimator (purple bar). This is because the MCD estimator indeed selects the training samples with clean labels as shown in Figure
2(b). The above results justify the proposed generative classifier, in comparison with other alternatives.Next, to confirm that the ensemble approach is indeed effective, we measure a classification accuracy of generative classifier from different basic blocks of ResNet34. First, we found that the performances of the generative classifiers from lowlevel features are more stable, while the accuracy of generative classifier from penultimate layer significantly decreases as the noisy fraction increases as shown in Figure 2(c). We expect that this is because the dimension (i.e., number of channels) of lowlevel features is usually smaller than that of highlevel features. Since the breakdown point of MCD is inversely proportional to the feature dimension, the generative classifiers from lowlevel features can be more robust. This also coincides with the prior observation in the literature (Morcos et al., 2018) that DNNs tend to have similar hidden features at early layers, regardless of whether they train clean or noisy labels. Since the generative classifiers from lowlevel features are more stable, the ensemble method significantly improves the classification accuracy as shown in Table 1. Finally, Table 2 reports the classification accuracy for all networks and datasets, where the proposed method significantly outperforms the softmax classifier for all tested cases.
We compare the performance of the standard softmax classifier and RoG when they are combined with other various training methods for noisy environments, where more detailed explanations about training methods are given in the supplementary material. First, we consider the following methods that require to train only a single network: Hard/soft bootstrapping (Reed et al., 2014), forward/backward (Patrini et al., 2017), and D2L (Ma et al., 2018). Following the same experimental setup in Ma et al. (2018)^{1}^{1}1The code is available at https://github.com/xingjunm/dimensionalitydrivenlearning., we use ResNet44 and only consider the uniform noises of various levels. Table 3 shows the classification accuracy of softmax classifier and the ensemble version of RoG. Note that RoG always improves the classification accuracy compared to the softmax classifier, where the gains due to ours are more significant than those due to other special training methods.
We also consider the following methods that require to train multiple networks, i.e., an ensemble of classifiers or a metalearning model: Decoupling (Malach & ShalevShwartz, 2017), MentorNet (Jiang et al., 2018) and Coteaching (Han et al., 2018b). Following the same experimental setup of Han et al. (2018b)^{2}^{2}2We used a reference implementation: https://github.com/bhanML/Coteaching.
, we use a 9layer convolutional neural network (CNN), and consider the CIFAR10 and CIFAR100 datasets with uniform and flip noise. In this setup, we only apply RoG to a model pretrained by Coteaching since it outperforms other training methods. As shown in Table
4, RoG with Coteaching method achieves the best performance in all tested cases.Training method  Label generator (noisy fraction) on CIFAR10  Label generator (noisy fraction) on CIFAR100  

DenseNet (32%)  ResNet (38%)  VGG (34%)  DenseNet (34%)  ResNet (37%)  VGG (37%)  
Softmax / RoG  Softmax / RoG  
Crossentropy  67.24 / 68.33  62.26 / 64.15  68.77 / 70.04  50.72 / 61.14  50.68 / 53.09  51.08 / 53.64 
Bootstrap (hard)  67.31 / 68.40  62.22 / 63.98  69.11 / 70.09  51.31 / 53.66  50.62 / 52.62  50.91 / 53.46 
Bootstrap (soft)  67.17 / 68.38  62.15 / 64.03  69.28 / 70.11  50.57 / 54.71  50.68 / 53.30  51.41 / 53.76 
Forward  67.46 / 68.20  61.96 / 64.24  68.95 / 70.09  50.59 / 53.91  51.04 / 53.36  51.05 / 53.63 
Backward  67.31 / 68.66  62.40 / 63.45  69.04 / 70.18  50.54 / 54.01  50.30 / 53.03  51.15 / 53.50 
D2L  66.91 / 68.57  59.10 / 60.25  57.97 / 59.94  5.00 / 31.67  23.71 / 39.92  40.97 / 45.42 
We further apply our inference method to nonconvolutional neural networks on natural language processing (NLP) tasks: the text categorization on Reuters (Lewis et al., 2004), and partofspeech (POS) tagging on Twitter POS (Gimpel et al., 2011). By following the same experimental setup of Hendrycks et al. (2018)^{3}^{3}3The code is available at https://github.com/mmazeika/glc., we train 2layer fully connected networks (FCNs) using forward (gold)^{4}^{4}4We use an augmented version of forward (Patrini et al., 2017) which estimates a corruption matrix using the trusted data., and GLC (Hendrycks et al., 2018) methods. Note that they are designed to train a single network using a set of trusted data with golden clean labels (1% of training samples in our experiments). Hence, the setting is slightly different from what we considered so far, but we run RoG (without utilizing 1% knowledge of ground truth) to compare. Table 5 shows that, even with this unfair disadvantage, RoG can improve the performance over the baselines for these NLP datasets with noisy labels.
Openset data  Softmax  RoG 

CIFAR100  79.01  83.37 
ImageNet  86.88  87.05 
CIFAR100 + ImageNet  81.58  84.35 
In this section, our method is evaluated under more realistic noisy labels. First, in order to generate more semantically meaningful noisy labels, we train DenseNet100, ResNet34 and VGG13 (Simonyan & Zisserman, 2015) using 5% and 20% of CIFAR10 and CIFAR100 training samples with clean labels, respectively. Then, we produce the labels of remaining training samples based on their predictions, and train another DenseNet100 with the pseudolabeled samples. Figure 3(a) shows a confusion graph for pseudolabels, obtained from ResNet34 trained on the 5% of CIFAR10: each node corresponds to a class, and an edge from the node represents its most confusing class. Note that the weak classification system produces semantically noisy labels; e.g., “Cat” is confused with “Dog”, but not with “Car”. We remark that DenseNet and VGG also produce similar confusion graphs. Table 6 shows RoG consistently improves the performance, while the gains due to other special training methods are not very significant.
Our final benchmark is the openset noisy scenario (Wang et al., 2018). In this case, some training images are often from the open world and not relevant to the targeted classification task at all, i.e., outofdistribution. However, they are still labeled to certain classes. For example, as shown in Figure 3(b), noisy samples like “chair” from CIFAR100 and “door” from (downsampled) ImageNet (Chrabaszcz et al., 2017) can be labeled as “bird” to train, even though their true labels are not contained within the set of training classes in the CIFAR10 dataset. In our experiments, openset noisy datasets are built by replacing some training samples in CIFAR10 by outofdistribution samples, while keeping the labels and the number of images per class unchanged. We train DenseNet100 on CIFAR10 with 60% openset noise. As shown in Table 7, our method achieves comparable or significantly better test accuracy than the softmax classifier.
We propose a new inference method for handling noisy labels. Our main idea is inducing the generative classifier on top of fixed features from the pretrained model. Such “deep generative classifiers” have been largely dismissed for fullysupervised classification settings as they are often substantially outperformed by discriminative deep classifiers (e.g., softmax classifiers). In contrast to this common belief, we show that it is possible to formulate a simple generative classifier that is significantly more robust to labeling noise without much sacrifice of the discriminative performance for clean labeling setting. We expect that our work would bring a refreshing perspective for other related tasks, e.g., memorization (Zhang et al., 2017), adversarial attacks (Szegedy et al., 2014)
, and semisupervised learning
(Oliver et al., 2018).On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes.
In NeurIPS, 2002.Learning to reweight examples for robust deep learning.
In ICML, 2018.Gaussian discriminant analysis. In this section, we describe the basic concepts of the discriminative and generative classifier (Ng & Jordan, 2002)
. Formally, denote the random variable of the input and label as
and , respectively. For the classification task, the discriminative classifier directly defines a posterior distribution , i.e., learning a direct mapping between input and label . A popular model for discriminative classifier is softmax classifier which defines the posterior distribution as follows: where and are weights and bias for a class , respectively. In contrast to the discriminative classifier, the generative classifier defines the class conditional distribution and class priorin order to indirectly define the posterior distribution by specifying the joint distribution
. Gaussian discriminant analysis (GDA) is a popular method to define the generative classifier by assuming that the class conditional distribution follows the multivariate Gaussian distribution and the class prior follows Bernoulli distribution: where and are the mean and covariance of multivariate Gaussian distribution, and is the unnormalized prior for class . This classifier has been studied in various machine learning areas (e.g., semisupervised learning (Lasserre et al., 2006) and incremental learning (Lee et al., 2018)).In this paper, we focus on the special case of GDA, also known as the linear discriminant analysis (LDA). In addition to Gaussian assumption, LDA further assumes that all classes share the same covariance matrix, i.e., . Since the quadratic term is canceled out with this assumption, the posterior distribution of generative classifier can be represented as follows:
One can note that the above form of posterior distribution is equivalent to the softmax classifier by considering and as its weight and bias, respectively. This implies that might be fitted in Gaussian distribution during training a softmax classifier.
Breakdown points. The robustness of MCD estimator can be explained by the fact that it has high breakdown points (Hampel, 1971). Specifically, the breakdown point of an estimator measures the smallest fraction of observations that need to be replaced by arbitrary values to carry the estimate beyond all bounds. Formally, denote as a set obtained by replacing data points of set by some arbitrary values. Then, for a multivariate mean estimator from , the breakdown point is defined as follows (see Appendix A for more detailed explanations including the breakdown point of covariance estimator):
For a multivariate estimator of covariance , we have
where the
th largest eigenvalue of a general
matrix is denoted by , such that . This implies that we consider a covariance estimator to be broken whenever any of the eigenvalues can become arbitrary large or arbitrary close to 0.We describe the detailed explanation about all the experiments in Section 4. The code is available at [anonymized].
Detailed model architecture and datasets. We consider two stateoftheart neural network architectures: DenseNet (Huang & Liu, 2017) and ResNet (He et al., 2016). For DenseNet, our model follows the same setup as in Huang & Liu (2017): 100 layers, growth rate and dropout rate 0. Also, we use ResNet with 34 and 44 layers, filters = 64 and dropout rate 0^{5}^{5}5ResNet architecture is available at https://github.com/kuangliu/pytorchcifar.. The softmax classifier is used, and each model is trained by minimizing the crossentropy loss. We train DenseNet and ResNet for classifying CIFAR10 (or 100) and SVHN datasets: the former consists of 50,000 training and 10,000 test images with 10 (or 100) image classes, and the latter consists of 73,257 training and 26,032 test images with 10 digits.^{6}^{6}6We do not use the extra SVHN dataset for training. By following the experimental setup of Ma et al. (2018), All networks were trained using SGD with momentum 0.9, weight decay
and an initial learning rate of 0.1. The learning rate is divided by 10 after epochs 40 and 80 for CIFAR10/SVHN (120 epochs in total), and after epochs 80, 120 and 160 for CIFAR100 (200 epochs in total). For our method, we extract the hidden features at {79,89,99}th layers and {27,29,31,33}th layers for DenseNet and ResNet, respectively. We assume the uniform class prior distribution.
In the Table 5
, we evaluate RoG for the NLP tasks on Twitter and Reuters dataset: the former has a task of partofspeech (POS) tagging, and the latter has a task of text categorization. The Twitter dataset consists of 14,619 training data from 25 different classes, while some of classes only contain a small number of training data. We exclude such classes that are smaller than 100 in size, then we finally have 14,468 training data and 7,082 test data from 19 different classes. Similarly, we exclude the classes which have a size of less than 100 in the Reuters dataset. Then it consists of 5,444 training data and 2,179 test data from 7 different classes. For training, we use 2layer FCNs with ReLU nonlinearity and uniform noise on both Twitter and Reuters datasets and we extract the hidden features at both layers for 2layer FCNs.
We also consider the openset noisy scenarios (Wang et al., 2018). Figure 4 shows the examples of openset noisy dataset which is built by replacing some training images in CIFAR10 by external images in CIFAR100 and Downsampled ImageNet (Chrabaszcz et al., 2017) which is equivalent to the ILSVRC 1,000class ImageNet dataset (Deng et al., 2009), but with images downsampled to resolution. In the Table 7, we maintain the original label of the CIFAR10 dataset, while replacing 60% of the training data in CIFAR10 with training data in CIFAR100 and Downsampled ImageNet.
Validation. For our methods, the ensemble weights are chosen by optimizing the NLL loss over the validation set. We assume that the validation set consists of 1000 images with (same type and fraction of) noisy labels. However, one can expect that if we use all validation samples, the performance of our method can be affected by outliers. To relax this issue, we use only half of them, chosen by the MCD estimator. Specifically, we first compute the Mahalanobis distance for all validation samples using the parameters from MCD estimator, and select 500 samples with smallest distance. Then, one can expect that our ensemble method is more robust against the noisy labels in validation sets. In the case of Twitter and Reuters dataset, validation set consists of 570 and 210 samples with noisy labels, respectively.
Training method for noisy label learning. We consider the following training methods for noisy label learning:
Hard bootstrapping (Reed et al., 2014): Training with new labels generated by a convex combination (the “hard” version) of the noisy labels and their predicted labels.
Soft bootstrapping (Reed et al., 2014): Training with new labels generated by a convex combination (the “soft” version) of the noisy labels and their predictions.
Backward (Patrini et al., 2017): Training via loss correction by multiplying the crossentropy loss by a noiseaware correction matrix.
Forward (Patrini et al., 2017): Training with label correction by multiplying the network prediction by a noiseaware correction matrix.
Forward (Gold) (Hendrycks et al., 2018): An augmented version of Forward method which replaces its corruption matrix estimation with the identity on trusted samples.
GLC (Gold Loss Correction) (Hendrycks et al., 2018): Training with the corruption matrix which is estimated by using the trusted dataset.
D2L (Ma et al., 2018): Training with new labels generated by a convex combination of the noisy labels and their predictions, where its weights are chosen by utilizing the Local Intrinsic Dimensionality (LID).
Decoupling (Malach & ShalevShwartz, 2017): Updating the parameters only using the samples which have different prediction from two classifier.
MentorNet (Jiang et al., 2018): An extra teacher network is pretrained and then used to select clean samples for its student network.
Coteaching (Han et al., 2018b): A simple ensemble method where each network selects its smallloss training data and back propagates the training data selected by its peer network.
Crossentropy: the conventional approach of training with crossentropy loss.
Figure 5 shows the classification accuracy of the generative classifiers from different basic blocks of ResNet34 (He et al., 2016) and DenseNet100 (Huang & Liu, 2017). One can note that the generative classifiers from DenseNet and ResNet have different patterns due to the architecture design. In the case of DenseNet, we found that it produces meaningful features after 20th basic blocks.
In this section, we present a proof of Theorem 1, which consists of two statements: the limit of estimation error (3) and estimated error ratio (4). We prove both statements one by one as stated in below. For convenience, we skip to mention the Continuous Mapping Theorem^{7}^{7}7
P. Billingsley, Convergence of Probability Measures, John Wiley & Sons, 1999
and the number of training samples goes to infinity for all convergences in the proof.We start with a following lemma, which shows the convergences of sample and MCD estimators as the number of training samples goes to infinity.
Suppose we have number of dimensional training samples and contains outlier samples with the fixed fraction . We assume the outlier samples are from an arbitrary distribution with zero mean and finite covariance matrix , and the clean samples are from a distribution of the hidden features of DNNs with mean and covariance matrix . Let and be the mean and covariance matrix of sample estimator, and let and be the mean and covariance matrix of MCD estimator which selects samples from with the fixed fraction to optimize its objective (2). Then the mean and covariance matrix of sample estimator converge almost surely to below as
In addition, if and , the mean and covariance matrix of MCD estimator converge almost surely to below as
A proof of the lemma is given in appendix D.3, where it is built upon the fact that the determinant of covariance matrix with some assumptions can be expressed as the th degree polynomial of outlier ratio.
Lemma 1 states the convergences of sample and MCD estimators on a single distribution of hidden features of DNNs. Without loss of generality, one can assume the mean of outlier distribution is zero, i.e., by an affine translation of hidden features. Furthermore, one can extend Lemma 1 to number of distributions, which have the class mean and class covariance matrix on each class label with the assumptions . Then the class mean of MCD and sample estimators converge almost surely as follows:
which implies that
This completes the proof of the limit of estimation error (3).
Recall the class mean of MCD and sample estimators converge almost surely as follows:
Then one can induce that the limit of mean distance of sample and MCD estimators as follow:
(5)  
(6) 
On the other hand, the assumptions states that all class covariance matrices are the same, i.e., . Then tied covariance matrices and are given by gathering and on each class respectively:
(7) 
From the tied covariance matrices (7) and Lemma 1, one can induce their convergences and limits as follow:
(8)  
(9) 
Next, we define a function of the tied covariance matrix as follow:
Then (9) implies that clearly. Since the condition number of is in , i.e. , one can induces that for and it is a monotonic decreasing function by using the change of variables with . Hence has the maximum at and it implies
(10) 
Therefore the limits of mean distance of estimators (5), (6) and ratio of the function (10) hold the statement (4),
This completes the proof of Theorem 1.
In this part, we present a proof of Lemma 1. We show the almost surely convergences of sample and MCD estimators as the number of training samples goes to infinity.
Proof of the convergence of sample estimator. First of all, the set of training samples contains outlier samples with the fixed fraction . So, is from a mixture distribution . Then mean and covWariance matrix of sample estimator, and , estimate mean and covariance matrix of the mixture distribution , respectively. One can induce and directly as follow:
(11) 
Since has the finite covariance matrix, i.e.,
, one can apply the the Strong Law of Large Numbers
^{8}^{8}8W. Feller, An Introduction to Probability Theory and Its Applications, John Wiley & Sons, 1968
to the sample estimator of the mixture distribution . Then the mean and covariance matrix of sample estimator converge almost surely to the mean and covariance matrix of , respectively:This completes the proof of the convergence of sample estimator.
Proof of the convergence of MCD estimator. Consider a collection of subsets with the size , and each subset contains the outlier samples with the fraction . Then is from a mixture distribution . One can induce that the mean and covariance matrix of the mixture distribution as (11):
(12) 
Thus sample mean estimator and covariance estimator of a subset converge almost surely to and respectively:
by the Strong Law of Large Numbers.
On the other hand, there is a subset in which is selected by MCD estimator. Then the determinant of its covariance matrix is the minimum over all subset of size in , and , as . Since the determinant is a continuous funciton, the Continuous Mapping Theorem^{9}^{9}9P. Billingsley, Convergence of Probability Measures, John Wiley & Sons, 1999 implies
and
By the assumption , is nonempty. It shows the existence of . From the covariance matrix (12), is a th degree polynomial of as follow:
Since the assumption gives , has the lower bound as follow:
Then for all and the equality holds for only . It implies and (13) is the shown. Therefore the mean and covariance matrix of MCD estimator converge almost surely to and , respectively:
This completes the proof of Lemma 1.