I Problem Statement
In the usual supervised learning setting, we have access to a set of labeled training images. Denote each image by , and denote the class label for image by . Denote this ideal training set by
As discussed in the introduction, accurate labels are difficult to obtain for large datasets, so we suppose that we have access only to noisy labels, denoted by . Denote the noisy training set by
We assume a probabilistic model of label noise in which each noisy label depends only on the true label and not on the image . We further suppose that the noisy labels are i.i.d. conditioned on the true labels. That is, and are independent of each other given the true labels and , and for image pairs and
. We represent the conditional noise model by the columnstochastic matrix
:(1) 
where is the th element of .
In our simulations, we synthesize the noisy labels. From the standard datasets CIFAR10 and MNIST, we fix a noise distribution and create noisy labels by drawing i.i.d. from the distribution specified by (1) for the training samples. We do not perturb the labels for the test samples.
While the proposed method works for any , we use two parametric noise models in the sequel. First, we choose a noise level , and we set
(2) 
where
is the identity matrix and
is the allones column vector. That is, the noisy label is the true label with probability and is drawn uniformly from with probability p. We call this the uniform noise model.Second, we again choose a noise level , and we set
(3) 
where the columns of are drawn uniformly from the unit simplex, i.e. the set of vectors with nonnegative elements that sum to one. The matrix is constant over a single instantiation of the noisy training set . We call this the nonuniform noise model.
Ia Learning Deep Networks with Noise Models
Our objective is to learn a deep network from the noisy training set that accurately classifies cleanlylabeled images. Our approach is to take a standard deep network—which we call the base model
—and augment it with a noise model that accounts for label noise. Then, the base and noise models are learned jointly via stochastic gradient descent. The noise model has a role only during training—as the noise model is learned, it effectively denoises the labels during backpropagation, making it possible to learn a more accurate base model. After training, the noise model is disconnected, and test images are classified using the base model output.
We use two standard deep networks for the base model. The first is the deep convolutional network. It has three processing layers, with rectified linear units (ReLus) and max and averagepool operations between layers. The hyperparameters are similar to those used in the popular “AlexNet” architecture, described in
[2]. The second model is a standard deep neural network, with three rectified linear processing layers (RELUs).We lump the base model parameters—processing layer weights and biases, etc.—into a single parameter vector . Further, let be the output vector of the final layer of the base model. Define the usual softmax function
(4) 
Then, for test image
, the base model estimate of the distribution of the class label is
(5) 
One approach to noisy labels is to use the base model without modification and treat as the true label for . Taking the standard crossentropy loss, one can minimize the empirical risk
(6)  
(7) 
As shown in Section III, the base model alone offers satisfactory performance when the label noise is not too severe; otherwise the incorrect labels overwhelm the model, and it fails.
To motivate our approach, we describe first the method presented in [9]. Suppose momentarily that the true noise distribution, characterized by , is known. One can augment the base model with a linear noise model, with weight matrix equal to , as depicted in Figure 0(a). For this architecture, we can express the estimate of the distribution of the noisy class label as
(8)  
(9) 
where is standard matrixvector multiplication. We can then minimize the empirical crossentropy of the noisy labels directly:
(10)  
(11) 
where returns the th element of a vector. Then, each test sample is classified according to the output of the base model, i.e. . Because the noise model is known perfectly, one might expect that this approach gives the best possible performance. While it does provide excellent performance, in Section III we show that even better performance is possible in most cases.
The noise model, however, is usually unknown. Furthermore, we do not know which labels are corrupted and we cannot estimate a noise model directly. The authors of [9] suggested that one can estimate the noise probabilities while simultaneously learning the base model parameters . The challenge here is that convolutional networks are sufficiently expressive models that base model may fit to the noisy labels directly and learn a trivial noise model. To prevent this, the authors of [9] add a regularization term that penalizes the trace of the estimate of
. This encourages a diffuse noise model estimate and permits the base model to learn from denoised labels. The associated loss function is
(12)  
(13) 
where is the matrix trace, and is a regularization parameter chosen via crossvalidation. When minimizing , one must take care to project the estimate onto the space of stochastic matrices at every iteration, else it will not correspond to a meaningful model of label noise.
Ii Dropout Regularization
We propose to augment the base model with a different noise architecture. As depicted in Figure 0(b), we add a softmax layer with square weight matrix , unconstrained. We interpret the output of this softmax layer, denoted
, as the probability distribution over the noisy label
. This results in the effective conditional probability distribution of the noisy label
conditioned on :(14) 
where is the th elementary vector. We use this architecture without loss of generality. Because the softmax function is invertible, there is a onetoone relationship between noise distributions induced by and (1) and those induced by and (14). For any and base model parameters , the estimate of the distribution of the noisy class label is
(15)  
(16) 
This architecture offers two major advantages. First, the matrix is unconstrained during optimization. Because the softmax layer implicitly normalizes the resulting conditional probabilities, there is no need to normalize or force its entries to be nonnegative. This simplifies the optimization process by eliminating the normalization step described above.
Second, it is congruent with dropout regularization, which we apply to the output of base model to prevent the base model from learning the noisy labels directly. Dropout is a wellestablished technique for preventing overfitting in deep learning [13]
. It regularizes learning by introducing binary multiplicative noise during training. At each gradient step, the base model outputs are multiplied by random variables drawn i.i.d from the Bernoulli distribution
. This “thins” out the network, effectively sampling from a different network for each gradient step.Applying dropout to entails forming the effective weight matrix
(17)  
(18) 
where has entries drawn i.i.d. from the Bernoulli distribution and represents the Hadamard (elementwise) product. We choose a different vector for each minibatch, i.e. each SGD step, in the training set. Again using the crossentropy loss, the resulting loss function is
(19)  
(20) 
Observing the conditional distribution in (14), each instantiation of the multiplicative noise zeros out a fraction of the elements , forcing the associated probabilities to a baseline, uniform value. [ISHAN: Is this right?] This forces the learning “action” on the remaining probabilities, which encourages a nontrivial noise model. The Bernoulli parameter determines the sparsity of each instantiation. In our simulations, we find that —which corresponds to an aggressively sparse model—works best.
The usual dropout procedure involves “averaging” together the different models when classifying samples by reducing the learned weights. In our setting, this is unnecessary. The noise model serves only as an intermediate step for denoising the noisy labels to train a more accurate base model. The noise model is disconnected at test time, and averaging is not performed.
Iii Experimental Results
In this section, we demonstrate the performance of the proposed method. We state results on two datasets (CIFAR10 and MNIST), two noise models (uniform and nonuniform), and two base models (CNN and DNN). For training the CNN, we use the model architecture from the publiclyavailable MATLAB toolbox [14]. [ISHAN: What are the hyperparameters of this model?] Other than changing the size of the input units, we keep the model hyperparameters constant. For training the DNN, we use the architecture used in [10], which has ReLUs per layer. In each case, we present results for label noise probabilities , i.e. label noise that corrupts 30%, 50%, and 70% of the training samples. As mentioned earlier, we use a dropout rate of in all simulations. We train the CNN and DNN endtoend using stochastic gradient descent with batch size 100. When training on the MNIST dataset, we perform early stopping, ceasing iterations when the loss function begins to increase. We emphasize that the loss function does not depend on the true labels, so choosing when to stop does not require knowledge of the uncorrupted dataset. MATLAB code for these simulations is available at [15].
Iiia CIFAR Images
The CIFAR10 dataset [16] is a subset of the Tiny Images dataset [17]. CIFAR10 consists of 50,000 training images and 10,000 test images, each of which belongs to one of ten object categories, which are equally represented in the training and test sets. Each image has dimension , where the latter dimension reflects the three color channels of the images.
First, we state results for the uniform noise model using CNN. For , we choose as indicated in (2). We corrupt the labels in the CIFAR10 training according to , and we leave the test labels uncorrupted. For reference, CNN achieves 20.49% classification error when trained on the noisefree dataset.
We state the classification accuracy over the test set in Table I. As a baseline, we present results for the base model, in which the noisy labels are treated as true labels and the model parameters are chosen to minimize the standard loss function in (6). We also present results for the true noise model, in which is known, a linear noise layer with weights is appended to the base model, and the model parameters are chosen to minimize the loss function in (6). Next, we present results for the proposed softmax architecture, first without regularization (referred to as “Softmax” in Table I) and then with the proposed dropout regularization (“Dropout”).
Finally, we compare to the results presented in [9] (“Trace”), in which a linear layer is added, but the label noise model is learned jointly with the base model parameters according to the tracepenalized loss function of (12). We emphasize that these results come with significant caveats. While the noise level and network architecture used here is the same as that of [9], the authors of [9] used a nonuniform noise model which we do not replicate in this paper. Therefore, these results are from a roughly comparable, but not strictly identical, noise scenario.
Noise level  True noise  Base model  Softmax  Dropout  Trace ([9]) 
30%  25.76  29.78  26.04  24.43  26 
50%  29.63  38.76  33.40  32.64  35 
70%  36.24  48.34  37.10  33.00  63 
In most cases, the proposed dropout method gives the best performance—even better than the true noise model, which supposes that is known a priori. Only in the case of 50% noise does the true noise model outperform dropout. Note that even without dropout regularization, the proposed softmax noise model gives satisfactory performance, consistently outperforming the base model.
Because there is a onetoone relationship between the softmax and linear noise models, one might expect their performance to be similar. To understand further why this is not so, in Figure 2 we plot the true noise model alongside the equivalent noise matrices learned via the proposed dropout scheme. The learned models are of the correct form—approximately uniform and diagonally dominant—but they also are more pessimistic, underestimating the probability of a correct noise label by a few percent. Indeed, the average diagonal value of the learned noise matrices are and for 30%, 50%, and 70% noise, respectively. This suggests that a CNN may learn from noisy labels better if the denoising model is pessimistic. This notion is a topic for future investigation.
Next, we state results for the nonuniform noise model using a CNN. For , we corrupt the labels in the CIFAR10 training set according to as indicated in (3). We again compare the proposed dropout scheme to the base model, the true noise model, and the traceregularized scheme of [9]. We emphasize again that these error rates, taken directly from [9], are for a similar but not identical noise model. We omit results for the unregularized softmax scheme.
Table II states the classification error for the different schemes over the CIFAR10 test set. Again dropout performs well, outperforming the base model and performing better or on par with the traceregularized scheme. In this case, however, dropout does not outperform the true noise model. Indeed, overall dropout performs worse under nonuniform noise. To investigate this further, we plot the values of used for simulations and the noise model learned via dropout in Figure 3. Similar to before, dropout learns a more pessimistic noise model, with average diagonal entries equal to , , and for 30%, 50%, and 70% noise levels, respectively. Further, the learned noise models are close to uniform, even though the true model is nonuniform. We hypothesize that the failure of dropout to learn a nonuniform noise model explains the performance gap. We emphasize, though, the stateoftheart performance of the model learned by dropout.
Noise level  True noise  Base model  Dropout  Trace ([9]) 

30%  24.95  30.49  25.4  26 
50%  29.9  39.47  31.28  35 
70%  63.91  65.6  63.04  63 
IiiB MNIST Images
MNIST is a set of images of handwritten digits [18]. It has 60,000 training images and 10,000 test images. We use the version of the dataset included in , in which the original blackandwhite images are normalized to grayscale and fit to a dimension of . For reference, the CNN achieves 0.89% classification error when trained on the uncorrupted training set.
First, we present results for learning the CNN model parameters on the MNIST training set corrupted by uniform noise. As usual we take as defined in (2) for . We compare the proposed dropout method to the base and true noise models. For this scenario, there is no prior work against which to compare.
Noise level  True noise  Base model  Dropout 

30%  1.3  8.3  1.2 
50%  2.06  25.44  1.92 
70%  3.31  44.42  3.12 
We state the results in Table III. Dropout outperforms the true noise model for 30% and 50% noise, and performs only slightly worse at 70% noise. Still, dropout proves quite robust to label noise, outperforming the base model substantially.
In Table IV we state the results of the same experiment, this time with drawn according to the nonuniform noise model of (3). Similar to the CIFAR10 case, the relative performance of dropout is worse. It slightly underperforms relative to the true noise model for 30% and 50%, and it performs substantially worse for 70%. This is due to two factors: first, the dropout scheme learns nonuniform noise models poorly, as seen above, and the MNIST dataset does not cluster as naturally as the CIFAR10 dataset.
Noise Level 

Base model  Dropout  

30%  1.72  4.5  1.83  
50%  2.29  34.5  2.83  
70%  3.58  48.80  24.6 
To compare the dropout performance on MNIST with previous work, we also state results for a threelayer DNN as described in [10]. As mentioned above, this network has rectified linear units per layer. The DNN is less sophisticated than the CNN, so it has worse performance overall. When trained on the uncorrupted MNIST training set, it achieves 1.84% classification error.
We first state results for uniform noise, shown in Table V. As before, we corrupt the MNIST training set labels with noise drawn according to (2). In addition to the true noise and base models, we compare the proposed dropout scheme to that presented in [10], where a “bootstrapping” scheme is used to denoise the corrupted labels during training. Similar to before, the proposed dropout scheme outperforms every scheme, including the true noise model, except for the 70% noise level. However, dropout significantly outperforms bootstrapping in all regimes; at 70% noise, dropout performs even better than bootstrap does at 50% noise.
Noise level  True noise  Base model  Dropout  Bootstrapping ([10]) 
30%  2.46  3.42  2.41  2 
50%  3.72  23.4  3.63  45 
70%  7.59  45.33  8.77  N/A 
Similar results obtain for nonuniform noise, as shown in Table VI. Again, dropout has worse relative performance due to its difficulty in learning a nonuniform noise model, and this gap is significant at the 70% noise level. We plot the true and learned noise model for the 70% noise level in Figure 4
. Similar to before, the learned model is more pessimistic and closer to a uniform distribution than the true model. We hypothesize that this has a more drastic effect because the MNIST digits do not cluster as naturally as the CIFAR images.
Noise level  True noise  Base model  Dropout  Bootstrapping ([10]) 
30%  3.71  6.03  2.45  2 
50%  5.24  36.35  4.58  45 
70%  6.76  53.55  43.03  N/A 
While preparing this manuscript, we became aware of a recentlypublished approach [19]. It uses the “AlexNet” convolutional neural network, pretrained on a noisefree version of the ILSVRC2012 dataset. Then, for a different, noisy training set, it finetunes the last CNN layer using an auxiliary image regularization function, optimized via alternating direction method of multipliers (ADMM). The regularization encourages the model to identify and discard incorrectlylabeled images. This approach has a somewhat different setting—in particular, they rely on a pretrained CNN, whereas the results reported herein suppose that the endtoend network must be trained via noisy labels—so we cannot give a direct comparison of our method to theirs. However, [19] reports a classification error rate of 7.83% for 50% noise on the MNIST set, whereas dropout achieves 2.83%. This suggests that at least in some regimes dropout provides superior performance.
Iv Conclusion and Future Work
We have proposed a simple and effective method for learning a deep network from training data whose labels are corrupted by noise. We augmented a standard deep network with a softmax layer that models the label noise. To learn the classifier and the noise model jointly, we applied dropout regularization to the weights of the final softmax layer. On the CIFAR10 and MNIST datasets, this approach achieves stateoftheart performance, and in some cases it outperforms models in which the label noise statistics are known a priori.
A consistent feature of this approach is that it learns a noise model that overestimates the probability of a label flip. One way to interpret this result is that the deep network is encouraged to learn to cluster the data—rather than to classify it—to a greater extent than one would expect from the noise statistics. In other words, it is better to let deep networks cluster ambiguouslylabeled data than to risk learning noisy labels. The details of this phenomenon—including which noise model is “ideal” for training an accurate network—is a topic for future research.
Acknowledgment
This work is supported in part by the US National Science Foundation award to XWC (IIS1554264)
References
 [1] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei, “Imagenet: A largescale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 248–255.
 [2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
 [3] X. Zhu and X. Wu, “Class noise vs. attribute noise: A quantitative study,” Artificial Intelligence Review, vol. 22, no. 3, pp. 177–210, 2004.
 [4] J. A. Sáez, M. Galar, J. Luengo, and F. Herrera, “Analyzing the presence of noise in multiclass problems: alleviating its influence with the onevsone decomposition,” Knowledge and information systems, vol. 38, no. 1, pp. 179–206, 2014.
 [5] B. Frénay and M. Verleysen, “Classification in the presence of label noise: a survey,” Neural Networks and Learning Systems, IEEE Transactions on, vol. 25, no. 5, pp. 845–869, 2014.
 [6] N. Natarajan, I. S. Dhillon, P. K. Ravikumar, and A. Tewari, “Learning with noisy labels,” in Advances in neural information processing systems, 2013, pp. 1196–1204.
 [7] J. Larsen, L. Nonboe, M. HintzMadsen, and L. K. Hansen, “Design of robust neural network classifiers,” in Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on, vol. 2. IEEE, 1998, pp. 1205–1208.
 [8] V. Mnih and G. E. Hinton, “Learning to label aerial images from noisy data,” in Proceedings of the 29th International Conference on Machine Learning (ICML12), 2012, pp. 567–574.
 [9] S. Sukhbaatar, J. Bruna, M. Paluri, L. Bourdev, and R. Fergus, “Training convolutional networks with noisy labels,” arXiv preprint arXiv:1406.2080, 2014.
 [10] S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Rabinovich, “Training deep neural networks on noisy labels with bootstrapping,” arXiv preprint arXiv:1412.6596, 2014.

[11]
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in
Proceedings of the ACM International Conference on Multimedia. ACM, 2014, pp. 675–678.  [12] R. Collobert, K. Kavukcuoglu, and C. Farabet, “Torch7: A matlablike environment for machine learning,” in BigLearn, NIPS Workshop, no. EPFLCONF192376, 2011.
 [13] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
 [14] A. Vedaldi and K. Lenc, “Matconvnet: Convolutional neural networks for matlab,” in Proceedings of the 23rd Annual ACM Conference on Multimedia Conference. ACM, 2015, pp. 689–692.
 [15] “Code repository,” https://www.dropbox.com/sh/4kgbhgz7328ke16/AADY0gaOcaI5MIMAO0HY77Gqa?dl=0.
 [16] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” 2009.

[17]
A. Torralba, R. Fergus, and W. T. Freeman, “80 million tiny images: A large data set for nonparametric object and scene recognition,”
Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 30, no. 11, pp. 1958–1970, 2008. 
[18]
Y. LeCun, C. Cortes, and C. J. Burges, “The mnist database of handwritten digits,” 1998.
 [19] S. Azadi, J. Feng, S. Jegelka, and T. Darrell, “Auxiliary image regularization for deep cnns with noisy labels,” arXiv preprint arXiv:1511.07069, 2015.