Learning Deep Networks from Noisy Labels with Dropout Regularization

by   Ishan Jindal, et al.
Wayne State University

Large datasets often have unreliable labels-such as those obtained from Amazon's Mechanical Turk or social media platforms-and classifiers trained on mislabeled datasets often exhibit poor performance. We present a simple, effective technique for accounting for label noise when training deep neural networks. We augment a standard deep network with a softmax layer that models the label noise statistics. Then, we train the deep network and noise model jointly via end-to-end stochastic gradient descent on the (perhaps mislabeled) dataset. The augmented model is overdetermined, so in order to encourage the learning of a non-trivial noise model, we apply dropout regularization to the weights of the noise model during training. Numerical experiments on noisy versions of the CIFAR-10 and MNIST datasets show that the proposed dropout technique outperforms state-of-the-art methods.


page 5

page 6

page 7


Improving Generalization by Controlling Label-Noise Information in Neural Network Weights

In the presence of noisy or incorrect labels, neural networks have the u...

Training Deep Neural Networks on Noisy Labels with Bootstrapping

Current state-of-the-art deep learning systems for visual object recogni...

An Effective Label Noise Model for DNN Text Classification

Because large, human-annotated datasets suffer from labeling errors, it ...

On the Robustness of Monte Carlo Dropout Trained with Noisy Labels

The memorization effect of deep learning hinders its performance to effe...

Proximal Mapping for Deep Regularization

Underpinning the success of deep learning is effective regularizations t...

Data Dropout in Arbitrary Basis for Deep Network Regularization

An important problem in training deep networks with high capacity is to ...

Making Deep Neural Networks Robust to Label Noise: a Loss Correction Approach

We present a theoretically grounded approach to train deep neural networ...

I Problem Statement

In the usual supervised learning setting, we have access to a set of labeled training images. Denote each image by , and denote the class label for image by . Denote this ideal training set by

As discussed in the introduction, accurate labels are difficult to obtain for large datasets, so we suppose that we have access only to noisy labels, denoted by . Denote the noisy training set by

We assume a probabilistic model of label noise in which each noisy label depends only on the true label and not on the image . We further suppose that the noisy labels are i.i.d. conditioned on the true labels. That is, and are independent of each other given the true labels and , and for image pairs and

. We represent the conditional noise model by the column-stochastic matrix



where is the th element of .

In our simulations, we synthesize the noisy labels. From the standard datasets CIFAR-10 and MNIST, we fix a noise distribution and create noisy labels by drawing i.i.d. from the distribution specified by (1) for the training samples. We do not perturb the labels for the test samples.

While the proposed method works for any , we use two parametric noise models in the sequel. First, we choose a noise level , and we set



is the identity matrix and

is the all-ones column vector. That is, the noisy label is the true label with probability and is drawn uniformly from with probability p. We call this the uniform noise model.

Second, we again choose a noise level , and we set


where the columns of are drawn uniformly from the unit simplex, i.e. the set of vectors with nonnegative elements that sum to one. The matrix is constant over a single instantiation of the noisy training set . We call this the non-uniform noise model.

I-a Learning Deep Networks with Noise Models

(a) A deep network augmented with a linear noise model.
(b) A deep network augmented with a softmax/dropout noise model.
Fig. 1:

Our objective is to learn a deep network from the noisy training set that accurately classifies cleanly-labeled images. Our approach is to take a standard deep network—which we call the base model

—and augment it with a noise model that accounts for label noise. Then, the base and noise models are learned jointly via stochastic gradient descent. The noise model has a role only during training—as the noise model is learned, it effectively denoises the labels during backpropagation, making it possible to learn a more accurate base model. After training, the noise model is disconnected, and test images are classified using the base model output.

We use two standard deep networks for the base model. The first is the deep convolutional network. It has three processing layers, with rectified linear units (ReLus) and max- and average-pool operations between layers. The hyperparameters are similar to those used in the popular “AlexNet” architecture, described in

[2]. The second model is a standard deep neural network, with three rectified linear processing layers (RELUs).

We lump the base model parameters—processing layer weights and biases, etc.—into a single parameter vector . Further, let be the output vector of the final layer of the base model. Define the usual softmax function


Then, for test image

, the base model estimate of the distribution of the class label is


One approach to noisy labels is to use the base model without modification and treat as the true label for . Taking the standard cross-entropy loss, one can minimize the empirical risk


As shown in Section III, the base model alone offers satisfactory performance when the label noise is not too severe; otherwise the incorrect labels overwhelm the model, and it fails.

To motivate our approach, we describe first the method presented in [9]. Suppose momentarily that the true noise distribution, characterized by , is known. One can augment the base model with a linear noise model, with weight matrix equal to , as depicted in Figure 0(a). For this architecture, we can express the estimate of the distribution of the noisy class label as


where is standard matrix-vector multiplication. We can then minimize the empirical cross-entropy of the noisy labels directly:


where returns the th element of a vector. Then, each test sample is classified according to the output of the base model, i.e. . Because the noise model is known perfectly, one might expect that this approach gives the best possible performance. While it does provide excellent performance, in Section III we show that even better performance is possible in most cases.

The noise model, however, is usually unknown. Furthermore, we do not know which labels are corrupted and we cannot estimate a noise model directly. The authors of [9] suggested that one can estimate the noise probabilities while simultaneously learning the base model parameters . The challenge here is that convolutional networks are sufficiently expressive models that base model may fit to the noisy labels directly and learn a trivial noise model. To prevent this, the authors of [9] add a regularization term that penalizes the trace of the estimate of

. This encourages a diffuse noise model estimate and permits the base model to learn from denoised labels. The associated loss function is


where is the matrix trace, and is a regularization parameter chosen via cross-validation. When minimizing , one must take care to project the estimate onto the space of stochastic matrices at every iteration, else it will not correspond to a meaningful model of label noise.

Ii Dropout Regularization

We propose to augment the base model with a different noise architecture. As depicted in Figure 0(b), we add a softmax layer with square weight matrix , unconstrained. We interpret the output of this softmax layer, denoted

, as the probability distribution over the noisy label

. This results in the effective conditional probability distribution of the noisy label

conditioned on :


where is the th elementary vector. We use this architecture without loss of generality. Because the softmax function is invertible, there is a one-to-one relationship between noise distributions induced by and (1) and those induced by and (14). For any and base model parameters , the estimate of the distribution of the noisy class label is


This architecture offers two major advantages. First, the matrix is unconstrained during optimization. Because the softmax layer implicitly normalizes the resulting conditional probabilities, there is no need to normalize or force its entries to be nonnegative. This simplifies the optimization process by eliminating the normalization step described above.

Second, it is congruent with dropout regularization, which we apply to the output of base model to prevent the base model from learning the noisy labels directly. Dropout is a well-established technique for preventing overfitting in deep learning [13]

. It regularizes learning by introducing binary multiplicative noise during training. At each gradient step, the base model outputs are multiplied by random variables drawn i.i.d from the Bernoulli distribution

. This “thins” out the network, effectively sampling from a different network for each gradient step.

Applying dropout to entails forming the effective weight matrix


where has entries drawn i.i.d. from the Bernoulli distribution and represents the Hadamard (element-wise) product. We choose a different vector for each mini-batch, i.e. each SGD step, in the training set. Again using the cross-entropy loss, the resulting loss function is


Observing the conditional distribution in (14), each instantiation of the multiplicative noise zeros out a fraction of the elements , forcing the associated probabilities to a baseline, uniform value. [ISHAN: Is this right?] This forces the learning “action” on the remaining probabilities, which encourages a non-trivial noise model. The Bernoulli parameter determines the sparsity of each instantiation. In our simulations, we find that —which corresponds to an aggressively sparse model—works best.

The usual dropout procedure involves “averaging” together the different models when classifying samples by reducing the learned weights. In our setting, this is unnecessary. The noise model serves only as an intermediate step for denoising the noisy labels to train a more accurate base model. The noise model is disconnected at test time, and averaging is not performed.

Iii Experimental Results

In this section, we demonstrate the performance of the proposed method. We state results on two datasets (CIFAR-10 and MNIST), two noise models (uniform and non-uniform), and two base models (CNN and DNN). For training the CNN, we use the model architecture from the publicly-available MATLAB toolbox [14]. [ISHAN: What are the hyperparameters of this model?] Other than changing the size of the input units, we keep the model hyperparameters constant. For training the DNN, we use the architecture used in [10], which has ReLUs per layer. In each case, we present results for label noise probabilities , i.e. label noise that corrupts 30%, 50%, and 70% of the training samples. As mentioned earlier, we use a dropout rate of in all simulations. We train the CNN and DNN end-to-end using stochastic gradient descent with batch size 100. When training on the MNIST dataset, we perform early stopping, ceasing iterations when the loss function begins to increase. We emphasize that the loss function does not depend on the true labels, so choosing when to stop does not require knowledge of the uncorrupted dataset. MATLAB code for these simulations is available at [15].

Iii-a CIFAR Images

The CIFAR-10 dataset [16] is a subset of the Tiny Images dataset [17]. CIFAR-10 consists of 50,000 training images and 10,000 test images, each of which belongs to one of ten object categories, which are equally represented in the training and test sets. Each image has dimension , where the latter dimension reflects the three color channels of the images.

First, we state results for the uniform noise model using CNN. For , we choose as indicated in (2). We corrupt the labels in the CIFAR-10 training according to , and we leave the test labels uncorrupted. For reference, CNN achieves 20.49% classification error when trained on the noise-free dataset.

We state the classification accuracy over the test set in Table I. As a baseline, we present results for the base model, in which the noisy labels are treated as true labels and the model parameters are chosen to minimize the standard loss function in (6). We also present results for the true noise model, in which is known, a linear noise layer with weights is appended to the base model, and the model parameters are chosen to minimize the loss function in (6). Next, we present results for the proposed softmax architecture, first without regularization (referred to as “Softmax” in Table I) and then with the proposed dropout regularization (“Dropout”).

Finally, we compare to the results presented in [9] (“Trace”), in which a linear layer is added, but the label noise model is learned jointly with the base model parameters according to the trace-penalized loss function of (12). We emphasize that these results come with significant caveats. While the noise level and network architecture used here is the same as that of [9], the authors of [9] used a non-uniform noise model which we do not replicate in this paper. Therefore, these results are from a roughly comparable, but not strictly identical, noise scenario.

Noise level True noise Base model Softmax Dropout Trace ([9])
30% 25.76 29.78 26.04 24.43 26
50% 29.63 38.76 33.40 32.64 35
70% 36.24 48.34 37.10 33.00 63
TABLE I: Classification accuracy on the CIFAR-10 dataset with uniform label noise and the CNN architecture.

In most cases, the proposed dropout method gives the best performance—even better than the true noise model, which supposes that is known a priori. Only in the case of 50% noise does the true noise model outperform dropout. Note that even without dropout regularization, the proposed softmax noise model gives satisfactory performance, consistently outperforming the base model.

Because there is a one-to-one relationship between the softmax and linear noise models, one might expect their performance to be similar. To understand further why this is not so, in Figure 2 we plot the true noise model alongside the equivalent noise matrices learned via the proposed dropout scheme. The learned models are of the correct form—approximately uniform and diagonally dominant—but they also are more pessimistic, underestimating the probability of a correct noise label by a few percent. Indeed, the average diagonal value of the learned noise matrices are and for 30%, 50%, and 70% noise, respectively. This suggests that a CNN may learn from noisy labels better if the denoising model is pessimistic. This notion is a topic for future investigation.

(a) 30% True Noise
(b) 50% True Noise
(c) 70% True Noise
(d) 30% Learned Noise
(e) 50% Learned Noise
(f) 70% Learned Noise
Fig. 2: True and learned uniform noise distributions. The first row shows the elements of the true noise matrix for the uniform noise model with 30%, 50% and 70% noise levels. The second row shows the noise model learned via the proposed dropout method.

Next, we state results for the non-uniform noise model using a CNN. For , we corrupt the labels in the CIFAR-10 training set according to as indicated in (3). We again compare the proposed dropout scheme to the base model, the true noise model, and the trace-regularized scheme of [9]. We emphasize again that these error rates, taken directly from [9], are for a similar but not identical noise model. We omit results for the unregularized softmax scheme.

Table II states the classification error for the different schemes over the CIFAR-10 test set. Again dropout performs well, outperforming the base model and performing better or on par with the trace-regularized scheme. In this case, however, dropout does not outperform the true noise model. Indeed, overall dropout performs worse under non-uniform noise. To investigate this further, we plot the values of used for simulations and the noise model learned via dropout in Figure 3. Similar to before, dropout learns a more pessimistic noise model, with average diagonal entries equal to , , and for 30%, 50%, and 70% noise levels, respectively. Further, the learned noise models are close to uniform, even though the true model is non-uniform. We hypothesize that the failure of dropout to learn a non-uniform noise model explains the performance gap. We emphasize, though, the state-of-the-art performance of the model learned by dropout.

Noise level True noise Base model Dropout Trace ([9])
30% 24.95 30.49 25.4 26
50% 29.9 39.47 31.28 35
70% 63.91 65.6 63.04 63
TABLE II: Classification error rates on the CIFAR-10 dataset with non-uniform label noise and the CNN architecture.
(a) 30% True Noise
(b) 50% True Noise
(c) 70% True Noise
(d) 30% Learned Noise
(e) 50% Learned Noise
(f) 70% Learned Noise
Fig. 3: True and learned non-uniform noise distributions. The first row shows the elements of the true noise matrix for the non-uniform noise model with 30%, 50% and 70% noise levels. The second row shows the noise model learned via the proposed dropout method.

Iii-B MNIST Images

MNIST is a set of images of handwritten digits [18]. It has 60,000 training images and 10,000 test images. We use the version of the dataset included in , in which the original black-and-white images are normalized to grayscale and fit to a dimension of . For reference, the CNN achieves 0.89% classification error when trained on the uncorrupted training set.

First, we present results for learning the CNN model parameters on the MNIST training set corrupted by uniform noise. As usual we take as defined in (2) for . We compare the proposed dropout method to the base and true noise models. For this scenario, there is no prior work against which to compare.

Noise level True noise Base model Dropout
30% 1.3 8.3 1.2
50% 2.06 25.44 1.92
70% 3.31 44.42 3.12
TABLE III: Classification error rates for the CNN architecture trained on the MNIST dataset corrupted by uniform noise.

We state the results in Table III. Dropout outperforms the true noise model for 30% and 50% noise, and performs only slightly worse at 70% noise. Still, dropout proves quite robust to label noise, outperforming the base model substantially.

In Table IV we state the results of the same experiment, this time with drawn according to the non-uniform noise model of (3). Similar to the CIFAR-10 case, the relative performance of dropout is worse. It slightly under-performs relative to the true noise model for 30% and 50%, and it performs substantially worse for 70%. This is due to two factors: first, the dropout scheme learns non-uniform noise models poorly, as seen above, and the MNIST dataset does not cluster as naturally as the CIFAR-10 dataset.

Noise Level
True Noise
Base model Dropout
30% 1.72 4.5 1.83
50% 2.29 34.5 2.83
70% 3.58 48.80 24.6
TABLE IV: Classification error rates for the CNN architecture trained on the MNIST dataset corrupted by non-uniform noise.

To compare the dropout performance on MNIST with previous work, we also state results for a three-layer DNN as described in [10]. As mentioned above, this network has rectified linear units per layer. The DNN is less sophisticated than the CNN, so it has worse performance overall. When trained on the uncorrupted MNIST training set, it achieves 1.84% classification error.

We first state results for uniform noise, shown in Table V. As before, we corrupt the MNIST training set labels with noise drawn according to (2). In addition to the true noise and base models, we compare the proposed dropout scheme to that presented in [10], where a “bootstrapping” scheme is used to denoise the corrupted labels during training. Similar to before, the proposed dropout scheme outperforms every scheme, including the true noise model, except for the 70% noise level. However, dropout significantly outperforms bootstrapping in all regimes; at 70% noise, dropout performs even better than bootstrap does at 50% noise.

Noise level True noise Base model Dropout Bootstrapping ([10])
30% 2.46 3.42 2.41 2
50% 3.72 23.4 3.63 45
70% 7.59 45.33 8.77 N/A
TABLE V: Classification error rates for the DNN architecture trained on the MNIST dataset corrupted by uniform noise.

Similar results obtain for non-uniform noise, as shown in Table VI. Again, dropout has worse relative performance due to its difficulty in learning a non-uniform noise model, and this gap is significant at the 70% noise level. We plot the true and learned noise model for the 70% noise level in Figure 4

. Similar to before, the learned model is more pessimistic and closer to a uniform distribution than the true model. We hypothesize that this has a more drastic effect because the MNIST digits do not cluster as naturally as the CIFAR images.

(a) True noise
(b) Learned noise
Fig. 4: True and learned noise model for the CNN architecture over the MNIST digits with 70% label noise.
Noise level True noise Base model Dropout Bootstrapping ([10])
30% 3.71 6.03 2.45 2
50% 5.24 36.35 4.58 45
70% 6.76 53.55 43.03 N/A
TABLE VI: Classification error rates for the DNN architecture trained on the MNIST dataset corrupted by non-uniform noise.

While preparing this manuscript, we became aware of a recently-published approach [19]. It uses the “AlexNet” convolutional neural network, pretrained on a noise-free version of the ILSVRC2012 dataset. Then, for a different, noisy training set, it fine-tunes the last CNN layer using an auxiliary image regularization function, optimized via alternating direction method of multipliers (ADMM). The regularization encourages the model to identify and discard incorrectly-labeled images. This approach has a somewhat different setting—in particular, they rely on a pretrained CNN, whereas the results reported herein suppose that the end-to-end network must be trained via noisy labels—so we cannot give a direct comparison of our method to theirs. However, [19] reports a classification error rate of 7.83% for 50% noise on the MNIST set, whereas dropout achieves 2.83%. This suggests that at least in some regimes dropout provides superior performance.

Iv Conclusion and Future Work

We have proposed a simple and effective method for learning a deep network from training data whose labels are corrupted by noise. We augmented a standard deep network with a softmax layer that models the label noise. To learn the classifier and the noise model jointly, we applied dropout regularization to the weights of the final softmax layer. On the CIFAR-10 and MNIST datasets, this approach achieves state-of-the-art performance, and in some cases it outperforms models in which the label noise statistics are known a priori.

A consistent feature of this approach is that it learns a noise model that overestimates the probability of a label flip. One way to interpret this result is that the deep network is encouraged to learn to cluster the data—rather than to classify it—to a greater extent than one would expect from the noise statistics. In other words, it is better to let deep networks cluster ambiguously-labeled data than to risk learning noisy labels. The details of this phenomenon—including which noise model is “ideal” for training an accurate network—is a topic for future research.


This work is supported in part by the US National Science Foundation award to XWC (IIS-1554264)


  • [1] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on.   IEEE, 2009, pp. 248–255.
  • [2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [3] X. Zhu and X. Wu, “Class noise vs. attribute noise: A quantitative study,” Artificial Intelligence Review, vol. 22, no. 3, pp. 177–210, 2004.
  • [4] J. A. Sáez, M. Galar, J. Luengo, and F. Herrera, “Analyzing the presence of noise in multi-class problems: alleviating its influence with the one-vs-one decomposition,” Knowledge and information systems, vol. 38, no. 1, pp. 179–206, 2014.
  • [5] B. Frénay and M. Verleysen, “Classification in the presence of label noise: a survey,” Neural Networks and Learning Systems, IEEE Transactions on, vol. 25, no. 5, pp. 845–869, 2014.
  • [6] N. Natarajan, I. S. Dhillon, P. K. Ravikumar, and A. Tewari, “Learning with noisy labels,” in Advances in neural information processing systems, 2013, pp. 1196–1204.
  • [7] J. Larsen, L. Nonboe, M. Hintz-Madsen, and L. K. Hansen, “Design of robust neural network classifiers,” in Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on, vol. 2.   IEEE, 1998, pp. 1205–1208.
  • [8] V. Mnih and G. E. Hinton, “Learning to label aerial images from noisy data,” in Proceedings of the 29th International Conference on Machine Learning (ICML-12), 2012, pp. 567–574.
  • [9] S. Sukhbaatar, J. Bruna, M. Paluri, L. Bourdev, and R. Fergus, “Training convolutional networks with noisy labels,” arXiv preprint arXiv:1406.2080, 2014.
  • [10] S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Rabinovich, “Training deep neural networks on noisy labels with bootstrapping,” arXiv preprint arXiv:1412.6596, 2014.
  • [11]

    Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in

    Proceedings of the ACM International Conference on Multimedia.   ACM, 2014, pp. 675–678.
  • [12] R. Collobert, K. Kavukcuoglu, and C. Farabet, “Torch7: A matlab-like environment for machine learning,” in BigLearn, NIPS Workshop, no. EPFL-CONF-192376, 2011.
  • [13] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
  • [14] A. Vedaldi and K. Lenc, “Matconvnet: Convolutional neural networks for matlab,” in Proceedings of the 23rd Annual ACM Conference on Multimedia Conference.   ACM, 2015, pp. 689–692.
  • [15] “Code repository,” https://www.dropbox.com/sh/4kgbhgz7328ke16/AADY0gaOcaI5MIMAO0HY77Gqa?dl=0.
  • [16] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” 2009.
  • [17]

    A. Torralba, R. Fergus, and W. T. Freeman, “80 million tiny images: A large data set for nonparametric object and scene recognition,”

    Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 30, no. 11, pp. 1958–1970, 2008.
  • [18]

    Y. LeCun, C. Cortes, and C. J. Burges, “The mnist database of handwritten digits,” 1998.

  • [19] S. Azadi, J. Feng, S. Jegelka, and T. Darrell, “Auxiliary image regularization for deep cnns with noisy labels,” arXiv preprint arXiv:1511.07069, 2015.