1 Introduction
Neural networks exhibit stateoftheart performance on many learning tasks, such as classification and segmentation. However, training these networks requires an abundance of carefully labeled data; networks tend to overfit quickly to noise in training labels, which makes their application to noisy realworld problems less effective. Expertlabeled data is expensive and timeconsuming to collect; label noise is common in less carefully crafted datasets due to measurement inaccuracies, human error, etc.
Nonetheless the latter type of data, albeit noisy, is much more readily available and in much larger quantities. One recent strategy shown to perform well on datasets containing significant amounts of label noise is augmenting the neural network with an uncertainty estimation method like Monte Carlo Dropout [5]. These uncertainty estimation models display a delayed memorization effect of noisy training labels, and can generalize better to clean test data. Augmenting models with Monte Carlo Dropout shows a slower degradation of classification performance, consistent on benchmark datasets like MNIST and CIFAR10[5]. In addition to its resilient performance against noisy training labels, MCDropout also does not add training overhead and only adds minimal cost to inference time.
The robustness property and lowcomputational cost of MCDropout indicate it as an effective and practical solution against noisy labels. In this paper, our goal is to not only determine whether MCDropout performs consistently better in these noisylabel situations, but also provide an indepth analysis for why it performs better. We present an investigation into the performance and latent representation learned by a model augmented with MCDropout. We first evaluate the accuracy of MCDropout models in comparison with deterministic neural networks on datasets like MNIST and CIFAR10 with artificially injected noisy labels and Animal10n with natural annotation noise. Second, we measure neuron responsiveness in each layer, to better explore the differences between latent representations learned by certainty and MCDropout models. Finally we study network sparsity and find that the sparsity property offered by MCDropout models contribute to robustness against noisy training labels. To our knowledge, our work provides the first detailed analysis of MCDropout in the setting of noisy labels.
The rest of the paper is organized as follows. In Section 2, we provide the background information on the noisy label setting, label noise taxonomies, Monte Carlo Dropout and related work. In Section 3, we describe our study directions including measuring efficacy, neuron responsiveness via volatility and network sparsity. In Section 4, we demonstrate the effectiveness of MCDropout on empirical datasets such as MNIST, CIFAR10 with artificially corrupted training labels and Animal10n a realworld dataset containing annotation noise. We further analyze the neuron responsiveness and network sparsity by MCDropout in comparison with deterministic networks. Finally in Section 5, we discuss optimal placement for MCDropout on a neural network and conclude our paper.
2 Preliminaries
In this section, we present the problem statement, the preliminaries on label noise, Monte Carlo Dropout and related work.
We consider a fully supervised learning problem in image classification, where the images and its associated labels in the training set, denoted by
, with denoting the total number of training samples and all the pairssampled i.i.d from a joint distribution
. However instead of observing all the correctly annotated labels, we observe the training data , where given by a probabilistic process, deviates from. Our exploitation task is to learn a robust classifier on
containing noisy labels such that the classification efficacy on incoming test image can best predict the unknown label .Across this paper, we refer to the deterministic neural network without uncertainty estimation as certainty model or deterministic model” interchangeably. We refer to the neural network augmented with MCDropout layers as the MCDropout model.
2.1 Label Noise Taxonomies
There are several categorizations of noise labels. One commonly used categorization depends on whether or not the noisy label depends on the features. If the noisy label generation process is conditionally independent of the features, then a noise transition matrix , where is the number of classes, is sufficient to describe the label noise generation process. Each entry in
is a probability such that the true label will be changed into a noisy label with probability
. If the observed label is different from the true label with a uniform probability, then the noise is considered to be labelindependent and this noise is called considered symmetric or uniform noise. If the observed label is changed from the true label with probabilities depending on the original ground truth, then the noise is labeldependent and called asymmetric noise. On the other hand, if the corruption process depends on the features and labels, the label noise is called instancedependent. A more recent study proposes a new but practical assumption within instancedependent label noise, defined as partdependent label noise, where the noise depends partially on an instance [26].Another perspective on label noise is via uncertainty characterization [5]
. The noisy label generation process is probabilistic and random. Naturally uncertainty characterization comes into play. From the notion of deep learning uncertainty, the noise in the labels can be considered a type of aleatoric uncertainty, a measurement of the intrinsic and irreducible uncertainty within the data. Within aleatoric uncertainty, homoscedastic uncertainty is constant across the input while heteroscedastic uncertainty is dependent on the input. Hence if the noise transition matrix is a uniform or symmetric one, then the label noise can be considered homoscedastic; if it is labeldependent, then the label noise can be considered heteroscedastic. In recent noise simulation schemes, label noise is applied on samples that are more likely to be mislabeled given by prelearned model
[1]. We consider such type of noise as epistemic uncertainty, a term that describes uncertainty induced by models.2.2 Monte Carlo Dropout
The deep learning uncertainty perspective to characterize label noise inspires us to study label noise via deep learning uncertainty estimation techniques. Chen et al [5] proposed using epistemic uncertainty estimation methods when learning with noisy labels. Comparing Monte Carlo Dropout, Bootstrap [15], Bayesian CNN upon Bayes by Backprop [4] and certainty neural networks trained in noisy label settings, the authors discovered that Monte Carlo Dropout (MCDropout) had a prolonged memorization effect and possessed the best classification performance on test set.
We also included Figure 1 as our motivational example here. Hence in this paper, we laserfocus on the study of why MCDropout possesses robustness against noisy labels in comparison with certainty models. In this section, we provide the background information on MCDropout.
The core idea of MCDropout is to enable dropout regularization at both training and test time. With multiple forward passes at inference time, the prediction is not deterministic and can be used to estimate the posterior distribution. As a result, MCDropout offers Bayesian interpretation. First proposed in [8]
, the authors established the theoretical framework of MCDropout as approximate Bayesian inference and proved MCDropout minimises the Kullback–Leibler divergence between an approximate distribution and the posterior of a deep Gaussian process. More formally, let
denotes dropout at the th layer of a neural network, where . Then at inference time, with forward passes, we obtained a distribution of logits and predictions per test data, where we can compute the expected value, standard deviation, variation ratio and entropy to assess uncertainty.2.3 Related Work on Deep Learning with Noisy Labels
While there has not been much work on applying epistemic uncertainty methods to address noisy labels, an abundant of research has been done in deep learning noisy labels ranging from loss function adjustment, robust architecture design, data processing, data filtering and so on. Authors in
[10, 28, 25, 18] devised robust loss function to achieve a smaller risk for unseen clean data when learning with noisy labels. Sample selection techniques as to filter the clean labels for training and removing the noisy labels have been proposed in [14, 11, 27, 20]. Sample selection and label correction for spatial computing is studied in [6]. Devising loss to estimate noise transition matrix and correct the labels are studied in [22, 12, 2]. Semisupervised learning is another field of techniques on noisy labels, where the noisy labelled data are treated as unlabeled and clean labelled data are as labeled
[21, 7, 17].(Left): MNIST test accuracy when training labels contain 15% noise. (Right): MNIST test accuracy when training labels contain 35% noise. Our previous study suggests that MCDropout has the best classification performance among a few other uncertainty estimation methods. Further MCDropout does not increase training time per epoch and has relatively cheap inference cost. Hence in this paper, we focus on investigating the robustness of MCDropout when training with noisy labels.
Certain 


MCDropout 
Certain 


MCDropout 
Certain 


MCDropout 
3 Investigation
Our goal is to analyze the latent representations learned by MCDropout models, particularly in comparison with certainty models, trained in the presence of noisy labels. Similar to the definitions presented in Bau et al [3], we use the term representation to describe the outputs of a particular layer in a model. More specifically: which channels of the layer have been activated for various data inputs? How strongly have these channels been activated? What is the variation in a specific channel’s possible activations? Comparing the representations lends insight and intuition as to why one model may perform better than another one. Essentially, we investigate why MCDropout performs better than a certainty model by comparing the different latent representations learned by the two models respectively.
Again following the vocabulary used in Bau et al [3], we refer to feature maps as the output of every layer in the network–the aggregate of the feature maps makes up the network’s learned representation. We refer to a neuron as a specific channel of the feature map. In this paper, we use the term activation gamut to refer to all the possible values that a particular neuron can produce. We can approximate the activation gamut as the set of a neuron’s activation values for each image in a dataset.
We compare the classification efficacy, neuron responsiveness, and network sparsity by the two models respectively. To understand how two models have learned and encoded information differently, we evaluate trained models on test set and cache neuron activations from each layer, where we derive statistics such as mean and standard deviations on each neuron with respect to data samples from the test set.
3.1 Measuring Efficacy
We first train MCDropout and certainty models on training data with noisy labels and evaluate their accuracy on a cleanly labeled test set. We present the learning behaviors during training and testing over epochs.
3.2 Measuring Responsiveness
Next we compare neuron responsiveness measured by volatility in the two models. We define volatility as the standard deviation of a neuron’s activations over a dataset; if a neuron is capable of producing vastly different activation values for different input images, the neuron’s activation gamut would possess high standard deviation, indicating a highly responsive neuron.
To compute the activation gamut of a neuron, we first cache the feature maps
, postReLu, produced by the
th neuron on the th test set image. We find the mean activation value, for the feature map . In other words, for a feature map with rows and columns,Per neuron, this results in values which compose its activation gamut
We can perform statistical analysis on these gamuts and aggregate them perlayer, such as finding the mean activation value of all neurons in the th network layer:
We also find the average gamut standard deviation for all neurons in the th network layer
We would observe the activation gamut of a volatile neuron to possess a higher standard deviation than that of a nonvolatile neuron. The activation gamut of a volatile neuron may also include extremes, showing a higher maximum activation than a nonvolatile neuron.
3.3 Measuring Sparsity
Along with research directions into network uncertainty and robustness, neural network sparsity has become a subject of interest for many machine learning researchers
[9, 13, 24]. Sparse neural networks are desirable because they require less computation at test time, demand less memory [9], and are less likely to overfit to training data [19]. In the context of our investigation, the tendency for sparse neural networks to overfit more slowly to training data can allow them to avoid memorizing noisy training labels. We can evaluate network sparsity on a perneuron level: which neurons never or rarely activate, for any and all test samples, and how common are these neurons throughout the entire model? Network sparsity can be defined as the subset of neurons output a value that is always zero [9], or very close: these neurons do not affect the final predictions in any significant manner. The larger the subset of neurons with this property, the more sparse the network’s learned representation is. Because neural networks can easily overfit to noise in training labels [5], we are interested in the observed property of sparse models to overfit more slowly. With fewer tunable parameters available, sparse models have fewer degrees of freedom to overfit to noise.
4 Results
Metric  Model  conv0  conv1  fc1  fc2  fc3 

Activation STD  Certain MNIST  0.215  0.5367  2.386  1.335  1.5733 
MCDropout MNIST  0.0646  0.1085  0.4207  0.2144  0.9182  
Activation Mean  Certain MNIST  1.009  1.3936  1.4381  0.8383  0.0324 
MCDropout MNIST  0.2443  0.2041  0.1567  0.1208  0.498  
Unresponsive neurons  Certain MNIST  0.0  0.0  0.0916  0.0  0.0 
MCDropout MNIST  0.1666  0.25  0.5083  0.2023  0.0 
Metric  Model  conv0  conv1  conv2  conv3  fc1  fc2  fc3 

Activation STD  Certain ConvNet  0.0602  0.0343  0.1715  0.1279  7.3578  11.8364  6.5248 
MCDropout ConvNet  0.047  0.0123  0.0708  0.0871  4.3155  9.2449  4.480  
Activation Mean  Certain ConvNet  0.0818  0.04378  0.238  0.106  1.6077  5.038  0.075 
MCDropout ConvNet  0.060  0.0149  0.091  0.0616  0.4894  3.340  2.424  
Unresponsive neurons  Certain ConvNet  0.4166  0.4583  0.4323  0.4414  0.5449  0.2031  0.0 
MCDropout ConvNet  0.4583  0.71875  0.7083  0.6172  0.7851  0.0781  0.0 
Metric  Model  conv0  conv1  conv2  conv3  fc1  fc2  fc3 

Activation STD  Certain ConvNet  0.2037  0.0202  0.0596  0.0304  2.077  5.071  5.695 
MCDropout ConvNet  0.0191  0.0132  0.0277  0.0301  1.6326  4.032  3.8372  
Activation Mean  Certain ConvNet  0.0217  0.0146  0.0599  0.0191  0.4833  3.017  2.256 
MCDropout ConvNet  0.0172  0.0086  0.0265  0.0148  0.210  1.534  1.9296  
Unresponsive neurons  Certain ConvNet  0.625  0.6354  0.4583  0.6367  0.4804  0.1094  0.0 
MCDropout ConvNet  0.6666  0.7708  0.7448  0.4687  0.6679  0.5  0.0 
We study the classification efficacy, neuron responsiveness, and network sparsity of MCDropout and certainty model on three benchmark classification datasets: MNIST, CIFAR10, and Animal10n [23]. We use two different architectures: LeNet5 [16]
and ConvNet, a convolutional neural network architecture with 4 convolutional layers followed by 3 fully connected layers. To maximize the effect of MCDropout, we use an alllayer MCDropout architecture where each layer in the certainty model is augmented with MCDropout. Our investigation compares the findings for the original certainty model and its augmented alllayer MCDropout model.
We train both models on noisy training labels. Because we are evaluating these models on a classification task, mislabeled data simply means that training samples labeled with the incorrect class. For MNIST and CIFAR10, we use a uniform noise simulation scheme to add noise to 35% of our training labels: in this scheme, each corrupted label has an equal chance of 35% being mislabeled as any of the other classes. Once training is complete, we run our trained models on clean test data and compute all the neurons’ individual activation maps after the application of an activation function. All of our chosen architectures use ReLu as their activation.
4.1 Classification Efficacy
We compare the performance of the certainty model and MCDropout models trained on noisy data and plot their training and accuracy curves over time 5. We see an emerging trend consistent across all models and datasets. The certain model overfits to the noise in the training data and results in a similar or higher final training accuracy than the MCDropout model. However, the MCDropout model consistently produces a higher validation accuracy.
Next we investigate why MCDropout outperforms certainty by analyzing the representations learned by both models. Consider the results on MNIST shown in 5, left. Given that the training accuracies of the certainty and MCDropout model are quite similar after 100 epochs, both models are clearly learning something. However given the vastly different test accuracies between the two models–the uncertain models undoubtedly generalize better to the test dataset–the models are representing information differently.
4.2 Neuron Responsiveness Measured by Volatility
Next, we compare the volatility of the neurons in the uncertain and certain models. We do so with two strategies. As mentioned earlier, we cache each neuron’s activation map for every image in the test set. We find the mean activation value for each feature map. To measure volatility, we compare two statistics per layer: the standard deviation of the layer’s mean activation values and the mean of the layer’s mean activation values.
We show the results of this investigation for each dataset in Table 1 for MNIST, Table 2 for CIFAR10 and Table 3 for Animal10n. With the exception of a few layers, the neurons in the certainty models activate more strongly: the mean activation is higher for each layer and with greater variation: the standard deviation of the activations is higher for each layer.
.
4.3 Network Sparsity
We can compare the sparsity via analysis of the neurons’ individual feature maps. Sparser models contain more neurons whose feature maps have values close to some constant c, usually 0, no matter the input sample from the test set.
For qualitative evaluation, we visualize the postactivation feature map of individual neuron for a given sample of testing data. We can visually compare how many neurons seem near constant or activate in only small patches of the map. We show heatmaps from various layers in Figure 2 (MNIST), Figure 3 (CIFAR10), and Figure 4 (Animal10n). In all cases, activation maps from MCDropout models have spatially sparse activations: when they do activate, it is in tight, localized regions, and large patches of each activation map remain inactivated. In addition, several of the MCDropout activation maps show very little activation at all. We also calculate the mean activation value per neuron per image in each experiment’s test dataset. We plot perneuron histograms of these mean activation values for the certain and MCDropout models in Figures 6 (MNIST), 7 (CIFAR10), and 8 (Animal10n). We can then compare the mean and support of the resulting activation distributions: in many cases, the distributions from MCDropout models possess smaller supports and are centered more closely around a mean activation value of 0.0. This provides an intuitive understanding of why MCDropout is more robust against noisy labels: the neurons that may be influenced by noisy labels in the certainty model are not activated in MCDropout models. MCDropout layers provide regularization against these “corrupted” neurons.
These qualitative traits show that the MCDropout model’s learned representation is more sparse. For a quantitative analysis, we can count how many neurons are “relatively unresponsive” based on their gamut of possible activations for all the test images. Neurons that rarely activate–that is, the mean of their activations for all images on the test set falls below some epsilon threshold–are tallied in the final row of Tables 1 (MNIST), 2 (CIFAR10), and 3 (Animal10n). We report these numbers as the ratio of “relatively unresponsive neurons” to the total number of neurons in the layer. The results show that the major of the MCDropout models’ layers have more dead neurons than corresponding layers in the certain model does. This indicates that the uncertain model has learned a more sparse representation.
5 Discussion
We have compared the representations, on a perneuron level, learned by MCDropout models and certainty models when trained with noisy labels. The representation learned by MCDropout representation is less volatile but more sparse, an apt justification for its greater effectiveness and generalization in noisylabel scenarios. MCDropout provides regularization so that neurons are not overly influenced by the noisy labels; as a result, these neurons are not activated at test time, thus contributing to the robustness against noisy label training. With fewer free parameters to overexplain training label noise, MCDropout models forge representations that are less capable of overfitting to noisy labels.
Our larger goal in this investigation is not to build stateoftheart models on any of the presented datasets or find the best network to deal with noisy labels. Rather, we investigate interpretable metrics and observations from the learned representations of models that identify why MCDropout model outperforms certainty model.
In our experiment, we primarily analyzed alllayered MCDropout for the purpose of maximizing the MCDropout effect. However we acknowledge there are other different configurations of uncertainty placement. As seen in Figure 9, we further analyze different MCDropout placement configurations on MNIST dataset and discover that MCDropout on all layers possesses the best test classification accuracy when training with noisy labels. Such behavior is consistent with the theoretical establishment that alllayered MCDropout best approximates Bayesian neural network. While other configurations such as converting only convolutional layers, internal layers, final layers, etc., to MCDropout layers still outperform the certainty model, the bestperforming model benefits from the most number of MCDropout layers. We believe research directions on an optimal trade off between classification performance and MCDropout layer placement is critical for noisylabel training with constraints on memory or inference time.
We hope this investigation helps us ask and answer more questions. It would be interesting to see if our observations about volatility and sparsity hold for other uncertainty estimation or ensemble methods like Bootstrap [15] or Bayes by Backprop [4]
. These observations may also help us tune MCDropoutrelated hyperparameters, such as the best locations to place MCDropout layer–or layers capable of uncertainty estimation in general–in a model architecture.
References
 [1] (2020) Label noise types and their effects on deep learning. arXiv preprint arXiv:2003.10471. Cited by: §2.1.
 [2] (2019) Unsupervised label noise modeling and loss correction. In International Conference on Machine Learning, pp. 312–321. Cited by: §2.3.

[3]
(2019)
GAN dissection: visualizing and understanding generative adversarial networks
. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §3, §3.  [4] (2015) Weight uncertainty in neural networks. In Proceedings of the 32nd International Conference on International Conference on Machine Learning  Volume 37, ICML’15, pp. 1613–1622. Cited by: §2.2, §5.
 [5] (2021) Uncertainty estimation methods in the presence of noisy labels. Advances in Neural Information Processing Systems, Women in Machine Learning Workshop. Cited by: §1, §2.1, §2.2, §3.3.
 [6] (2020) Robust deep learning with active noise cancellation for spatial computing. arXiv preprint arXiv:2011.08341. Cited by: §2.3.

[7]
(2018)
A semisupervised twostage approach to learning from noisy labels.
In
2018 IEEE Winter Conference on Applications of Computer Vision (WACV)
, pp. 1215–1224. Cited by: §2.3.  [8] (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059. Cited by: §2.2.
 [9] (2019) The state of sparsity in deep neural networks. ArXiv abs/1902.09574. Cited by: §3.3.

[10]
(2017)
Robust loss functions under label noise for deep neural networks.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 31. Cited by: §2.3.  [11] (2018) Coteaching: robust training of deep neural networks with extremely noisy labels. arXiv preprint arXiv:1804.06872. Cited by: §2.3.
 [12] (2018) Using trusted data to train deep networks on labels corrupted by severe noise. arXiv preprint arXiv:1802.05300. Cited by: §2.3.
 [13] Cited by: §3.3.
 [14] (2018) Mentornet: learning datadriven curriculum for very deep neural networks on corrupted labels. In International Conference on Machine Learning, pp. 2304–2313. Cited by: §2.3.
 [15] Cited by: §2.2, §5.
 [16] (1998) Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. External Links: Document Cited by: §4.
 [17] (2020) Dividemix: learning with noisy labels as semisupervised learning. arXiv preprint arXiv:2002.07394. Cited by: §2.3.
 [18] (2019) Curriculum loss: robust learning and generalization against label corruption. arXiv preprint arXiv:1905.10045. Cited by: §2.3.
 [19] (2018) A survey of sparselearning methods for deep neural networks. In 2018 IEEE/WIC/ACM International Conference on Web Intelligence (WI), Vol. , pp. 647–650. External Links: Document Cited by: §3.3.
 [20] (2017) Decoupling” when to update” from” how to update”. arXiv preprint arXiv:1706.02613. Cited by: §2.3.
 [21] (2019) Self: learning to filter noisy labels with selfensembling. arXiv preprint arXiv:1910.01842. Cited by: §2.3.

[22]
(2017)
Making deep neural networks robust to label noise: a loss correction approach.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 1944–1952. Cited by: §2.3.  [23] (2019) SELFIE: refurbishing unclean samples for robust deep learning. In ICML, Cited by: §4.
 [24] (2017) Training sparse neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vol. , pp. 455–462. External Links: Document Cited by: §3.3.
 [25] (2019) Symmetric cross entropy for robust learning with noisy labels. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 322–330. Cited by: §2.3.
 [26] (2020) Partdependent label noise: towards instancedependent label noise. Advances in Neural Information Processing Systems 33. Cited by: §2.1.
 [27] (2019) How does disagreement help generalization against label corruption?. In International Conference on Machine Learning, pp. 7164–7173. Cited by: §2.3.
 [28] (2018) Generalized cross entropy loss for training deep neural networks with noisy labels. arXiv preprint arXiv:1805.07836. Cited by: §2.3.