1 Introduction
Convolutional neural networks (CNN) provide stateoftheart results for many machine learning challenges, such as image classification
[1], detection [2] and segmentation [3]. However, in order to train these models, large datasets of labeled samples are required. Time and cost limitations come into play in the creation of such datasets, and often result in imperfect labeling, or label noise, due to human error [4]. An alternative to manual annotation are images taken from the Internet that use the surrounding text to produce labels. This approach results in noisy labels too.Perhaps surprisingly, it has been repeatedly shown, e.g. in [5], that neural nets trained on datasets with high levels of label noise are still able to perform accurate predictions. Yet, as we show hereafter, the ability of the network to overcome label noise depends on the type of the distribution of the noise.
Fig. 1 demonstrates this behavior for different types of noise distributions. Fig. 1
(a) shows embeddings of deep features of the 10 classes in MNIST, where we randomly change the labels of
of the training data. A neural network trained with this data is capable of reaching prediction accuracy. Same behavior is observed also when the labels of each class are consistently flipped to another specific class (e.g. to , to , etc.). On the other hand, Fig. 1(b) shows the case where concentrated groups of samples have all their labels flipped to the same label. Here too, of the labels are changed, but the noise is no longer distributed uniformly in feature space but is rather locally concentrated in different parts. In this case, the neural network does not overcome the label noise and prediction accuracy drops to .In this work, we offer an explanation to this phenomenon that is based on a connection between neural networks and Knearest neighbors (KNN). We demonstrate that CNN, in a similar way to the KNN algorithm, predicts the label of a test sample based on a neighborhood of the training samples. Thus, analyzing the behavior of KNN in the presence of label noise can serve as a way to understand the behavior of CNN in the presence of this noise.
We develop an analytical expression for the expected accuracy of the network at any given noise level for various types of label noise. We test our hypothesis on both the MNIST and CIFAR10 datasets. We show that empirical curves of accuracypernoiselevel fit well with the curves produced by our proposed mathematical expression.
From the relationship between neural networks and KNN follows an important conclusion about the resistance of CNN to label noise: The amount of resistance depends on how well the noisy samples are spread in the training set. In cases where they are randomly spread
, the resistance is high since the probability of noisy samples overcoming the correct ones in any local neighborhood is small. However, when the noisy samples are
locally concentrated, neural nets are completely unable to overcome the noise.2 Related Work
Classification in the presence of label noise has long been explored in the context of classical machine learning [6]. Recently, it has also been studied in the context of deep neural networks. Several works, e.g. [5, 7, 8] have shown that neural nets trained on large and noisy datasets can still produce highly accurate results.
For example, Krause et al. [5] report classification results on up to categories. Their key observation is that working with large scale datasets that are collected by image search on the web leads to excellent results even though such data is known to contain noisy labels.
Sun et al. [9]
report logarithmic growth in performance as a function of training set size. They perform their experiments on the JFT300M dataset, which has more than 375M noisy labels for 300M images. The annotations have been cleaned using complex algorithms. Still, they estimate that as much as
of the labels are noisy and they have no way of detecting them.In [10, 11], an extra noise layer is introduced to the network to address label noise. It is assumed that the observed labels were created from the true labels by passing through a noisy channel whose parameters are unknown. Their method simultaneously learns both the neural network parameters and the noise distribution. They report improvement in classification results on several datasets.
Xiao et al. [12] combine a small set of clean labeled data with a large collection of noisy labeled data. They model the relationships between images, class labels and label noise with a probabilistic graphical model and further integrate it into an endtoend deep learning system. In a synthetic experiment they show that the robustness of their algorithm to noise is of up to on the CIFAR10 dataset. They also show that on a large clothing dataset, their method outperforms previous techniques that do not use noisy labels.
Reed et al. [13] combat noisy labels by means of consistency. They consider a prediction to be consistent if the same prediction is made given similar percepts, where the notion of similarity is between deep network features computed from the input data. They report substantial improvements in several challenging recognition tasks.
Liu at al. [14]
propose to use importance reweighting to deal with label noise in CNN. They extend the idea of using an unbiased loss function for reweighting to improve resistance to label noise in the classical machine learning setting
[15].Li at al. [16] suggest to use a small clean dataset (with no noisy labels) together with side information that provide label relations in a form of a graph to improve the learning using noisy labels.
Malach and ShalevSchwartz [17] suggest a different method for overcoming label noise. They train two networks, and only allow a training sample to participate in the gradient descent stage of training if these networks disagree on the prediction for this sample. This allows the training process to ignore incorrectly labeled training samples, as long as both networks agree about what the correct label should be.
Rolnick et al. [18] treats the case where for each clean label, several noisy labels (for the same sample) are added to the training. They show that adding up to (for MNIST) or (for CIFAR10) noisy labels for each clean label decreases the accuracy by only . In addition, they show that training in this regime requires a significant but manageable increase in the dataset size that is related to the factor by which the correct labels have been diluted.
The explanation they provide for this behavior is based on an analysis of the stochastic gradient step. Specifically, they claim that within a batch, gradient updates from randomly sampled noisy labels roughly cancel out, while gradients from correct samples that are marginally more frequent sum together and contribute to learning. By this logic, large batch sizes are more robust to noise since the mean gradient over a larger batch is closer to the gradient of correct labels.
3 Label Noise Types
In the “ideal” classification setting, we have a training set and a test set , where is typically an image, and is a label from the label set . The network is trained on and tested on . Yet, in the setting with label noise, the network is trained on a noisy training set , which is derived from the clean data by changing some of the labels. We next describe several different types of label noise.
In the simplest label noise scenario, a random subset of the training samples receive a random new label, uniformly sampled from . This occurs, for example, when a human operator makes a random error while labeling the training samples [4]. We define the noise level, , as the fraction of the training set that gets its labels reassigned, and we say that these samples have been corrupted. This setting is used, for example, by Bekker et al. [10] and we will refer to it as random labelnoise.
Another common type of label noise is flip labelnoise. In this setting, each label has one counterpart with which it may be replaced. For example, humans might be consistently confused with two particular breeds of dogs that appear very similar. Again, samples are randomly selected, and for each one the true label is replaced with its counterpart. This setting is used, for example, by Reed et al. [13].
A more general case is confusionmatrix labelnoise. In this setting, the probability of the new label depends on the original label, and is described by a conditional probability function: . can also be called a confusion matrix. This setting captures similarity in appearance between images of different categories, which leads to error in labeling. This setting is used, for example, by Sukhbaatar et al. [9]. This noise type includes in it the previous two cases: The random case arises with , and the flip
type corresponds to the case where the confusion matrix is a permutation matrix.
In all previous settings, the noisy labels are randomly spread in the training set. In the locally concentrated noise setting, which we consider in this work, the noisy labels are locally concentrated in some region of the feature space. This type of error could occur for example if a human operator is tasked with marking images as either cat or dog, but consistently marks all poodles as cat. In this example, all poodle samples are concentrated in a subregion of the dog samples, and all are mislabeled. We will show that KNN and, by extension CNN, are resilient to randomly spread label noise but not to locally concentrated noise.
4 Mathematical Analysis
Equipped with the characterization of the different types of label noise, in this section we suggest that the prediction produced by neural networks is in fact the plurality label in a neighborhood of training samples, i.e. the most common label in the neighborhood. Following this assumption, we produce an analytical expression for the expected accuracy of a neural network, which is in fact the probability of the plurality label remaining unchanged when label noise is added. To show this, we take the following strategy: We first show empirically that the output of the softmax of a CNN resembles the output of a KNN. With this observation in hand, we derive a formula for KNN with the hypothesis that it applies also to CNN. The experiments in Section 5 demonstrate the validity of this hypothesis.
4.1 The connection between CNN and KNN
We start by investigating the relationship between neural networks and KNN. This connection is demonstrated by examining the output of the softmax layer of the network, which is essentially a probability distribution over the class labels. We have empirically observed for various networks, different datasets and the different noise types that when a sample
is fed to the network, the output of this layer, denoted as , tends to encapsulate the local distribution of the training samples in the vicinity of . The final output of the network is . We suggest that this output is the most common label in the neighborhood of , or the plurality label. The conclusion is that similarly to KNN, neural networks output the most common label seen in the trainingset in the neighborhood around a given input .For demonstration purposes, we present some representative results for this phenomenon on the MNIST and CIFAR10 datasets in Figs. 2 and 3 respectively. They demonstrates that when a sample is fed into the network, the output of the network’s softmax layer is approximately the distribution of the labels in the neighborhood of training samples around . For example, when there is a random noise with noise level , we see that the peak of the softmax is at and the rest of the bins contains approximately , which is the number of noisy samples from each class expected to be in any local neighborhood. In the case of flip noise, it can be seen that the softmax probabilities spread only at the classes with which the flip occurs and that the value is proportional to amount of noise.
As the network’s prediction is the argmax of this distribution, i.e. the most common label in the neighborhood (the plurality label), the network makes a mistake only when the “wrong” class achieves plurality in a local neighborhood. This is the case when locally concentrated noise is added and the test sample is taken from its vicinity.
Appendix 0.A describes another experiment that demonstrates the similarity between the softmax outputs and the local distribution of the labels of the training samples.
These findings provide us with an intuition into how CNNs are able to overcome label noise: Only the plurality label in a neighborhood determines the output of the network. Therefore, adding label noise in a way that does not change the plurality label should not affect the network’s prediction. As long as the noise is randomly spread in the training set, the plurality label is likely to remain unchanged. The higher the noise level, the more likely it is that a plurality label switch will occur in some neighborhoods. In Section 4.2, we produce an analytical expression for this probability. When the noise is locally concentrated, however, the KNNlike behavior of the network leaves it with no resilience to noise. We empirically show that indeed CNNs are not resilient to this kind of noise.
4.2 Prediction accuracy
Having the relationship between CNN and KNN established, we turn to calculate the effects of label noise on the KNN accuracy, and thus also on that of CNN. We start with some definitions.
Definition 1 (Prediction Accuracy)
Prediction accuracy is defined as
(1) 
where is the network’s prediction for a test sample and is the indicator function.
In the KNN model, the predicted label for is derived from a set of neighboring training samples. The prediction is simply the most common label in the neighborhood, or the plurality label, which we denote by . The KNN approximation for the expected accuracy is defined as follows.
Definition 2 (KNN Prediction Accuracy)
KNN prediction accuracy is defined as
(2) 
where is the probability that the plurality label of test sample in is correct.
By expanding the expression in Eq. (2), we obtain an analytical formula for the accuracy of a KNN classifier:
Proposition 1 (Pluarlity Accuracy)
Assuming that the members of each local neighborhood in the data are selected independently of all other neighborhoods, the probability of plurality label is given by
where is the correct label, is the number of appearances of the label in and is the probability of any such appearance.
Proof
Let be an ordering of the samples in the neighborhood, and the labels string be an assignment of labels for each sample respectively. We assume that the selection of labels is done i.i.d, and denote by the probability that the label is assigned to a given sample. Notice that the i.i.d assumption is an approximation, since in reality all the labels of the entire training set are assigned together, while enforcing that exactly labels undergo corruption. A truly random assignment of labels may result in a larger or smaller number of samples being corrupted. Due to independence, the probability of the labels string is simply the product of the probabilities of each label in it. We notice that there is no meaning to order, therefore, the probability only depends on the number of appearances of each label in the string, which we denote as . Therefore, the probability of a labels string is given by:
(4) 
Since the probability of a string depends only on the values of , we can simplify the calculations by grouping all strings for which these values are the same. Denoting by such a group, we have
(5) 
where the multinomial coefficient counts the number of different orderings that can be made of a string with the required number of repeats of each letter. The probability is the sum of probabilities for all strings in which the plurality label is the correct one. Let the correct label be , then these are the strings for which . Combining this requirement with Eq. (5) leads to Eq. (1). ∎
What is left to show is how to calculate . The probability is derived from the process that creates the noisy training set. Let be a test sample, and let be a training sample in . Let be the clean label of and be its noisy label. We denote by the clean label distribution in . In other words, . As we show in Figs. 2 and 3, an estimate for this distribution is given by the output of the softmax layer of a network trained on clean data. Thus, the expression for is given by
(6) 
where is the noise level, and is the confusion matrix that defines the corruption process. Eq. (6) shows that a sample can become labeled with a noisy label in two ways: Either this sample is uncorrupted and was its original label, or this sample was corrupted and received as its noisy label.
A naive calculation of the probability in Eq. (1) by iterating over all possible valid values of is inefficient. In Appendix 0.B, we provide details on how to efficiently iterate only over the combinations where is indeed the plurality label. Next, we present how it is possible to further simplify for some special cases.
4.3 Simplified analysis of special cases
The process of calculating can be accelerated by several orders of magnitude if the following requirements are met:

The dataset is almost perfectly learnable, meaning that a CNN is able to reach approximately 100% test accuracy when trained with clean labels.

The conditional probabilities are the same for all , up to renaming of the labels.

The distribution of labels in the test set is balanced, meaning there is the same number of test samples for each label.
In these cases, the perfect learnability allows us to simplify by assuming that for all train samples , all clean labels in are the correct label:
(7) 
Also, the probability is the same for all test samples, from which follows .
For the random noise setting, is simplified to
(8) 
and for the flip noise setting, is simplified to
(9) 
where is the number of samples in that have not been corrupted, and is the number of those that have been corrupted, i.e. flipped to the alternative label.
5 Experiments
Random Noise. The experimental curves (a,d) show the mean and standard deviation of the accuracy (taken over 10 experiments for MNIST, 7 for CIFAR10).
We perform several experiments that compare the empirical results of CNN trained with various types of label noise to the analytical (or numeric) curves derived from our mathematical analysis of the KNN model. We demonstrate our results on two datasets: MNIST and CIFAR10.
The analytical expressions in Eqs. (2) and (1) are quite computationally intense. To make the running time feasible we use an optimized multithreaded C++ implementation, and run it on a fast 8core Intel i7 CPU. Each experiment in the analytic graphs based on Eq. (1), which are presented ahead, takes between 30 to 60 minutes to create.
To generate the empirical plots, we train multiple neural nets on data with a range of noise levels. For each noise level we train multiple networks (typically 10) and calculate the mean accuracy and its standard deviation. We use a validation/test split of 50%/50%. The validation set is used for early stopping [19] that is especially crucial because overfitting also tends to include memorization of noisy labels, which thus ruins the network’s resistance to label noise. The test set is used to calculate the networks accuracy.
Locally concentrated noise is produced as follows: we use the output of the penultimate layer of a network trained on clean data as a feature vector for each training sample. This is an embedding of the samples in a
dimensional space. In this space, we perform means for each class separately to divide it into clusters. Then we select one of the clusters and change all of the labels in it into the same incorrect label. Each class has one alternative class to which the noisy labels are flipped. means with different values of result in different noiselevels, from roughly 10% when , to roughly 50% when .We start by comparing the effects of different noise settings: random, flip and locally concentrated noise. For the CIFAR10 experiments, we use the All Convolutional Network [20].
For the MNIST dataset, we are using a CNN with the following structure:
cnv@20  cnv@20  pool  cnv@50  cnv@50  pool  fc@500  fc@10  softmax,
where cnv
is a convolutional layer using a 5x5 filter and zeropadding,
fc is a fully connected layer, @c denotes the number of output channels, and poolis 2x2 maxpooling . Batch Normalization
[21]is added after each convolutional and fullyconnected layer, followed by a ReLU nonlinearity (except before the softmax layer). The reason we use this network for MNIST is that it achieves
100% accuracy on the MNIST dataset.Fig. 4 demonstrates that neural networks are able to resist high levels of noise, but only if the noise is randomly spread in the training set (i.e., the random and flip settings). In contrast, in the locally concentrated noise setting the network has no resistance to noise. This experiment also shows that the random noise setting is easier for the network to overcome than the flip setting. In the flip case, resistance to noise holds only until the noise level approaches 50%. In the random setting, noticeable drop in accuracy happens only when approaching 90%. This is due to the fact that in the flip setting, at 50% there is a reversal of roles between the correct label and the alternative labels, and the network ends up learning the alternative labels and ignoring the correct ones. In the random noise setting, however, the probability of the correct label being the plurality label is still higher than that of any of the other labels.
An approximate analysis of CNN accuracy based on the KNN algorithm can be done also in the locally concentrated noise setting. To do so, we need to assume that the noisy samples are concentrated in the feature space that KNN operates in. If the noise is concentrated, then is almost always either:

completely contained in the corrupt area, OR

completely contained in the clean area.
In the first case, the prediction will be incorrect. In the second, it will be correct. Therefore, the expected accuracy can be determined by the fraction of test samples for which is in the clean area. If we assume that the test samples are randomly spread in the sample space, we can expect this fraction to be . Fig. 4(a) demonstrates that this is indeed the case empirically.
In Fig. 3(g,h), it can be seen that when a sample is drawn from a clean region, the output of the network shows high probability for the correct class. Yet, when sampling in a noisy region where the noise is concentrated, the network output gives the highest probability to the class determined by the noise in that local region. Notice that the correct class gets very low distribution as it is misrepresented locally.
We turn to present now experiments for the other types of noise. We compare the empirical results with the analytical curves derived from the mathematical model of the KNN algorithm. We perform several tests comparing the empirical vs. the analytical degradation of accuracy as label noise increases. The empirical accuracy vs. noise level curve is acquired by training networks on training data with different noise levels, and measuring the networks’ accuracy on the test set. This is compared to multiple analytical curves that are produced using different values of . We show that the empirical curve is of the same general shape as the empirical curves.
We use the MNIST dataset with the network described above. Note that the MNIST dataset is almost perfectly learnable, which allows us to use the simplified analytical expression proposed in Section 4.3. Figs. 5 and 6 show the results for random and flip noise respectively. In Fig. 7 we show the results of an experiment where the noise follows a general confusion matrix. Indeed, our analytic curve matches the empirical curves in all three settings. As mentioned above, this is also the case for the locally concentrated noise case.
For the CIFAR10 experiments, we use the All Convolutional Network [20]. Unlike MNIST, this dataset is not perfectly learnable, i.e., even when training with clean data the network does not achieve 100% accuracy. Therefore we must use the general case formula in Eq. (1) for the analytic curve. The results are shown in Figs. 5, 6 and 7. Also in this case, it is clear that our analytical curve matches the empirical one.
6 Conclusions
In this work, we have studied the robustness of neural networks to label noise. The underlying assumption of our analysis is that neural networks behave similarly to the Knearest neighbors algorithm, which is especially evident in their performance when trained with noisy data. We performed several experiments that demonstrated this intuition, and then compared empirical results of training neural nets with label noise, with analytical (or numeric) curves derived from a mathematical analysis of the KNN model. Our conclusion is that CNN robustness to label noise depends on the plurality label in the vicinity of a given input sample. This explains the incredible resistance of these networks to random and flip noise and their degradation in performance in the case of locally concentrated noise.
Appendix 0.A Comparison of Softmax Outputs to KNN Histograms
In this work, we have presented the conjecture that the output of the softmax layer tends to encapsulate the local distribution of the train samples in the vicinity of a given test sample. To further verify this hypothesis, we run the following test: We produce histograms of labels for KNearest Neighbors (with different values of K), and calculate the chisquare distance from these histograms to the softmax layer output. We use the 256dimensional output of the penultimate layer of a network as the feature space in which we calculate KNN. The network is trained on a clean version of the CIFAR10 dataset, and has the following structure:
cnv@20  cnv@20  pool  cnv@50  cnv@50  pool  fc@256  fc@10  softmax,
where cnv is a convolutional layer using a 5x5 filter and zeropadding, fc is a fully connected layer, @c denotes the number of output channels, and pool is 2x2 maxpooling . Batch Normalization is added after each convolutional and fullyconnected layer, followed by a ReLU nonlinearity (except before the softmax layer). The features we use are the raw outputs of the fully connected layer with 256 output channels, before they are passed into batch normalization and ReLU. We try a range of K values, between 10 and 300, and for each sample select its preferred K value, which is the one with the lowest chisquare distance. Fig. 8(a) shows the prevalence of different choices of K. Fig. 8(b) presents the histogram of the calculated chisquare distances.
The median chisquare distance between softmax layer output and KNN histogram is , which shows that the distributions are very close to each other. To get a better sense of the meaning of this number, we show a comparison of histograms for several samples in Fig. 9, where the chisquare distance is around this value. In each pair, the softmax output and the KNN histogram for the sample’s preferred K are presented. It can bee seen that these histograms are very close to each other.
Appendix 0.B Efficient Summation in the Calculation of Q
We turn to present here an efficient strategy for computing the probability in Eq. (1). A naive computation of it, may iterate over all possible combinations of , but only sum those where the plurality label is the correct one. As we shall see now, in addition to being inefficient, this is also unnecessary.
To make the calculation more efficient, we calculate the lower and upper boundaries of each such that the summation only goes through the combinations that lead to a correct plurality label. Denoting the lower bounds by and the upper bounds by , we have that
(10) 
where is the smallest number of repeats of allowed, and is the largest one. Their possible values are calculated in Section 0.B.1. Notice that the number of repeats allowed for any label depends on the number of repeats already selected for all the previous labels, .
For further efficiency, we can now decompose the summed expression so that shared parts of the calculation are only performed once. We decompose the multinomial coefficient into a product of binomial coefficients as follows:
(11) 
and get the following formula for calculating :
(12) 
0.b.1 Defining and
We will assume, without loss of generality, that the correct label is . Clearly, we can repeat the same analysis by simply renaming or shuffling the labels. and need to be defined in a way that ensures:

There are exactly K letters in the string.

is the plurality label, i.e. .
We can start with , which is simply . Clearly, a string consisting of K repeats of fulfills both requirements. Once is known, we can define the maximum allowed number of repeats for any other letter as . With the definition of , we turn to calculate . Since and , we have that
(13) 
By reordering the terms, we get that
(14) 
Using the fact that is the smallest integer satisfying (14), we have
(15) 
Having and set, we turn to calculate the values of . We start by defining which is the number of string positions that are still unassigned:
(16) 
Clearly, the value of should be no larger than . Thus,
(17) 
Lastly, we define in a way that makes sure the string has no less than K letters:
(18) 
The intuition here is that if all the subsequent letters have the maximal number of repeats, , then need to be repeated enough times to bring the total repeats of all the yet unassigned letters to .
References
 [1] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q., eds.: Advances in Neural Information Processing Systems 25
 [2] Redmon, J., Farhadi, A.: Yolo9000: Better, faster, stronger. arXiv preprint arXiv:1612.08242 (2016)
 [3] Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. CoRR abs/1606.00915 (2016)
 [4] Ipeirotis, P.G., Provost, F., Wang, J.: Quality management on amazon me chanical turk. In: ACM SIGKDD workshop on human computation
 [5] Krause, J., Sapp, B., Howard, A., Zhou, H., Toshev, A., Duerig, T., Philbin, J., FeiFei, L.: The Unreasonable Effectiveness of Noisy Data for FineGrained Recognition. ArXiv eprints (November 2015)
 [6] Frénay, B., Verleysen, M.: Classification in the presence of label noise: A survey. IEEE Transactions on Neural Networks and Learning Systems 25 (2014)
 [7] Flatow, D., Penner, D.: On the robustness of convnets to training on noisy labels. Stanford Technical Report (2017)
 [8] Sun, C., Shrivastava, A., Singh, S., Gupta, A.: Revisiting unreasonable effectiveness of data in deep learning era. CoRR abs/1707.02968 (2017)
 [9] Sukhbaatar, S., Fergus, R.: Learning from noisy labels with deep neural networks. CoRR abs/1406.2080 (2014)
 [10] Bekker, A.J., Goldberger, J.: Training deep neuralnetworks based on unreliable labels. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016, Shanghai, China, March 2025, 2016
 [11] Goldberger, J., BenReuven, E.: Training deep neuralnetworks using a noise adaptation layer. In: ICLR. (2017)
 [12] Xiao, T., Xia, T., Yang, Y., Huang, C., Wang, X.: Learning from massive noisy labeled data for image classification. In: CVPR, IEEE Computer Society (2015) 2691–2699
 [13] Reed, S.E., Lee, H., Anguelov, D., Szegedy, C., Erhan, D., Rabinovich, A.: Training deep neural networks on noisy labels with bootstrapping. CoRR abs/1412.6596 (2014)
 [14] Liu, T., Tao, D.: Classification with noisy labels by importance reweighting. IEEE TPAMI 38(3) (2016)
 [15] Natarajan, N., Dhillon, I.S., Ravikumar, P.K., Tewari., A.: Learning with noisy labels. In: NIPS
 [16] Li, Y., Yang, J., Song, Y., Cao, L., Luo, J., Li, L.J.: Learning from noisy labels with distillation. In: ICCV. (2017)
 [17] Malach, E., ShalevShwartz, S.: Decoupling “when to update” from “how to update”. In: NIPS. (2107)
 [18] Rolnick, D., Veit, A., Belongie, S.J., Shavit, N.: Deep learning is robust to massive label noise. CoRR abs/1705.10694 (2017)
 [19] Plaut, D., Nowlan, S., Hinton, G.: Experiments on learning by back propagation. Technical Report CMU–CS–86–126, Department of Computer Science, Carnegie Mellon University, Pittsburgh, PA (1986)
 [20] Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.A.: Striving for simplicity: The all convolutional net. CoRR abs/1412.6806 (2014)
 [21] Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32Nd International Conference on International Conference on Machine Learning  Volume 37. ICML’15 (2015)