1 Introduction
Large datasets used in training modern machine learning models, such as deep neural networks, are often affected by label noise. The problem is pervasive for a simple reason: manual expertlabelling of each instance at a large scale is not feasible, and so researchers often resort to cheap but imperfect surrogates. Two such popular surrogates are crowdsourcing using nonexpert labellers and — especially for images — the use of search engines to query instances by a keyword, assuming the keyword as a valid label
[5, 35, 3, 29, 17] Both approaches offer the possibility to scale the acquisition of training labels, but invariably result in the introduction of label noise, which may adversely affect model training.Our goal is to effectively train deep neural networks with modern architectures under label noise. We do so by marrying two different lines of recent research. The first strand is work on adhoc deep architectures
tailored to the problem, primarily developed in Computer Vision
[27, 32, 39, 42]. While some such approaches have shown good experimental performance on specific domains, they lack a solid theoretical framework and often need a large amount of clean labels to obtain acceptable results — in particular, for pretraining or validating hyperparameters [42, 17, 32].The second strand is recent Machine Learning research on theoretically grounded means of combating label noise. In particular, we are interested in the design of corrected losses that are robust to label noise [38, 28, 30]. Despite their formal guarantees, these methods have not been fully appreciated in practice because, crucially, they require noise rates to be known a priori.
An estimate of the noise is often available to practitioners by polishing a subset of the training data [42] — which is useful and often necessary for model selection. Yet, interestingly, recent work has provided practical algorithms for estimating the noise rates [36, 34, 21, 26, 31]; remarkably, this is achievable with absolutely no knowledge of ground truth labels. To our knowledge, no prior work has combined those estimators with loss correction techniques, nor has either idea been applied to modern deep architectures. Our contributions aim to unify these research streams:

We introduce two alternative procedures for loss correction, provided that we know a stochastic matrix
summarizing the probability of one class being flipped into another under noise. The first procedure, a multiclass extension of [28, 30] applied to neural networks, is called “backward” as it multiplies the loss by . The second, inspired by [39], is named “forward” as it multiplies the network predictions by . 
We prove that both procedures enjoy formal robustness guarantees w.r.t.
the clean data distribution. Since we only operate on the loss function, the approach is both architecture and application domain independent, as well as viable for any chosen loss function.

We take a further step and extend the noise estimator of [26] to our multiclass setting, thus formulating an endtoend solution to the problem.

We prove that for ReLU networks the Hessian of the loss is independent from label noise.
We apply our loss corrections to image recognition on MNIST, CIFAR10, CIFAR100 and sentiment analysis on IMDB; we simulate corruption by artificially injecting noise on the training labels. In order to show that no architectural choice is the secret ingredient of our robustification recipe, we experiment with a variety of network modules currently in fashion: convolutions and pooling
[20], dropout [37], batch normalization [15], word embedding and residual units [11, 12]. Additional tests on LSTM [13]confirm that the procedures can be seamlessly applied to recurrent neural networks as well. Comparisons with noncorrected losses and several known methods confirm robustness of our two procedures, with the forward correction dominating the backward. Unsurprisingly, the noise estimator is the bottleneck in obtaining nearperfect robustness, yet in most experiments our approach is often the best compared to prior work. Finally, we experiment with Clothing1M, the
clothing images dataset of [42], and establish the new state of the art.2 Related work
Our work leverages recent research in a number of different areas, summarized below.
Noise robustness^{1}^{1}1We use the term robustness in its meaning of immunity to noise and not generically as “adaptivity to various scenarios”, e.g. [6].. Learning with noisy labels has been widely investigated in the literature [7]. From the theoretical standpoint label noise has been studied in two different regimes, with vastly different conclusions. In the case of lowcapacity (typically linear) models, even mild symmetric, i.e. classindependent (versus asymmetric, i.e. classdependent), label noise can produce solutions that are akin to random guessing [22]
. On the other hand, the Bayesoptimal classifier remains unchanged under symmetric
[28, 26] and even instance dependent label noise [25] implying that highcapacity models are robust to essentially any level of such noise, given sufficiently many samples.Surrogate losses. Suppose one wishes to minimize a loss on clean data. When the level of noise is known a priori, [28] provided the general form of a noise corrected loss such that minimization of on noisy data is equivalent to minimization of on clean data. In the idealized case of symmetric label noise, for certain one in fact does not need to know the noise rate: [8] gives a sufficient condition for which is robust, and several examples of such robust nonconvex losses, while [41] shows that the (convex) linear or unhinged loss is its own noisecorrected loss. Another robust nonconvex loss is given in [24].
Noise rate estimation. Recent work has provided methods to estimate label flip probabilities directly from noisy samples. Typically, it is required that the generating distribution is such that for each class, there exists some “perfect” instance, i.e. one that is classified with probability equal to one. Proposed estimators involve either the use of kernel mean embedding [31]
, or postprocessing the output of a standard classprobability estimator such as logistic regression using order statistics on the range of scores
[21, 26] or the slope of the induced ROC curve [34].Deep learning with noisy labels
. Several works in Deep Learning have attempted to deal with noisy labels of late, especially in Computer Vision. This is often achieved by formulating noiseaware models.
[27] builds a noise model for binary classification of aerial image patches, which can handle omission and wrong location of training labels. [42] constructs a more sophisticated mix of symmetric, asymmetric and instancedependent noise; two networks are learned by EM as models for classifier and noise type. It is often the case that a small set of clean labels is needed in order either to pretrain or finetune the model [42, 17, 32].The work of [39] deserves a particular mention. The method augments the architecture by adding a linear layer on top of the network. Once learned, this layer plays the role of our matrix
. However, learning this architecture appears problematic; heuristics such as trace regularization and a fixed updating schedule for the linear layer are necessary. We sidestep those issues by decoupling the two phases: we first estimate
and then learn with loss correction.We are not aware of any other attempt at either applying the noisecorrected loss approach of [28] to neural networks, nor on combining those losses with the above noise rate estimators. Our work sits precisely in this intersection. Note that, even though in principle loss correction should not be necessary for highcapacity models like deep neural networks, owing to aforementioned theoretical results, in practice, such correction may offset the suboptimality of these models arising from training on finite samples. Specifically, we expect that directly optimizing the (corrected) objective we care about will be beneficial in the finitesample case.
3 Preliminaries
We begin by fixing notation. We let for any
positive integer. Column vectors are written in bold (
e.g. ) and matrices in capitals (e.g. ). Coordinates of a vector are denoted by a subscript (e.g. ), while rows and columns of a matrix are denoted e.g. and respectively. We denote the allones vector by , with size clear from context, and the dimensional simplex.In supervised class classification, one has feature space and label space , where denotes the th standard canonical vector in by , i.e. . One observes examples drawn from an unknown distribution over . We denote expectations over by . Note that each only has one nonzero value at the coordinate corresponding to the underlying label.
An layer neural network^{2}^{2}2W.l.o.g., we assume all layers to be fully connected, or dense; for example, convolutions can be represented by dense layers with shared sparse weights. comprises a transformation , where is the composition of a number of intermediate transformations — the layers — defined by:
where and are parameters to be estimated^{3}^{3}3Here, , the original feature dimensionality, and , the label dimensionality., and
is any activation function that acts
coordinatewise, such as the ReLU . Observe that the final layer applies a linear projection, unlike all preceding layers. To simplify notation, we write:with the base case , so that e.g. is exactly the representation in the first layer. The coordinates of represent the relative weights that the model assigns to each class to be predicted. The predicted label is thus given by . In the training phase, the output of the final layer is contrasted with the true label via two steps. First, passes through the softmax function . The softmax output can be interpreted as a vector approximating the classconditional probabilities ; we denote it by . Next, we measure the discrepancy between label and network output by a loss function , for example by means of crossentropy:
(1) 
With some abuse of notation, we also define a loss in vector form , computed on every possible label:
(2) 
In the following, formal results hold under very mild conditions on a generic loss function ; at times we provide examples for the crossentropy. For simplicity, one could think of crossentropy every time is mentioned.
4 Label noise and loss robustness
We now consider label noise. We assume the asymmetric, i.e. classconditional noise setting [28], where each label in the training set is flipped to with probability ; feature vectors are untouched. Thus, we observe samples from a distribution . Denote by the noise transition matrix specifying the probability of one label being flipped to another, so that . The matrix is rowstochastic and not necessarily symmetric across the classes.
This is an approximation of realworld corruption which can still be useful in certain scenarios. One such case is that of classes representing a finegrained hierarchy of concepts, for example dog breeds and bird species [17] or narrow categories of clothing [42]. Classes may be too similar between each other for nonexpert human labellers to distinguish, regardless of the specific instances. Little is known about learning under the more generic feature dependent noise, with few exceptions [42, 8, 25].
We aim to modify a loss so as to make it robust to asymmetric label noise; in fact, this is possible if is known. Under this assumption — that we relax later on — we introduce two alternative corrections inspired by [28] and [39].
4.1 The backward correction procedure
We can build an unbiased estimator of the loss function, such that under expected label noise the corrected loss equals the original one computed on clean data. This property is stated in the next Theorem, a multiclass generalization of [28, Theorem 1]. The Theorem is also a particular instance of the more abstract [40, Theorem 3.2].
Theorem 1
Suppose that the noise matrix is nonsingular. Given a loss , backward corrected loss is defined as:
Then, the loss correction is unbiased, i.e. :
and therefore the minimizers are the same:
Proof.
The corrected loss is effectively a linear combination of the loss values for each observable label, whose coefficients are due to the probability that attributes to each possible true label , given the observed one
. Intuitively, we are “going one step back” in the noise process described by the Markov chain
. The corrected loss is differentiable — although not always nonnegative — and can be minimized with any offtheshelf algorithm for backpropagation. Although in practice would be invertible almost surely, its condition number may be problematic. A simple solution is to mixwith the identity matrix before inversion; this may be seen as taking a more conservative noisefree prior.
4.2 The forward correction procedure
Alternatively, we can correct the model predictions. Following [39], we start by observing that a neural network learned with no loss correction would result in a predictor for noisy labels . We can make explicit the dependency on . For instance, with crossentropy we have:
(3)  
(4)  
(5) 
or in matrix form This loss compares the noisy label to averaged noisy prediction corrupted by . We call this procedure “forward” correction. In order to analyze its behavior, we first need to recall definition and properties of a broad family of losses named proper composite [33, Section 4]. Consider a link function , invertible. Many losses are said to be composite, and denoted by , in the sense that they can be expressed by the aid of a link function as
(6) 
In the case of crossentropy, the softmax is the inverse link function. When composite losses are also proper [33], their minimizer assumes the particular shape of the link function applied to the classconditional probabilities :
(7) 
Crossentropy and square are examples of proper composite losses. An intriguing robustness property holds for forward correction of proper composite losses.
Theorem 2
Suppose that the noise matrix is nonsingular. Given a proper composite loss , define the forward loss correction as:
Then, the minimizer of the corrected loss under the noisy distribution is the same as the minimizer of the original loss under the clean distribution:
Proof. First notice that:
(8) 
where we denote . Equivalently, is invertible by composition of invertible functions, its domain is as of and its codomain is . The last loss in Equation 8 is therefore proper composite with link . Finally, from Equation 7, the loss minimizer over the noisy distribution is
(9)  
(10) 
that proves the Theorem by Equation 7 once again.
Recall that approximates and thus we can relate to the result by taking any neural network that enough expressive. Although, the property is weaker than unbiasedness of Theorem 1. Robustness applies to the minimizer only, that is, the model learned by forward correction is the minimizer over the clean distribution. Yet, Theorem 2 guarantees noise robustness with no explicit matrix inversion; the “denoising” link function does it behind the scene. This turns out to be an important factor in practice; see below.
4.3 The overall algorithm
A limitation of the above procedures is that they require knowing . In most applications, the matrix would be unknown and to be estimated. We present here an extension of the recent noise estimator of [21, 26] to the multiclass settings. It is derived under two assumptions.
Theorem 3
Assume is such that:

There exist “perfect examples” of each of class , in the sense that

given sufficiently many corrupted samples, is rich enough to model accurately.
It follows that
Proof. By (2), we can consider instead of . For any and any , we have that:
(11) 
By (1), when , for .
Rather surprisingly, Theorem 3 tells us that we can estimate each component of matrix just based on noisy class probability estimates, that is, the output of the softmax of a network trained with noisy labels. In particular, let be any set of features vectors. This can be the training set itself, but not necessarily: we do not require this sample to have any label at all and therefore any unlabeled sample from the same distributions can be used as well. We can approximate with two steps:
(12)  
(13) 
In practice, assumption (1) of Theorem 3 might hold true when is large enough. Assumption (2) of Theorem 3 is more difficult to justify; we require that the network can perfectly model the probability of the noisy labels. Although, in the experiments we can often recover close to the ground truth and find that small estimation errors have a mild, not catastrophic effect on the quality of the correction.
Algorithm 1 summarizes the endtoend approach. If we know , for example by cleaning manually a subset of training data, we can train with or . Otherwise, we first have to train the network with on noisy data, and obtain from it estimates of for each class via the output of the softmax. After training is computable in . Finally, we retrain with the corrected loss, while potentially utilizing the first network to help initializing the second one.
4.4 Digression: noise free Hessians via ReLU
We now present a result of independent interest in the context of label noise. The ReLU activation function appears to be a good fit for an architecture in our noise model, since it brings the particular convenience that the Hessian of the loss does not depend on noise, and hence the local curvature is left unchanged. At the same time, we are assured that backward correction by — or any arbitrarily bad estimator of the matrix — has no impact on those second order properties of the loss — something that does not hold for the forward correction though. We stress the fact that other activation functions like the sigmoid do not share this guarantee. The proof makes use of the factorization trick due to [30].
Theorem 4
Assume that all activation functions are ReLUs^{4}^{4}4A caveat:
must be a linearodd loss studied in
[30]; crossentropy and square loss are such. At the same time, we could generalize Theorem 4to any neural network that expresses a piecewise linear function, including for example maxpooling.
. Then, the Hessian of does not change under noise. Moreover, the Hessians of and are the same for any .Proof. We give the proof for crossentropy for simplicity; see [30] for a generalization. When the loss is:
The only dependence on the true class above are the first two terms. The logpartition is independent of the precise class . Evidently, the noise affects the loss only through and : those are the only terms in which and may differ. Therefore we can rewrite the backward corrected loss as:
(14)  
(15)  
(16) 
In fact, note that does not affect the logpartition function. To see this, let , with the (vector) logpartition being . It follows that its correction is , by leftmultiplication of and because since is rowstochastic. Thus where is a piecewise linear function of the model parameters, and the logpartition is nonlinear because of the loss and the architecture but does not depend on noise. Since the composition of piecewise linear function is piecewise linear, the Hessian of vanishes, and therefore the Hessian of is noise independent for any . The same holds for (no correction) by taking and hence the Hessians are the same.
loss  correction  Hessian of  

  no guarantee  unchanged  
unbiased estimator of  unchanged  
same minimizer of  no guarantee 
Theorem 4 does not provide any assurance on minima: indeed, stationary points may change location due to label noise. What it does guarantee is that the convergence rate of firstorder methods is the same: the loss curvature cannot blow up or flat out and instead it is the same point by point in the model space. The Theorem advocates for use of ReLU networks, in line with the recent theoretical breakthrough allowing for deep learning with no local minima [16]. Table 1 summaries the properties of loss correction.
5 Experiments
We now test the theory on various deep neural networks trained on MNIST [20], IMDB [23], CIFAR10, CIFAR100 [18] and Clothing1M [42] so as to stress that our approach is independent on both architecture and data domain.
5.1 Loss corrections with known or estimated
We artificially corrupt labels by a parametric matrix . The rationale is to mimic some of the structure of real mistakes for similar classes, e.g. cat dog. Transitions are parameterized by such that ground truth and wrong class have probability respectively of . An example of used for MNIST with is on the left:
(17) 
Common to all experiments is what follows. The loss chosen for comparison is crossentropy. of training data is held out for validation. The loss is evaluated on it during training. With the corrected losses we can validate on noisy data, which is advantageous over other approaches that measure noisy validation accuracy instead. The available standard test sets are used for testing. We use ReLU for all networks and initialize weights prior to ReLUs as in [10], otherwise by uniform sampling in . The minibatch size is . The estimator of from noisy labels is applied to being training and validation sets together. In fact, preliminary experiments highlighted that the large size improve sensibly the approximation of ; after estimation, we rownormalize the matrix. Following [26], we take a percentile in place of the of Equation 12, and we found to work well for most experiments; the estimator performs very poorly with CIFAR100, possibly due the small number of images per class, and we found it is better off computing the instead.
Fully connected network on MNIST. In the first set of experiments we consider MNIST. Pixels are normalized in . Noise flips some of the similar digits: ; see Equation (17, left). We train an architecture with two dense hidden layers of size , with probability of dropout. AdaGrad [4]
is run for 40 epochs with initial learning rate
and . We repeat each experiment 5 times to account for noise and weight initialization. It is clear from Figure 0(c) that, although the model is somewhat robust to mild noise, high level of corruption has a disrupting effect on . Instead, our losses do not witness a drastic drop. With estimated performance lays in between, yet it is significantly better than with no correction. An example of is in Equation (17, right), with .Word embedding and LSTM on IMDB.
We keep only the top 5000 most frequent words in the corpus. Each review is either truncated or padded to be 400word long. To simulate asymmetric noise in this binary problem, we keep constant noise for the transition
at , while is parameterized as above; are the two review’s sentiments. We trained two models inspired by the baselines of [2]. The first maps words into dimensional embeddings, before passing through ReLUs; dropout with probability is applied to the embedding output. In the second model the embedding has dimension and it is followed by an LSTM with units and by a last dimensional hidden layer with dropout. AdaGrad is run for 50 epochs with the same setup as above; results are averages over 5 runs. Figures 0(c)0(c) display an outcome similar to what previously observed on MNIST, in spite of difference in dataset, number of classes, architecture and structure of . Noticeably, our approach is effective on recurrent networks as well. Correcting with is in line with the true here; we believe this is because estimation is easier on this binary problem.


Residual networks on CIFAR10 and CIFAR100. For both datasets we perform perpixel mean subtraction and data augmentation as in [11], by horizontal random flips and random crops after padding with 4 pixels on each side. for CIFAR10 is described by: truck automobile, bird airplane, deer horse, cat dog. In CIFAR100, the 100 classes are grouped into size superclasses, e.g. aquatic mammals contain beaver, dolphin, otter, seal and whale. Within superclasses, the noise flips each class into the next, circularly.
For the last experiments we use deep residual networks (ResNet), the CIFAR10/100 architectures from [11]. In short, residual blocks implements a nonlinear operation in parallel with an identity shortcut: . is as cascade of twice batch normalization ReLU convolution, following the “preactivation" recommendation of [12]. Here we experiment with ResNets of depth and (CIFAR10) and (CIFAR100). By common practice [14], we run SGD with momentum and learning rate , and divide it by after and epoch ( in total) for CIFAR10 and after and () for CIFAR100; weight decay is . Training deep ResNets is more time consuming and thus experiments are run only once. Since we use shallower networks than the ones in [11], performance is not comparable with the original work. In figures 0(f)0(f), forward correction does not suffer any significant loss. Except with the shallowest ResNet, backward correction does not seem to work well in the low noise regime. Finally, noise estimation is particularly difficult on CIFAR100.
5.2 Comparing with other loss functions
MNIST, fully connected  CIFAR10, 14layer ResNet  

no noise  symm.  asymm.  asymm.  no noise  symm.  asymm.  asymm.  
crossentropy  
unhinged (BN)  
sigmoid (BN)  
Savage  
bootstrap soft  
bootstrap hard  
backward  
backward  
forward  
forward  
IMBD, word embedding  CIFAR10, 32layer ResNet  
no noise  symm.  asymm.  asymm.  no noise  symm.  asymm.  asymm.  
crossentropy  
unhinged (BN)  
sigmoid (BN)  
Savage  
bootstrap soft  
bootstrap hard  
backward  
backward  
forward  
forward  
IMBD, word embedding + LSTM  CIFAR100, 44layer ResNet  
no noise  symm.  asymm.  asymm.  no noise  symm.  asymm.  asymm.  
crossentropy  
unhinged (BN)  
sigmoid (BN)  
Savage  
bootstrap soft  
bootstrap hard  
backward  
backward  
forward  
forward 
Average accuracy with standard deviation (5 runs, left part) is bold faced when statistically far from the others, by means of passing a
Welch’s ttest
with value ; in case the highest accuracy is due to or with the ground truth , we denote those by and highlight the next highest accuracy as well. For experiments with no standard deviation (right part), the same rule is applied, but bold face is given to the all accuracies in a range of points from the highest. The meaning of depends on symmetric vs. asymmetric noise and on number of classes (see Section 5.1). On the first columns with no injected noise, indicates when the noise estimation recovers some natural noise and beats “loss correction” with .We now compare with other methods. Data, architectures and artificial noise are the same as above. Additionally, we test the case of symmetric noise where is the probability of label flip that is spread uniformly among all the other classes. We select methods prescribing changes in the loss function, similarly to ours: unhinged [41], sigmoid [8], Savage [24] and soft and hard bootstrapping [32]; hyperparameters of the last two methods are set in accordance with their paper.
Unhinged loss is unbounded and cannot be used alone. In the original work regularization is applied to address the problem, when training nonparametric kernel models. We tried to regularize every layer with little success; learning either does not converge (too little regularization) or converge to very poor solutions (too much). On preliminary experiments sigmoid loss ran into the opposite issue, namely premature saturation; the loss reaches a plateau too quickly, a wellknown problem with sigmoidal activation functions [9]. To make those losses usable for comparison, we stack a layer of batch normalization right before the loss function. Essentially, the network outputs are whitened and likely to operate in a bounded, nonsaturated area of the loss; note that this is never required for linear or kernel models.
Table 2 presents the empirical analysis. We list the key findings: (a) In the absence of artificial noise (first column for each dataset), all losses reach similar accuracies with a spread of points; exceptions are some instances of unhinged, sigmoid and Savage. Additionally, with IMDB there are cases ( in Table 2) of loss correction with noise estimation that perform slightly better than assuming no noise. Clearly, the estimator is able to recover the natural noise in the sentiment reviews. (b) With low asymmetric noise (second column) results differ between simple architecture/tasks (datasets on the left) and deep networks/more difficult problems (right); in the former case, the two corrections behave similarly and are not statistically far from the competitors; in the latter case, forward correction with known is unbeaten, with no clear winner among the remaining ones. (c) With asymmetric noise (last two columns) the two loss corrections with known are overall the best performing, confirming the practical implications of their formal guarantees; forward is usually the best. (d) If we exclude CIFAR100, the noise estimation accounts for average accuracy drops between (IMBD with LSTM model) and points (MNIST); nevertheless, our performance is better than every other method in many occasions. (e) In the experiment on CIFAR100 we obtain essentially perfect noise robustness with the ideal forward correction. The noise estimation works well except in the very last column, yet it guarantees again better accuracy over competing methods. We discuss this issue in Section 6.
5.3 Experiments on Clothing1M
Clothing1M  
model  loss  init  training  accuracy  
1  AlexNet  cross.  ImageNet  
2  AlexNet [39]  cross.  
3  AlexNet [42]  cross.  
4  50ResNet  cross  ImageNet  
5  50ResNet  backward  ImageNet  
6  50ResNet  forward  ImageNet  
7  50ResNet  cross.  ImageNet  
8  50ResNet  cross. 
Finally, we test on Clothing1M [42], consisting of 1M images with noisy labels, with additional of clean data respectively for training, validation and testing; we refer to those sets by their size. We aim to classify images within 14 classes, e.g. tshirt, suit, vest. In the original work two AlexNets [19] are trained together via EM; the networks are pretrained with ImageNet. Two practical tricks are fundamental: a first learning phase with the clean to help EM ( in Table 3) and a second phase with the mix of bootstrapped to and (). Data augmentation is also applied, same as in Section 5.1 for CIFAR10.
We learn a 50layer ResNet pretrained on ImageNet — the bottleneck architecture of [11] — with SGD with learning rate and for 5 epochs each, momentum, and batch size . When we train with we use weight decay of and data augmentation, while with we use only weight decay of . The ResNet gives an uplift of about by training with only ( vs. ). However, the large amount of noisy images is essential to compete with . Instead of estimating the matrix by (12)(13), we exploit the curated labels of and their noisy versions in . Forward and backward corrections are confirmed to work better than crossentropy ( vs. ), yet cannot reach the state of the art without the additional clean data. Thus, we fine tune the networks with , with the same learning parameters as in ; due to reasons of time we only tune . The new state of the art is that outperforms [42] of more than percent, which is achieved without time consuming bootstrapping of the .
6 Discussion and Conclusion
We have proposed a framework for training deep neural networks with noisy labels that boils down to two loss corrections. Accuracy is consistently only few percent points away from training crossentropy on clean data, while corruption can worsen performance of crossentropy by percent or more. Forward correction often performs better. We believe the reason is not statistical — Theorems 1 and 2
guarantee optimality, in the limit of infinite data. The cause may be either numerical (via matrix inversion) or a drastic change of the loss (in particular its Hessian), which may have a detrimental effect on optimization. Indeed, backward correction is a linear combination of losses for every possible label, with coefficients that can be far by orders of magnitude and thus makes the learning harder. Instead, forward correction projects predictions into a probability distribution in
.The quality of noise estimation is a key factor for obtaining robustness. In practice, it works well in most experiments with a median drop of only points of accuracy with respect to using the true . The exception is the last column for CIFAR100, where estimation destroys most of the gain from loss correction. We believe that the mix of high noise and limited number of images per class (500) is detrimental to the estimator. This is confirmed by the sensitivity of .
Future work shall improve the estimation phase by incorporating priors of the noise structure, for example assuming low rank . Improvements on this direction may also widen the applicability to massively multiclass scenarios. It remains an open question whether instancedependent noise may be included into our approach [42, 25]. Finally, we anticipate the use of our approach as a tool for pretraining models with noisy data from the Web, in the spirit of [17].
References
 [1] F. Chollet. Keras. github.com/fchollet/keras.
 [2] A. M. Dai and Q. V. Le. Semisupervised sequence learning. In NIPS*29, 2015.
 [3] S. Divvala, A. Farhadi, and C. Guestrin. Learning everything about anything: Weblysupervised visual concept learning. In 27 IEEE CVPR, 2014.
 [4] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 12:2121–2159, 2011.
 [5] R. Fergus, L. FeiFei, P. Perona, and A. Zisserman. Learning object categories from internet image searches. Proceedings of the IEEE, 98(8):1453–1466, 2010.
 [6] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter. Efficient and robust automated machine learning. In NIPS*29, 2015.
 [7] B. Frénay and M. Verleysen. Classification in the Presence of Label Noise: A Survey. IEEE Transactions on Neural Networks and Learning Systems, 25(5):845–869, May 2014.
 [8] A. Ghosh, N. Manwani, and P. S. Sastry. Making risk minimization tolerant to label noise. Neurocomputing, 2015.
 [9] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010.
 [10] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In ICCV, 2015.
 [11] K. He, X. Zhang., S. Ren, and J. Sun. Deep residual learning for image recognition. In 29 IEEE CVPR, 2016.
 [12] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In 14 ECCV, 2016.
 [13] S. Hochreiter and J. Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 [14] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Weinberger. Deep networks with stochastic depth. In 14 ECCV, 2016.
 [15] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In 32 ICML, 2015.
 [16] K. Kawaguchi. Deep learning without poor local minima. In NIPS*30, 2016.
 [17] J. Krause, B. Sapp, A. Howard, H. Zhou, A. Toshev, T. Duerig, J. Philbin, and L. FeiFei. The unreasonable effectiveness of noisy data for finegrained recognition. In 14 ECCV, 2016.
 [18] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.

[19]
A. Krizhevsky, I. Sutskever, and G. E. Hinton.
Imagenet classification with deep convolutional neural networks.
In NIPS*26, 2012.  [20] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [21] T. Liu and D. Tao. Classification with noisy labels by importance reweighting. IEEE Transactions on PAMI, 38(3):447–461, 2016.
 [22] P. M. Long and R. A. Servedio. Random classification noise defeats all convex potential boosters. Machine learning, 78(3):287–304, 2010.
 [23] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning word vectors for sentiment analysis. In 49 ACL, 2011.

[24]
H. MasnadiShirazi and N. Vasconcelos.
On the design of loss functions for classification: theory, robustness to outliers, and savageboost.
In NIPS*23, 2009.  [25] A. Menon, B. van Rooyen, and N. Natarajan. Learning from binary labels with instancedependent corruption. arXiv preprint arXiv:1605.00751, 2016.
 [26] A. Menon, B. van Rooyen, C. S. Ong, and B. Williamson. Learning from corrupted binary labels via classprobability estimation. In 32 ICML, 2015.
 [27] V. Mnih and G. E. Hinton. Learning to label aerial images from noisy data. In 29 ICML, 2012.
 [28] N. Natarajan, I. S. Dhillon, P. K. Ravikumar, and A. Tewari. Learning with noisy labels. In NIPS*27, 2013.
 [29] L. Niu, W. Li, and D. Xu. Visual recognition by learning from web data: A weakly supervised domain generalization approach. In 28 IEEE CVPR, 2015.

[30]
G. Patrini, F. Nielsen, R. Nock, and M. Carioni.
Loss factorization, weakly supervised learning and label noise robustness.
In 33 ICML, 2016.  [31] H. G. Ramaswamy, C. Scott, and A. Tewari. Mixture proportion estimation via kernel embedding of distributions. In 33 ICML, 2016.
 [32] S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Rabinovich. Training deep neural networks on noisy labels with bootstrapping. arXiv preprint arXiv:1412.6596, 2014.
 [33] M. D. Reid and R. C. Williamson. Composite binary losses. JMLR, 11:2387–2422, 2010.
 [34] T. Sanderson and C. C. Scott. Class proportion estimation with application to multiclass anomaly rejection. In AISTATS, 2014.
 [35] F. Schroff, A. Criminisi, and A. Zisserman. Harvesting image databases from the web. IEEE Transactions on PAMI, 33(4):754–766, 2011.
 [36] C. Scott, G. Blanchard, and G. Handy. Classification with asymmetric label noise : Consistency and maximal denoising. In 26 COLT, 2013.
 [37] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. JMLR, 15(1):1929–1958, 2014.
 [38] G. Stempfel and L. Ralaivola. Learning SVMs from sloppily labeled data. In Artificial Neural Networks (ICANN), pages 884–893. Springer, 2009.
 [39] S. Sukhbaatar, J. Bruna, M. Paluri, L. Bourdev, and R. Fergus. Training convolutional networks with noisy labels. In ICLR Workshops, 2015.
 [40] B. van Rooyen. Machine Learning via Transitions. PhD thesis, The Australian National University, 2015.
 [41] B. van Rooyen, A. K. Menon, and R. C. Williamson. Learning with symmetric label noise: The importance of being unhinged. In NIPS*29, 2015.
 [42] T. Xiao, T. Xia, T. Yang, C. Huang, and X. Wang. Learning from massive noisy labeled data for image classification. In 28 IEEE CVPR, 2015.