1 Introduction
Machine learned models can only be as good as the data they were used to train on. With increasingly large modern datasets and automated and indirect labels like clicks, it is becoming ever more important to investigate and provide effective techniques to handle noisy labels.
We revisit the classical method of filtering out suspicious training examples using nearest neighbors (NN) (Wilson, 1972). Like Papernot and McDaniel (2018), we apply NN on the learned intermediate representation of a preliminary model, producing a supervised distance to compute the nearest neighbors and identify suspicious examples for filtering. In fact, NN methods have recently been receiving renewed attention for their usefulness (Wang et al., 2018; Reeve and Kaban, 2019). Here, like Jiang et al. (2018), we use
NN as an auxiliary method to improve modern deep learning.
The main contributions of this paper are:

Experimentally showing that NN executed on an intermediate layer of a preliminary deep model to filter suspiciouslylabeled examples works as well or better than stateofart methods for handling noisy labels, and is robust to the choice of .

Theoretically showing that asymptotically, NN’s predictions will only identify a training example as clean if its label is the Bayesoptimal label. We also provide finitesample analysis in terms of the margin and how spread out the corrupted labels are (Theorem 1), rates of convergence for the margin (Theorem 2) and rates under Tsybakov’s noise condition (Theorem 3) with all rates matching minimaxoptimal rates in the noiseless setting.
Our work shows that even though the preliminary neural network is trained with corrupted labels, it still yields intermediate representations that are useful for
NN filtering. After identifying examples whose labels disagree with their neighbors, one can either automatically remove them and retrain on the remaining, or send to a human operator for further review. This strategy is also be useful in humanintheloop systems where one can warn the human annotator that a label is suspicious, and automatically propose new labels based on its nearest neighbors’ labels.In addition to strong empirical performance, deep NN filtering has a couple of advantages. Firstly, many methods require a clean set of samples whose labels can be trusted. Here we show that the NN based method is effective both in the presence and absence of a clean set of samples. Second, while
NN does introduce the hyperparameter
, we will show that deep NN filtering is stable to the choice of : such robustness to hyperparameters is highly desirable as optimal tuning for this problem is often not available in practice (i.e. when no clean validation set is available).2 Related Work
We review relevant prior work for training on noisy labels in Sec. 2.1, and then related NN theory in 2.2.
2.1 Training with noisy labels
Methods to handle label noise can be classified into two main strategies: (i) explicitly identify and remove the noisy examples, and (ii) indirectly handle the noise with robust training methods.
Data Cleaning: The proposed deep NN filtering fits into the broad family of data cleaning methods, in that our proposal detects and filters dirty data Chu et al. (2016). Using NN to “edit” training data has been popular since Wilson (1972) used it to throw away training examples that were not consistent with their nearest neighbors. The idea of using a preliminary model to help identify mislabeled examples dates to at least Guyon et al. (1994), who proposed using the model to compute an information gain for each example and then considered examples with high gain to be suspect. Other early work used a crossvalidation setup to train a classifier on part of the data, then use it to make a prediction on heldout training examples, and remove any examples if the prediction disagrees with the label (Brodley and Freidl, 1999).
Noise Corruption Estimation:
For multiclass problems, a popular approach is to account for noisy labels by applying a confusion matrix after the model’s softmax layer
(Sukhbaatar et al., 2014). Such methods rely on a confusion matrix which is often unknown and must be estimated.
Patrini et al. (2017) suggest deriving it from the softmax distribution of the model trained on noisy data, but there are other alternatives (Goldberger and BenReuven, 2016; Jindal et al., 2016; Han et al., 2018). Accurate estimates are generally hard to attain when only untrusted data is available. Hendrycks et al. (2018) achieves more accurate estimates in the setting where some amount of known clean, trusted data is available. EMstyle algorithms have also been proposed to estimate the clean label distribution (Xiao et al., 2015; Khetan et al., 2017; Vahdat, 2017).NoiseRobust Training: Natarajan et al. (2013)
propose a method to make any surrogate loss function noiserobust given knowledge of the corruption rates.
Ghosh et al. (2017) prove that losses like mean absolute error (MAE) are inherently robust under symmetric or uniform label noise while Zhang and Sabuncu (2018)show that training with MAE results in poor convergence and accuracy. They propose a new loss function based on the negative BoxCox transformation that trades off the noiserobustness of MAE with the training efficiency of crossentropy. Lastly, the ramp, unhinged, and savage losses have been proposed and theoretically justified to be noiserobust for support vector machines
(Brooks, 2011; Van Rooyen et al., 2015; MasnadiShirazi and Vasconcelos, 2009). Amid et al. (2019) construct a noiserobust “bitempered” loss by introducing temperature in the exponential and log functions. Rolnick et al. (2017) empirically show that deep learning models are robust to noise when there are enough correctly labeled examples and when the model capacity and training batch size are sufficiently large. Thulasidasan et al. (2019) propose a new loss that enables the model to abstain from confusing examples during training.Auxiliary Models: Veit et al. (2017) propose learning a label cleaning network on trusted data by predicting the differences between clean and noisy labels. Li et al. (2017) suggest training on a weighted average between noisy labels and distilled predictions of an auxiliary model trained on trusted data. Given a model pretrained on noisy data, Lee et al. (2019)
boost its generalization on clean data by inducing a generative classifier on top of hidden features from the model. The method is grounded in the assumption that the hidden features follow a classspecific Gaussian distribution.
Example Weighting: Here we make a hard decision about whether to keep a training example, but one can also adapt the weights on training examples based on the confidence in their labels. Liu and Tao (2015) provide an importanceweighting scheme for binary classification. Ren et al. (2018) suggest upweighting examples whose loss gradient is aligned with those of trusted examples at every step in training. Jiang et al. (2017) investigate a recurrent network that learns a sample weighting scheme to give to the base model.
2.2 nearest neighbor theory
The theory of nearest neighbor classification has a long history (see e.g. Fix and Hodges Jr (1951); Cover (1968); Stone (1977); Devroye et al. (1994); Chaudhuri and Dasgupta (2014)). Much of the prior work focuses on NN’s statistical consistency properties. However, with the growing interest in adversarial examples and learning with noisy labels, there have recently been analyses of nearest neighbor methods in these settings. Wang et al. (2018) analyze the robustness of NN classification and provide a robust variant of NN classification where their notion of robustness is that predictions of nearby points should be similar. Gao et al. (2016) provide an analysis of the NN classifier under noisy labels and like us, show that NN can attain similar rates in the noisy setting as in the noiseless setting. Gao et al. (2016) assume a noise model where labels are corrupted uniformly at random, while we assume an arbitrary corruption pattern and provide results based on a notion of how spread out the corrupted points are. Moreover, we provide finitesample bounds borrowing recent advances in NN convergence theory in the noiseless setting (Jiang, 2019) while the guarantees of Gao et al. (2016) are asymptotic. Reeve and Kaban (2019) provide stronger guarantees on a robust modification of NN proposed by Gao et al. (2016). To the best of our knowledge, we provide the first finitesample rates of consistency for the classical NN method in the noisy setting with very little assumptions on the label noise.
3 Deep NN Algorithm
Recall the standard nearest neighbor classifier:
Definition 1 (Nn).
Let the NN radius of be where and the NN set of be . Then for all , the NN classifier function w.r.t. has discriminant function
with prediction .
Our method is detailed in Algorithm 1. It takes a dataset of examples with potentially noisy labels, along with a dataset consisting of clean or trusted labels. Note that we allow to be empty (i.e. in instances where no such trusted data is available). We also take as given a model architecture (e.g. a 2layer DNN with 20 hidden nodes). Algorithm 1 uses the following subroutine:
Filtering Model Train Set Selection Procedure: To determine whether the model used to compute the NN should only be trained on or on , we split the 70/30 into two sets: and , and train two preliminary models: one on and one on . If the model trained with performs better on the heldout , then output , otherwise output .
4 Theoretical Analysis
For the theoretical analysis, we assume the binary classification problem with the features defined on compact set . We assume that points are drawn according to distribution as follows: the features come from distribution on
and the labels are distributed according to the measurable conditional probability function
. That is, a sample is drawn from as follows: is drawn according to and is chosen according to .The goal will be to show that given corrupted examples, the NN disagreement method is still able to identify the examples whose labels do not match that of the Bayesoptimal label.
We will make a few regularity assumptions for our analysis to hold. The first regularity assumption ensures that the support does not become arbitrarily thin anywhere. This is a standard nonparametric assumption (e.g. (Singh et al., 2009; Jiang, 2019)).
Assumption 1 (Support Regularity).
There exists and such that for all and , where .
Let be the density function corresponding to . The next assumption ensures that with a sufficiently large sample, we will obtain a good covering of the input space.
Assumption 2 ( bounded from below).
.
Finally, we make a smoothness assumption on , as done in other analyses of NN classification (e.g. (Chaudhuri and Dasgupta, 2014; Reeve and Kaban, 2019))
Assumption 3 ( Hölder continuous).
There exists and such that for all .
We propose a notion of how spread out a set of points is based on the minimum pairwise distance between the points. This will be a quantity in the finitesample bounds we will present. Intuitively, the more spread out a contaminated set of points is, the less clean samples we will be needed to overcome the contamination of that set.
Definition 2 (Minimum pairwise distance).
Also define the interior region of where there is at least margin in the probabilistic label:
Definition 3.
Let . Define .
We now state the result, which says that with high probability uniformly on when is known, we have that the label disagrees with the NN classifier if and only if the label is not the Bayesoptimal prediction. Due to space, all of the proofs have been deferred to the Appendix.
Theorem 1 (Fixed ).
Let and suppose Assumptions 1, 2, and 3 hold. There exists constants depending only on such that the following holds with probability at least . Let be (uncorrupted) examples drawn from the and be a set of points with corrupted labels and denote our sample . Suppose lies in the following range:
Then the following holds uniformly over : the NN prediction computed w.r.t. agrees with the label if and only if the label is the Bayesoptimal label .
In the last result, we assumed that was fixed. We next show how we can make a similar guarantee but show that we can take as we choose appropriately and provide rates of convergence.
Theorem 2 (Rates of convergence for ).
Let and suppose Assumptions 1, 2, and 3 hold. There exist constants depending only on such that the following holds with probability at least . Let be (uncorrupted) examples drawn from , and be a set of points with corrupted labels and denote our sample . Suppose lies in the following range:
then the following holds uniformly over : the NN prediction computed w.r.t. agrees with the label if and only if the label is the Bayesoptimal label where
Remark 1.
Choosing in the above result gives us . This rate for is the minimaxoptimal rate for nearest neighbor classification on given a sample of size (Chaudhuri and Dasgupta, 2014) in the uncorrupted setting. Thus, our analysis is tight up to logarithmic factors.
We next give results with an additional margin assumption, also known as Tsybakov’s noise condition (Mammen et al., 1999; Tsybakov and others, 2004):
Assumption 4 (Tsybakov Noise Condition).
The following holds for some and and all :
Theorem 3 (Rates under Tsybakov Noise Condition).
Let and suppose Assumptions 1, 2, 3 and 4 hold. There exists constants depending only on such that the following holds with probability at least . Let be (uncorrupted) examples drawn from the and be a set of points with corrupted labels and denote our sample . Suppose lies in the following range
and define . Then,
and denote the risk of the NN method and Bayes optimal classifier, respectively.
Remark 2.
Choosing in the above gives us a rate of for the excess risk. This matches the lower bounds of (Audibert et al., 2007) up to logarithmic factors.
4.1 Impact of Minimum Pairwise Distance
The minimum pairwise distance across corrupted samples, , is a key quantity in the theory presented in the previous section. We now empirically study its significance in a simulated binary classification task in 2 dimensions. Clean samples with label are generated by sampling i.i.d from , where and . The decision boundary is the line . We take 100 samples uniformly spaced on a square grid centered about and corrupt them by flipping their true label. With this construction, is precisely the grid width, which we let vary. The training set is a union of 100 clean samples and the 100 corrupted samples. Using 1000 clean samples as a test set we study the classification performance of a majority vote NN classifier, where . Results are shown in Figure 1. As expected, we see that as decreases, so does test accuracy and we need more clean training samples to compensate.
Uniform  

Dataset  % Clean  Forward  Clean  Distill  GLC  kNN  kNN Classify  Full 
5  4.55  3.16  2.48  2.33  2.05  2.19  2.61  
Letters  10  4.28  2.51  2.05  1.91  1.78  1.79  2.23 
20  3.77  2.06  1.76  1.57  1.56  1.34  1.85  
5  7.89  1.91  1.79  2.12  1.26  2.58  3  
Phonemes  10  7.86  1.54  1.53  1.67  1.16  2.28  2.85 
20  7.72  1.34  1.33  1.35  1.13  1.76  2.24  
5  5.18  0.56  0.85  0.54  0.39  0.93  1.86  
Wilt  10  4.68  0.43  0.75  0.45  0.32  0.77  1.77 
20  4.31  0.36  0.86  0.35  0.32  0.57  1.5  
5  3.29  4.22  3.71  4.2  3.08  2.87  2.71  
Seeds  10  3.43  3.04  2.84  2.99  2.14  2.74  2.57 
20  2.99  2.69  2.27  2.74  1.72  2.44  2.2  
5  2.97  3.23  3.48  3.97  2.46  1.72  2.05  
Iris  10  2.89  2.48  2.59  2.25  1.32  1.26  1.55 
20  2.46  1.85  1.97  1.52  0.6  0.96  1.32  
5  5  3.46  3.26  4.22  3.4  3.26  3.76  
Parkinsons  10  5.35  3.26  3.22  3.45  3.21  3.22  3.82 
20  4.88  3.01  3.08  3.1  2.98  2.97  3.52  
5  2.88  0.69  1.03  0.5  0.4  2.72  2.75  
MNIST  10  2.57  0.5  0.85  0.41  0.33  2.42  2.45 
20  2.07  0.35  0.69  0.34  0.27  1.97  2.03  
5  2.76  1.88  1.73  1.59  1.56  2.53  2.54  
Fashion MNIST  10  2.47  1.71  1.6  1.52  1.52  2.21  2.3 
20  2.07  1.56  1.48  1.45  1.44  1.95  2.05  
5  6.74  7  6.86  5.43  5.03  6.34  6.74  
CIFAR10  10  6.58  6.58  6.32  5.39  5.27  6.11  6.55 
20  6.4  5.52  5.66  5.11  4.57  5.93  6.36  
5  10.8  10.22  9.98  9.59  9.57  9.17  9.29  
CIFAR100  10  10.79  9.94  9.7  9.42  9.63  9.09  9.25 
20  10.78  9.38  9.15  9.06  9.23  8.92  9.07  
5  5.04  3.52  3.56  1.99  1.62  4.64  4.95  
SVHN  10  4.98  2.2  3.2  2.27  1.34  4.51  4.82 
20  4.32  1.83  2.67  2.14  1.2  3.96  4.41 
5 Experiments With Clean Auxiliary Data
In this section we present experiments where a small set of relatively clean labeled data is given as side information. The methods in Section 6 do not assume such clean set is available.
5.1 Experiment Setup
We ran experiments for different sizes of , different noise rates, and different label corruptions, as detailed.
We randomly partition each dataset’s train set into and , and we present results for 95/5, 90/10, and 80/20 splits.
We corrupt the labels in for a fraction of the examples  the experiment’s “noise rate”  using one of three schemes:
Uniform: The label is flipped to any one of the labels (including itself) with equal probability.
Flip: The label is flipped to any other label with equal probability.
Hard Flip: With probability , we flip the label to where is some predefined permutation of the labels. The motivation here is to simulate confusion between semantically similar classes, as done in Hendrycks et al. (2018).
5.2 Comparison Methods
We compare against the following:
Gold Loss Correction (GLC): Hendrycks et al. (2018) estimates a corruption matrix by averaging the softmax outputs of the clean examples on a model trained on noisy data.
Forward: Patrini et al. (2017), similar in spirit to GLC, estimates the corruption matrix by training a model on noisy data and using the softmax output for prototype examples for each class. It does not require a clean dataset like other methods.
Distill: Li et al. (2017) assigns each example in the combined dataset a “soft” label that is a convex combination of its label and its softmax output from a model trained solely on clean data.
Clean: Train only on .
Full: Train on .
NN Classify: Train on , but at runtime classify using NN evaluated on the logit layer (rather than the model decision). This is not a competing data cleaning method, but rather it doublechecks the value of the logit layer as a metric space for NN.
5.3 Other Experiment Details
We report test errors and show the average across multiple runs with standard error bands shaded. Errors are computed on 11 uniformly distributed noise rates between 0 and 1 inclusive. For the results shown in the main text, we have that
is randomly selected and is of the data. In the Appendix, we show results over different sizes of. We implement all methods using Tensorflow 2.0 and ScikitLearn. We use the Adam optimizer with default learning rate 0.001 and a batch size of 128 across all experiments. For the UCI datasets, we set
and set for all other datasets. We chose for the UCI datasets because some of the datasets were of small size. However, we found that the NN method’s performance was quite stable to the choice of , which we show in Section 5.7. We describe the permutations used for hard flipping in the Appendix.5.4 UCI and MNIST Results
We show the results for one of the UCI datasets and Fashion MNIST in Figure 2. Due to space, results for MNIST and the remaining UCI datasets are in the Appendix. For UCI, we use a fullyconnected neural network with a single hidden layer of dimension
with ReLU activations and train for
epochs. For both MNIST datasets, we use is a two hiddenlayer fullyconnected neural network where each layer has hidden units with ReLU activations. We train the model for epochs. We see that the NN approach attains models with a low error rate across noise rates and either outperforms or is competitive with the next best method, GLC.5.5 SVHN Results
We show the results in Figure 3. We train ResNet20 from scratch on NVIDIA P100 GPUs for 100 epochs. As in the CIFAR experiments, we see that the NN method tends to be competitive in the uniform and flip noise types but does slightly worse in the hard flip.
5.6 CIFAR Results
For CIFAR10/100 we use ResNet20, which we train from scratch on single NVIDIA P100 GPUs. We train CIFAR10 for 100 epochs and CIFAR100 for 150 epochs. We show results for CIFAR10 in Figure 3 and results for CIFAR100 in the Appendix, due to space. We see that the NN method performs competitively. It generally outperforms on the uniform and flip noise types but performs worse for the hard flip noise type. It is not too surprising that NN would be weaker in the presence of hard flip noise (i.e. where labels are mapped based on a predetermined mapping between labels) as the noise is much more structured in that case making it more difficult to be filtered out by majority vote among the neighbors. In other words, unlike the uniform and flip noise types, we are no longer dealing with white label noise in the hard flip noise type.
. We find that on MNIST, KNN consistently matches or outperforms the other methods across corruption types and noise levels.
5.7 Robustness to
In this section, we show that our procedure is stable in its hyperparameter . The theoretical results suggest that a wide range of can give us statistical consistency guarantees, and in Figure 4 we show that a wide range of gives us similar results for Algorithm 1. Such robustness in hyperparameter is highly desirable because optimal tuning is often not available, especially when no sufficient clean validation set is available.
5.8 Summary of Results
We summarize the results in Table 1 by reporting the area under the curve for the results shown in the figures for the different datasets for the Uniform noise corruption. The best results (bolded) are generally with deep NN. Except in one case, when NN filtering does not produce the best results, the best results are using the same NN to directly classify, rather than retraining the deep model (marked NN Classify). One reason not to use NN Classify is its slow runtime. If we insist on retraining the model and only allow the use of NN to filter, then NN loses 6 of the 33 experiments, with Full (no filtering) the 2nd best method. Analogous tables for Flip and Hard Flip are in the Appendix and show similar results.
6 Experiments Without Clean Auxiliary Data
Next, we do not split the train set  that is, . The experimental setup is otherwise asdescribed in Section 5.
6.1 Additional Comparison Methods
We compare against two more recentlyproposed methods that do not use a .
BiTempered Loss: Amid et al. (2019) introduces two temperatures into the loss function and trains the model on noisy data using this “bitempered” loss. We use their code from github.com/google/bitemperedloss and the hyperparameters suggested in their paper: , and iterations of normalization.
Robust Generative Classifier (RoG): Lee et al. (2019) induces a generative classifier using a model pretrained on noisy data. We implemented their algorithm mimicking their hyperparameter choices  see Appendix for details.
6.2 Results
Results for MNIST are shown in Figure 5. Due to space constraints we put results for all the other datasets presented in Section 5 in the Appendix. The results in Figure 5 are mostly representative of the other experiment results. However, the results for CIFAR100 are notable in that the logit space is 100dimensions (since it’s a 100class problem), and we hypothesize this higher dimensional space befuddles NN a bit as it only does as well as most methods, whereas the RoG method is substantially better than most methods.
7 Conclusions and Open Questions
We conclude from our experiments and theory that the proposed deep NN filtering is a safe and dependable method to remove problematic training examples. While NN methods can be sensitive to the choice of when used with small datasets (Garcia et al., 2009), we hypothesize that with today’s large datasets one can blithely set to a fixed practically mediumsized value (e.g. ) as done here and expect reasonable performance. Theoretically we provided some new results for how well NN can identify clean versus corrupted labels. Open theoretical questions are whether there are alternate notions of how to characterize the difficulty of a particular configuration of corrupted examples and whether we can provide both upper and lower learning bounds under these noise conditions.
References
 Robust bitempered logistic loss based on bregman divergences. In Advances in Neural Information Processing Systems, pp. 14987–14996. Cited by: §2.1, §6.1.
 Fast learning rates for plugin classifiers. The Annals of Statistics 35 (2), pp. 608–633. Cited by: Remark 2.

Identifying mislabeled training data.
Journal Artificial Intelligence Research
. Cited by: §2.1.  Support vector machines with the ramp loss and the hard margin loss. Operations Research 59 (2), pp. 467–479. Cited by: §2.1.
 Rates of convergence for nearest neighbor classification. In Advances in Neural Information Processing Systems, pp. 3437–3445. Cited by: §2.2, §4, Remark 1.
 Data cleaning: overview and emerging challenges. In SIGMOD, Cited by: §2.1.
 Rates of convergence for nearest neighbor procedures. In Proceedings of the Hawaii International Conference on Systems Sciences, pp. 413–415. Cited by: §2.2.
 On the strong universal consistency of nearest neighbor regression function estimates. The Annals of Statistics 22 (3), pp. 1371–1385. Cited by: §2.2.
 Discriminatory analysisnonparametric discrimination: consistency properties. Technical report California Univ Berkeley. Cited by: §2.2.
 On the consistency of exact and approximate nearest neighbor with noisy data. arXiv preprint arXiv:1607.07526. Cited by: §2.2.
 Completely lazy learning. IEEE Trans. on Knowledge and DataEngineering. Cited by: §7.
 Robust loss functions under label noise for deep neural networks. In ThirtyFirst AAAI Conference on Artificial Intelligence, Cited by: §2.1.
 Training deep neuralnetworks using a noise adaptation layer. Cited by: §2.1.
 Discovering informative patterns and data cleaning. In AAAI Workshop on Knowledge Discovery in Databases, Cited by: §2.1.
 Masking: a new perspective of noisy supervision. In Advances in Neural Information Processing Systems, pp. 5836–5846. Cited by: §2.1.
 Using trusted data to train deep networks on labels corrupted by severe noise. In Advances in neural information processing systems, pp. 10456–10465. Cited by: §2.1, §5.1, §5.2.
 To trust or not to trust a classifier. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1.
 Nonasymptotic uniform rates of consistency for knn regression. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 3999–4006. Cited by: §2.2, §4.
 Mentornet: regularizing very deep neural networks on corrupted labels. arXiv preprint arXiv:1712.05055 4. Cited by: §2.1.
 Learning deep networks from noisy labels with dropout regularization. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 967–972. Cited by: §2.1.
 Learning from noisy singlylabeled data. arXiv preprint arXiv:1712.04577. Cited by: §2.1.
 Robust inference via generative classifiers for handling noisy labels. arXiv preprint arXiv:1901.11300. Cited by: §2.1, §6.1.

Learning from noisy labels with distillation.
In
Proceedings of the IEEE International Conference on Computer Vision
, pp. 1910–1918. Cited by: §2.1, §5.2.  Classification with noisy labels by importance reweighting. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (3), pp. 447–461. Cited by: §2.1.
 Smooth discrimination analysis. The Annals of Statistics 27 (6), pp. 1808–1829. Cited by: §4.

On the design of loss functions for classification: theory, robustness to outliers, and savageboost
. In Advances in Neural Information Processing Systems, pp. 1049–1056. Cited by: §2.1.  Learning with noisy labels. In Advances in Neural Information Processing Systems, pp. 1196–1204. Cited by: §2.1.
 Deep knearest neighbors: towards confident, interpretable and robust deep learning. arXiv preprint arXiv:1803.04765. Cited by: §1.

Making deep neural networks robust to label noise: a loss correction approach.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 1944–1952. Cited by: §2.1, §5.2.  Fast rates for a kNN classifier robust to unknown asymmetric label noise. arXiv preprint arXiv:1906.04542. Cited by: §1, §2.2, §4.
 Learning to reweight examples for robust deep learning. arXiv preprint arXiv:1803.09050. Cited by: §2.1.
 Deep learning is robust to massive label noise. arXiv preprint arXiv:1705.10694. Cited by: §2.1.
 Adaptive Hausdorff estimation of density level sets. The Annals of Statistics 37 (5B), pp. 2760–2782. Cited by: §4.
 Consistent nonparametric regression. The Annals of Statistics, pp. 595–620. Cited by: §2.2.
 Training convolutional networks with noisy labels. arXiv preprint arXiv:1406.2080. Cited by: §2.1.
 Combating label noise in deep learning using abstention. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 6234–6243. Cited by: §2.1.
 Optimal aggregation of classifiers in statistical learning. The Annals of Statistics 32 (1), pp. 135–166. Cited by: §4.
 Toward robustness against label noise in training deep discriminative neural networks. In Advances in Neural Information Processing Systems, pp. 5596–5605. Cited by: §2.1.
 Learning with symmetric label noise: the importance of being unhinged. In Advances in Neural Information Processing Systems, pp. 10–18. Cited by: §2.1.
 Learning from noisy largescale datasets with minimal supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 839–847. Cited by: §2.1.
 Analyzing the robustness of nearest neighbors to adversarial examples. In International Conference on Machine Learning, pp. 5120–5129. Cited by: §1, §2.2.
 Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. on Systems, Man and Cybernetics. Cited by: §1, §2.1.
 Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2691–2699. Cited by: §2.1.
 Generalized cross entropy loss for training deep neural networks with noisy labels. In Advances in Neural Information Processing Systems, pp. 8778–8788. Cited by: §2.1.