Machine learned models can only be as good as the data they were used to train on. With increasingly large modern datasets and automated and indirect labels like clicks, it is becoming ever more important to investigate and provide effective techniques to handle noisy labels.
We revisit the classical method of filtering out suspicious training examples using -nearest neighbors (-NN) (Wilson, 1972). Like Papernot and McDaniel (2018), we apply -NN on the learned intermediate representation of a preliminary model, producing a supervised distance to compute the nearest neighbors and identify suspicious examples for filtering. In fact, -NN methods have recently been receiving renewed attention for their usefulness (Wang et al., 2018; Reeve and Kaban, 2019). Here, like Jiang et al. (2018), we use
-NN as an auxiliary method to improve modern deep learning.
The main contributions of this paper are:
Experimentally showing that -NN executed on an intermediate layer of a preliminary deep model to filter suspiciously-labeled examples works as well or better than state-of-art methods for handling noisy labels, and is robust to the choice of .
Theoretically showing that asymptotically, -NN’s predictions will only identify a training example as clean if its label is the Bayes-optimal label. We also provide finite-sample analysis in terms of the margin and how spread out the corrupted labels are (Theorem 1), rates of convergence for the margin (Theorem 2) and rates under Tsybakov’s noise condition (Theorem 3) with all rates matching minimax-optimal rates in the noiseless setting.
Our work shows that even though the preliminary neural network is trained with corrupted labels, it still yields intermediate representations that are useful for-NN filtering. After identifying examples whose labels disagree with their neighbors, one can either automatically remove them and retrain on the remaining, or send to a human operator for further review. This strategy is also be useful in human-in-the-loop systems where one can warn the human annotator that a label is suspicious, and automatically propose new labels based on its nearest neighbors’ labels.
In addition to strong empirical performance, deep -NN filtering has a couple of advantages. Firstly, many methods require a clean set of samples whose labels can be trusted. Here we show that the -NN based method is effective both in the presence and absence of a clean set of samples. Second, while
-NN does introduce the hyperparameter, we will show that deep -NN filtering is stable to the choice of : such robustness to hyperparameters is highly desirable as optimal tuning for this problem is often not available in practice (i.e. when no clean validation set is available).
2 Related Work
2.1 Training with noisy labels
Methods to handle label noise can be classified into two main strategies: (i) explicitly identify and remove the noisy examples, and (ii) indirectly handle the noise with robust training methods.
Data Cleaning: The proposed deep -NN filtering fits into the broad family of data cleaning methods, in that our proposal detects and filters dirty data Chu et al. (2016). Using -NN to “edit” training data has been popular since Wilson (1972) used it to throw away training examples that were not consistent with their nearest neighbors. The idea of using a preliminary model to help identify mislabeled examples dates to at least Guyon et al. (1994), who proposed using the model to compute an information gain for each example and then considered examples with high gain to be suspect. Other early work used a cross-validation set-up to train a classifier on part of the data, then use it to make a prediction on held-out training examples, and remove any examples if the prediction disagrees with the label (Brodley and Freidl, 1999).
Noise Corruption Estimation:et al., 2014)
. Such methods rely on a confusion matrix which is often unknown and must be estimated.Patrini et al. (2017) suggest deriving it from the softmax distribution of the model trained on noisy data, but there are other alternatives (Goldberger and Ben-Reuven, 2016; Jindal et al., 2016; Han et al., 2018). Accurate estimates are generally hard to attain when only untrusted data is available. Hendrycks et al. (2018) achieves more accurate estimates in the setting where some amount of known clean, trusted data is available. EM-style algorithms have also been proposed to estimate the clean label distribution (Xiao et al., 2015; Khetan et al., 2017; Vahdat, 2017).
Noise-Robust Training: Natarajan et al. (2013)
propose a method to make any surrogate loss function noise-robust given knowledge of the corruption rates.Ghosh et al. (2017) prove that losses like mean absolute error (MAE) are inherently robust under symmetric or uniform label noise while Zhang and Sabuncu (2018)
show that training with MAE results in poor convergence and accuracy. They propose a new loss function based on the negative Box-Cox transformation that trades off the noise-robustness of MAE with the training efficiency of cross-entropy. Lastly, the ramp, unhinged, and savage losses have been proposed and theoretically justified to be noise-robust for support vector machines(Brooks, 2011; Van Rooyen et al., 2015; Masnadi-Shirazi and Vasconcelos, 2009). Amid et al. (2019) construct a noise-robust “bi-tempered” loss by introducing temperature in the exponential and log functions. Rolnick et al. (2017) empirically show that deep learning models are robust to noise when there are enough correctly labeled examples and when the model capacity and training batch size are sufficiently large. Thulasidasan et al. (2019) propose a new loss that enables the model to abstain from confusing examples during training.
Auxiliary Models: Veit et al. (2017) propose learning a label cleaning network on trusted data by predicting the differences between clean and noisy labels. Li et al. (2017) suggest training on a weighted average between noisy labels and distilled predictions of an auxiliary model trained on trusted data. Given a model pre-trained on noisy data, Lee et al. (2019)
boost its generalization on clean data by inducing a generative classifier on top of hidden features from the model. The method is grounded in the assumption that the hidden features follow a class-specific Gaussian distribution.
Example Weighting: Here we make a hard decision about whether to keep a training example, but one can also adapt the weights on training examples based on the confidence in their labels. Liu and Tao (2015) provide an importance-weighting scheme for binary classification. Ren et al. (2018) suggest upweighting examples whose loss gradient is aligned with those of trusted examples at every step in training. Jiang et al. (2017) investigate a recurrent network that learns a sample weighting scheme to give to the base model.
2.2 -nearest neighbor theory
The theory of -nearest neighbor classification has a long history (see e.g. Fix and Hodges Jr (1951); Cover (1968); Stone (1977); Devroye et al. (1994); Chaudhuri and Dasgupta (2014)). Much of the prior work focuses on -NN’s statistical consistency properties. However, with the growing interest in adversarial examples and learning with noisy labels, there have recently been analyses of -nearest neighbor methods in these settings. Wang et al. (2018) analyze the robustness of -NN classification and provide a robust variant of -NN classification where their notion of robustness is that predictions of nearby points should be similar. Gao et al. (2016) provide an analysis of the -NN classifier under noisy labels and like us, show that -NN can attain similar rates in the noisy setting as in the noiseless setting. Gao et al. (2016) assume a noise model where labels are corrupted uniformly at random, while we assume an arbitrary corruption pattern and provide results based on a notion of how spread out the corrupted points are. Moreover, we provide finite-sample bounds borrowing recent advances in -NN convergence theory in the noiseless setting (Jiang, 2019) while the guarantees of Gao et al. (2016) are asymptotic. Reeve and Kaban (2019) provide stronger guarantees on a robust modification of -NN proposed by Gao et al. (2016). To the best of our knowledge, we provide the first finite-sample rates of consistency for the classical -NN method in the noisy setting with very little assumptions on the label noise.
3 Deep -NN Algorithm
Recall the standard -nearest neighbor classifier:
Definition 1 (-Nn).
Let the -NN radius of be where and the -NN set of be . Then for all , the -NN classifier function w.r.t. has discriminant function
with prediction .
Our method is detailed in Algorithm 1. It takes a dataset of examples with potentially noisy labels, along with a dataset consisting of clean or trusted labels. Note that we allow to be empty (i.e. in instances where no such trusted data is available). We also take as given a model architecture (e.g. a 2-layer DNN with 20 hidden nodes). Algorithm 1 uses the following sub-routine:
Filtering Model Train Set Selection Procedure: To determine whether the model used to compute the -NN should only be trained on or on , we split the 70/30 into two sets: and , and train two preliminary models: one on and one on . If the model trained with performs better on the held-out , then output , otherwise output .
4 Theoretical Analysis
For the theoretical analysis, we assume the binary classification problem with the features defined on compact set . We assume that points are drawn according to distribution as follows: the features come from distribution on
and the labels are distributed according to the measurable conditional probability function. That is, a sample is drawn from as follows: is drawn according to and is chosen according to .
The goal will be to show that given corrupted examples, the -NN disagreement method is still able to identify the examples whose labels do not match that of the Bayes-optimal label.
We will make a few regularity assumptions for our analysis to hold. The first regularity assumption ensures that the support does not become arbitrarily thin anywhere. This is a standard non-parametric assumption (e.g. (Singh et al., 2009; Jiang, 2019)).
Assumption 1 (Support Regularity).
There exists and such that for all and , where .
Let be the density function corresponding to . The next assumption ensures that with a sufficiently large sample, we will obtain a good covering of the input space.
Assumption 2 ( bounded from below).
Assumption 3 ( Hölder continuous).
There exists and such that for all .
We propose a notion of how spread out a set of points is based on the minimum pairwise distance between the points. This will be a quantity in the finite-sample bounds we will present. Intuitively, the more spread out a contaminated set of points is, the less clean samples we will be needed to overcome the contamination of that set.
Definition 2 (Minimum pairwise distance).
Also define the -interior region of where there is at least margin in the probabilistic label:
Let . Define .
We now state the result, which says that with high probability uniformly on when is known, we have that the label disagrees with the -NN classifier if and only if the label is not the Bayes-optimal prediction. Due to space, all of the proofs have been deferred to the Appendix.
Theorem 1 (Fixed ).
Let and suppose Assumptions 1, 2, and 3 hold. There exists constants depending only on such that the following holds with probability at least . Let be (uncorrupted) examples drawn from the and be a set of points with corrupted labels and denote our sample . Suppose lies in the following range:
Then the following holds uniformly over : the -NN prediction computed w.r.t. agrees with the label if and only if the label is the Bayes-optimal label .
In the last result, we assumed that was fixed. We next show how we can make a similar guarantee but show that we can take as we choose appropriately and provide rates of convergence.
Theorem 2 (Rates of convergence for ).
Let and suppose Assumptions 1, 2, and 3 hold. There exist constants depending only on such that the following holds with probability at least . Let be (uncorrupted) examples drawn from , and be a set of points with corrupted labels and denote our sample . Suppose lies in the following range:
then the following holds uniformly over : the -NN prediction computed w.r.t. agrees with the label if and only if the label is the Bayes-optimal label where
Choosing in the above result gives us . This rate for is the minimax-optimal rate for -nearest neighbor classification on given a sample of size (Chaudhuri and Dasgupta, 2014) in the uncorrupted setting. Thus, our analysis is tight up to logarithmic factors.
Assumption 4 (Tsybakov Noise Condition).
The following holds for some and and all :
Theorem 3 (Rates under Tsybakov Noise Condition).
Let and suppose Assumptions 1, 2, 3 and 4 hold. There exists constants depending only on such that the following holds with probability at least . Let be (uncorrupted) examples drawn from the and be a set of points with corrupted labels and denote our sample . Suppose lies in the following range
and define . Then,
and denote the risk of the -NN method and Bayes optimal classifier, respectively.
Choosing in the above gives us a rate of for the excess risk. This matches the lower bounds of (Audibert et al., 2007) up to logarithmic factors.
4.1 Impact of Minimum Pairwise Distance
The minimum pairwise distance across corrupted samples, , is a key quantity in the theory presented in the previous section. We now empirically study its significance in a simulated binary classification task in 2 dimensions. Clean samples with label are generated by sampling i.i.d from , where and . The decision boundary is the line . We take 100 samples uniformly spaced on a square grid centered about and corrupt them by flipping their true label. With this construction, is precisely the grid width, which we let vary. The training set is a union of 100 clean samples and the 100 corrupted samples. Using 1000 clean samples as a test set we study the classification performance of a majority vote -NN classifier, where . Results are shown in Figure 1. As expected, we see that as decreases, so does test accuracy and we need more clean training samples to compensate.
|Dataset||% Clean||Forward||Clean||Distill||GLC||k-NN||k-NN Classify||Full|
5 Experiments With Clean Auxiliary Data
In this section we present experiments where a small set of relatively clean labeled data is given as side information. The methods in Section 6 do not assume such clean set is available.
5.1 Experiment Set-up
We ran experiments for different sizes of , different noise rates, and different label corruptions, as detailed.
We randomly partition each dataset’s train set into and , and we present results for 95/5, 90/10, and 80/20 splits.
We corrupt the labels in for a fraction of the examples - the experiment’s “noise rate” - using one of three schemes:
Uniform: The label is flipped to any one of the labels (including itself) with equal probability.
Flip: The label is flipped to any other label with equal probability.
Hard Flip: With probability , we flip the label to where is some predefined permutation of the labels. The motivation here is to simulate confusion between semantically similar classes, as done in Hendrycks et al. (2018).
5.2 Comparison Methods
We compare against the following:
Gold Loss Correction (GLC): Hendrycks et al. (2018) estimates a corruption matrix by averaging the softmax outputs of the clean examples on a model trained on noisy data.
Forward: Patrini et al. (2017), similar in spirit to GLC, estimates the corruption matrix by training a model on noisy data and using the softmax output for prototype examples for each class. It does not require a clean dataset like other methods.
Distill: Li et al. (2017) assigns each example in the combined dataset a “soft” label that is a convex combination of its label and its softmax output from a model trained solely on clean data.
Clean: Train only on .
Full: Train on .
-NN Classify: Train on , but at runtime classify using -NN evaluated on the logit layer (rather than the model decision). This is not a competing data cleaning method, but rather it double-checks the value of the logit layer as a metric space for -NN.
5.3 Other Experiment Details
We report test errors and show the average across multiple runs with standard error bands shaded. Errors are computed on 11 uniformly distributed noise rates between 0 and 1 inclusive. For the results shown in the main text, we have thatis randomly selected and is of the data. In the Appendix, we show results over different sizes of
. We implement all methods using Tensorflow 2.0 and Scikit-Learn. We use the Adam optimizer with default learning rate 0.001 and a batch size of 128 across all experiments. For the UCI datasets, we setand set for all other datasets. We chose for the UCI datasets because some of the datasets were of small size. However, we found that the -NN method’s performance was quite stable to the choice of , which we show in Section 5.7. We describe the permutations used for hard flipping in the Appendix.
5.4 UCI and MNIST Results
We show the results for one of the UCI datasets and Fashion MNIST in Figure 2. Due to space, results for MNIST and the remaining UCI datasets are in the Appendix. For UCI, we use a fully-connected neural network with a single hidden layer of dimension
with ReLU activations and train forepochs. For both MNIST datasets, we use is a two hidden-layer fully-connected neural network where each layer has hidden units with ReLU activations. We train the model for epochs. We see that the -NN approach attains models with a low error rate across noise rates and either outperforms or is competitive with the next best method, GLC.
5.5 SVHN Results
We show the results in Figure 3. We train ResNet-20 from scratch on NVIDIA P100 GPUs for 100 epochs. As in the CIFAR experiments, we see that the -NN method tends to be competitive in the uniform and flip noise types but does slightly worse in the hard flip.
5.6 CIFAR Results
For CIFAR10/100 we use ResNet-20, which we train from scratch on single NVIDIA P100 GPUs. We train CIFAR10 for 100 epochs and CIFAR100 for 150 epochs. We show results for CIFAR10 in Figure 3 and results for CIFAR100 in the Appendix, due to space. We see that the -NN method performs competitively. It generally outperforms on the uniform and flip noise types but performs worse for the hard flip noise type. It is not too surprising that -NN would be weaker in the presence of hard flip noise (i.e. where labels are mapped based on a pre-determined mapping between labels) as the noise is much more structured in that case making it more difficult to be filtered out by majority vote among the neighbors. In other words, unlike the uniform and flip noise types, we are no longer dealing with white label noise in the hard flip noise type.
. We find that on MNIST, KNN consistently matches or outperforms the other methods across corruption types and noise levels.
5.7 Robustness to
In this section, we show that our procedure is stable in its hyperparameter . The theoretical results suggest that a wide range of can give us statistical consistency guarantees, and in Figure 4 we show that a wide range of gives us similar results for Algorithm 1. Such robustness in hyperparameter is highly desirable because optimal tuning is often not available, especially when no sufficient clean validation set is available.
5.8 Summary of Results
We summarize the results in Table 1 by reporting the area under the curve for the results shown in the figures for the different datasets for the Uniform noise corruption. The best results (bolded) are generally with deep -NN. Except in one case, when -NN filtering does not produce the best results, the best results are using the same -NN to directly classify, rather than re-training the deep model (marked -NN Classify). One reason not to use -NN Classify is its slow runtime. If we insist on re-training the model and only allow the use of -NN to filter, then -NN loses 6 of the 33 experiments, with Full (no filtering) the 2nd best method. Analogous tables for Flip and Hard Flip are in the Appendix and show similar results.
6 Experiments Without Clean Auxiliary Data
Next, we do not split the train set - that is, . The experimental setup is otherwise as-described in Section 5.
6.1 Additional Comparison Methods
We compare against two more recently-proposed methods that do not use a .
Bi-Tempered Loss: Amid et al. (2019) introduces two temperatures into the loss function and trains the model on noisy data using this “bi-tempered” loss. We use their code from github.com/google/bi-tempered-loss and the hyperparameters suggested in their paper: , and iterations of normalization.
Robust Generative Classifier (RoG): Lee et al. (2019) induces a generative classifier using a model pre-trained on noisy data. We implemented their algorithm mimicking their hyperparameter choices - see Appendix for details.
Results for MNIST are shown in Figure 5. Due to space constraints we put results for all the other datasets presented in Section 5 in the Appendix. The results in Figure 5 are mostly representative of the other experiment results. However, the results for CIFAR-100 are notable in that the logit space is 100-dimensions (since it’s a 100-class problem), and we hypothesize this higher dimensional space befuddles -NN a bit as it only does as well as most methods, whereas the RoG method is substantially better than most methods.
7 Conclusions and Open Questions
We conclude from our experiments and theory that the proposed deep -NN filtering is a safe and dependable method to remove problematic training examples. While -NN methods can be sensitive to the choice of when used with small datasets (Garcia et al., 2009), we hypothesize that with today’s large datasets one can blithely set to a fixed practically medium-sized value (e.g. ) as done here and expect reasonable performance. Theoretically we provided some new results for how well -NN can identify clean versus corrupted labels. Open theoretical questions are whether there are alternate notions of how to characterize the difficulty of a particular configuration of corrupted examples and whether we can provide both upper and lower learning bounds under these noise conditions.
- Robust bi-tempered logistic loss based on bregman divergences. In Advances in Neural Information Processing Systems, pp. 14987–14996. Cited by: §2.1, §6.1.
- Fast learning rates for plug-in classifiers. The Annals of Statistics 35 (2), pp. 608–633. Cited by: Remark 2.
Identifying mislabeled training data.
Journal Artificial Intelligence Research. Cited by: §2.1.
- Support vector machines with the ramp loss and the hard margin loss. Operations Research 59 (2), pp. 467–479. Cited by: §2.1.
- Rates of convergence for nearest neighbor classification. In Advances in Neural Information Processing Systems, pp. 3437–3445. Cited by: §2.2, §4, Remark 1.
- Data cleaning: overview and emerging challenges. In SIGMOD, Cited by: §2.1.
- Rates of convergence for nearest neighbor procedures. In Proceedings of the Hawaii International Conference on Systems Sciences, pp. 413–415. Cited by: §2.2.
- On the strong universal consistency of nearest neighbor regression function estimates. The Annals of Statistics 22 (3), pp. 1371–1385. Cited by: §2.2.
- Discriminatory analysis-nonparametric discrimination: consistency properties. Technical report California Univ Berkeley. Cited by: §2.2.
- On the consistency of exact and approximate nearest neighbor with noisy data. arXiv preprint arXiv:1607.07526. Cited by: §2.2.
- Completely lazy learning. IEEE Trans. on Knowledge and DataEngineering. Cited by: §7.
- Robust loss functions under label noise for deep neural networks. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §2.1.
- Training deep neural-networks using a noise adaptation layer. Cited by: §2.1.
- Discovering informative patterns and data cleaning. In AAAI Workshop on Knowledge Discovery in Databases, Cited by: §2.1.
- Masking: a new perspective of noisy supervision. In Advances in Neural Information Processing Systems, pp. 5836–5846. Cited by: §2.1.
- Using trusted data to train deep networks on labels corrupted by severe noise. In Advances in neural information processing systems, pp. 10456–10465. Cited by: §2.1, §5.1, §5.2.
- To trust or not to trust a classifier. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1.
- Non-asymptotic uniform rates of consistency for k-nn regression. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 3999–4006. Cited by: §2.2, §4.
- Mentornet: regularizing very deep neural networks on corrupted labels. arXiv preprint arXiv:1712.05055 4. Cited by: §2.1.
- Learning deep networks from noisy labels with dropout regularization. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 967–972. Cited by: §2.1.
- Learning from noisy singly-labeled data. arXiv preprint arXiv:1712.04577. Cited by: §2.1.
- Robust inference via generative classifiers for handling noisy labels. arXiv preprint arXiv:1901.11300. Cited by: §2.1, §6.1.
Learning from noisy labels with distillation.
Proceedings of the IEEE International Conference on Computer Vision, pp. 1910–1918. Cited by: §2.1, §5.2.
- Classification with noisy labels by importance reweighting. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (3), pp. 447–461. Cited by: §2.1.
- Smooth discrimination analysis. The Annals of Statistics 27 (6), pp. 1808–1829. Cited by: §4.
On the design of loss functions for classification: theory, robustness to outliers, and savageboost. In Advances in Neural Information Processing Systems, pp. 1049–1056. Cited by: §2.1.
- Learning with noisy labels. In Advances in Neural Information Processing Systems, pp. 1196–1204. Cited by: §2.1.
- Deep k-nearest neighbors: towards confident, interpretable and robust deep learning. arXiv preprint arXiv:1803.04765. Cited by: §1.
Making deep neural networks robust to label noise: a loss correction approach.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1944–1952. Cited by: §2.1, §5.2.
- Fast rates for a kNN classifier robust to unknown asymmetric label noise. arXiv preprint arXiv:1906.04542. Cited by: §1, §2.2, §4.
- Learning to reweight examples for robust deep learning. arXiv preprint arXiv:1803.09050. Cited by: §2.1.
- Deep learning is robust to massive label noise. arXiv preprint arXiv:1705.10694. Cited by: §2.1.
- Adaptive Hausdorff estimation of density level sets. The Annals of Statistics 37 (5B), pp. 2760–2782. Cited by: §4.
- Consistent nonparametric regression. The Annals of Statistics, pp. 595–620. Cited by: §2.2.
- Training convolutional networks with noisy labels. arXiv preprint arXiv:1406.2080. Cited by: §2.1.
- Combating label noise in deep learning using abstention. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 6234–6243. Cited by: §2.1.
- Optimal aggregation of classifiers in statistical learning. The Annals of Statistics 32 (1), pp. 135–166. Cited by: §4.
- Toward robustness against label noise in training deep discriminative neural networks. In Advances in Neural Information Processing Systems, pp. 5596–5605. Cited by: §2.1.
- Learning with symmetric label noise: the importance of being unhinged. In Advances in Neural Information Processing Systems, pp. 10–18. Cited by: §2.1.
- Learning from noisy large-scale datasets with minimal supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 839–847. Cited by: §2.1.
- Analyzing the robustness of nearest neighbors to adversarial examples. In International Conference on Machine Learning, pp. 5120–5129. Cited by: §1, §2.2.
- Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. on Systems, Man and Cybernetics. Cited by: §1, §2.1.
- Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2691–2699. Cited by: §2.1.
- Generalized cross entropy loss for training deep neural networks with noisy labels. In Advances in Neural Information Processing Systems, pp. 8778–8788. Cited by: §2.1.