1 Introduction
Deep neural networks dominate the state of the art in an ever increasing list of application domains, but for the most part, this incredible success relies on very large datasets of annotated examples available for training. Unfortunately, large amounts of highquality annotated data are hard and expensive to acquire, whereas cheap alternatives (obtained by way of crowdsourcing or automatic labeling, for example) often introduce noisy labels into the training set. By now there is much empirical evidence that neural networks can memorize almost every training set, including ones with noisy and even random labels Zhang et al. (2017), which in turn increases the generalization error of the model. As a result, the problems of identifying the existence of label noise and the separation of noisy labels from clean ones, are becoming more urgent and therefore attract increasing attention.
Henceforth, we will call the set of examples in the training data whose labels are correct "clean data", and the set of examples whose labels are incorrect "noisy data". While all labels can be eventually learned by deep models, it has been empirically shown that most noisy datapoints are learned by deep models late, after most of the clean data has already been learned (Arpit et al., 2017)
. Therefore many methods focus on the learning time of an example in order to classify it as noisy or clean, by looking at its loss
(Pleiss et al., 2020; Arazo et al., 2019)or loss per epoch
(Li et al., 2020) in a single model. However, these methods struggle to classify correctly clean and noisy datapoints that are learned at the same time, or worse  noisy datapoints that are learned early. Additionally, many of these methods work in an endtoend manner, and thus neither provide noise level estimation nor do they deliver separate sets of clean and noisy data for novel future usages.Our first contribution is a new empirical results regarding the learning dynamics of an ensemble of deep networks, showing that the dynamics is different when training with clean data vs. noisy data. The dynamics of clean data has been studied in (Hacohen et al., 2020; Pliushch et al., 2021), where it is reported that different deep models learn examples in the same order and pace. This means that when training a few models and comparing their predictions, a binary occurrence (approximately) is seen at each epoch
: either all the networks correctly predict the example’s label, or none of them does. This further implies that for the most part, the distribution of predictions across points is bimodal. Additionally, a variety of studies showed that the bias and variance of deep networks decrease as the networks complexity grow
(Nakkiran et al., 2021; Neal et al., 2018), providing additional evidence that different deep networks learn data at the same time simultaneously.In Section 3 we describe a new empirical result: when training an ensemble of deep models with noisy data, and in contrast to what happens when using clean data, different models learn different datapoints at different times (see Fig. 1). This empirical finding tells us that in an ensemble of networks, the learning dynamics of clean data and noisy data can be distinguished. When training such an ensemble with a mixture of clean and noisy data, the emerging dynamics reflects this observation, as well as the tendency of clean data to be learned faster as previously observed.
In our second contribution, we use this result to develop a new algorithm for noise level estimation and noise filtration, which we call DisagreeNet (see Section 4
). Importantly, unlike most alternative methods, our algorithm is simple (it does not introduce any new hyperparameters), parallelizable, easy to integrate with any supervised or semisupervised learning method and any loss function, and does not rely on prior knowledge of the noise amount. When used for noise filtration, our empirical study (see Section
5) shows the superiority of DisagreeNet as compared to the state of the art, using different datasets, different noise models and different noise levels. When used for supervised classification by way of preprocessing the training set prior to training a deep model, it provides a significant boost in performance, more so than alternative methods.Relation to prior art
Work on the dynamics of learning in deep models has received increased attention in recent years (e.g., Nguyen et al., 2020; Hacohen et al., 2020; Baldock et al., 2021). Our work adds a new observation to this body of knowledge, which is seemingly unique to an ensemble of deep models (as against an ensemble of other commonly used classifiers). Thus, while there exist other methods that use ensembles to handle label noise (e.g., Sabzevari et al., 2018; Feng et al., 2020; Chai et al., 2021; de Moura et al., 2018), for the most part they cannot take advantage of this characteristic of deep models, and as a result are forced to use additional knowledge, typically the availability of a clean validation set and/or prior knowledge of the noise amount.
Work on deep learning with noisy labels (see
Song et al. (2022) for a recent survey) can be coarsely divided to two categories: general methods that use a modified loss or network’s architecture, and methods that focus on noise identification. The first group includes methods that aim to estimate the underlying noise transition matrix (Goldberger and BenReuven, 2016; Patrini et al., 2017), employ a noiserobust loss (Ghosh et al., 2017; Zhang and Sabuncu, 2018; Wang et al., 2019; Xu et al., 2019), or achieve robustness to noise by way of regularization (Tanno et al., 2019; Jenni and Favaro, 2018). Methods in the second group, which is more inline with our approach, focus more directly on noise identification. Some methods assume that clean examples are usually learned faster than noisy examples (e.g. Liu et al., 2020). Others (Arazo et al., 2019; Li et al., 2020)generate soft labels by interpolating the given labels and the model’s predictions during training. Yet other methods
(Jiang et al., 2018; Han et al., 2018; Malach and ShalevShwartz, 2017; Yu et al., 2019; Lee and Chung, 2019), like our own, inspect an ensemble of networks, usually in order to transfer information between networks and thus avoid agreement bias.Notably, we also analyze the behavior of ensembles in order to identify the noisy examples, resembling (Pleiss et al., 2020; Nguyen et al., 2019; Lee and Chung, 2019). But unlike these methods, which track the loss of the networks, we track the dynamics of the agreement between multiple networks over epochs. We then show that this statistics is more effective, and achieves superior results. Additionally (and not less importantly), unlike these works, we do not assume prior knowledge of the noise amount or the presence of a clean validation set, and do not introduce new hyperparameters in our algorithm.
Recently, the emphasis has somewhat shifted to the use of semisupervised learning and contrastive learning
(Li et al., 2020; Liu et al., 2020; Ortego et al., 2020; Wei et al., 2020; Yao et al., 2021; Zheltonozhskii et al., 2022; Li et al., 2022; Karim et al., 2022). Semisupervised learning is an effective paradigm for the prediction of missing labels. This paradigm is especially useful when the identification of noisy points cannot be done reliably, in which case it is advantageous to remove labels whose likelihood to be true is not negligible. The effectiveness of semisupervised learning in providing reliable pseudolabels for unlabeled points will compensate for the loss of clean labels.However, semisupervised learning is not universally practical as it often relies on the extraction of effective representations based on unsupervised learning tasks, which typically introduces implicit priors (e.g., that contrastive loss is appropriate). In contrast, our goal is to reliably identify noisy points, to be subsequently removed. Thus, our method can be easily incorporated into any SOTA method which uses supervised or semisupervised learning (with or without contrastive learning), and may provide benefit even when semisupervised learning is not viable.
2 InterNetwork Agreement: Definition and Scores
Measuring the similarity between deep models is not a trivial challenge, as modern deep neural networks are complex functions defined by a huge number of parameters, which are invariant to transformations hidden in the model’s architecture. Here we measure the similarity between deep models in an ensemble by measuring intermodel prediction agreement at each datapoint. Accordingly, in Section 2.2 we describe scores that are based on the state of the networks at each epoch , while in Section 2.3 we describe cumulative scores that integrate these states through many epochs. Practically (see Section 4), our proposed method relies on the cumulative scores, which are shown empirically to provide more accurate results in the noise filtration task. These scores promise added robustness, as it is no longer necessary to identify the epoch at which the score is to be evaluated.
2.1 Preliminaries
Notations Let
denote a deep model, trained with Stochastic Gradient Descent (SGD) for
epochs on training set , where denotes a single example and its corresponding label. Let denote an ensemble of such models, where each model is initialized and trained independently on .Noise model We analyze the training dynamics of an ensemble of models in the presence of label noise. Label noise is different from data noise (like image distortion or additive Gaussian noise). Here it is assumed that after the training set is sampled, the labels are corrupted by some noise function , and the training set becomes . The two most common models of label noise are termed symmetric noise and asymmetric noise (Patrini et al., 2017). In both cases it is assumed that some fixed percentage of the labels are corrupted by . With symmetric noise, assigns any new label from the set
with equal probability. With asymmetric noise,
is the deterministic permutation function (see App. F for details). Note that the asymmetric noise model is considered much harder than the symmetric noise model.2.2 PerEpoch Agreement Score
Following Hacohen et al. (2020), we define the True Positive Agreement (TPA) score of ensemble at each datapoint , where . The TPA score measures the average accuracy of the models in the ensemble, when seeing , after each model has been trained for exactly epochs on . Note that measures the average accuracy of multiple models on one example, as opposed to the generalization error that measures the average error of one model on multiple examples.
2.3 Cumulative Scores
When inspecting the dynamics of the TPA score on clean data, we see that at the beginning the distribution of is concentrated around 0, and then quickly shifts to 1 as training proceeds (see side panels in Fig. 1(a)). This implies that empirically, data is learned in a specific order by all models in the ensemble. To measure this phenomenon we use the Ensemble Learning Pace (ELP) score defined below, which essentially integrates the TPA score over a set of epochs :
(1) 
captures both the time of learning by a single model, and its consistency across models. For example, if all the models learned the example early, the score would be high. It would be significantly lower if some of them learned it later than others (see pseudocode in App. C).
In our study we evaluated two additional cumulative scores of intermodel agreement:

Cumulative loss:
Above denotes the cross entropy function. This score is very similar to ELP, engaging the average of the crossentropy loss instead of the accuracy indicator .
3 The Dynamics of Agreement: New Empirical Observation
In this section we analyze, both theoretically and empirically, how measures of internetwork agreement may indicate the detrimental phenomenon of ’overfit’. Overfit is a condition that can occur during the training of deep neural networks. It is characterized by the cooccurring decrease of train error or loss and the increase of test error or loss. Recall that train loss is the quantity that is being continuously minimized during the training of deep models, while the test error is the quantity linked to generalization error. When these quantities change in opposite directions, training harms the final performance and thus early stopping is recommended.
We begin by showing in Section 3.1
that in an ensemble of linear regression models, overfit and the agreement between models are negatively correlated. When this is the case, an epoch in which the agreement between networks reaches its maximal value is likely to indicate the beginning of overfit.
Our next goal is to examine the relevance of this result to deep learning in practice. Yet inexplicably, at least as far as image datasets are concerned, overfit rarely occurs in practice when deep learning is used for image recognition. However, when label noise is introduced, significant overfit occurs. Capitalizing on this observation, we report in Section 3.3 that when overfit occurs in the independent training of an ensemble of deep networks, the agreement between the networks starts to decrease.
The approach we describe in Section 4 is motivated by these results: Since it has been observed that noisy data are memorized later than clean data, we hypothesize that overfit occurs when the memorization of noisy labels becomes dominant. This suggests that measuring the dynamics of agreement between networks, which is correlated with overfit as shown below, can be effectively used for the identification of label noise.
3.1 Overfit and Agreement: Theoretical Result
Since deep learning models are not amenable to a rigorous theoretical analysis, and in order to gain computational insight into such general phenomena as overfit, simpler models are sometimes analyzed (e.g. Weinshall and Amir, 2020). Accordingly, in App. A we formally analyze the relation between overfit and intermodel agreement in an ensemble of linear regression models. In this framework, it can be shown that the two phenomena are negatively correlated, namely, increase in overfit implies decrease in intermodel agreement. Thus, we prove (under some assumptions) the following result:
Theorem.
Assume an ensemble of models obtained by solving linear regression with gradient descent and random initialization. If overfit increases at time in all the models in the ensemble, then the agreement between the models in the ensemble at time decreases.
3.2 Measuring the Agreement between Models
In order to obtain a score that captures the level of disagreement between networks, we inspect more closely the distribution of , defined in Section 2.2, over a sample of datapoints, and analyze its dynamics as training proceeds. First, note that if all of the models in ensemble give identical predictions at each point, the TPA score would be either 0 (when all the networks predict a false label) or 1 (when all the networks predict the correct label). In this case, the TPA distribution is perfectly bimodal, with only two peaks at 0 and 1. If the predictions of the models at each point are independent with mean accuracy
, then it can be readily shown that TPA is approximately the binomial random variable with a unimodal distribution around
.Empirically, (Hacohen et al., 2020) showed that in ensembles of deep models trained on ‘real’ datasets as we use here, the TPA distribution is highly bimodal. Since commonly used measures of bimodality, such as the Pearson bimodality score, are illfitted for the discrete TPA distribution, we measure bimodality with the following Bimodal Index score:
(2) 
measures how many examples are either correctly or incorrectly classified by all the models in the ensemble, rewarding distributions where points are (roughly) equally divided between 0 and 1. Here we use this score to measure the agreement between networks at epoch .
If we were to draw the Bimodality Index (BI) of the TPA score as a function of the epochs (Fig. 1(a)), we often see two distinct phases. Initially (phase 1), BI is monotonically increasing, namely, both test accuracy and agreement are on the rise. We call it the ‘learning’ phase. Empirically, in this phase most of the clean examples are being learned (or memorized), as can also be seen in the left side panels of Fig. 1(a) (cf. Li et al., 2015). At some point BI may start to decrease, followed by another possible ascent. This is phase 2, in which empirically the memorization of noisy examples dominates the learning (see the right side panels of Fig. 1(a)). This fall and rise is explained by another set of empirical observations, that noisy labels are not being learned in the same order by an ensemble of networks (see App. B), which therefore predicts a decline in BI when noisy labels are being learned.
3.3 Overfit and Agreement: Empirical Evidence
Earlier work, investigating the dynamics of learning in deep networks, suggests that examples with noisy labels are learned later (Krueger et al., 2017; Zhang et al., 2017; Arpit et al., 2017; Arora et al., 2019). Since the learning of noisy labels is unlikely to improve the model’s test accuracy, we hypothesize that this may be correlated with the occurrence (or increase) of overfit. The theoretical result in Section 3.1 suggests that this may be correlated with a decrease in the agreement between networks. Our goal now is to test this prediction empirically.
We next outline empirical evidence that this is indeed the case in actual deep models. In order to boost the strength of overfit, we adopt the scenario of recognition with label noise, where the occurrence of overfit is abundant. When overfit indeed occurs, our experiments show that if the test accuracy drops, then the disagreement score BI also decreases (see example in Fig. 1(b)bottom). This observation is confirmed with various noise models and different datasets. When overfit does not occur, the prediction is no longer observed (see example in Fig. 1(b)top).
These results suggest that a consistent drop in the index of some training set can be used to estimate the occurrence of overfit, and possibly even the beginning of noisy label memorization.
4 Dealing with Noisy Labels: Proposed Approach
When dealing with noisy labels, there are essentially three intertwined problems that may require separate treatment:

Noise level estimation: estimate the number of noisy examples.

Noise filtration: flag points whose label is to be removed.

Classifier construction: train a model without the examples that are flagged as noisy.
4.1 DisagreeNet, for Noise Level Estimation and Noise Filtration
Guided by Section 3, we propose a method to estimate the noise level in a training set denoted DisagreeNet, which is further used to filter out the noisy examples (see pseudocode below in Alg. 1):

Compute the ELP score from (1) at each training example.

Fit a two component BMM to the ELP distribution (see Fig. 3).

Use the intersection between the 2 components of the BMM fit to divide the data to two groups.

Call the group with lower ELP ’noisy data’.

Estimate noise level by counting the number of datapoints in the noisy group.
As part of our ablation study, we evaluated the two alternative scores defined in Section 2.3: CumLoss and MeanMargin, where Step 2 of DisagreeNet is executed using one of them instead of the ELP score. Results are shown in Section 5.4, revealing the superiority of the ELP score in noise filtration.
4.2 Classifier construction
Aiming to achieve modular handling of noisy labels, we propose the following twostep approach:

Run DisagreeNet.

Run SOTA supervised learning method using the filtered data.
In step 2 it is possible to invoke semisupervised SOTA methods, using the noisy group as unsupervised data. However, given that semisupervised learning typically involves additional assumptions (or prior knowledge) as well as high computational complexity (that restricts its applicability to smaller datasets), as discussed in Section 1, we do not consider this scenario here.
5 Empirical Evaluation
We evaluate our method in the following scenarios and tasks:
5.1 Dataset and Baselines (details are deferred to App. F)
Datasets We evaluate our method on a few standard image classification datasets, including Cifar10 and Cifar100 (Krizhevsky et al., 2009)
, Tiny imagenet
(Le and Yang, 2015), subsets of Imagenet
(Deng et al., 2009), Clothing1M (Xiao et al., 2015) and Animal10N (Song et al., 2019a), see App. F for details. These datasets were used in earlier work to evaluate the success of noise estimation (Pleiss et al., 2020; Arazo et al., 2019; Li et al., 2020; Liu et al., 2020).Baselines and comparable methods We report results with the following supervised learning methods for learning from noisy data: DYBMM and DYGMM (Arazo et al., 2019), INCV (Chen et al., 2019), AUM (Pleiss et al., 2020), Bootstrap (Reed et al., 2014), D2L (Ma et al., 2018), MentorNet (Jiang et al., 2018). We also report the results of two absolute baselines: (i) Oracle, which trains the model on the clean dataset; (ii) Random, which trains the model after the removal of a random fraction of the whole data, equivalent to the noise level.
Other methods The following methods use additional prior information, such as a clean validation set or known level of noise: Coteaching (Han et al., 2018), O2U (Huang et al., 2019b), LEC (Lee and Chung, 2019) and SELFIE (Song et al., 2019b). Direct comparison does injustice to the previous group of methods, and is therefore deferred to App. H. Another group of methods is excluded from the comparison because they invoke semisupervised or contrastive learning (e.g., Ortego et al., 2020; Li et al., 2020; Karim et al., 2022; Li et al., 2022; Wei et al., 2020; Yao et al., 2021), which is a different learning paradigm (see discussion of prior art in Section 1).
5.2 Results: Noise Identification
The performance of DisagreeNet is evaluated in two tasks: (i) The detection of noisy examples, shown in Fig. 3(a)3(b) (see also Figs. 7 and 8 in App. D), where DisagreeNet is seen to outperform the three baselines  AUM, DYGMM and DYBMM. (ii) Noise level estimation, shown in Fig. 3(c)3(d) , showing good noise level estimation especially in the case of symmetric noise. We also compare DisagreeNet to MeanMargin and CumLoss, see Fig. 5.
5.3 Result: Supervised Classifications
Method/Dataset  CIFAR10 sym  CIFAR100 sym  

Noise level  20%  40%  60%  20%  40%  60% 
random  
Bootstrap  
MentorNet  –  –  
D2L  –  
INCV  
AUM  
DisagreeNet+SL  
oracle 
Method/Dataset  CIFAR10 constant asym  CIFAR100 constant asym  Tiny Imagenet sym  

Noise level  20%  40%  20%  40%  20%  40% 
random  
Bootstrap      
D2L      
DYBMM  
INCV  
AUM  
DisagreeNet+SL  
oracle 
Method/Dataset  animal10N, 8% noise  Clothing1M, 38% noise  
Noise level  noise est  test accuracy  noise est  test accuracy 
CrossEntropy      
AUM      10.7  70.4 
DisagreeNet+SL  7.8  17 
Test accuracy (%), average and standard error, in the best epoch of retraining after filtration. Results of benchmark methods (see Section
5.1) are taken from (Pleiss et al., 2020). The top and middle tables show CIFAR10, CIFAR100 and Tiny Imagenet, with simulated noise. The bottom table shows three ‘real noise’ datasets, and includes in addition results of noise level estimation (when applicable). The presumed noise level for these datasets is indicated in the top line following (Huang et al., 2019a; Song et al., 2019b).DisagreeNet is used to remove noisy examples, after which we train a deep model from scratch using the remaining examples only. We report our main results using the Densenet architecture, and report results with other architectures in the ablation study. Table 1 summarizes the results for simulated symmetric and asymmetric noise on 5 datasets, and 3 repetitions. It also shows results on 2 real datasets, which are assumed (in previous work) to contain significant levels of ’real’ label noise. Additional results are reported in App. H, including methods that require additional prior knowledge.
Not surprisingly, dealing with datasets that are presumed to include inherent label noise proved more difficult, and quite different, than dealing with synthetic noise. As claimed in (Ortego et al., 2020), nonmalicious label noise does less damage to networks’ generalization than random label noise: on Clothing1M, for example, hardly any overfit is seen during training, even though the data is believed to contain more than 35% noise. Still, here too, DisagreeNet achieves improved accuracy without access to a clean validation set or known noise level (see Table 1). In App. H, Table 5 we compare Disagreenet to methods that do use such prior knowledge. Surprisingly, we see that DisagreeNet still achieves better results even without using any additional prior knowledge.
5.4 Ablation Study
How many networks are needed?
We report in Table. 2 the F1 score for noisy label identification, using DisagreeNet with varying numbers of networks. The main boost in performance provided by the use of additional networks is seen when using DisagreeNet on hard noise scenarios, such as the asymmetric noise, or with small amounts of noise.
Dataset  Noise  size of ensemble (number of networks)  

Method  1  2  3  4  7  10  
Cifar10 sym  
Cifar100 sym  
Cifar10 asym  
Cifar100 asym  
Additional ablation results
Results in App. E, Table 4 indicate robustness to architecture, scheduler, and usage of augmentation, although the standard training procedures achieve the best results. Additionally, we see robustness to changing the backbone architecture of DisagreeNet, using ResNet18 and ResNet50, see Table 3. Finally, in Fig. 5 we compare DisagreeNet using ELP to disagreeNet using the MeanMargin and CumLoss scores, as defined in Section 2.3. In symmetric noise scenarios all scores perform well, while in asymmetric noise scenarios the ELP score performs much better, as can be seen in Figs. 4(b),4(d). Additional comparisons of the 3 scores are reported in Apps. E and G.
Method/Dataset  CIFAR10 sym  CIFAR100 sym  

Noise level  20%  40%  60%  20%  40%  60% 
DisagreeNet+SL R18  
DisagreeNet+SL R50 
6 Summary and Discussion
We presented a new empirical observation, that the variability in the predictions of an ensemble of deep networks is much larger when labels are noisy, than it is when labels are clean. This observation is used as a basis for a new method for classification with noisy labels, addressing along the way the tasks of noise level estimation, noisy labels identification, and classifier construction. Our method is easy to implement, and can be readily incorporated into existing methods for deep learning with label noise, including semisupervised methods, to improve the outcome of the methods.
Importantly, our method achieves this improvement without making additional assumptions, which are commonly made by alternative methods: (i) Noise level is expected to be unknown. (ii) There is no need for a clean validation set, which many other methods require, but which is very difficult to acquire when the training set is corrupted. (iii) Almost no additional hyperparameters are introduced.
Acknowledgments
This work was supported by the Israeli Ministry of Science and Technology, and by the Gatsby Charitable Foundations.
References

Unsupervised label noise modeling and loss correction.
In
International conference on machine learning
, pp. 312–321. Cited by: item , Appendix F, §1, §1, §5.1, §5.1.  Finegrained analysis of optimization and generalization for overparameterized twolayer neural networks. In International Conference on Machine Learning, pp. 322–332. Cited by: §3.3.
 A closer look at memorization in deep networks. In International conference on machine learning, pp. 233–242. Cited by: §1, §3.3.
 Deep learning through the lens of example difficulty. Advances in Neural Information Processing Systems 34, pp. 10876–10889. Cited by: §1.
 Mixmatch: a holistic approach to semisupervised learning. Advances in Neural Information Processing Systems 32. Cited by: item .
 Detect noisy label based on ensemble learning. In Advances in Natural Computation, Fuzzy Systems and Knowledge Discovery, H. Meng, T. Lei, M. Li, K. Li, N. Xiong, and L. Wang (Eds.), Cham, pp. 1843–1850. External Links: ISBN 9783030706654 Cited by: §1.
 Understanding and utilizing deep neural networks trained with noisy labels. In International Conference on Machine Learning, pp. 1062–1070. Cited by: item , §5.1.
 Ensemble methods for label noise detection under the noisy at random model. In 2018 7th Brazilian Conference on Intelligent Systems (BRACIS), Vol. , pp. 474–479. External Links: Document Cited by: §1.

Imagenet: a largescale hierarchical image database.
In
2009 IEEE conference on computer vision and pattern recognition
, pp. 248–255. Cited by: Appendix F, §5.1.  Label noise cleaning with an adaptive ensemble method based on noise detection metric. Sensors 20 (23). External Links: Link, ISSN 14248220, Document Cited by: §1.

Robust loss functions under label noise for deep neural networks.
In
Proceedings of the AAAI conference on artificial intelligence
, Vol. 31. Cited by: §1.  Training deep neuralnetworks using a noise adaptation layer. Cited by: §1.
 Let’s agree to agree: neural networks share classification order on real datasets. In Int. Conf. Machine Learning ICML, pp. 3950–3960. Cited by: §1, §1, §2.2, §3.2.
 Coteaching: robust training of deep neural networks with extremely noisy labels. Advances in neural information processing systems 31. Cited by: item , §1, §5.1.
 Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §5.1.
 O2unet: a simple noisy label detection approach for deep neural networks. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 3326–3334. Cited by: Table 1.
 O2Unet: a simple noisy label detection approach for deep neural networks. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Vol. , pp. 3325–3333. External Links: Document Cited by: item , Appendix H, §5.1.
 Densenet: implementing efficient convnet descriptor pyramids. arXiv preprint arXiv:1404.1869. Cited by: §5.1.
 Deep bilevel learning. In Proceedings of the European conference on computer vision (ECCV), pp. 618–633. Cited by: §1.
 Mentornet: learning datadriven curriculum for very deep neural networks on corrupted labels. In International Conference on Machine Learning, pp. 2304–2313. Cited by: item , §1, §5.1.
 UNICON: combating label noise through uniform selection and contrastive learning. arXiv. External Links: Document, Link Cited by: §1, §5.1.
 Learning multiple layers of features from tiny images. Online. Cited by: Appendix F, §5.1.
 Deep nets don’t learn via memorization. In Int. Conf. Learning Representations ICLR, Cited by: §3.3.
 Tiny imagenet visual recognition challenge. CS 231N 7 (7), pp. 3. Cited by: Appendix F, §5.1.
 Robust training with ensemble consensus. arXiv preprint arXiv:1910.09792. Cited by: item , Appendix H, §1, §1, §5.1.
 Dividemix: learning with noisy labels as semisupervised learning. arXiv preprint arXiv:2002.07394. Cited by: item , Appendix F, §1, §1, §1, §5.1, §5.1.
 Learning to learn from noisy labeled data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5051–5059. Cited by: item .
 Selectivesupervised contrastive learning with noisy labels. arXiv. External Links: Document, Link Cited by: §1, §5.1.
 Convergent learning: do different neural networks learn the same representations?. In Adv. Neural Inform. Process. Syst. NeurIPS, pp. 196–212. Cited by: §3.2.
 Earlylearning regularization prevents memorization of noisy labels. Advances in neural information processing systems 33, pp. 20331–20342. Cited by: item , Appendix F, §1, §1, §5.1.
 Dimensionalitydriven learning with noisy labels. In International Conference on Machine Learning, pp. 3355–3364. Cited by: item , §5.1.
 Decoupling" when to update" from" how to update". Advances in Neural Information Processing Systems 30. Cited by: §1.
 Deep double descent: where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment 2021 (12), pp. 124003. Cited by: §1.
 A modern take on the biasvariance tradeoff in neural networks. arXiv preprint arXiv:1810.08591. Cited by: §1.
 Self: learning to filter noisy labels with selfensembling. arXiv preprint arXiv:1910.01842. Cited by: item , §1.
 Do wide and deep networks learn the same things? uncovering how neural network representations vary with width and depth. arXiv preprint arXiv:2010.15327. Cited by: §1.
 Multiobjective interpolation training for robustness to label noise. CoRR abs/2012.04462. External Links: Link, 2012.04462 Cited by: §1, §5.1, §5.3.
 Making deep neural networks robust to label noise: a loss correction approach. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1944–1952. Cited by: §1, §2.1.
 Identifying mislabeled data using the area under the margin ranking. Advances in Neural Information Processing Systems 33, pp. 17044–17056. Cited by: item , Appendix F, §1, §1, item ii, §5.1, §5.1, Table 1.
 When deep classifiers agree: analyzing correlations between learning order and image statistics. arXiv preprint arXiv:2105.08997. Cited by: §1.
 Training deep neural networks on noisy labels with bootstrapping. arXiv preprint arXiv:1412.6596. Cited by: item , §5.1.
 A twostage ensemble method for the detection of classlabel noise. Neurocomputing 275, pp. 2374–2383. External Links: ISSN 09252312, Document, Link Cited by: §1.
 SELFIE: refurbishing unclean samples for robust deep learning. In ICML, Cited by: §5.1.
 Selfie: refurbishing unclean samples for robust deep learning. In International Conference on Machine Learning, pp. 5907–5915. Cited by: §5.1, Table 1.
 Learning from noisy labels with deep neural networks: a survey. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §1.
 Learning from noisy labels by regularized estimation of annotator confusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11244–11253. Cited by: §1.
 Symmetric cross entropy for robust learning with noisy labels. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 322–330. Cited by: §1.
 Combating noisy labels by agreement: a joint training method with coregularization. arXiv. External Links: Document, Link Cited by: §1, §5.1.
 Theory of curriculum learning, with convex loss functions. Journal of Machine Learning Research 21 (222), pp. 1–19. Cited by: §3.1.
 Learning from massive noisy labeled data for image classification. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 2691–2699. External Links: Document Cited by: Appendix F, §5.1.
 L_dmi: a novel informationtheoretic loss function for training deep nets robust to label noise. Advances in neural information processing systems 32. Cited by: §1.
 Josrc: a contrastive approach for combating noisy labels. arXiv. External Links: Document, Link Cited by: §1, §5.1.
 How does disagreement help generalization against label corruption?. In International Conference on Machine Learning, pp. 7164–7173. Cited by: §1.
 Understanding deep learning requires rethinking generalization. In Int. Conf. Learning Representations ICLR, Cited by: §1, §3.3.
 Generalized cross entropy loss for training deep neural networks with noisy labels. Advances in neural information processing systems 31. Cited by: §1.
 Contrast to divide: selfsupervised pretraining for learning with noisy labels. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1657–1667. Cited by: item , §1.
Appendix
Appendix A Overfit and intermodel correlation
In this section we formally analyze the relation between two type of scores, which measure either overfit or intermodel agreement. Overfit is a condition that can occur during the training of deep neural networks. It is characterized by the cooccurring decrease of train error or loss, which is continuously minimized during the training of a deep model, and the increase of test error or loss, which is the ideal measure one would have liked to minimize and which determines the network’s generalization error. An agreement score measures how similar the models are in their predictions.
We start by introducing the model and some notations in Section A.1. In Section A.2 we prove the main result (Prop. A.2): the occurrence of overfit at time s in all the models of the ensemble implies that the agreement between the models decreases.
a.1 Model and notations
Model. We analyze the agreement between an ensemble of models, computed by solving the linear regression problem with Gradient Descent (GM) and random initialization. In this problem, the learner estimates a linear function , where
denotes an input vector and
the desired output. Given a training set of pairs , let denote the training input  a matrix whose column is , and let row vector denote the output vector whose element is . When solving a linear regression problem, we seek a row vector that satisfies(3) 
To solve (3) with GD, we perform at each iterative step the following computation:
(4) 
for some random initialization vector where usually , and learning rate . Henceforth we omit the index when self evident from context.
Additional notations Below, index denotes a network instance and denotes the test data.

Let denotes the training matrix used to train network , while denotes the test matrix. Accordingly,

Let denote the gradient step of , the mdoel learned from training set , and let denote the cross error of on . Then

Let denote the cross gradient:
(5)
After each GD step, the model and the error are updated as follows:
We note that at step and , is a random vector in , and is a random vector in .
Test error random variable. Let denote the number of test examples. Note that is a set of test errors vectors in , where the component of the vector captures the test error of model on test example . In effect, it is a sample of size from the random variable . This random variable captures the error over test point of a model computed from a random sample of training examples. The empirical variance of this random variable will be used to estimate the agreement between the models.
Overfit. Overfit occurs at step if
(6) 
Measuring intermodel agreement. In classification problems, bimodality of the ELP score captures the agreement between a set of classifiers, all trained on the same training matrix . Since here we are analyzing a regression problem, we need a comparable score to measure agreement between the predictions of linear functions. This measure is chosen to be the variance of the test error among models. Accordingly, we will measure disagreement by the empirical variance of the test error random variable , average over all test examples .
More specifically, consider an ensemble of linear models trained on set to minimize (3) with gradient steps, where denotes the index of a network instance and the number of network instances. Using the test error vectors of these models , we compute the empirical variance of each element , and sum over the test examples :
Definition 1 (Intermodel DisAgreement.).
The disagreement among a set of linear models at step is defined as follows
(7) 
a.2 Overfit and InterNetwork Agreement
Lemma 1.
Assume that the learning rate is small enough so that we can neglect terms that are . Then in each gradient descent step , overfit occurs iff the gradient step of network is negatively correlated with the cross gradient .
Proof.
Lemma 2.
Assume that and is invertible. If the number of gradient steps is large enough so that is negligible, then
(9) 
Proof.
Starting from (4), we can show that
Since
Given the lemma’s assumptions, this expression can be evaluated and simplified:
(10) 
∎
From (7) it follows that a decrease in intermodel agreement at step , which is implied by increased test variance among models, is indicated by the following inequality:
(11) 
Theorem.
We make the following asymptotic assumptions:

All models see the same training set, where .

The learning rate is small enough so that we can neglect terms that are .

, is invertible, and the number of gradient steps is large enough so that can be neglected.

The number of models is large enough so that .
When these assumptions hold, the occurrence of overfit at time in all the models of the ensemble implies that the agreement between the models decreases.
Proof.
(11) can be rearranged as follows
where the last transition follows from . Using assumption 2
(12) 
where
(13) 
and
(14) 
Next, we prove that is approximately 0. We first deduce from assumptions 1 and 4 that
From assumption 3 and Lemma 2, we have that . Thus
From this derivation and (14) we may conclude that . Thus
(15) 
If overfit occurs at time in all the models of the ensemble, then from Lemma 1 and (15). From (11) we may conclude that the intermodel agreement decreases, which concludes the proof.
∎
Appendix B Noisy Labels and InterModel Agreement
Here we show empirical evidence, that noisy labels are not being learned in the same order by an ensemble of networks. To see this, we measure the distance between the TPA distribution, computed separately for clean examples and for noisy examples, and the binomial distribution, which is the expected distribution of iid classifiers with the same overall accuracy. Specifically, we compute the Wasserstein distance between the agreement distribution at each epoch and the binomials BIN (k,) and BIN(k,), where is the average accuracy on the clean examples, and is the average accuracy on the noisy examples, see Fig. 6. We see that while the distribution of model agreement on clean examples is very far from the binomial distribution, the distribution of model agreement on noisy examples is much closer, which means that each DNN memorizes noisy examples at an approximately independent rate.
Appendix C Computing the ELP Score: PseudoCode
Appendix D Additional results
d.1 Noise level estimation on additional datasets
d.2 Precision and Recall results
Appendix E Ablation study
Table 4 summarize experiments relating to architecture, scheduler, and augmentation usage.
Noise level  No change  ResNet34  No Aug  Constant lr 0.01  Lr 0.01 + no Aug 

Alternative scores
We evaluate the two alternative scores defined in Section 2.3: CumLoss and MeanMargin, in which case Step 2 of DisagreeNet is executed using one of them instead of the ELP score. Fig. 9
shows the Probability Distribution Function (PDF) of the three scores, revealing that ELP is more consistency bimodal (especially in the difficult asymmetric case), with modes (peaks) that appear more separable. This benefit translates to superior performance in the noise filtration task (Figs.
4(b),4(d)).We believe that this empirical observation, of increased mode separation, is due to significant difference in the pace of change in agreement values during training between clean and noisy data, in contrast with the pace of change in smoother measures of confidence like Margin and Loss (see App. G). Note that with the easier symmetric noise, we do not see this difference, and indeed the other scores exhibit two nicely separated modes, sometimes achieving even better results in noise filtration than ELP (Fig. 7 in App. D.1). However, when comparing the test accuracy after retraining (see App. D), we observe that ELP still achieves superior results.
Appendix F Methodology and Technical details
Datasets We evaluated our method on a few standard image classification datasets, including Cifar10 and Cifar100 [Krizhevsky et al., 2009] and Tiny imagenet [Le and Yang, 2015]. Cifar10/100 consist of 60k color images of 10 and 100 classes respectivaly. Tiny ImageNet consists of 100,000 images from 200 classes of ImageNet [Deng et al., 2009], downsampled to size . Animal10N dataset contains 5 pairs of confusing animals with a total of 55,000 64x64 images. Clothing1M [Xiao et al., 2015] contains 1M clothing images in 14 classes. These datasets were used in earlier work to evaluate the success of noise estimation [Pleiss et al., 2020, Arazo et al., 2019, Li et al., 2020, Liu et al., 2020].
Baseline methods for comparison We evaluate our method in the context of two approaches designed to deal with label noise: methods that focus on improving the supervised learning by identifying noisy labels and removing/reducing their influence on the training, and methods that use iterative methods and utilize semisupervised algorithms in order to learn with noisy labels. First approach: DYBMMand DYGMM [Arazo et al., 2019] estimate mixture models on the loss to separate noisy and clean examples. INCV[Chen et al., 2019]iteratively filter out noisy examples by using crossvalidation. AUM[Pleiss et al., 2020]inserts corrupted examples to determine a filtration threshold, using the mean margin as a score. Bootstrap[Reed et al., 2014]interpolates between the net predictions and the given label. D2L[Ma et al., 2018]follows Bootstrap, and uses the examples dimensional attributes for the interpolation. Coteaching[Han et al., 2018]use two networks to filter clean data for the other net training. O2U[Huang et al., 2019b]varies the learning rate to identify the noisy samples, based on a lossbased metric. MentorNet[Jiang et al., 2018]trains a mentor network, whose outputs are used as a curriculum to the student network. LEC[Lee and Chung, 2019]trains multiple networks, and uses the intersection of their small loss examples (using a given noise rate as a threshold) to construct a filtered dataset for the next epoch.
Second approach: SELF[Nguyen et al., 2019]iteratively uses an exponential moving average of a net prediction over the epochs, compared to the ground truth labels, to filter noisy labels and retrain . Meta learning[Li et al., 2019]uses a gradient based technique to update the networks weights with noise tolerance. DivideMix[Li et al., 2020]uses 2 networks to flag examples as noisy and clean with two component mixture, after which the SSL technique MixMatch [Berthelot et al., 2019] is used. ELR[Liu et al., 2020]identifies early learned example, and uses them to regulate the learning process. C2D[Zheltonozhskii et al., 2022]uses the same algorithm as ELR and Dividemix, and uses a pretrain net with unsupervised loss.
Technical Details Unless stated otherwise, we used an SGD optimizer with 0.9 momentum and a learning rate of 0.01, weight decay of 5e4, and batch size of 32. We used a Cosineannealing scheduler in all of our experiments and used standard augmentation (horizontal flips, random crops) during training. We inspected the effect of different hyperparameters in the ablation study. All of our experiments were conducted on the internal cluster of the Hebrew University, on GPU type AmpereA10.
Appendix G Comparing agreement to confidence in noise filtration
While the learning time of an example has been shown to be effective for noise filtration, it fails to separate noisy and clean data that are learned more or less at the same time. To tackle this problem, one needs additional information, beyond the learning time of a single network. When using an ensemble, we can use the TPA score, or else the average probability assigned to the ground truth label (denoted the "correct" logit) by the networks. The latter score conveys the model’s confidence in the ground truth label, and is used by our two alternative scores  CumLoss and MeanMargin.
Going beyond learning time, we propose to look at "how quickly" the agreement value rises from 0 to 1, denoted as the "slope" of the agreement. Since our empirical results indicate that the learning time of noisy data is much more varied, we expect a slower rise in agreement over noisy data as compared to clean data. In our experiments, ELP achieved superior results in noise filtration. We hypothesize that the difference in slope between clean and noisy data may underlie the superiority of ELP in noise filtration.
To check this hypothesis, we compare between two scores computed at each data example: ELP and Logits Mean (denoted LM for simplicity). LM is defined as follows:
where is the number of networks, is the number of epochs during training, is a data example and its assigned label, and is the probability assigned by network in epoch to (the ground truth label).
In order to compare between the pace of increase (slope) of ELP and LM, we conduct the following analysis: We select the two groups of clean and noisy data that are learned (roughly) at the same time by some net in the ensemble, and then compute the average agreement and "correct" logit functions as a function of epoch, separately for clean and noisy data. We then compute the difference per epoch between the noisy and clean average agreement, which we denote as and . Note that and encode the difference in the slope between noisy and clean data, since they begin to rise at (roughly) the same time. Finally, we plot in Fig. 10 the difference between and , recalling that larger indicates stronger separation between the clean and noisy data.
Indeed, our analysis shows that with asymmetric noise, the difference between the agreement slope on clean and noisy data of the ELP score is consistently larger than the agreement slope difference between the average logits on clean and noisy data. This, we believe, is the key reason as to why ELP outperforms LM in noise filtration. Note that this effect is much less pronounced when using the easier symmetric noise, and indeed, our empirical results show that ELP does not outperform LM significantly in this case.
To conclude, we believe that the signal encoded by the agreement values is stronger than the signal encoded in measures of confidence in the networks’ prediction when true labels are concerned, which explains its capability to classify correctly even some hardclean examples and easynoisy examples as clean and noise (respectively). This, we believe, is a result of the polarization effect caused by the binary indicators inside TPA, which disregard misleading positive probabilities assigned to noisy labels even before they are learned by the networks.
Appendix H Comparing to methods with different assumptions
Here we compare DisagreeNet to methods that assume known noise level  O2U [Huang et al., 2019b] and LEC [Lee and Chung, 2019], using a 9layered CNN for the training (with standard hyper parameters as detailed in App. F). Since the noise level is assumed known, we replace the estimation provided by DisagreeNet with the actual noise level. The results are summarized in Table. 5. We also compare DisagreeNet to other methods that use prior knowledge, where DisagreeNet does not use prior knowledge. The results are summarized in Table. 6
Dataset  Noise level  Method  

Dataset  noise type  Noise level  O2U  LEC  DisagreeNet 
Cifar10 sym 