Log In Sign Up

The Dynamic of Consensus in Deep Networks and the Identification of Noisy Labels

Deep neural networks have incredible capacity and expressibility, and can seemingly memorize any training set. This introduces a problem when training in the presence of noisy labels, as the noisy examples cannot be distinguished from clean examples by the end of training. Recent research has dealt with this challenge by utilizing the fact that deep networks seem to memorize clean examples much earlier than noisy examples. Here we report a new empirical result: for each example, when looking at the time it has been memorized by each model in an ensemble of networks, the diversity seen in noisy examples is much larger than the clean examples. We use this observation to develop a new method for noisy labels filtration. The method is based on a statistics of the data, which captures the differences in ensemble learning dynamics between clean and noisy data. We test our method on three tasks: (i) noise amount estimation; (ii) noise filtration; (iii) supervised classification. We show that our method improves over existing baselines in all three tasks using a variety of datasets, noise models, and noise levels. Aside from its improved performance, our method has two other advantages. (i) Simplicity, which implies that no additional hyperparameters are introduced. (ii) Our method is modular: it does not work in an end-to-end fashion, and can therefore be used to clean a dataset for any other future usage.


page 17

page 19


Leveraging inductive bias of neural networks for learning without explicit human annotations

Classification problems today are typically solved by first collecting e...

Deep Learning is Robust to Massive Label Noise

Deep neural networks trained on large supervised datasets have led to im...

Robust Training with Ensemble Consensus

Since deep neural networks are over-parametrized, they may memorize nois...

Graph convolutional networks for learning with few clean and many noisy labels

In this work we consider the problem of learning a classifier from noisy...

Towards a New Understanding of the Training of Neural Networks with Mislabeled Training Data

We investigate the problem of machine learning with mislabeled training ...

Identifying Training Stop Point with Noisy Labeled Data

Training deep neural networks (DNNs) with noisy labels is a challenging ...

Early-Learning Regularization Prevents Memorization of Noisy Labels

We propose a novel framework to perform classification via deep learning...

1 Introduction

Deep neural networks dominate the state of the art in an ever increasing list of application domains, but for the most part, this incredible success relies on very large datasets of annotated examples available for training. Unfortunately, large amounts of high-quality annotated data are hard and expensive to acquire, whereas cheap alternatives (obtained by way of crowd-sourcing or automatic labeling, for example) often introduce noisy labels into the training set. By now there is much empirical evidence that neural networks can memorize almost every training set, including ones with noisy and even random labels Zhang et al. (2017), which in turn increases the generalization error of the model. As a result, the problems of identifying the existence of label noise and the separation of noisy labels from clean ones, are becoming more urgent and therefore attract increasing attention.

Henceforth, we will call the set of examples in the training data whose labels are correct "clean data", and the set of examples whose labels are incorrect "noisy data". While all labels can be eventually learned by deep models, it has been empirically shown that most noisy datapoints are learned by deep models late, after most of the clean data has already been learned (Arpit et al., 2017)

. Therefore many methods focus on the learning time of an example in order to classify it as noisy or clean, by looking at its loss

(Pleiss et al., 2020; Arazo et al., 2019)

or loss per epoch

(Li et al., 2020) in a single model. However, these methods struggle to classify correctly clean and noisy datapoints that are learned at the same time, or worse - noisy datapoints that are learned early. Additionally, many of these methods work in an end-to-end manner, and thus neither provide noise level estimation nor do they deliver separate sets of clean and noisy data for novel future usages.

Figure 1: With noisy labels models show higher disagreement. The noisy examples are not only learned at a later stage, but each model learns the example at its own different time.

Our first contribution is a new empirical results regarding the learning dynamics of an ensemble of deep networks, showing that the dynamics is different when training with clean data vs. noisy data. The dynamics of clean data has been studied in (Hacohen et al., 2020; Pliushch et al., 2021), where it is reported that different deep models learn examples in the same order and pace. This means that when training a few models and comparing their predictions, a binary occurrence (approximately) is seen at each epoch

: either all the networks correctly predict the example’s label, or none of them does. This further implies that for the most part, the distribution of predictions across points is bimodal. Additionally, a variety of studies showed that the bias and variance of deep networks decrease as the networks complexity grow

(Nakkiran et al., 2021; Neal et al., 2018), providing additional evidence that different deep networks learn data at the same time simultaneously.

In Section 3 we describe a new empirical result: when training an ensemble of deep models with noisy data, and in contrast to what happens when using clean data, different models learn different datapoints at different times (see Fig. 1). This empirical finding tells us that in an ensemble of networks, the learning dynamics of clean data and noisy data can be distinguished. When training such an ensemble with a mixture of clean and noisy data, the emerging dynamics reflects this observation, as well as the tendency of clean data to be learned faster as previously observed.

In our second contribution, we use this result to develop a new algorithm for noise level estimation and noise filtration, which we call DisagreeNet (see Section 4

). Importantly, unlike most alternative methods, our algorithm is simple (it does not introduce any new hyperparameters), parallelizable, easy to integrate with any supervised or semi-supervised learning method and any loss function, and does not rely on prior knowledge of the noise amount. When used for noise filtration, our empirical study (see Section 

5) shows the superiority of DisagreeNet as compared to the state of the art, using different datasets, different noise models and different noise levels. When used for supervised classification by way of pre-processing the training set prior to training a deep model, it provides a significant boost in performance, more so than alternative methods.

Relation to prior art

Work on the dynamics of learning in deep models has received increased attention in recent years (e.g., Nguyen et al., 2020; Hacohen et al., 2020; Baldock et al., 2021). Our work adds a new observation to this body of knowledge, which is seemingly unique to an ensemble of deep models (as against an ensemble of other commonly used classifiers). Thus, while there exist other methods that use ensembles to handle label noise (e.g., Sabzevari et al., 2018; Feng et al., 2020; Chai et al., 2021; de Moura et al., 2018), for the most part they cannot take advantage of this characteristic of deep models, and as a result are forced to use additional knowledge, typically the availability of a clean validation set and/or prior knowledge of the noise amount.

Work on deep learning with noisy labels (see

Song et al. (2022) for a recent survey) can be coarsely divided to two categories: general methods that use a modified loss or network’s architecture, and methods that focus on noise identification. The first group includes methods that aim to estimate the underlying noise transition matrix (Goldberger and Ben-Reuven, 2016; Patrini et al., 2017), employ a noise-robust loss (Ghosh et al., 2017; Zhang and Sabuncu, 2018; Wang et al., 2019; Xu et al., 2019), or achieve robustness to noise by way of regularization (Tanno et al., 2019; Jenni and Favaro, 2018). Methods in the second group, which is more inline with our approach, focus more directly on noise identification. Some methods assume that clean examples are usually learned faster than noisy examples (e.g. Liu et al., 2020). Others (Arazo et al., 2019; Li et al., 2020)

generate soft labels by interpolating the given labels and the model’s predictions during training. Yet other methods

(Jiang et al., 2018; Han et al., 2018; Malach and Shalev-Shwartz, 2017; Yu et al., 2019; Lee and Chung, 2019), like our own, inspect an ensemble of networks, usually in order to transfer information between networks and thus avoid agreement bias.

Notably, we also analyze the behavior of ensembles in order to identify the noisy examples, resembling (Pleiss et al., 2020; Nguyen et al., 2019; Lee and Chung, 2019). But unlike these methods, which track the loss of the networks, we track the dynamics of the agreement between multiple networks over epochs. We then show that this statistics is more effective, and achieves superior results. Additionally (and not less importantly), unlike these works, we do not assume prior knowledge of the noise amount or the presence of a clean validation set, and do not introduce new hyper-parameters in our algorithm.

Recently, the emphasis has somewhat shifted to the use of semi-supervised learning and contrastive learning

(Li et al., 2020; Liu et al., 2020; Ortego et al., 2020; Wei et al., 2020; Yao et al., 2021; Zheltonozhskii et al., 2022; Li et al., 2022; Karim et al., 2022). Semi-supervised learning is an effective paradigm for the prediction of missing labels. This paradigm is especially useful when the identification of noisy points cannot be done reliably, in which case it is advantageous to remove labels whose likelihood to be true is not negligible. The effectiveness of semi-supervised learning in providing reliable pseudo-labels for unlabeled points will compensate for the loss of clean labels.

However, semi-supervised learning is not universally practical as it often relies on the extraction of effective representations based on unsupervised learning tasks, which typically introduces implicit priors (e.g., that contrastive loss is appropriate). In contrast, our goal is to reliably identify noisy points, to be subsequently removed. Thus, our method can be easily incorporated into any SOTA method which uses supervised or semi-supervised learning (with or without contrastive learning), and may provide benefit even when semi-supervised learning is not viable.

2 Inter-Network Agreement: Definition and Scores

Measuring the similarity between deep models is not a trivial challenge, as modern deep neural networks are complex functions defined by a huge number of parameters, which are invariant to transformations hidden in the model’s architecture. Here we measure the similarity between deep models in an ensemble by measuring inter-model prediction agreement at each datapoint. Accordingly, in Section 2.2 we describe scores that are based on the state of the networks at each epoch , while in Section 2.3 we describe cumulative scores that integrate these states through many epochs. Practically (see Section 4), our proposed method relies on the cumulative scores, which are shown empirically to provide more accurate results in the noise filtration task. These scores promise added robustness, as it is no longer necessary to identify the epoch at which the score is to be evaluated.

2.1 Preliminaries

Notations Let

denote a deep model, trained with Stochastic Gradient Descent (SGD) for

epochs on training set , where denotes a single example and its corresponding label. Let denote an ensemble of such models, where each model is initialized and trained independently on .

Noise model We analyze the training dynamics of an ensemble of models in the presence of label noise. Label noise is different from data noise (like image distortion or additive Gaussian noise). Here it is assumed that after the training set is sampled, the labels are corrupted by some noise function , and the training set becomes . The two most common models of label noise are termed symmetric noise and asymmetric noise (Patrini et al., 2017). In both cases it is assumed that some fixed percentage of the labels are corrupted by . With symmetric noise, assigns any new label from the set

with equal probability. With asymmetric noise,

is the deterministic permutation function (see App. F for details). Note that the asymmetric noise model is considered much harder than the symmetric noise model.

2.2 Per-Epoch Agreement Score

Following Hacohen et al. (2020), we define the True Positive Agreement (TPA) score of ensemble at each datapoint , where . The TPA score measures the average accuracy of the models in the ensemble, when seeing , after each model has been trained for exactly epochs on . Note that measures the average accuracy of multiple models on one example, as opposed to the generalization error that measures the average error of one model on multiple examples.

2.3 Cumulative Scores

When inspecting the dynamics of the TPA score on clean data, we see that at the beginning the distribution of is concentrated around 0, and then quickly shifts to 1 as training proceeds (see side panels in Fig. 1(a)). This implies that empirically, data is learned in a specific order by all models in the ensemble. To measure this phenomenon we use the Ensemble Learning Pace (ELP) score defined below, which essentially integrates the TPA score over a set of epochs :


captures both the time of learning by a single model, and its consistency across models. For example, if all the models learned the example early, the score would be high. It would be significantly lower if some of them learned it later than others (see pseudo-code in App. C).

In our study we evaluated two additional cumulative scores of inter-model agreement:

  1. Cumulative loss:

    Above denotes the cross entropy function. This score is very similar to ELP, engaging the average of the cross-entropy loss instead of the accuracy indicator .

  2. Area under the margin: following (Pleiss et al., 2020), the MeanMargin score is defined as follows

    The MeanMargin score is the mean of the ’margin’, the difference between the value of the ground-truth logit (before softmax) and the value of the otherwise maximal logit.

3 The Dynamics of Agreement: New Empirical Observation

In this section we analyze, both theoretically and empirically, how measures of inter-network agreement may indicate the detrimental phenomenon of ’overfit’. Overfit is a condition that can occur during the training of deep neural networks. It is characterized by the co-occurring decrease of train error or loss and the increase of test error or loss. Recall that train loss is the quantity that is being continuously minimized during the training of deep models, while the test error is the quantity linked to generalization error. When these quantities change in opposite directions, training harms the final performance and thus early stopping is recommended.

We begin by showing in Section 3.1

that in an ensemble of linear regression models, overfit and the agreement between models are negatively correlated. When this is the case, an epoch in which the agreement between networks reaches its maximal value is likely to indicate the beginning of overfit.

Our next goal is to examine the relevance of this result to deep learning in practice. Yet inexplicably, at least as far as image datasets are concerned, overfit rarely occurs in practice when deep learning is used for image recognition. However, when label noise is introduced, significant overfit occurs. Capitalizing on this observation, we report in Section 3.3 that when overfit occurs in the independent training of an ensemble of deep networks, the agreement between the networks starts to decrease.

The approach we describe in Section 4 is motivated by these results: Since it has been observed that noisy data are memorized later than clean data, we hypothesize that overfit occurs when the memorization of noisy labels becomes dominant. This suggests that measuring the dynamics of agreement between networks, which is correlated with overfit as shown below, can be effectively used for the identification of label noise.

3.1 Overfit and Agreement: Theoretical Result

Since deep learning models are not amenable to a rigorous theoretical analysis, and in order to gain computational insight into such general phenomena as overfit, simpler models are sometimes analyzed (e.g. Weinshall and Amir, 2020). Accordingly, in App. A we formally analyze the relation between overfit and inter-model agreement in an ensemble of linear regression models. In this framework, it can be shown that the two phenomena are negatively correlated, namely, increase in overfit implies decrease in inter-model agreement. Thus, we prove (under some assumptions) the following result:


Assume an ensemble of models obtained by solving linear regression with gradient descent and random initialization. If overfit increases at time in all the models in the ensemble, then the agreement between the models in the ensemble at time decreases.

3.2 Measuring the Agreement between Models

In order to obtain a score that captures the level of disagreement between networks, we inspect more closely the distribution of , defined in Section 2.2, over a sample of datapoints, and analyze its dynamics as training proceeds. First, note that if all of the models in ensemble give identical predictions at each point, the TPA score would be either 0 (when all the networks predict a false label) or 1 (when all the networks predict the correct label). In this case, the TPA distribution is perfectly bimodal, with only two peaks at 0 and 1. If the predictions of the models at each point are independent with mean accuracy

, then it can be readily shown that TPA is approximately the binomial random variable with a unimodal distribution around


Empirically, (Hacohen et al., 2020) showed that in ensembles of deep models trained on ‘real’ datasets as we use here, the TPA distribution is highly bimodal. Since commonly used measures of bimodality, such as the Pearson bimodality score, are ill-fitted for the discrete TPA distribution, we measure bimodality with the following Bimodal Index score:


measures how many examples are either correctly or incorrectly classified by all the models in the ensemble, rewarding distributions where points are (roughly) equally divided between 0 and 1. Here we use this score to measure the agreement between networks at epoch .


2 max BI


(a) BI (-axis) vs. epochs (-axis)




Figure 2: (a) Main panel: bimodality in an ensemble of 10 DenseNet networks, trained to classify Cifar10 with 20% symmetric noise. Side panels: TPA distribution in 6 epochs (blue - clean examples, orange - noisy ones). (b) Scatter plots of test accuracy vs train bimodality, measured by as defined in (2), where changes in color from blue to yellow correspond with advancing epochs.

If we were to draw the Bimodality Index (BI) of the TPA score as a function of the epochs (Fig. 1(a)), we often see two distinct phases. Initially (phase 1), BI is monotonically increasing, namely, both test accuracy and agreement are on the rise. We call it the ‘learning’ phase. Empirically, in this phase most of the clean examples are being learned (or memorized), as can also be seen in the left side panels of Fig. 1(a) (cf. Li et al., 2015). At some point BI may start to decrease, followed by another possible ascent. This is phase 2, in which empirically the memorization of noisy examples dominates the learning (see the right side panels of Fig. 1(a)). This fall and rise is explained by another set of empirical observations, that noisy labels are not being learned in the same order by an ensemble of networks (see App. B), which therefore predicts a decline in BI when noisy labels are being learned.

3.3 Overfit and Agreement: Empirical Evidence

Earlier work, investigating the dynamics of learning in deep networks, suggests that examples with noisy labels are learned later (Krueger et al., 2017; Zhang et al., 2017; Arpit et al., 2017; Arora et al., 2019). Since the learning of noisy labels is unlikely to improve the model’s test accuracy, we hypothesize that this may be correlated with the occurrence (or increase) of overfit. The theoretical result in Section 3.1 suggests that this may be correlated with a decrease in the agreement between networks. Our goal now is to test this prediction empirically.

We next outline empirical evidence that this is indeed the case in actual deep models. In order to boost the strength of overfit, we adopt the scenario of recognition with label noise, where the occurrence of overfit is abundant. When overfit indeed occurs, our experiments show that if the test accuracy drops, then the disagreement score BI also decreases (see example in Fig. 1(b)-bottom). This observation is confirmed with various noise models and different datasets. When overfit does not occur, the prediction is no longer observed (see example in Fig. 1(b)-top).

These results suggest that a consistent drop in the index of some training set can be used to estimate the occurrence of overfit, and possibly even the beginning of noisy label memorization.

4 Dealing with Noisy Labels: Proposed Approach

When dealing with noisy labels, there are essentially three intertwined problems that may require separate treatment:

  1. Noise level estimation: estimate the number of noisy examples.

  2. Noise filtration: flag points whose label is to be removed.

  3. Classifier construction: train a model without the examples that are flagged as noisy.

4.1 DisagreeNet, for Noise Level Estimation and Noise Filtration

Guided by Section 3, we propose a method to estimate the noise level in a training set denoted DisagreeNet, which is further used to filter out the noisy examples (see pseudo-code below in Alg. 1):

  1. Compute the ELP score from (1) at each training example.

  2. Fit a two component BMM to the ELP distribution (see Fig. 3).

  3. Use the intersection between the 2 components of the BMM fit to divide the data to two groups.

  4. Call the group with lower ELP ’noisy data’.

  5. Estimate noise level by counting the number of datapoints in the noisy group.

Input: ELP_arr, specifying the ELP score of each point in training set
Output: Noise level estimate, and the list of indices of noisy points
divide the data to two groups using fit_BMM(ELP_arr);
noise_indices indices of ELP_arr assigned to ;
noise_estim ;
return noise_estim, noise_indices
Algorithm 1 DisagreeNet

Cifar10, 40% Symm. noise

Cifar100, 20% Asymm. noise

TinyImagenet, 40% Symm. noise

Figure 3: ELP distribution, shown separately for the clean data in blue and the noisy data in orange. Superimposed, in blue and orange lines, is the bi-modal BMM fit to the ELP total (not separated) distribution

As part of our ablation study, we evaluated the two alternative scores defined in Section 2.3: CumLoss and MeanMargin, where Step 2 of DisagreeNet is executed using one of them instead of the ELP score. Results are shown in Section 5.4, revealing the superiority of the ELP score in noise filtration.

4.2 Classifier construction

Aiming to achieve modular handling of noisy labels, we propose the following two-step approach:

  1. Run DisagreeNet.

  2. Run SOTA supervised learning method using the filtered data.

In step 2 it is possible to invoke semi-supervised SOTA methods, using the noisy group as unsupervised data. However, given that semi-supervised learning typically involves additional assumptions (or prior knowledge) as well as high computational complexity (that restricts its applicability to smaller datasets), as discussed in Section 1, we do not consider this scenario here.

5 Empirical Evaluation

We evaluate our method in the following scenarios and tasks:

  1. Noise identification (Section 5.2), with two complementary sub-tasks: (i) estimate the noise level in the given dataset; (ii) identify the noisy examples.

  2. Supervised classification (Section 5.3), after the removal of the noisy examples.

5.1 Dataset and Baselines (details are deferred to App. F)

Datasets We evaluate our method on a few standard image classification datasets, including Cifar10 and Cifar100 (Krizhevsky et al., 2009)

, Tiny imagenet

(Le and Yang, 2015)

, subsets of Imagenet

(Deng et al., 2009), Clothing1M (Xiao et al., 2015) and Animal10N (Song et al., 2019a), see App. F for details. These datasets were used in earlier work to evaluate the success of noise estimation (Pleiss et al., 2020; Arazo et al., 2019; Li et al., 2020; Liu et al., 2020).

Baselines and comparable methods We report results with the following supervised learning methods for learning from noisy data: DY-BMM and DY-GMM (Arazo et al., 2019), INCV (Chen et al., 2019), AUM (Pleiss et al., 2020), Bootstrap (Reed et al., 2014), D2L (Ma et al., 2018), MentorNet (Jiang et al., 2018). We also report the results of two absolute baselines: (i) Oracle, which trains the model on the clean dataset; (ii) Random, which trains the model after the removal of a random fraction of the whole data, equivalent to the noise level.

Other methods The following methods use additional prior information, such as a clean validation set or known level of noise: Co-teaching (Han et al., 2018), O2U (Huang et al., 2019b), LEC (Lee and Chung, 2019) and SELFIE (Song et al., 2019b). Direct comparison does injustice to the previous group of methods, and is therefore deferred to App.  H. Another group of methods is excluded from the comparison because they invoke semi-supervised or contrastive learning (e.g., Ortego et al., 2020; Li et al., 2020; Karim et al., 2022; Li et al., 2022; Wei et al., 2020; Yao et al., 2021), which is a different learning paradigm (see discussion of prior art in Section 1).

(a) Cifar100 sym
(b) Cifar100 asym
(c) Cifar100 sym
(d) Cifar100 asym
Figure 4: (a)-(b) Noise identification: F1 score for noisy label identification task, using different noise levels (-axis), with asymmetric (a) and asymmetric (b) noise models. Results reflect 3 repetitions involving an ensemble of 10 Densenets each. (c)-(d) Noise level estimation: different noise levels are evaluated (-axis), with asymmetric (c) and asymmetric (d) noise models (the 3 comparison baselines did not report this estimate).

Implementation details We used DenseNet (Iandola et al., 2014), ResNet-18 and ResNet50 (He et al., 2016)

when training on CIFAR-10/100 and Tiny imagenet , and ResNet50 for Clothing1M and Animal10N.

5.2 Results: Noise Identification

The performance of DisagreeNet is evaluated in two tasks: (i) The detection of noisy examples, shown in Fig. 3(a)-3(b) (see also Figs. 7 and 8 in App. D), where DisagreeNet is seen to outperform the three baselines - AUM, DY-GMM and DY-BMM. (ii) Noise level estimation, shown in Fig. 3(c)-3(d) , showing good noise level estimation especially in the case of symmetric noise. We also compare DisagreeNet to MeanMargin and CumLoss, see Fig. 5.

5.3 Result: Supervised Classifications

Method/Dataset CIFAR-10 sym CIFAR-100 sym
Noise level 20% 40% 60% 20% 40% 60%
Method/Dataset CIFAR-10 constant asym CIFAR-100 constant asym Tiny Imagenet sym
Noise level 20% 40% 20% 40% 20% 40%
Bootstrap - -
D2L - -
Method/Dataset animal10N, 8% noise Clothing1M, 38% noise
Noise level noise est test accuracy noise est test accuracy
Cross-Entropy - -
AUM - - 10.7 70.4
DisagreeNet+SL 7.8 17
Table 1:

Test accuracy (%), average and standard error, in the best epoch of retraining after filtration. Results of benchmark methods (see Section 

5.1) are taken from (Pleiss et al., 2020). The top and middle tables show CIFAR-10, CIFAR-100 and Tiny Imagenet, with simulated noise. The bottom table shows three ‘real noise’ datasets, and includes in addition results of noise level estimation (when applicable). The presumed noise level for these datasets is indicated in the top line following (Huang et al., 2019a; Song et al., 2019b).

DisagreeNet is used to remove noisy examples, after which we train a deep model from scratch using the remaining examples only. We report our main results using the Densenet architecture, and report results with other architectures in the ablation study. Table 1 summarizes the results for simulated symmetric and asymmetric noise on 5 datasets, and 3 repetitions. It also shows results on 2 real datasets, which are assumed (in previous work) to contain significant levels of ’real’ label noise. Additional results are reported in App. H, including methods that require additional prior knowledge.

Not surprisingly, dealing with datasets that are presumed to include inherent label noise proved more difficult, and quite different, than dealing with synthetic noise. As claimed in (Ortego et al., 2020), non-malicious label noise does less damage to networks’ generalization than random label noise: on Clothing1M, for example, hardly any overfit is seen during training, even though the data is believed to contain more than 35% noise. Still, here too, DisagreeNet achieves improved accuracy without access to a clean validation set or known noise level (see Table 1). In App. H, Table  5 we compare Disagreenet to methods that do use such prior knowledge. Surprisingly, we see that DisagreeNet still achieves better results even without using any additional prior knowledge.

5.4 Ablation Study

How many networks are needed?

We report in Table. 2 the F1 score for noisy label identification, using DisagreeNet with varying numbers of networks. The main boost in performance provided by the use of additional networks is seen when using DisagreeNet on hard noise scenarios, such as the asymmetric noise, or with small amounts of noise.

Dataset Noise size of ensemble (number of networks)
Method 1 2 3 4 7 10
Cifar10 sym
Cifar100 sym
Cifar10 asym
Cifar100 asym
Table 2: F1 score of DisagreeNet, using different numbers of models.

Additional ablation results

Results in App. E, Table 4 indicate robustness to architecture, scheduler, and usage of augmentation, although the standard training procedures achieve the best results. Additionally, we see robustness to changing the backbone architecture of DisagreeNet, using ResNet18 and ResNet50, see Table 3. Finally, in Fig. 5 we compare DisagreeNet using ELP to disagreeNet using the MeanMargin and CumLoss scores, as defined in Section 2.3. In symmetric noise scenarios all scores perform well, while in asymmetric noise scenarios the ELP score performs much better, as can be seen in Figs. 4(b),4(d). Additional comparisons of the 3 scores are reported in Apps. E and G.

Method/Dataset CIFAR-10 sym CIFAR-100 sym
Noise level 20% 40% 60% 20% 40% 60%
DisagreeNet+SL R18
DisagreeNet+SL R50
Table 3: Final accuracy results when changing the backbone architecture.
(a) Cifar100 sym
(b) Cifar100 asym
(c) Cifar100 sym
(d) Cifar100 asym
Figure 5: (a)-(b): F1 score (-axis) for the noisy label identification task, using different noise levels (-axis), with asymmetric (a) and asymmetric (b) noise models. Results with 3 variants of DisagreeNet are shown, based on 3 scores: MeanMargin, ELP and CumLoss. (c)-(d): Error in noise level estimation (-axis) using different noise levels (-axis), with asymmetric (c) and asymmetric (d) noise models. As can be seen, ELP very significantly outperforms the other 2 scores when handling asymmetric noise.

6 Summary and Discussion

We presented a new empirical observation, that the variability in the predictions of an ensemble of deep networks is much larger when labels are noisy, than it is when labels are clean. This observation is used as a basis for a new method for classification with noisy labels, addressing along the way the tasks of noise level estimation, noisy labels identification, and classifier construction. Our method is easy to implement, and can be readily incorporated into existing methods for deep learning with label noise, including semi-supervised methods, to improve the outcome of the methods.

Importantly, our method achieves this improvement without making additional assumptions, which are commonly made by alternative methods: (i) Noise level is expected to be unknown. (ii) There is no need for a clean validation set, which many other methods require, but which is very difficult to acquire when the training set is corrupted. (iii) Almost no additional hyperparameters are introduced.


This work was supported by the Israeli Ministry of Science and Technology, and by the Gatsby Charitable Foundations.


  • E. Arazo, D. Ortego, P. Albert, N. O’Connor, and K. McGuinness (2019) Unsupervised label noise modeling and loss correction. In

    International conference on machine learning

    pp. 312–321. Cited by: item , Appendix F, §1, §1, §5.1, §5.1.
  • S. Arora, S. Du, W. Hu, Z. Li, and R. Wang (2019) Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pp. 322–332. Cited by: §3.3.
  • D. Arpit, S. Jastrzębski, N. Ballas, D. Krueger, E. Bengio, M. S. Kanwal, T. Maharaj, A. Fischer, A. Courville, Y. Bengio, et al. (2017) A closer look at memorization in deep networks. In International conference on machine learning, pp. 233–242. Cited by: §1, §3.3.
  • R. Baldock, H. Maennel, and B. Neyshabur (2021) Deep learning through the lens of example difficulty. Advances in Neural Information Processing Systems 34, pp. 10876–10889. Cited by: §1.
  • D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, and C. A. Raffel (2019) Mixmatch: a holistic approach to semi-supervised learning. Advances in Neural Information Processing Systems 32. Cited by: item .
  • Y. Chai, C. Wu, and J. Zeng (2021) Detect noisy label based on ensemble learning. In Advances in Natural Computation, Fuzzy Systems and Knowledge Discovery, H. Meng, T. Lei, M. Li, K. Li, N. Xiong, and L. Wang (Eds.), Cham, pp. 1843–1850. External Links: ISBN 978-3-030-70665-4 Cited by: §1.
  • P. Chen, B. B. Liao, G. Chen, and S. Zhang (2019) Understanding and utilizing deep neural networks trained with noisy labels. In International Conference on Machine Learning, pp. 1062–1070. Cited by: item , §5.1.
  • K. G. de Moura, R. B.C. Prudencio, and G. D.C. Cavalcanti (2018) Ensemble methods for label noise detection under the noisy at random model. In 2018 7th Brazilian Conference on Intelligent Systems (BRACIS), Vol. , pp. 474–479. External Links: Document Cited by: §1.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    2009 IEEE conference on computer vision and pattern recognition

    pp. 248–255. Cited by: Appendix F, §5.1.
  • W. Feng, Y. Quan, and G. Dauphin (2020) Label noise cleaning with an adaptive ensemble method based on noise detection metric. Sensors 20 (23). External Links: Link, ISSN 1424-8220, Document Cited by: §1.
  • A. Ghosh, H. Kumar, and P. S. Sastry (2017) Robust loss functions under label noise for deep neural networks. In

    Proceedings of the AAAI conference on artificial intelligence

    Vol. 31. Cited by: §1.
  • J. Goldberger and E. Ben-Reuven (2016) Training deep neural-networks using a noise adaptation layer. Cited by: §1.
  • G. Hacohen, L. Choshen, and D. Weinshall (2020) Let’s agree to agree: neural networks share classification order on real datasets. In Int. Conf. Machine Learning ICML, pp. 3950–3960. Cited by: §1, §1, §2.2, §3.2.
  • B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama (2018) Co-teaching: robust training of deep neural networks with extremely noisy labels. Advances in neural information processing systems 31. Cited by: item , §1, §5.1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §5.1.
  • J. Huang, L. Qu, R. Jia, and B. Zhao (2019a) O2u-net: a simple noisy label detection approach for deep neural networks. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 3326–3334. Cited by: Table 1.
  • J. Huang, L. Qu, R. Jia, and B. Zhao (2019b) O2U-net: a simple noisy label detection approach for deep neural networks. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Vol. , pp. 3325–3333. External Links: Document Cited by: item , Appendix H, §5.1.
  • F. Iandola, M. Moskewicz, S. Karayev, R. Girshick, T. Darrell, and K. Keutzer (2014) Densenet: implementing efficient convnet descriptor pyramids. arXiv preprint arXiv:1404.1869. Cited by: §5.1.
  • S. Jenni and P. Favaro (2018) Deep bilevel learning. In Proceedings of the European conference on computer vision (ECCV), pp. 618–633. Cited by: §1.
  • L. Jiang, Z. Zhou, T. Leung, L. Li, and L. Fei-Fei (2018) Mentornet: learning data-driven curriculum for very deep neural networks on corrupted labels. In International Conference on Machine Learning, pp. 2304–2313. Cited by: item , §1, §5.1.
  • N. Karim, M. N. Rizve, N. Rahnavard, A. Mian, and M. Shah (2022) UNICON: combating label noise through uniform selection and contrastive learning. arXiv. External Links: Document, Link Cited by: §1, §5.1.
  • A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Online. Cited by: Appendix F, §5.1.
  • D. Krueger, N. Ballas, S. D. Arpit, M. S. Kanwal, T. Maharaj, E. Bengio, A. Fischer, and A. C. Courville (2017) Deep nets don’t learn via memorization. In Int. Conf. Learning Representations ICLR, Cited by: §3.3.
  • Y. Le and X. Yang (2015) Tiny imagenet visual recognition challenge. CS 231N 7 (7), pp. 3. Cited by: Appendix F, §5.1.
  • J. Lee and S. Chung (2019) Robust training with ensemble consensus. arXiv preprint arXiv:1910.09792. Cited by: item , Appendix H, §1, §1, §5.1.
  • J. Li, R. Socher, and S. C. Hoi (2020) Dividemix: learning with noisy labels as semi-supervised learning. arXiv preprint arXiv:2002.07394. Cited by: item , Appendix F, §1, §1, §1, §5.1, §5.1.
  • J. Li, Y. Wong, Q. Zhao, and M. S. Kankanhalli (2019) Learning to learn from noisy labeled data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5051–5059. Cited by: item .
  • S. Li, X. Xia, S. Ge, and T. Liu (2022) Selective-supervised contrastive learning with noisy labels. arXiv. External Links: Document, Link Cited by: §1, §5.1.
  • Y. Li, J. Yosinski, J. Clune, H. Lipson, and J. E. Hopcroft (2015) Convergent learning: do different neural networks learn the same representations?. In Adv. Neural Inform. Process. Syst. NeurIPS, pp. 196–212. Cited by: §3.2.
  • S. Liu, J. Niles-Weed, N. Razavian, and C. Fernandez-Granda (2020) Early-learning regularization prevents memorization of noisy labels. Advances in neural information processing systems 33, pp. 20331–20342. Cited by: item , Appendix F, §1, §1, §5.1.
  • X. Ma, Y. Wang, M. E. Houle, S. Zhou, S. Erfani, S. Xia, S. Wijewickrema, and J. Bailey (2018) Dimensionality-driven learning with noisy labels. In International Conference on Machine Learning, pp. 3355–3364. Cited by: item , §5.1.
  • E. Malach and S. Shalev-Shwartz (2017) Decoupling" when to update" from" how to update". Advances in Neural Information Processing Systems 30. Cited by: §1.
  • P. Nakkiran, G. Kaplun, Y. Bansal, T. Yang, B. Barak, and I. Sutskever (2021) Deep double descent: where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment 2021 (12), pp. 124003. Cited by: §1.
  • B. Neal, S. Mittal, A. Baratin, V. Tantia, M. Scicluna, S. Lacoste-Julien, and I. Mitliagkas (2018) A modern take on the bias-variance tradeoff in neural networks. arXiv preprint arXiv:1810.08591. Cited by: §1.
  • D. T. Nguyen, C. K. Mummadi, T. P. N. Ngo, T. H. P. Nguyen, L. Beggel, and T. Brox (2019) Self: learning to filter noisy labels with self-ensembling. arXiv preprint arXiv:1910.01842. Cited by: item , §1.
  • T. Nguyen, M. Raghu, and S. Kornblith (2020) Do wide and deep networks learn the same things? uncovering how neural network representations vary with width and depth. arXiv preprint arXiv:2010.15327. Cited by: §1.
  • D. Ortego, E. Arazo, P. Albert, N. E. O’Connor, and K. McGuinness (2020) Multi-objective interpolation training for robustness to label noise. CoRR abs/2012.04462. External Links: Link, 2012.04462 Cited by: §1, §5.1, §5.3.
  • G. Patrini, A. Rozza, A. Krishna Menon, R. Nock, and L. Qu (2017) Making deep neural networks robust to label noise: a loss correction approach. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1944–1952. Cited by: §1, §2.1.
  • G. Pleiss, T. Zhang, E. Elenberg, and K. Q. Weinberger (2020) Identifying mislabeled data using the area under the margin ranking. Advances in Neural Information Processing Systems 33, pp. 17044–17056. Cited by: item , Appendix F, §1, §1, item ii, §5.1, §5.1, Table 1.
  • I. Pliushch, M. Mundt, N. Lupp, and V. Ramesh (2021) When deep classifiers agree: analyzing correlations between learning order and image statistics. arXiv preprint arXiv:2105.08997. Cited by: §1.
  • S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Rabinovich (2014) Training deep neural networks on noisy labels with bootstrapping. arXiv preprint arXiv:1412.6596. Cited by: item , §5.1.
  • M. Sabzevari, G. Martínez-Muñoz, and A. Suárez (2018) A two-stage ensemble method for the detection of class-label noise. Neurocomputing 275, pp. 2374–2383. External Links: ISSN 0925-2312, Document, Link Cited by: §1.
  • H. Song, M. Kim, and J. Lee (2019a) SELFIE: refurbishing unclean samples for robust deep learning. In ICML, Cited by: §5.1.
  • H. Song, M. Kim, and J. Lee (2019b) Selfie: refurbishing unclean samples for robust deep learning. In International Conference on Machine Learning, pp. 5907–5915. Cited by: §5.1, Table 1.
  • H. Song, M. Kim, D. Park, Y. Shin, and J. Lee (2022) Learning from noisy labels with deep neural networks: a survey. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §1.
  • R. Tanno, A. Saeedi, S. Sankaranarayanan, D. C. Alexander, and N. Silberman (2019) Learning from noisy labels by regularized estimation of annotator confusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11244–11253. Cited by: §1.
  • Y. Wang, X. Ma, Z. Chen, Y. Luo, J. Yi, and J. Bailey (2019) Symmetric cross entropy for robust learning with noisy labels. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 322–330. Cited by: §1.
  • H. Wei, L. Feng, X. Chen, and B. An (2020) Combating noisy labels by agreement: a joint training method with co-regularization. arXiv. External Links: Document, Link Cited by: §1, §5.1.
  • D. Weinshall and D. Amir (2020) Theory of curriculum learning, with convex loss functions. Journal of Machine Learning Research 21 (222), pp. 1–19. Cited by: §3.1.
  • T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang (2015) Learning from massive noisy labeled data for image classification. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 2691–2699. External Links: Document Cited by: Appendix F, §5.1.
  • Y. Xu, P. Cao, Y. Kong, and Y. Wang (2019) L_dmi: a novel information-theoretic loss function for training deep nets robust to label noise. Advances in neural information processing systems 32. Cited by: §1.
  • Y. Yao, Z. Sun, C. Zhang, F. Shen, Q. Wu, J. Zhang, and Z. Tang (2021) Jo-src: a contrastive approach for combating noisy labels. arXiv. External Links: Document, Link Cited by: §1, §5.1.
  • X. Yu, B. Han, J. Yao, G. Niu, I. Tsang, and M. Sugiyama (2019) How does disagreement help generalization against label corruption?. In International Conference on Machine Learning, pp. 7164–7173. Cited by: §1.
  • C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2017) Understanding deep learning requires rethinking generalization. In Int. Conf. Learning Representations ICLR, Cited by: §1, §3.3.
  • Z. Zhang and M. Sabuncu (2018) Generalized cross entropy loss for training deep neural networks with noisy labels. Advances in neural information processing systems 31. Cited by: §1.
  • E. Zheltonozhskii, C. Baskin, A. Mendelson, A. M. Bronstein, and O. Litany (2022) Contrast to divide: self-supervised pre-training for learning with noisy labels. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1657–1667. Cited by: item , §1.


Appendix A Overfit and inter-model correlation

In this section we formally analyze the relation between two type of scores, which measure either overfit or inter-model agreement. Overfit is a condition that can occur during the training of deep neural networks. It is characterized by the co-occurring decrease of train error or loss, which is continuously minimized during the training of a deep model, and the increase of test error or loss, which is the ideal measure one would have liked to minimize and which determines the network’s generalization error. An agreement score measures how similar the models are in their predictions.

We start by introducing the model and some notations in Section A.1. In Section A.2 we prove the main result (Prop. A.2): the occurrence of overfit at time s in all the models of the ensemble implies that the agreement between the models decreases.

a.1 Model and notations

Model. We analyze the agreement between an ensemble of models, computed by solving the linear regression problem with Gradient Descent (GM) and random initialization. In this problem, the learner estimates a linear function , where

denotes an input vector and

the desired output. Given a training set of pairs , let denote the training input - a matrix whose column is , and let row vector denote the output vector whose element is . When solving a linear regression problem, we seek a row vector that satisfies


To solve (3) with GD, we perform at each iterative step the following computation:


for some random initialization vector where usually , and learning rate . Henceforth we omit the index when self evident from context.

Additional notations Below, index denotes a network instance and denotes the test data.

  • Let denotes the training matrix used to train network , while denotes the test matrix. Accordingly,

  • Let denote the gradient step of , the mdoel learned from training set , and let denote the cross error of on . Then

  • Let denote the cross gradient:


After each GD step, the model and the error are updated as follows:

We note that at step and , is a random vector in , and is a random vector in .

Test error random variable. Let denote the number of test examples. Note that is a set of test errors vectors in , where the component of the vector captures the test error of model on test example . In effect, it is a sample of size from the random variable . This random variable captures the error over test point of a model computed from a random sample of training examples. The empirical variance of this random variable will be used to estimate the agreement between the models.

Overfit. Overfit occurs at step if


Measuring inter-model agreement. In classification problems, bi-modality of the ELP score captures the agreement between a set of classifiers, all trained on the same training matrix . Since here we are analyzing a regression problem, we need a comparable score to measure agreement between the predictions of linear functions. This measure is chosen to be the variance of the test error among models. Accordingly, we will measure disagreement by the empirical variance of the test error random variable , average over all test examples .

More specifically, consider an ensemble of linear models trained on set to minimize (3) with gradient steps, where denotes the index of a network instance and the number of network instances. Using the test error vectors of these models , we compute the empirical variance of each element , and sum over the test examples :

Definition 1 (Inter-model DisAgreement.).

The disagreement among a set of linear models at step is defined as follows


a.2 Overfit and Inter-Network Agreement

Lemma 1.

Assume that the learning rate is small enough so that we can neglect terms that are . Then in each gradient descent step , overfit occurs iff the gradient step of network is negatively correlated with the cross gradient .


Starting from (6)


Lemma 2.

Assume that and is invertible. If the number of gradient steps is large enough so that is negligible, then


Starting from (4), we can show that


Given the lemma’s assumptions, this expression can be evaluated and simplified:


From (7) it follows that a decrease in inter-model agreement at step , which is implied by increased test variance among models, is indicated by the following inequality:


We make the following asymptotic assumptions:

  1. All models see the same training set, where .

  2. The learning rate is small enough so that we can neglect terms that are .

  3. , is invertible, and the number of gradient steps is large enough so that can be neglected.

  4. The number of models is large enough so that .

When these assumptions hold, the occurrence of overfit at time in all the models of the ensemble implies that the agreement between the models decreases.


(11) can be rearranged as follows

where the last transition follows from . Using assumption 2






Next, we prove that is approximately 0. We first deduce from assumptions 1 and 4 that

From assumption 3 and Lemma 2, we have that . Thus

From this derivation and (14) we may conclude that . Thus


If overfit occurs at time in all the models of the ensemble, then from Lemma 1 and (15). From (11) we may conclude that the inter-model agreement decreases, which concludes the proof.

Appendix B Noisy Labels and Inter-Model Agreement

(a) Empirical distribution - Clean examples
(b) Empirical distribution - Noisy examples
(c) Binomial distribution - Clean examples
(d) Binomial distribution - Noisy examples
Figure 6: (a)-(d): The empirical distribution of the agreement values over epochs (-axis: epochs, -axis: agreement, color code: blue for low and red for high). Clearly, the distribution of noisy examples resembles the binomial distribution with matched expected value, while the clean examples distribution is far from binomial. (e) Wasserstein distance between the binomial distribution and the empirical agreement distribution over epochs.

Here we show empirical evidence, that noisy labels are not being learned in the same order by an ensemble of networks. To see this, we measure the distance between the TPA distribution, computed separately for clean examples and for noisy examples, and the binomial distribution, which is the expected distribution of iid classifiers with the same overall accuracy. Specifically, we compute the Wasserstein distance between the agreement distribution at each epoch and the binomials BIN (k,) and BIN(k,), where is the average accuracy on the clean examples, and is the average accuracy on the noisy examples, see Fig. 6. We see that while the distribution of model agreement on clean examples is very far from the binomial distribution, the distribution of model agreement on noisy examples is much closer, which means that each DNN memorizes noisy examples at an approximately independent rate.

Appendix C Computing the ELP Score: Pseudo-Code

Input: Training dataset with examples, potentially noisy, network architecture , batch size , learning rate , number of networks K, number of epochs E
Output: Array, containing the ELP score for each data point
compute agreement during training;
initialization different initialization of ;
Initialize agreement_arr[E,N];
for  do
       for  do
             sample with size from [1...N] uniformly;
              /* Gets a mini batch */
             compute on using ;
             compute loss with respect to and ;
             += ;
              /* Store whether the network k predicted correctly on the examples at epoch e */
       end for
end for
  /* normaliztion */
  /* mean over -axis */
Algorithm 2 Computing the ELP score

Appendix D Additional results

d.1 Noise level estimation on additional datasets

(a) CIFAR10 F1
(b) CIFAR10 estimation error
(c) CIFAR10 asym F1
(d) CIFAR10 asym estimation error
(e) TinyImagenet F1
(f) TinyImagenet estimation error
Figure 7: Additional results on CIFAR10 and Tiny Imagenet.

d.2 Precision and Recall results

(a) CIFAR10
(b) CIFAR100
(c) CIFAR10 asym
(d) CIFAR100 asym
(e) CIFAR10
(f) CIFAR100
(g) CIFAR10 asym
(h) CIFAR100 asym
Figure 8: Noisy label identification. Top: precision; bottom: recall.

Appendix E Ablation study

Table  4 summarize experiments relating to architecture, scheduler, and augmentation usage.

Noise level No change ResNet34 No Aug Constant lr 0.01 Lr 0.01 + no Aug
Table 4: F1 score for Cifar100 with 2 levels of symmetric noise. Different ablation conditions are marked in columns: ResNet34 indicates a change of architecture, no Aug indicates that image augmentations are not used, and lr 0.01 indicates that no scheduler or learning rate drop are used during training.

Alternative scores

We evaluate the two alternative scores defined in Section 2.3: CumLoss and MeanMargin, in which case Step 2 of DisagreeNet is executed using one of them instead of the ELP score. Fig. 9

shows the Probability Distribution Function (PDF) of the three scores, revealing that ELP is more consistency bimodal (especially in the difficult asymmetric case), with modes (peaks) that appear more separable. This benefit translates to superior performance in the noise filtration task (Figs. 


(a) Cifar10 with 40% Symmetric noise
(b) Cifar100 with 20% Asymmetric noise
Figure 9: Distribution of the CumLoss, MeanMargin and ELP scores during training. ELP remains bimodal even for hard noise models, where the other scores become unimodal.

We believe that this empirical observation, of increased mode separation, is due to significant difference in the pace of change in agreement values during training between clean and noisy data, in contrast with the pace of change in smoother measures of confidence like Margin and Loss (see App. G). Note that with the easier symmetric noise, we do not see this difference, and indeed the other scores exhibit two nicely separated modes, sometimes achieving even better results in noise filtration than ELP (Fig. 7 in App. D.1). However, when comparing the test accuracy after retraining (see App. D), we observe that ELP still achieves superior results.

Appendix F Methodology and Technical details

Datasets We evaluated our method on a few standard image classification datasets, including Cifar10 and Cifar100 [Krizhevsky et al., 2009] and Tiny imagenet [Le and Yang, 2015]. Cifar10/100 consist of 60k color images of 10 and 100 classes respectivaly. Tiny ImageNet consists of 100,000 images from 200 classes of ImageNet [Deng et al., 2009], downsampled to size . Animal10N dataset contains 5 pairs of confusing animals with a total of 55,000 64x64 images. Clothing1M [Xiao et al., 2015] contains 1M clothing images in 14 classes. These datasets were used in earlier work to evaluate the success of noise estimation [Pleiss et al., 2020, Arazo et al., 2019, Li et al., 2020, Liu et al., 2020].

Baseline methods for comparison We evaluate our method in the context of two approaches designed to deal with label noise: methods that focus on improving the supervised learning by identifying noisy labels and removing/reducing their influence on the training, and methods that use iterative methods and utilize semi-supervised algorithms in order to learn with noisy labels. First approach: DY-BMMand DY-GMM [Arazo et al., 2019] estimate mixture models on the loss to separate noisy and clean examples. INCV[Chen et al., 2019]iteratively filter out noisy examples by using cross-validation. AUM[Pleiss et al., 2020]inserts corrupted examples to determine a filtration threshold, using the mean margin as a score. Bootstrap[Reed et al., 2014]interpolates between the net predictions and the given label. D2L[Ma et al., 2018]follows Bootstrap, and uses the examples dimensional attributes for the interpolation. Co-teaching[Han et al., 2018]use two networks to filter clean data for the other net training. O2U[Huang et al., 2019b]varies the learning rate to identify the noisy samples, based on a loss-based metric. MentorNet[Jiang et al., 2018]trains a mentor network, whose outputs are used as a curriculum to the student network. LEC[Lee and Chung, 2019]trains multiple networks, and uses the intersection of their small loss examples (using a given noise rate as a threshold) to construct a filtered dataset for the next epoch.

Second approach: SELF[Nguyen et al., 2019]iteratively uses an exponential moving average of a net prediction over the epochs, compared to the ground truth labels, to filter noisy labels and retrain . Meta learning[Li et al., 2019]uses a gradient based technique to update the networks weights with noise tolerance. DivideMix[Li et al., 2020]uses 2 networks to flag examples as noisy and clean with two component mixture, after which the SSL technique MixMatch [Berthelot et al., 2019] is used. ELR[Liu et al., 2020]identifies early learned example, and uses them to regulate the learning process. C2D[Zheltonozhskii et al., 2022]uses the same algorithm as ELR and Dividemix, and uses a pretrain net with unsupervised loss.

Technical Details Unless stated otherwise, we used an SGD optimizer with 0.9 momentum and a learning rate of 0.01, weight decay of 5e-4, and batch size of 32. We used a Cosine-annealing scheduler in all of our experiments and used standard augmentation (horizontal flips, random crops) during training. We inspected the effect of different hyperparameters in the ablation study. All of our experiments were conducted on the internal cluster of the Hebrew University, on GPU type AmpereA10.

Appendix G Comparing agreement to confidence in noise filtration

While the learning time of an example has been shown to be effective for noise filtration, it fails to separate noisy and clean data that are learned more or less at the same time. To tackle this problem, one needs additional information, beyond the learning time of a single network. When using an ensemble, we can use the TPA score, or else the average probability assigned to the ground truth label (denoted the "correct" logit) by the networks. The latter score conveys the model’s confidence in the ground truth label, and is used by our two alternative scores - CumLoss and MeanMargin.

Going beyond learning time, we propose to look at "how quickly" the agreement value rises from 0 to 1, denoted as the "slope" of the agreement. Since our empirical results indicate that the learning time of noisy data is much more varied, we expect a slower rise in agreement over noisy data as compared to clean data. In our experiments, ELP achieved superior results in noise filtration. We hypothesize that the difference in slope between clean and noisy data may underlie the superiority of ELP in noise filtration.

To check this hypothesis, we compare between two scores computed at each data example: ELP and Logits Mean (denoted LM for simplicity). LM is defined as follows:

where is the number of networks, is the number of epochs during training, is a data example and its assigned label, and is the probability assigned by network in epoch to (the ground truth label).

In order to compare between the pace of increase (slope) of ELP and LM, we conduct the following analysis: We select the two groups of clean and noisy data that are learned (roughly) at the same time by some net in the ensemble, and then compute the average agreement and "correct" logit functions as a function of epoch, separately for clean and noisy data. We then compute the difference per epoch between the noisy and clean average agreement, which we denote as and . Note that and encode the difference in the slope between noisy and clean data, since they begin to rise at (roughly) the same time. Finally, we plot in Fig. 10 the difference between and , recalling that larger indicates stronger separation between the clean and noisy data.

(a) 20% symmetric noise
(b) 40% symmetric noise
(c) 20% Asymmetric noise
(d) 40% Asymmetric noise
(e) 20% symmetric noise
(f) 40% symmetric noise
(g) 20% Asymmetric noise
(h) 40% Asymmetric noise
Figure 10: Top: -axis is the learning time of the chosen clean and noisy data; -axis is the difference between and . We see that for most of the training, the difference is positive, implying that ELP provides stronger separation between these groups. Bottom: -axis is the difference between and ; -axis is the ratio between the amount of clean and noisy data. The color represents the learning time of the groups. These graphs show that while at the end of the training the difference between and is negative, implying that LM would be better at separating these groups, these are in fact very small sets of data, as most of the data is learned by some network at an earlier stage of the training

Indeed, our analysis shows that with asymmetric noise, the difference between the agreement slope on clean and noisy data of the ELP score is consistently larger than the agreement slope difference between the average logits on clean and noisy data. This, we believe, is the key reason as to why ELP outperforms LM in noise filtration. Note that this effect is much less pronounced when using the easier symmetric noise, and indeed, our empirical results show that ELP does not outperform LM significantly in this case.

To conclude, we believe that the signal encoded by the agreement values is stronger than the signal encoded in measures of confidence in the networks’ prediction when true labels are concerned, which explains its capability to classify correctly even some hard-clean examples and easy-noisy examples as clean and noise (respectively). This, we believe, is a result of the polarization effect caused by the binary indicators inside TPA, which disregard misleading positive probabilities assigned to noisy labels even before they are learned by the networks.

Appendix H Comparing to methods with different assumptions

Here we compare DisagreeNet to methods that assume known noise level - O2U [Huang et al., 2019b] and LEC [Lee and Chung, 2019], using a 9-layered CNN for the training (with standard hyper parameters as detailed in App.  F). Since the noise level is assumed known, we replace the estimation provided by DisagreeNet with the actual noise level. The results are summarized in Table. 5. We also compare DisagreeNet to other methods that use prior knowledge, where DisagreeNet does not use prior knowledge. The results are summarized in Table. 6

Dataset Noise level Method
Dataset - noise type Noise level O2U LEC DisagreeNet
Cifar10 sym