Deep learning has achieved outstanding results in several important classification problems, using large and well-curated training data [12, 3]. However, most of the interesting data sets available in the society are orders of magnitude larger, but poorly curated, which means that the data may contain acquisition and labeling mistakes that can lead to poor generalisation . Therefore, one of the important challenges of the field is the development of methods that can cope with such noisy label data sets. Lately, researchers have greatly fostered the development of this field by studying controlled synthetic label noise and discovering theories or methodologies that can then be applied to real-world noisy data sets.
The types of label noise investigated thus far can be classified into two categories –closed-set and open-set noise. Although these terms (‘closed-set’ and ‘open-set’) were coined only recently by Wang et al. in  where they introduced the open-set noisy label problem, the closed-set noisy label problem has been extensively studied since much before. When handling closed-set label noise, the majority of the learning algorithms assume a fixed set of training labels [14, 22]. In this setting, some of the training samples are annotated to an incorrect label, while their true class is present in the training label set. These mistakes can be completely random, where labels are flipped arbitrarily to an incorrect class, or consistent, when the annotator is genuinely confused about the annotation of a particular sample. A less studied label noise is the open-set noisy label problem , where we incorrectly sample some data observations, such that their true annotation is not contained within the set of known training labels. A hyperbolic example of such a setting could be the presence of a horse image in the training set for modelling a cats vs dogs binary classifier. As evident from their definitions, these two types of label noise are mutually exclusive, i.e., a given noisy label cannot be closed-set and open-set at the same time.
It is quite easy to substantiate that both open-set and closed-set noise are likely to co-occur in real-world data sets. For instance, recent methods for large scale data collection propose the use of querying commercial search engines (e.g., Google Images), where the search keywords serve as the labels of the queried images. It is evident from Figure 1 that collecting images using such methods can lead to both open-set and closed-set noise. However, thus far, no systematic study with controlled label noise has been presented, where the training data set contains both types of label noise simultaneously. Even though there have been papers that evaluated their proposed methods on both [13, 24], the training data sets have been exclusively corrupted with either closed-set noise or open-set noise, but never in a combined fashion.
In this paper, we formulate a novel benchmark evaluation to address the noisy label learning problem that consists of a combination of closed-set and open-set noise. This proposed benchmark evaluation is defined by three variables: 1) the total proportion of label noise in the training set, represented by ; 2) the proportion of closed-set noise within the set of samples containing noisy labels, denoted by (this implies that samples of the entire data set have a closed-set noisy label and samples of the entire data set have an open-set noisy label); and 3) the source of open-set noisy label data. Note that this setup generalises both types of label noise as it can collapse to one of the two noise types when .
The state-of-the-art (SOTA) approaches that aim to solve the closed-set noisy label problem focus on methods that identify the samples that were incorrectly annotated and update their labels with semi-supervised learning (SSL) approaches for the next training iteration. This strategy is likely to fail in the open-set problem because it assumes that there exists a correct class in the training labels for every training sample, which is not the case. On the other hand, the main approach addressing the open-set noise problem targets the identification of noisy samples to reduce their weights in the learning process . Such strategy is inefficient in closed-set problems because the closed-set noisy label samples are still very meaningful during the SSL stage. Hence, to be robust in the scenarios where both closed-set and open-set noise samples are present, the learning algorithm must be able to identify the type of label noise affecting each training sample, and then either update the label, if it is closed-set noise, or reduce its weight, if it is open-set noise. To achieve this, we propose a new learning algorithm, called EvidentialMix (EDM) – see Fig. 2. The key contributions of our proposed algorithm are the following:
EDM is able to accurately distinguish between clean, open-set and closed-set samples, thus allowing it to exercise different learning mechanisms depending on the type of label noise. In comparison, previous methods [14, 24] can only separate clean samples from noisy ones, but not closed-noise from the open-noise samples.
We show that our method can learn superior feature representations than previous methods as evident from the t-SNE plot in Figure 4, where our method has a unique cluster for each of the known label/class and another separate cluster for open-set samples. In comparison, previous methods have shown to largely overfit the open-set samples and incorrectly cluster them to one of the known classes.
We experimentally show that EDM produces classification accuracy that is comparable or better than the previous methods on various label noise rates (including the extreme case where ).
2 Prior Work
There is an increasing interest in the study of modelling deep learning classifiers with noisy labels. For the closed-set noise, Reed et al.  proposed one of the first approaches that uses a transition matrix to learn how labels switch between different classes. The use of transition matrices has been further explored in many different ways [18, 5], but none of them show competitive results, likely because they do not include mechanisms to identify and handle samples containing noisy labels. Data augmentation approaches  have been successively explored by closed-set noisy label methods, where the idea is that it can naturally increase the training robustness to label noise. Meta-learning is another technique explored in closed-set noisy label problems [15, 20], but the need for clean validation sets or artificial new training tasks makes this technique relatively unexplored. The use of curriculum learning (CL)  for closed-set problems has been explored to re-label training samples dynamically during training, based on their loss values. This approach has been extended with the training of multiple models [17, 25] that aim to focus the training on samples with small loss that are inconsistently classified by the multiple models. Recently, the explicit identification of noisy samples using negative learning has been explored by Kim et al. , with competitive results. Another important approach in handling label noise is model ensembling, as proposed by Tarvainen et al.. The use of robust generative classifiers (RoG) to improve the performance of discriminative classifiers has been explored by Lee et al. 
, where they build an ensemble of robust linear discriminative models using features extracted from several layers of the trained discriminative model – in principle, this approach has the potential to improve the performance of any method and has been successively tested in closed-set and open-set scenarios.
The learning with open-set noisy labels has only recently been explored by Wang et al. , where the idea is to identify the samples containing noisy labels and reduce their weight in the training process since they almost certainly belong to a class not represented in the training set. Given that they are the only ones to explicitly address the open-set noise, their method is the main baseline for that problem.
The current SOTA for closed-set noisy label approaches are SELF  and DivideMix  – both consisting of methods that combine several of the approaches described above. SELF  combines model ensembling, re-labelling, noisy sample identification, and data augmentation; while DivideMix  uses multiple model training, noisy sample identification, and data augmentation . These two approaches are likely to be vulnerable to open-set noise since they assume that training samples must belong to one of the training classes – an assumption that is not correct for open-set noise.
3.1 Problem Definition
We define the training set as , where the RGB image ( represents the image lattice), the set of training labels is denoted by , which forms the standard basis of dimensions, (, representing a multi-class problem). Note that is the noisy label for and the hidden clean label is represented by .
For the closed-set noise problem with noise rate , we assume that is labelled as
with probability, and with probability , with representing a random function that picks one of the labels in following a particular distribution parameterised by .
For the open-set noise problem with noise rate , we need to define a new training set (with ), where the label set for is represented by (with ) – this means that the images in no longer have labels in . In such open-set problem, a proportion of samples is drawn with with , while a proportion of samples are obtained with with .
The combined closed-set and open-set problem with rates is defined by mixing the two types of noise above. More specifically, of the training set contains images annotated with , while images are sampled as with label , and images belong to and labelled with .
3.2 Noise Classification
The main impediment when dealing with this problem is the need to identify closed-set and open-set noisy samples since they must be dealt differently by the method. One possible way of dealing with this problem is by associating closed-set samples with high losses computed from confident but incorrect classification 
, and open-set samples with uncertain classification. To achieve this, we propose the use of the subjective logic (SL) loss function that relies on the theory of evidential reasoning and SL to quantify classification uncertainty. The SL loss makes use of the Dirichlet distribution to represent subjective opinions, encoding belief and uncertainty. A network trained with the SL loss tries to learn parameters of a predictive posterior as a Dirichlet density function for the classification of the training samples. The resulting output for a given sample is considered as the evidence for the classification of that sample over a set of class labels . Figure 3 shows the per-sample loss distribution of training samples from a network trained using the SL loss. The separation between clean, closed-set and open-set samples is easy to capture.
Our proposed EvidentialMix simultaneously trains two networks: NetS, which uses the SL loss , and NetD, which uses the SSL training mechanism and the DivideMix (DM) loss . Broadly speaking, the ability of the SL loss to estimate classification uncertainty allows NetS to divide the training set into clean-set, open-set and closed-set samples. The predicted clean-set and closed-set samples are then used to train NetD using MixMatch as outlined in , while the predicted open-set samples are discarded for that epoch. Following this, NetD re-labels the entire training data set (including predicted open-set samples) that are then used to train NetS.
As NetS iteratively learns from the labels predicted by NetD, it gets better at splitting the data into the three sets. This is so because the labels from NetD become more accurate over the training process given that it is only trained on clean and closed-set samples, and never on predicted open-set samples. The two networks thus complement each other to produce accurate classification results for the combined closed-set and open-set noise problem. A detailed explanation is outlined below, while Alg. 1 delineates the full training algorithm.
Algorithm 1 trains NetD, represented by , and NetS, denoted by (with
) – both of which return a logit in. In the warm-up stage (see line WarmUp()), we train both models for a limited number of epochs using the cross entropy loss for , where probability is obtained by applying a softmax activation to form , and the following (SL) loss for :
where for class , with.
The classification of samples into the clean, closed-set, and open-set is performed using the SL loss values from Eq. (2) for the entire training set . More specifically, we take the set of losses and fit a
-component Gaussian mixture model (GMM) using the Expectation-Maximization algorithm. The idea we explore in this paper lies in the fact that the model output for theclean samples will tend to be confident and at the same time agree with its original label, producing a small loss. The model output for closed-set noise samples will also tend to be confident but at the same time disagree with its original label, generating a large loss value. The model output for open-set noise samples, however, will not produce a confident output, resulting in a loss value that is neither large nor small. Therefore, the multi-component GMM will capture each of these sets, where the clean probability
is the posterior probability, where denotes the set of Gaussian components with mean (i.e., small losses), the closed-set posterior probability is computed from the components with mean (i.e., large losses), and the open-set probability is the posterior of the remaining components with means – these posterior probabilities form the three sets , and . Using these posteriors, we can build the set of clean samples, represented by with samples where the clean posterior probability is larger than the other two probabilities, and the closed-set, denoted by , containing samples that have the closed-set posterior probability larger than the other two.
Next, we train NetD with the clean set and closed-set , defined above. A mini-batch is sampled from and , and we augment each sample in each set times . The average classification probabilities for the clean and closed-set samples are then computed using the augmented samples, which after temperature sharpening, denoted by TempSharpen with denoting the temperature, form the ‘new’ samples and labels for the clean and closed-set samples, and
, respectively. The last stage before stochastic gradient descent (SGD) is the MixMatch process, where samples from the and are linearly combined to form and . SGD minimises the DM loss that combines the following functions :
where denotes the weight of the loss associated with the unlabelled data set and weights the regularisation loss. The loss terms in Eq. (3) are defined by
where represents the model output of all labels for input . After training NetD, we train the NetS model by minimising the SL loss Eq. (1) with an updated training set, represented by , formed by , for that produces the new labels .
The inference for a test sample relies entirely on the NetD classifier, as follows: .
We train an 18-layer PreAct Resnet  (for both NetS and NetD) using stochastic gradient descent (SGD) with momentum of 0.8, wight decay of 0.0005 and batch size of 64. The learning rate is 0.02 for WarmUp and for 100 epochs in the main training process, which is reduced to 0.002 afterwards. The WarmUp stage lasts for 10 and 30 epochs for NetD and NetS, respectively, where NetD is trained with a cross-entropy loss (i.e., in Eq. (5) using the unchanged training set ) while NetS is trained with the subjective logic loss in Eq. (1) also using . After WarmUp, both models are trained for epochs. Similar to , the number of data augmented samples is , the sharpening temperature is , the MixMatch parameter is , and the regularisation weight for the DM loss in Eq. (3) is . However, unlike  that manually selects the value of based on the value of , we set = 25 for all our experiments. For the GMM, we use components, , and since these values produced stable results.
Following the prior work on closed-set and open-set noise problems [14, 24, 13], we conduct our experiments on the CIFAR-10 data set  for closed-set noise [14, 13]; and include the CIFAR-100 (small-scale)  and ImageNet32 (large-scale)  data sets for the open-set noise scenario [24, 13]. CIFAR-10 has 10 classes with 5000 3232 pixel training images per class (forming a total of 50000 training images), and a testing set with 10000 3232 pixel images with 1000 images per class. CIFAR-100 has 100 classes with 5000 32
32 pixel images per class and ImageNet32 is a down-sampled variant of ImageNet, with 1281149 images and 1000 classes, but resized to 3232 pixels per image. All data sets above have been set up with curated labels, so below we introduce a new noisy label benchmark evaluation that combines closed-set and open-set synthetic label noise.
4.1 Combined Open-set and Closed-set Noisy Label Benchmark
The proposed benchmark is defined by the rate of label noise in the experiment, denoted by , and the proportion of closed-set noise in the label noise, denoted by . The closed-set label noise is simulated by randomly selecting of the training samples from CIFAR-10, and symmetrically shuffling their label, similarly to the synthetic label noise used in . The open-set label noise is simulated by randomly selecting of the training images from CIFAR-10 and replacing them with images randomly selected from either CIFAR-100  or ImageNet32 , where a CIFAR-10 label is randomly assigned to each one of these images. Results are based on the classification accuracy on the clean testing set from CIFAR-10, using the benchmark proposed above. We also show a comparison of the sample distribution between EDM and other related approaches in the feature space using t-SNE , and the effectiveness of EDM to separate clean, closed-set and open-set noisy samples.
4.2 Related Approaches for Comparison
We compare our proposed approach with the three methods listed below:
DivideMix 111We used the publicly available code provided by the authors of the paper to produce our results. 
is the current SOTA method that converts the problem of closed-set noisy label learning into a semi-supervised learning problem. It follows a multiple-model approach that splits the training data into clean and noisy subsets by fitting a 2-component Gaussian Mixture Model (GMM) to the loss values of the training samples at each epoch. Next, the framework discards the labels of the predicted noisy samples and uses MixMatch to train the model.
ILON 222As the authors did not make their code publicly available, we implemented their method from scratch and trained a Siamese network of 18-layer PreAct Resnet to produce our results. 
introduces the open-set noisy label learning problem, where the proposed approach is based on an iterative solution that re-weights the samples based on the outlier measure of the Local Outlier Factor Algorithm (LOF).
RoG 1  builds an ensemble of generative classifiers formed from features extracted from multiple layers of the the ResNet model. The authors of RoG tested their approach on both closed-set noise and open-set noise separately which makes it an important benchmark to consider in our combined setup.
4.3 Results and Discussion
Classification accuracy: Table 1 shows the results computed from the benchmark evaluation of the proposed EDM, in comparison with the results of RoG , ILON , and DivideMix . The evaluation relies on different rates of label noise () and closed-set noise () using CIFAR-100 and ImageNet32 as open-set data sets. Results show that our method EDM outperforms all the competing approaches for 17 out of the 20 noise settings and is a close second on the remaining 3. For , EDM produces better results than all competing methods for all values of and choice of open-set data sets, with an improvement of more than over the next best method in some cases. On the other hand, both RoG and ILON perform significantly worse than EDM and DivideMix, particularly for where the difference in accuracy is over 15% in some cases. In general, RoG and ILON are observed to perform worse when the proportion of closed-set noise increases, while the converse is true for DivideMix and EDM. It is also apparent that EDM is more robust to open-set noise than DivideMix, as evident from classification results when is small.
Feature representations: We show the t-SNE plots  in Fig. 4 for all the methods for the case where total noise rate , with the closed-set proportion using CIFAR-100 and ImageNet32 as open-set data sets. In particular, the features for all methods are extracted from the last layer of the models (in our case, we use the features from the NetD, which is the one used for classification, as explained in Sec. 3.3). In the visualisation, the brown samples are from the open-set data sets, while all other colours represent the true CIFAR-10 classes. This clearly shows that our proposed EDM is quite effective at separating open-set samples from the other clean and closed-set training samples, while DivideMix and ILON largely overfit these samples, as evident from the spread of open-set samples around CIFAR-10 classes. Interestingly, RoG also shows good separation, but with apparently more complex distributions than EDM.
Noise classification: Fig. 5 shows the distribution of loss function values for the clean, open-set and closed-set samples in cases where the noise rate , with closed-set rates using samples from both CIFAR-100 and Imagenet32 as open-set noise. From these graphs, it is clear that the SL loss in Eq. (2) is able to successfully distinguish samples from each one of the three sets above, even when only one of the noise types is available, such as the case when . This suggests that the exploration of uncertainty in the SL loss to identify samples belonging to open-set noise is effective. Among the methods tested in this paper, DivideMix  also tries to separate the training samples into clean and noisy sets using the loss in (3). However, the resulting distribution seems inadequate to allow for a clear separation between the three sets because the open-set and closed-set noisy labels are basically indistinguishable. Consequently, DivideMix is able to separate clean samples from noisy samples, but not closed-set noise from open-set noise, thus forcing it to treat both noise types similarly during the training (i.e., both types are treated as closed-set noise). This is not ideal given that the open-set samples will be allocated to one of the incorrect training labels, which can ultimately cause the training to overfit these samples.
In this paper, we investigate a variant of the noisy label problem that combines open-set [24, 13] and closed-set noisy labels [14, 13]. To test various methods for this new problem, we propose a new benchmark that systematically changes the total noise rate and the proportion of closed-set and open-set noise. The open-set samples were sourced from either a small-scale data set (CIFAR100) or a large-scale data set (ImageNet32) such that the true label of these samples is not contained in the primary data set (CIFAR10). We argue that such a problem setup is more general and similar to real-life noisy label scenarios. We then propose the EvidentialMix algorithm to successfully address this new noise type with the use of the subjective logic loss  that produces low loss for clean samples, high loss for closed-set noisy samples, and mid-range loss for open-set samples. The clear division of the training data allows us to (1) identify and thereby remove the open-set samples from training to avoid overfitting them, given that they do not belong to any of the known classes, and (2) learn from the predicted closed-set samples in a semi-supervised fashion as in . The evaluation shows that our proposed EDM is more effective to address this new combined open-set and closed-set label noise problem than the current state-of-the-art approaches for closed-set problems [14, 13] and open-set problems [24, 13].
Future work: The motivation for introducing this problem was to open the dialogue in the research community to investigate the combined open-set and closed-set label noise. Moving forward, we aim to explore more challenging noise settings such as incorporating asymmetric  and semantic noise  in the proposed combined label noise problem. Since we are the first ones to address this problem in a controlled setup, there is no precedent on how these more challenging noise scenarios could be meaningfully incorporated. For instance, even though asymmetric closed-set noise has previously been studied in the literature 
, it is not obvious what its counterpart, asymmetric open-set noise entails; for instance, it is not immediately clear how to build a noise transition matrix between CIFAR10 and ImageNet classes. In addition, we see merit in investigating other types of uncertainty to identify open-set noise, such as with Bayesian learning, and aim to explore such methods.
Acknowledgements: IR and GC gratefully acknowledge the support of the Australian Research Council through the Centre of Excellence for Robotic Vision CE140100016 and Future Fellowship (to GC) FT190100525. GC acknowledges the support by the Alexander von Humboldt-Stiftung for the renewed research stay sponsorship. RS acknowledges the support by the Playford Trust Honours Scholarship.
-  (2019-05) MixMatch: A Holistic Approach to Semi-Supervised Learning. arXiv e-prints, pp. arXiv:1905.02249. External Links: Cited by: §2, §3.3, §4.2.
-  (2017) A downsampled variant of imagenet as an alternative to the cifar datasets. arXiv preprint arXiv:1707.08819. Cited by: Figure 3, §4.1, §4.
-  (2009) Imagenet: a large-scale hierarchical image database. In , pp. 248–255. Cited by: §1, §4.
Dropout as a bayesian approximation: representing model uncertainty in deep learning.
international conference on machine learning, pp. 1050–1059. Cited by: §5.
Training deep neural-networks using a noise adaptation layer. Cited by: §2.
-  (2016) Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Cited by: §3.4.
-  (2018) Mentornet: learning data-driven curriculum for very deep neural networks on corrupted labels. In International Conference on Machine Learning, pp. 2304–2313. Cited by: §2.
-  (2018-09) Uncertainty Aware AI ML: Why and How. arXiv e-prints, pp. arXiv:1809.07882. External Links: Cited by: §3.2.
-  (2019) Nlnl: negative learning for noisy labels. In Proceedings of the IEEE International Conference on Computer Vision, pp. 101–110. Cited by: §2.
-  (2009) LoOP: local outlier probabilities. In Proceedings of the 18th ACM conference on Information and knowledge management, pp. 1649–1652. Cited by: §4.2.
-  (2009) Learning multiple layers of features from tiny images. Cited by: Figure 3, §4.1, §4.
-  (2015-05) Deep learning. Nature 521, pp. 436–44. External Links: Cited by: §1.
-  (2019) Robust inference via generative classifiers for handling noisy labels. arXiv preprint arXiv:1901.11300. Cited by: §1, §2, Figure 4, Figure 4, Table 1, §4.2, §4.3, §4, §5, §5.
-  (2020-02) DivideMix: Learning with Noisy Labels as Semi-supervised Learning. arXiv e-prints, pp. arXiv:2002.07394. External Links: Cited by: Figure 2, 1st item, §1, §1, §2, Figure 4, Figure 4, Figure 5, §3.2, §3.3, §3.3, §3.4, Table 1, §4.1, §4.2, §4.3, §4.3, §4, §5.
-  (2019) Learning to learn from noisy labeled data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5051–5059. Cited by: §2.
-  (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §4.1, §4.3.
-  (2017) Decoupling” when to update” from” how to update”. In Advances in Neural Information Processing Systems, pp. 960–970. Cited by: §2.
-  (2017) Making deep neural networks robust to label noise: a loss correction approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1944–1952. Cited by: §2.
-  (2014-12) Training Deep Neural Networks on Noisy Labels with Bootstrapping. arXiv e-prints, pp. arXiv:1412.6596. External Links: Cited by: §2, §5.
-  (2018) Learning to reweight examples for robust deep learning. arXiv preprint arXiv:1803.09050. Cited by: §2.
-  (2018) Evidential deep learning to quantify classification uncertainty. External Links: Cited by: Figure 2, §3.2, §3.3, §3.3, §5.
-  (2019-10) SELF: Learning to Filter Noisy Labels with Self-Ensembling. arXiv e-prints, pp. arXiv:1910.01842. External Links: Cited by: §1, §2.
-  (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems, pp. 1195–1204. Cited by: §2.
-  (2018-03) Iterative Learning with Open-set Noisy Labels. arXiv e-prints, pp. arXiv:1804.00092. External Links: Cited by: 1st item, §1, §1, §1, §2, Figure 4, Figure 4, Table 1, §4.2, §4.3, §4, §5.
-  (2019) How does disagreement help generalization against label corruption?. arXiv preprint arXiv:1901.04215. Cited by: §2.
-  (2016) Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530. Cited by: §1.
-  (2017) Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412. Cited by: §2.