Sample Prior Guided Robust Model Learning to Suppress Noisy Labels

12/02/2021
by   Wenkai Chen, et al.
Beijing University of Technology
0

Imperfect labels are ubiquitous in real-world datasets and seriously harm the model performance. Several recent effective methods for handling noisy labels have two key steps: 1) dividing samples into cleanly labeled and wrongly labeled sets by training loss, 2) using semi-supervised methods to generate pseudo-labels for samples in the wrongly labeled set. However, current methods always hurt the informative hard samples due to the similar loss distribution between the hard samples and the noisy ones. In this paper, we proposed PGDF (Prior Guided Denoising Framework), a novel framework to learn a deep model to suppress noise by generating the samples' prior knowledge, which is integrated into both dividing samples step and semi-supervised step. Our framework can save more informative hard clean samples into the cleanly labeled set. Besides, our framework also promotes the quality of pseudo-labels during the semi-supervised step by suppressing the noise in the current pseudo-labels generating scheme. To further enhance the hard samples, we reweight the samples in the cleanly labeled set during training. We evaluated our method using synthetic datasets based on CIFAR-10 and CIFAR-100, as well as on the real-world datasets WebVision and Clothing1M. The results demonstrate substantial improvements over state-of-the-art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/03/2021

Augmentation Strategies for Learning with Noisy Labels

Imperfect labels are ubiquitous in real-world datasets. Several recent s...
12/05/2021

Hard Sample Aware Noise Robust Learning for Histopathology Image Classification

Deep learning-based histopathology image classification is a key techniq...
07/31/2018

A Robust Deep Attention Network to Noisy Labels in Semi-supervised Biomedical Segmentation

Learning-based methods suffer from limited clean annotations, especially...
10/20/2021

One-Step Abductive Multi-Target Learning with Diverse Noisy Samples

One-step abductive multi-target learning (OSAMTL) was proposed to handle...
04/28/2021

Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples

This paper proposes a novel method of learning by predicting view assign...
03/25/2021

Transform consistency for learning with noisy labels

It is crucial to distinguish mislabeled samples for dealing with noisy l...
03/01/2021

DST: Data Selection and joint Training for Learning with Noisy Labels

Training a deep neural network heavily relies on a large amount of train...

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Deep learning techniques, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), have recently achieved great success in object recognition montserrat2017training, image classification krizhevsky2012imagenet

, and natural language processing (NLP)

young2018recent

. Most existing CNN or RNN deep models mainly rely on collecting large scale labeled datasets, such as ImageNet

russakovsky2015imagenet. However, it is very expensive and difficult to collect a large scale dataset with clean labels yi2019probabilistic. Moreover, in the real world, noise labels are often inevitable in manual annotation sun2019limited. Therefore, research on designing robust algorithms with noisy labels is of great significance wei2019harnessing.

Figure 1:

(a) Loss distribution of the clean and noise samples in CIFAR-100 with 50% symmetric noise ratio. (b) Mean prediction probability distribution of the clean and noisy samples in CIFAR-100 with 50% symmetric noise ratio.

In the literature, a lot of approaches were proposed to improve the learning performance with label noise. Such as estimating the noise transition matrix

goldberger2016training; patrini2017making

, designing noise-robust loss functions

ghosh2017robust; 2019Curriculum; xu2019l_dmi, designing noise-robust regularization 2020Early; Tanno2020Learning, sample selection han2018co; chen2019understanding

, and semi-supervised learning

2020DivideMix; 2021LongReMix

. Recently, methods of semi-supervised learning achieve state-of-the-art performance. They always first divide samples into cleanly labeled and wrongly labeled sets by training loss, and then use semi-supervised methods to generate pseudo-labels for samples in the wrongly labeled set. The dividing samples step is generally based on the small-loss strategy

2019How

, where at every epoch, samples with small loss are classified as clean data, and large loss as noise. However, the above methods fail in distinguishing informative hard samples from noisy ones due to their similar loss distributions (as depicted in Figure

1(a), samples in the yellow dotted box are indistinguishable), and thus may ignore the important information of the hard samples xiao2015learning. To the best of our knowledge, there are very few works studying hard samples under noisy label scenarios. Work wang2019symmetric mentioned the hard samples in learning with noisy labels, but that work did not explicitly identify hard samples. Our previous work zhuhard did some research on it in histopathology image classification.

Although the hard samples and noisy samples cannot be directly distinguished by training loss, we observed that they have different behaviors in training history. Through this intuition, we propose PGDF (Prior Guided Denoising Framework), a novel framework to learn a deep model to suppress noise. We first use the training history to distinguish the hard samples from the noisy ones (as depicted in Figure 1(b), samples between two thresholds are distinguished by training history). Thus, we pre-classify the samples into three sets as prior knowledge. The prior knowledge is integrated into the subsequent training process. Our key findings and contributions are summarized as follows:

  • Hard samples and noisy samples can be recognized using training history. We first propose a Prior Generation Module, which generates the prior knowledge to pre-classify the samples into an easy set, a hard set, and a noisy set. We further optimize the pre-classification result at each epoch with adaptive sample attribution obtained by Gaussian Mixture Model.

  • We realize robust noisy labels suppression based on the divided sets. On one hand, we generate high-quality pseudo-labels by the estimated distribution transition matrix with the divided easy set. On the other hand, we further safely enhance the informative samples in the hard set, while previous existing noisy labels processing methods cannot achieve this because they fail to distinguish the hard samples and noisy ones.

  • We experimentally show that our PGDF significantly advances state-of-the-art results on multiple benchmarks with different types and levels of label noise. We also provide the ablation study to examine the effect of different components.

Related Work

In this section we describe existing works on learning with noisy labels. Typically, the noisy-label processing algorithms can be classified into five categories by exploring different strategies: estimating the noise transition matrix goldberger2016training; patrini2017making, designing noise-robust loss functions ghosh2017robust; 2019Curriculum; xu2019l_dmi; zhou2021asymmetric; zhou2021learning, adding noise-robust regularization 2020Early; Tanno2020Learning, selecting sample subset han2018co; chen2019understanding, and semi-supervised learning 2020DivideMix; 2021LongReMix.

In the first category, different transition matrix estimation methods were proposed in goldberger2016training; patrini2017making

, such as using additional softmax layer

goldberger2016training, and two-step estimating scheme patrini2017making. However, these transition matrix estimations fail in real-world datasets where the utilized prior assumption is no longer valid 2019Deep. Being free of transition matrix estimation, the second category targets at designing loss functions that have more noise-tolerant power. Work in ghosh2017robust adopted mean absolute error (MAE) which demonstrates more noise-robust ability than cross-entropy loss. The authors in work xu2019l_dmi

proposed determinant-based mutual information loss which can be applied to any existing classification neural networks regardless of the noise pattern. Recently, work

zhou2021learning proposed a novel strategy to restrict the model output and thus made any loss robust to noisy labels. Nevertheless, it has been reported that performances with such losses are significantly affected by noisy labels 2018Learning. Such implementations perform well only in simple cases where learning is easy or the number of classes is small. For designing noise-robust regularization, work in Tanno2020Learning

assumed the existence of multiple annotators and introduced a regularized EM-based approach to model the label transition probability. In work

2020Early, a regularization term was proposed to implicitly prevent memorization of the false labels.

Most recent successful sample selection strategies in the fourth category conducted noisy label processing by training two models simultaneously. The authors in han2018co proposed a Co-teaching scheme with two models, where each model selected a certain number of small-loss samples and fed them to its peer model for further training. Based on this scheme, work chen2019understanding tried to improve the performance by proposing an Iterative Noisy Cross-Validation (INCV) method. This family of methods effectively avoids the risk of false correction by simply excluding unreliable samples. However, they may eliminate numerous useful samples. To solve this shortcoming, the methods of the fifth category based on semi-supervised learning treated the noisy samples as unlabeled samples, and used the outputs of classification models as pseudo-labels for subsequent loss calculations. The authors in 2020DivideMix proposed DivideMix, which relied on MixMatch 2019MixMatch to linearly combine training samples classified as clean or noisy. Work 2021LongReMix designed a two-stage method called LongReMix, which first found the high confidence samples and then used the high confidence samples to update the predicted clean set and trained the model. Recently, work 2021Augmentation used different data augmentation strategies in different steps to improve the performance of DivideMix 2020DivideMix.

The above samples selection strategy and semi-supervised learning strategy both select the samples with clean labels for the subsequent training process. All of their selecting strategies are based on the training loss because the clean samples tend to have small loss during training. However, they will hurt the informative hard samples due to the similar loss distribution between the hard samples and the noisy ones. Our work strives to reconcile this gap by distinguishing the hard samples from the noisy ones by introducing a sample prior.

Method

Figure 2: PGDF first generates the prior knowledge by the Prior Generation Module. Then, it trains two models (A and B) simultaneously. At each epoch, a model divides the original dataset into an easy set, a hard set, and a noisy set by combining the prior knowledge and the loss value of each sample. The divided dataset is used by the other network. After the first stage, the models conduct label correction for samples with the help of the estimated distribution transition matrix. Finally, the training loss is reweighted by the dividing result to further enhance the hard samples.
Figure 3: The overview of the Prior Generation Module. It pre-classifies the samples into an easy set, a hard set, and a noisy set.

The overview of our proposed PGDF is shown in Figure 2. The first stage (Prior Guided Samples Dividing) of PGDF is dividing the samples into an easy set, a hard set, or a noisy set. The Prior Generation Module first pre-classifies the samples into three sets as prior knowledge; then, at each epoch, the pre-classification result is optimized by the training loss (Samples Dividing Optimization). With the divided sets, the second stage (Denoising with the Divided Sets) conducts label correction for samples with the help of the estimated distribution transition matrix (Pseudo-labels Refining), and then the hard samples are further enhanced (Hard Samples Enhancing). The details of each component are described in the following.

Prior Guided Samples Dividing

Prior Generation Module. Previous work 2016Understanding shows that CNNs tend to memorize simple samples first, and then the networks can gradually learn all the remaining samples, even including the noise samples, due to the high representation capacity. According to this finding, many methods use the “small-loss” strategy, where at each epoch, samples with small loss are classified as clean data and large loss as noise. Sample with small loss means the prediction probability of the model output is closer to the supervising label. However, the normalized probability is much easier for analysis than the loss value. Therefore, we first train a CNN classification model with the data, and record the probability history of the model output for each sample on the class of its corresponding label. We calculate the mean prediction probability value of the sample training history which is shown in Figure 1(b). The figure shows the clean sample tends to have a higher mean prediction probability than the noisy one. Therefore, we can set two thresholds (such as the black dotted lines in Figure 1(b)). Samples with mean prediction probability lower than are almost noisy, while higher than are almost clean. However, we still cannot distinguish the samples with mean prediction probability between two thresholds. We define this part of clean data as hard samples in our work. To the best of our knowledge, there are no existing schemes that can distinguish the hard samples from the noisy ones.

In order to figure out this problem, we construct the Prior Generation Module based on the prediction history of the training samples, as depicted by Figure 3. For the training set with samples, we gradually obtain the corresponding prediction probability maps through the training of a CNN classification model for epochs. This module first selects easy samples and part of noisy samples by using the mean prediction probability values. Then we manually add noise to the as and record whether the sample is noise or not. The noise ratio of the adding noise is the same as the original dataset, which can be known or estimated by the noise cross-validation algorithm of chen2019understanding. After that, we train the same classification model by using and record training history again. Then we discard the “easy samples” and part of “noisy samples” of according to mean prediction probability, and utilize the rest samples as training data to train the classifier. We use a simple one dimension CNN which contains 3 one dimension convolution layers and a fully connected layer as the classifier here. So far, we will obtain a classifier that takes the prediction probability map of training history as input, and output whether it is a hard sample or a noise one. Finally, we put the samples in into the classifier to get the hard sample set and a part of the noisy set , and we combine and as the noisy set . Algorithm 1 shows the details of this module.

0:  , is input image, label , is label for , easy samples ratio , part of noisy samples ratio
0:  easy set , hard set , noise set
1:  Train classification model by using and , record training history , where

is a vector with shape of

2:  Calculate the mean value of as , , sort descending by , select easy samples , select part of noisy samples
3:  Add noise to as , get noisy label and record whether it is noise or not
4:  Retrain by and , record training history
5:  Sort descending by mean, select training history
6:  Train classifier by using and
7:  Put the samples in into and get and
8:  
9:  return , ,
Algorithm 1 Prior Generation Module.

Samples Dividing Optimization. Considering the online training loss at each epoch is also important information to help samples dividing, we apply this information to optimize the pre-classification result. Specifically, as shown in Figure 2, at each epoch, we get the clean probability of each sample from training loss by using Gaussian Mixture Model (GMM) following previous work 2020DivideMix. And we have already got the prior knowledge of each sample, we set the clean probability of prior knowledge as from Equation (1),

(1)

where is the classifier () prediction probability for to be hard sample and is the classifier prediction probability for to be noisy sample. Then, we combine and to get the clean probability by Equation (2),

(2)

where is a hyper-parameter. Finally, we divide samples with equal to into the easy set , the samples with are divided into the hard set , and the rest samples are divided into the noisy set . Each network divides the original dataset for the other network to use to avoid confirmation bias of self-training similar to previous works han2018co; chen2019understanding; 2020DivideMix.

Denoising with the Divided Sets

Pseudo-labels Refining. After the samples dividing phase, we combine the outputs of the two models to generate the pseudo-labels to conduct label correction, similar to “co-guessing” in DivideMix 2020DivideMix. Considering the samples in the easy set are highly reliable, we can use this part of data to estimate the distribution difference between pseudo-label and the ground truth, which can then be used to refine the pseudo-labels. Given the ground truth label , we use a square matrix to denote the differences between ground truth label distribution and pseudo-label distribution , thus and .

Specifically, we use the easy set and its label to estimate the to refine . We first pass the easy set to the model and get , where denotes the model outputs of . Then we obtain , of which the element can be calculated by Equation (3),

(3)

where consists of samples with the same label of class in , is the samples number of , is the model output softmax probability for class of the sample . After that, we refine the pseudo-labels by Equation (4),

(4)

where is the refined pseudo-labels. Because may contain negative values, we first utilize Equation (5) to enable the non-negative matrix, and then perform normalization along the row direction by Equation (6) to ensure the summation of elements in each pseudo-label probability vector equal to .

(5)
(6)

Finally, the labels of samples in noisy set are replaced by the refined pseudo-labels . And the label of sample in hard set is replaced by the combination of the refined pseudo-label in and original label in as Equation (7), where is the clean probability of sample .

(7)

Hard Samples Enhancing. After generating the pseudo-labels, the samples in easy set and hard set are grouped in labeled set , and the noisy set is considered as unlabeled set . We used MixMatch2019MixMatch

to “mix” the data, where each sample is randomly interpolated with another sample to generate mixed input

and label . MixMatch transforms and to and . To further enhance the informative hard samples, the loss on is reweighted by as shown in Equation (8), where is a hyper-parameter. Similar to DivideMix 2020DivideMix, the loss on is the mean squared error as shown in Equation (9), and the regularization term is shown in Equation (10).

(8)
(9)
(10)

Finally the total loss is defined in Equation (11). and follow the same settings in DivideMix.

(11)

Experiment

Datasets and Implementation Details

We compare our PGDF with related approaches on four benchmark datasets, namely CIFAR-10 2009Learning, CIFAR-100 2009Learning, WebVision 2017WebVision, and Clothing1M 2015Learning. Both CIFAR-10 and CIFAR-100 have 50000 training and 10000 testing images of size pixels. And CIFAR-10 contains 10 classes and CIFAR-100 contains 100 classes for classification. As CIFAR-10 and CIFAR-100 datasets originally do not contain label noise, following previous works 2019Learning; 2020DivideMix; 2021Augmentation, we experiment with two types of label noise: symmetric and asymmetric. Symmetric noise is generated by randomly replacing the labels for a percentage of the training data with all possible labels, and asymmetric noise is designed to mimic the structure of real-world label noise, where labels are only replaced by similar classes (e.g. deerhorse, dogcat)2020DivideMix. WebVision contains 2.4 million images in 1000 classes. Since the dataset is quite large, for quick experiments, we follow the previous works chen2019understanding; 2020DivideMix; wu2021ngc and only use the first 50 classes of the Google image subset. Its noise level is estimated at 20% 2019Prestopping. Clothing1M is a real-world dataset that consists of 1 million training images acquired from online shopping websites and it is composed of 14 classes. Its noise level is estimated at 38.5% 2019Prestopping.

In our experiment, we use the same backbones as previous methods to make our results comparable. For CIFAR-10 and CIFAR-100, we use an 18-layer PreAct ResNet 2016Identity as the backbone and train it using SGD with a batch size of 128, a momentum of 0.9, a weight decay of 0.0005, and the models are trained for roughly 300 epochs depending on the speed of convergence. The image augmentation strategy is the same as in work 2021Augmentation. We set the initial learning rate as 0.02, and reduce it by a factor of 10 after 150 epochs. The warm up period is 10 epochs for CIFAR-10 and 30 epochs for CIFAR-100.

For WebVision, we use the Inception-ResNet v2 2017Inception as the backbone, and train it using SGD with a momentum of 0.9, a learning rate of 0.01, and a batch size of 32. The networks are trained for 80 epochs and the warm up period is 1 epoch.

For Clothing1M, we use a ResNet-50 with pre-trained ImageNet weights. We train the network using SGD for 80 epochs with a momentum of 0.9, a weight decay of 0.001, and a batch size of 32. The initial learning rate is set as 0.002 and reduced by a factor of 10 after 40 epochs.

The hyper-parameters proposed in this paper are set in the same manner for all datasets. With noise ratio , we set , , , and . The noise ratio is estimated by the noise cross-validation algorithm in work chen2019understanding.

Noise type sym. asym.
Method/Noise ratio 20% 50% 80% 90% 40%
Cross-Entropy best 86.8 79.4 62.9 42.7 85.0
last 82.7 57.9 26.1 16.8 72.3
Mixup best 95.6 87.1 71.6 52.2 -
2018mixup last 92.3 77.3 46.7 43.9 -
M-correction best 94.0 92.0 86.8 69.1 87.4
2019Unsupervised last 93.8 91.9 86.6 68.7 86.3
Meta-Learning best 92.9 89.3 77.4 58.7 89.2
2019Learning last 92.0 88.8 76.1 58.3 88.6
ELR+ best 95.8 94.8 93.3 78.7 93.0
2020Early last - - - - -
DivideMix best 96.1 94.6 93.2 76.0 93.4
2020DivideMix last 95.7 94.4 92.9 75.4 92.1
LongReMix best 96.2 95.0 93.9 82.0 94.7
2021LongReMix last 96.0 94.7 93.4 81.3 94.3
DM-AugDesc-WS-SAW best 96.3 95.6 93.7 35.3 94.4
2021Augmentation last 96.2 95.4 93.6 10.0 94.1
PGDF (ours) best 96.7 96.3 94.7 84.0 94.8
last 96.6 96.2 94.6 83.1 94.5
Table 1: Comparison with state-of-the-art methods in test accuracy (%) on CIFAR-10 with symmetric noise (ranging from 20% to 90%) and 40% asymmetric noise. Results for previous techniques were directly copied from their respective papers.

Comparison with State-of-the-Art Methods

We compare the performance of PGDF with recent state-of-the-art methods: Mixup 2018mixup, M-correction 2019Unsupervised, Meta-Learning 2019Learning, NCT 2020Noisy, ELR+ 2020Early, DivideMix 2020DivideMix, NGC wu2021ngc, LongReMix 2021LongReMix, and DM-AugDesc-WS-SAW 2021Augmentation. We do not compare with the DM-AugDesc-WS-WAW strategy in 2021Augmentation because the authors mentioned in their github page that the reproducibility is unstable.

Method/Noise ratio 20% 50% 80% 90%
Cross-Entropy best 62.0 46.7 19.9 10.1
last 61.8 37.3 8.8 3.5
Mixup best 67.8 57.3 30.8 14.6
2018mixup last 66.0 46.6 17.6 8.1
M-correction best 73.9 66.1 48.2 24.3
2019Unsupervised last 73.4 65.4 47.6 20.5
Meta-Learning best 68.5 59.2 42.4 19.5
2019Learning last 67.7 58.0 40.1 14.3
ELR+ best 77.6 73.6 60.8 33.4
2020Early last - - - -
DivideMix best 77.3 74.6 60.2 31.5
2020DivideMix last 76.9 74.2 59.6 31.0
LongReMix best 77.8 75.6 62.9 33.8
2021LongReMix last 77.5 75.1 62.3 33.2
DM-AugDesc-WS-SAW best 79.6 77.6 61.8 17.3
2021Augmentation last 79.5 77.5 61.6 15.1
PGDF (ours) best 81.3 78.0 66.7 42.3
last 81.2 77.6 65.9 41.7
Table 2: Comparison with state-of-the-art methods in test accuracy (%) on CIFAR-100 with symmetric noise (ranging from 20% to 90%). Results for previous techniques were directly copied from their respective papers.
Method WebVision ILSVRC12
top1 top5 top1 top5
NCT 2020Noisy 75.16 90.77 71.73 91.61
ELR+ 2020Early 77.78 91.68 70.29 89.76
DivideMix 2020DivideMix 77.32 91.64 75.20 90.84
LongReMix 2021LongReMix 78.92 92.32 - -
NGC wu2021ngc 79.16 91.84 74.44 91.04
PGDF (ours) 81.47 94.03 75.45 93.11
Table 3: Comparison with state-of-the-art methods trained on (mini) WebVision dataset in top-1/top-5 accuracy (%) on the WebVision validation set and the ImageNet ILSVRC12 validation set. Results for previous techniques were directly copied from their respective papers.
Method Test Accuracy
Cross-Entropy 69.21
M-correction 2019Unsupervised 71.00
Meta-Learning 2019Learning 73.47
NCT 2020Noisy 74.02
ELR+ 2020Early 74.81
DivideMix 2020DivideMix 74.76
LongReMix 2021LongReMix 74.38
DM-AugDesc-WS-SAW 2021Augmentation 75.11
PGDF (ours) 75.19
Table 4: Comparison with state-of-the-art methods in test accuracy (%) on the Clothing1M dataset. Results for previous techniques were directly copied from their respective papers.

Table 1 shows the results on CIFAR-10 with different levels of symmetric label noise ranging from 20% to 90% and with 40% asymmetric noise. Table 2 shows the results on CIFAR-100 with different levels of symmetric label noise ranging from 20% to 90%. Following the same metrics in previous works 2021Augmentation; 2021LongReMix; 2020DivideMix; 2020Early, we report both the best test accuracy across all epochs and the averaged test accuracy over the last 10 epochs of training. Our PGDF outperforms the state-of-the-art methods across all noise ratios.

Table 3 compares PGDF with state-of-the-art methods on (mini) WebVision dataset. Our method outperforms all other methods by a large margin. Table 4 shows the result on Clothing1M dataset. Our method also achieves state-of-the-art performance. The result shows our method also works in real-world situations.

Ablation Study

Method/Noise ratio 50% 80%
PGDF best 96.26 0.09 94.69 0.46
last 96.15 0.13 94.55 0.25
PGDF w/o prior knowledge best 95.55 0.11 93.77 0.19
last 95.12 0.19 93.51 0.23
PGDF w/o two networks best 95.65 0.35 93.84 0.73
last 95.22 0.39 93.11 0.80
PGDF w/o samples dividing optimization best 95.73 0.09 93.51 0.28
last 95.50 0.12 93.20 0.32
PGDF w/o pseudo-labels refining best 95.78 0.26 94.06 0.68
last 95.35 0.27 93.71 0.52
PGDF with “co-refinement & co-guessing” best 95.91 0.22 94.30 0.51
last 95.66 0.29 93.81 0.62
PGDF w/o hard samples enhancing best 96.01 0.19 94.39 0.28
last 95.78 0.07 94.21 0.23
Table 5:

Ablation study results in terms of average test accuracy (%, 3 runs) with standard deviation on CIFAR-10 with 50% and 80% symmetric noise.

We study the effect of removing different components to provide insights into what makes our method successful. The result is shown in Table 5.

To study the effect of the prior knowledge, we divide the dataset only by and change the easy set threshold to 0.95 because there is no value equal to 1 in . The result shows the prior knowledge is very effective to save more hard samples and filter more noisy ones. By removing the prior knowledge, the test accuracy decreases by an average of about 0.93%.

To study the effect of the two networks scheme, we train a single network. All steps are done by itself. By removing the two networks scheme, the test accuracy decreases by an average of about 0.91%.

To study the effect of the samples dividing optimization, we divide the dataset only by the prior knowledge . The result shows that whether a sample is noisy or not depends not only on the training history, but also on the information of the image itself and the corresponding label. Integrating this information can make judgments more accurate. By removing the samples dividing phase, the test accuracy decreases by an average of about 0.68%.

To study the effect of the pseudo-labels refining phase, we use the pseudo-labels without being refined by the estimated transition matrix. By removing the pseudo-labels refining phase, the test accuracy decreases by an average of about 0.69%. We also evaluate the pseudo-labels refinement method in DivideMix 2020DivideMix by replacing our scheme with “co-refinement” and “co-guessing”. By replacing our pseudo-labels refining phase with “co-refinement” and “co-guessing”, the test accuracy decreases by an average of about 0.49%.

To study the effect of the hard samples enhancing, we remove the hard enhancing component. The decrease in accuracy suggests that by enhancing the informative hard samples, the method yields better performance by an average of 0.30%.

Among the prior knowledge, two networks, samples dividing phase, pseudo-labels refining phase, and hard enhancing, the prior knowledge introduces the maximum performance gain. All components have a certain gain.

Hyper-parameters Analysis

In order to analyze how sensitive PGDF is to the hyper-parameters and , we trained on different and in CIFAR-10 dataset with 50% symmetric noise ratio. Specifically, we first adjusted the value of with fixed = 0.25, and thus obtained the sensitivity of PGDF to . Then we adjusted the value of with fixed = 0.25, and thus obtained the sensitivity of PGDF to . We report both the best test accuracy across all epochs and the averaged test accuracy over the last 10 epochs of training, as shown in Table 6. The result shows that the performance is stable when changing and in a reasonable range. Thus, the performance does not highly rely on the pre-defined settings of and . Although their settings depend on the noise ratio, they are still easy to set due to the insensitivity. In fact, we set hyper-parameter to select a part of samples which are highly reliable to train the (the classifier to distinguish the hard and noisy samples). The settings of and are not critical, they just support the algorithm. Analysis of other hyper-parameters is shown in Appendix.

0.2 0.25 0.3 0.35
best 96.09 0.04 96.26 0.09 96.14 0.08 96.11 0.06
last 95.98 0.06 96.15 0.13 95.99 0.10 95.93 0.09
0.2 0.25 0.3 0.35
best 96.26 0.12 96.26 0.09 96.25 0.13 96.02 0.07
last 96.08 0.11 96.15 0.13 96.11 0.16 95.82 0.10
Table 6: Results in terms of average test accuracy (%, 3 runs) with standard deviation on different “” and “” on CIFAR-10 with 50% symmetric noise ratio.
Figure 4: (a) Mean prediction probability histogram of the clean and noisy samples in CIFAR-10 () with 50% symmetric noise. (b) Mean prediction probability histogram of the clean and noisy samples in corresponding .

Why Works?

To study why trained on the artificial created can recognize hard and noisy samples in original dataset , we plotted both the mean prediction probability histogram of the clean and noisy samples in CIFAR-10 (50% symmetric noise ratio) and the corresponding . The results are shown in Figure 4. According to Figure 4(b), there are also some clean samples which can not be distinguished from the noisy ones by mean prediction probability. Although the samples in are all easy ones to dataset , part of the samples became hard ones to dataset . In Figure 4, it should be noted that the mean prediction probabilities of samples trained on have similar distribution with the original dataset , and this is why our Prior Generation Module works by using the artificial created .

Limitations

The quantifiable behavioral differences between hard and noisy samples are not clear. Features that the classifier uses to classify are not very interpretable. There could exist better other metrics that can be used to directly distinguish the hard samples from noisy ones. Our subsequent work will continue to investigate specific quantifiable metrics to simplify the process of the Prior Generation Module, and how to strengthen hard samples more reasonably is a direction worth studying in the future.

Conclusions

The existing methods for learning with noisy labels fail to distinguish the hard samples from the noisy ones and thus ruin the model performance. In this paper, we propose PGDF to learn a deep model to suppress noise. We found that the training history can be used to distinguish the hard samples and noisy samples. By integrating our Prior Generation, more hard clean samples can be saved. Besides, our pseudo-labels refining phase and hard enhancing phase further boost the learning performance. Through extensive experiments across multiple datasets, we show that PGDF outperforms state-of-the-art performance. In the future, we will conduct experiments on instance-dependent label noise which is more challenging and would introduce many hard samples.

References