A Label Management Mechanism for Retinal Fundus Image Classification of Diabetic Retinopathy

06/23/2021 ∙ by Mengdi Gao, et al. ∙ Peking University NetEase, Inc 0

Diabetic retinopathy (DR) remains the most prevalent cause of vision impairment and irreversible blindness in the working-age adults. Due to the renaissance of deep learning (DL), DL-based DR diagnosis has become a promising tool for the early screening and severity grading of DR. However, training deep neural networks (DNNs) requires an enormous amount of carefully labeled data. Noisy label data may be introduced when labeling plenty of data, degrading the performance of models. In this work, we propose a novel label management mechanism (LMM) for the DNN to overcome overfitting on the noisy data. LMM utilizes maximum posteriori probability (MAP) in the Bayesian statistic and time-weighted technique to selectively correct the labels of unclean data, which gradually purify the training data and improve classification performance. Comprehensive experiments on both synthetic noise data (Messidor & our collected DR dataset) and real-world noise data (ANIMAL-10N) demonstrated that LMM could boost performance of models and is superior to three state-of-the-art methods.



There are no comments yet.


page 1

page 2

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Diabetes is a common and high-risk chronic disease worldwide. International Diabetes Federation (IDF) reports that over 9% people are suffering diabetes globally [1]. The diabetic may be complicated by diabetic retinopathy (DR). DR remains the most prevalent cause of vision impairment and irreversible blindness in working-age adults [18] and is expected to increase to 693 million by 2045 worldwide [4]. Early diagnosis and timely intervention can significantly decrease the risk of vision impairment and are necessary for a good DR prognosis [2]

. The manual interpretation of retinal photography is a widely accepted screening tool for DR. However, the lack of ophthalmologists limits the popularity of fundus screening, especially in developing countries. Accurate detection of lesions located on the retinal fundus images through artificial intelligence (AI) techniques is of great significance for automatically and effectively screening and evaluating the grade of DR. Over the past decade, learning-based computer vision algorithms have been widely explored and have contributed to medical imaging research. AI especially DNNs has exhibited impressive performance in numerous medical imaging tasks

[19], such as image classification, object detection, semantic segmentation and image registration, etc. Automated grading of DR based on DNNs is a potential solution to reduce physicians’ workload and improve the diagnostic efficiency and accuracy.

While state-of-the-art results have been continuously achieved by various machine learning methods

[33, 8], all these depend on the massive dataset with reliable labels. However, it is costly and time-consuming to obtain the labeled DR retinal fundus images. Meanwhile, some ineluctable labeling mistakes from annotators may generate noisy labels that may be corrupted from ground-truth labels. It is reported that the ratio of noisy labels in real-world datasets ranges from 8.0% to 38.5% [33, 17, 16, 27, 28]. Although DNNs tend to prioritize learning simple patterns first, they are capable of memorizing noise data [23, 14]. That is, even a small portion of mislabeled samples are included in the training data, it could severely degenerate the model performance owing to the high capacity to fit any noisy labels. In fact, Zhang et al. have demonstrated that DNNs can easily fit the entire training dataset with any ratio of corrupted labels, leading to poor generalization on the test dataset [37]

. Thus, the critical issue is how to train DNNs robustly even noisy labels exist in the training data. Unfortunately, the above issue cannot be settled completely through the popular regulation techniques, such as data augmentation, dropout, weight decay, and batch normalization.

Song et al. have provided a comprehensive study on learning from noisy labels with DNNs [28]

. They stated the problem of supervised learning with noisy labels along with the taxonomy of label noise. Then they primarily reviewed the deep learning approaches to curb the adverse effects of learning from noisy labels. According to their research directions, the existing methods could be categorized into five categories: sample selection, robust architecture, robust regularization, robust loss function, and loss adjustment. Sample selection

[14, 36, 22] aims to identify true-label samples from noisy training data. It excludes unreliable samples from the update, but it eliminates hard yet useful training samples as well. Robust architecture [33, 7, 35]

attempts to design a new dedicated architecture or add a noise adaptation layer at the top of the SoftMax layer. However, the dedicated architecture lacks flexibility for extending to other architectures and noise adaptation layer hinders a model’s generalization to complex label noise. Robust regularization

[25, 29, 15, 13]

aims to enforce a DNN to overfit less to false-labeled samples. This technique introduces additional hyperparameters both sensitive to noise and data type but would be unable to promote the model performance remarkably. Robust loss function

[6, 39, 32, 20] aims to modify the loss function and achieves a small risk for unseen clean data with the presentence of noisy labels in the training data. The defect of the technique is that it cannot combat the heavy and complex noise. Loss adjustment [24, 12, 31, 38] is intended to weaken the adverse effects of noisy labels by adjusting the loss of training samples before updating the DNNs. Despite a full exploration of the training data when adjusting the loss of each sample, the error incurred by false correction is accumulated, particularly when the number of classification and noise samples is large. In addition, SELFIE is a hybrid approach that combines advantages of both sample selection and loss adjustment [27]

. The refurbished label of a training sample is determined by the most frequently predicted label for previous certain epochs (time window,

) when the sample satisfies the refurbished condition in SELFIE. Although it reduces the possibility of false correction while exploiting the full training data, it deals with predicted labels during time-period equally, which is not appropriate in our opinion.

In this work, we propose a label management mechanism (LMM) for retinal fundus image classification of diabetic retinopathy. We hypothesize that the predicted labels are more credible derived from the well-trained models in the stable rising epoch of performance before overfitting situation. Consequently, we pose time-weight technique to the predicted labels during the latest epochs. Moreover, LMM determines the refurbished labels based on the maximum posteriori probability (MAP) in the Bayesian statistic. Furthermore, the launch condition of LMM is put forward, which hinges on the gradient of accuracy and loss of test dataset. This novel LMM integrated with DNNs aims to suppress the influence of noisy data, enhancing the classification and grading performance of DR images with noisy labels. The rest of the paper is organized as follows. Section II presents the background and details of the proposed approach. Section III shows the results of public and our collected datasets. Section IV discusses relevant issues. Finally, Section V concludes the paper.

Fig. 1: The workflow diagram of deep neural network with label management mechanism.

Ii Material and Methods

Ii-a Problem Definition and Algorithm Description

In a typical DR image classification task, the training dataset consisting of the sample , and corresponding label is collected. Each pair is independent identically distributed. The goal of the task is to learn a function which maps the feature space of to the ground-truth label space . In this work, a mapping function

is learned by CNN to classify the retinal fundus images, where

is the parameter of . The parameter is learned by minimizing the empirical risk loss function and is updated along the descent direction of the expected loss on the mini-batch samples , where is the subset of .


where and are the learning rate and loss function respectively. Considering the possible corruption of sample labels in many real-world scenarios, this study aims to modify the update equation (1) to render the network more robust on noisy labels. Algorithm 1 describes the overall procedure of our proposed LMM to handle the noisy labels. First, in the warm-up period, lasting the initial epochs of training, the network is trained on the whole training dataset in the default manner as shown in equation (1) (Lines 6–9). Notwithstanding the existence of noisy data, the memorization effects [23, 14] indicate that DNNs will initially memorize the training samples with clean labels and then those of noisy labels. Subsequently, the start-up condition of label management mechanism is reached, the training samples in the mini-batch are separated into clean samples, refurbished samples and neither. Let be the clean samples and be the refurbished samples. Subset covers of low-loss instances [19] and is the noise rate (Lines 10–12). If is unknown, it can be reckoned through cross-validation [38, 39]. For this period, each train sample is identified through checking the predictive uncertainty that uses the entropy to measure the consistency of label prediction in the epochs (Line 14) [27]. Our proposed LMM is applied to determine the refurbished labels of samples in (Line 15). Then the refurbished samples are aggregated into for reuse (Line 16). To be mentioned, the intersection of and is not necessarily nonempty set. If a sample , being refurbished precedes being clean because mislabeled instances could be included even in . The parameters of the DNN will be updated based on the clean samples along with refurbished samples. We correct the backward loss of the refurbished sample by replacing its corrupted label with the refurbished label

and backpropagate the losses for the refurbished and clean samples to update the network (Lines 17-18), which can be described as:


INPUT: : train data, , : mini-batch data, : uncertainty threshold, : noise rate, : window width

OUTPUT: : model parameters, : refurbished data

3:: Initialize the model parameters;
4:for  to  do
5:     for  to  do
6:         Extract a mini-batch from ;
7:         if  belongs to warm-up period then
8:              /* updated by Eq. (1)*/
10:         else reaches start-up condition of LMM
11:              /* Clean samples selection*/
12:               of low-loss data in
13:              for each  do
14:                  if  (,) or  then
15:                       Calculate on LMM theorem
16:                        (,)
17:                       /* updated by Eq. (2)*/
19:                        +                                               
20:return ,
Algorithm 1 Pseudocode of the proposed LMM

Ii-B Techniques of Label Management Mechanism

Fig. 1

describes the overall procedure of our proposed label management mechanism. Fundus images are fed into base model to train DR classification model. Through the training process monitoring, label record module is used to save the predicted labels with corresponding probabilities of the whole training samples at each epoch for the Optimal Label Section Module. At the early stage of training, original labels should be used for loss calculation and launch condition of LMM should be judged simultaneously. When the launch condition of LMM is reached, Optimal Label Selection Module will be activated and calculates the estimated labels

to replace the original labels . The technical details can be illustrated as following parts.

Start-up Condition of Label Management Mechanism. Before the LMM takes into effect, a warm-up period is necessary for ensuring the performance of the training model reaching into relatively steady state. In the warm-up stage of training, the performance of the model is unstable or under-fitting, which leads to large deviations of the computation results of refurbished labels. Therefore, an appropriate launch condition of LMM should be devised carefully, which can prevent the refurbished labels from fluctuating or unchanging unacceptably. In this study, we propose two prerequisites to launch the LMM. First, the average loss value of samples in validation dataset should step into a range , in which our training model may acquire the best performance on validation dataset. Obviously, the value of is zero in an ideal situation. As the output of the last layer should be normalized by SoftMax function, the smallest probability of the Ground Truth (GT) label for a correct predicted sample should be no less than , where is the number of categories. According to the formula of cross-entropy, can be calculated as :


where represents the probability of the GT label. Second, our model has not been suspected of over-fitting or under-fitting. In the training process, the accuracy of validation dataset should satisfy following condition,


where represents accuracy on validation dataset during training and is the original noise rate of training dataset. is a hyperparameter that is named the relaxation factor in our study and is used to control the base accuracy to trigger the LMM. Generally, the value of can be selected from according to different dataset.

Bayesian Statistics for Optimal Label Selection. The Bayesian statistics formula is utilized in our study to estimate the optimal label for each sample, which is shown as follows:


where expresses the label of a sample used in epoch , represents the labels sequences ranging from the and the epoch. Besides, and stands for the window width epoch for Bayesian statistics and the normalized constant respectively.

The prior probability

is given by the training model of the epoch. We need to make sure what statistic to use to calculate the likelihood function . Referring to our hypothesis that the estimated label of a current sample is related to its past learning effects, we regard as a weighted mean statistic to introduce past knowledge into current label estimation. The likelihood function can be computed by following formula:




where denotes the weight of one label of ith iteration in window , is a normalization constant where , and is an adjustable parameter that controls the distribution of . In this study, we assigned a bigger weight to a more recently learned label, for the latest knowledge has the greatest impact on the decisions. The change process of with respect to can be shown in Fig. 2

. Then the posterior probability

can be calculated.

Fig. 2: The purity varying curve of training data. (b) ROC curves of Inception v3 w/o LMM.

Ii-C Network Architecture and Implementation

To validate the effectiveness of LMM, we integrate the LMM into the DNN architecture and compare the performance between the DNNs with and without LMM based on the same train and test dataset. With the new network architectures constantly emerging and becoming available, the capability of the proposed LMM to be equipped with any type of DNNs is important. Flexibility ensures that the proposed method can quickly adapt to the different architecture. In the comparative experiments, three common DNNs (VGG-16 [26], Inception-V3 [30], Resnet-50 [11]) was used to demonstrate the flexibility of LMM. The training procedure utilized the Adam optimizer with a learning rate of 0.0001, a cross-entropy loss function, and a minibatch size of 32.

DR Grading Grading Criterion Dataset No. Train No. Test No. Referral or Not
DR0 (NMA = 0) AND (NHE = 0) Non-referral
DR1 (0 <NMA 5) AND (NHE = 0)
DR2 (5 <NMA <15) AND (0 <NHE <5) AND (NNV = 0) Referral
DR3 (NMA 15) OR (NHE 5) OR (NNV >0)
DR Grading Grading Criterion Train No. Test No. Referral or Not
DR0 No abnormalities Non-referral
DR1 Microaneurysms only
DR2 More than just microaneurysms but less than severe non-proliferative DR Referral
Any of the following:
-More than 20 intraretinal hemorrhages in each of 4 quadrants;
-Definite venous beading in more than 2 quadrants;
-Prominent intraretinal microvascular abnormalities in more than 1 quadrant
and no signs of proliferative DR.
One or more of the following:
-Vitreous hemorrhage;
-Preretinal hemorrhage.

Ii-D Dataset Accumulation and Transformation

Dataset description. To evaluate the superiority of our proposed LMM, we performed the image classification tasks based on two DR retinal fundus images datasets (Messidor [21] and our collected retinal fundus images dataset). According to the international DR grading protocol [8, 10], the severity of DR can be divided into two stages, non-proliferative DR (NPDR) and proliferative DR (PDR). NPDR can be further graded into 4 levels: no retinopathy (DR0), mild NPDR (DR1), moderate NPDR (DR2), and severe NPDR (DR3) and PDR means DR4. Messidor collects 1200 retinal fundus images with different sizes of 1440×960 pixels, 2240×1488 pixels, and 2304×1536 pixels, and detailed grading information is listed in Table I. NMA, NHE, and NNV refer to the number of microaneurysms, hemorrhages and neovascularization, respectively. Significantly, we have referred to the errata available and then deleted 13 duplicate images and adjusted labels of 4 images with inconsistent grading. As for our collected dataset which is acquired from the Beijing Tongren Eye Center, the distribution and grading criterion of each stage are shown in Table II. The retinal fundus images were selected from a retrospective cohort of adult patients with no age or gender-based exclusion criteria. All images were deidentified according to Health Insurance Portability and Accountability Act Safe Harbor prior to transfer to study investigators. Ethics review and institutional review board exemption was obtained using Quorum Review IRB. The image annotation work involves two stages. In the first stage, each fundus image was labeled through multi-blind way by more than three senior ophthalmologists. In the second stage, a quality check of these labeled images was done by a group of experts led by a chief ophthalmologist. Finally, a total of 7262 DR images were collected and the numbers of DR0, DR1, DR2, DR3, and DR4 images is 1935, 1417, 1488, 1199, and 1223, respectively.

Noise Injection. As the both datasets contain only clean samples, we need to artificially corrupt sample labels to generate noisy labels. Frénay and Verleysen [5] summarized the taxonomy of label noise in detail. As shown in Fig. 3, noise transition matrix describes the probability of ground-truth label being flipped to the noisy labels . For classes, symmetry noise satisfies


where a ground-truth label is flipped into other labels with equal probability and the noise rate .

Fig. 3: Confusion matrices on DR grading before (a) and after (b) correction.

In our work, symmetry noise is introduced respectively to construct researchable datasets with noise. To evaluate the robustness on varying noise rates from light noise to heavy noise, according to the real-world noise rate. We tested five noise rates, ranging from 0.0 to 0.4 with step 0.1.

Ii-E Quantitative Evaluation Metrics and Comparative Study

Quantitative evaluation metrics.

The performance of LMM was quantitatively evaluated by test accuracy. The test dataset has unbiased and clean samples that are not used in training. The test accuracy degrades drastically when the DNN overfits samples with noisy labels [37]. Furthermore, area under curve (AUC) is also calculated as the metric. Meanwhile, data purity can be utilized as an indicator of the proportion of samples with ground-truth labels in the whole train dataset.


Where is the whole train dataset and is the ground-truth label and is the resulting label of the samples in . is either original label or refurbished label. Data purity may be updated after each epoch when LMM works. Cohen’s kappa (kappa) is further employed to measure the agreement between ground-truth labels and noisy labels of train dataset.

Comparative study methods.

We compared our proposed method with a benchmark model (marked as Default) and three state-of-the-art robust training algorithms (ActiveBias, Coteching, and SELFIE). Default trains the DNNs without any processing for the noisy labels. ActiveBias and Coteching are chosen to represent loss correction and sample selection strategies, respectively. ActiveBias implements re-weight loss of training samples by emphasizing high variance samples

[3]. Coteaching selects the clean samples by the loss-based separation and adopts the co-training mechanism [9]. SELFIE combines both loss correction with sample selection strategy [27].

Iii Experiment and Results

We initially verify the validity of the proposed LMM through both 2-class DR classification and 5-class DR grading experiments based on our collected dataset. Two classes refer to the non-referral (DR0 and DR1) and the referral (DR2, DR3, and DR4) treatment group. Five classes are from DR0 to DR4. The number and distribution of train data and test data are descripted in Table II in detail. After that, generalization is proved through two-class DR classification with the public Messidor dataset. Two classes refer to the non-referral (DR0 and DR1) and referral (DR2 and DR3) treatment group. The quantity of non-referral and referral treatment group are 550 and 380 for train dataset, are 146 and 111 for test dataset, respectively (Table II). DR grading experiment is omitted as lack of data of each grade. All experiments are performed on an NVIDIA RTX3090 GPU with 24 GB of memory. In this work, we did not apply any data augmentation or pre-processing procedures.

Fig. 4: Grid search on Messidor with a noise rate of 0.4.

Iii-a Hyperparameter Selection

DNN with LMM receives the two hyperparameters: the window width and the uncertainty threshold . To determine the optimal combination of hyperparameters, we train Inception-V3 on the noisy Messidor dataset at a rate of 40% noise with two hyperparameters set in a grid and . Fig. 4 illustrates the test accuracy obtained by the grid search on the noisy dataset. Regarding the uncertainty threshold, the best test accuracy cannot be achieved with both small and large thresholds. The performance generally involves a comprise between the correct refurbished samples in the and wrong refurbished samples in the . The small threshold corresponds to the small rate of both the two cases while the large threshold corresponds to the high rate of both the two cases. As for the window width, although there is no clear winner among the 5, 10, and 15, the of value 5 achieves the highest test accuracy when the threshold is 0.4.

Fig. 5: (a) The purity varying curve of training data; (b) ROC curves of Inception v3 w/o LMM.

Iii-B Performance of DR 2-class image classification

The flexibility of the proposed LMM. The flexibility of the LMM ensures the capability of supporting any type of DNN architecture. Three common DNNs (VGG-16, Resnet50, Inception-V3) were separately integrated with LMM to perform comparative experiments on our collected dataset. Here, the noise rate was fixed at 0.2. Taking the Inception-V3 for example, as demonstrated in Fig. 5 (a), the data purity of training dataset remains unchanged (80%) in the warm-up period and climbs to 85% then fluctuates minorly when the LMM mechanism works. Furthermore, the ROC curves (Fig.5 (b)) also demonstrated that the Inception-V3 can acquire better performance in DR classification task with the help of LMM. The data purity and ROC results of Resnet50 and VGG16 have the similar performance trend with that of Inception-V3. The detailed performance metrics are shown in Table III. Compared with the benchmark VGG16, the test accuracy and AUC of VGG16-LMM improve from 0.8512 and 0.9076 to 0.7450 and 0.8814, respectively. Similarly, the outcomes of the Resnet50 and Inception V3 have the same performance trends. In general, the results prove that the proposed LMM can enhance the performance of DNNs and the promotion is independent of specific models.

Method ACC AUC Data Purity
VGG16-LMM 0.8512 0.9076 83.41%
Resnet50-LMM 0.8370 0.9131 86.95%
Inception-V3-LMM 0.8386 0.9072 85.05%
DR data Method ACC AUC Data Purity
No noisel Inception V3 100%
Inception V3-LMM 0.9209 0.9732
10% noisel Inception V3
Inception V3-LMM 0.8917 0.9565
20% noisel Inception V3
Inception V3-LMM 0.8386 0.9072
30% noisel Inception V3 70%
Inception V3-LMM 0.7855 0.8380
40% noisel Inception V3 60%
Inception V3-LMM 0.6636 0.7247
DR data Method ACC Data Purity
No noise Inception V3 100%
Inception V3-LMM 0.8872
10% noise Inception V3
Inception V3-LMM 0.8289 92.25%
20% noise Inception V3
Inception V3-LMM 0.7860 82.90%
30% noise Inception V3
Inception V3-LMM 0.7120 72.80%
40% noise Inception V3
Inception V3-LMM 0.6226 64.41%
DR data Method ACC Kappa Data Purity
No noise Inception V3 1.0 100%
Inception V3-LMM 0.7577
10% noise Inception V3
Inception V3-LMM 0.6419 0.9031 92.32%
20% noise Inception V3
Inception V3-LMM 0.6387 0.7996 84.14%
30% noise Inception V3
Inception V3-LMM 0.5813 0.6974 76.04%
40% noise Inception V3
Inception V3-LMM 0.5367 0.5746 66.27%

Tolerance to different proportions of noise. We select Inception-V3 as the benchmark model and set the noise rate from 0.0 to 0.4 with step 0.1 to test the tolerance of the proposed LMM to different proportions of noise. The comparison metrics of our collected dataset are listed in Table IV. We can see that the larger the noise ratio, the worse the performance of the model without LMM. In case of no noise and contain only clean training samples, the LMM produces hardly any side effects to the performance of the well-trained models. Generally, under the noise level from 0.1 to 0.4, LMM achieves better metrics than benchmark model on our collected dataset. For example, at the relatively heavy noise rate of 30%, the data purity, ACC, and AUC increase 11.29%, 11.94%, and 11.73%, respectively.

Generalization on the public Messidor dataset.

We also verify the generalization of the LMM based on the corrupted Messidor dataset. Considering the amount of Messidor is relatively small (1200), we do not develop the image grading experiment and choose to train the 2-class image classification model based on the pretrained models derived from ImageNet. We adopt Inception-V3 and the same hyperparameters of the DNN as above experiments. Messidor is corrupted with different noise rates of 0.1, 0.2, 0.3 and 0.4, respectively. The test accuracy and train data purity of comparative experiments are displayed in Table

V. Similarly, the Inception-V3 with LMM weakens the influences of noisy labels on the Messidor dataset. The Inception-V3 with LMM achieves improvement on test accuracy of 3.51%, 6.62%, 2.33%, and 5.84% under the noise rate raises from 0.1 to 0.4, respectively. In brief, the LMM has optimized the performance of models and purity of training data on the noisy Messidor dataset.

Iii-C Performance of DR 5-class image grading

In addition to the referral and non-referral 2-class DR image classification, we further carried out the experiments of 5-class DR image grading task to verify the effectiveness of the proposed LMM based on our collected dataset. Considering that there are likely no significant differences between adjacent stages, the task of 5-class DR image grading is more challenging than 2-class DR image classification. We inject symmetric noise into the ground truth-labels and the noise rate ranges from 0 to 0.4. As shown in Fig. 3

(a), the confusion matrix before correction corresponds to the noise transition matrix. It can be seen from Table

VI, the ACC of test dataset, the Kappa score and data purity of training dataset are all successfully improved by using the proposed LMM under five noise ratios. In Fig. 3 (b), the confusion matrix after correction contains the refurbished labels determined by our method. Although, the noise rate is high (40%), the probability of the diagonal entries all increased to a certain extent. Therefore, LMM can reduce the influence of noise at different noise rates for DR 5-class grading task.

Iii-D Comparison of different methods to deal with noisy labels

Three state-of-the-art methods (ActiveBias, Coteching, and SELFIE) and Default method without any processing for the noisy labels are employed on our collected dataset to compare with our proposed LMM. Fig. 6 shows the test accuracy of the five methods with varying symmetry noise rates. In generally, LMM outperforms other methods at different noise rates. Under the noise rate of 0.2, compared with Default, ActiveBias, Coteaching, and SELFIE, LMM significantly increased the absolute test accuracy by 9.21%, 5.33%, 5.89%, and 2.13%, respectively. Additionally, the Default tended to show vulnerability even with the light noise rate of 0.1.

Fig. 6: The best test accuracy of the five training methods using Inception-V3 on our collected dataset at different noise rate.

Iv Discussion

Iv-a Result with realistic noise

In addition to manually induced noise, ANIMAL-10N [27] with realistic noise is further utilized to conduct image classification experiment to validate the proposed LMM method. ANIMAL-10N consists of 10-class animal images with 50,000 training images and 5,000 testing images. Notably, ANIMAL-10N has realistic noise that are corrupted with noisy labels naturally by human mistakes and the noise rate is estimated at 8%. Further, the superiority of LMM is proved by comparing four comparative methods (Default, ActiveBias, Coteching, and SELFIE). As the correct ground-truth labels of training dataset in ANIMAL-10N are unknown, the data purity and Kappa metrics of training dataset cannot be calculated. The test accuracy is illustrated in the bar chart in Fig. 7 Our proposed LMM ranks first reaching 82.6% and Default rank last reaching 79.4%. The LMM increase the accuracy by 1.6%, 2.4%, and 2.1% compared with SELFIE, Coteching, and ActiveBias, respectively. In brief, the LMM also works well when dealing with dataset with realistic noise.

Fig. 7: The best test accuracy of the five training methods using Inception-V3 on ANIMAL-10N.
Fig. 8: Label management mechanism applied in the self-training.

Iv-B LMM Applied in weak supervision learning

DNN of image classification is sensitive to the quantity of training data to some extent when fixed one model. When lacking training data, enriching training data is an effective way to improve the performance of the model. Self-training exploits unlabeled data with pseudo-label to achieve better model performance. Inspired by self-training [34], we innovatively apply LMM to self-training to refurbish the pseudo-labels. We carry out three comparative experiments and Fig. 7 illustrates the results by percentage of train dataset as X-axis and test accuracy as Y-axis. The differences among three groups of comparative experiments depend on the training data and strategies of the training model. Taking x value being 10% for example, in control group experiment, 10% training data with ground-truth label are trained for DR image dichotomy task. While self-training group experiment not only contains the above 10% training dataset but other 10% dataset with pseudo-labels. Self-training with LMM group shares a similar training data pattern as self-training and avails LMM to enhance performance of the DNN. In Fig. 8, all the three groups experiments can verify that test accuracy becomes higher along with the increase (from 10% to 30%) of the training dataset. Self-training group can indeed utilize the unlabeled samples and improve the test accuracy compared with the control group. Self-training with LMM surpassed the other two groups and increased 3.2%, 5.9%, 2.5%, 2.9%, and 2.8% compared with the control group from the 10% to 30% percentage of training dataset.

Iv-C The interpretability of the model

The class activation maps (CAM) suggest that the sensitive areas which caused the high response of our model are consistent with the suspicious areas in a clinical diagnosis. CAM can prove the superiority of the DNN with LMM, which is shown as Fig. 9. The left most column shows the color retinal fundus images with corresponding ground-truth label in the upper left corner. The middle and right most column are the CAMs with predicted labels and probabilities in the upper left corner from Inception-V3 with and without LMM respectively. In Fig. 9 (a), the predominant lesions including multiple hemorrhages lesions are located on the nasal side of the foveal location. Microaneurysms and hemorrhages lesions can be observed between superior and inferior vascular arcades in Fig. 9 (b). The model without LMM fails to detect the lesions, resulting in false-negative misjudgment both in the two cases. While our proposed method can accurately localize the lesions and obtain correct positive predicted labels. Fig. 9 (c) is a negative case. The model without LMM treat the reflection of nerve fiber as lesions mistakenly, leading to false-positive prediction with high probability of 0.9755.

Fig. 9: Class activation maps from Inception-V3 w/o LMM.

Iv-D Limitations and Future Direction

This study still has some limitations. Firstly, our study initially carries out all the experiments based on the dataset with limited symmetry noise. In the actual medical scene, the distribution of label noise is unknowable. It’s worth exploring the taxonomy of label noise such as asymmetric (or label-dependent) noise [25]. Asymmetric noise means that a ground-truth label is more likely to be mislabeled into a particular label, which is more reasonable in a real sense. Secondly, the features of noisy samples need to be further mined so that more noisy samples can be refurbished to correct ones and fewer clean samples can be refurbished to wrong ones. Thirdly, datasets from other medical scenarios such as CT, MRI and PET can be utilized further to validate the effectiveness of our proposed LMM.

V Conclusion

We propose a novel label management mechanism (LMM) for robust training DNN classification models with noisy labels. A selectively refurbished sample can be corrected through analyzing the former predicted labels with the Bayesian maximum posteriori probability and time-weighted technology. We conducted extensive experiments of two-class classification and five-class grading of DR on public Messidor and our collected DR datasets with varying noise rates. Our experiment results showed that LMM can improve the robustness of the DNN when dealing with corrupted labels. LMM guided the network to avoid noise accumulation from the false correction and allowed it to take advantage of the full exploration of training data. In summary, the proposed LMM has demonstrated its capability on reducing the adverse effects of noisy labels in DR classification based on deep learning.


  • [1] A. M. Carracher, P. H. Marathe, and K. L. Close (2018) International diabetes federation 2017. Wiley Online Library. Cited by: §I.
  • [2] R. Chakrabarti, C. A. Harper, and J. E. Keeffe (2012) Diabetic retinopathy management guidelines. Expert Review of Ophthalmology 7 (5), pp. 417–439. Cited by: §I.
  • [3] H. Chang, E. Learned-Miller, and A. McCallum (2017) Active bias: training more accurate neural networks by emphasizing high variance samples. arXiv preprint arXiv:1704.07433. Cited by: §II-E.
  • [4] N. Cho, J. Shaw, S. Karuranga, Y. Huang, J. da Rocha Fernandes, A. Ohlrogge, and B. Malanda (2018) IDF diabetes atlas: global estimates of diabetes prevalence for 2017 and projections for 2045. Diabetes research and clinical practice 138, pp. 271–281. Cited by: §I.
  • [5] B. Frénay and M. Verleysen (2013) Classification in the presence of label noise: a survey. IEEE transactions on neural networks and learning systems 25 (5), pp. 845–869. Cited by: §II-D.
  • [6] A. Ghosh, H. Kumar, and P. Sastry (2017) Robust loss functions under label noise for deep neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31. Cited by: §I.
  • [7] J. Goldberger and E. Ben-Reuven (2016) Training deep neural-networks using a noise adaptation layer. Cited by: §I.
  • [8] V. Gulshan, L. Peng, M. Coram, M. C. Stumpe, D. Wu, A. Narayanaswamy, S. Venugopalan, K. Widner, T. Madams, J. Cuadros, et al. (2016) Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. Jama 316 (22), pp. 2402–2410. Cited by: §I, §II-D.
  • [9] B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama (2018) Co-teaching: robust training of deep neural networks with extremely noisy labels. arXiv preprint arXiv:1804.06872. Cited by: §II-E.
  • [10] S. Haneda and H. Yamashita (2010) International clinical diabetic retinopathy disease severity scale. Nihon rinsho. Japanese journal of clinical medicine 68, pp. 228–235. Cited by: §II-D.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §II-C.
  • [12] D. Hendrycks, M. Mazeika, D. Wilson, and K. Gimpel (2018) Using trusted data to train deep networks on labels corrupted by severe noise. arXiv preprint arXiv:1802.05300. Cited by: §I.
  • [13] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448–456. Cited by: §I.
  • [14] L. Jiang, Z. Zhou, T. Leung, L. Li, and L. Fei-Fei (2018) Mentornet: learning data-driven curriculum for very deep neural networks on corrupted labels. In International Conference on Machine Learning, pp. 2304–2313. Cited by: §I, §I, §II-A.
  • [15] A. Krogh and J. A. Hertz (1992) A simple weight decay can improve generalization. In Advances in neural information processing systems, pp. 950–957. Cited by: §I.
  • [16] K. Lee, X. He, L. Zhang, and L. Yang (2018)

    Cleannet: transfer learning for scalable image classifier training with label noise

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5447–5456. Cited by: §I.
  • [17] W. Li, L. Wang, W. Li, E. Agustsson, and L. Van Gool (2017) Webvision database: visual learning and understanding from web data. arXiv preprint arXiv:1708.02862. Cited by: §I.
  • [18] L. Lin, M. Li, Y. Huang, P. Cheng, H. Xia, K. Wang, J. Yuan, and X. Tang (2020) The sustech-sysu dataset for automated exudate detection and diabetic retinopathy grading. Scientific Data 7 (1), pp. 1–10. Cited by: §I.
  • [19] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. Van Der Laak, B. Van Ginneken, and C. I. Sánchez (2017) A survey on deep learning in medical image analysis. Medical image analysis 42, pp. 60–88. Cited by: §I.
  • [20] Y. Lyu and I. W. Tsang (2019) Curriculum loss: robust learning and generalization against label corruption. arXiv preprint arXiv:1905.10045. Cited by: §I.
  • [21] T. MESSIDOR (2014) MESSIDOR: methods to evaluate segmentation and indexing techniques in the field of retinal ophthalmology. 2014. Available on: http://messidor. crihan. fr/index-en. php Accessed: October 9. Cited by: §II-D.
  • [22] D. T. Nguyen, C. K. Mummadi, T. P. N. Ngo, T. H. P. Nguyen, L. Beggel, and T. Brox (2019) Self: learning to filter noisy labels with self-ensembling. arXiv preprint arXiv:1910.01842. Cited by: §I.
  • [23] S. Park, J. Lee, C. Yun, and J. Shin (2020) Provable memorization via deep neural networks using sub-linear parameters. arXiv preprint arXiv:2010.13363. Cited by: §I, §II-A.
  • [24] G. Patrini, A. Rozza, A. Krishna Menon, R. Nock, and L. Qu (2017) Making deep neural networks robust to label noise: a loss correction approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1944–1952. Cited by: §I.
  • [25] C. Shorten and T. M. Khoshgoftaar (2019) A survey on image data augmentation for deep learning. Journal of Big Data 6 (1), pp. 1–48. Cited by: §I.
  • [26] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §II-C.
  • [27] H. Song, M. Kim, and J. Lee (2019) Selfie: refurbishing unclean samples for robust deep learning. In International Conference on Machine Learning, pp. 5907–5915. Cited by: §I, §I, §II-A, §II-E, §IV-A.
  • [28] H. Song, M. Kim, D. Park, Y. Shin, and J. Lee (2020) Learning from noisy labels with deep neural networks: a survey. arXiv preprint arXiv:2007.08199. Cited by: §I, §I.
  • [29] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §I.
  • [30] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §II-C.
  • [31] R. Wang, T. Liu, and D. Tao (2017) Multiclass learning with partially corrupted labels. IEEE transactions on neural networks and learning systems 29 (6), pp. 2568–2580. Cited by: §I.
  • [32] Y. Wang, X. Ma, Z. Chen, Y. Luo, J. Yi, and J. Bailey (2019) Symmetric cross entropy for robust learning with noisy labels. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 322–330. Cited by: §I.
  • [33] T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang (2015) Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2691–2699. Cited by: §I, §I.
  • [34] Q. Xie, M. Luong, E. Hovy, and Q. V. Le (2020)

    Self-training with noisy student improves imagenet classification

    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10687–10698. Cited by: §IV-B.
  • [35] J. Yao, J. Wang, I. W. Tsang, Y. Zhang, J. Sun, C. Zhang, and R. Zhang (2018) Deep learning from noisy image labels with quality embedding. IEEE Transactions on Image Processing 28 (4), pp. 1909–1922. Cited by: §I.
  • [36] X. Yu, B. Han, J. Yao, G. Niu, I. Tsang, and M. Sugiyama (2019) How does disagreement help generalization against label corruption?. In International Conference on Machine Learning, pp. 7164–7173. Cited by: §I.
  • [37] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2016) Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530. Cited by: §I, §II-E.
  • [38] X. Zhang, K. Zhou, S. Wang, F. Zhang, Z. Wang, and J. Liu (2020) Learn with noisy data via unsupervised loss correction for weakly supervised reading comprehension. In Proceedings of the 28th International Conference on Computational Linguistics, pp. 2624–2634. Cited by: §I.
  • [39] Z. Zhang and M. R. Sabuncu (2018) Generalized cross entropy loss for training deep neural networks with noisy labels. arXiv preprint arXiv:1805.07836. Cited by: §I.