Cost-Sensitive Regularization for Diabetic Retinopathy Grading from Eye Fundus Images

10/01/2020 ∙ by Adrian Galdran, et al. ∙ Bournemouth University 13

Assessing the degree of disease severity in biomedical images is a task similar to standard classification but constrained by an underlying structure in the label space. Such a structure reflects the monotonic relationship between different disease grades. In this paper, we propose a straightforward approach to enforce this constraint for the task of predicting Diabetic Retinopathy (DR) severity from eye fundus images based on the well-known notion of Cost-Sensitive classification. We expand standard classification losses with an extra term that acts as a regularizer, imposing greater penalties on predicted grades when they are farther away from the true grade associated to a particular image. Furthermore, we show how to adapt our method to the modelling of label noise in each of the sub-problems associated to DR grading, an approach we refer to as Atomic Sub-Task modeling. This yields models that can implicitly take into account the inherent noise present in DR grade annotations. Our experimental analysis on several public datasets reveals that, when a standard Convolutional Neural Network is trained using this simple strategy, improvements of 3-5% of quadratic-weighted kappa scores can be achieved at a negligible computational cost. Code to reproduce our results is released at



There are no comments yet.


page 2

Code Repositories


A straightforward mechanism to implement cost sensitive losses in pytorch

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Diabetes is regarded as a global eye health issue, with a steadily increasing world-wide affected population, expected to reach 630 million individuals by 2045 [1]. Diabetic Retinopathy (DR) is a complication of standard diabetes, caused by damage to vasculature within the retina. DR shows early signs in the form of swelling micro-lesions that destroy small vessels and release blood into the retina. Advanced DR stages are characterized by the appearance of more noticeable symptoms, e.g. proliferation of neo-vessels, leading to the detachment of the retinal layer and eventually permanent sight loss.

Retinal images acquired with fundus cameras are the tool of choice for discovering these early symptoms, representing an effective diagnostic tool suitable for automatic diagnostic systems [2]

. In this context, and with the advent of Deep Learning in the last decade, a wide set of techniques has been proposed in recent years

[3, 4, 5]. However, the vast majority of these works are designed for the screening task, i.e. distinguishing healthy individuals from patients at any stage of risk. Due to its difficulty, fewer works have addressed the task of DR grading, consisting of classifying an eye fundus image into one of the five categories proposed by the American Academy of Ophthalmology [6], illustrated in Fig. 1. In addition, most recent DR grading techniques [7, 8, 9]

have focused on scaling up existing Convolutional Neural Networks by considering larger/better databases, but only a few works addressed the design of customized loss functions that are more suitable for this task, which is the goal of this paper.

Figure 1: Images from the Messidor-2 dataset illustrating the progressive behavior of DR. (a) Grade 1 (Mild NPDR): only few microaneurysms can be found (b) Grade 2 (Moderate NPDR): Presence of multiple microaneurysms, blot hemorrhages, venous beading, and/or cotton wool spots (c) Grade 3 (Severe NPDR): Micro-aneurysms if 4 quadrants of the retina, cotton wool spots, venous beading, severe intra-retinal microvascular abnormalities. (d) Grade 4 (PDR): Neovascularization, vitreous hemorrhages.

Cost-Sensitive classifiers are known to be useful for addressing two of the main challenges related to DR grading. First, they allow to model the underlying structure of an heterogeneous label space [10, 11, 12]. Second, they are beneficial for dealing with severely class-imbalanced scenarios [13, 14]. Despite this, to the best of our knowledge, no previous work has explored Cost-Sensitive loss minimization approaches in the context of DR grading from eye fundus images.

In this paper, we present a straightforward approach for integrating Cost-Sensitive classification constraints in the task of DR grading from retinal images. We choose to introduce these constraints by attaching an auxiliary Cost-Sensitive loss term to popular miss-classification error functions, and by analyzing the impact of this process in the training of a standard CNN. In addition, we illustrate how to adapt our method to the modeling of label noise in each of the sub-problems associated to DR grading, an approach we refer to as Atomic Sub-Task modeling. We conduct a series of careful experiments demonstrating that expanding well-known loss functions with a Cost-Sensitive term brings noticeable performance increases, and that sub-task modeling leads to learning models that behave more similarly to human annotators.

2 Methodology

In this section we first describe our approach to build Cost-Sensitive (CS) classifiers, and the loss functions we select as baselines, to which we will add a CS-regularizing term. We then show how CS can be employed to model label noise for DR grading problems, and detail the training process we followed to optimize the parameters of our models.

2.1 Cost-Sensitive Regularization

In order to induce a different penalty for each kind of error, let us first consider the case in which a model produces a prediction . Such prediction is to be compared with the corresponding label . For the sake of readability, in the following we will abuse notation and refer by indistinctly to an integer label

and its one-hot-encoded counterpart

, which takes a value of in the position corresponding to and elsewhere.

Standard loss functions like the cross-entropy error, described by:


are insensitive to any underlying structure in the label space . This means that for a particular example , if any permutation is applied on , the resulting error will remain the same. In order to modify that behavior, we consider a cost matrix that encodes a null cost for a prediction such that , but cost that increases along with the distance .

A simple approach to achieve such increasing label-dependent penalty is by encoding in each row of those costs, and then computing the scalar product of with the row of corresponding to , i.e. . However, due to the high imbalance of the DR grading problem (with typically few examples of classes DR1, DR3, and DR4) in our experiments we noted that simply minimizing such quantity would lead to models remaining stuck in local minima and classifying all images into DR0 and DR2 classes. For this reason, we prefer to combine a CS term with a base loss as follows:


In the above equation, we have selected the -based ground cost matrix , since it fits nicely with the goal of maximizing quadratic-weighted kappa score, but other cost matrices could be easily implemented if previous knowledge of the problem is available to be embedded in the loss function. We give an example of how to build different penalties in the next section.

As for the base loss, in this paper we consider three different alternatives, namely the above Cross-Entropy loss together with the Focal Loss and Non-Uniform Label Smoothing Loss functions. The Focal Loss was introduced for object detection tasks in [15], but it has become widely popular in classification tasks due to its ability to penalize wrongly miss-classified examples during training. In a multi-class setting, it is given by the following equation:


being a weighing factor and the so-called focusing parameter that penalizes errors in wrongly classified examples more than errors in correctly classified ones.

Non-Uniform Label Smoothing Loss is a straightforward modification of the popular Label Smoothing technique in which neighboring labels receive more probability mass than farther-away ones. This process is described by the following formula



where actual labels are manipulated by means of convolution with a Gaussian kernel resulting in the introduction of lower penalty in neighboring grades and greater loss value for far away predictions. Differently from the Cross-Entropy and the Focal loss, the Non-Uniform Label Smoothing strategy is sensitive to the label space structure. Yet, we hypothesize that further imposing greater penalty on farther away grades could bring benefits training based on this loss, as well as the other two above functions. In our experiments, described below, we train several models by considering to be , , and and varying the hyper-parameter from (no CS regularization whatsoever) to greater CS penalty, and observe the resulting performance.

2.2 Atomic Sub-Task Modeling

Annotating retinal images regarding the level of DR severity is know to be a noisy process, with high rates of inter-observer disagreement [7, 17]

. In this paper we propose to leverage available data regarding the structure of that disagreement to improve DR grading accuracy. Our hypothesis is that if the kind of noise affecting labels in the training data can be estimated, we can make a model aware of such noise via a CS mechanism similar to the one described in eq. (


Specifically, we consider the confusion matrix

from the left hand side of eq. (5). This matrix contains information collected in [7] regarding inter-observer disagreement between retinal specialists and an adjudicated consensus during the grading process of their clinical validation dataset. Interestingly, this matrix conveys not only information about which grades are most likely to be subject of expert disagreement, but it also tells us which grades are more often mistaken by which other grades.

To formalize the above, we refer to the task of categorizing an image of actual DR grade image into the -th grade as , and we refer to this process as atomic sub-tasks. For a given grade , the amount of images actually belonging to that grade is , and normalizing by provides an estimate of , which denotes the likelihood that an annotator diagnoses an image as grade when it actually was of grade , as shown in the right hand side of eq. (5):


We assume below that matrices are indexed starting from , i.e. . By observing we can draw several conclusions, for example:

  • Annotators are likely to be greatly accurate when grading and images, as derived from and .

  • Around of images are likely to be incorrectly labeled ().

  • Only of incorrectly labeled images are likely to be labeled as .

  • Approximately of those incorrectly labeled images are likely to be labeled as .

Under the hypothesis that in a dataset labeled by a single annotator the reliability of the annotations will follow a distribution similar to the above, we can assume, for instance, that such dataset will contain reliable labels concerning grades. However, we may also assume that when an image has been annotated as of grade , this is quite likely to be incorrect, and it may well be the case that such image is actually of grade , since the corresponding atomic sub-task holds value comparable to .

Our goal is to impose in our models a penalty on erroneous predictions that takes into account all the above information. That is, we want to penalize incorrect predictions when the label is likely to be reliable, but we are willing to be more tolerant with erroneous predictions if we know the associated label is unreliable. Embedding this knowledge into a loss function is easily accomplished using the CS loss formulation as developed in the previous section: we consider in eq. (2), being

the identity matrix. Higher values of

will result in lower penalties, whereas lower values lead to a greater penalty.

Note, however, that for grades such that , , there is no useful information in terms of relative reliability of these grades, e.g. does not convey the information that it is harder to misdiagnose a images as than it is to misdiagnose it as . In those situations it might be better to rely on the penalty imposed by from eq. (2). For this reason, we suggest to implement an averaged Cost-Sensitive regularizer as:


We now describe the remaining training specifications aside of the loss functions.

2.3 Training Details

For analyzing the impact of minimizing CS-regularized loss functions in the problem of DR grading, we follow the process of varying the hyper-parameter in eq. (2). For each base loss function, we train a Convolutional Neural Network (CNN) by setting (no regularization), , and . If the best performance of these three experiments results from employing , we set and train the CNN again. This process is repeated until performance does not improve anymore.

As for the CNN, we select the Resnext50 architecture based on its excellent classification accuracy in other multi-class problems [18]

, and weights are initialized from training in the ImageNet dataset. We use Stochastic Gradient Descent with a batch size of

, and the learning rate is set to . Performance (quadratic kappa score) is monitored in an independent validation set. The learning rate is decreased by a factor of whenever performance stagnates in the validation set, and the training is stopped after epochs of no further improvement. In addition, to mitigate the impact of class imbalance, we oversample minority classes [19].

3 Experimental Validation

In this section we describe the experimental setting we follow in order to validate our approach: considered datasets, comparing techniques, and numerical results.

3.1 Experimental Details

We consider as our primary dataset the Eyepacs database111 the largest public dataset with DR grading labels for DR grading labels. It contains around 80,000 high resolution retinal fundus images (approximately 35,000 are assigned to the training set, from which we employ for validation, and 55,000 are held out for testing purposes). The Eyepacs dataset contains a considerable amount of low quality images and label noise [17]. Therefore, it represents an interesting test-bed to observe the robustness of DR grading algorithms.

As a secondary test set, we also consider the Messidor-2 dataset [4], which contains images corresponding to patients. In this case, we employ the ground-truth labels released by [7], available online222 These labels are extracted from a process of consensus adjudication of three retinal specialists, and they are therefore of much better quality than the Eyepacs dataset ground-truth.

For performance assessment, we apply as the main metric of interest the quadratic-weighted kappa score (quad-kappa), which is typically used to assess inter-observer variability, and is very popular metric in this task. As further measures of correlation, we also analyze Average of Classification Accuracy (ACA, the mean of the diagonal in a normalized confusion matrix [20]) and the Kendall- coefficient. We also report the mean Area Under the Receiver-Operator Curve in its multi-class extension, after considering each possible class pair[21]. For statistical testing, expert labels and model predictions in each of both test sets (Eyepacs and Messidor-2) are bootstrapped [22] (n=1000) in a stratified manner with respect to the relative presence of DR grades. Performance differences for each metrics are derived in each bootstrap and p-values are computed for testing significance. The statistical significance level was set to in each case.

For comparison purposes, we select three other recent techniques that introduce methods specifically developed to solve the DR grading task: [23], Bilinear Attention Net for DR Grading (Bira-Net) [20], and Quadratic-Weighted Kappa Loss (QWKL) [24].

3.2 Numerical Results

After training a CNN by minimizing each of the three considered base losses (Cross-Entropy, Focal Loss, and Non-Uniform Label Smoothing) with different degrees of regularization, we select the best model and compute results first on the Eyepacs test set. We denote the unregularized models by CE, FL, and NULS respectively, and their regularized counterparts as CE-CS, FL-CS, and NULS-CS.

We then select the best hyperparameter setting for each regularized model (

in all cases), and retrain the same model but this time using our proposed Atomic Sub-Task modeling, denoted by an suffix in each case. We compile in Table 1 the obtained results in terms of quadratic -score, mean AUC, ACA and Kendall-, for all the described options.

quad-kappa mAUC ACA Kendall-
CE 75.76 0.31 87.35 0.14 51.32 0.44 67.35 0.31
CE-CS 77.27 0.30 88.42 0.14 53.26 0.42 69.89 0.30
CE-AST 77.39 0.29 88.49 0.13 54.12 0.44 69.25 0.30
Focal Loss 74.72 0.34 86.63 0.16 51.90 0.44 65.38 0.32
FL-CS 77.38 0.31 88.58 0.14 54.11 0.45 69.45 0.30
FL-AST 77.94 0.29 88.90 0.13 54.71 0.43 70.45 0.29
NULS 77.09 0.30 88.44 0.14 53.02 0.46 69.47 0.29
NULS-CS 77.91 0.30 88.82 0.14 54.55 0.44 70.14 0.30
NULS-AST 78.71 0.28 89.05 0.13 54.57 0.46 71.04 0.30
Table 1: Performance comparison when training without regularization, with CS regularization as in eq. (2), and with Atomic Sub-Task modeling (AST) as in eq. (6), for the three considered loss functions. Statistically significant results are marked bold.

Finally, we report in table 2 the performance of our best model (using as a base loss NULS and Atomic Sub-Task modeling) in comparison with the techniques proposed in [23], [24], and [20], in the test set of both Eyepacs and Messidor. We also provide confusion matrices for the Eyepacs test set in Fig. 2.

[23] QWKL [24] Bira-Net [20] NULS-AST
Eyepacs 74.00/53.6 74.00/n.a. n.a./54.31 78.710.28/54.570.46
Messidor-2 71.00/59.60 n.a./n.a. n.a./n.a. 79.791.03/ 63.411.99
Table 2: Performance comparison in terms of quad-kappa/ACA for different methods when tested on the Eyepacs and Messidor-2 datasets. Models were trained on Eyepacs and tested on Eyepacs and Messidor (without retraining/fine-tuning).







DR0 8 1 1 6 3 0 0
DR1 3 5 4 8 1 7 0 0
DR2 1 1 2 5 4 9 1 4 1
DR3 2 5 3 4 5 5 4
DR4 2 8 2 2 3 3 3 5







DR0 7 9 6 1 5 0 0
DR1 5 1 1 9 3 0 0 0
DR2 1 6 4 6 9 9 2
DR3 3 1 4 3 4 7 6
DR4 0 1 2 2 1 9 5 8







DR0 9 7 2 1 0 0
DR1 6 8 2 4 8 0 0
DR2 2 6 1 3 5 0 1 0 1
DR3 4 2 3 9 5 2 3
DR4 7 1 1 8 2 4 5 0
Figure 2: (a)–(c): Normalized confusion matrices corresponding to: (a) the method of Araújo et al. [23], (b) Zhao et al. [20], (c) NULS-AST.

4 Discussion and Conclusion

Results on Table 1 clearly show that introducing Cost-Sensitive regularization results in noticeable improvements, particularly when measuring performance in terms of quadratic -score. This is meaningful since the considered cost matrix was selected so as to quadratically penalize distance in the label space for erroneous predictions. Quadratic -score experimented an improvement ranging from when regularizing the Focal loss to for NULS. This could also be expected, since NULS already introduces some asymmetry in the way DR grades are treated. If Atomic Sub-Task modeling is considered, these improvements are even greater when compared with unregularized counterparts: from an increase of score of for the Focal Loss to an increase of for NULS. It is also worth noticing that the confusion matrix resulting from training with Atomic Sub-Task modeling shows certain similarity with respect to the inter-observer disagreement matrix in the left-hand side of eq. (5), specially when compared with the confusion matrices produced by other techniques, as shown in Fig. (2).

It should be stressed that performance on Table 1

is not comparable to results of the competition that published the data. There are several reasons for this: the heuristics for ranking optimization common to these competitions, or the fact that participants were allowed to submit predictions on

of the testing data during the competition. In addition, the lack of cross-dataset experimentation complicates evaluating generalization ability. In contrast, the approach proposed here is a general improvement over standard techniques, not limited to the DR grading problem, and which generalizes to other datasets, as Table 2 shows.


  • [1] Diabetes Report, WHO. Technical report.
  • [2] MSc Valentina Bellemo, PhD Zhan W Lim, PhD Gilbert Lim, BEng Quang D Nguyen, MScPH Yuchen Xie, B. A. Michelle Y T Yip, BSc Haslina Hamzah, Dfst Jinyi Ho, BSc (Hons) Xin Q Lee, PhD Wynne Hsu, PhD Mong L Lee, M. D. Lillian Musonda, FRCOphth Manju Chandran, FCOphth (ECSA) Grace Chipalo-Mutati, FCOphth (ECSA) Mulenga Muma, M. D. Gavin S W Tan, FRCOphth Sobha Sivaprasad, FRCOphth Geeta Menon, M. D. Tien Y Wong, and M. D. Daniel S W Ting. Artificial intelligence using deep learning to screen for referable and vision-threatening diabetic retinopathy in Africa: a clinical validation study. The Lancet: Digital Health, 1(1):e35–e44, May 2019.
  • [3] Varun Gulshan, Lily Peng, Marc Coram, Martin C. Stumpe, Derek Wu, Arunachalam Narayanaswamy, Subhashini Venugopalan, Kasumi Widner, Tom Madams, Jorge Cuadros, Ramasamy Kim, Rajiv Raman, Philip C. Nelson, Jessica L. Mega, and Dale R. Webster. Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. JAMA, 316(22):2402–2410, December 2016.
  • [4] Michael David Abràmoff, Yiyue Lou, Ali Erginay, Warren Clarida, Ryan Amelon, James C. Folk, and Meindert Niemeijer. Improved Automated Detection of Diabetic Retinopathy on a Publicly Available Dataset Through Integration of Deep Learning. Investigative Ophthalmology & Visual Science, 57(13):5200–5206, October 2016.
  • [5] Pedro Costa, Adrian Galdran, Asim Smailagic, and AuréLio Campilho. A Weakly-Supervised Framework for Interpretable Diabetic Retinopathy Detection on Retinal Images. IEEE Access, 6:18747–18758, 2018.
  • [6] C. P. Wilkinson, Frederick L. Ferris, Ronald E. Klein, Paul P. Lee, Carl David Agardh, Matthew Davis, Diana Dills, Anselm Kampik, R. Pararajasegaram, and Juan T. Verdaguer. Proposed international clinical diabetic retinopathy and diabetic macular edema disease severity scales. Ophthalmology, 110(9):1677–1682, September 2003.
  • [7] Jonathan Krause, Varun Gulshan, Ehsan Rahimy, Peter Karth, Kasumi Widner, Greg S. Corrado, Lily Peng, and Dale R. Webster.

    Grader Variability and the Importance of Reference Standards for Evaluating Machine Learning Models for Diabetic Retinopathy.

    Ophthalmology, 125(8):1264–1272, August 2018.
  • [8] Feng Li, Zheng Liu, Hua Chen, Minshan Jiang, Xuedian Zhang, and Zhizheng Wu. Automatic Detection of Diabetic Retinopathy in Retinal Fundus Photographs Based on Deep Learning Algorithm. Translational Vision Science & Technology, 8(6):4–4, November 2019.
  • [9] Jaakko Sahlsten, Joel Jaskari, Jyri Kivinen, Lauri Turunen, Esa Jaanio, Kustaa Hietala, and Kimmo Kaski. Deep Learning Fundus Image Analysis for Diabetic Retinopathy and Macular Edema Grading. Scientific Reports, 9(1):1–11, July 2019.
  • [10] Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya, and Tomaso A Poggio. Learning with a Wasserstein Loss. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2053–2061. Curran Associates, Inc., 2015.
  • [11] Arthur Mensch, Mathieu Blondel, and Gabriel Peyré. Geometric Losses for Distributional Learning. In International Conference on Machine Learning, pages 4516–4525, May 2019.
  • [12] Hongyu Lin, Yaojie Lu, Xianpei Han, and Le Sun. Cost-sensitive Regularization for Label Confusion-aware Event Detection. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5278–5283, Florence, Italy, July 2019. Association for Computational Linguistics.
  • [13] Nguyen Thai-Nghe, Zeno Gantner, and Lars Schmidt-Thieme. Cost-sensitive learning methods for imbalanced data. In The 2010 International Joint Conference on Neural Networks (IJCNN), pages 1–8, July 2010.
  • [14] Zhi-Hua Zhou and Xu-Ying Liu. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering, 18(1):63–77, January 2006.
  • [15] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal Loss for Dense Object Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2):318–327, February 2020.
  • [16] Adrian Galdran, Jihed Chelbi, Riadh Kobi, José Dolz, Hervé Lombaert, Ismail ben Ayed, and Hadi Chakor. Non-uniform Label Smoothing for Diabetic Retinopathy Grading from Retinal Fundus Images with Deep Neural Networks. Translational Vision Science & Technology, 9(2):34–34, January 2020.
  • [17] Mike Voets, Kajsa Møllersen, and Lars Ailo Bongo. Reproduction study using public data of: Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. PLOS ONE, 14(6):e0217541, June 2019.
  • [18] Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He. Aggregated Residual Transformations for Deep Neural Networks. pages 1492–1500, 2017.
  • [19] Mateusz Buda, Atsuto Maki, and Maciej A. Mazurowski. A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks, 106:249–259, October 2018.
  • [20] Ziyuan Zhao, Kerui Zhang, Xuejie Hao, Jing Tian, Matthew Chin Heng Chua, Li Chen, and Xin Xu. BiRA-Net: Bilinear Attention Net for Diabetic Retinopathy Grading. In 2019 IEEE International Conference on Image Processing (ICIP), pages 1385–1389, September 2019. ISSN: 1522-4880.
  • [21] David J. Hand and Robert J. Till. A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems. Machine Learning, 45(2):171–186, November 2001.
  • [22] Patrice Bertail, Stéphan J. Clémençcon, and Nicolas Vayatis. On Bootstrapping the ROC Curve. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems, pages 137–144. Curran Associates, Inc., 2009.
  • [23] Teresa Araújo, Guilherme Aresta, Luís Mendonça, Susana Penas, Carolina Maia, Ângela Carneiro, Ana Maria Mendonça, and Aurélio Campilho. DR|GRADUATE: Uncertainty-aware deep learning-based diabetic retinopathy grading in eye fundus images. Medical Image Analysis, 63:101715, July 2020.
  • [24] Jordi de la Torre, Domenec Puig, and Aida Valls. Weighted kappa loss function for multi-class classification of ordinal data in deep learning. Pattern Recognition Letters, 105:144–154, April 2018.