A straightforward mechanism to implement cost sensitive losses in pytorch
Assessing the degree of disease severity in biomedical images is a task similar to standard classification but constrained by an underlying structure in the label space. Such a structure reflects the monotonic relationship between different disease grades. In this paper, we propose a straightforward approach to enforce this constraint for the task of predicting Diabetic Retinopathy (DR) severity from eye fundus images based on the well-known notion of Cost-Sensitive classification. We expand standard classification losses with an extra term that acts as a regularizer, imposing greater penalties on predicted grades when they are farther away from the true grade associated to a particular image. Furthermore, we show how to adapt our method to the modelling of label noise in each of the sub-problems associated to DR grading, an approach we refer to as Atomic Sub-Task modeling. This yields models that can implicitly take into account the inherent noise present in DR grade annotations. Our experimental analysis on several public datasets reveals that, when a standard Convolutional Neural Network is trained using this simple strategy, improvements of 3-5% of quadratic-weighted kappa scores can be achieved at a negligible computational cost. Code to reproduce our results is released at https://github.com/agaldran/cost_sensitive_loss_classification.READ FULL TEXT VIEW PDF
People with diabetes are at risk of developing an eye disease called dia...
We propose an automatic diabetic retinopathy (DR) analysis algorithm bas...
Diabetic Retinopathy (DR) is a non-negligible eye disease among patients...
Diabetic retinopathy (DR) and diabetic macular edema (DME) are the leadi...
Diabetic Retinopathy (DR) is a leading cause of vision loss globally. Ye...
Diabetic Retinopathy (DR) is among the worlds leading vision loss causes...
Human visual system is modeled in engineering field providing
A straightforward mechanism to implement cost sensitive losses in pytorch
Diabetes is regarded as a global eye health issue, with a steadily increasing world-wide affected population, expected to reach 630 million individuals by 2045 . Diabetic Retinopathy (DR) is a complication of standard diabetes, caused by damage to vasculature within the retina. DR shows early signs in the form of swelling micro-lesions that destroy small vessels and release blood into the retina. Advanced DR stages are characterized by the appearance of more noticeable symptoms, e.g. proliferation of neo-vessels, leading to the detachment of the retinal layer and eventually permanent sight loss.
Retinal images acquired with fundus cameras are the tool of choice for discovering these early symptoms, representing an effective diagnostic tool suitable for automatic diagnostic systems  .
In this context, and with the advent of Deep Learning in the last decade, a wide set of techniques has been proposed in recent years have focused on scaling up existing Convolutional Neural Networks by considering larger/better databases, but only a few works addressed the design of customized loss functions that are more suitable for this task, which is the goal of this paper.
. In this context, and with the advent of Deep Learning in the last decade, a wide set of techniques has been proposed in recent years[3, 4, 5]. However, the vast majority of these works are designed for the screening task, i.e. distinguishing healthy individuals from patients at any stage of risk. Due to its difficulty, fewer works have addressed the task of DR grading, consisting of classifying an eye fundus image into one of the five categories proposed by the American Academy of Ophthalmology , illustrated in Fig. 1. In addition, most recent DR grading techniques [7, 8, 9]
have focused on scaling up existing Convolutional Neural Networks by considering larger/better databases, but only a few works addressed the design of customized loss functions that are more suitable for this task, which is the goal of this paper.
Cost-Sensitive classifiers are known to be useful for addressing two of the main challenges related to DR grading. First, they allow to model the underlying structure of an heterogeneous label space [10, 11, 12]. Second, they are beneficial for dealing with severely class-imbalanced scenarios [13, 14]. Despite this, to the best of our knowledge, no previous work has explored Cost-Sensitive loss minimization approaches in the context of DR grading from eye fundus images.
In this paper, we present a straightforward approach for integrating Cost-Sensitive classification constraints in the task of DR grading from retinal images. We choose to introduce these constraints by attaching an auxiliary Cost-Sensitive loss term to popular miss-classification error functions, and by analyzing the impact of this process in the training of a standard CNN. In addition, we illustrate how to adapt our method to the modeling of label noise in each of the sub-problems associated to DR grading, an approach we refer to as Atomic Sub-Task modeling. We conduct a series of careful experiments demonstrating that expanding well-known loss functions with a Cost-Sensitive term brings noticeable performance increases, and that sub-task modeling leads to learning models that behave more similarly to human annotators.
In this section we first describe our approach to build Cost-Sensitive (CS) classifiers, and the loss functions we select as baselines, to which we will add a CS-regularizing term. We then show how CS can be employed to model label noise for DR grading problems, and detail the training process we followed to optimize the parameters of our models.
In order to induce a different penalty for each kind of error, let us first consider the case in which a model produces a prediction . Such prediction is to be compared with the corresponding label .
For the sake of readability, in the following we will abuse notation and refer by indistinctly to an integer label and its one-hot-encoded counterpart
and its one-hot-encoded counterpart, which takes a value of in the position corresponding to and elsewhere.
Standard loss functions like the cross-entropy error, described by:
are insensitive to any underlying structure in the label space . This means that for a particular example , if any permutation is applied on , the resulting error will remain the same. In order to modify that behavior, we consider a cost matrix that encodes a null cost for a prediction such that , but cost that increases along with the distance .
A simple approach to achieve such increasing label-dependent penalty is by encoding in each row of those costs, and then computing the scalar product of with the row of corresponding to , i.e. . However, due to the high imbalance of the DR grading problem (with typically few examples of classes DR1, DR3, and DR4) in our experiments we noted that simply minimizing such quantity would lead to models remaining stuck in local minima and classifying all images into DR0 and DR2 classes. For this reason, we prefer to combine a CS term with a base loss as follows:
In the above equation, we have selected the -based ground cost matrix , since it fits nicely with the goal of maximizing quadratic-weighted kappa score, but other cost matrices could be easily implemented if previous knowledge of the problem is available to be embedded in the loss function. We give an example of how to build different penalties in the next section.
As for the base loss, in this paper we consider three different alternatives, namely the above Cross-Entropy loss together with the Focal Loss and Non-Uniform Label Smoothing Loss functions. The Focal Loss was introduced for object detection tasks in , but it has become widely popular in classification tasks due to its ability to penalize wrongly miss-classified examples during training. In a multi-class setting, it is given by the following equation:
being a weighing factor and the so-called focusing parameter that penalizes errors in wrongly classified examples more than errors in correctly classified ones.
Non-Uniform Label Smoothing Loss is a straightforward modification of the popular Label Smoothing technique in which neighboring labels receive more probability mass than farther-away ones.
This process is described by the following formula
Non-Uniform Label Smoothing Loss is a straightforward modification of the popular Label Smoothing technique in which neighboring labels receive more probability mass than farther-away ones. This process is described by the following formula:
where actual labels are manipulated by means of convolution with a Gaussian kernel resulting in the introduction of lower penalty in neighboring grades and greater loss value for far away predictions. Differently from the Cross-Entropy and the Focal loss, the Non-Uniform Label Smoothing strategy is sensitive to the label space structure. Yet, we hypothesize that further imposing greater penalty on farther away grades could bring benefits training based on this loss, as well as the other two above functions. In our experiments, described below, we train several models by considering to be , , and and varying the hyper-parameter from (no CS regularization whatsoever) to greater CS penalty, and observe the resulting performance.
Annotating retinal images regarding the level of DR severity is know to be a noisy process, with high rates of inter-observer disagreement [7, 17] .
In this paper we propose to leverage available data regarding the structure of that disagreement to improve DR grading accuracy.
Our hypothesis is that if the kind of noise affecting labels in the training data can be estimated, we can make a model aware of such noise via a CS mechanism similar to the one described in eq. (
. In this paper we propose to leverage available data regarding the structure of that disagreement to improve DR grading accuracy. Our hypothesis is that if the kind of noise affecting labels in the training data can be estimated, we can make a model aware of such noise via a CS mechanism similar to the one described in eq. (2).
Specifically, we consider the confusion matrix
Specifically, we consider the confusion matrixfrom the left hand side of eq. (5). This matrix contains information collected in  regarding inter-observer disagreement between retinal specialists and an adjudicated consensus during the grading process of their clinical validation dataset. Interestingly, this matrix conveys not only information about which grades are most likely to be subject of expert disagreement, but it also tells us which grades are more often mistaken by which other grades.
To formalize the above, we refer to the task of categorizing an image of actual DR grade image into the -th grade as , and we refer to this process as atomic sub-tasks. For a given grade , the amount of images actually belonging to that grade is , and normalizing by provides an estimate of , which denotes the likelihood that an annotator diagnoses an image as grade when it actually was of grade , as shown in the right hand side of eq. (5):
We assume below that matrices are indexed starting from , i.e. . By observing we can draw several conclusions, for example:
Annotators are likely to be greatly accurate when grading and images, as derived from and .
Around of images are likely to be incorrectly labeled ().
Only of incorrectly labeled images are likely to be labeled as .
Approximately of those incorrectly labeled images are likely to be labeled as .
Under the hypothesis that in a dataset labeled by a single annotator the reliability of the annotations will follow a distribution similar to the above, we can assume, for instance, that such dataset will contain reliable labels concerning grades. However, we may also assume that when an image has been annotated as of grade , this is quite likely to be incorrect, and it may well be the case that such image is actually of grade , since the corresponding atomic sub-task holds value comparable to .
Our goal is to impose in our models a penalty on erroneous predictions that takes into account all the above information.
That is, we want to penalize incorrect predictions when the label is likely to be reliable, but we are willing to be more tolerant with erroneous predictions if we know the associated label is unreliable.
Embedding this knowledge into a loss function is easily accomplished using the CS loss formulation as developed in the previous section: we consider in eq. (2), being the identity matrix.
Higher values of
the identity matrix. Higher values ofwill result in lower penalties, whereas lower values lead to a greater penalty.
Note, however, that for grades such that , , there is no useful information in terms of relative reliability of these grades, e.g. does not convey the information that it is harder to misdiagnose a images as than it is to misdiagnose it as . In those situations it might be better to rely on the penalty imposed by from eq. (2). For this reason, we suggest to implement an averaged Cost-Sensitive regularizer as:
We now describe the remaining training specifications aside of the loss functions.
For analyzing the impact of minimizing CS-regularized loss functions in the problem of DR grading, we follow the process of varying the hyper-parameter in eq. (2). For each base loss function, we train a Convolutional Neural Network (CNN) by setting (no regularization), , and . If the best performance of these three experiments results from employing , we set and train the CNN again. This process is repeated until performance does not improve anymore.
As for the CNN, we select the Resnext50 architecture based on its excellent classification accuracy in other multi-class problems , and the learning rate is set to . Performance (quadratic kappa score) is monitored in an independent validation set. The learning rate is decreased by a factor of whenever performance stagnates in the validation set, and the training is stopped after epochs of no further improvement. In addition, to mitigate the impact of class imbalance, we oversample minority classes .
In this section we describe the experimental setting we follow in order to validate our approach: considered datasets, comparing techniques, and numerical results.
We consider as our primary dataset the Eyepacs database111https://www.kaggle.com/c/diabetic-retinopathy-detection the largest public dataset with DR grading labels for DR grading labels. It contains around 80,000 high resolution retinal fundus images (approximately 35,000 are assigned to the training set, from which we employ for validation, and 55,000 are held out for testing purposes). The Eyepacs dataset contains a considerable amount of low quality images and label noise . Therefore, it represents an interesting test-bed to observe the robustness of DR grading algorithms.
As a secondary test set, we also consider the Messidor-2 dataset , which contains images corresponding to patients. In this case, we employ the ground-truth labels released by , available online222https://www.kaggle.com/google-brain/messidor2-dr-grades. These labels are extracted from a process of consensus adjudication of three retinal specialists, and they are therefore of much better quality than the Eyepacs dataset ground-truth.
For performance assessment, we apply as the main metric of interest the quadratic-weighted kappa score (quad-kappa), which is typically used to assess inter-observer variability, and is very popular metric in this task. As further measures of correlation, we also analyze Average of Classification Accuracy (ACA, the mean of the diagonal in a normalized confusion matrix ) and the Kendall- coefficient. We also report the mean Area Under the Receiver-Operator Curve in its multi-class extension, after considering each possible class pair. For statistical testing, expert labels and model predictions in each of both test sets (Eyepacs and Messidor-2) are bootstrapped  (n=1000) in a stratified manner with respect to the relative presence of DR grades. Performance differences for each metrics are derived in each bootstrap and p-values are computed for testing significance. The statistical significance level was set to in each case.
After training a CNN by minimizing each of the three considered base losses (Cross-Entropy, Focal Loss, and Non-Uniform Label Smoothing) with different degrees of regularization, we select the best model and compute results first on the Eyepacs test set. We denote the unregularized models by CE, FL, and NULS respectively, and their regularized counterparts as CE-CS, FL-CS, and NULS-CS.
We then select the best hyperparameter setting for each regularized model (
We then select the best hyperparameter setting for each regularized model (in all cases), and retrain the same model but this time using our proposed Atomic Sub-Task modeling, denoted by an suffix in each case. We compile in Table 1 the obtained results in terms of quadratic -score, mean AUC, ACA and Kendall-, for all the described options.
|CE||75.76 0.31||87.35 0.14||51.32 0.44||67.35 0.31|
|CE-CS||77.27 0.30||88.42 0.14||53.26 0.42||69.89 0.30|
|CE-AST||77.39 0.29||88.49 0.13||54.12 0.44||69.25 0.30|
|Focal Loss||74.72 0.34||86.63 0.16||51.90 0.44||65.38 0.32|
|FL-CS||77.38 0.31||88.58 0.14||54.11 0.45||69.45 0.30|
|FL-AST||77.94 0.29||88.90 0.13||54.71 0.43||70.45 0.29|
|NULS||77.09 0.30||88.44 0.14||53.02 0.46||69.47 0.29|
|NULS-CS||77.91 0.30||88.82 0.14||54.55 0.44||70.14 0.30|
|NULS-AST||78.71 0.28||89.05 0.13||54.57 0.46||71.04 0.30|
Finally, we report in table 2 the performance of our best model (using as a base loss NULS and Atomic Sub-Task modeling) in comparison with the techniques proposed in , , and , in the test set of both Eyepacs and Messidor. We also provide confusion matrices for the Eyepacs test set in Fig. 2.
|||QWKL ||Bira-Net ||NULS-AST|
Results on Table 1 clearly show that introducing Cost-Sensitive regularization results in noticeable improvements, particularly when measuring performance in terms of quadratic -score. This is meaningful since the considered cost matrix was selected so as to quadratically penalize distance in the label space for erroneous predictions. Quadratic -score experimented an improvement ranging from when regularizing the Focal loss to for NULS. This could also be expected, since NULS already introduces some asymmetry in the way DR grades are treated. If Atomic Sub-Task modeling is considered, these improvements are even greater when compared with unregularized counterparts: from an increase of score of for the Focal Loss to an increase of for NULS. It is also worth noticing that the confusion matrix resulting from training with Atomic Sub-Task modeling shows certain similarity with respect to the inter-observer disagreement matrix in the left-hand side of eq. (5), specially when compared with the confusion matrices produced by other techniques, as shown in Fig. (2).
It should be stressed that performance on Table 1 is not comparable to results of the competition that published the data.
There are several reasons for this: the heuristics for ranking optimization common to these competitions, or the fact that participants were allowed to submit predictions on
is not comparable to results of the competition that published the data. There are several reasons for this: the heuristics for ranking optimization common to these competitions, or the fact that participants were allowed to submit predictions onof the testing data during the competition. In addition, the lack of cross-dataset experimentation complicates evaluating generalization ability. In contrast, the approach proposed here is a general improvement over standard techniques, not limited to the DR grading problem, and which generalizes to other datasets, as Table 2 shows.
Grader Variability and the Importance of Reference Standards for Evaluating Machine Learning Models for Diabetic Retinopathy.Ophthalmology, 125(8):1264–1272, August 2018.