cost_sensitive_loss_classification
A straightforward mechanism to implement cost sensitive losses in pytorch
view repo
Assessing the degree of disease severity in biomedical images is a task similar to standard classification but constrained by an underlying structure in the label space. Such a structure reflects the monotonic relationship between different disease grades. In this paper, we propose a straightforward approach to enforce this constraint for the task of predicting Diabetic Retinopathy (DR) severity from eye fundus images based on the wellknown notion of CostSensitive classification. We expand standard classification losses with an extra term that acts as a regularizer, imposing greater penalties on predicted grades when they are farther away from the true grade associated to a particular image. Furthermore, we show how to adapt our method to the modelling of label noise in each of the subproblems associated to DR grading, an approach we refer to as Atomic SubTask modeling. This yields models that can implicitly take into account the inherent noise present in DR grade annotations. Our experimental analysis on several public datasets reveals that, when a standard Convolutional Neural Network is trained using this simple strategy, improvements of 35% of quadraticweighted kappa scores can be achieved at a negligible computational cost. Code to reproduce our results is released at https://github.com/agaldran/cost_sensitive_loss_classification.
READ FULL TEXT VIEW PDF
People with diabetes are at risk of developing an eye disease called dia...
read it
We propose an automatic diabetic retinopathy (DR) analysis algorithm bas...
read it
Diabetic Retinopathy (DR) is a nonnegligible eye disease among patients...
read it
Diabetic retinopathy (DR) and diabetic macular edema (DME) are the leadi...
read it
Diabetic Retinopathy (DR) is a leading cause of vision loss globally. Ye...
read it
Diabetic Retinopathy (DR) is among the worlds leading vision loss causes...
read it
Human visual system is modeled in engineering field providing
featureen...
read it
A straightforward mechanism to implement cost sensitive losses in pytorch
Diabetes is regarded as a global eye health issue, with a steadily increasing worldwide affected population, expected to reach 630 million individuals by 2045 [1]. Diabetic Retinopathy (DR) is a complication of standard diabetes, caused by damage to vasculature within the retina. DR shows early signs in the form of swelling microlesions that destroy small vessels and release blood into the retina. Advanced DR stages are characterized by the appearance of more noticeable symptoms, e.g. proliferation of neovessels, leading to the detachment of the retinal layer and eventually permanent sight loss.
Retinal images acquired with fundus cameras are the tool of choice for discovering these early symptoms, representing an effective diagnostic tool suitable for automatic diagnostic systems [2]
. In this context, and with the advent of Deep Learning in the last decade, a wide set of techniques has been proposed in recent years
[3, 4, 5]. However, the vast majority of these works are designed for the screening task, i.e. distinguishing healthy individuals from patients at any stage of risk. Due to its difficulty, fewer works have addressed the task of DR grading, consisting of classifying an eye fundus image into one of the five categories proposed by the American Academy of Ophthalmology [6], illustrated in Fig. 1. In addition, most recent DR grading techniques [7, 8, 9]have focused on scaling up existing Convolutional Neural Networks by considering larger/better databases, but only a few works addressed the design of customized loss functions that are more suitable for this task, which is the goal of this paper.
CostSensitive classifiers are known to be useful for addressing two of the main challenges related to DR grading. First, they allow to model the underlying structure of an heterogeneous label space [10, 11, 12]. Second, they are beneficial for dealing with severely classimbalanced scenarios [13, 14]. Despite this, to the best of our knowledge, no previous work has explored CostSensitive loss minimization approaches in the context of DR grading from eye fundus images.
In this paper, we present a straightforward approach for integrating CostSensitive classification constraints in the task of DR grading from retinal images. We choose to introduce these constraints by attaching an auxiliary CostSensitive loss term to popular missclassification error functions, and by analyzing the impact of this process in the training of a standard CNN. In addition, we illustrate how to adapt our method to the modeling of label noise in each of the subproblems associated to DR grading, an approach we refer to as Atomic SubTask modeling. We conduct a series of careful experiments demonstrating that expanding wellknown loss functions with a CostSensitive term brings noticeable performance increases, and that subtask modeling leads to learning models that behave more similarly to human annotators.
In this section we first describe our approach to build CostSensitive (CS) classifiers, and the loss functions we select as baselines, to which we will add a CSregularizing term. We then show how CS can be employed to model label noise for DR grading problems, and detail the training process we followed to optimize the parameters of our models.
In order to induce a different penalty for each kind of error, let us first consider the case in which a model produces a prediction . Such prediction is to be compared with the corresponding label . For the sake of readability, in the following we will abuse notation and refer by indistinctly to an integer label
and its onehotencoded counterpart
, which takes a value of in the position corresponding to and elsewhere.Standard loss functions like the crossentropy error, described by:
(1) 
are insensitive to any underlying structure in the label space . This means that for a particular example , if any permutation is applied on , the resulting error will remain the same. In order to modify that behavior, we consider a cost matrix that encodes a null cost for a prediction such that , but cost that increases along with the distance .
A simple approach to achieve such increasing labeldependent penalty is by encoding in each row of those costs, and then computing the scalar product of with the row of corresponding to , i.e. . However, due to the high imbalance of the DR grading problem (with typically few examples of classes DR1, DR3, and DR4) in our experiments we noted that simply minimizing such quantity would lead to models remaining stuck in local minima and classifying all images into DR0 and DR2 classes. For this reason, we prefer to combine a CS term with a base loss as follows:
(2) 
In the above equation, we have selected the based ground cost matrix , since it fits nicely with the goal of maximizing quadraticweighted kappa score, but other cost matrices could be easily implemented if previous knowledge of the problem is available to be embedded in the loss function. We give an example of how to build different penalties in the next section.
As for the base loss, in this paper we consider three different alternatives, namely the above CrossEntropy loss together with the Focal Loss and NonUniform Label Smoothing Loss functions. The Focal Loss was introduced for object detection tasks in [15], but it has become widely popular in classification tasks due to its ability to penalize wrongly missclassified examples during training. In a multiclass setting, it is given by the following equation:
(3) 
being a weighing factor and the socalled focusing parameter that penalizes errors in wrongly classified examples more than errors in correctly classified ones.
NonUniform Label Smoothing Loss is a straightforward modification of the popular Label Smoothing technique in which neighboring labels receive more probability mass than fartheraway ones. This process is described by the following formula
[16]:(4) 
where actual labels are manipulated by means of convolution with a Gaussian kernel resulting in the introduction of lower penalty in neighboring grades and greater loss value for far away predictions. Differently from the CrossEntropy and the Focal loss, the NonUniform Label Smoothing strategy is sensitive to the label space structure. Yet, we hypothesize that further imposing greater penalty on farther away grades could bring benefits training based on this loss, as well as the other two above functions. In our experiments, described below, we train several models by considering to be , , and and varying the hyperparameter from (no CS regularization whatsoever) to greater CS penalty, and observe the resulting performance.
Annotating retinal images regarding the level of DR severity is know to be a noisy process, with high rates of interobserver disagreement [7, 17]
. In this paper we propose to leverage available data regarding the structure of that disagreement to improve DR grading accuracy. Our hypothesis is that if the kind of noise affecting labels in the training data can be estimated, we can make a model aware of such noise via a CS mechanism similar to the one described in eq. (
2).Specifically, we consider the confusion matrix
from the left hand side of eq. (5). This matrix contains information collected in [7] regarding interobserver disagreement between retinal specialists and an adjudicated consensus during the grading process of their clinical validation dataset. Interestingly, this matrix conveys not only information about which grades are most likely to be subject of expert disagreement, but it also tells us which grades are more often mistaken by which other grades.To formalize the above, we refer to the task of categorizing an image of actual DR grade image into the th grade as , and we refer to this process as atomic subtasks. For a given grade , the amount of images actually belonging to that grade is , and normalizing by provides an estimate of , which denotes the likelihood that an annotator diagnoses an image as grade when it actually was of grade , as shown in the right hand side of eq. (5):
(5) 
We assume below that matrices are indexed starting from , i.e. . By observing we can draw several conclusions, for example:
Annotators are likely to be greatly accurate when grading and images, as derived from and .
Around of images are likely to be incorrectly labeled ().
Only of incorrectly labeled images are likely to be labeled as .
Approximately of those incorrectly labeled images are likely to be labeled as .
Under the hypothesis that in a dataset labeled by a single annotator the reliability of the annotations will follow a distribution similar to the above, we can assume, for instance, that such dataset will contain reliable labels concerning grades. However, we may also assume that when an image has been annotated as of grade , this is quite likely to be incorrect, and it may well be the case that such image is actually of grade , since the corresponding atomic subtask holds value comparable to .
Our goal is to impose in our models a penalty on erroneous predictions that takes into account all the above information. That is, we want to penalize incorrect predictions when the label is likely to be reliable, but we are willing to be more tolerant with erroneous predictions if we know the associated label is unreliable. Embedding this knowledge into a loss function is easily accomplished using the CS loss formulation as developed in the previous section: we consider in eq. (2), being
the identity matrix. Higher values of
will result in lower penalties, whereas lower values lead to a greater penalty.Note, however, that for grades such that , , there is no useful information in terms of relative reliability of these grades, e.g. does not convey the information that it is harder to misdiagnose a images as than it is to misdiagnose it as . In those situations it might be better to rely on the penalty imposed by from eq. (2). For this reason, we suggest to implement an averaged CostSensitive regularizer as:
(6) 
We now describe the remaining training specifications aside of the loss functions.
For analyzing the impact of minimizing CSregularized loss functions in the problem of DR grading, we follow the process of varying the hyperparameter in eq. (2). For each base loss function, we train a Convolutional Neural Network (CNN) by setting (no regularization), , and . If the best performance of these three experiments results from employing , we set and train the CNN again. This process is repeated until performance does not improve anymore.
As for the CNN, we select the Resnext50 architecture based on its excellent classification accuracy in other multiclass problems [18]
, and weights are initialized from training in the ImageNet dataset. We use Stochastic Gradient Descent with a batch size of
, and the learning rate is set to . Performance (quadratic kappa score) is monitored in an independent validation set. The learning rate is decreased by a factor of whenever performance stagnates in the validation set, and the training is stopped after epochs of no further improvement. In addition, to mitigate the impact of class imbalance, we oversample minority classes [19].In this section we describe the experimental setting we follow in order to validate our approach: considered datasets, comparing techniques, and numerical results.
We consider as our primary dataset the Eyepacs database^{1}^{1}1https://www.kaggle.com/c/diabeticretinopathydetection the largest public dataset with DR grading labels for DR grading labels. It contains around 80,000 high resolution retinal fundus images (approximately 35,000 are assigned to the training set, from which we employ for validation, and 55,000 are held out for testing purposes). The Eyepacs dataset contains a considerable amount of low quality images and label noise [17]. Therefore, it represents an interesting testbed to observe the robustness of DR grading algorithms.
As a secondary test set, we also consider the Messidor2 dataset [4], which contains images corresponding to patients. In this case, we employ the groundtruth labels released by [7], available online^{2}^{2}2https://www.kaggle.com/googlebrain/messidor2drgrades. These labels are extracted from a process of consensus adjudication of three retinal specialists, and they are therefore of much better quality than the Eyepacs dataset groundtruth.
For performance assessment, we apply as the main metric of interest the quadraticweighted kappa score (quadkappa), which is typically used to assess interobserver variability, and is very popular metric in this task. As further measures of correlation, we also analyze Average of Classification Accuracy (ACA, the mean of the diagonal in a normalized confusion matrix [20]) and the Kendall coefficient. We also report the mean Area Under the ReceiverOperator Curve in its multiclass extension, after considering each possible class pair[21]. For statistical testing, expert labels and model predictions in each of both test sets (Eyepacs and Messidor2) are bootstrapped [22] (n=1000) in a stratified manner with respect to the relative presence of DR grades. Performance differences for each metrics are derived in each bootstrap and pvalues are computed for testing significance. The statistical significance level was set to in each case.
After training a CNN by minimizing each of the three considered base losses (CrossEntropy, Focal Loss, and NonUniform Label Smoothing) with different degrees of regularization, we select the best model and compute results first on the Eyepacs test set. We denote the unregularized models by CE, FL, and NULS respectively, and their regularized counterparts as CECS, FLCS, and NULSCS.
We then select the best hyperparameter setting for each regularized model (
in all cases), and retrain the same model but this time using our proposed Atomic SubTask modeling, denoted by an suffix in each case. We compile in Table 1 the obtained results in terms of quadratic score, mean AUC, ACA and Kendall, for all the described options.quadkappa  mAUC  ACA  Kendall  

CE  75.76 0.31  87.35 0.14  51.32 0.44  67.35 0.31 
CECS  77.27 0.30  88.42 0.14  53.26 0.42  69.89 0.30 
CEAST  77.39 0.29  88.49 0.13  54.12 0.44  69.25 0.30 
Focal Loss  74.72 0.34  86.63 0.16  51.90 0.44  65.38 0.32 
FLCS  77.38 0.31  88.58 0.14  54.11 0.45  69.45 0.30 
FLAST  77.94 0.29  88.90 0.13  54.71 0.43  70.45 0.29 
NULS  77.09 0.30  88.44 0.14  53.02 0.46  69.47 0.29 
NULSCS  77.91 0.30  88.82 0.14  54.55 0.44  70.14 0.30 
NULSAST  78.71 0.28  89.05 0.13  54.57 0.46  71.04 0.30 
Finally, we report in table 2 the performance of our best model (using as a base loss NULS and Atomic SubTask modeling) in comparison with the techniques proposed in [23], [24], and [20], in the test set of both Eyepacs and Messidor. We also provide confusion matrices for the Eyepacs test set in Fig. 2.
[23]  QWKL [24]  BiraNet [20]  NULSAST  

Eyepacs  74.00/53.6  74.00/n.a.  n.a./54.31  78.710.28/54.570.46 
Messidor2  71.00/59.60  n.a./n.a.  n.a./n.a.  79.791.03/ 63.411.99 



Results on Table 1 clearly show that introducing CostSensitive regularization results in noticeable improvements, particularly when measuring performance in terms of quadratic score. This is meaningful since the considered cost matrix was selected so as to quadratically penalize distance in the label space for erroneous predictions. Quadratic score experimented an improvement ranging from when regularizing the Focal loss to for NULS. This could also be expected, since NULS already introduces some asymmetry in the way DR grades are treated. If Atomic SubTask modeling is considered, these improvements are even greater when compared with unregularized counterparts: from an increase of score of for the Focal Loss to an increase of for NULS. It is also worth noticing that the confusion matrix resulting from training with Atomic SubTask modeling shows certain similarity with respect to the interobserver disagreement matrix in the lefthand side of eq. (5), specially when compared with the confusion matrices produced by other techniques, as shown in Fig. (2).
It should be stressed that performance on Table 1
is not comparable to results of the competition that published the data. There are several reasons for this: the heuristics for ranking optimization common to these competitions, or the fact that participants were allowed to submit predictions on
of the testing data during the competition. In addition, the lack of crossdataset experimentation complicates evaluating generalization ability. In contrast, the approach proposed here is a general improvement over standard techniques, not limited to the DR grading problem, and which generalizes to other datasets, as Table 2 shows.Grader Variability and the Importance of Reference Standards for Evaluating Machine Learning Models for Diabetic Retinopathy.
Ophthalmology, 125(8):1264–1272, August 2018.
Comments
There are no comments yet.