With advances in technology, deep convolutional networks have become a fast and accurate means to carry out semantic segmentation tasks. They are widely used in most applications in 2D and 3D medical image analysis. The networks effectively learn to label a binary mask as 0, for every background pixel and as 1 for the foreground. Historically, the binary cross entropy loss emerged as the loss function of choice for this per-pixel labelling task. Given a ground truth, and a predicted output, the loss is given by,
for all instances n, where,
is the probability of the ground truth label being 0 and
is the probabilistic output from a logistic sigmoid function predicted as 0. The loss generally works well for classification and segmentation as long as the labels for all classes are balanced. If one class dominates over the other, the imbalance results in the network predicting all outputs to be the dominant class due to convergence to a non optimal local minimum. Some recently proposed loss functions such as the dice loss and the focal loss tackle this problem by weighting some outputs more than others.
General evaluations of these losses is done by calculating the overall overlap between the ground truth and the prediction. This intersection over union metric (Jaccard Index) is given by,
where G is the ground truth mask and P is the predicted mask. In contrast, the Dice Index assigns a higher weight to the true positives, and is given by the formula:
Due to its high weight on the true positives, is also widely used as a loss function. The Tversky Index  is another function proposed that adds further weight to the false positives and false negatives to get better predictions. These similarity metrics are generally converted to loss functions by optimizing over a sum of their class-wise difference from the optimal value. Their general form is where the metric, , can be Jaccard, Dice or Tversky Index. The subscript indicates that the summation is over the number of classes, . Many loss functions have also been proposed   
as weighted combinations of these losses. In this paper, we propose to enhance the properties of the dice loss using our methodology. We conduct an extensive hyperparameter search for our function and emperically show that our technique leads to better convergence of the dice loss under even less optimal settings of our function. We compare with state-of-the-art for the same problems, and show performance gains over them. We use the U-Net, and FocusNet  architectures to compare results. Our enhancement experiments with the dice loss due to its popularity in medical image segmentation tasks, but in theory, any loss function could be used here. The rest of the paper is organized as follows. In Section 2, we discuss our loss function. Section 3 describes our evaluation. Results are presented in Section 4 and we conclude with Section 5.
2 Adaptive Logarithmic Loss
We motivate the need for our loss based on the properties of a good loss function. Once a loss function computes the error between the label and the ground truth values, the error is backpropagated though a network in order to make it learn. This fundamental task is generally conducted well by all loss functions, though some tend to converge faster than the others. Empirically, tversky loss converges in lesser epochs compared to the earlier proposed losses such as CE or the Jaccard loss. A good loss function should not take too long to converge. It is an added bonus if it speeds up convergence. Secondly, a loss function should be able to adapt to the loss landscape closer to convergence. Keeping these points in mind, we construct a loss function that can both, converge at a faster rate, as well as adaptively refine its landscape when closer to convergence. The formula for this adaptive loss is given by,
where is used to make the loss function differentiable and smooth at and DL is the computed dice loss. are hypermarameters of this loss function. Further, as the dice loss lies between [0, 1], we experiment with values of
that are [0, 1] to find the optimal threshold to shift to a smoother log based loss for convergence close to the minima. As log is a monotonic function, it smoothens the convergence. The derivative of this loss can be computed via the chain rule. It is visually shown in figure1. Differentiating a function of a function results in the product of two derivatives. So, , where the plot of is shown in figure 1. Hence, given any loss as input to our adaptive function, it’s derivative will be multiplied by a smooth differentiable function that would in turn remove any discontinuities. The loss resembles the form of functions proposed in   but to the best of our knowledge, this is the first time it has been adapted for any image segmentation task. After experimentation, we found the optimal values for the hyperparameters to be, .
, the data science bowl 2018 cell nuclei segmentation dataset, and the DRIVE retinal vessel segmentation dataset. We don’t apply any preprocessing excepting resizing the images to a constant size and scaling the pixel values between [0,1]. For the DRIVE dataset, we extract small patches from the images to construct our dataset. The data for all experiments is divided into a 80:20 split. To keep the evaluation fair, we do not use any augmentation strategies as different augmentations can effect performances differently. We apply a grid search style strategy to find the optimal hyperparameters of our loss, where we first run some initial tests to see the behaviour of the loss given some hyperparameters, and then tune them. Initially, we set and tuned the values of
to their optimal settings. Then we use the empirically estimatedto find the optimal value for . The values obtained are shown in Table 1 and 2. From the tables we can see that the loss isn’t affected a lot by changes in , but even small changes in can cause significant changes in the loss value. This is graphically verified via the derivative of the loss in figure 1 where we can see that for small values of
, the penalty for getting a prediction wrong is a lot larger. All experiments were run using Keras with a tensorflow backend. Adam with a learning rate of 1e-4 was used. A constant batch size of 16 was used throughout. Our implementation of the loss will be open sourced on github. The experiments were run for a maximum of 50 epochs. To evaluate the performance of our loss, we compute the intersection over union (IoU) overlap, recall, specificity, F-measure and the area under the receiver operator characteristics curve (AUC-ROC) of the corresponding network predictions trained on various loss function.
We compare the performance of our loss (Table 3) with the jaccard loss (JL), dice loss (DL), tversky loss (TL) , focal loss (FL)  and the combo loss (CL) . The ISIC 2018 dataset shows the least imbalance and hence the results are fairly even for this dataset, though our loss get the best IoU. This is also exhibited in Figure 2. The nuclei segmentation dataset exhibits more class imbalance, where our loss shows significantly superior performance compared to all the other losses. We get significant gains over the baseline AUC as shown in figure 3. We do not compare the DRIVE dataset by the AUC or accuracy as these metrics are fairly saturated for this dataset and don’t offer any statistically significant insights. It is interesting to note that we get an improved F-measure score and the best in class IoU, which given the large number of patches extracted, is statistically significant. Our loss shows significantly better performance than the baseline dice loss for all three datasets, which means that it manages to optimize the loss to a significantly better minimum on the loss landscape leading to a more optimal solution. In all cases, the trend shown by our loss is to converge to within delta of the optimal solution and then refine the convergences using the adaptive strategy. Without the adaptive strategy, our loss often gets stuck in local minimum, which reiterates the importance of having such a piecewise continuous loss. The other loss functions (especially dice loss) exhibit slightly unstable convergence and also take longer to reach within delta of the optimal solution. We observed that our loss always converged faster than JL, DL, TL and CL. FL convergence is at par with our loss, while TL converges more smoothly.
|Method||ISIC||Data Science Bowl||DRIVE|
In this paper, we proposed an enhanced loss function that potentially constructs a loss landscape that is easier to traverse via backpropagation. We tested our approach on 3 datasets that show varied class imbalance. As the imbalance in the data increases, our loss provides a more robust solution (along with faster convergence) to this prominent problem compared to other state-of-the-art loss functions. We base these conclusions on carefully constructed evaluation metrics for each task that show superior performance in favour of our loss.
-  (2018) A novel focal tversky loss function with improved attention u-net for lesion segmentation. CoRR abs/1810.07842. External Links: Cited by: Table 4.
-  (2018) NnU-net: self-adapting framework for u-net-based medical image segmentation. CoRR abs/1809.10486. External Links: Cited by: §1.
-  (2004) Ridge based vessel segmentation in color images of the retina. IEEE Transactions on Medical Imaging 23 (4), pp. 501–509. Cited by: §3.
-  (2019) Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (ISIC). CoRR abs/1902.03368. External Links: Cited by: §3.
-  (2018) Combo loss: handling input and output imbalance in multi-organ segmentation. CoRR abs/1805.02798. External Links: Cited by: §1, §4.
-  (2017) Focal loss for dense object detection. CoRR abs/1708.02002. External Links: Cited by: §1, §4.
-  (2018) AnatomyNet: deep 3d squeeze-and-excitation u-nets for fast and fully automated whole-volume anatomical segmentation. CoRR abs/1808.05238. External Links: Cited by: §1.
Wing loss for robust facial landmark localisation with convolutional neural networks. CoRR abs/1711.06753. External Links: Cited by: §2.
-  (2019) FocusNet: an attention-based fully convolutional network for medical image segmentation. CoRR abs/1902.03091. External Links: Cited by: §1, §3.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), LNCS, Vol. 9351, pp. 234–241. Note: (available on arXiv:1505.04597 [cs.CV]) External Links: Cited by: §1, §3.
-  (2017) Tversky loss function for image segmentation using 3d fully convolutional deep networks. In Machine Learning in Medical Imaging, Cham, pp. 379–387. External Links: Cited by: §1, §4.
-  (2019) Adaptive wing loss for robust face alignment via heatmap regression. CoRR abs/1904.07399. External Links: Cited by: §2.