Log In Sign Up

Penalizing small errors using an Adaptive Logarithmic Loss

by   Chaitanya Kaul, et al.

Loss functions are error metrics that quantify the difference between a prediction and its corresponding ground truth. Fundamentally, they define a functional landscape for traversal by gradient descent. Although numerous loss functions have been proposed to date in order to handle various machine learning problems, little attention has been given to enhancing these functions to better traverse the loss landscape. In this paper, we simultaneously and significantly mitigate two prominent problems in medical image segmentation namely: i) class imbalance between foreground and background pixels and ii) poor loss function convergence. To this end, we propose an adaptive logarithmic loss function. We compare this loss function with the existing state-of-the-art on the ISIC 2018 dataset, the nuclei segmentation dataset as well as the DRIVE retinal vessel segmentation dataset. We measure the performance of our methodology on benchmark metrics and demonstrate state-of-the-art performance. More generally, we show that our system can be used as a framework for better training of deep neural networks.


page 1

page 2

page 3

page 4


A Mixed Focal Loss Function for Handling Class Imbalanced Medical Image Segmentation

Automatic segmentation methods are an important advancement in medical i...

Enhancing Foreground Boundaries for Medical Image Segmentation

Object segmentation plays an important role in the modern medical image ...

Adaptive Wing Loss for Robust Face Alignment via Heatmap Regression

Heatmap regression has became one of the mainstream approaches to locali...

A Flatter Loss for Bias Mitigation in Cross-dataset Facial Age Estimation

The most existing studies in the facial age estimation assume training a...

A Two-Stream Meticulous Processing Network for Retinal Vessel Segmentation

Vessel segmentation in fundus is a key diagnostic capability in ophthalm...

Marginal loss and exclusion loss for partially supervised multi-organ segmentation

Annotating multiple organs in medical images is both costly and time-con...

Component Tree Loss Function: Definition and Optimization

In this article, we propose a method to design loss functions based on c...

1 Introduction

With advances in technology, deep convolutional networks have become a fast and accurate means to carry out semantic segmentation tasks. They are widely used in most applications in 2D and 3D medical image analysis. The networks effectively learn to label a binary mask as 0, for every background pixel and as 1 for the foreground. Historically, the binary cross entropy loss emerged as the loss function of choice for this per-pixel labelling task. Given a ground truth, and a predicted output, the loss is given by,

for all instances n, where,

is the probability of the ground truth label being 0 and

is the probabilistic output from a logistic sigmoid function predicted as 0. The loss generally works well for classification and segmentation as long as the labels for all classes are balanced. If one class dominates over the other, the imbalance results in the network predicting all outputs to be the dominant class due to convergence to a non optimal local minimum. Some recently proposed loss functions such as the dice loss and the focal loss

[6] tackle this problem by weighting some outputs more than others.

Figure 1:

The plot shows the value of the derivative of our loss against the value of the dice loss that it optimizes. It can be seen that for smaller values of the loss metric, a larger loss is backpropagated.

is fixed empirically based on initial experiments to any value on the x-axis.

General evaluations of these losses is done by calculating the overall overlap between the ground truth and the prediction. This intersection over union metric (Jaccard Index) is given by,

where G is the ground truth mask and P is the predicted mask. In contrast, the Dice Index assigns a higher weight to the true positives, and is given by the formula:

Due to its high weight on the true positives, is also widely used as a loss function. The Tversky Index [11] is another function proposed that adds further weight to the false positives and false negatives to get better predictions. These similarity metrics are generally converted to loss functions by optimizing over a sum of their class-wise difference from the optimal value. Their general form is where the metric, , can be Jaccard, Dice or Tversky Index. The subscript indicates that the summation is over the number of classes, . Many loss functions have also been proposed [2] [7] [5]

as weighted combinations of these losses. In this paper, we propose to enhance the properties of the dice loss using our methodology. We conduct an extensive hyperparameter search for our function and emperically show that our technique leads to better convergence of the dice loss under even less optimal settings of our function. We compare with state-of-the-art for the same problems, and show performance gains over them. We use the U-Net

[10], and FocusNet [9] architectures to compare results. Our enhancement experiments with the dice loss due to its popularity in medical image segmentation tasks, but in theory, any loss function could be used here. The rest of the paper is organized as follows. In Section 2, we discuss our loss function. Section 3 describes our evaluation. Results are presented in Section 4 and we conclude with Section 5.

2 Adaptive Logarithmic Loss

We motivate the need for our loss based on the properties of a good loss function. Once a loss function computes the error between the label and the ground truth values, the error is backpropagated though a network in order to make it learn. This fundamental task is generally conducted well by all loss functions, though some tend to converge faster than the others. Empirically, tversky loss converges in lesser epochs compared to the earlier proposed losses such as CE or the Jaccard loss. A good loss function should not take too long to converge. It is an added bonus if it speeds up convergence. Secondly, a loss function should be able to adapt to the loss landscape closer to convergence. Keeping these points in mind, we construct a loss function that can both, converge at a faster rate, as well as adaptively refine its landscape when closer to convergence. The formula for this adaptive loss is given by,


where is used to make the loss function differentiable and smooth at and DL is the computed dice loss. are hypermarameters of this loss function. Further, as the dice loss lies between [0, 1], we experiment with values of

that are [0, 1] to find the optimal threshold to shift to a smoother log based loss for convergence close to the minima. As log is a monotonic function, it smoothens the convergence. The derivative of this loss can be computed via the chain rule. It is visually shown in figure

1. Differentiating a function of a function results in the product of two derivatives. So, , where the plot of is shown in figure 1. Hence, given any loss as input to our adaptive function, it’s derivative will be multiplied by a smooth differentiable function that would in turn remove any discontinuities. The loss resembles the form of functions proposed in [8] [12] but to the best of our knowledge, this is the first time it has been adapted for any image segmentation task. After experimentation, we found the optimal values for the hyperparameters to be, .

6 8 10 12 14 16
0.3 81.43 81.48 81.51 81.59 81.90 81.67
0.5 81.97 81.57 82.43 82.24 81.58 81.07
1.0 81.78 82.11 81.84 81.73 82.21 81.96
2.0 81.75 81.99 82.18 81.58 81.71 81.63
Table 1: Optimizing the values of and over the corresponding Jaccard Index (%). Values are average of 3 runs. Experiments conducted with constant . JI with baseline dice loss = 80.27. Results obtained using FocusNet.
0.08 0.10 0.12 0.15 0.20 0.30
JI 81.60 82.43 81.54 81.51 80.85 80.97
Table 2: Optimizing the values of over the Jaccard Index (%). Values are average of 3 runs. Experiments conducted with constant values of . JI with baseline dice loss = 80.27. Results obtained using FocusNet.

3 Evaluation

The experiments for our methodology are conducted with two architectures. We use the benchmark U-Net [10] and the attention based FocusNet [9]

. A generic U-Net is enhanced with batch normalization, dropout and strided downsampling to improve on it’s performance. The FocusNet architecture used is exactly the same as proposed in

[9]. 3 datasets exhibiting varying class imbalance are used for our experiments, to study the effect of our loss on them. The ISIC 2018 skin cancer segmentation dataset [4]

, the data science bowl 2018 cell nuclei segmentation dataset, and the DRIVE retinal vessel segmentation dataset

[3]. We don’t apply any preprocessing excepting resizing the images to a constant size and scaling the pixel values between [0,1]. For the DRIVE dataset, we extract small patches from the images to construct our dataset. The data for all experiments is divided into a 80:20 split. To keep the evaluation fair, we do not use any augmentation strategies as different augmentations can effect performances differently. We apply a grid search style strategy to find the optimal hyperparameters of our loss, where we first run some initial tests to see the behaviour of the loss given some hyperparameters, and then tune them. Initially, we set and tuned the values of

to their optimal settings. Then we use the empirically estimated

to find the optimal value for . The values obtained are shown in Table 1 and 2. From the tables we can see that the loss isn’t affected a lot by changes in , but even small changes in can cause significant changes in the loss value. This is graphically verified via the derivative of the loss in figure 1 where we can see that for small values of

, the penalty for getting a prediction wrong is a lot larger. All experiments were run using Keras with a tensorflow backend. Adam with a learning rate of 1e-4 was used. A constant batch size of 16 was used throughout. Our implementation of the loss will be open sourced on github. The experiments were run for a maximum of 50 epochs. To evaluate the performance of our loss, we compute the intersection over union (IoU) overlap, recall, specificity, F-measure and the area under the receiver operator characteristics curve (AUC-ROC) of the corresponding network predictions trained on various loss function.

4 Results

Figure 2: The ROC curves for the ISIC 2018 skin cancer segmentation dataset. Our loss has a better Area Under the ROC curve than the baseline. The curves are plotted for the best performing models for our experiments on this task.
Figure 3: The ROC curves for the Data Science Bowl 2018 cell nuclei segmentation dataset. It can be seen that when the imbalance is high, our loss provides a much more robust and significant Area Under the ROC curve than the baseline, demonstrating a superior convergence. The curves are plotted for the best performing models from our experiments on this task.

We compare the performance of our loss (Table 3) with the jaccard loss (JL), dice loss (DL), tversky loss (TL) [11], focal loss (FL) [6] and the combo loss (CL) [5]. The ISIC 2018 dataset shows the least imbalance and hence the results are fairly even for this dataset, though our loss get the best IoU. This is also exhibited in Figure 2. The nuclei segmentation dataset exhibits more class imbalance, where our loss shows significantly superior performance compared to all the other losses. We get significant gains over the baseline AUC as shown in figure 3. We do not compare the DRIVE dataset by the AUC or accuracy as these metrics are fairly saturated for this dataset and don’t offer any statistically significant insights. It is interesting to note that we get an improved F-measure score and the best in class IoU, which given the large number of patches extracted, is statistically significant. Our loss shows significantly better performance than the baseline dice loss for all three datasets, which means that it manages to optimize the loss to a significantly better minimum on the loss landscape leading to a more optimal solution. In all cases, the trend shown by our loss is to converge to within delta of the optimal solution and then refine the convergences using the adaptive strategy. Without the adaptive strategy, our loss often gets stuck in local minimum, which reiterates the importance of having such a piecewise continuous loss. The other loss functions (especially dice loss) exhibit slightly unstable convergence and also take longer to reach within delta of the optimal solution. We observed that our loss always converged faster than JL, DL, TL and CL. FL convergence is at par with our loss, while TL converges more smoothly.

Method ISIC Data Science Bowl DRIVE
Recall Specificity Jaccard Recall Specificity Jaccard F-Measure Recall Jaccard
U-Net (JI) 78.62 85.21 72.96 76.27 81.29 73.64 78.46 73.28 65.37
U-Net (DL) 76.12 83.74 69.34 74.92 82.85 64.57 78.94 74.10 67.79
U-Net (TL) 80.82 86.98 74.18 79.21 85.81 77.72 79.89 74.47 66.18
U-Net (FL) 83.76 89.85 79.17 78.27 86.88 78.82 80.07 75.65 68.96
U-Net (CL) 82.19 87.96 75.87 77.34 85.63 78.24 79.26 74.33 67.42
U-Net (ALL) 83.56 88.47 77.69 79.88 87.27 79.71 81.41 75.83 69.23
FocusNet (JI) 80.13 86.17 72.28 77.12 84.19 74.97 77.87 73.28 64.67
FocusNet (DL) 80.78 85.81 71.92 78.37 84.92 77.82 77.86 73.98 64.71
FocusNet (TL) 84.86 90.62 77.63 79.64 88.27 77.28 81.31 74.19 68.66
FocusNet (FL) 86.19 93.95 82.78 79.26 89.17 78.73 81.28 76.89 69.57
FocusNet (CL) 84.82 86,19 78.63 80.65 87.34 79.35 78.64 74.18 68.34
FocusNet (ALL) 86.62 92.78 82.84 82.51 90.86 81.37 82.17 76.13 70.96
Table 3: Segmentation results for the three datasets. All values in the ISIC 2018 experiments, Data Science Bowl and the DRIVE retinal blood vessel segmentation datasets are averaged over 3, 5 and 2 runs respectively to average out the effects of random weight initialization as much as possible. The values reported are all in %.
Method Dice Precision Recall
U-Net (FTL) 82.92 79.74 92.61
Att-U-Net+M+D (FTL) 85.61 85.82 89.71
FocusNet (ALL) 87.14 88.11 90.47
Table 4: Experiments run for the ISIC 2018 dataset training-validation-test split in [1]. Our reported values (in %) are averaged over 3 runs. ’M’ denotes Multi Scale Input. ’D’ denotes deep supervision.

5 Conclusion

In this paper, we proposed an enhanced loss function that potentially constructs a loss landscape that is easier to traverse via backpropagation. We tested our approach on 3 datasets that show varied class imbalance. As the imbalance in the data increases, our loss provides a more robust solution (along with faster convergence) to this prominent problem compared to other state-of-the-art loss functions. We base these conclusions on carefully constructed evaluation metrics for each task that show superior performance in favour of our loss.


  • [1] N. Abraham and N. M. Khan (2018) A novel focal tversky loss function with improved attention u-net for lesion segmentation. CoRR abs/1810.07842. External Links: Link, 1810.07842 Cited by: Table 4.
  • [2] F. I. et. al (2018) NnU-net: self-adapting framework for u-net-based medical image segmentation. CoRR abs/1809.10486. External Links: Link, 1809.10486 Cited by: §1.
  • [3] J.J. S. et. al (2004) Ridge based vessel segmentation in color images of the retina. IEEE Transactions on Medical Imaging 23 (4), pp. 501–509. Cited by: §3.
  • [4] N. C. F. C. et. al (2019) Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (ISIC). CoRR abs/1902.03368. External Links: Link, 1902.03368 Cited by: §3.
  • [5] S. A. T. et. al (2018) Combo loss: handling input and output imbalance in multi-organ segmentation. CoRR abs/1805.02798. External Links: Link, 1805.02798 Cited by: §1, §4.
  • [6] T. L. et. al (2017) Focal loss for dense object detection. CoRR abs/1708.02002. External Links: Link, 1708.02002 Cited by: §1, §4.
  • [7] W. Z. et. al (2018) AnatomyNet: deep 3d squeeze-and-excitation u-nets for fast and fully automated whole-volume anatomical segmentation. CoRR abs/1808.05238. External Links: Link, 1808.05238 Cited by: §1.
  • [8] Z. F. et. al (2017)

    Wing loss for robust facial landmark localisation with convolutional neural networks

    CoRR abs/1711.06753. External Links: Link, 1711.06753 Cited by: §2.
  • [9] C. Kaul, S. Manandhar, and N. Pears (2019) FocusNet: an attention-based fully convolutional network for medical image segmentation. CoRR abs/1902.03091. External Links: Link, 1902.03091 Cited by: §1, §3.
  • [10] O. Ronneberger, P.Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), LNCS, Vol. 9351, pp. 234–241. Note: (available on arXiv:1505.04597 [cs.CV]) External Links: Link Cited by: §1, §3.
  • [11] S. S. M. Salehi, D. Erdogmus, and A. Gholipour (2017) Tversky loss function for image segmentation using 3d fully convolutional deep networks. In Machine Learning in Medical Imaging, Cham, pp. 379–387. External Links: ISBN 978-3-319-67389-9 Cited by: §1, §4.
  • [12] X. Wang, L. Bo, and F. Li (2019) Adaptive wing loss for robust face alignment via heatmap regression. CoRR abs/1904.07399. External Links: Link, 1904.07399 Cited by: §2.