Using uncertainty estimation to reduce false positives in liver lesion detection

01/12/2021 ∙ by Ishaan Bhat, et al. ∙ 0

Despite the successes of deep learning techniques at detecting objects in medical images, false positive detections occur which may hinder an accurate diagnosis. We propose a technique to reduce false positive detections made by a neural network using an SVM classifier trained with features derived from the uncertainty map of the neural network prediction. We demonstrate the effectiveness of this method for the detection of liver lesions on a dataset of abdominal MR images. We find that the use of a dropout rate of 0.5 produces the least number of false positives in the neural network predictions and the trained classifier filters out approximately 90 detections in the test-set.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Primary tumors such as neuroendocrine and colorectal tumors have a high likelihood of developing metastases in the liver. Early detection of (new) liver metastases is crucial since it may prolong patient life [15]. Automatic detection of these metastases is a challenging task and deep learning based systems are increasingly being used to address the challenge.

However, deep learning systems may make erroneous predictions. These arise due to a variety of reasons, for example, the model overfitting to the training data, presence of noise/artefacts in the image etc. Presence of false positives in the prediction is one such type of error and may hinder accurate patient diagnosis.

Efficient and scalable uncertainty estimation techniques for deep learning-based systems such has MC-Dropout [3] and model ensembles [10] have been widely adopted by the medical imaging research community to estimate uncertainty at tasks such as classification and segmentation. False positive detections tend to have a higher estimated uncertainty, thus uncertainty quantification can be used to filter such detections [12, 11, 17]

There has been work to show that modern neural networks exhibit poor calibration [4] which may degrade the quality of uncertainty estimates [13], thereby making any conclusion drawn on the basis of solely the uncertainty estimate, unreliable. In this paper, we propose an approach based on leveraging features based on shape and other attributes in addition to the uncertainty estimate to detect false positive predictions made by deep learning systems.

2 Related Work

There has been active research to address challenges in developing interpretable uncertainty metrics to detect segmentation failures and aid clinicians in their decision making [17, 7, 12, 18]. In the context of image segmentation, computing such a metric on an entire object, rather than on a per-voxel basis may aid interpretability.The direct use of voxel-wise uncertainty estimates to detect failures has shown limited success [6]. In  [6], it is also shown that aggregating voxel-wise uncertainties spatially can aid in detecting failed segmentations.

In [12] lesion-level uncertainties are computed by taking a log-sum of voxel-wise uncertainties over the lesion prediction by assuming that per-voxel uncertainty estimates within a single lesion volume are independent. It is shown that using lesion-level uncertainties to filter predicted lesions reduces the number of false positives and false negatives. In  [17, 11] a negative correlation between mean uncertainty over structure and the Dice score is shown, leading to the conclusion that the mean entropy over the structure can be used to filter wrong predictions. Similarly, in [7] a doubt score is computed by summing up voxel-wise uncertainties in predicted foreground regions.

An alternate approach to explicit aggregation of voxel-wise estimates has been to use a second neural network that uses the per-voxel uncertainty map and the network prediction to estimate the segmentation quality or refine detection [14, 2]. In [2] a second neural network supplied with the prediction and the spatial uncertainty map learns to predict the Dice score. In [14] the second neural network uses a 3-channel input of the original image patch, prediction and uncertainty estimate to predict if the detection of the nodule by the first stage was correct.

In this paper, we propose a two-stage process to detect false positive predictions. Instead of training a second neural network, we train a SVM classifier to predict whether a lesion detected by the segmentation network is a false positive. This classifier is trained by computing a low-dimensional feature vector for each lesion, comprised of the aggregated uncertainty and shape-based attributes. Our approach requires less data to train the second stage (compared to the use of a neural network) and we demonstrate the effectiveness of this approach in Section


3 Methodology

3.1 Data

In this paper we included abdominal DCE and DWI MRI of 72 patients with liver metastases from the University Medical Center Utrecht, the Netherlands.

The DCE MR series was acquired in six breath holds resulting in a total of 16 3-D images. Voxel size for these images is x x mm . The liver and the metastases within the liver were manually segmented on the DCE-MRI by a radiologist in training and verified by a radiologist with more than 10 years of experience. The dataset mainly included colorectal metastases, neuroendocrine metastases and some other metastases types. The DCE-MR images were motion corrected using techniques presented in [5].

The DWI-MR images were acquired with three b-values: , , and s/mm2, using a protocol with the following parameters: TE: ms; TR: ms; flip angle: degrees. For each patient, the DWI MR image was nonlinearly registered to the DCE MR image using the elastix111 toolbox.

We apply the manually created liver masks to the abdominal DCE and DWI MR images and pre-process them using z-score normalization of the intensities.

The data was split into 50 training patients, 5 validation patients and 17 test patients.

3.2 Neural Network Architecture and Training

Our choice of neural network architecture (Figure 1) is inspired by the U-Net [16] and Bayesian SegNet [8]. We use the standard encoder-decoder with skip connections like the U-Net and add dropout to the bottom-most encoder and decoder blocks since these posititions are shown to be most effective [8]. Preceding the encoder-decoder structures, we use convolutions to process and fuse the DCE and DWI images.

Figure 1: Neural network architecture used to perform lesion detection and uncertainty estimation

The network is trained using 2-D slices from the 3-D DCE and DWI MR images. As described in  [3], the network is trained with dropout and at test-time, outputs obtained from multiple passes through the network (with dropout enabled) are used to estimate model uncertainty. Each pass can be thought of as a sample from the weight posterior distribution and averaging the outputs can be thought of as marginalizing out the weight posterior to obtain an estimate of the model likelihood. Thus, the mean output over multiple passes is taken to be the final prediction. To quantify the uncertainty, we calculate the entropy of the mean softmax prediction given by , where is class index and is the softmax value for that class. We create the binary prediction by thresholding the mean softmax output at 0.5. To remove noisy detections and fill small holes, this step is followed by a post-processing step involving binary closing and opening. An example of a lesion prediction and associated uncertainty map is shown in Figure 2.

(a) DCE MR image overlayed with true and predicted lesion masks. The red prediction corresponds to a false positve detection, while the ground truth annotation is shown in yellow.
(b) Uncertainty map containing per-voxel entropy computed from the mean softmax prediction.
Figure 2: Lesion detection and uncertainty quantification

During training we use the Adam [9] optimizer with an initial learning rate of in combination with the PolyLR scheduler [1]

to decrease the learning rate as training progresses. We extract 128x128 overlapping patches from the 256x256 size image slices (5 per image) and feed this to the neural network. We use rotations using angles sampled from a uniform distribution over

to augment the training images. We use a weighted version of the standard cross-entropy loss to address the class-imbalance. To estimate uncertainty during test-time, we use 20 forward passes for each image patch. We train the neural networks for 120K iterations.

3.3 Feature Extraction and Classification

Figure 3: Feature extraction and lesion classification pipeline

The feature extraction and classification pipeline is shown in Figure 3. For each patient, we extract 3-D patches from the uncertainty map corresponding to regions in the neural network output where lesions have been detected. In our analysis we found that false positive predictions tended to have a smaller volume as compared to true positive predictions. Therefore, in addition to the mean uncertainty, we selected the maximum diameter of the detection as a feature.

Additionally, we automatically selected features from a set of 107 features extracted using PyRadiomics222 Linear models with L1 penalty produce sparse classifiers with many of the feature coefficients set to zero after training. We use such a classifier trained on the validation patient dataset to select the top-2 features with non-zero coefficients for each configuration. We compare the performance of these automatically selected features with features we chose manually.

We use a support vector machine (SVM) to classify the feature vector as a true or a false positive lesion. The classifier is trained using patches extracted from uncertainty maps computed for patients part of the neural network validation set (5 patients). Hyper-parameters are selected using a grid search in combination with cross-validation over this data.

4 Results

In this section we show results of lesion detection and false positive classification for three different configurations:

  • Baseline (No dropout)

  • Low dropout ()

  • High dropout ()

We vary the dropout rate to analyze the behavior of MC-Dropout over a range of values. Using a dropout rate higher than 0.5 lead to unstable training. For each configuration, we train 5 different neural network instances, each of which has a different train-validation data split. The set of test patients used to report the performance are the same across all runs and configurations.

4.1 Lesion detection

The results for lesion detection are shown in Table 1. A connected region in the neural network prediction is counted as a single lesion prediction. If such a prediction has a non-zero overlap with ground truth annotation, it is considered detected i.e. a true positive. If there is no overlap, then that lesion is a false positive.

We see that increasing the dropout rate reduces the number of false positives which could be attributed to its regularizing effect and slight improvement in calibration [6].

True Positive
False Positive
Baseline 75 41.4
Low Dropout 75 38.6
High Dropout 75 29.0
Table 1: Total number of true and false positive predictions by the neural network in the test-set. Mean taken over results of 5 separately trained neural networks per configuration.

4.2 False positive classification

In Table 2

, we show the cross-validation results for the SVM training data (mean and standard deviation) for all configurations using manual and automatic feature selection. In all cases, the cross-validation accuracy for the manually selected features is higher than or equal to that of the automatically selected features. This led us to choose the manually selected features to perform classification and report results on the test set.

Configuration Manual Feature Selection Automatic Feature Selection
Low Dropout
High Dropout
Table 2: Cross-validation accuracy (mean and standard deviation) for manual and automatic feature selection
Model Accuracy Sensitivity Specificity F1-Score
Low dropout
High dropout
Table 3: False positive classification metrics for test patients using manually selected features

In Table 3 we report classification metrics for the false positive detection task. We see that the dropout configurations have a better accuracy and sensitivity i.e. they are much better at classifying false positive predictions made by the neural network correctly.

The specificity metric tells us the ability of the classifier to correctly classify a true positive lesion. On this metric, the high dropout configuration mis-classifies around of true lesions, the best among the 3 configurations.

In Table 4 we show the total number of predictions (true and false positives) in the test set before and after the feature based classification. The number of false positives is smallest for the high dropout configuration after classification. Additionally, it retains the most number of true positives owing to its better specificity.

True Positives
True Positives
False Positives
False Positives
Baseline 75 72.8 41.4 8.9
Low Dropout 75 73 38.6 3.3
High Dropout 75 73.8 29.0 2.7
Table 4: Total number of true and false positive predictions in test-set (mean over runs) before and after classification.

5 Discussion and Conclusion

Our results show that the neural network with a dropout rate of filters out close to % of false positive detections in the neural network output. By choosing MC-Dropout to estimate uncertainty, we consider only the uncertainty inherent in the model and not the data. The method might be further improved by combining MC-Dropout with techniques to estimate data uncertainty.

We could not use this approach to correct false negatives. These were extremely small in size and would get filtered out during post-processing of the predicted mask as noise.

Using more than two features did not improve the performance of the false positive classification. Further investigation into the robustness of the manually selected features over image modality, organ, uncertainty estimation technique is required.

6 Compliance with Ethical Standards

The UMCU Medical Ethical Committee has reviewed this study and informed consent was waived due to its retrospective nature.

7 Acknowledgements

This work was financially supported by the project IMPACT (Intelligence based iMprovement of Personalized treatment And Clinical workflow supporT) in the framework of the EU research programme ITEA3 (Information Technology for European Advancement). The authors declare no conflict of interest.


  • [1] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017-05) DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. Note: arXiv: 1606.00915Comment: Accepted by TPAMI External Links: Link Cited by: §3.2.
  • [2] T. DeVries and G. W. Taylor (2018-07) Leveraging Uncertainty Estimates for Predicting Segmentation Quality. Note: arXiv: 1807.00502 External Links: Link Cited by: §2.
  • [3] Y. Gal and Z. Ghahramani (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In

    Proceedings of the 33nd International Conference on Machine Learning, ICML 2016

    JMLR Workshop and Conference Proceedings, Vol. 48, pp. 1050–1059. External Links: Link Cited by: §1, §3.2.
  • [4] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017) On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pp. 1321–1330. Cited by: §1.
  • [5] M. J. A. Jansen, H. J. Kuijf, W. B. Veldhuis, F. J. Wessels, M. S. van Leeuwen, and J. P. W. Pluim (2017-09) Evaluation of motion correction for clinical dynamic contrast enhanced MRI of the liver. Physics in Medicine & Biology 62 (19), pp. 7556–7568 (en). External Links: ISSN 1361-6560, Link, Document Cited by: §3.1.
  • [6] A. Jungo, F. Balsiger, and M. Reyes (2020-04) Analyzing the Quality and Challenges of Uncertainty Estimations for Brain Tumor Segmentation. Frontiers in Neuroscience 14, pp. 282 (en). External Links: ISSN 1662-453X, Link, Document Cited by: §2, §4.1.
  • [7] A. Jungo, R. Meier, E. Ermis, E. Herrmann, and M. Reyes (2018-06) Uncertainty-driven Sanity Check: Application to Postoperative Brain Tumor Cavity Segmentation. Note: arXiv: 1806.03106 External Links: Link Cited by: §2, §2.
  • [8] A. Kendall, V. Badrinarayanan, and R. Cipolla (2016-10)

    Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding

    Note: arXiv: 1511.02680 External Links: Link Cited by: §3.2.
  • [9] D. P. Kingma and J. Ba (2017-01) Adam: A Method for Stochastic Optimization. (en). Note: arXiv: 1412.6980Comment: Published as a conference paper at the 3rd International Conference for Learning Representations, San Diego, 2015 External Links: Link Cited by: §3.2.
  • [10] B. Lakshminarayanan, A. Pritzel, and C. Blundell (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, pp. 6402–6413. External Links: Link Cited by: §1.
  • [11] A. Mehrtash, W. M. Wells, C. M. Tempany, P. Abolmaesumi, and T. Kapur (2020) Confidence calibration and predictive uncertainty estimation for deep medical image segmentation. IEEE Transactions on Medical Imaging, pp. 1–1. External Links: ISSN 1558-254X, Link, Document Cited by: §1, §2.
  • [12] T. Nair, D. Precup, D. L. Arnold, and T. Arbel (2018) Exploring Uncertainty Measures in Deep Networks for Multiple Sclerosis Lesion Detection and Segmentation. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2018, Cham, pp. 655–663. External Links: ISBN 978-3-030-00928-1 Cited by: §1, §2, §2.
  • [13] Y. Ovadia, E. Fertig, J. Ren, Z. Nado, D. Sculley, S. Nowozin, J. Dillon, B. Lakshminarayanan, and J. Snoek (2019) Can you trust your model’s uncertainty? Evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems 32, pp. 13991–14002. External Links: Link Cited by: §1.
  • [14] O. Ozdemir, B. Woodward, and A. A. Berlin (2017-12)

    Propagating Uncertainty in Multi-Stage Bayesian Convolutional Neural Networks with Application to Pulmonary Nodule Detection

    Note: arXiv: 1712.00497 External Links: Link Cited by: §2.
  • [15] P. J. Robinson (2002-12) The early detection of liver metastases. Cancer Imaging 2 (2), pp. 1–3. External Links: ISSN 1470-7330, Link, Document Cited by: §1.
  • [16] O. Ronneberger, P. Fischer, and T. Brox (2015) U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, Cham, pp. 234–241. External Links: ISBN 978-3-319-24574-4 Cited by: §3.2.
  • [17] A. G. Roy, S. Conjeti, N. Navab, and C. Wachinger (2018) Inherent Brain Segmentation Quality Control from Fully ConvNet Monte Carlo Sampling. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2018, Cham, pp. 664–672. External Links: ISBN 978-3-030-00928-1 Cited by: §1, §2, §2.
  • [18] J. Sander, B. D. de Vos, J. M. Wolterink, and I. Išgum (2019) Towards increased trustworthiness of deep learning segmentation methods on cardiac MRI. In Medical Imaging 2019: Image Processing, Vol. 10949, pp. 324 – 330. External Links: Document, Link Cited by: §2.