Primary tumors such as neuroendocrine and colorectal tumors have a high likelihood of developing metastases in the liver. Early detection of (new) liver metastases is crucial since it may prolong patient life . Automatic detection of these metastases is a challenging task and deep learning based systems are increasingly being used to address the challenge.
However, deep learning systems may make erroneous predictions. These arise due to a variety of reasons, for example, the model overfitting to the training data, presence of noise/artefacts in the image etc. Presence of false positives in the prediction is one such type of error and may hinder accurate patient diagnosis.
Efficient and scalable uncertainty estimation techniques for deep learning-based systems such has MC-Dropout  and model ensembles  have been widely adopted by the medical imaging research community to estimate uncertainty at tasks such as classification and segmentation. False positive detections tend to have a higher estimated uncertainty, thus uncertainty quantification can be used to filter such detections [12, 11, 17]
There has been work to show that modern neural networks exhibit poor calibration  which may degrade the quality of uncertainty estimates , thereby making any conclusion drawn on the basis of solely the uncertainty estimate, unreliable. In this paper, we propose an approach based on leveraging features based on shape and other attributes in addition to the uncertainty estimate to detect false positive predictions made by deep learning systems.
2 Related Work
There has been active research to address challenges in developing interpretable uncertainty metrics to detect segmentation failures and aid clinicians in their decision making [17, 7, 12, 18]. In the context of image segmentation, computing such a metric on an entire object, rather than on a per-voxel basis may aid interpretability.The direct use of voxel-wise uncertainty estimates to detect failures has shown limited success . In , it is also shown that aggregating voxel-wise uncertainties spatially can aid in detecting failed segmentations.
In  lesion-level uncertainties are computed by taking a log-sum of voxel-wise uncertainties over the lesion prediction by assuming that per-voxel uncertainty estimates within a single lesion volume are independent. It is shown that using lesion-level uncertainties to filter predicted lesions reduces the number of false positives and false negatives. In [17, 11] a negative correlation between mean uncertainty over structure and the Dice score is shown, leading to the conclusion that the mean entropy over the structure can be used to filter wrong predictions. Similarly, in  a doubt score is computed by summing up voxel-wise uncertainties in predicted foreground regions.
An alternate approach to explicit aggregation of voxel-wise estimates has been to use a second neural network that uses the per-voxel uncertainty map and the network prediction to estimate the segmentation quality or refine detection [14, 2]. In  a second neural network supplied with the prediction and the spatial uncertainty map learns to predict the Dice score. In  the second neural network uses a 3-channel input of the original image patch, prediction and uncertainty estimate to predict if the detection of the nodule by the first stage was correct.
In this paper, we propose a two-stage process to detect false positive predictions. Instead of training a second neural network, we train a SVM classifier to predict whether a lesion detected by the segmentation network is a false positive. This classifier is trained by computing a low-dimensional feature vector for each lesion, comprised of the aggregated uncertainty and shape-based attributes. Our approach requires less data to train the second stage (compared to the use of a neural network) and we demonstrate the effectiveness of this approach in Section4.2.
In this paper we included abdominal DCE and DWI MRI of 72 patients with liver metastases from the University Medical Center Utrecht, the Netherlands.
The DCE MR series was acquired in six breath holds resulting in a total of 16 3-D images. Voxel size for these images is x x mm . The liver and the metastases within the liver were manually segmented on the DCE-MRI by a radiologist in training and verified by a radiologist with more than 10 years of experience. The dataset mainly included colorectal metastases, neuroendocrine metastases and some other metastases types. The DCE-MR images were motion corrected using techniques presented in .
The DWI-MR images were acquired with three b-values: , , and s/mm2, using a protocol with the following parameters: TE: ms; TR: ms; flip angle: degrees. For each patient, the DWI MR image was nonlinearly registered to the DCE MR image using the elastix111https://elastix.lumc.nl/ toolbox.
We apply the manually created liver masks to the abdominal DCE and DWI MR images and pre-process them using z-score normalization of the intensities.
The data was split into 50 training patients, 5 validation patients and 17 test patients.
3.2 Neural Network Architecture and Training
Our choice of neural network architecture (Figure 1) is inspired by the U-Net  and Bayesian SegNet . We use the standard encoder-decoder with skip connections like the U-Net and add dropout to the bottom-most encoder and decoder blocks since these posititions are shown to be most effective . Preceding the encoder-decoder structures, we use convolutions to process and fuse the DCE and DWI images.
The network is trained using 2-D slices from the 3-D DCE and DWI MR images. As described in , the network is trained with dropout and at test-time, outputs obtained from multiple passes through the network (with dropout enabled) are used to estimate model uncertainty. Each pass can be thought of as a sample from the weight posterior distribution and averaging the outputs can be thought of as marginalizing out the weight posterior to obtain an estimate of the model likelihood. Thus, the mean output over multiple passes is taken to be the final prediction. To quantify the uncertainty, we calculate the entropy of the mean softmax prediction given by , where is class index and is the softmax value for that class. We create the binary prediction by thresholding the mean softmax output at 0.5. To remove noisy detections and fill small holes, this step is followed by a post-processing step involving binary closing and opening. An example of a lesion prediction and associated uncertainty map is shown in Figure 2.
to decrease the learning rate as training progresses. We extract 128x128 overlapping patches from the 256x256 size image slices (5 per image) and feed this to the neural network. We use rotations using angles sampled from a uniform distribution overto augment the training images. We use a weighted version of the standard cross-entropy loss to address the class-imbalance. To estimate uncertainty during test-time, we use 20 forward passes for each image patch. We train the neural networks for 120K iterations.
3.3 Feature Extraction and Classification
The feature extraction and classification pipeline is shown in Figure 3. For each patient, we extract 3-D patches from the uncertainty map corresponding to regions in the neural network output where lesions have been detected. In our analysis we found that false positive predictions tended to have a smaller volume as compared to true positive predictions. Therefore, in addition to the mean uncertainty, we selected the maximum diameter of the detection as a feature.
Additionally, we automatically selected features from a set of 107 features extracted using PyRadiomics222https://github.com/Radiomics/pyradiomics. Linear models with L1 penalty produce sparse classifiers with many of the feature coefficients set to zero after training. We use such a classifier trained on the validation patient dataset to select the top-2 features with non-zero coefficients for each configuration. We compare the performance of these automatically selected features with features we chose manually.
We use a support vector machine (SVM) to classify the feature vector as a true or a false positive lesion. The classifier is trained using patches extracted from uncertainty maps computed for patients part of the neural network validation set (5 patients). Hyper-parameters are selected using a grid search in combination with cross-validation over this data.
In this section we show results of lesion detection and false positive classification for three different configurations:
Baseline (No dropout)
Low dropout ()
High dropout ()
We vary the dropout rate to analyze the behavior of MC-Dropout over a range of values. Using a dropout rate higher than 0.5 lead to unstable training. For each configuration, we train 5 different neural network instances, each of which has a different train-validation data split. The set of test patients used to report the performance are the same across all runs and configurations.
4.1 Lesion detection
The results for lesion detection are shown in Table 1. A connected region in the neural network prediction is counted as a single lesion prediction. If such a prediction has a non-zero overlap with ground truth annotation, it is considered detected i.e. a true positive. If there is no overlap, then that lesion is a false positive.
We see that increasing the dropout rate reduces the number of false positives which could be attributed to its regularizing effect and slight improvement in calibration .
4.2 False positive classification
In Table 2
, we show the cross-validation results for the SVM training data (mean and standard deviation) for all configurations using manual and automatic feature selection. In all cases, the cross-validation accuracy for the manually selected features is higher than or equal to that of the automatically selected features. This led us to choose the manually selected features to perform classification and report results on the test set.
|Configuration||Manual Feature Selection||Automatic Feature Selection|
In Table 3 we report classification metrics for the false positive detection task. We see that the dropout configurations have a better accuracy and sensitivity i.e. they are much better at classifying false positive predictions made by the neural network correctly.
The specificity metric tells us the ability of the classifier to correctly classify a true positive lesion. On this metric, the high dropout configuration mis-classifies around of true lesions, the best among the 3 configurations.
In Table 4 we show the total number of predictions (true and false positives) in the test set before and after the feature based classification. The number of false positives is smallest for the high dropout configuration after classification. Additionally, it retains the most number of true positives owing to its better specificity.
5 Discussion and Conclusion
Our results show that the neural network with a dropout rate of filters out close to % of false positive detections in the neural network output. By choosing MC-Dropout to estimate uncertainty, we consider only the uncertainty inherent in the model and not the data. The method might be further improved by combining MC-Dropout with techniques to estimate data uncertainty.
We could not use this approach to correct false negatives. These were extremely small in size and would get filtered out during post-processing of the predicted mask as noise.
Using more than two features did not improve the performance of the false positive classification. Further investigation into the robustness of the manually selected features over image modality, organ, uncertainty estimation technique is required.
6 Compliance with Ethical Standards
The UMCU Medical Ethical Committee has reviewed this study and informed consent was waived due to its retrospective nature.
This work was financially supported by the project IMPACT (Intelligence based iMprovement of Personalized treatment And Clinical workflow supporT) in the framework of the EU research programme ITEA3 (Information Technology for European Advancement). The authors declare no conflict of interest.
-  (2017-05) DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. Note: arXiv: 1606.00915Comment: Accepted by TPAMI External Links: Cited by: §3.2.
-  (2018-07) Leveraging Uncertainty Estimates for Predicting Segmentation Quality. Note: arXiv: 1807.00502 External Links: Cited by: §2.
Dropout as a bayesian approximation: representing model uncertainty in deep learning.
Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, JMLR Workshop and Conference Proceedings, Vol. 48, pp. 1050–1059. External Links: Cited by: §1, §3.2.
-  (2017) On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pp. 1321–1330. Cited by: §1.
-  (2017-09) Evaluation of motion correction for clinical dynamic contrast enhanced MRI of the liver. Physics in Medicine & Biology 62 (19), pp. 7556–7568 (en). External Links: Cited by: §3.1.
-  (2020-04) Analyzing the Quality and Challenges of Uncertainty Estimations for Brain Tumor Segmentation. Frontiers in Neuroscience 14, pp. 282 (en). External Links: Cited by: §2, §4.1.
-  (2018-06) Uncertainty-driven Sanity Check: Application to Postoperative Brain Tumor Cavity Segmentation. Note: arXiv: 1806.03106 External Links: Cited by: §2, §2.
Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding. Note: arXiv: 1511.02680 External Links: Cited by: §3.2.
-  (2017-01) Adam: A Method for Stochastic Optimization. (en). Note: arXiv: 1412.6980Comment: Published as a conference paper at the 3rd International Conference for Learning Representations, San Diego, 2015 External Links: Cited by: §3.2.
-  (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, pp. 6402–6413. External Links: Cited by: §1.
-  (2020) Confidence calibration and predictive uncertainty estimation for deep medical image segmentation. IEEE Transactions on Medical Imaging, pp. 1–1. External Links: Cited by: §1, §2.
-  (2018) Exploring Uncertainty Measures in Deep Networks for Multiple Sclerosis Lesion Detection and Segmentation. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2018, Cham, pp. 655–663. External Links: Cited by: §1, §2, §2.
-  (2019) Can you trust your model’s uncertainty? Evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems 32, pp. 13991–14002. External Links: Cited by: §1.
Propagating Uncertainty in Multi-Stage Bayesian Convolutional Neural Networks with Application to Pulmonary Nodule Detection. Note: arXiv: 1712.00497 External Links: Cited by: §2.
-  (2002-12) The early detection of liver metastases. Cancer Imaging 2 (2), pp. 1–3. External Links: Cited by: §1.
-  (2015) U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, Cham, pp. 234–241. External Links: Cited by: §3.2.
-  (2018) Inherent Brain Segmentation Quality Control from Fully ConvNet Monte Carlo Sampling. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2018, Cham, pp. 664–672. External Links: Cited by: §1, §2, §2.
-  (2019) Towards increased trustworthiness of deep learning segmentation methods on cardiac MRI. In Medical Imaging 2019: Image Processing, Vol. 10949, pp. 324 – 330. External Links: Cited by: §2.