The interpretation of medical images is an essential task in the practice of a radiologist, yet challenging due to inconclusive and ambiguous image information, image artifacts, occlusions and more. Consider the example of chest radiography, a key imaging exam that enables the early detection of abnormalities in the lungs, heart or chest wall (Rajpurkar2018)
. Driven by the need to improve and accelerate the interpretation of such images, several deep learning solutions have been proposed for the automatic classification of radiographic findings(Wang2017; Guendel2018; Guendel2019; Yao2018). However, development, training and validation of these solutions are challenged by significant inter-rater variations in detection and classification of such radiographic findings (Rajpurkar2018). This can be caused by inconclusive evidence in the data or subjective definitions of the appearance of different findings. Similar challenges are faces in other applications as well, e.g., the view-classification of abdominal ultrasound images or assessment of brain metastases in magnetic resonance (MR) scans of the brain. We argue that modeling this variability when designing a system for assessing this type of data is essential – an aspect which was not considered in previous work.
Using principles of information theory and subjective logic (Josang2016) based on the Dempster-Shafer framework for modeling of evidence (Dempster1968)
, we present a method for training a parametric model that generates both an image-level class probability and a corresponding uncertainty measure. We evaluate this method on the image-level labeling of abnormalities on chest radiographs, the view-classification of abdominal ultrasound images, and detection of small brain metastases in brain MR scans. There, we demonstrate that one can effectively use the uncertainty measure to avoid returning a prediction on cases with highest uncertainty, thereby consistently achieving a more accurate classification on the remaining cases. Also, we propose uncertainty-driven bootstrapping as a means to filter training samples with highest predictive uncertainty in order to improve robustness and accuracy on unseen data. Finally, we empirically show that the uncertainty measure can distinguish radiographs with correct and incorrect labels according to a multi-radiologist-consensus study. This correlation indicates the potential of the uncertainty metric to help build trust between the user and the system. This paper is an extended version of our work presented in(Ghesu2019).
2 Background and Motivation
2.1 Machine Learning for Abnormality Assessment
Assessment of Chest Radiographs: The open access to the ChestX-Ray8 dataset (Wang2017) of chest radiographs has led to a series of publications that propose machine learning based systems for abnormality detection and classification. With this dataset, Wang2017
evaluated several state-of-the-art convolutional neural network architectures to address this problem and reported a baseline average area under receiver operating characteristic curve (ROC-AUC) of 0.75. This performance level was increased byIslam2017 using an ensemble of classification models. Further improvements were achieved by modeling the correlation between different abnormalities based on prevalence and co-morbidity (Yao2017; Yao2018), and explicitly integrating information observed in lateral/oblique radiographs paired with the frontal projection radiographs (Rubin2018). An alternative method focused on driving the attention of the learning model on the image sub-regions that are most relevant for the considered abnormalities (Guan2018). Cai2018
proposed an attention mining strategy to identify regions with abnormalities and showed that it significantly outperforms heuristic approaches, such as class activation maps. A curriculum learning method based on quantified disease severity-levels was proposed as alternative(Tang2018).
State-of-the-art results on the official split of the ChestX-Ray8 dataset are reported in Guendel2018, including the follow-up work presented in Guendel2019. Using multi-task learning coupled with a location-aware dense neural network learning architecture, an average ROC-AUC of 0.81 was achieved. On the official split of the dataset from the Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial (PLCO), an average performance of 0.88 (ROC-AUC) was reported for 12 different abnormalities.
In light of all these publications, a recent study compared the performance of such an AI system with the performance of 9 practicing radiologists (Rajpurkar2018). The authors selected 6 board-certified radiologists with an average experience of 12 years (ranging between 4 and 28 years) and 3 senior radiology residents. The 9 readers originated from 3 academic institutions. While the study indicates that the system can surpass human performance, it also highlights the high variability among different expert radiologists for the interpretation of chest radiographs. The reported average specificity of the readers was very high (over 95%), while their average sensitivity was 50% 8%. Such a large inter-rater variability leads to several questions: How can one obtain accurate ground truth data? To what extent does the label noise affect the training process and the system performance? The mentioned reference solutions do not consider this variability. In practice, this strategy typically leads to models with overconfident predictions and limited generalization on unseen data. In this context, Irvin2019 recently presented a new public dataset with image level annotations and additional uncertainty labels that can be taken into account during system training.
Detection of Brain Metastases: An additional example is the localization of brain metastases for treatment selection, planning and longitudinal monitoring. Contrast-enhanced magnetization-prepared rapid acquisition with gradient echo (MPRAGE) scans enable the detection and segmentation of brain metastases, which serves as essential information in guiding radiosurgery protocols and other treatment decisions. However, identifying and segmenting small metastases (i.e., ) is challenging and there is a significant degree of ambiguity in assessing and manually annotating such small metastases (pope2018brain). Computer-aided assistance could have a significant impact in the staging, selection and implementation of treatment, and assessment of therapeutic response; however, the ambiguity in the visual appearance and hence in the reference data annotation also limit the performance of machine learning systems trained for this task.
2.2 Machine Learning for Image-View Classification
Abdominal ultrasonography (US) is a commonly performed imaging test for a variety of ailments. High patient throughput and decreasing reimbursement can lead to errors or lack of anatomic markup and orientation in the acquired images (vannetti2015usability). A typical abdominal exam comprises of a trained sonographer navigating to and capturing a series of views of abdominal organs, freezing the view, and recording clinically relevant measurements. The classification of these views depending on the underlying anatomy is a challenging problem; recent studies indicated a inter-rater agreement of less than 80%. This is mainly because of the quality of the image or symmetry confusion (e.g., between left and right kidney). The substantial manual interaction is not only burdensome for the operator and substantially lowers the workflow efficiency, but also introduce user bias to the acquired patient data which has led to a series of publications on how to optimize US-based workflow screening (xu2018less; lin2019multi; otey2006automatic; aschkenasy2006unsupervised). This uncertainty poses a challenge to a machine learning system in solving the task.
2.3 Principles of Uncertainty Estimation
Explicitly quantifying the classification uncertainty based on observed data is a principled strategy to address the aforementioned challenges. Early contributions in this field were based on Bayesian estimation theory to measure model uncertainty (Hinton1993; Mackay1992). In the context of deep learning, techniques such as variational dropout (Molchanov2017; Gal2016; Kingma2015) have been proposed to approximate Bayesian learning, while better coping with the high computational requirements for large hierarchical models. Benefits are demonstrated in (Kuo2019). Alternatively, ensembles of deep learning models have been proposed by Laks2017 to implicitly model both epistemic and aleatoric uncertainty (see Figure 1). Nonetheless, the computational complexity of these methods remains suboptimal for training large models that are often used in practice.
3 Proposed Method
We propose a model for joint sample classification and predictive uncertainty estimation, following the Dempster-Shafer theory of evidence (Dempster1968) and principles of subjective logic (Josang2016). This research is inspired by the work of Sensoy2018.
3.1 Modeling the Predictive Uncertainty
We focus on the problem of binary classification, and define the per-class estimated probabilities of an arbitrary data sample as (for the positive class) and (for the negative class), and estimated predictive uncertainty as , with .
The classification problem is reduced to a problem of estimating belief values (also called belief masses) that indicate the membership of a sample to a specific class (Dempster1968; Josang2016). We denote and the belief values for the positive and negative class, respectively. In this theoretical framework, these values are computed from evidence values , that indicate based on features of
the likelihood of it being classified in the positive, or the negative class:
denoting the total evidence. With these variables, one can also quantify the so called uncertainty mass, which is defined as:
. For the considered binary classification setting, we propose to model the distribution of such evidence values using the beta distribution, defined by two parametersand as:
where denotes the gamma function and , with and . The per-class probabilities can be derived as:
Figure 2 visualizes the beta distribution for different values of and . This model implicitly measures both epistemic and aleatoric uncertainty. More details about different types of uncertainty and their properties are depicted in Figure 1.
3.2 Learning to Predict Uncertainty from Labeled Data
Let us assume a labeled training dataset is given as , consisting of pairs of images with a binary class assignment . We propose to use a parametric model – a deep convolutional neural network to estimate the per-class evidence values from the image data. Let denote the parameters of this model. The evidence values are estimated as: , where denote the estimated evidence values for the positive/negative class for sample and denotes the model as a functional component.
Using maximum likelihood estimation, one can learn the network parameters by optimizing the Bayes risk of the class predictor with a beta prior distribution:
where denotes the gamma function, , denotes the index of the training example from dataset ; and
represent the predicted probability and label in vector form for the training sample
(label is a one-hot encoding for the two considered classes, while the prediction is defined as). defines the goodness of fit. Using linearity properties of the expectation, Eq. 4 becomes:
denote the probabilistic prediction. In this equation, the first two terms measure the goodness of fit, and the last term encodes the variance of the prediction(Sensoy2018).
A regularization term is added to the loss function to penalize low uncertainty predictions for data samples with limited/low per-class evidence values. We denote this term as
and define as the relative entropy, i.e., the Kullback-Leibler divergence, between the beta distributed prior term and the beta distribution with total uncertainty (). In this way, we account for cost deviations from the total uncertainty state (i.e., ), which do not contribute to the data fit (Sensoy2018):
where , with for and for . Removing additive constants and using properties of the logarithm function, the regularization term becomes:
where denotes the digamma function and is the sample index. In this context, we define the total loss:
Using the stochastic gradient descent method, the total lossis optimized on the training set with the annealing coefficient starting at a small value (i.e., ) and gradually increased during training.
An adequate sampling of the underlying distribution is essential to ensure stability during training and ensure a robust estimation of evidence value. We empirically found dropout (Srivastava2014)
to be a simple and very effective strategy to address this problem. Through the random deactivation of neurons, dropout emulates an ensembles of deep models enabling an effective sampling during training. In contrast, an explicit ensemble ofindependently trained models may be used. Following the principles of deep ensembles (Laks2017), the per-class evidence can be computed from the ensemble estimates via averaging. In our experiments we empirically found that there is no significant difference between the two approaches. More details can be found in Section 4.
3.3 Uncertainty-driven Bootstrapping
Given a dataset , let us assume denotes the estimated model parameters which are used to measure the predictive uncertainty for each sample as . An efficient strategy to filter the dataset with the target of reducing label noise is to eliminate a fraction of samples with highest uncertainty. Without loss of generality, let us reorder the samples in in descending order according to the predictive uncertainty value as follows: . We define the selected subset as:
The hypothesis is that by retraining the model on dataset one can increase the robustness during training and improve its performance on unseen data. Please note, the fraction of eliminated samples, i.e., the value
, is highly dependent on the prior probability of label noise, the problem complexity and the capacity of the learning model to capture the underlying distribution of the data.
3.3.1 Relation to Robust M-Estimators
Conceptually, one can reformulate this strategy to optimizing the cost function defined in Equation 8 using a per-sample multiplicative weight determined based on the estimated uncertainty . As such, this weight is proportional to the so called inlier noise with the per-sample loss bounded to . This can be regarded as a robust M-estimator described in more detail in (Meer2004). As this is part of our ongoing work, we do not include experiments related to this approach in this paper.
4 Experiments and Results
We investigated the performance and properties of our method on three different problems, the classification of abnormalities in frontal chest radiographs, the view-classification of abdominal ultrasound images and the detection of small metastases in MR scans of the brain.
4.1 Assessment of Chest Radiographs
We considered several radiographic findings and abnormalities including calcified nodules (which are often granulomas), fibrosis, scaring, osseous lesions, cardiac abnormalities (e.g., enlarged cardiac silhouette which can suggest cardiomegaly) and pleural effusions (accumulation of fluid in the pleural space). Often these abnormalities co-occur.
4.1.1 Dataset and Setup
We used two public datasets, the ChestX-Ray8 dataset published by Wang2017 and the dataset from the Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial (PLCO)
. They contain a series of frontal chest radiographs in anterior-posterior (AP) or posterior-anterior (PA) view. Each image is associated with a binary labels indicating the presence of the considered abnormalities. The ChestX-Ray8 dataset contains 112,120 images from over 30,000 patients, with binary label annotations for 14 findings. These were automatically generated by parsing radiological reports using natural language processing (NLP) software(Wang2017). On the other hand, the PLCO dataset was built as part of a screening trial, containing 185,421 images from over 55,000 patients and covering 12 different abnormalities. More details are provided in Table LABEL:tab:CXRdata.
|ChestX-Ray8 (Wang2017)||PLCO (PLCO)|
|Number of images||112,120||185,421|
|Number of patients||30,805||56,071|
|Avg. images per patient||3.6||3.3|
We selected location-aware dense networks (Guendel2018; Guendel2019) as reference method. On the official split of the ChestX-Ray8 dataset this method achieves an average ROC-AUC of (for comparison, related competing methods report average scores of (Wang2017) and (Yao2018). On the official split of the PLCO dataset the reported average performance is higher, at a ROC-AUC of . We hypothesize that this difference in performance is explained by the better quality of the labels in the PLCO dataset. More details on this aspect are provided through this section. We also investigated the benefits of using deep ensembles instead of dropout to improve the sampling ( models were trained on random subsets of 80% of the training data; we refer to this method with the keyword [ens]).
Random subsets of images were selected from both datasets to be used for testing. These images were interpreted and manually labeled by multiple radiologists. For the PLCO dataset, testing chest radiographs were selected and annotated by board-certified expert radiologists. The final label for each image was determined using a majority vote among opinions, i.e., the reads from the aforementioned radiologists and the original label of the image - established during the cancer screening trial. For the ChestX-Ray8 dataset, 689 images were selected for testing and read by 4 board-certified radiologists. For each image, the label was decided by a consensus discussion. For both datasets, the remaining data was split at patient level in 90% training and 10% validation. All images were rescaled to
using bilinear interpolation.
4.1.2 Model Architecture and Training
We used a DenseNet-121 architecture (Huang2017) and inserted a dropout layer (
) after the last convolutional layer. A fully connected layer with ReLU activation units completes the mapping between the input imageand the outputs and . A systematic grid search was used to find the optimal configuration of training meta-parameters: learning rate of
, around 12 training epochs – using an early stop strategy with a patience of 3 epochs and a batch size of.
4.1.3 Uncertainty-driven Sample Rejection
Given predictive uncertainty estimates for each sample of a testing set, we propose to use this measure for sample rejection, i.e., set a threshold and configure the system to not output its prediction on any cases with an expected uncertainty larger than . One can view this as a system that is empowered to answer: ”I don’t know for sure”. Recall, the predictive uncertainty is an additional measure to the class probability, with increased values on out-of-distribution cases under the given model. Formally, we refer to the degree of sample rejection using the term coverage, as an expected percentile of cases to be rejected. For example, at a coverage of 100%, the system outputs its prediction on all cases, while at a coverage of 80% the system avoids outputting its prediction on 20% of the cases with highest uncertainty.
With this strategy one can significantly increase in system accuracy compared to the state-of-the-art on the remaining cases, as reported in Table 2 and Figure 3. For example, for the identification of calcified nodule or granuloma, a rejection rate of 25% leads to an increase of over 20% in the micro-average F1 score. We found no significant difference in average performance when using ensembles (see Figure 3).
|Finding||Guendel2018||Ours [100%]||Ours [90%]||Ours [75%]||Ours [50%]|
Considering a standard classifier trained, e.g., by minimizing the binary cross entropy function as presented by Guendel2018, we emphasize that one cannot effectively use the probability measure alone to effectively perform sample rejection. We investigated this by eliminating samples with predicted probability close to a predefined decision boundary . More details can be seen in Figure 4 on the example of pleural effusion.
4.1.4 Relating Predictive Uncertainty with Reader Uncertainty
We provide an analysis of the properties of the estimated predictive uncertainty on the example of pleural effusion based on the ChestX-Ray8 dataset. For the testing set of 689 cases that was reread by a committee of 4 experts in consensus, we define the so called critical set/subset : containing only cases for which the original label established by Wang2017 from the radiographic report via NLP was changed/flipped by the committee. According to the committee, this set contained both easier cases (for which the NLP technology presumably failed to extract the correct information from the radiographic report), and more difficult cases with subtle or atypical pleural effusions.
In Figure 5, we empirically show that the predictive uncertainty measure correlates with the decision of the committee to change the image label. In other words, for cases in (that preserve the original label) our algorithm yields lower uncertainty estimates (average 0.16) at an average AUC of 0.976 (coverage of 100%). In contrast, on cases of the critical subset , the algorithm showed higher uncertainties distributed between 0.1 and the maximum value of 1 (average 0.41). This indicates the ability of the algorithm to recognize the cases where annotation errors occurred in the first place (through NLP or human reader error). In Figure 6 we show several qualitative examples.
4.1.5 Uncertainty-driven Bootstrapping
Using uncertainty driven bootstrapping one can filter the training data, i.e., remove a fraction of training cases with highest uncertainty, with the goal to reduce label noise. On the example of pleural effusion, based on the ChestX-Ray8 dataset, we show the one can retrain the system on the remaining data and achieve better performance on an unseen dataset. Performance is reported as a triple of values [AUC; F1-score (for the positive class); F1-score (for the negative class)].
After the initial training, the baseline performance of our method was measured at on the testing set (including all 689 cases). We then constructed different version of the training set , , by eliminating , and of the training data. The metrics on the testing set improved to when retraining on , when retraining on and finally , when retraining on . This significant increase demonstrates the potential of this approach to improve the robustness of the model to label noise.
4.2 View-classification on Abdominal Ultrasound Images
A standard abdominal US examination typically consists of ten standard view classifications and their corresponding measurements from five structures of abdomen (right hepatic lobe, left hepatic lobe, right kidney, left kidney, and spleen) at two orientations (longitudinal/transverse). Here we focus on the longitudinal view-classification of the left versus the right kidney.
4.2.1 Dataset and Setup
To demonstrate the efficacy of our proposed approach in leveraging the predictive uncertainty, we trained a state-of-the-art binary classifier to be used as baseline. For this we selected the DenseNet121 architecture (Huang2017). For training and validation of the classification framework, we used kidney US images acquired retrospectively from 706 subjects. For testing, US images ( subjects) were used. The images were acquired both as longitudinal sequence and as single frame. The view information was assigned manually by at least one expert urologist either at the time of acquisition or retrospectively. As a pre-processing step for the framework, the images were resampled to mm resolution and resized to pixels. Finally, a mask was applied to each image to hide any text or icon-based information that could indicate the organ in the view.
The baseline performance of the classifier was measured at a ROC-AUC of 0.974. This is a competitive value, comparable to previous state-of-the-art results reported in (xu2018less), where a simultaneous view-classification and measurement framework for abdominal US exams was presented. To train the model for estimating predictive uncertainty we used the same architecture as for the baseline classifier and similar training meta-parameters as for the chest radiograph experiments.
4.2.2 Uncertainty-driven Sample Rejection
Following a similar strategy for rejecting samples with highest predictive uncertainty, one can significantly improve the classification performance of the trained model on the remaining cases, e.g., from a ROC-AUC of 0.972 at a coverage of 100% to a ROC-AUC of 0.991 at a coverage of 80% – that is more than 10% improvement in terms of precision. More details are shown in Figure 7. Several example images with both high and low predictive uncertainty are shown in Figure 8.
4.3 Brain metastases detection on MPRAGE images
Automated detection and segmentation of small metastases in 3D MRI scans could support therapy workflows. However, this task remains challenging due, in part, to the imbalance between metastatic tissue and normal tissue in an MRI volume. Reliable detection or exclusion of metastases on 2D slices within a volumetric image processing pipeline can mitigate this imbalance. Thus, in this work, we focus on classifying 2D slices with metastases in MPRAGE volumes.
4.3.1 Dataset and Results
We utilized a 2.5D encoder-decoder network to first obtain a segmentation mask showing potential areas of suspected metastases. The segmentation mask is subsequently used along with input slices as an input to a DenseNet121 model (Huang2017) to perform a slice-wise classification.
Our dataset included 480 contrast-enhanced MPRAGE image volumes from 442 patients treated primarily with stereotactic radiosurgery to one or more brain metastases. Metastasis gross tumor volumes, manually delineated in the course of standard clinical treatment, were reviewed for inclusion in the study. We excluded 47 cases where the planned treatment did not include all identifiable metastases. Further 13 cases were excluded due to non-standard orientations, field-of-views or imaging artifacts. The dataset was split into a training set (341 cases), a validation set (36 cases) and a test set (43 cases). In order to evaluate the performance of detecting small metastases, we selected all 16 patients from the test set with all metastatic lesions under 1 cm in volume. The total number of annotated small metastases in this subset was 35.
To evaluate the efficacy of uncertainty-driven sampling rejection with high predictive uncertainty, we measured the classification performance of the trained model in different coverage settings. The baseline performance of the trained model without the proposed approach was 0.85 in ROC-AUC. Using our method ROC-AUC was increased to 0.88 at a coverage rate of 100%, where no test samples were rejected by uncertainty-driven sampling described in Section 4.2.2. This indicates the improved ability of our model to capture the underlying noise in the training labels. The classification performance was further increased up to a ROC-AUC of 0.925 at a coverage of 50% and up to a ROC-AUC of 0.96 at a coverage of 20%. More details are shown in Figure 9. Several example images with both high and low predictive uncertainty are shown in Figure 10.
5 Summary and Conclusion
In conclusion, this paper presents an effective method for the joint estimation of class probabilities and predictive uncertainty. Extensive experiments were conducted on large datasets in the context chest of radiograph assessment, abdominal ultrasound view-classification and detection of small metastases in brain MR scans. We demonstrate that it is possible to achieve a significant increase in accuracy if sample rejection is performed based on the estimated uncertainty measure. For the assessment of chest radiographs, we highlight the capacity of the system to distinguish based on the uncertainty measure radiographs with correct and incorrect labels according to a multi-radiologist-consensus user study. Finally, we provide an insight into how to effectively use the predictive uncertainty to stratify the training dataset via bootstrapping to achieve higher accuracy on unseen data.
5.1 Discussion and Directions of Future Research
Based on these results and the potential impact of predictive uncertainty on system users, we believe that more research is required to address several open problems, including:
Investigation of additional sampling strategies which may allow to better capture the underlying distribution of the data and enable an improved estimate of predictive uncertainty
Establishing a formal measure to quantify the quality of the estimated predictive uncertainty, distinguishing between reducible epistemic uncertainty and aleatoric data uncertainty. One step may be to relate the uncertainty to a consensus or majority decision of experts. In this paper, we made a first step in this direction for chest radiograph assessment based on a reader-study involving 4 board-certified radiologists.
Based on the previous point, more research is needed into verifying how predictive uncertainty contributes to building user trust and reducing negative user bias. This is a key aspect to ensure that such systems are accepted and successfully used in daily practice.
Acknowledgement The authors thank the National Cancer Institute for access to NCI’s data collected by the Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial. The statements contained herein are solely those of the authors and do not represent or imply concurrence or endorsement by NCI.
Disclaimer The concepts and information presented in this paper are based on research results that are not commercially available.