The interpretation of chest radiographs is an essential task in the practice of a radiologist, enabling the early detection of thoracic diseases [9, 12]. To accelerate and improve the assessment of the continuously increasing number of radiographs, several deep learning solutions have been recently proposed for the automatic classification of radiographic findings [12, 4, 13]. Due to large variations in image quality or subjective definitions of disease appearance, there is a large inter-rate variability which leads to a high degree of label noise . Modeling this variability when designing an automatic system for assessing this type of data is essential; an aspect which was not considered in previous work.
Using principles of information theory and subjective logic  based on the Dempster-Shafer framework for modeling of evidence , we present a method for training a system that generates both an image-level label and a classification uncertainty measure. We evaluate this system for classification of abnormalities on chest radiographs. The main contributions of this paper include:
proposing uncertainty-driven bootstrapping as a means to filter training samples with highest predictive uncertainty to improve robustness and accuracy;
comparing methods for generating stochastic classifications to model classification uncertainty;
presenting an application of this system to identify cases with uncertain classification, yielding more accurate classification on the remaining cases;
showing that the uncertainty measure can distinguish radiographs with correct and incorrect labels according to a multi-radiologist-consensus study.
2 Background and Motivation
2.1 Machine Learning for the Assessment of Chest Radiographs
The open access to the ChestX-Ray8 dataset 
of chest radiographs has led to a series of recent publications that propose machine learning based systems for disease classification. With this dataset, Wang et al.
also report a first performance baseline of a deep neural network at an average area under receiver operating characteristic curve (ROC-AUC) of 0.75. These results have been further improved by using multi-scale image analysis, or by actively focusing the attention of the network on the most relevant sub-regions of the lungs . State-of-the-art results on the official split of the ChestX-Ray8 dataset are reported in  (avg. ROC-AUC of 0.81), using a location-aware dense neural network. In light of these contributions, a recent study compares the performance of such an AI system and 9 practicing radiologists . While the study indicates that the system can surpass human performance, it also highlights the high variability among different expert radiologists for the reading of chest radiographs. The reported average specificity of the readers is very high (over 95%), with an average sensitivity of 50% 8%. With such a large inter-rater variability, one may ask: How can real ’ground truth’ data be obtained? Does the label noise affect the training? Current solutions do not consider this variability, which leads to models with overconfident predictions and limited generalization.
Principles of Uncertainty Estimation: One way to handle this challenge is to explicitly estimate the classification uncertainty from the data. Recent methods for uncertainty estimation in the context of deep learning rely on Bayesian estimation theory  or ensembles  and demonstrate increased robustness to out-of-distribution data. However, these approaches come with significant computational limitations; associated with the high complexity of sampling parameter spaces of deep models for Bayesian risk estimation; or associated with the challenge of managing ensembles of deep models. Sensoy et al.  propose an efficient alternative based on the theory of subjective logic , training a deep neural network to estimate the sample uncertainty based on observed data.
3 Proposed Method
Following the work of Sensoy et al.  based on the Dempster-Shafer theory of evidence , we apply principles of subjective logic  to derive a binary classification model that can support the joint estimation of per-class probabilities () and predictive uncertainty . In this context, a decisional framework is defined through the assignment of so called belief masses from evidence collected from observed data to individual attributes, e.g., membership to a class [1, 6]. Let us denote and the belief values for the positive and negative class, respectively. The uncertainty mass is defined as: , where and with denoting the per-class collected evidence and total evidence
. For binary classification, we propose to model the distribution of such evidence values using the beta distribution, defined by two parametersand as: , where denotes the gamma function and with and . In this context, the per-class probabilities can be derived as and . Figure 1 visualizes the beta distribution for different values.
A training dataset is provided: , composed of pairs of images with class assignment . To estimate the per-class evidence values from the observed data, a deep neural network parametrized by can be applied, with: , where denotes the network response function. Using maximum likelihood estimation, one can learn the network parameters by optimizing the Bayes risk with a beta distributed prior:
where denotes the index of the training example from dataset , the predicted probability on the training sample , and defines the goodness of fit. Using linearity properties of the expectation, Eq. 1 becomes:
denote the network’s probabilistic prediction. The first two terms measure the goodness of fit, and the last term encodes the variance of the prediction.
To ensure a high uncertainty value for data samples for which the gathered evidence is not conclusive for an accurate classification, an additional regularization term
is added to the loss. Using information theory, this term is defined as the relative entropy, i.e., the Kullback-Leibler divergence, between the beta distributed prior term and the beta distribution with total uncertainty (). In this way, cost deviations from the total uncertainty state, i.e., , which do not contribute to the data fit are accounted for . With the additional term, the total cost becomes with:
where , , with for and for . Removing additive constants and using properties of the logarithm function, one can simplify the regularization term to the following:
where denotes the digamma function and
. Using stochastic gradient descent, the total lossis optimized on the training set .
Sampling the Data Distribution: An important requirement to ensure training stability and to learn a robust estimation of evidence values is an adequate sampling of the data distribution. We empirically found dropout 
to be a simple and very effective strategy to address this problem. In practice, dropout emulates an ensemble model combination driven by the random deactivation of neurons. Alternatively, one may use an explicit ensemble ofmodels , each trained independently. Following the principles of deep ensembles , the per-class evidence can be computed from the ensemble estimates via averaging. In our work, we found dropout to be as effective as deep ensembles.
Uncertainty-driven Bootstrapping: Given the predictive uncertainty measure
, we propose a simple and effective algorithm for filtering the training set with the target of reducing label noise. A fraction of training samples with highest uncertainty are eliminated and the model is retrained on the remaining data. Instead of sample elimination, robust M-estimators may be applied, using a per-sample weight that is inversely proportional to the predicted uncertainty. The hypothesis is that by focusing the training on ’confident’ labels, we increase the robustness of the classifier and improve its performance on unseen data.
. Both datasets provide a series of AP/PA chest radiographs with binary labels on the presence of different radiological findings, e.g., granuloma, pleural effusion, or consolidation. The ChestX-Ray8 dataset contains 112,120 images from 30,805 patients, covering 14 findings extracted from radiological reports using natural language processing (NLP). In contrast, the PLCO dataset was constructed as part of a screening trial, containing 185,421 images from 56,071 patients and covering 12 different abnormalities.
For performance comparison, we selected location-aware dense networks  as baseline. This method achieves state-of-the-art results on this problem, with a reported average ROC-AUC of 0.81 (significantly higher than that of competing methods: 0.75  and 0.77 ) on the official split of the ChestX-Ray8 dataset and a ROC-AUC of 0.88 on the official split of the PLCO dataset. To evaluate our method, we identified testing subsets with higher confidence labels from multi-radiologist studies. For PLCO, we randomly selected 565 test images and had 2 board-certified expert radiologists read the images – updating the labels to the majority vote of the 3 opinions (incl. the original label). For ChestX-Ray8, a subset of 689 test images was randomly selected and read by 4 board-certified radiologists. The final label was decided following a consensus discussion. For both datasets, the remaining data was split in 90% training and 10% validation. All images were down-sampled to
using bilinear interpolation.
System Training: We constructed our learning model from the DenseNet-121 architecture . A dropout layer with a dropout rate of 0.5 was inserted after the last convolutional layer. We also investigated the benefits of using deep ensembles to improve the sampling ( models trained on random subsets of 80% of the training data; we refer to this with the keyword [ens]
). A fully connected layer with ReLU activation units maps to the two outputsand . We used a systematic grid search to find the optimal configuration of training meta-parameters: learning rate (), regularization factor (; decayed to and
after 1/3, respectively 2/3 of the epochs), training epochs (around 12, using an early stop strategy with a patience of 3 epochs) and a batch size of 128. The low number of epochs is explained by the large size of the dataset.
Uncertainty-driven Sample Rejection: Given a model trained for the assessment of an arbitrary finding, one can directly estimate the prediction uncertainty . This is an orthogonal measure to the predicted probability, with increased values on out-of-distribution cases under the given model. One can use this measure for sample rejection, i.e., set a threshold and steer the system to not output its prediction on all cases with an expected uncertainty larger than . Instead, these cases are labeled with the message ”Do not know for sure; process case manually”. In practice this leads to a significant increase in accuracy compared to the state-of-the-art on the remaining cases, as reported in Table 1 and Figure 2. For example, for the identification of granuloma, a rejection rate of 25% leads to an increase of over 20% in the micro-average F1 score. On the same abnormality, a 50% rejection rate leads to a F1 score over 0.99 for the prediction of negative cases. We found no significant difference in average performance when using ensembles (see Figure 2).
|Finding||Guendel et al. ||Ours [0%]||Ours [10%]||Ours [25%]||Ours [50%]|
System versus Reader Uncertainty: To provide an insight into the meaning of the uncertainty measure and its correlation with the difficulty of cases, we evaluated our system on the detection of pleural effusion (excess accumulation of fluid in the pleural cavity) based on the ChestX-Ray8 dataset. In particular, we analyzed the test set of 689 cases that were relabeled using an expert committee of 4 experts. We defined a so called critical set, that contains only cases for which the label (positive or negative) was changed after the expert reevaluation. According to the committee, this set contained not only easy examples for which probably the NLP algorithm has failed to properly extract the correct labels from the radiographic report; but also difficult cases where either the image quality was limited or the evidence of effusion was very subtle. In Figure 3 (left), we empirically demonstrate that the uncertainty estimates of our algorithm correlate with the committee’s decision to change the label. Specifically, for unchanged cases, our algorithm displayed very low uncertainty estimates (average 0.16) at an average AUC of 0.976 (rejection rate of 0%). In contrast, on cases in the critical set, the algorithm showed higher uncertainties distributed between 0.1 and the maximum value of 1 (average 0.41). This empirically demonstrates the ability of the algorithm to recognize the cases where annotation errors occurred in the first place (through NLP or human reader error). In Figure 3 (right) we show how cases of the critical set can be effectively filtered out using sample rejection. Qualitative examples are shown in Figure 4.
Uncertainty-driven Bootstrapping: We also investigated the benefit of using bootstrapping based on the uncertainty measure on the example of plural effusion (ChestX-Ray8). We report performance as [AUC; F1-score (pos. class); F1-score (neg. class)]. After training our method, the baseline performance was measured at on testing. We then eliminated 5%, 10% and 15% of training samples with highest uncertainty, and retrained in each case on the remaining data. The metrics improved to , and on the test set. This is a significant increase, demonstrating the potential of this strategy to improve the robustness of the model to the label noise. We are currently focused on further exploring this method.
In conclusion, this paper presents an effective method for the joint estimation of class probabilities and classification uncertainty in the context of chest radiograph assessment. Extensive experiments on two large datasets demonstrate a significant accuracy increase if sample rejection is performed based on the estimated uncertainty measure. In addition, we highlight the capacity of the system to distinguish radiographs with correct and incorrect labels according to a multi-radiologist-consensus user study, using the uncertainty measure only.
The authors thank the National Cancer Institute for access to NCI’s data collected by the Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial. The statements contained herein are solely those of the authors and do not represent or imply concurrence or endorsement by NCI.
Disclaimer The concepts and information presented in this paper are based on research results that are not commercially available.
Dempster, A.P.: A generalization of bayesian inference. Journal of the Royal Statistical Society: Series B (Methodological)30(2), 205–232 (1968)
-  Gohagan, J.K., Prorok, P.C., Hayes, R.B., Kramer, B.S.: The prostate, lung, colorectal and ovarian (PLCO) cancer screening trial of the National Cancer Institute: History, organization, and status. Controlled clinical trials 21(6), 251–272 (2000)
-  Guan, Q., Huang, Y., Zhong, Z., Zheng, Z., Zheng, L., Yang, Y.: Diagnose like a radiologist: Attention guided convolutional neural network for thorax disease classification. arXiv 1801.09927 (2018)
-  Guendel, S., Grbic, S., Georgescu, B., Zhou, K., Ritschl, L., Meier, A., Comaniciu, D.: Learning to recognize abnormalities in chest X-rays with location-aware dense networks. arXiv 1803.04565 (2018)
-  Huang, G., Liu, Z., v. d. Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR. pp. 2261–2269 (2017)
-  Jøsang, A.: Subjective Logic: A Formalism for Reasoning Under Uncertainty. Springer, 1st edn. (2016)
-  Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles. In: NIPS, pp. 6402–6413 (2017)
-  Molchanov, D., Ashukha, A., Vetrov, D.: Variational dropout sparsifies deep neural networks. In: ICML. pp. 2498–2507 (2017)
-  Rajpurkar, P., Irvin, J., Ball, R.L., Zhu, K., Yang, B., Mehta, H., Duan, T., Ding, D., Bagul, A., Langlotz, C.P., et al.: Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists. PLoS medicine 15(11) (2018)
-  Sensoy, M., Kaplan, L., Kandemir, M.: Evidential deep learning to quantify classification uncertainty. In: NIPS, pp. 3179–3189 (2018)
-  Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. JMLR 15(1), 1929–1958 (2014)
-  Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.: ChestX-Ray8: Hos-pital-scale chest X-ray database and benchmarks on weakly-supervised classifica-tion and localization of common thorax diseases. In: CVPR. pp. 3462–3471 (2017)
-  Yao, L., Prosky, J., Poblenz, E., Covington, B., Lyman, K.: Weakly supervised medical diagnosis and localization from multiple resolutions. arXiv 1803.07703 (2018)