1 Introduction
The interpretation of chest radiographs is an essential task in the practice of a radiologist, enabling the early detection of thoracic diseases [9, 12]. To accelerate and improve the assessment of the continuously increasing number of radiographs, several deep learning solutions have been recently proposed for the automatic classification of radiographic findings [12, 4, 13]. Due to large variations in image quality or subjective definitions of disease appearance, there is a large interrate variability which leads to a high degree of label noise [9]. Modeling this variability when designing an automatic system for assessing this type of data is essential; an aspect which was not considered in previous work.
Using principles of information theory and subjective logic [6] based on the DempsterShafer framework for modeling of evidence [1], we present a method for training a system that generates both an imagelevel label and a classification uncertainty measure. We evaluate this system for classification of abnormalities on chest radiographs. The main contributions of this paper include:

describing a system for jointly learning classification probabilities and classification uncertainty in a parametric model;

proposing uncertaintydriven bootstrapping as a means to filter training samples with highest predictive uncertainty to improve robustness and accuracy;

comparing methods for generating stochastic classifications to model classification uncertainty;

presenting an application of this system to identify cases with uncertain classification, yielding more accurate classification on the remaining cases;

showing that the uncertainty measure can distinguish radiographs with correct and incorrect labels according to a multiradiologistconsensus study.
2 Background and Motivation
2.1 Machine Learning for the Assessment of Chest Radiographs
The open access to the ChestXRay8 dataset [12]
of chest radiographs has led to a series of recent publications that propose machine learning based systems for disease classification. With this dataset, Wang et al.
[12]also report a first performance baseline of a deep neural network at an average area under receiver operating characteristic curve (ROCAUC) of 0.75. These results have been further improved by using multiscale image analysis
[13], or by actively focusing the attention of the network on the most relevant subregions of the lungs [3]. Stateoftheart results on the official split of the ChestXRay8 dataset are reported in [4] (avg. ROCAUC of 0.81), using a locationaware dense neural network. In light of these contributions, a recent study compares the performance of such an AI system and 9 practicing radiologists [9]. While the study indicates that the system can surpass human performance, it also highlights the high variability among different expert radiologists for the reading of chest radiographs. The reported average specificity of the readers is very high (over 95%), with an average sensitivity of 50% 8%. With such a large interrater variability, one may ask: How can real ’ground truth’ data be obtained? Does the label noise affect the training? Current solutions do not consider this variability, which leads to models with overconfident predictions and limited generalization.Principles of Uncertainty Estimation: One way to handle this challenge is to explicitly estimate the classification uncertainty from the data. Recent methods for uncertainty estimation in the context of deep learning rely on Bayesian estimation theory [8] or ensembles [7] and demonstrate increased robustness to outofdistribution data. However, these approaches come with significant computational limitations; associated with the high complexity of sampling parameter spaces of deep models for Bayesian risk estimation; or associated with the challenge of managing ensembles of deep models. Sensoy et al. [10] propose an efficient alternative based on the theory of subjective logic [6], training a deep neural network to estimate the sample uncertainty based on observed data.
3 Proposed Method
Following the work of Sensoy et al. [10] based on the DempsterShafer theory of evidence [1], we apply principles of subjective logic [6] to derive a binary classification model that can support the joint estimation of perclass probabilities () and predictive uncertainty . In this context, a decisional framework is defined through the assignment of so called belief masses from evidence collected from observed data to individual attributes, e.g., membership to a class [1, 6]. Let us denote and the belief values for the positive and negative class, respectively. The uncertainty mass is defined as: , where and with denoting the perclass collected evidence and total evidence
. For binary classification, we propose to model the distribution of such evidence values using the beta distribution, defined by two parameters
and as: , where denotes the gamma function and with and . In this context, the perclass probabilities can be derived as and . Figure 1 visualizes the beta distribution for different values.A training dataset is provided: , composed of pairs of images with class assignment . To estimate the perclass evidence values from the observed data, a deep neural network parametrized by can be applied, with: , where denotes the network response function. Using maximum likelihood estimation, one can learn the network parameters by optimizing the Bayes risk with a beta distributed prior:
(1) 
where denotes the index of the training example from dataset , the predicted probability on the training sample , and defines the goodness of fit. Using linearity properties of the expectation, Eq. 1 becomes:
(2) 
where and
denote the network’s probabilistic prediction. The first two terms measure the goodness of fit, and the last term encodes the variance of the prediction
[10].To ensure a high uncertainty value for data samples for which the gathered evidence is not conclusive for an accurate classification, an additional regularization term
is added to the loss. Using information theory, this term is defined as the relative entropy, i.e., the KullbackLeibler divergence, between the beta distributed prior term and the beta distribution with total uncertainty (
). In this way, cost deviations from the total uncertainty state, i.e., , which do not contribute to the data fit are accounted for [10]. With the additional term, the total cost becomes with:(3) 
where , , with for and for . Removing additive constants and using properties of the logarithm function, one can simplify the regularization term to the following:
(4) 
where denotes the digamma function and
. Using stochastic gradient descent, the total loss
is optimized on the training set .Sampling the Data Distribution: An important requirement to ensure training stability and to learn a robust estimation of evidence values is an adequate sampling of the data distribution. We empirically found dropout [11]
to be a simple and very effective strategy to address this problem. In practice, dropout emulates an ensemble model combination driven by the random deactivation of neurons. Alternatively, one may use an explicit ensemble of
models , each trained independently. Following the principles of deep ensembles [7], the perclass evidence can be computed from the ensemble estimates via averaging. In our work, we found dropout to be as effective as deep ensembles.Uncertaintydriven Bootstrapping: Given the predictive uncertainty measure
, we propose a simple and effective algorithm for filtering the training set with the target of reducing label noise. A fraction of training samples with highest uncertainty are eliminated and the model is retrained on the remaining data. Instead of sample elimination, robust Mestimators may be applied, using a persample weight that is inversely proportional to the predicted uncertainty. The hypothesis is that by focusing the training on ’confident’ labels, we increase the robustness of the classifier and improve its performance on unseen data.
4 Experiments
Dataset and Setup: The evaluation is based on two datasets, the ChestXRay8 [12] and PLCO [2]
. Both datasets provide a series of AP/PA chest radiographs with binary labels on the presence of different radiological findings, e.g., granuloma, pleural effusion, or consolidation. The ChestXRay8 dataset contains 112,120 images from 30,805 patients, covering 14 findings extracted from radiological reports using natural language processing (NLP)
[12]. In contrast, the PLCO dataset was constructed as part of a screening trial, containing 185,421 images from 56,071 patients and covering 12 different abnormalities.For performance comparison, we selected locationaware dense networks [4] as baseline. This method achieves stateoftheart results on this problem, with a reported average ROCAUC of 0.81 (significantly higher than that of competing methods: 0.75 [12] and 0.77 [13]) on the official split of the ChestXRay8 dataset and a ROCAUC of 0.88 on the official split of the PLCO dataset. To evaluate our method, we identified testing subsets with higher confidence labels from multiradiologist studies. For PLCO, we randomly selected 565 test images and had 2 boardcertified expert radiologists read the images – updating the labels to the majority vote of the 3 opinions (incl. the original label). For ChestXRay8, a subset of 689 test images was randomly selected and read by 4 boardcertified radiologists. The final label was decided following a consensus discussion. For both datasets, the remaining data was split in 90% training and 10% validation. All images were downsampled to
using bilinear interpolation.
System Training: We constructed our learning model from the DenseNet121 architecture [5]. A dropout layer with a dropout rate of 0.5 was inserted after the last convolutional layer. We also investigated the benefits of using deep ensembles to improve the sampling ( models trained on random subsets of 80% of the training data; we refer to this with the keyword [ens]
). A fully connected layer with ReLU activation units maps to the two outputs
and . We used a systematic grid search to find the optimal configuration of training metaparameters: learning rate (), regularization factor (; decayed to andafter 1/3, respectively 2/3 of the epochs), training epochs (around 12, using an early stop strategy with a patience of 3 epochs) and a batch size of 128. The low number of epochs is explained by the large size of the dataset.
Uncertaintydriven Sample Rejection: Given a model trained for the assessment of an arbitrary finding, one can directly estimate the prediction uncertainty . This is an orthogonal measure to the predicted probability, with increased values on outofdistribution cases under the given model. One can use this measure for sample rejection, i.e., set a threshold and steer the system to not output its prediction on all cases with an expected uncertainty larger than . Instead, these cases are labeled with the message ”Do not know for sure; process case manually”. In practice this leads to a significant increase in accuracy compared to the stateoftheart on the remaining cases, as reported in Table 1 and Figure 2. For example, for the identification of granuloma, a rejection rate of 25% leads to an increase of over 20% in the microaverage F1 score. On the same abnormality, a 50% rejection rate leads to a F1 score over 0.99 for the prediction of negative cases. We found no significant difference in average performance when using ensembles (see Figure 2).
ROCAUC  
Finding  Guendel et al. [4]  Ours [0%]  Ours [10%]  Ours [25%]  Ours [50%] 
Granuloma  0.83  0.85  0.87  0.90  0.92 
Fibrosis  0.87  0.88  0.90  0.92  0.94 
Scaring  0.82  0.81  0.84  0.89  0.93 
Lesion  0.82  0.83  0.86  0.88  0.90 
Cardiac Ab.  0.93  0.94  0.95  0.96  0.97 
Average  0.85  0.86  0.89  0.91  0.93 
System versus Reader Uncertainty: To provide an insight into the meaning of the uncertainty measure and its correlation with the difficulty of cases, we evaluated our system on the detection of pleural effusion (excess accumulation of fluid in the pleural cavity) based on the ChestXRay8 dataset. In particular, we analyzed the test set of 689 cases that were relabeled using an expert committee of 4 experts. We defined a so called critical set, that contains only cases for which the label (positive or negative) was changed after the expert reevaluation. According to the committee, this set contained not only easy examples for which probably the NLP algorithm has failed to properly extract the correct labels from the radiographic report; but also difficult cases where either the image quality was limited or the evidence of effusion was very subtle. In Figure 3 (left), we empirically demonstrate that the uncertainty estimates of our algorithm correlate with the committee’s decision to change the label. Specifically, for unchanged cases, our algorithm displayed very low uncertainty estimates (average 0.16) at an average AUC of 0.976 (rejection rate of 0%). In contrast, on cases in the critical set, the algorithm showed higher uncertainties distributed between 0.1 and the maximum value of 1 (average 0.41). This empirically demonstrates the ability of the algorithm to recognize the cases where annotation errors occurred in the first place (through NLP or human reader error). In Figure 3 (right) we show how cases of the critical set can be effectively filtered out using sample rejection. Qualitative examples are shown in Figure 4.
Uncertaintydriven Bootstrapping: We also investigated the benefit of using bootstrapping based on the uncertainty measure on the example of plural effusion (ChestXRay8). We report performance as [AUC; F1score (pos. class); F1score (neg. class)]. After training our method, the baseline performance was measured at on testing. We then eliminated 5%, 10% and 15% of training samples with highest uncertainty, and retrained in each case on the remaining data. The metrics improved to , and on the test set. This is a significant increase, demonstrating the potential of this strategy to improve the robustness of the model to the label noise. We are currently focused on further exploring this method.
5 Conclusion
In conclusion, this paper presents an effective method for the joint estimation of class probabilities and classification uncertainty in the context of chest radiograph assessment. Extensive experiments on two large datasets demonstrate a significant accuracy increase if sample rejection is performed based on the estimated uncertainty measure. In addition, we highlight the capacity of the system to distinguish radiographs with correct and incorrect labels according to a multiradiologistconsensus user study, using the uncertainty measure only.
The authors thank the National Cancer Institute for access to NCI’s data collected by the Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial. The statements contained herein are solely those of the authors and do not represent or imply concurrence or endorsement by NCI.
Disclaimer The concepts and information presented in this paper are based on research results that are not commercially available.
References

[1]
Dempster, A.P.: A generalization of bayesian inference. Journal of the Royal Statistical Society: Series B (Methodological)
30(2), 205–232 (1968)  [2] Gohagan, J.K., Prorok, P.C., Hayes, R.B., Kramer, B.S.: The prostate, lung, colorectal and ovarian (PLCO) cancer screening trial of the National Cancer Institute: History, organization, and status. Controlled clinical trials 21(6), 251–272 (2000)
 [3] Guan, Q., Huang, Y., Zhong, Z., Zheng, Z., Zheng, L., Yang, Y.: Diagnose like a radiologist: Attention guided convolutional neural network for thorax disease classification. arXiv 1801.09927 (2018)
 [4] Guendel, S., Grbic, S., Georgescu, B., Zhou, K., Ritschl, L., Meier, A., Comaniciu, D.: Learning to recognize abnormalities in chest Xrays with locationaware dense networks. arXiv 1803.04565 (2018)
 [5] Huang, G., Liu, Z., v. d. Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR. pp. 2261–2269 (2017)
 [6] Jøsang, A.: Subjective Logic: A Formalism for Reasoning Under Uncertainty. Springer, 1st edn. (2016)
 [7] Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles. In: NIPS, pp. 6402–6413 (2017)
 [8] Molchanov, D., Ashukha, A., Vetrov, D.: Variational dropout sparsifies deep neural networks. In: ICML. pp. 2498–2507 (2017)
 [9] Rajpurkar, P., Irvin, J., Ball, R.L., Zhu, K., Yang, B., Mehta, H., Duan, T., Ding, D., Bagul, A., Langlotz, C.P., et al.: Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists. PLoS medicine 15(11) (2018)
 [10] Sensoy, M., Kaplan, L., Kandemir, M.: Evidential deep learning to quantify classification uncertainty. In: NIPS, pp. 3179–3189 (2018)
 [11] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. JMLR 15(1), 1929–1958 (2014)
 [12] Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.: ChestXRay8: Hospitalscale chest Xray database and benchmarks on weaklysupervised classification and localization of common thorax diseases. In: CVPR. pp. 3462–3471 (2017)
 [13] Yao, L., Prosky, J., Poblenz, E., Covington, B., Lyman, K.: Weakly supervised medical diagnosis and localization from multiple resolutions. arXiv 1803.07703 (2018)
Comments
There are no comments yet.