1 Introduction
While there is interest in using convolutional neural networks (CNNs) to identify findings in medical imaging
(Rajpurkar2017nh; Irvin2019lc; Johnson2019as; Wang2017py; Wang2018ry; Zech2018tg), some researchers have questioned the reliability (Sculley2015da; Papernot2016lz; Rahimi2017xb; Finlayson2019ed; Forde2019fv) and reproducibility (Vaswani2017iu; Melis2018mv; Lucic2018bp; Riquelme2018cn; Henderson2018yx)of deep learning methods. Often, researchers in medical imaging evaluate a single model’s predictions to measure performance
(gale; mabal; Rajpurkar2017nh; tienet). However, the loss surface of a CNN is nonconvex, and differences in training such as random seed (Henderson2018yx) and optimization method (Wilson2017ca; Choi2019od)can affect the learned model weights and, consequently, the predictions of the model. Within the machine learning community, there are concerns that evaluation of a single trained model does not provide sufficient measurement of its variability
(Henderson2018yx; Forde2019fv). To mitigate this variability, some researchers in medical imaging have used crossvalidation (thrun; Chang2018ch), which varies the data used for both training and testing, but still uses a single model to make predictions during each crossvalidation run. Others have used ensembling (Sollich1996wj), combining predictions from 3 (titano), 5 (wu; mura), 10 (Gulshan2016mb; Rajpurkar2018vn; Pan2019ux), and 30 (Irvin2019lc)different trained models to optimize classification performance. While ensembles have been proposed as a simple method for estimating the uncertainty of the predictions of a CNN
(Lakshminarayanan2017en; Ovadia2019qr), such variability measurements, as seen in wu, are not common in medical imaging research.If models generate different predictions when they are retrained, they may make inconsistent predictions for the same patient. The AUC of a single trained model will not give a direct indication of this inconsistency. AUC itself will likely vary between retrained models, which could complicate efforts to compare CNN performance to other models or human experts using statistical testing (SilvaAycaguer2010fu). Researchers have attempted to quantify the uncertainty of predictions for individual patients by statistical estimation (Schulam2019ui) and direct prediction (Raghu2019uk). Raghu2019uk proposed a machine learning method which identifies which retinal fundus photographs (Gulshan2016mb) would be likely to have human expert disagreement in diagnosing diabetic retinopathy; such a model could be used to identify high uncertainty cases likely to benefit from a second opinion. Dusenberry2019xh examined the variability of RNNbased mortality prediction using the medical records of ICU patients in MIMICIII (Johnson2019as) and recommended the use of Bayesian RNNs (Fortunato2017fo) with stochastic embedding over ensembling as a way to estimate the variability of predictions in clinical timeseries data.
In this study, we explicitly characterized the variability in individual predicted findings and overall AUC of a CNN that was trained multiple times to predict findings on chest radiographs. Like Dusenberry2019xh, we found notable variability of predictions on individual patients with similar aggregate performance metrics. Because many realworld clinical decision support systems rely on single values of predicted probability rather than statistical distributions incorporating uncertainty, we focused our analysis on the use of ensembling for the purposes of robust prediction. We found that, in the case of chest radiographs, simple ensembling can reduce the variability of these probability estimates; ensembles of as few as ten models were found to reduce the variability of predictions by 70% (mean coefficient of variation from 0.543 to 0.169, ttest 15.96, pvalue < 0.0001).
2 Methods
Using an open source implementation
(Zech2018gf), we replicated the model described in Rajpurkar2017nh 50 times, varying the random seed with each finetuning (Lakshminarayanan2017en). Per Rajpurkar2017nh, a DenseNet121 (Huang2017hi)pretrained on ImageNet
(Russakovsky2015ib) was finetuned to identify 14 findings in the NIH chest radiography dataset (=112,120) (Wang2017py). The dataset was partitioned into 70% train, 10% tune, and 20% test data (train =78,468, tune =11,219, test=22,433). These 50 models were finetuned using SGD with identical hyperparameters on the same train and tune datasets. Because the CNN was consistently initialized with parameters from a DenseNet121 pretrained to ImageNet, the only difference in the training procedure across model runs was the order in which data was batched and presented to the model during each epoch of finetuning. Each model’s performance was assessed on the full test partition (
=22,433). To replicate the test set used for labeling by radiologists (Rajpurkar2017nh), a smaller test partition (=792) was created by randomly sampling 100 normal radiographs and 50 positive examples for each finding except for hernia (=42 in the test set).Metric  Symbol  Description (for each finding, radiograph pair) 
Mean  Average predicted probability  
Stdev.  Standard deviation of predicted probability  
Coefficient of variation  Standard deviation divided by mean predicted probability  
Log prob. ratio  Natural log of the highest predicted probability divided by the lowest predicted probabilty  
Percentile (%ile) rank  The percentile rank relative to all predictions for that finding among radiographs in the test set  
%ile rank range  The percentile rank of the highest probability predicted minus the percentile rank of the lowest probability predicted 
We calculated various statistical measurements for each finding on each radiograph across our 50 models in the full test set (=22,433). Table 1 describes each metric in detail. In addition to mean, standard deviation, and coefficient of variation, we calculated , where is the greatest and the least probability predicted by the 50 trained models for a given finding on a given radiograph. This ratio provides a scaled measurement of the variability of the predicted probability of a finding on a radiograph. To contextualize predictions within the population of predictions for that finding, we calculated the percentile rank range of the predictions, , where is the percentile rank of a prediction relative to all predictions for that finding in the test set.
To evaluate the effectiveness of ensembling in reducing variance, we averaged predictions over disjoint groups of 10 models to yield 5 separate averaged predictions for each finding for each radiograph (
=5 groups 10 models per group = 50 total models). We reported the standard deviation and coefficient of variation across these averaged groups. A paired ttest was used to compare coefficients of variation across the raw (=50) and averaged (=5) predictions.We examined the variance in overall AUC for the 14 possible targets. For each of the 50 models, AUC was calculated on both the full and limited test sets using the pROC package in R (Robin2011lg; Ihaka1996kl)
. This calculation provided an empirical distribution of the test AUC relative to the order of samples in the training data. For both the full and limited test sets, we calculated the 95% confidence interval of this distribution by subtracting the second smallest AUC from the second largest of the 50 AUCs. For the limited test sets, the average width of the 95% confidence interval was also estimated using DeLong’s method
(DeLong1988je) and bootstrapping Carpenter2000xn. DeLong expresses AUC in terms of the MannWhitney U statistic (Mann1947kd), a nonparametric test statistic that is approximately normally distributed for large sample size, and thus can used to calculate confidence intervals.
3 Results
The variability in predictions for a given radiograph was substantial across models. An example radiograph classification is shown in Figure 1. Figure 1 compares the variability across models (=50) in predicted probability of pneumonia for this radiograph to the variability of predictions for pneumonia in the full test set (=22,433 cases 50 models = 1,121,650 total predictions). In this example, the percentile rank range was 95.3% 48.2% = 47.1%. The variability of each finding for this radiograph relative to all predictions for a given finding is shown in Figure 4.
Average across individual models (n=50) 


Finding  Mean ()  Stdev. () 





Atelectasis  0.107  0.034  0.449  2.085  0.360  0.011  0.142  
Cardiomegaly  0.030  0.014  0.686  2.993  0.404  0.004  0.211  
Consolidation  0.041  0.014  0.439  2.046  0.368  0.004  0.133  
Edema  0.022  0.009  0.654  2.921  0.378  0.003  0.205  
Effusion  0.128  0.033  0.523  2.415  0.309  0.010  0.163  
Emphysema  0.023  0.010  0.703  3.033  0.479  0.003  0.219  
Nodule  0.056  0.021  0.444  2.029  0.493  0.007  0.140  
Pneumonia  0.012  0.004  0.403  1.867  0.451  0.001  0.126  
Fibrosis  0.016  0.007  0.531  2.435  0.446  0.002  0.171  
Hernia  0.002  0.001  0.608  2.784  0.494  0.0004  0.185  
Infiltration  0.172  0.042  0.299  1.401  0.425  0.013  0.091  
Mass  0.051  0.022  0.624  2.765  0.493  0.007  0.199  
Pleural Thickening  0.029  0.012  0.515  2.367  0.457  0.004  0.162  
Pneumothorax  0.046  0.022  0.723  3.196  0.465  0.007  0.227 
In the full test set (=22,433), the mean coefficient of variation for an individual radiograph over 50 retrainings was 0.543, and mean was 2.45 (Table 2, Figure 2); for a model with unvarying predictions, would equal zero. The radiographs had a mean percentile rank range of 43.0%. In other words, the average difference between the percentile rank of a radiograph’s highest prediction, relative to all predictions for that finding in the test set, and the radiograph’s lowest prediction of that finding, was 43.0%—nearly half the available range.
Averaging model predictions significantly reduced the mean coefficient of variability from 0.543 to 0.169 (ttest 15.96, pvalue < 0.0001). The distribution over AUC across models showed a degree of variability in both the full and limited test sets (Figure 3, Table 3). In the limited test set, the empirical variability in predictions did not exceed the average DeLong or bootstrap confidence interval for each model (Table 3). The DeLong and bootstrap 95% confidence intervals for AUC contained the mean AUC across models in 99.7% of cases (=698/700).

Limited test set (n=792)  
Finding 







Atelectasis  0.817  0.010  0.796  0.029  0.077  0.077  
Cardiomegaly  0.906  0.012  0.878  0.037  0.083  0.082  
Consolidation  0.802  0.010  0.736  0.030  0.097  0.097  
Edema  0.894  0.010  0.879  0.025  0.072  0.071  
Effusion  0.883  0.004  0.829  0.018  0.065  0.065  
Emphysema  0.923  0.012  0.910  0.028  0.067  0.066  
Nodule  0.772  0.011  0.681  0.047  0.111  0.109  
Pneumonia  0.760  0.023  0.715  0.054  0.137  0.136  
Fibrosis  0.827  0.017  0.836  0.033  0.100  0.100  
Hernia  0.911  0.067  0.897  0.082  0.108  0.105  
Infiltration  0.712  0.009  0.650  0.030  0.087  0.087  
Mass  0.837  0.015  0.766  0.040  0.103  0.103  
Pleural Thickening  0.784  0.016  0.725  0.051  0.112  0.111  
Pneumothorax  0.873  0.012  0.860  0.031  0.077  0.077 
4 Discussion
We found substantial variation among the predicted probability of findings when varying the sampling of batches in the training set (mean coefficient of variation across all findings 0.543, mean of 2.45; Figure 2, Table 2). We highlighted a case that demonstrated how predicted probabilities could vary across models (Figure 4), shifting its estimated risk relative to the test set population based on the random seed used to train the model. The average case had a 43.0% percentile range between its highest and lowest estimated probability of disease across all 50 models.
We found that there was variability across models in AUC for all findings. The overall AUC for each finding in the full test set of over 20,000 cases was much more stable than the substantial variability in predictions for individual radiographs. As explained by delongarticle, “the area under the population ROC curve represents the probability that, when the variable is observed for a randomly selected individual from the abnormal population and a randomly selected individual from the normal population, the resulting values will be in the correct order (e.g., abnormal value higher than the normal value).” In our case, AUC represents the probability that a randomly selected radiograph that is groundtruth positive for pathology will be assigned a higher score by the CNN than a randomly selected radiograph that is groundtruth negative for pathology. Calculating AUC is thus identical to estimating
for a Bernoulli random variable (i.e., a weighted coinflip) by repeatedly sampling from this distribution and calculating the average
over all draws. As our sample size grows larger and larger, our uncertainty interval over the true value of(and, equivalently, our uncertainty over AUC) grows progressively narrower. AUC can thus be relatively consistent across CNNs that make variable radiographlevel predictions, provided that these variable predictions are similar overall in their ability to classify positive and negative cases.
The variability in AUC was expectedly wider in the limited test set compared to the full test set. We compared realized variability across models to 95% confidence intervals estimated by two commonly used methods, DeLong and bootstrapping, on the limited test set and found that the realized variability did not exceed these estimated bounds. We note that this comparison is limited and not fully powered; we use sample mean instead of unknown population mean, limiting our ability to detect true differences. Nevertheless, it provides evidence that variability in AUC does not grossly exceed the estimates of common statistical tests, and that these tests can be used to compare the performance of different CNNs, provided researchers are aware of their variability.
Stability in AUC across trained models can mask the wide variation in predictions for a single radiograph, and should not reassure researchers that predictions will remain consistent. In our experiments, each DenseNet121 (Huang2017hi) model was initialized with the same pretrained weights from ImageNet (Russakovsky2015ib) and trained with the same train/tune/test data, optimizer, and hyperparameters. From this consistent configuration, we finetuned each model on the NIH chest radiograph dataset (Wang2017py), varying only the order in which training data was batched and presented to the model. The substantial variability we observed in predictions for individual radiographs might have been even wider had we allowed the model’s initialization parameters, choice of optimizer, or hyperparameters to vary. (Wilson2017ca; Choi2019od). Raghu2019js suggested that pretraining may not be necessary to achieve competitive performance on medical imaging tasks. Our results call to question whether the absence of pretraining may induce additional variability in predictions.
In the context of healthcare, it is particularly important to remain aware of the variability in individual predictions. If deep learningbased decision support will be deployed in clinical settings, their predictions will alter the diagnoses and treatments given to some patients. Justice, beneficence, and respect for persons are the three ethical principles proposed by the Belmont Report (belmont), which guides discussion of ethical considerations in medical research. An algorithm that treats identical patients differently challenges the value of justice and potentially leaves the care of patients up to a multidimensional coin flip. At the same time, radiologists are also far from perfectly consistent (Bruno2015cs). Rajpurkar2017nh observed relatively low interrater agreement between the radiologists who contributed the expert labels for pneumonia (0.387 average F1 score comparing each individual radiologist to the majority vote of three other radiologists). Similarly, a study of radiologists at Massachusetts General Hospital found 30% disagreement between colleagues’ interpretations of abdominopelvic CTs and 25% disagreement for the same radiologist viewing the CT at different times (Abujudeh2010re; Bruno2015cs). Machine learning algorithms may offer an opportunity to improve the consistency of medical decisions, but only if we are attentive to the inconsistency of which they, too, are capable.
Straightforward workarounds, such as averaging predictions across models (Sollich1996wj; Lakshminarayanan2017en), can substantially mitigate the effect of this individuallevel variability (coefficient of variation reduced from 0.543 to 0.169, pvalue < 0.0001). Reducing the variability in individual predictions is also likely to improve performance metrics such as AUC; ensembling of CNN predictions has been successfully demonstrated in medical imaging literature titano; wu; mura; Gulshan2016mb; Rajpurkar2018vn; Irvin2019lc; Pan2019ux, primarily to optimize model performance. No prior work to our knowledge examines how variability in the predictions of CNNs for radiologic imaging may translate to the care of individual patients. We encourage researchers to be vigilant of the variability of deep learning models and to provide some measure of how consistently their final (possibly ensembled) model performs in predicting findings for individual patients to assure readers that end users of the model will not be fooled by randomness.
Acknowledgments
We would like to thank the Internal Medicine residency program at California Pacific Medical Center (San Francisco, CA) for giving one of the authors (J.Z.) dedicated time to work on this research while he was a preliminary medicine resident.
Comments
There are no comments yet.