While there is interest in using convolutional neural networks (CNNs) to identify findings in medical imaging(Rajpurkar2017-nh; Irvin2019-lc; Johnson2019-as; Wang2017-py; Wang2018-ry; Zech2018-tg), some researchers have questioned the reliability (Sculley2015-da; Papernot2016-lz; Rahimi2017-xb; Finlayson2019-ed; Forde2019-fv) and reproducibility (Vaswani2017-iu; Melis2018-mv; Lucic2018-bp; Riquelme2018-cn; Henderson2018-yx)
of deep learning methods. Often, researchers in medical imaging evaluate a single model’s predictions to measure performance(gale; mabal; Rajpurkar2017-nh; tienet). However, the loss surface of a CNN is non-convex, and differences in training such as random seed (Henderson2018-yx) and optimization method (Wilson2017-ca; Choi2019-od)
can affect the learned model weights and, consequently, the predictions of the model. Within the machine learning community, there are concerns that evaluation of a single trained model does not provide sufficient measurement of its variability(Henderson2018-yx; Forde2019-fv). To mitigate this variability, some researchers in medical imaging have used cross-validation (thrun; Chang2018-ch), which varies the data used for both training and testing, but still uses a single model to make predictions during each cross-validation run. Others have used ensembling (Sollich1996-wj), combining predictions from 3 (titano), 5 (wu; mura), 10 (Gulshan2016-mb; Rajpurkar2018-vn; Pan2019-ux), and 30 (Irvin2019-lc)
different trained models to optimize classification performance. While ensembles have been proposed as a simple method for estimating the uncertainty of the predictions of a CNN(Lakshminarayanan2017-en; Ovadia2019-qr), such variability measurements, as seen in wu, are not common in medical imaging research.
If models generate different predictions when they are retrained, they may make inconsistent predictions for the same patient. The AUC of a single trained model will not give a direct indication of this inconsistency. AUC itself will likely vary between retrained models, which could complicate efforts to compare CNN performance to other models or human experts using statistical testing (Silva-Aycaguer2010-fu). Researchers have attempted to quantify the uncertainty of predictions for individual patients by statistical estimation (Schulam2019-ui) and direct prediction (Raghu2019-uk). Raghu2019-uk proposed a machine learning method which identifies which retinal fundus photographs (Gulshan2016-mb) would be likely to have human expert disagreement in diagnosing diabetic retinopathy; such a model could be used to identify high uncertainty cases likely to benefit from a second opinion. Dusenberry2019-xh examined the variability of RNN-based mortality prediction using the medical records of ICU patients in MIMIC-III (Johnson2019-as) and recommended the use of Bayesian RNNs (Fortunato2017-fo) with stochastic embedding over ensembling as a way to estimate the variability of predictions in clinical time-series data.
In this study, we explicitly characterized the variability in individual predicted findings and overall AUC of a CNN that was trained multiple times to predict findings on chest radiographs. Like Dusenberry2019-xh, we found notable variability of predictions on individual patients with similar aggregate performance metrics. Because many real-world clinical decision support systems rely on single values of predicted probability rather than statistical distributions incorporating uncertainty, we focused our analysis on the use of ensembling for the purposes of robust prediction. We found that, in the case of chest radiographs, simple ensembling can reduce the variability of these probability estimates; ensembles of as few as ten models were found to reduce the variability of predictions by 70% (mean coefficient of variation from 0.543 to 0.169, t-test 15.96, p-value < 0.0001).
Using an open source implementation(Zech2018-gf), we replicated the model described in Rajpurkar2017-nh 50 times, varying the random seed with each fine-tuning (Lakshminarayanan2017-en). Per Rajpurkar2017-nh, a DenseNet-121 (Huang2017-hi)
pre-trained on ImageNet(Russakovsky2015-ib) was fine-tuned to identify 14 findings in the NIH chest radiography dataset (=112,120) (Wang2017-py). The dataset was partitioned into 70% train, 10% tune, and 20% test data (train =78,468, tune =11,219, test
=22,433). These 50 models were fine-tuned using SGD with identical hyperparameters on the same train and tune datasets. Because the CNN was consistently initialized with parameters from a DenseNet-121 pre-trained to ImageNet, the only difference in the training procedure across model runs was the order in which data was batched and presented to the model during each epoch of fine-tuning. Each model’s performance was assessed on the full test partition (=22,433). To replicate the test set used for labeling by radiologists (Rajpurkar2017-nh), a smaller test partition (=792) was created by randomly sampling 100 normal radiographs and 50 positive examples for each finding except for hernia (=42 in the test set).
|Metric||Symbol||Description (for each finding, radiograph pair)|
|Mean||Average predicted probability|
|Stdev.||Standard deviation of predicted probability|
|Coefficient of variation||Standard deviation divided by mean predicted probability|
|Log prob. ratio||Natural log of the highest predicted probability divided by the lowest predicted probabilty|
|Percentile (%ile) rank||The percentile rank relative to all predictions for that finding among radiographs in the test set|
|%ile rank range||The percentile rank of the highest probability predicted minus the percentile rank of the lowest probability predicted|
We calculated various statistical measurements for each finding on each radiograph across our 50 models in the full test set (=22,433). Table 1 describes each metric in detail. In addition to mean, standard deviation, and coefficient of variation, we calculated , where is the greatest and the least probability predicted by the 50 trained models for a given finding on a given radiograph. This ratio provides a scaled measurement of the variability of the predicted probability of a finding on a radiograph. To contextualize predictions within the population of predictions for that finding, we calculated the percentile rank range of the predictions, , where is the percentile rank of a prediction relative to all predictions for that finding in the test set.
To evaluate the effectiveness of ensembling in reducing variance, we averaged predictions over disjoint groups of 10 models to yield 5 separate averaged predictions for each finding for each radiograph (=5 groups 10 models per group = 50 total models). We reported the standard deviation and coefficient of variation across these averaged groups. A paired t-test was used to compare coefficients of variation across the raw (=50) and averaged (=5) predictions.
We examined the variance in overall AUC for the 14 possible targets. For each of the 50 models, AUC was calculated on both the full and limited test sets using the pROC package in R (Robin2011-lg; Ihaka1996-kl)
. This calculation provided an empirical distribution of the test AUC relative to the order of samples in the training data. For both the full and limited test sets, we calculated the 95% confidence interval of this distribution by subtracting the second smallest AUC from the second largest of the 50 AUCs. For the limited test sets, the average width of the 95% confidence interval was also estimated using DeLong’s method(DeLong1988-je) and bootstrapping Carpenter2000-xn. DeLong expresses AUC in terms of the Mann-Whitney U statistic (Mann1947-kd)
The variability in predictions for a given radiograph was substantial across models. An example radiograph classification is shown in Figure 1. Figure 1 compares the variability across models (=50) in predicted probability of pneumonia for this radiograph to the variability of predictions for pneumonia in the full test set (=22,433 cases 50 models = 1,121,650 total predictions). In this example, the percentile rank range was 95.3% 48.2% = 47.1%. The variability of each finding for this radiograph relative to all predictions for a given finding is shown in Figure 4.
|Average across individual models (n=50)||
|Finding||Mean ()||Stdev. ()||
In the full test set (=22,433), the mean coefficient of variation for an individual radiograph over 50 retrainings was 0.543, and mean was 2.45 (Table 2, Figure 2); for a model with unvarying predictions, would equal zero. The radiographs had a mean percentile rank range of 43.0%. In other words, the average difference between the percentile rank of a radiograph’s highest prediction, relative to all predictions for that finding in the test set, and the radiograph’s lowest prediction of that finding, was 43.0%—nearly half the available range.
Averaging model predictions significantly reduced the mean coefficient of variability from 0.543 to 0.169 (t-test 15.96, p-value < 0.0001). The distribution over AUC across models showed a degree of variability in both the full and limited test sets (Figure 3, Table 3). In the limited test set, the empirical variability in predictions did not exceed the average DeLong or bootstrap confidence interval for each model (Table 3). The DeLong and bootstrap 95% confidence intervals for AUC contained the mean AUC across models in 99.7% of cases (=698/700).
|Limited test set (n=792)|
We found substantial variation among the predicted probability of findings when varying the sampling of batches in the training set (mean coefficient of variation across all findings 0.543, mean of 2.45; Figure 2, Table 2). We highlighted a case that demonstrated how predicted probabilities could vary across models (Figure 4), shifting its estimated risk relative to the test set population based on the random seed used to train the model. The average case had a 43.0% percentile range between its highest and lowest estimated probability of disease across all 50 models.
We found that there was variability across models in AUC for all findings. The overall AUC for each finding in the full test set of over 20,000 cases was much more stable than the substantial variability in predictions for individual radiographs. As explained by delong-article, “the area under the population ROC curve represents the probability that, when the variable is observed for a randomly selected individual from the abnormal population and a randomly selected individual from the normal population, the resulting values will be in the correct order (e.g., abnormal value higher than the normal value).” In our case, AUC represents the probability that a randomly selected radiograph that is ground-truth positive for pathology will be assigned a higher score by the CNN than a randomly selected radiograph that is ground-truth negative for pathology. Calculating AUC is thus identical to estimating
for a Bernoulli random variable (i.e., a weighted coin-flip) by repeatedly sampling from this distribution and calculating the averageover all draws. As our sample size grows larger and larger, our uncertainty interval over the true value of
(and, equivalently, our uncertainty over AUC) grows progressively narrower. AUC can thus be relatively consistent across CNNs that make variable radiograph-level predictions, provided that these variable predictions are similar overall in their ability to classify positive and negative cases.
The variability in AUC was expectedly wider in the limited test set compared to the full test set. We compared realized variability across models to 95% confidence intervals estimated by two commonly used methods, DeLong and bootstrapping, on the limited test set and found that the realized variability did not exceed these estimated bounds. We note that this comparison is limited and not fully powered; we use sample mean instead of unknown population mean, limiting our ability to detect true differences. Nevertheless, it provides evidence that variability in AUC does not grossly exceed the estimates of common statistical tests, and that these tests can be used to compare the performance of different CNNs, provided researchers are aware of their variability.
Stability in AUC across trained models can mask the wide variation in predictions for a single radiograph, and should not reassure researchers that predictions will remain consistent. In our experiments, each DenseNet-121 (Huang2017-hi) model was initialized with the same pre-trained weights from ImageNet (Russakovsky2015-ib) and trained with the same train/tune/test data, optimizer, and hyperparameters. From this consistent configuration, we fine-tuned each model on the NIH chest radiograph dataset (Wang2017-py), varying only the order in which training data was batched and presented to the model. The substantial variability we observed in predictions for individual radiographs might have been even wider had we allowed the model’s initialization parameters, choice of optimizer, or hyperparameters to vary. (Wilson2017-ca; Choi2019-od). Raghu2019-js suggested that pre-training may not be necessary to achieve competitive performance on medical imaging tasks. Our results call to question whether the absence of pre-training may induce additional variability in predictions.
In the context of healthcare, it is particularly important to remain aware of the variability in individual predictions. If deep learning-based decision support will be deployed in clinical settings, their predictions will alter the diagnoses and treatments given to some patients. Justice, beneficence, and respect for persons are the three ethical principles proposed by the Belmont Report (belmont), which guides discussion of ethical considerations in medical research. An algorithm that treats identical patients differently challenges the value of justice and potentially leaves the care of patients up to a multi-dimensional coin flip. At the same time, radiologists are also far from perfectly consistent (Bruno2015-cs). Rajpurkar2017-nh observed relatively low inter-rater agreement between the radiologists who contributed the expert labels for pneumonia (0.387 average F1 score comparing each individual radiologist to the majority vote of three other radiologists). Similarly, a study of radiologists at Massachusetts General Hospital found 30% disagreement between colleagues’ interpretations of abdominopelvic CTs and 25% disagreement for the same radiologist viewing the CT at different times (Abujudeh2010-re; Bruno2015-cs). Machine learning algorithms may offer an opportunity to improve the consistency of medical decisions, but only if we are attentive to the inconsistency of which they, too, are capable.
Straightforward workarounds, such as averaging predictions across models (Sollich1996-wj; Lakshminarayanan2017-en), can substantially mitigate the effect of this individual-level variability (coefficient of variation reduced from 0.543 to 0.169, p-value < 0.0001). Reducing the variability in individual predictions is also likely to improve performance metrics such as AUC; ensembling of CNN predictions has been successfully demonstrated in medical imaging literature titano; wu; mura; Gulshan2016-mb; Rajpurkar2018-vn; Irvin2019-lc; Pan2019-ux, primarily to optimize model performance. No prior work to our knowledge examines how variability in the predictions of CNNs for radiologic imaging may translate to the care of individual patients. We encourage researchers to be vigilant of the variability of deep learning models and to provide some measure of how consistently their final (possibly ensembled) model performs in predicting findings for individual patients to assure readers that end users of the model will not be fooled by randomness.
We would like to thank the Internal Medicine residency program at California Pacific Medical Center (San Francisco, CA) for giving one of the authors (J.Z.) dedicated time to work on this research while he was a preliminary medicine resident.