There have been several recent advances in the application of deep learning algorithms to chest x-ray interpretation at a high level of performance (Rajpurkar et al., 2018; Singh et al., 2018; Nam et al., 2018). Although these advancements have led many to suggest a near-term potential of these algorithms to provide accurate chest x-ray interpretation and increase access to radiology expertise, a few major challenges remain to their translation to the clinical setting.
There remain major challenges for the translation of chest x-ray algorithms to the clinical setting. First, the performance of deep learning chest x-ray algorithms, trained with mainly US-based chest x-ray datasets, on endemic and globally relevant diseases not commonly found in the US, such as tuberculosis (TB) is unknown (Qin et al., 2018, 2019). Second, most chest x-ray algorithms have been developed and validated on digital x-rays, while the vast majority of the world relies on film for X-ray interpretation, a barrier that denies these populations from the advancements of automated interpretation (Schwartz et al., 2014). In order to apply an interim digital solution, digital photographs of films for storage, interpretation, and consultation can be performed as a ”workaround” (Handelman et al., 2018). Third, chest x-ray algorithms which are developed using the data from one institution have not shown sustained performance when externally validated in application data from a different unrelated institution, and instead, these models have been criticized as vulnerable to bias and non-medically relevant cues (Zech et al., 2018). We believe that tackling each of these challenges will serve to inform improved translation of deep learning algorithms into safe and effective clinical decision support tools that can be validated prospectively with large impact studies and clinical trials.
The purpose of this work is to systematically address the aforementioned translation challenges for chest x-ray models. We validate the performance of chest x-ray models on the tasks of (1) TB detection (2) pathology detection on digital photographs of chest x-rays, and (3) pathology detection on chest x-rays from a separate institution. Rather than choosing one model architecture or approach, we evaluate performance under each of the conditions using the top 10 performing models on the CheXpert challenge, a large public competition for chest x-ray analysis (Irvin et al., 2019).
In this work we report performance metrics for the generalizability of existing chest x-ray models on the three aforementioned tasks. First, we find that the top 10 chest x-ray models on the CheXpert competition without fine-tuning or including the TB labels in training data, achieve an average AUC of 0.851 on the task of detecting TB on two public TB datasets, competitive with previously published approaches that trained and tested their models specifically on these same TB datasets (Qin et al., 2018, 2019). Second, we find that the average performance of the models on photos of x-rays (AUC = 0.916) is similar to their performance on the original chest x-ray images (AUC = 0.924). Third, we find that the models tested on an external dataset either perform comparably to or exceed the average performance of radiologists.
2. Experimental Setup
Top Models on CheXpert Leaderboard
We investigated the generalization performance of the top 10 models on the CheXpert (Irvin et al., 2019) competition leaderboard. CheXpert is a competition for automated chest x-ray interpretation that has been running from January 2019 featuring a strong radiologist-labeled reference standard. As of November, 2019, there were 94 models that had been submitted to the CheXpert leaderboard from both academic and industry teams. The top 10 available models on the CheXpert competition leaderboard as of November 2019 were selected. All of the selected models were ensembles with the number of models in the ensemble ranging from 8 to 32; the majority of these models featured Densely Connected Convolutional Networks (Huang et al., 2017) as part of their ensemble.
Running Models on New Test Sets
CheXpert is a unique competition in that it uses a hidden test set for official evaluation of models. Teams submit their executable code, which is then run on a test set that is not publicly readable. Such a setup preserves the integrity of the test results. Models can be rerun on new test sets to evaluate the ability of the model to generalize to new domains.
We make use of the CodaLab platform to re-run these chest x-ray models on the new test sets. CodaLab is an online platform for collaborative and reproducible computational research. The system exposes a simple command-line interface using which one can upload code and data and subsequently submit jobs to run them. Once a team has submitted their model on CodaLab and successfully inferred on the hidden CheXpert test set, they get added to the leaderboard. We reproduced the runs of the top 10 teams using their model checkpoints and inference scripts by substituting the hidden CheXpert test set for the other datasets used in this study.
Our primary evaluation metric is the area under the receiver operating characteristic curve (AUC). We report the average AUC of the top 10 models on the new test sets, averaged over the available tasks. Additionally, in experiments comparing the models to board-certified radiologists, we compute the average sensitivities of all models per task thresholded at the specificities of the radiologists at each task. The sensitivities of the average of the radiologists are compared to the sensitivities of the models per task.
were used to highlight regions with the greatest influence on a model’s decision. For a given x-ray, the CAM was produced for every class by taking the weighted average across the final convolutional feature map, with weights determined by the linear layer. The CAM was then scaled according to the output probability, so that more confident predictions appeared brighter. Finally, the map was upsampled to the input image resolution, and overlaid onto the input image. The Stanford baseline model on the CheXpert leaderboard was used as the model of choice to generate the CAMs.
3. TB Detection
We evaluated the models on the task of detecting tuberculosis (TB). TB is the leading cause of death from a single infectious disease agent and the leading cause of death for people living with human immunodeficiency virus (HIV) infection (MacNeil et al., 2019). Currently, chest x-ray models trained using large datasets from American institutions (Irvin et al., 2019; Johnson et al., 2019; Wang et al., 2017) do not include TB as one of the labeled pathologies because the pathology is not prevalent in their settings. However, the application of chest x-ray models to the global setting requires their high performance on this globally relevant task. We hypothesized that we could use existing models trained on the CheXpert dataset to detect TB without any fine-tuning on the TB task or the TB datasets. Because consolidation is one of the most common chest x-ray manifestations of pulmonary TB, we considered the use of the consolidation label as a proxy for the task of detecting TB.
We tested the performance of the models on two datasets: the Shenzhen and Montgomery datasets released by the NIH (Jaeger et al., 2014). The Shenzhen dataset was collected in the Shenzhen No.3 Hospital, China. Of the 662 x-rays in the dataset, 326 are normal and 336 are abnormal with manifestations of TB; 34 cases are pediatric cases (defined as age 18 years). The Montgomery set was collected by the Department of Health and Human Services in Montgomery County, USA. Of the 138 x-rays in the dataset, 80 are normal and 58 are abnormal; 17 cases are pediatric cases.
We evaluated the performance of the models using their probability on the consolidation label as the predicted score for TB on an x-ray (see Figure 2). The average AUC of the models on the TB test sets ranged from 0.815 to 0.893 with an average of 0.851.
We analyzed the strength of the relationship between the performance of the models on the source tasks and the target TB dataset. We ran a linear regression to predict the average AUC of the models on the TB datasets using (1) the average AUC of the models on the consolidation task in CheXpert, and (2) the average AUC across all 5 competition tasks in CheXpert.
We found that the strength of relationship was smaller for the AUC on consolidation in CheXpert () compared to the average AUC on all five CheXpert tasks ().
There have been a number of studies developing models for TB detection. Hwang et al. (2016) tested on the Shenzhen TB dataset without training on the data, but their models were explicitly trained on the TB task, and achieved an AUC of 0.884 on the Shenzhen dataset. Pasa et al. (2019) reported AUCs of 0.811 on Montgomery and 0.900 on the Shenzhen dataset when their model was trained on a combination of the two datasets and additional data. Similarly, Vajda et al. (2018) reported AUCs of 0.870 on Montgomery and 0.990 on the Shenzhen dataset after training on the same two datasets. Finally, Lakhani and Sundaram (2017) trained on a combination of four different TB datasets, and achieved an AUC of 0.990 on their test set with their ensemble model.
In our study, we found that the average AUC of the models on the TB test sets (average AUC of 0.851) without exposure to TB datasets was competitive to that of models that had been directly trained on these datasets for the task of tuberculosis detection. We also found that the average performance of a model across tasks was a stronger predictor of performance on the tuberculosis dataset as compared to the performance of the model on any of the individual tasks. This suggests that training models to perform well across tasks may allow them to perform better on unseen images than models that optimize for a single task. A possible reason for this finding may be that the shared representations learnt by optimizing for multitask performance are exploited for better performance on different data distributions (Caruana, 1997).
4. Smartphone Photos
We evaluated the models on the task of detecting pathologies on smartphone photos of chest x-rays. While most deep learning models are trained on digital x-rays, scaled deployment demands a solution that can navigate an endless array of medical imaging / IT infrastructures. An appealing solution to scaled deployment is to leverage the ubiquity of smartphones: clinicians and radiologists in parts of the world take smartphone photos of medical imaging studies to share with other experts or clinicians using messaging services like WhatsApp (Handelman et al., 2018). While using photos of chest x-rays to input into chest-xray algorithms could enable any physician with a smartphone to get instant AI algorithm assistance, the performance of chest x-ray algorithms on photos of chest x-rays has not been thoroughly investigated. Outside chest x-ray classification, deep learning algorithms for image classification have been shown to attain lower performance on photos of images than on the images themselves (Kurakin et al., 2016). We conducted an experiment to determine whether existing chest x-ray models could generalize well to photos of chest x-rays.
We generated a dataset of photos of the CheXpert test set, consisting of studies from 500 patients. Chest X-rays from each test study were displayed on a non-diagnostic computer monitor. Photos of the monitor were taken with an Apple iPhone 7 by a physician. The physician was instructed to keep the mobile camera stable and center the lung fields in the camera view. A time-restriction of 5 seconds per image was imposed to simulate a busy healthcare environment. Subsequent inspection of photos showed that they were taken with slightly varying angles; some photos included artefacts such as Moiré patterns and subtle screen-glares. Photos were labeled using the ground truth for the corresponding digital x-ray image.
The models achieved a mean AUC of 0.916 on photos of the chexpert test set, compared with an AUC of 0.924 on the original chexpert test set. All of the models had mean AUCs higher than 0.9, and were within 0.01 AUC of their performance on the original images. The average AUCs of each of the top 10 models across the 5 CheXpert competition tasks are detailed in Figure 3.
Several studies have highlighted the importance of generalizability of computer vision models with noise in images(Hendrycks and Dietterich, 2019). Dodge and Karam (2017) demonstrated that deep neural networks perform poorly compared to humans on image classification on distorted images. Schmidt et al. (2018); Geirhos et al. (2019)
have found that convolutional neural networks trained on specific image corruptions did not generalize, and the error patterns of network and human predictions were not similar on noisy and elastically deformed images.
In our study, the dataset we generated for this experiment allows for the direct comparison of the effect of photos against the source images on model performance, addressing a key deployment and generalization challenge. We found that the performance across top teams on photos of chest x-rays was comparable to their performance on the original x-rays. Figure 6 demonstrates that the model is able to detect the location of the pathology on a characteristic example where the distortion generated by taking photos of the x-rays did not affect the ability of the model to identify clinically relevant information in the x-rays.
|Failed to correctly localize (False Negative). CAMs failed to localize the actual consolidation. Typically, the consolidation was smaller or less opaque than average; in some cases, the CAMs highlighted a feature that was visually similar but unrelated to consolidation.||36 (44.44%)|
|Failed to confidently detect (False Negative). CAMs accurately localized the consolidation, but wasn’t confident enough to make a positive diagnosis. This was found to occur when the consolidation was overlapping with other diseases (such as severe pulmonary edema) or anatomical structures.||29 (35.80%)|
|Mistaken for mimicking feature (False Positive). CAMs detected a visual feature which mimics consolidation and made a false positive diagnosis. This was often the case in the presence of severe pulmonary edema, and cases with other pulmonary opacities such as fibrosis, scarring and lung lesion.||13 (16.05%)|
|Mistaken for non-mimicking feature (False Positive). The x-ray contains enlarged cardiac contours and bilateral mid and lower lung interstitial predominant opacities consistent with cardiomegaly and pulmonary edema. CAMs highlighted an area of the cardiac border and chest wall which bear no apparent visual resemblance to consolidations.||3 (3.70%)|
5. External Institution
We evaluated the performance of the top 10 CheXpert models on a dataset from an external institution. Chest x-ray algorithms which are developed using the data from one institution have not shown sustained performance when externally validated in application data from a different unrelated institution and have been criticized as vulnerable to bias and non-medically relevant cues (Zech et al., 2018). Furthermore, certain institutions may not allow access to patient data for privacy reasons. This makes it important for models trained on one institution’s data to be generalizable to others without finetuning or retraining for wider deployment in the healthcare system.
We used a set of 420 frontal chest x-rays curated in the test set of Rajpurkar et al. (2018). These x-rays contained images from the ChestXray-14 dataset collected at the National Institutes of Health Clinical Center (Wang et al., 2017), sampled to contain at least 50 cases of each pathology according to the original labels provided in the dataset.
The models achieved an average performance of 0.897 AUC across the 5 CheXpert competition tasks on the test set from the external institution. On Atelectasis, Cardiomegaly, Edema, and Pleural Effusion, the mean sensitivities of the models of 0.750, 0.617, 0.712, and 0.806 respectively, are higher than the mean radiologist sensitivities of 0.646, 0.485, 0.710, and 0.761 (at the mean radiologist specificities of 0.806, 0.924, 0.925, and 0.883 respectively). On Consolidation, the mean sensitivity of the models of 0.443 is lower than the mean radiologist sensitivity of 0.456 (at the mean radiologist specificity of 0.935).
Because our primary performance measures do not reveal any information on patterns of mistakes or systematic biases, we qualitatively analyzed chest x-rays where the model output was wrong compared to ground truth diagnosis of consolidation. We used CAMs to reason about model mistakes. The analysis revealed that the type of model mistakes could be pooled into four distinct categories as shown in Table 1. Each chest x-ray was categorized into one or more of four categories: Failure to correctly localize the consolidation, Failure to confidently detect consolidation, Mistaking a mimicking feature for consolidation, Mistaking a non-mimicking feature for consolidation. The most common mistake was failure to detect to consolidation, and as can be expected this was often the case for faint or small consolidations.
Given the variety of healthcare systems and patient populations, it is critical for deep learning models in healthcare to be able to generalize to new patient populations from different institutions (Kelly et al., 2019; Chen et al., 2019). There have been several studies investigating the generalization of models to different institutions. Particularly for chest x-ray interpretation models, Zech et al. (2018)
trained image classifiers on chest x-ray from three different institutions and found that models trained on data from one institution failed to generalize to other institutions.Chen et al. (2019) raised concerns about whether deep learning based approaches could generalize to smaller healthcare institutions with limited data. Kelly et al. (2019) detailed limitations of deep learning towards generalization to new populations given that the models may learn confounders present in one population. However, McKinney et al. (2020) recently showed that the performance of deep learning models on the task of breast cancer detection entirely trained on data from the UK generalized to healthcare data from the US. Kim et al. (2019) reported that only 6% of studies evaluating the performance of AI algorithms for diagnostic analysis of medical images performed external validation.
In our study, we found that CheXpert-trained models demonstrated generalizability to another institution’s data without any additional site specific training. Furthermore, the models exceeded radiologists on sensitivity for majority of the tasks when thresholded on radiologists’ specificity despite not having been trained on the dataset. The CAMs demonstrate that the model is learning clinically relevant information in the chest x-rays and not confounders.
Our primary assumption in testing the generalization of these models for these different tasks and circumstances is that these models had not been exposed to data used for the external test sets. All models used in the study were trained exclusively to classify CheXpert pathologies (and did not include TB or NIH-specific pathologies): we verified that the output of all models had complete intersection with the CheXpert pathologies.
Furthermore, the results of our study do not suggest guaranteed generalization of chest x-ray models to new clinical settings; future work should evaluate evaluate the performance in clinical trials for further verification, a necessary step for the successful translation of diagnostic or predictive artificial intelligence tools into practice(Park and Han, 2018).
Despite advances in the performance of chest x-ray algorithms (Lakhani and Sundaram, 2017; Kallianos et al., 2019; Shih et al., 2019; Kashyap et al., 2019; Qin et al., 2019, 2018), the ability of these models to generalize has not been systematically explored. The purpose of this study was to systematically evaluate the generalization capabilities of existing models to (1) detect diseases not explicitly included in model development, (2) smartphone photos of x-rays, and (3) x-rays from institutions not included in model development. Our results suggest the possibility for existing chest x-ray models to generalize to new clinical settings without fine-tuning.
Deep learning models, including for chest x-ray interpretation, have been criticized for their inability to generalize to new clinical settings (Kelly et al., 2019). For instance, Zech et al. (2018) reported that chest x-ray models failed to generalize to new populations or institutions separate from the training data, relying on institution specific and/or confounding cues to infer the label of interest. In contrast, our results suggest that existing models may generalize across institutions, modalities, and diseases without further engineering. Importantly, in evaluation of the models there was no indication of bias toward institution specific features in model decision making or a reliance on unrelated features for classification as evident from the class activation maps.
Our systematic examination of the generalization capabilities of existing models can be extended to other tasks in medical AI (Rajpurkar et al., 2018; Hannun et al., 2019; Park et al., 2019; Uyumazturk et al., 2019; Varma et al., 2019; Duan et al., 2019; Topol, 2019), and provide a framework for tracking technical readiness towards clinical translation.
We would like to acknowledge the Stanford Machine Learning Group (stanfordmlgroup.github.io) and the Stanford Program for Artificial Intelligence in Medicine and Imaging for infrastructure support (AIMI.stanford.edu). We would also like to acknowledge Wenwu Ye from JF healthcare, Hieu Pham from the Medical Imaging Team at Vingroup Big Data Institute (VinBDI) and Desmond from Beihang University who were among the top submitters in the competition and helped us understand the data and techniques used in their models.
- Multitask learning. Mach. Learn. 28 (1), pp. 41–75. External Links: Cited by: §3.
- Deep learning and alternative learning strategies for retrospective real-world clinical data. Nature News. External Links: Cited by: §5.
- A Study and Comparison of Human and Deep Learning Recognition Performance under Visual Distortions. In 2017 26th International Conference on Computer Communication and Networks (ICCCN), pp. 1–7. External Links: Cited by: §4.
Clinical value of predicting individual treatment effects for intensive blood pressure therapy: a machine learning experiment to estimate treatment effects from randomized trial data. Circulation: Cardiovascular Quality and Outcomes 12 (3), pp. e005010. Cited by: §7.
- ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv:1811.12231 [cs, q-bio, stat]. Note: arXiv: 1811.12231Comment: Accepted at ICLR 2019 (oral) Cited by: §4.
- Media messaging in diagnosis of acute cxr pathology: an interobserver study among residents. SpringerLink. External Links: Cited by: §1, §4.
- Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nature Medicine 25 (1), pp. 65. Cited by: §7.
- Benchmarking Neural Network Robustness to Common Corruptions and Perturbations. arXiv:1903.12261 [cs, stat]. Note: arXiv: 1903.12261Comment: ICLR 2019 camera-ready; datasets available at https://github.com/hendrycks/robustness ; this article supersedes arXiv:1807.01697 Cited by: §4.
- Densely Connected Convolutional Networks. pp. 4700–4708. Cited by: §2.
- A novel approach for tuberculosis screening based on deep convolutional neural networks. In Medical Imaging 2016: Computer-Aided Diagnosis, Vol. 9785, pp. 97852W. External Links: Cited by: §3.
- CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison. arXiv:1901.07031 [cs, eess]. Note: arXiv: 1901.07031Comment: Published in AAAI 2019 Cited by: §1, §2, §3.
- Two public chest x-ray datasets for computer-aided screening of pulmonary diseases. Quantitative imaging in medicine and surgery. External Links: Cited by: §3.
- MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. arXiv:1901.07042 [cs, eess]. Note: arXiv: 1901.07042 Cited by: §3.
- How far have we come? Artificial intelligence for chest radiograph interpretation. Clinical Radiology 74 (5), pp. 338–345 (en). External Links: Cited by: §7.
- Artificial intelligence for point of care radiograph quality assessment. In Medical Imaging 2019: Computer-Aided Diagnosis, Vol. 10950, pp. 109503K. External Links: Cited by: §7.
- Key challenges for delivering clinical impact with artificial intelligence. Cited by: §5, §7.
- Design characteristics of studies reporting the performance of artificial intelligence algorithms for diagnostic analysis of medical images: results from recently published papers. Korean journal of radiology. Cited by: §5.
- Adversarial examples in the physical world. CoRR abs/1607.02533. External Links: Cited by: §4.
- Deep Learning at Chest Radiography: Automated Classification of Pulmonary Tuberculosis by Using Convolutional Neural Networks. Radiology 284 (2), pp. 574–582. External Links: Cited by: §3, §7.
- Global epidemiology of tuberculosis and progress toward achieving global targets - 2017. Centers for Disease Control and Prevention. External Links: Cited by: §3.
- International evaluation of an ai system for breast cancer screening. Nature News. Cited by: §5.
- Development and validation of deep learning–based automatic detection algorithm for malignant pulmonary nodules on chest radiographs. Radiology 290 (1), pp. 218–228. Cited by: §1.
- Deep learning–assisted diagnosis of cerebral aneurysms using the headxnet model. JAMA Network Open 2 (6), pp. e195600–e195600. Cited by: §7.
- Methodologic Guide for Evaluating Clinical Performance and Effect of Artificial Intelligence Technology for Medical Diagnosis and Prediction. Radiology 286 (3), pp. 800–809. External Links: Cited by: §6.
- Efficient Deep Network Architectures for Fast Chest X-Ray Tuberculosis Screening and Visualization. Scientific Reports 9 (1), pp. 1–9 (en). External Links: Cited by: §3.
- Computer-aided detection in chest radiography based on artificial intelligence: a survey. BioMedical Engineering OnLine 17 (1), pp. 113 (en). External Links: Cited by: §1, §1, §7.
- Using artificial intelligence to read chest radiographs for tuberculosis detection: A multi-site evaluation of the diagnostic accuracy of three deep learning systems. Scientific Reports 9 (1), pp. 1–10 (en). External Links: Cited by: §1, §1, §7.
- MURA: large dataset for abnormality detection in musculoskeletal radiographs. In 1st Conference on Medical Imaging with Deep Learning, Cited by: §7.
- Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists. PLOS Medicine 15 (11), pp. e1002686 (en). External Links: Cited by: §1, §5.
- Adversarially Robust Generalization Requires More Data. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 5014–5026. Cited by: §4.
- The accuracy of mobile teleradiology in the evaluation of chest x-rays. Journal of Telemedicine and Telecare. External Links: Cited by: §1.
- Grad-cam: why did you say that? visual explanations from deep networks via gradient-based localization. CoRR abs/1610.02391. External Links: Cited by: §2.
- Augmenting the National Institutes of Health Chest Radiograph Dataset with Expert Annotations of Possible Pneumonia. Radiology: Artificial Intelligence 1 (1), pp. e180041. External Links: Cited by: §7.
- Deep learning in chest radiography: Detection of findings and presence of change. PLoS ONE 13 (10). External Links: Cited by: §1.
- High-performance medicine: the convergence of human and artificial intelligence. Nature medicine 25 (1), pp. 44. Cited by: §7.
- Deep learning for the digital pathologic diagnosis of cholangiocarcinoma and hepatocellular carcinoma: evaluating the impact of a web-based diagnostic assistant. arXiv preprint arXiv:1911.07372. Cited by: §7.
- Feature selection for automatic tuberculosis screening in frontal chest radiographs. Journal of medical systems 42 (8), pp. 146. Cited by: §3.
- Automated abnormality detection in lower extremity radiographs using deep learning. Nature Machine Intelligence, pp. 1–6. Cited by: §7.
Chestx-ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2097–2106. Cited by: §3, §5.
- Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLOS Medicine 15 (11), pp. e1002683 (en). External Links: Cited by: §1, §5, §5, §7.
Learning deep features for discriminative localization. CoRR abs/1512.04150. External Links: Cited by: §2.