Machine learning research has produced strategies and algorithms to mitigate domain over-fitting with the study of domain adaptation (DA) (Wang and Deng, 2018) and domain generalization (DG) (Li et al., 2017)
. Modeling in medical imaging, however, comes with unique challenges that are not faced in day-to-day computer vision tasks. For instance, medical images are typically of much higher resolution in 2D (and often are 3D or 4D), contain subtle artifacts, and have small regions of interest. Moreover, the interpretation of medical images can involve a high degree of uncertainty, even for highly-trained radiologists.
Reliable predictive modeling in medical imaging calls for remedies from DA and DG. This work illustrates the problem of domain over-fitting in the context of classifying chest X-rays (CXRs), the most commonly prescribed imaging exams worldwide. Experimental data are gathered from ten domains varied by their patient distributions, clinical environments, and global locations. We empirically show the phenomenon of performance degradation with inter-domain generalization. In this preliminary work, we a simple solution suggested which quantitatively shows its promise as a strong baseline for better generalization.
High performance in classification, detection and segmentation is regularly observed in retrospective clinical studies and publications. For instance, an AUC of 0.99 was reported in Lakhani and Sundaram (2017) for classifying pulmonary tuberculosis from CXRs. An average AUC of 0.96 was recorded in Dunnmon et al. (2018) in triaging normal and abnormal CXRs. A DICE score of 0.98 was shown in Weston et al. (2018) in segmenting body parts in abdominal CTs. A sensitivity of 0.96 was shown in Thian et al. (2019) in detecting fracture in wrist X-rays. Ueda et al. (2018) reported a sensitivity of 0.93 in detecting cerebral aneurysms in head MR angiography. As Kim et al. (2019) pointed out, however, most clinical publications do not contain a sufficient amount of external validation that is beyond the source domain, whose data is used to train the models. Among those that did, the recent work of Zech et al. (2018) empirically showed drastic performance gaps of models across three medical institutions in classifying pneumonia in CXRs. The work of Prevedello et al. (2019) also discussed a similar issue of coping with data heterogeneity, but offered no practical recommendations. Unlike previous work, we conduct an unprecedented study with ten datasets collected internationally, measuring the ability of state-of-the-art machine learning models to perform domain adaptation and domain generalization in the context of medical imaging. We establish baseline solutions that are intuitive and practical, and lead to better generalization performance in experiments.
We utilize ten datasets from diverse sources to empirically show the benefits of training with data from multiple domains for model generalization. For training, we use four publicly available datasets: ChestX-ray14 (Wang et al., 2017) (NIH), CheXpert (Irvin et al., 2019) (CHX), PadChest (Bustos et al., 2019)
(PAD), Mimic-CXR(Johnson et al., 2019) (MIM), and one private data set from Australia (AUS). In addition, to evaluate generalization, we use one public data set, Open-i (Demner-Fushman et al., 2015) (OPI), and four private sets - one from Canada (CAN) and three from different sources in China (CHN1, CHN2, CHN3). Table LABEL:tab:data below summarizes the data used in our experiments.
|Dataset||Origin||# Patients||# Train Scans||# Test Scans|
|NIH||Bethesda, MD, USA||30,806||89,322||22,798|
|CHX||Stanford, CA, USA||64,534||152,938||38,089|
|MIM||Boston, MA, USA||62,592||200,874||49,170|
|OPI||Bloomington, IN, USA||3,670||-||3,670|
For all of the following experiments, we use a DenseNet-121 pretrained on ImageNet, and we are concerned with classifying CXRs as normal or abnormal. In the first set of experiments, we train a model on each of the five training datasets and test on all ten sets. TableLABEL:tab:auc shows AUCs from training on each source domain and evaluating on all target domains. We notice a couple of nice effects of training on all source domains. First, when aggregating all source domain data for training, test performance on those domains is essentially as good as training on any single source. Moreover, training on all sources simultaneously results in consistent improvement and yields the best performance on each of the five target domains, on which models are never trained on.
Figure LABEL:fig:loo shows AUCs for experiments where one of the five source domains is left out during training. We observe that for NIH and CHX, leaving them out has a negligible impact on the model’s performance. For AUS, PAD, and MIM, however, we notice what we expect: a moderate decrease in performance when the source is held out during training. There are many possible reasons to explain why performance is hurt more for some domains than for others, which we leave for further research.
- Bustos et al. (2019) Aurelia Bustos, Antonio Pertusa, Jose-Maria Salinas, and Maria de la Iglesia-Vayá. Padchest: A large chest x-ray image dataset with multi-label annotated reports, 2019.
- Demner-Fushman et al. (2015) Dina Demner-Fushman, Marc D. Kohli, Marc B. Rosenman, Sonya E. Shooshan, Laritza Rodriguez, Sameer Antani, George R. Thoma, and Clement J. Mcdonald. Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical Informatics Association, 23(2):304–310, 2015. doi: 10.1093/jamia/ocv080.
Dunnmon et al. (2018)
Jared A Dunnmon, Darvin Yi, Curtis P Langlotz, Christopher Ré, Daniel L
Rubin, and Matthew P Lungren.
Assessment of convolutional neural networks for automated classification of chest radiographs.Radiology, page 181422, 2018.
- Irvin et al. (2019) Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, Jayne Seekins, David A. Mong, Safwan S. Halabi, Jesse K. Sandberg, Ricky Jones, David B. Larson, Curtis P. Langlotz, Bhavik N. Patel, Matthew P. Lungren, and Andrew Y. Ng. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison, 2019.
- Johnson et al. (2019) Alistair E. W. Johnson, Tom J. Pollard, Seth J. Berkowitz, Nathaniel R. Greenbaum, Matthew P. Lungren, Chih ying Deng, Roger G. Mark, and Steven Horng. Mimic-cxr: A large publicly available database of labeled chest radiographs, 2019.
Kim et al. (2019)
Dong Wook Kim, Hye Young Jang, Kyung Won Kim, Youngbin Shin, and Seong Ho Park.
Design characteristics of studies reporting the performance of artificial intelligence algorithms for diagnostic analysis of medical images: Results from recently published papers.Korean journal of radiology, 20(3):405–410, 2019.
- Lakhani and Sundaram (2017) Paras Lakhani and Baskaran Sundaram. Deep learning at chest radiography: automated classification of pulmonary tuberculosis by using convolutional neural networks. Radiology, 284(2):574–582, 2017.
- Li et al. (2017) Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. Deeper, broader and artier domain generalization. In Proceedings of the IEEE International Conference on Computer Vision, pages 5542–5550, 2017.
- Mazurowski et al. (2018) Maciej A Mazurowski, Mateusz Buda, Ashirbani Saha, and Mustafa R Bashir. Deep learning in radiology: an overview of the concepts and a survey of the state of the art. arXiv preprint arXiv:1802.08717, 2018.
- Prevedello et al. (2019) Luciano M Prevedello, Safwan S Halabi, George Shih, Carol C Wu, Marc D Kohli, Falgun H Chokshi, Bradley J Erickson, Jayashree Kalpathy-Cramer, Katherine P Andriole, and Adam E Flanders. Challenges related to artificial intelligence research in medical imaging and the importance of image analysis competitions. Radiology: Artificial Intelligence, 1(1):e180031, 2019.
- Thian et al. (2019) Yee Liang Thian, Yiting Li, Pooja Jagmohan, David Sia, Vincent Ern Yao Chan, and Robby T Tan. Convolutional neural networks for automated fracture detection and localization on wrist radiographs. Radiology: Artificial Intelligence, 1(1):e180001, 2019.
- Ueda et al. (2018) Daiju Ueda, Akira Yamamoto, Masataka Nishimori, Taro Shimono, Satoshi Doishita, Akitoshi Shimazaki, Yutaka Katayama, Shinya Fukumoto, Antoine Choppin, Yuki Shimahara, and Yukio Miki. Deep learning for mr angiography: Automated detection of cerebral aneurysms. Radiology, 290:180901, 10 2018. doi: 10.1148/radiol.2018180901.
- Wang and Deng (2018) Mei Wang and Weihong Deng. Deep visual domain adaptation: A survey. Neurocomputing, 312:135–153, 2018.
Wang et al. (2017)
Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and
Ronald M. Summers.
Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on
weakly-supervised classification and localization of common thorax diseases.
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. doi: 10.1109/cvpr.2017.369.
- Weston et al. (2018) Alexander D Weston, Panagiotis Korfiatis, Timothy L Kline, Kenneth A Philbrick, Petro Kostandy, Tomas Sakinis, Motokazu Sugimoto, Naoki Takahashi, and Bradley J Erickson. Automated abdominal segmentation of ct scans for body composition analysis using deep learning. Radiology, page 181432, 2018.
- Zech et al. (2018) John R Zech, Marcus A Badgeley, Manway Liu, Anthony B Costa, Joseph J Titano, and Eric Karl Oermann. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLoS medicine, 15(11):e1002683, 2018.