Radiography is a common exam to diagnose chest conditions since it is a low-cost, fast, and widely available imaging modality. Abnormalities identified on radiographs are called radiological findings. Several chest radiological findings might indicate lung cancer, such as lesions, consolidation, and atelectasis. Lung cancer is the first cause of cancer death worldwide and the lack of effective early-detection methods is one of the main reasons for its poor prognosis . Lung cancer signs are mostly identified through imaging exams, but, at the same time, 90% of the lung cancer misdiagnosis occurs in radiographs, often due to observer error .
Deep learning is a growing field for image analysis. It has recently been employed at several medical imaging tasks 
and may help to overcome observer error. Considering chest radiographs, deep learning is used in a multilabel classification scenario to provide radiological findings to assist physicians with the diagnosis process. Recent work in the field achieved near radiologist-level accuracy at identifying radiological findings by the use of convolutional neural networks, one of the most successful deep learning method.
One assumption underlying deep learning models is that training and test data are independent and identically distributed (i.i.d). This assumption sometimes does not persist when data comes from different settings. This is a common case for medical imaging, a scenario which image acquisition protocols and machines may vary between diagnostic centers. Another aspect of medical imaging is the epidemiological variation between different populations, which may change the label distribution in different datasets. This difference of data distribution from the same task is called domain shift. The domain from where training data is sampled is the source domain, with distribution , and the one where the model is applied is the target domain, with distribution . When , it means the model most likely will handle test data the same way as in the training. As diverges from trained models tend to yield poor results, failing to effectively handle the input data .
With deep learning becoming more widespread, predictive models will inevitably become a big part of health care. We believe that before health care providers trust predictive models as a second opinion, we must understand the extent of their generalization capabilities and how well they perform outside the source domain. In this work, we propose to evaluate how well can models trained on a hospital-scale database generalize to unseen data from other hospitals or diagnostic centers by analyzing the degree of domain shift among three large datasets of chest radiographs. We train a state-of-the-art convolutional neural network for multilabel classification at each of the three datasets and evaluate their performance at predicting labels at the remaining two.
This paper is organized as follows: we first describe our methods, detailing our experiment design and datasets. Then, we summarize our results in Table 2 and discuss our findings.
2 Materials and methods
Three large datasets of chest radiographs are available to this date. ChestX-ray14  from the National Institute of Health contains 112,120 frontal-view chest radiographs from 32,717 different patients labeled with 14 pathologies. CheXpert  from the Stanford Hospital contains 224,316 frontal and lateral chest radiographs of 65,240 patients. MIMIC-CXR 
from Massachusetts Institute of Technology presents 371,920 chest X-rays associated with 227,943 imaging studies from 65,079 patients. Both CheXpert and MIMIC-CXR are labeled with the same 13 pathologies. The labels from all three datasets are automatically extracted using natural language processing algorithms on radiological reports.
We show the pixel intensity distribution of each dataset in Figure 2. We see a spike at low intensities (especially ) for every center, but the distribution for higher intensities is somewhat different for every center, which might implicate in a decrease of the models predictive performance.
2.2 Experiment design
We develop a multi-label classification approach reproducing CheXNet 
, which achieved state-of-the-art results in classification of multiple pathologies using a DenseNet121 neural network architecture. The model is pre-trained on the ImageNet dataset, the images are resized to 224x224 pixels and normalized using ImageNet mean and standard deviation. We train three models, one for each dataset, and subsequently evaluate our model at the other two. Each model is trained with the training set and evaluated at the other two test sets. The three datasets have the same train, test and validation sets in all experiments. For the ChestX-ray14 dataset, we use the original split, but as CheXpert and MIMIC-CXR test sets are not publicly available, we randomly re-split their data keeping ChestX-ray14 split ratio (70% train, 20% test, and 10% validation) and no patient overlap between the sets. Table1 shows the frequency of the labels in each train and test split.
One limitation we encountered is that the ChestX-ray14 dataset has a different set of labels from the other two datasets. We fixed this by training each model with all labels available, but reporting the results only of the common labels for all three (No finding, Atelectasis, Cardiomegaly, Edema, Lesion, Consolidation, Pneumonia, Pneumothorax). We created a ”Lesion” label on ChestX-ray14 by joining the samples annotated with ”Nodule” or ”Mass”. Another limitation is that ChestX-ray14 has only frontal X-rays, therefore, we only use the frontal samples from the other two datasets, remaining 191,229 samples on CheXpert and 249,995 on MIMIC-CXR.
To evaluate domain shift, we use a common performance metric to multi-label classification, the Area Under Receiver Operating Characteristic curve (AUC), to report both individual radiological findings prediction results and their average for an overall view of model performance. Both the true positive rate and the false positive rate are considered to compute the AUC. A higher AUC value suggests better performance.
We train the same neural network architecture with the same hyperparameters at each of the three datasets individually. Training and testing on ChestX-ray14 achieves results similar to the ones reported on. After training, we load our model and evaluate it with images from the remaining two.
|Test set||Training set||Atelectasis||Cardiomegaly||Consolidation||Edema||Lesion||Pneumonia||Pneumothorax||No Finding||Mean|
We summarize our results in Table 2. We can see that the best results for each test set appear when the training set is from the same dataset. This shows that clinicians should expect a decrease in the reported performances of machine learning models when applying them in real-world scenarios. This decrease may vary according to the dataset distribution in which the model was trained. For instance, running a model trained on CheXpert on MIMIC-CXR’s test set reduces the mean AUC in , while the model trained on ChestX-ray14 reduces it in . On CheXpert test set, training on MIMIC-CXR shows almost the same decrease on mean AUC (), reducing the AUC in all of the findings. The model trained on ChestX-ray14 has the highest average AUC when testing on its own test set, but when testing in other datasets, it shows the most significant performance drop, lowering CheXpert’s mean AUC in and MIMIC-CXR’s in . ”No Finding” is the label with the most notable domain shift in ChestX-ray14, questioning the label reliability. Both the models trained on CheXpert and MIMIC-CXR mostly preserve the ChestX-ray14 baseline mean AUC.
Clear evidence of domain shift’s impact over model performance is how frequently the best AUC for each radiological finding comes from the same dataset. In ChestX-ray14 test set, the best AUC appears ( out of ) times when training with the same set. The same phenomenon happens on both CheXpert ( out of ) and MIMIC-CXR ( out of ). Furthermore, in all three test sets, the best average AUC comes from their respective training set.
One possible cause of domain shift is the label extraction method. CheXpert and MIMIC-CXR used the same labeler, while ChestX-ray14 has its own. Also, ChestX-ray14 labeler has raised some questions concerning its reliability. One work of visual inspection 
states that its labels do not accurately reflect the content of the images. Estimated label accuracies are 10% to 30% lower than the values originally reported.
4 Related work
The impact of domain shift for medical imaging has been studied for brain tumors by AlBadawy et al. . They showed how training models with data from a different institution to where it is tested impacted the results for brain tumor segmentation. They also found that using multiple institutions for training does not necessarily remove this limitation.
We also see methods focused on unsupervised domain adaptation, where the task is to mitigate the problems of domain shift with unlabeled data from the target domain. Madani et al. 
observed the problem of domain overfitting on chest radiographs and developed a semi-supervised learning method based on generative adversarial networks (GANs) capable of detecting cardiac abnormality alongside the adversarial objective. Their method was able to overcome the domain shift between two datasets and increased the model performance when testing on a different domain. Chen et al. developed a domain adaptation method based on CycleGANs. Their work resembles CyCADA, with the difference of also introducing a discriminator for the network output, creating what they called semantic-aware GANs. Javanmardi and Tasdizen  use a framework very similar to domain-adversarial neural networks  that use a domain classification network with a gradient reversal layer (similar effect of a discriminator) to model a joint feature space.
Gholami et al.  propose a biophysics-based data augmentation method to produce synthetic tumor-bearing images to the training set. The authors argue that this augmentation procedure improves models generalization. Mahmood et al.  presents a generative model that translates images from a simulated endoscopic images domain to a realistic-looking domain as data augmentation. The authors also introduce an L1-regularization loss between the translated and the original image to minimize distortions.
5 Discussion and Conclusion
In this work, we show how a state-of-the-art deep learning model fails to generalize to unseen datasets when they follow a somewhat different distribution. Our experiments show that a model with reported radiologist-level performance had a huge drop in performance outside its source dataset, pointing the existence of domain shift in chest X-rays datasets. Despite recent efforts for the creation of large radiographs datasets in the hope of training generalized models, it seems that the data acquisition methodology of the available datasets does not capture the required heterogeneity for this purpose. Among the analyzed datasets, CheXpert and MIMIC-CXR seem to be most representative of the other two, as the models trained on them show a smaller performance drop when comparing to the baseline. The least representative seems to be the ChestX-ray14, which did not fit a model to predict as well outside its own test set, while the models trained on the other datasets performed well when testing on ChestX-ray14.
Although deep learning advances allow for new application scenarios, more steps for model validation must be conducted with more emphasis on external validation. We argue that a case-by-case validation is ideal, where the model is validated at new data from each center. The reason is twofold. First, models must be able to properly handle data from a specific scenario. Second, the label distribution from each environment might change due to several external factors, which might not reflect prediction biases learned by the model. One alternative for these limitations is to create small datasets with specific machines where the model will be used and fine-tuning models trained on large available datasets.
-  Fred R Hirsch, Wilbur A Franklin, Adi F Gazdar, and Paul A Bunn, “Early detection of lung cancer: clinical perspectives of recent advances in biology and radiology,” Clinical Cancer Research, vol. 7, no. 1, pp. 5–22, 2001.
-  Annemilia del Ciello, Paola Franchi, Andrea Contegiacomo, Giuseppe Cicchetti, Lorenzo Bonomo, and Anna Rita Larici, “Missed lung cancer: when, where, and why?,” Diagnostic and Interventional Radiology, vol. 23, no. 2, pp. 118–126, mar 2017.
-  Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen A.W.M. van der Laak, Bram van Ginneken, and Clara I. Sánchez, “A survey on deep learning in medical image analysis,” Medical Image Analysis, vol. 42, pp. 60–88, Dec 2017.
-  Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie Shpanskaya, et al., “Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning,” arXiv preprint arXiv:1711.05225, 2017.
-  Antonio Torralba, Alexei A Efros, et al., “Unbiased look at dataset bias.,” in . Citeseer, 2011, vol. 1, p. 7.
-  Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M. Summers, “Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases,” CoRR, vol. abs/1705.02315, 2017.
-  Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al., “Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison,” arXiv preprint arXiv:1901.07031, 2019.
-  Alistair EW Johnson, Tom J Pollard, Seth Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng, “Mimic-cxr: A large publicly available database of labeled chest radiographs,” arXiv preprint arXiv:1901.07042, 2019.
-  Luke Oakden-Rayner, “Exploring large scale public medical image datasets,” Tech. Rep., The University of Adelaide, 2019.
-  Ehab A AlBadawy, Ashirbani Saha, and Maciej A Mazurowski, “Deep learning for segmentation of brain tumors: Impact of cross-institutional training and testing,” Medical physics, vol. 45, no. 3, 2018.
-  Ali Madani, Mehdi Moradi, Alexandros Karargyris, and Tanveer Syeda-Mahmood, “Semi-supervised learning with generative adversarial networks for chest X-ray classification with ability of data domain adaptation,” in IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018). apr 2018, pp. 1038–1042, IEEE.
-  Cheng Chen, Qi Dou, Hao Chen, and Pheng-Ann Heng, “Semantic-aware generative adversarial nets for unsupervised domain adaptation in chest x-ray segmentation,” in Proceedings of the International Workshop on Machine Learning in Medical Imaging. Springer, 2018, pp. 143–151.
-  Mehran Javanmardi and Tolga Tasdizen, “Domain adaptation for biomedical image segmentation using adversarial training,” in Proceedings of the 15th International Symposium on Biomedical Imaging. IEEE, 2018, pp. 554–558.
-  Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky, “Domain-adversarial training of neural networks,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 2096–2030, Apr 2016.
-  Amir Gholami, Shashank Subramanian, Varun Shenoy, Naveen Himthani, Xiangyu Yue, Sicheng Zhao, Peter Jin, George Biros, and Kurt Keutzer, “A novel domain adaptation framework for medical image segmentation,” in Proceedings of the International Medical Image Computing and Computer Assisted Intervention Brainlesion Workshop. Springer, 2018, pp. 289–298.
-  Faisal Mahmood, Richard Chen, and Nicholas J. Durr, “Unsupervised reverse domain adaptation for synthetic medical images via adversarial training,” IEEE Transactions on Medical Imaging, vol. 37, pp. 10, Jun 2018.