Can we trust deep learning models diagnosis? The impact of domain shift in chest radiograph classification

by   Eduardo H. P. Pooch, et al.

While deep learning models become more widespread, their ability to handle unseen data and generalize for any scenario is yet to be challenged. In medical imaging, there is a high heterogeneity of distributions among images based on the equipment that generate them and their parametrization. This heterogeneity triggers a common issue in machine learning called domain shift, which represents the difference between the training data distribution and the distribution of where a model is employed. A high domain shift tends to implicate in a poor performance from models. In this work, we evaluate the extent of domain shift on three of the largest datasets of chest radiographs. We show how training and testing with different datasets (e.g. training in ChestX-ray14 and testing in CheXpert) drastically affects model performance, posing a big question over the reliability of deep learning models.



There are no comments yet.


page 1


Computer-aided abnormality detection in chest radiographs in a clinical setting via domain-adaptation

Deep learning (DL) models are being deployed at medical centers to aid r...

More Generalizable Models For Sepsis Detection Under Covariate Shift

Sepsis is a major cause of mortality in the intensive care units (ICUs)....

Resampling-based Assessment of Robustness to Distribution Shift for Deep Neural Networks

A novel resampling framework is proposed to evaluate the robustness and ...

The Impact of Domain Shift on Left and Right Ventricle Segmentation in Short Axis Cardiac MR Images

Domain shift refers to the difference in the data distribution of two da...

A closer look at domain shift for deep learning in histopathology

Domain shift is a significant problem in histopathology. There can be la...

Data Valuation for Medical Imaging Using Shapley Value: Application on A Large-scale Chest X-ray Dataset

The reliability of machine learning models can be compromised when train...

Have you forgotten? A method to assess if machine learning models have forgotten data

In the era of deep learning, aggregation of data from several sources is...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Radiography is a common exam to diagnose chest conditions since it is a low-cost, fast, and widely available imaging modality. Abnormalities identified on radiographs are called radiological findings. Several chest radiological findings might indicate lung cancer, such as lesions, consolidation, and atelectasis. Lung cancer is the first cause of cancer death worldwide and the lack of effective early-detection methods is one of the main reasons for its poor prognosis [1]. Lung cancer signs are mostly identified through imaging exams, but, at the same time, 90% of the lung cancer misdiagnosis occurs in radiographs, often due to observer error [2].

Deep learning is a growing field for image analysis. It has recently been employed at several medical imaging tasks [3]

and may help to overcome observer error. Considering chest radiographs, deep learning is used in a multilabel classification scenario to provide radiological findings to assist physicians with the diagnosis process. Recent work in the field achieved near radiologist-level accuracy at identifying radiological findings by the use of convolutional neural networks 

[4], one of the most successful deep learning method.

Figure 1: Example of a chest radiograph positive for consolidation sampled from each of the three analyzed datasets.

One assumption underlying deep learning models is that training and test data are independent and identically distributed (i.i.d). This assumption sometimes does not persist when data comes from different settings. This is a common case for medical imaging, a scenario which image acquisition protocols and machines may vary between diagnostic centers. Another aspect of medical imaging is the epidemiological variation between different populations, which may change the label distribution in different datasets. This difference of data distribution from the same task is called domain shift. The domain from where training data is sampled is the source domain, with distribution , and the one where the model is applied is the target domain, with distribution . When , it means the model most likely will handle test data the same way as in the training. As diverges from trained models tend to yield poor results, failing to effectively handle the input data [5].

With deep learning becoming more widespread, predictive models will inevitably become a big part of health care. We believe that before health care providers trust predictive models as a second opinion, we must understand the extent of their generalization capabilities and how well they perform outside the source domain. In this work, we propose to evaluate how well can models trained on a hospital-scale database generalize to unseen data from other hospitals or diagnostic centers by analyzing the degree of domain shift among three large datasets of chest radiographs. We train a state-of-the-art convolutional neural network for multilabel classification at each of the three datasets and evaluate their performance at predicting labels at the remaining two.

This paper is organized as follows: we first describe our methods, detailing our experiment design and datasets. Then, we summarize our results in Table 2 and discuss our findings.

Atelectasis Cardiomegaly Consolidation Edema Lesion Pneumonia Pneumothorax No Finding
Train Test Train Test Train Test Train Test Train Test Train Test Train Test Train Test
ChestX-ray14 7,996 2,420 1,950 582 3,263 957 1,690 413 7,758 2,280 978 242 3,705 1,089 42,405 11,928
CheXpert 20,630 6,132 15,885 5,044 9,063 2,713 34,066 10,501 4,976 1,411 3,274 935 12,583 3,476 12,010 3,293
MIMIC-CXR 34,653 10,071 34,097 9,879 8,097 2,430 20,499 5,954 5,025 1,341 12,736 3,711 8,243 2,231 58,135 16,670
Table 1: Positive label frequency (in number of radiographs) in train and test split for each dataset.

2 Materials and methods

2.1 Datasets

Three large datasets of chest radiographs are available to this date. ChestX-ray14 [6] from the National Institute of Health contains 112,120 frontal-view chest radiographs from 32,717 different patients labeled with 14 pathologies. CheXpert [7] from the Stanford Hospital contains 224,316 frontal and lateral chest radiographs of 65,240 patients. MIMIC-CXR [8]

from Massachusetts Institute of Technology presents 371,920 chest X-rays associated with 227,943 imaging studies from 65,079 patients. Both CheXpert and MIMIC-CXR are labeled with the same 13 pathologies. The labels from all three datasets are automatically extracted using natural language processing algorithms on radiological reports.

We show the pixel intensity distribution of each dataset in Figure 2. We see a spike at low intensities (especially ) for every center, but the distribution for higher intensities is somewhat different for every center, which might implicate in a decrease of the models predictive performance.

Figure 2:

Dataset pixel intensity probability density function.

2.2 Experiment design

We develop a multi-label classification approach reproducing CheXNet [4]

, which achieved state-of-the-art results in classification of multiple pathologies using a DenseNet121 neural network architecture. The model is pre-trained on the ImageNet dataset, the images are resized to 224x224 pixels and normalized using ImageNet mean and standard deviation. We train three models, one for each dataset, and subsequently evaluate our model at the other two. Each model is trained with the training set and evaluated at the other two test sets. The three datasets have the same train, test and validation sets in all experiments. For the ChestX-ray14 dataset, we use the original split, but as CheXpert and MIMIC-CXR test sets are not publicly available, we randomly re-split their data keeping ChestX-ray14 split ratio (70% train, 20% test, and 10% validation) and no patient overlap between the sets. Table

1 shows the frequency of the labels in each train and test split.

One limitation we encountered is that the ChestX-ray14 dataset has a different set of labels from the other two datasets. We fixed this by training each model with all labels available, but reporting the results only of the common labels for all three (No finding, Atelectasis, Cardiomegaly, Edema, Lesion, Consolidation, Pneumonia, Pneumothorax). We created a ”Lesion” label on ChestX-ray14 by joining the samples annotated with ”Nodule” or ”Mass”. Another limitation is that ChestX-ray14 has only frontal X-rays, therefore, we only use the frontal samples from the other two datasets, remaining 191,229 samples on CheXpert and 249,995 on MIMIC-CXR.

To evaluate domain shift, we use a common performance metric to multi-label classification, the Area Under Receiver Operating Characteristic curve (AUC), to report both individual radiological findings prediction results and their average for an overall view of model performance. Both the true positive rate and the false positive rate are considered to compute the AUC. A higher AUC value suggests better performance.

3 Results

We train the same neural network architecture with the same hyperparameters at each of the three datasets individually. Training and testing on ChestX-ray14 achieves results similar to the ones reported on

[4]. After training, we load our model and evaluate it with images from the remaining two.

Test set Training set Atelectasis Cardiomegaly Consolidation Edema Lesion Pneumonia Pneumothorax No Finding Mean
ChestX-ray14 ChestX-ray14 0.8165 0.8998 0.8181 0.9066 0.7935 0.7633 0.8796 0.7789 0.8320
CheXpert 0.7850 0.8646 0.7771 0.8584 0.7291 0.7287 0.8464 0.7569 0.7933
MIMIC-CXR 0.8024 0.8322 0.7898 0.8609 0.7457 0.7656 0.8429 0.7652 0.8006
CheXpert ChestX-ray14 0.5137 0.5736 0.6565 0.7097 0.6741 0.6259 0.7330 0.2682 0.5943
CheXpert 0.6930 0.8687 0.7323 0.8344 0.7882 0.7619 0.8709 0.8842 0.8042
MIMIC-CXR 0.6576 0.8197 0.7002 0.7946 0.7465 0.7219 0.8046 0.8564 0.7627
MIMIC-CXR ChestX-ray14 0.5810 0.6798 0.7692 0.8098 0.6561 0.6740 0.7675 0.2562 0.6492
CheXpert 0.7587 0.7650 0.7936 0.8685 0.7527 0.6913 0.8142 0.8452 0.7861
MIMIC-CXR 0.8177 0.8126 0.8229 0.8922 0.7788 0.7461 0.8845 0.8718 0.8283
Table 2: Resulting AUCs for the 8 radiological findings common to the three datasets. Best results for each test set are in bold.

We summarize our results in Table 2. We can see that the best results for each test set appear when the training set is from the same dataset. This shows that clinicians should expect a decrease in the reported performances of machine learning models when applying them in real-world scenarios. This decrease may vary according to the dataset distribution in which the model was trained. For instance, running a model trained on CheXpert on MIMIC-CXR’s test set reduces the mean AUC in , while the model trained on ChestX-ray14 reduces it in . On CheXpert test set, training on MIMIC-CXR shows almost the same decrease on mean AUC (), reducing the AUC in all of the findings. The model trained on ChestX-ray14 has the highest average AUC when testing on its own test set, but when testing in other datasets, it shows the most significant performance drop, lowering CheXpert’s mean AUC in and MIMIC-CXR’s in . ”No Finding” is the label with the most notable domain shift in ChestX-ray14, questioning the label reliability. Both the models trained on CheXpert and MIMIC-CXR mostly preserve the ChestX-ray14 baseline mean AUC.

Clear evidence of domain shift’s impact over model performance is how frequently the best AUC for each radiological finding comes from the same dataset. In ChestX-ray14 test set, the best AUC appears ( out of ) times when training with the same set. The same phenomenon happens on both CheXpert ( out of ) and MIMIC-CXR ( out of ). Furthermore, in all three test sets, the best average AUC comes from their respective training set.

One possible cause of domain shift is the label extraction method. CheXpert and MIMIC-CXR used the same labeler, while ChestX-ray14 has its own. Also, ChestX-ray14 labeler has raised some questions concerning its reliability. One work of visual inspection [9]

states that its labels do not accurately reflect the content of the images. Estimated label accuracies are 10% to 30% lower than the values originally reported.

4 Related work

The impact of domain shift for medical imaging has been studied for brain tumors by AlBadawy et al. [10]. They showed how training models with data from a different institution to where it is tested impacted the results for brain tumor segmentation. They also found that using multiple institutions for training does not necessarily remove this limitation.

We also see methods focused on unsupervised domain adaptation, where the task is to mitigate the problems of domain shift with unlabeled data from the target domain. Madani et al. [11]

observed the problem of domain overfitting on chest radiographs and developed a semi-supervised learning method based on generative adversarial networks (GANs) capable of detecting cardiac abnormality alongside the adversarial objective. Their method was able to overcome the domain shift between two datasets and increased the model performance when testing on a different domain. Chen et al. 

[12] developed a domain adaptation method based on CycleGANs. Their work resembles CyCADA, with the difference of also introducing a discriminator for the network output, creating what they called semantic-aware GANs. Javanmardi and Tasdizen [13] use a framework very similar to domain-adversarial neural networks [14] that use a domain classification network with a gradient reversal layer (similar effect of a discriminator) to model a joint feature space.

Gholami et al. [15] propose a biophysics-based data augmentation method to produce synthetic tumor-bearing images to the training set. The authors argue that this augmentation procedure improves models generalization. Mahmood et al. [16] presents a generative model that translates images from a simulated endoscopic images domain to a realistic-looking domain as data augmentation. The authors also introduce an L1-regularization loss between the translated and the original image to minimize distortions.

5 Discussion and Conclusion

In this work, we show how a state-of-the-art deep learning model fails to generalize to unseen datasets when they follow a somewhat different distribution. Our experiments show that a model with reported radiologist-level performance had a huge drop in performance outside its source dataset, pointing the existence of domain shift in chest X-rays datasets. Despite recent efforts for the creation of large radiographs datasets in the hope of training generalized models, it seems that the data acquisition methodology of the available datasets does not capture the required heterogeneity for this purpose. Among the analyzed datasets, CheXpert and MIMIC-CXR seem to be most representative of the other two, as the models trained on them show a smaller performance drop when comparing to the baseline. The least representative seems to be the ChestX-ray14, which did not fit a model to predict as well outside its own test set, while the models trained on the other datasets performed well when testing on ChestX-ray14.

Although deep learning advances allow for new application scenarios, more steps for model validation must be conducted with more emphasis on external validation. We argue that a case-by-case validation is ideal, where the model is validated at new data from each center. The reason is twofold. First, models must be able to properly handle data from a specific scenario. Second, the label distribution from each environment might change due to several external factors, which might not reflect prediction biases learned by the model. One alternative for these limitations is to create small datasets with specific machines where the model will be used and fine-tuning models trained on large available datasets.