Log In Sign Up

Generalization of Deep Convolutional Neural Networks – A Case-study on Open-source Chest Radiographs

by   Nazanin Mashhaditafreshi, et al.

Deep Convolutional Neural Networks (DCNNs) have attracted extensive attention and been applied in many areas, including medical image analysis and clinical diagnosis. One major challenge is to conceive a DCNN model with remarkable performance on both internal and external data. We demonstrate that DCNNs may not generalize to new data, but increasing the quality and heterogeneity of the training data helps to improve the generalizibility factor. We use InceptionResNetV2 and DenseNet121 architectures to predict the risk of 5 common chest pathologies. The experiments were conducted on three publicly available databases: CheXpert, ChestX-ray14, and MIMIC Chest Xray JPG. The results show the internal performance of each of the 5 pathologies outperformed external performance on both of the models. Moreover, our strategy of exposing the models to a mix of different datasets during the training phase helps to improve model performance on the external dataset.


page 1

page 2

page 3

page 4

page 5

page 6

page 8

page 9


Learning Invariant Feature Representation to Improve Generalization across Chest X-ray Datasets

Chest radiography is the most common medical image examination for scree...

Abnormality Detection and Localization in Chest X-Rays using Deep Convolutional Neural Networks

Chest X-Rays (CXRs) are widely used for diagnosing abnormalities in the ...

Can we trust deep learning models diagnosis? The impact of domain shift in chest radiograph classification

While deep learning models become more widespread, their ability to hand...

Estimating Model Performance on External Samples from Their Limited Statistical Characteristics

Methods that address data shifts usually assume full access to multiple ...

Deep Mining External Imperfect Data for Chest X-ray Disease Screening

Deep learning approaches have demonstrated remarkable progress in automa...

Predicting Ejection Fraction from Chest X-rays Using Computer Vision for Diagnosing Heart Failure

Heart failure remains a major public health challenge with growing costs...

A Systematic Search over Deep Convolutional Neural Network Architectures for Screening Chest Radiographs

Chest radiographs are primarily employed for the screening of pulmonary ...


The proliferation of big data coupled with non-linear data abstraction (filters) and high performance computing [1]

has spurred rapid advancement in deep learning applications including speech recognition, sentiment analysis, computer vision, and machine translation. These areas were previously thought to be extremely hard for computers to analyze and required hundreds of hours of manual feature engineering yet deep learning techniques deliver state-of-the-art performance with minimal human intervention. Medicine is witnessing rapid adoption and application of deep learning. For example, a large volume of radiology studies are performed daily in most centers yet the number of available trained radiologists remains constant 

[2]. The opportunity to standardize the clinical workflow is thus seen as a low hanging fruit for automation using deep learning, with lots of efforts deemed as hype that try to replace radiologists using deep learning.

Deep Convolutional Neural Networks (DCNNs) apply multiple layers of convolution operations to extract translation and scale invariant features from images, and are widely used to analyze radiology image content to assist in diagnosis. DCNNs have achieved expert-level performance for various chest pathologies [3, 4]. Beyond classification tasks on radiology images, researchers have attempted to rebuild the imaging workflow, assessing DCNNs performance after non-image data fusion for classification of multi-label chest X-ray images [5]. Despite a plethora of multiple publications improving on the state-of-the-art, validation and scalability of deep learning in medicine remains limited, since model development and validation is frequently performed on a single institutional dataset. A review of studies published in 2018 found that only 6% (31 of 516) of published studies performed external validation (i.e. studies had a diagnostic cohort design, included data from multiple institutions, and performed prospective data collection).[6]

Overfitting is a well-known limitation of complex DCNN models which may produce an overly optimistic performance. Therefore, it is important for an optimized DCNN model to have sustained performance on unseen external datasets to promote model generalizibility and translation of models to real life clinical work. Despite the expensive cost of labeling medical datasets, there are several publicly available datasets for chest radiographs that can be used for testing the model generalization. These datasets include the MIMIC Chest X-ray JPG (MIMIC-CXR-JPG) Database v2.0.0 from Beth Israel Deaconess Medical Center in Boston [7, 8, 9]; the CheXpert dataset released from the Stanford Hospital, performed between October 2002 and July 2017 coded with 14 common radiographic diseases [10]; and the ChestX-ray14 dataset from the U.S. National Institutes of Health [11]

. The labels for these three datasets were derived from radiology text reports using natural language processing algorithms. A large number of publications have been published from these three datasets focusing on novel DCNN design and development, and present state-of-the-art performance. However, there are only a limited number of studies on the generalizability of a DCNN model trained on chest X-ray images, specifically assessing whether the model retains its performance and generalizes well on unseen datasets


In this study, we perform thorough experiments to understand the generalizability of state-of-the-art DCNN models using data from three publicly available chest radiograph datasets. We selected 5 common pathologies (Cardiomegaly, Edema, Atelectasis, Consolidation, Pleural Effusion) and trained two state-of-the-art DCNN models (DenseNet121 [13], InceptionResNetV2 [14]). To evaluate the performance, we adopted performance metrics that have been previously published for disease recognition tasks [15]. We compared the external and internal performance of the models by training them on different partitions of data from the three datasets, and subsequently tested each model with various combination of test sets of these datasets. We report the test AUC of each experiment which shows that the performance of the DCNNs on internal data outperforms the external performance for the test sets.

Materials and Methods

In this section, we present details of the public datasets and architecture of the two DCNN models that were used for our experiments. We also describe the experimentation details to test the generalization capability of the DCNN models.


The CheXpert [10] dataset comprises of frontal and lateral chest radiographs of patients. The ChestX-ray14 [11] dataset contains frontal-view X-ray images of patients. The MIMIC-CXR-JPG [7, 8, 9] dataset consists of images of patients. We selected diseases (Atelectasis, Edema, Pleural Effusion, Consolidation, Cardiomegaly) which were common among the above datasets. We randomly split the CheXpert dataset into training ( images), validation ( images), and test ( images) sets. ChestX-ray14 dataset was also randomly divided into training ( images), validation ( images), and test ( images) sets. There were no overlapping patients between the training, validation and test sets for the CheXpert and ChestX-ray14 datasets. In addition, we preserved the original test set of the MIMIC-CXR-JPG dataset ( images) and considered it as an external test set for every configuration.

The ChestX-ray14 and CheXpert datasets provide some non-image features such as age, gender, and radiographic positioning. Figure 0(a) shows the distribution of patients’ age and gender for the ChestX-ray14 dataset. The average age is

years with a standard deviation of

years for this dataset. Patients’ age and gender distributions for CheXpert dataset are shown in Figure 0(b). The average age and the standard deviation are and years, respectively. The distribution of gender is quite similar among the two datasets, but the ChestX-ray14 dataset has a larger proportion of younger patients as compared to the CheXpert dataset.

(a) ChestX-ray14 Dataset
(b) CheXpert Dataset
Figure 1: Distributions of patients’ age and gender

Data Preprocessing

For the ground truth labels, we used the binary mapping approach for handling uncertainty in labels, in which the uncertain labels were replaced with 1 (U-Ones model), or 0 (U-Zeroes model). We used U-Ones model for uncertain labels for Atelectasis, Edema, and Pleural Effusion and U-Zeroes model for Cardiomegaly and Consolidation based on the CheXpert results [10].

For image preprocessing, we applied contrast-limited adaptive histogram equalization (CLAHE) technique [16]

for contrast enhancement on all training and validation images before feeding them into the network. Thereafter, we normalized all images based on the mean and standard deviation of images in the ImageNet 

[17] training set. All the images were resized to pixels for InceptionResNetV2 and

pixels for DenseNet121 architecture. The scikit-image transform module, which applies first order spline interpolation for image downscaling and Gaussian filter to eliminate aliasing artifacts, was used to resize the images. A constant value of 0 was used to fill the points outside the input boundaries. 50% of the training data was augmented with random horizontal flipping.

Model Architecture and Implementation

We compared two well-known architectures; DenseNet121 [13], and InceptionResNetV2 [14]

. InceptionResNetV2 combines the Inception architecture (which is a very deep convolutional neural network) with residual connections while DenseNets are used to simplify the connectivity pattern between layers by connecting all layers directly with each other. While residual connections (used in Inception networks) sum up outputs of multiple layers, DenseNets concatenate outputs of multiple connected layers. DenseNets are known to avoid learning redundant feature maps and have much better feature reuse than traditional convolutional neural networks.

InceptionResNetV2 consists of layers and trainable parameters (Figure 2) and is more complex than DenseNet121 which has layers and trainable parameters (Figure 3

). It intuitively follows that DenseNet121 requires less memory than InceptionResNetV2 and is less susceptible to the vanishing-gradient problem. InceptionResNetV2 achieves better top-1 accuracy on ImageNet-1k validation set for image classification task


Figure 2: InceptionResNetV2 Architecture. () Schematic of the InceptionResNetV2 model; (

) A, B, C are Inception modules which comprise of several convolutional layers; (

) A and B are reduction modules which reduce the size of the output. [18]
Figure 3: DenseNet121 Architecture (Dx: Dense Block x, Tx: Transition Block x, DLx: Dense Layer x) [19]

We selected these two architectures to assess the impact of model complexity on external test performance. We largely preserved each architecture while adapting it to our classification task. We removed the top layer and replaced it with a global average 2D pooling layer and a dense layer with sigmoid activation as the last fully-connected layer. Parameter was set to for the last layer to match the labels of our classification problem.

We used binary cross-entropy loss and Adam accumulate optimizer for training each network starting with pre-trained ImageNet weights. For model training, we set , and

. We used GTX 1080 Ti GPU and Keras Python library.


We used three distinct ways to assess the generalization capabilities of state-of-the-art neural networks. For ChestX-ray14 and CheXpert, we randomly split the patients to train, validation, and test sets (Figure 4, 5). For the MIMIC-CXR-JPG, we kept the original test set of images from patients (Figure 6). The details of the three evaluation configurations are described in detail below.

In the first configuration, we trained our models on the train set of CheXpert. We tested the models on the test set of the CheXpert dataset for the in-sample data and the test set of the MIMIC-CXR-JPG dataset as the external test set. In the second configuration, we trained our models on the train set of ChestX-ray14. Models were tested on the test set of the ChestX-ray14 dataset as the in-sample test and the test set of the MIMIC-CXR-JPG dataset as the external test. In the third configuration, to increase the variation of training samples, we trained our models on the combination of train sets of CheXpert and ChestX-ray14 datasets. Models were tested on combined test sets of CheXpert and ChestX-ray14 datasets as the in-sample and the MIMIC-CXR-JPG test set as the external test.

Figure 4: Flowchart of images used in this study from the ChestX-ray14 dataset for train, validation, and test sets.
Figure 5: Flowchart of images used in this study from the CheXpert dataset for train, validation, and test sets.
Figure 6: Flowchart of images used in this study from the MIMIC-CXR-JPG dataset for test set.


In this section, we present a discussion on the results of our experiments. We also include observations made by a radiologist on the performance of the two models.

Model Performance

We performed thorough experimentation to evaluate the robustness and generalization capabilities of two very popular deep learning classification architectures, i.e., InceptionResNetV2 and DenseNet121, for the classification task of common chest diseases diagnosis. As explained in the previous section, we used three different training schemes and tested each trained model over internal and external test sets. Our diagnosis task includes

labels (Cardiomegaly, Edema, Atelectasis, Consolidation, Pleural Effusion) common to the three datasets that we used. We employed AUC (area under the curve of receiver operating characteristics curve) as an evaluation measure. The three public datasets have severe class imbalance for the five selected pathologies. The overall performance on both models in terms of AUC values is reported in Table


The two architectures have widely different complexity in terms of the depth of the network as well as the number of trainable parameters. Still, both architectures have the same performance on internal and external test sets. In general, the performance of each model on the internal test set is better than its performance on the external test set for every configuration. As demonstrated in Table 1, both architectures have better performance for the diagnosis of Cardiomegaly and Edema on the internal test set compared to the external test set under evaluation configuration (training over CheXpert dataset). On the other hand, performance for diagnosis of Atelectasis, Consolidation, and Pleural Effusion is quite similar for internal and external test sets under the same configuration for both architectures.

Both architectures have much better performance on internal test sets for all labels under configuration (training over ChestX-ray14 dataset) as compared to corresponding performance values under configuration . On the other hand, performance on external test sets is worse than the corresponding internal test performance for both architectures for all labels under this configuration.


Evaluation Configuration

(Train/Validation set)

Comparison Type

Test Set





Pleural Effusion


Configuration (CheXpert) Internal CheXpert
Configuration (ChestX-ray14) Internal ChestX-ray14
Configuration (CheXpert and ChestX-ray14) Internal CheXpert and ChestX-ray14


Configuration (CheXpert) Internal CheXpert
Configuration (ChestX-ray14) Internal ChestX-ray14
Configuration (CheXpert and ChestX-ray14) Internal CheXpert and ChestX-ray14
Table 1: Performance of InceptionResNetV2 and DenseNet121 models for all evaluation configurations for internal and external test sets in terms of AUC-ROC value

Configuration involves training over a larger and more generalized set by combining training sets of ChestX-ray14 and CheXpert datasets. We observed similar performance comparisons between internal and external test sets for both architectures as was observed in the other two configurations. Performance for internal sets is generally better than performance for external sets. There is a mixed trend in terms of performance improvement for internal sets as compared to corresponding performance values for the other two configurations. For both architectures, AUC for Edema and Pleural Effusion are better than corresponding values of other configurations for the same models. On the other hand, external test set performance improves for almost all labels under this configuration for both models. Hence, generalized training sets seem critical in improving the generalization capabilities of trained models.

(a) Train: CheXpert
Test: Internal
(b) Train: ChestX-ray14
(c) Train: CheXpert-ChestX-ray14
Test: Internal
(d) Train: CheXpert
Test: External
(e) Train: ChestX-ray14
Test: External
(f) Train: CheXpert-ChestX-ray14
Test: External
Figure 7: ROC curves for DenseNet121 for all evaluation configurations
(a) Train: CheXpert
Test: Internal
(b) Train: ChestX-ray14
Test: Internal
(c) Train: CheXpert-ChestX-ray14
Test: Internal
(d) Train: CheXpert
Test: External
(e) Train: ChestX-ray14
Test: External
(f) Train: CheXpert-ChestX-ray14
Test: External
Figure 8: ROC curves for InceptionResNetV2 for all evaluation configurations

Increasing the difference in the prevalence of all the diseases changes model performance with improved AUCs on the external sets when compared to corresponding values of other configurations except for two cases. First, DenseNet121 trained at CheXpert to detect Consolidation has better external test AUC compared to joint CheXpert-ChestX-ray14 (AUC ). Second, InceptionResNetV2 trained at CheXpert to detect Edema has better external test AUC compared to joint CheXpert-ChestX-ray14 (AUC ).

ROC curves for all evaluation configurations are displayed in Figures 7 and 8 for DenseNet121 and InceptionResNetV2 respectively. It is evident that ROC curves have larger areas under the curve for internal test sets for all configurations for both models than corresponding curves for external test sets. For the joint training set (CheXpert-ChestX-ray14), curves have larger areas under the curve for the external test than the other two configurations for both models. This trend indicates better generalization capacity of the trained model under joint training configuration.

Radiologist Evaluation

We observed that the models tend to perform very differently for different labels. For example, the performance of both models is quite poor for Cardiomegaly under all evaluation configurations. We consulted a board-certified radiologist to further investigate this trend. The radiologist reviewed randomly chosen MIMIC-CXR-JPG images (external test set) of Cardiomegaly label that were incorrectly labeled by trained models under configuration , i.e., trained over joint CheXpert and ChestX-ray14 datasets. Table 2

shows a few samples of such images with their groundtruth label and predicted probability of that label for each model.

Radiologists rely on the cardiothoracic ratio to diagnose Cardiomegaly. It is the ratio of maximal horizontal cardiac diameter and maximal horizontal thoracic diameter () and is measured on a Posteroanterior (PA) chest X-ray. A normal measurement should be less than . The first and third images shown in Table 2 were incorrectly predicted to have Cardiomegaly by DenseNet121. However, InceptionResNetV2 was able to correctly predict the absence of Cardiomegaly for these images. Also, the DenseNet121 model predicted incorrectly that the second image shown in Table 2 does not have Cardiomegaly. InceptionResNetV2 correctly predicted Cardiomegaly for this image. On the other hand, InceptionResNetV2 incorrectly predicted the fourth image shown in Table 2 to have Cardiomegaly. Radiologist assessed that these are all borderline Cardiomegaly cases based on the cardiothoracic ratio. The AI models performed poorly for borderline cases. Another observation is that the pacemaker is closely tied with Cardiomegaly for both models. The presence of this foreign object may bias the decision of the DCNN models.

Sample Images
Groundtruth Label No Cardiomegaly Cardiomegaly No Cardiomegaly No Cardiomegaly
InceptionResNetV2 0.496 0.605 0.472 0.565
DenseNet121 0.696 0.477 0.553 0.419
Table 2:

Incorrectly classified radiographic images of borderline Cardiomegaly pathology; Groundtruth label for each image as well as probabilities of that label predicted by both models are included. The maximum diameter of the heart and the thoracic diameter were measured by the radiologist (shown by red lines) to estimate cardiothoracic ratio to diagnose Cardiomegaly. All images were classified as

Borderline Cardiomegaly by the radiologist.


In this paper, we thoroughly studied the robustness and generalization capabilities of complex deep learning classification models. We used three publicly available chest X-ray datasets (CheXpert, ChestX-ray14, MIMIC-CXR-JPG) and experimented with two different DCNN models (InceptionResNetV2, DenseNet121). Our experiments indicate that these models have limited generalization capacity when tested over images outside of the training dataset, i.e., external test set. For all class labels under every evaluation configuration, the performance of each model is better for the internal test set than it is for the external test. We also worked on improving generalization capabilities of these models. Our technique relies on improving the quality of the training data by combining images from different datasets, thus increasing the data variation the models are exposed to during the training phase. This technique proved effective as performance of models was significantly improved for external test sets. In this case, even incorrectly predicted labels tend to be borderline cases of their corresponding pathologies. Interestingly, the performance of each model for internal test sets remains approximately the same level. Therefore, we can conclude that generalization of deep learning classification models to a larger variety of items is heavily dependent on the quality and heterogeneity of the training dataset. Exposing the model to multiple datasets with wide variation during the training phase is an effective technique for improvement in the generalization capabilities of the trained model.