CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning

by   Pranav Rajpurkar, et al.

We develop an algorithm that can detect pneumonia from chest X-rays at a level exceeding practicing radiologists. Our algorithm, CheXNet, is a 121-layer convolutional neural network trained on ChestX-ray14, currently the largest publicly available chest X-ray dataset, containing over 100,000 frontal-view X-ray images with 14 diseases. Four practicing academic radiologists annotate a test set, on which we compare the performance of CheXNet to that of radiologists. We find that CheXNet exceeds average radiologist performance on pneumonia detection on both sensitivity and specificity. We extend CheXNet to detect all 14 diseases in ChestX-ray14 and achieve state of the art results on all 14 diseases.


page 1

page 5


Deep learning classification of chest x-ray images

We propose a deep learning based method for classification of commonly o...

Context Learning for Bone Shadow Exclusion in CheXNet Accuracy Improvement

Chest X-ray examination plays an important role in lung disease detectio...

Large Scale Automated Reading of Frontal and Lateral Chest X-Rays using Dual Convolutional Neural Networks

The MIMIC-CXR dataset is (to date) the largest publicly released chest x...

PneumoXttention: A CNN compensating for Human Fallibility when Detecting Pneumonia through CXR images with Attention

Automatic Chest Radiograph X-ray (CXR) interpretation by machines is an ...

Classification of Pneumonia and Tuberculosis from Chest X-rays

Artificial intelligence (AI) and specifically machine learning is making...

CheXseen: Unseen Disease Detection for Deep Learning Interpretation of Chest X-rays

We systematically evaluate the performance of deep learning models in th...

Detection of distal radius fractures trained by a small set of X-ray images and Faster R-CNN

Distal radius fractures are the most common fractures of the upper extre...

1 Introduction

More than 1 million adults are hospitalized with pneumonia and around 50,000 die from the disease every year in the US alone (CDC, 2017). Chest X-rays are currently the best available method for diagnosing pneumonia (WHO, 2001), playing a crucial role in clinical care (Franquet, 2001) and epidemiological studies (Cherian et al., 2005). However, detecting pneumonia in chest X-rays is a challenging task that relies on the availability of expert radiologists. In this work, we present a model that can automatically detect pneumonia from chest X-rays at a level exceeding practicing radiologists.

Our model, ChexNet (shown in Figure 1), is a 121-layer convolutional neural network that inputs a chest X-ray image and outputs the probability of pneumonia along with a heatmap localizing the areas of the image most indicative of pneumonia. We train CheXNet on the recently released ChestX-ray14 dataset (Wang et al., 2017), which contains 112,120 frontal-view chest X-ray images individually labeled with up to 14 different thoracic diseases, including pneumonia. We use dense connections (Huang et al., 2016)

and batch normalization

(Ioffe & Szegedy, 2015) to make the optimization of such a deep network tractable.

Detecting pneumonia in chest radiography can be difficult for radiologists. The appearance of pneumonia in X-ray images is often vague, can overlap with other diagnoses, and can mimic many other benign abnormalities. These discrepancies cause considerable variability among radiologists in the diagnosis of pneumonia (Neuman et al., 2012; Davies et al., 1996; Hopstaken et al., 2004)

. To estimate radiologist performance, we collect annotations from four practicing academic radiologists on a subset of 420 images from ChestX-ray14. On these 420 images, we measure performance of individual radiologists and the model.

We find that the model exceeds the average radiologist performance on the pneumonia detection task. To compare CheXNet against previous work using ChestX-ray14, we make simple modifications to CheXNet to detect all 14 diseases in ChestX-ray14, and find that we outperform best published results on all 14 diseases. Automated detection of diseases from chest X-rays at the level of expert radiologists would not only have tremendous benefit in clinical settings, it would also be invaluable in delivery of health care to populations with inadequate access to diagnostic imaging specialists.

F1 Score (95% CI)
Radiologist 1 0.383 (0.309, 0.453)
Radiologist 2 0.356 (0.282, 0.428)
Radiologist 3 0.365 (0.291, 0.435)
Radiologist 4 0.442 (0.390, 0.492)
Radiologist Avg. 0.387 (0.330, 0.442)
CheXNet 0.435 (0.387, 0.481)
Table 1:

We compare radiologists and our model on the F1 metric, which is the harmonic average of the precision and recall of the models. CheXNet achieves an F1 score of 0.435 (95% CI 0.387, 0.481), higher than the radiologist average of 0.387 (95% CI 0.330, 0.442). We use the bootstrap to find that the difference in performance is statistically significant.

2 CheXNet

2.1 Problem Formulation

The pneumonia detection task is a binary classification problem, where the input is a frontal-view chest X-ray image and the output is a binary label indicating the absence or presence of pneumonia respectively. For a single example in the training set, we optimize the weighted binary cross entropy loss

where is the probability that the network assigns to the label , , and with and the number of positive cases and negative cases of pneumonia in the training set respectively.

2.2 Model Architecture and Training

CheXNet is a 121-layer Dense Convolutional Network (DenseNet) (Huang et al., 2016)

trained on the ChestX-ray 14 dataset. DenseNets improve flow of information and gradients through the network, making the optimization of very deep networks tractable. We replace the final fully connected layer with one that has a single output, after which we apply a sigmoid nonlinearity.

The weights of the network are initialized with weights from a model pretrained on ImageNet

(Deng et al., 2009). The network is trained end-to-end using Adam with standard parameters ( and ) (Kingma & Ba, 2014). We train the model using minibatches of size 16. We use an initial learning rate of that is decayed by a factor of

each time the validation loss plateaus after an epoch, and pick the model with the lowest validation loss.

Pathology Wang et al. (2017) Yao et al. (2017) CheXNet (ours)
Atelectasis 0.716 0.772 0.8094
Cardiomegaly 0.807 0.904 0.9248
Effusion 0.784 0.859 0.8638
Infiltration 0.609 0.695 0.7345
Mass 0.706 0.792 0.8676
Nodule 0.671 0.717 0.7802
Pneumonia 0.633 0.713 0.7680
Pneumothorax 0.806 0.841 0.8887
Consolidation 0.708 0.788 0.7901
Edema 0.835 0.882 0.8878
Emphysema 0.815 0.829 0.9371
Fibrosis 0.769 0.767 0.8047
Pleural Thickening 0.708 0.765 0.8062
Hernia 0.767 0.914 0.9164
Table 2: CheXNet outperforms the best published results on all 14 pathologies in the ChestX-ray14 dataset. In detecting Mass, Nodule, Pneumonia, and Emphysema, CheXNet has a margin of 0.05 AUROC over previous state of the art results.

3 Data

3.1 Training

We use the ChestX-ray14 dataset released by Wang et al. (2017) which contains 112,120 frontal-view X-ray images of 30,805 unique patients. Wang et al. (2017) annotate each image with up to 14 different thoracic pathology labels using automatic extraction methods on radiology reports. We label images that have pneumonia as one of the annotated pathologies as positive examples and label all other images as negative examples. For the pneumonia detection task, we randomly split the dataset into training (28744 patients, 98637 images), validation (1672 patients, 6351 images), and test (389 patients, 420 images). There is no patient overlap between the sets.

Before inputting the images into the network, we downscale the images to

and normalize based on the mean and standard deviation of images in the ImageNet training set. We also augment the training data with random horizontal flipping.

3.2 Test

We collected a test set of 420 frontal chest X-rays. Annotations were obtained independently from four practicing radiologists at Stanford University, who were asked to label all 14 pathologies in Wang et al. (2017). The radiologists had 4, 7, 25, and 28 years of experience, and one of the radiologists is a sub-specialty fellowship trained thoracic radiologist. Radiologists did not have access to any patient information or knowledge of disease prevalence in the data. Labels were entered into a standardized data entry program.

(a) Patient with multifocal community acquired pneumonia. The model correctly detects the airspace disease in the left lower and right upper lobes to arrive at the pneumonia diagnosis.

Patient with a left lung nodule. The model identifies the left lower lobe lung nodule and correctly classifies the pathology.

(c) Patient with primary lung malignancy and two large masses, one in the left lower lobe and one in the right upper lobe adjacent to the mediastinum. The model correctly identifies both masses in the X-ray.
(d) Patient with a right-sided pneumothroax and chest tube. The model detects the abnormal lung to correctly predict the presence of pneumothorax (collapsed lung).
(e) Patient with a large right pleural effusion (fluid in the pleural space). The model correctly labels the effusion and focuses on the right lower chest.
(f) Patient with congestive heart failure and cardiomegaly (enlarged heart). The model correctly identifies the enlarged cardiac silhouette.
Figure 2: CheXNet localizes pathologies it identifies using Class Activation Maps, which highlight the areas of the X-ray that are most important for making a particular pathology classification. The captions for each image are provided by one of the practicing radiologists.

4 CheXNet vs. Radiologist Performance

4.1 Comparison

We assess the performance of both radiologists and CheXNet on the test set for the pneumonia detection task. Recall that for each of the images in the test set, we have 4 labels from four practicing radiologists and 1 label from CheXNet. We compute the F1 score for each individual radiologist and for CheXNet against each of the other 4 labels as ground truth. We report the mean of the 4 resulting F1 scores for each radiologist and for CheXNet, along with the average F1 across the radiologists. We use the bootstrap to construct 95% bootstrap confidence intervals (CIs), calculating the average F1 score for both the radiologists and CheXNet on 10,000 bootstrap samples, sampled with replacement from the test set. We take the 2.5th and 97.5th percentiles of the F1 scores as the 95% bootstrap CI. We find that CheXNet achieves an F1 score of 0.435 (95% CI 0.387, 0.481), higher than the radiologist average of 0.387 (95% CI 0.330, 0.442). Table

1 summarizes the performance of each radiologist and of CheXNet.

To determine whether CheXNet’s performance is statistically significantly higher than radiologist performance, we also calculate the difference between the average F1 score of CheXNet and the average F1 score of the radiologists on the same bootstrap samples. If the 95% CI on the difference does not include zero, we conclude there was a significant difference between the F1 score of CheXNet and the F1 score of the radiologists. We find that the difference in F1 scores — 0.051 (95% CI 0.005, 0.084) — does not contain 0, and therefore conclude that the performance of CheXNet is statistically significantly higher than radiologist performance.

4.2 Limitations

We identify three limitations of this comparison. First, only frontal radiographs were presented to the radiologists and model during diagnosis, but it has been shown that up to 15% of accurate diagnoses require the lateral view (Raoof et al., 2012); we thus expect that this setup provides a conservative estimate of performance. Third, neither the model nor the radiologists were not permitted to use patient history, which has been shown to decrease radiologist diagnostic performance in interpreting chest radiographs (Berbaum et al., 1985; Potchen et al., 1979); for example, given a pulmonary abnormality with a history of fever and cough, pneumonia would be appropriate rather than less specific terms such as infiltration or consolidation) (Potchen et al., 1979).

5 ChexNet vs. Previous State of the Art on the ChestX-ray14 Dataset

We extend the algorithm to classify multiple thoracic pathologies by making three changes. First, instead of outputting one binary label, ChexNet outputs a vector

of binary labels indicating the absence or presence of each of the following 14 pathology classes: Atelectasis, Cardiomegaly, Consolidation, Edema, Effusion, Emphysema, Fibrosis, Hernia, Infiltration, Mass, Nodule, Pleural Thickening, Pneumonia, and Pneumothorax. Second, we replace the final fully connected layer in CheXNet with a fully connected layer producing a 14-dimensional output, after which we apply an elementwise sigmoid nonlinearity. The final output is the predicted probability of the presence of each pathology class. Third, we modify the loss function to optimize the sum of unweighted binary cross entropy losses

where is the predicted probability that the image contains the pathology and is the predicted probability that the image does not contain the pathology .

We randomly split the dataset into training (70%), validation (10%), and test (20%) sets, following previous work on ChestX-ray14 (Wang et al., 2017; Yao et al., 2017). We ensure that there is no patient overlap between the splits. We compare the per-class AUROC of the model against the previous state of the art held by Yao et al. (2017) on 13 classes and Wang et al. (2017) on the remaining 1 class.

We find that CheXNet achieves state of the art results on all 14 pathology classes. Table 2 illustrates the per-class AUROC comparison on the test set. On Mass, Nodule, Pneumonia, and Emphysema, we outperform previous state of the art considerably ( increase in AUROC).

6 Model Interpretation

To interpret the network predictions, we also produce heatmaps to visualize the areas of the image most indicative of the disease using class activation mappings (CAMs) (Zhou et al., 2016). To generate the CAMs, we feed an image into the fully trained network and extract the feature maps that are output by the final convolutional layer. Let be the th feature map and let be the weight in the final classification layer for feature map leading to pathology . We obtain a map of the most salient features used in classifying the image as having pathology by taking the weighted sum of the feature maps using their associated weights. Formally,

We identify the most important features used by the model in its prediction of the pathology by upscaling the map to the dimensions of the image and overlaying the image.

Figure 2 shows several examples of CAMs on the pneumonia detection task as well as the 14-class pathology classification task.

7 Related Work

Recent advancements in deep learning and large datasets have enabled algorithms to surpass the performance of medical professionals in a wide variety of medical imaging tasks, including diabetic retinopathy detection

(Gulshan et al., 2016), skin cancer classification (Esteva et al., 2017), arrhythmia detection (Rajpurkar et al., 2017), and hemorrhage identification (Grewal et al., 2017).

Automated diagnosis from chest radiographs has received increasing attention with algorithms for pulmonary tuberculosis classification (Lakhani & Sundaram, 2017) and lung nodule detection (Huang et al., 2017). Islam et al. (2017) studied the performance of various convolutional architectures on different abnormalities using the publicly available OpenI dataset (Demner-Fushman et al., 2015). Wang et al. (2017) released ChestX-ray-14, an order of magnitude larger than previous datasets of its kind, and also benchmarked different convolutional neural network architectures pre-trained on ImageNet. Recently Yao et al. (2017) exploited statistical dependencies between labels in order make more accurate predictions, outperforming Wang et al. (2017) on 13 of 14 classes.

8 Conclusion

Pneumonia accounts for a significant proportion of patient morbidity and mortality (Gonçalves-Pereira et al., 2013). Early diagnosis and treatment of pneumonia is critical to preventing complications including death (Aydogdu et al., 2010). With approximately 2 billion procedures per year, chest X-rays are the most common imaging examination tool used in practice, critical for screening, diagnosis, and management of a variety of diseases including pneumonia (Raoof et al., 2012). However, two thirds of the global population lacks access to radiology diagnostics, according to an estimate by the World Health Organization (Mollura et al., 2010). There is a shortage of experts who can interpret X-rays, even when imaging equipment is available, leading to increased mortality from treatable diseases (Kesselman et al., 2016).

We develop an algorithm which detects pneumonia from frontal-view chest X-ray images at a level exceeding practicing radiologists. We also show that a simple extension of our algorithm to detect multiple diseases outperforms previous state of the art on ChestX-ray14, the largest publicly available chest X-ray dataset. With automation at the level of experts, we hope that this technology can improve healthcare delivery and increase access to medical imaging expertise in parts of the world where access to skilled radiologists is limited.

9 Acknowledgements

We would like to acknowledge the Stanford Center for Artificial Intelligence in Medicine and Imaging for clinical dataset infrastructure support (