Deep-COVID: Predicting COVID-19 From Chest X-Ray Images Using Deep Transfer Learning

by   Shervin Minaee, et al.

The COVID-19 pandemic is causing a major outbreak in more than 150 countries around the world, having a severe impact on the health and life of many people globally. One of the crucial step in fighting COVID-19 is the ability to detect the infected patients early enough, and put them under special care. Detecting this disease from radiography and radiology images is perhaps one of the fastest way to diagnose the patients. Some of the early studies showed specific abnormalities in the chest radiograms of patients infected with COVID-19. Inspired by earlier works, we study the application of deep learning models to detect COVID-19 patients from their chest radiography images. We first prepare a dataset of 5,000 Chest X-rays from the publicly available datasets. Images exhibiting COVID-19 disease presence were identified by board-certified radiologist. Transfer learning on a subset of 2,000 radiograms was used to train four popular convolutional neural networks, including ResNet18, ResNet50, SqueezeNet, and DenseNet-121, to identify COVID-19 disease in the analyzed chest X-ray images. We evaluated these models on the remaining 3,000 images, and most of these networks achieved a sensitivity rate of 97%(± 5%), while having a specificity rate of around 90%. While the achieved performance is very encouraging, further analysis is required on a larger set of COVID-19 images, to have a more reliable estimation of accuracy rates. Besides sensitivity and specificity rates, we also present the receiver operating characteristic (ROC), area under the curve (AUC), and confusion matrix of each model. The dataset, model implementations (in PyTorch), and evaluations, are all made publicly available for research community, here:


page 1

page 3


COVID CT-Net: Predicting Covid-19 From Chest CT Images Using Attentional Convolutional Network

The novel corona-virus disease (COVID-19) pandemic has caused a major ou...

COVID-19 Classification of X-ray Images Using Deep Neural Networks

In the midst of the coronavirus disease 2019 (COVID-19) outbreak, chest ...

Hybrid quantum convolutional neural networks model for COVID-19 prediction using chest X-Ray images

Despite the great efforts to find an effective way for COVID-19 predicti...

COVIDomaly: A Deep Convolutional Autoencoder Approach for Detecting Early Cases of COVID-19

As of September 2020, the COVID-19 pandemic continues to devastate the h...

AC-CovidNet: Attention Guided Contrastive CNN for Recognition of Covid-19 in Chest X-Ray Images

Covid-19 global pandemic continues to devastate health care systems acro...

Deep Learning with robustness to missing data: A novel approach to the detection of COVID-19

In the context of the current global pandemic and the limitations of the...

Multi-objective optimization determines when, which and how to fuse deep networks: an application to predict COVID-19 outcomes

The COVID-19 pandemic has caused millions of cases and deaths and the AI...

I Introduction

Fig. 1: Three sample COVID-19 images, and the corresponding marked areas by our radiologist.

Since December 2019, a novel corona-virus (SARS-CoV-2) has spread from Wuhan to the whole China, and many other countries. By April 18, more than 2 million confirmed cases, and more than 150,000 deaths cases were reported in the world [1]. Due to unavailability of therapeutic treatment or vaccine for novel COVID-19 disease, early diagnosis is of real importance to provide the opportunity of immediate isolation of the suspected person and to decrease the chance of infection to healthy population. Reverse transcription polymerase chain reaction (RT-PCR) or gene sequencing for respiratory or blood specimens are introduced as main screening methods for COVID-19 [2]. However, total positive rate of RT-PCR for throat swab samples is reported to be 30 to 60 percent, which accordingly yields to un-diagnosed patients, which may contagiously infect a huge population of healthy people [3]. Chest radiography imaging (e.g., X-ray or computed tomography (CT) imaging) as a routine tool for pneumonia diagnosis is easy to perform with fast diagnosis. Chest CT has a high sensitivity for diagnosis of COVID-19 [4] and X-ray images show visual indexes correlated with COVID-19 [5]. The reports of chest imaging demonstrated multilobar involvement and peripheral airspace opacities. The opacities most frequently reported are ground-glass (57%) and mixed attenuation (29%) [6]. During the early course of COVID-19, ground glass pattern is seen in areas that edges the pulmonary vessels and may be difficult to appreciate visually [7]. Asymmetric patchy or diffuse airspace opacities are also reported for COVID-19 [8]

. Such subtle abnormalities can only be interpreted by expert radiologists. Considering huge rate of suspected people and limited number of trained radiologists, automatic methods for identification of such subtle abnormalities can assist the diagnosis procedure and increase the rate of early diagnosis with high accuracy. Artificial intelligence (AI)/machine learning solutions are potentially powerful tools for solving such problems.

So far, due to the lack of availability of public images of COVID-19 patients, there has not been any detailed study looking at the potential of AI/machine learning solutions for automatic detection of COVID-19 from X-ray (or Chest CT) images. Recently a small dataset of COVID-19 X-ray images have been collected by some researchers, which made it possible for AI researchers to train machine learning models to perform automatic COVID-19 diagnostics from X-ray images [10]. These images were extracted from academic publications reporting the results on COVID-19 X-ray and CT images. With the help of a board-certified radiologist, we re-labeled those images, and only kept the ones which were detected to have a clear sign of COVID-19 by our radiologist. Three sample images with their corresponding marked areas by our radiologist are shown in Figure 1. We then used a subset of images from ChexPert [11] dataset, as the negative samples for COVID-19 detection. The combined dataset has around 5,000 Chest X-ray images (called COVID-Xray-5k), which is dividing into 2,000 training, and 3,000 testing samples. It is worth mentioning that some of the earlier works in the past few weeks used the images from pediatric patients of one to five years old (from a Kaggle competition) as the negative class, which may not be the best idea, as there is a big difference among the age range of the positive and negative class in that case.

We then use a machine a learning framework to predict the COVID-19, from the Chest X-ray images. Unlike the classical approaches for medical image classification which follow a two-step procedure (hand-crafted feature extraction+recognition), we use an end-to-end deep learning framework which directly predicts the COVID-19 from raw images without any need of feature extraction. Deep learning based models (and more specifically convolutional neural networks (CNN)) have been shown to outperform the classical AI approaches in most of computer vision and and medical image analysis tasks in recent years, and have been used in a wide range of problems from classification, segmentation, face recognition, to super-resolution and image enhancement

[12, 18, 19, 20, 21].

Here, we train 4 popular convolutional networks which have achieved promising results in several tasks during recent years (including ResNet18, ResNet50, SqueezeNet, and DesneNet-161) on COVID-Xray-5k dataset, and analyze their performance for COVID-19 detection. Since so far there are only a few X-ray images publicly available for COVID-19 class, we cannot simply train these models from scratch. We used two strategies to address the COVID-19 image scarcity issue in this work:

  • We use data augmentation to create transformed version of COVID-19 images (such as flipping, small rotation, adding small amount of distortions), to increase the number of samples by a factor of 4.

  • Instead of training these models from scratch, we fine-tune the last layer of the pre-trained version of these models on ImageNet. In this way, the model can be trained with less labeled samples from each class.

The above two strategies help us to train these networks with the available images, and achieve reasonable performance on the test set of 3,000 images.

The best performing model out of the above four networks, achieves a sensitivity of 97.5%, and specificity of around 95%. Since the number of samples for COVID-19 class is limited, we also calculate the confidence interval of the performance metrics. To report a summarizing performance of these models, we provide the Receiver operating characteristic (ROC) curve, and area under the curve (AUC) for each of these models.

Here are the contribution of this paper:

  • We prepared a dataset of 5,000 images with binary labels, for COVID-19 detection from Chest X-ray images. This dataset can serve as a benchmark for the research community. The images in COVID-19 class, are labeled by a board-certified radiologist, and only those with a clear sign are used for testing purpose.

  • We trained 4 promising deep learning models on this dataset, and evaluated their performance on a test set of 3,000 images. Our best performing model achieved a sensitivity rate of 97.5%, while having a specificity of 95%.

  • We also provided the ROC curve, AUC, and the histogram of the predicted scores by these models.

  • We provided a detailed experimental analysis on the performance of these models, in terms of sensitivity, specificity, ROC curve, area under the curve, and confusion matrix.

  • We make the dataset, the trained models, and the implementation publicly available.

It is worth to mention that, although the result of this work is very encouraging, given the amount of the labeled data the result of this work is still preliminary, and more concrete conclusion requires further experiments on a larger dataset of COVID-19 labeled X-ray images. We believe this work can serve as a benchmark for future works and comparisons.

The structure of the rest of this paper is as follows. Section II provides a summary of the prepared COVID-Xray-5k Dataset. Section III presents the description of the overall proposed framework. Section IV provides the experimental studies and comparison with previous works. And finally the paper is concluded in Section V.

Ii COVID-Xray-5k Dataset

We have used the X-ray images from two datasets, to create the COVID-Xray-5k dataset. The COVID-Xray-5k dataset contains 2,031 training images, and 3,040 test images.

One of the used datasets is the recently published Covid-Chestxray-Dataset, which contains a set of images from publications on COVID-19 topics, collected by Joseph Paul Cohen [9, 10]. This dataset contains a mix of chest X-ray and CT images. As of March, 23, 2020, it contained 23 CT images (22 COVID-19, and 1 Non-Covid), and 126 X-ray images (102 COVID-19, and 24 Non-COVID images). It is mentioned that this dataset is continuously updated. It also contains some meta-data about each patients, such as sex and age. Our COVID-19 images are all coming from this dataset. The provided 102 COVID-19 images were examined by our board-certified radiologist which led to elimination of all lateral images and some less-reliable anterior-posterior images, yielding 71 X-ray images with COVID-19. Therefore, we have chosen 40 COVID-19 images to include in the test set (to meet some maximum confidence interval value), and 31 COVID-19 images for the training set. Data augmentation is applied to the training set to increase the number of COVID-19 samples to 496 (by a combination of flipping, rotation, small distortion, and over-sampling). We made sure all images for each patient go only to one of the training or test sets. It is worth mentioning that our radiologist marked some of the likely regions, which can have some sign of Covid-19 too.

Since the number of Non-Covid images was very small in the [9] dataset, additional images were employed from the ChexPert dataset [11], a large public dataset for chest radiograph interpretation consisting of 224,316 chest radiographs of 65,240 patients, labeled for the presence of 14 sub-categories (no-finding, Edema, Pneumonia, etc.). For the non-COVID samples in the training set, we only used images belonging to a single sub-category, composed of 700 images form no-finding class and 100 images from each remaining 13 sub-classes, resulting in 2,000 non-COVID images. As for the Non-COVID samples in the test dataset, we selected 1,700 images from no-finding category and around 100 images from each remaining 13 sub-classes in distinct sub-folders, resulting into 3000 images in total. The exact number of images of each class for both training and testing is given in Table I.

Split COVID-19 Non-COVID
Training Set 31 (496 after augmentation) 2000
Test Set 40 3000
TABLE I: Number of images per category in COVID-Xray-5k dataset.

Figure 2 shows 16 sample images from COVID-Xray-5k dataset, including 4 COVID-19 images (the first row), 4 normal images from ChexPert (the second row), and 8 images with one of the 13 diseases in ChexPert (third and fourth rows).

Fig. 2: Sample images from COVID-Xray-5k dataset. The images in the first row show 4 COVID-19 images. The images in the second row are 4 sample images of no-finding category in Non-COVID images from ChexPert. The images in the third and fourth rows give 8 sample images from other sub-categotries in ChexPert.

Iii The Proposed Framework

Since so far, the number of publicly available images, which are labeled as COVID-19 are very limited, it may not be possible to train a deep convolutional neural network from scratch to detect COVID-19 from X-ray images. To overcome this issue, we use a well-known strategy in machine learning, called ”transfer learning”, and fine-tune four popular pre-trained deep neural networks on the training images of COVID-Xray-5k dataset. We will first provide a quick introduction of transfer learning, and then discuss the proposed framework.

Iii-a Transfer Learning Approach

In transfer learning, a model trained on one task is re-purposed on another related task, usually by some adaptation toward the new task. For example, one can imagine using an image classification model trained on ImageNet (which contains millions of labeled images) to initiate task-specific learning for COVID-19 detection on a smaller dataset. Transfer learning is mainly useful for tasks where enough training samples are not available to train a model from scratch, such as medical image classification for rare or emerging diseases, in which sufficiently large numbers of labeled samples may not be available. This is especially the case for models based on deep neural networks, which have a large number of parameters to train. By using transfer learning, the model parameters start with a;ready-good initial values that only need some small modifications to be better curated toward the new task.

There are two main ways in which the pre-trained model is used for a different task. In one approach, the pre-trained model is treated as a feature extractor (i.e., the internal weights of the pre-trained model are not adapted to the new task), and a classifier is trained on top of it to perform classification. In another approach, the whole network, or a subset thereof, is fine-tuned on the new task. Therefore the pre-trained model weights are treated as the initial values for the new task, and are updated during the training stage.

In our case, since the number of images in COVID-19 category is very limited, we only fine-tune the last layer of the convolutional neural networks, and essentially use the pre-trained models as a feature extractor. We evaluate the performance of four popular pre-trained models, ResNet18 [14], ResNet50 [14], SqueezeNet [15], and DenseNet-121 [16]. In the next section we provide a quick overview of the architecture of these models, and how they are used for COVID-19 recognition.

Iii-B COVID-19 Detection Using Residual ConvNet – ResNet18 and ResNet50

One of the models used in this work, is the pre-trained ResNet18, trained on ImageNet dataset. ResNet is one of the most popular CNN architecture, which provides easier gradient flow for more efficient training, and was the winner of the 2015 ImageNet competition. The core idea of ResNet is introducing a so-called identity shortcut connection that skips one or more layers. This would help the network to provide a direct path to the very early layers in the network, making the gradient updates for those layers much easier.

The overall block diagram of ResNet18 model, and how it is used for COVID-19 detection is illustrated in Figure 3. ResNet50 architecture is pretty similar to ResNet18, the main difference being having more layers.

Fig. 3: The architecture of ResNet18 model [14].

Iii-C COVID-19 Detection Using SqueezeNet

SqueezeNet [15] proposed by Iandola et al., is a small CNN architecture, which achieves AlexNet-level [13] accuracy on ImageNet with 50x fewer parameters. Using model compression techniques, the authors were able to compress SqueezeNet to less than 0.5MB, which made it very popular for applications that require light-weight models. They alternate a 1x1 layer that ”squeezes” the incoming data in the vertical dimension followed by two parallel 1x1 and 3x3 convolutional layers that ”expand” the depth of the data again. Three main strategies used in SqueezeNet includes: replace 3x3 filters with 1x1 filters, decrease the number of input channels to 3x3 filters, Down-sample late in the network so that convolution layers have large activation maps. Figure 4 shows the architecture of a simple SqueezeNet.

Fig. 4: The architecture of SqueezeNet based on ”fire modules”. Courtesy of Google [17].

Iii-D COVID-19 Detection Using DenseNet

Dense Convolutional Network (DenseNet) is another popular architecture [16], which was the winner of the 2017 ImageNet competition. In DenseNet, each layer obtains additional inputs from all preceding layers and passes on its own feature-maps to all subsequent layers. Each layer is receiving a “collective knowledge” from all preceding layers. Since each layer receives feature maps from all preceding layers, network can be thinner and compact, i.e., number of channels can be fewer (so, it have higher computational efficiency and memory efficiency). The architecture of sample DenseNet is shown in Figure 5.

Fig. 5: The architecture of a DenseNet with 5 layers, with expansion of 4. Courtesy of model [16].

Iii-E Model Training

All employed models are trained with a cross-entropy loss function, which tries to minimize the distance between the predicted probability scores, and the ground truth probabilities (derived from labels), and is defined as:


where and

denote the ground-truth, and predicted probabilities for each image, respectively. We can then minimize this loss function using stochastic gradient descent algorithm (and its variations). We also tried to add regularization to the loss function, but the resulting model was not better than the case without regularization.

Iv Experimental Results

In this section we provide the experimental results of the four neural networks trained for COVID-19 detection, the histogram of their predicted scores on the test images, and quantitative performance.

Iv-a Model Hyper-parameters

We fine-tuned each model for 100 epochs. The batch size is set to 20, and ADAM optimizer is used to optimize the loss function, with a learning rate of 0.0001. All images are down-sampled to 224x224 before being fed to the neural network (as these pre-trained models are usually trained with a specific image resolution). All our implementations are done in PyTorch

[22], and are publicly available here:

Iv-B Evaluation Metrics

There are different metrics which can be used for evaluating the performance of classification models, such as classification accuracy, sensitivity, specificity, precision, and F1-score. Since the current test dataset is highly imbalanced (as there are 40 images with COVID-19, and 3000 images that are Non-COVID), sensitivity and specificity are two propoer metrics which can be used for reporting the model performance. These metrics are also widely used in medical domain, and can be defined as Eq 2:


Iv-C Model Predicted Scores

As mentioned earlier, we focused on four popular convolutional networks, ResNet18, ResNet50, SqueezeNet, DenseNet121. These models predict a probability score for each image, which shows the likelihood of the image being detected as COVID-19. By comparing this probability with a cut-off threshold, we can derive a binary label showing if the image is COVID-19 or not. An ideal model should predict the probability of all COVID-19 samples close to 1, and non-COVID samples close to 0.

Figures 6, 7, 8, and 9 show the distribution of predicted probability scores for the images in the test set, by ResNet18, ResNet50, SqueezeNet, and DenseNet-161 respectively. Since Non-COVID class in our study contains both normal cases, as well as other types of diseases, we provide the distribution of predicted scores for three classes: COVID-19, Non-COVID normal, and Non-COVID other diseases. As we can see the Non-Covid images with other types disease have slightly larger scores than the Non-COVID normal cases. This makes sense, since those images are more difficult to distinguish from COVID-19, than normal samples.

Based on these figures, the images for COVID-19 patients, are predicted to have much higher probabilities than the Non-COVID images, which is really encouraging, as it shows the model is learning to discriminate COVID-19 from non-COVID images. Among different models, it can be observed that SqueezeNet does a much better job in pushing the predicted scores for COVID-19 and Non-COVID images far apart from each other.

Fig. 6: The predicted probability scores by ResNet18 on the test set.
Fig. 7: The predicted probability scores by ResNet50 on the test set.
Fig. 8: The predicted probability scores by SqueezeNet on the test set.
Fig. 9: The predicted probability scores by DesneNet-121 on the test set.

Iv-D Model Sensitivity and Specificity

As we can see from previous part, each model predicts a probability score showing the chance of the image being COVID-19. We can then compare these scores with a threshold to infer if the image is COVID-19 or not (if the score is bigger than the threshold it will be predicted as COVID-19). The predicted labels are then used to estimate the sensitivity and specificity of each model. Depending on the value of cut-off threshold, we can get different sensitivity and specificity rates for each model.

Tables II, III, IV, and V show the sensitivity and specificity rates for different thresholds, using ResNet18, ResNet50, SqueezeNet, and DenseNet-121 models, respectively. As we can see, all these models achieve very promising results, in which for a sensitivity rate of around 97%, their specificity rate is in the range of 84-97%. SqueezeNet and ResNet50 achieve slightly better performance than the other models.

Threshold Sensitivity Specificity
0.04 100% 81.6%
0.055 97.5% 88.8%
0.08 95% 94.1%
0.18 92.5% 98.8%
0.2 87.5% 99.2%
TABLE II: Sensitivity and specificity rates of ResNet18 model, for different threshold values.
Threshold Sensitivity Specificity
0.11 97.5% 90.5%
0.2 95% 97.5%
0.24 92.5% 98.8%
0.28 87.5% 99.2%
TABLE III: Sensitivity and specificity rates of ResNet50 model, for different threshold values.
Threshold Sensitivity Specificity
0.08 100% 95.6%
0.18 97.5% 97.8%
0.2 95.0% 98.2%
0.32 87.5% 99.3%
TABLE IV: Sensitivity and specificity rates of SqueezeNet model, for different threshold values.
Threshold Sensitivity Specificity
0.08 100% 74.2%
0.1 97.5% 81.3%
0.15 92.5% 92.4%
0.2 87.5% 96.3%
TABLE V: Sensitivity and specificity rates of DenseNet-121 model, for different threshold values.

Iv-E Small Number of COVID-19 Cases and Model Reliability

It is worth mentioning that since so far the number of reliably labeled COVID-19 X-ray images is very limited, and we only have 40 test images in COVID-19 class, it is hard to believe that all the sensitivity and specificity rates reported above of are reliable. Ideally more experiments on a larger number of test samples with COVID-19 is needed to derive a more reliable estimation of sensitivity rates. We can however estimate the 95% confidence interval of the reported sensitivity and specificity rates here, to see what is the possible range of these values for the current number of test samples in each class. The confidence interval of the accuracy rates can be calculated as Eq 3:



denotes the significance level of the confidence interval (the number of standard deviation of the Gaussian distribution), and accuracy is the estimated accuracy (in our cases sensitivity and specificity), and

denotes the number of samples for that class. Here we use 95% confidence interval, for which the corresponding value of is 1.96.

As for COVID-19 diagnostic, having a sensitive model is crucial, we choose the cut-off threshold corresponding to a sensitivity rate of 97.5% for each model, and compare their specificity rates. Table VI provides a comparison of the performance of these four models on the test set. As we can see the confidence interval of specificity rates are small (around 1%), since we have around 3000 samples for this class, whereas for the sensitivity rate we get slightly higher confidence interval (around 4.8%) because of the limited number of samples.

Model Sensitivity Specificity
ResNet18 97.5% 4.8% 88.8% 1.1%
ResNet50 97.5% 4.8% 90.5% 1.1%
SqueezeNet 97.5% 4.8% 97.8% 0.5%
Densenet-121 97.5% 4.8% 81.3% 1.4%
TABLE VI: Comparison of sensitivity and specificity of four state-of-the-art deep neural networks.

Iv-F The ROC Curve of Each Model and Confusion Matrix

As we can, it is hard to compare different models only based on their sensitivity and specificity rates, since these rates change by varying the cut-off thresholds. To see the overall comparison between these models, we need to look at the comparison for all possible threshold values. One way to do this, is through the Receiver Operating Characteristic (ROC) curve, which provides the true positive rate as a function of false positive rate. The ROC curve of these four models is shown in Figure 10. As we can see all models have relatively similar area under the curve, but SqueezeNet achieve slightly higher AUC than other models.

Fig. 10: The ROC curve of four CNN architectures on COVID-19 test set.

To see the exact number of correctly samples as COVID-19 and Non-COVID, we also provide the confusion matrix the two top-performing models. The confusion matrix of the fine-tuned ResNet50, and SqueezeNet models on the set of 3040 test images are shown in Figure 11, and 12.

Fig. 11: The confusion matrix of the proposed ResNet50 model.
Fig. 12: The confusion matrix of the proposed SqueezeNet framework.

V Conclusion

In this work we propose a deep learning framework for COVID-19 detection from Chest X-ray images, by fine-tuning four pre-trained convolutional models (ResNet18, ResNet50, SqueezeNet, and DenseNet-121) on our training set. We prepared a dataset of around 5k images, called COVID-Xray-5k (using images from two datasets), with the help of a board-certified radiologist to confirm the COVID-19 labels. We make this dataset publicly available for the research community to use as a benchmark for training and evaluating future machine learning models for COVID-19 binary classification task. We performed a detail experimental analysis evaluating the performance of each of these 4 models on the test set of of COVID-Xray-5k Dataset, in terms of sensitivity, specificity, ROC, and AUC. For a sensitivity rate of 97.5%, these models achieved a specificity rate of around 90% on average. This is really encouraging, as it shows the promise of using X-ray images for COVID-19 diagnostics. This study is conducted on a set of publicly available images, which contains less than 100 COVID-19 images, and more than 5,000 non-COVID images. Due to the limited number of COVID-19 images publicly available so far, further experiments are needed on a larger set of cleanly labeled COVID-19 images for a more reliable estimation of the the sensitivity rates.


The authors would like to thank Joseph Paul Cohen for collecting the COVID-Chestxray-dataset. We would also like to thank the providers of ChexPert dataset, which are used as the negative samples in our case.