COVIDGR dataset and COVID-SDNet methodology for predicting COVID-19 based on Chest X-Ray images

06/02/2020 ∙ by S. Tabik, et al. ∙ 0

Currently, Coronavirus disease (COVID-19), one of the most infectious diseases in the 21st century, is diagnosed using RT-PCR testing, CT scans and/or Chest X-Ray (CXR) images. CT (Computed Tomography) scanners and RT-PCR testing are not available in most medical centers and hence in many cases CXR images become the most time/cost effective tool for assisting clinicians in making decisions. Deep learning neural networks have a great potential for building triage systems for detecting COVID-19 patients, especially patients with low severity. Unfortunately, current databases do not allow building such systems as they are highly heterogeneous and biased towards severe cases. This paper is three-fold: (i) we demystify the high sensitivities achieved by most recent COVID-19 classification models, (ii) under a close collaboration with Hospital Universitario Clínico San Cecilio, Granada, Spain, we built COVIDGR-1.0, a homogeneous and balanced database that includes all levels of severity, from Normal with positive RT-PCR, Mild, Moderate to Severe. COVIDGR-1.0 contains 377 positive and 377 negative PA (PosteroAnterior) CXR views and (iii) we propose COVID Smart Data based Network (COVID-SDNet) methodology for improving the generalization capacity of COVID-classification models. Our approach reaches good and stable results with an accuracy of 97.37%± 1.86 %, 88.14%± 2.02%, 66.5%± 8.04% in severe, moderate and mild COVID severity levels. Our approach could help in the early detection of COVID-19. COVIDGR-1.0 dataset will be made available after the review process.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 9

page 10

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the last months, the world has been witnessing how COVID-19 pandemic is increasingly infecting a large mass of people very fast everywhere in the world. The trends are not clear yet but some research confirm that this problem may persist until 2024 kissler2020projecting . Besides, prevalence studies conducted in several countries reveal that a tiny proportion of the population have developed antibodies after exposure to the virus, e.g., 5% in Spain 111https://english.elpais.com/society/2020-05-14/antibody-study-shows-just-5-of-spaniards-have-contracted-the-coronavirus.html. This means that frequently a large number of patients will need to be assessed in small time intervals by few number of clinicians and with very few resources.

In general, COVID-19 diagnosis is carried out using at least one of these three tests.

  • Computed Tomography (CT) scans-based assessment: it consists in analyzing 3D radiographic images from different angles. The needed equipment for this assessment is not available in most hospitals and it takes more than 15 minutes per patient in addition to the time required for CT decontamination american2020acr .

  • Reverse Transcription Polymerase Chain Reaction (RT-PCR) test: it detects the viral RNA from sputum or nasopharyngeal swab wong2020frequency . It requires specific material and equipment, which are not easily accessible and it takes at least 12 hours, which is not desirable as positive COVID-19 patients should be identified and tracked as soon as possible. Some studies found that RT-PCR results from several tests at different points from the same patients were variable during the course of the illness producing a high false-negative rate li2020stability . The authors suggested that RT-PCR test should be combined with other clinical tests such as CT.

  • Chest X-Ray (CXR): The required equipment for this assessment are less cumbersome and can be lightweight and transportable. In general this type of resources is more available than the required for RT-PCR and CT-scan tests. In addition, CXR test takes about 15 seconds per patient wong2020frequency . Which makes CXR one of the most time/cost effective assessment tools.

Few recent studies provide estimates on expert radiologists sensitivity in the diagnosis of COVID-19 based on CT scans, RT-PCR and CXR. A study on a set of 51 patients with chest CT and RT-PCR essay performed within 3 days, reported a sensitivity in CT of

compared with RT-PCR sensitivity of fang2020sensitivity . A different study on 64 patients (26 men, mean age 56 19 years) reported a sensitivity of for CXR compared with for initial RT-PCR wong2020frequency . According to an analysis of 636 ambulatory patients weinstock2020chest , most patients presenting to urgent care centers with confirmed coronavirus disease 2019 have normal or mildly abnormal findings on CXR. Only of these patients are correctly diagnosed by the expert eye.

In a recent study wong2020frequency , authors proposed simplifying the quantification of the level of severity by adapting a previously defined Radiographic Assessment of Lung Edema (RALE) score warren2018severity to COVID-19. This new score is calculated by assigning a value between 0-4 to each lung depending on the extent of visual features such as, consolidation and ground glass opacities, in the four parts of each lung as depicted in Figure 1

. Based on this score, experts can identify the level of severity of the infection among four severity stages, Normal 0, Mild 1-2, Moderate 3-5 and Severe 6-8. In practice, a patient classified by expert radiologist as Normal can have positive RT-PCR. We refer to these cases as Normal-PCR+. Expert annotation adopted in this work is based in this score.

Figure 1: The stratification of radiological severity of COVID. Examples of how RALE index is calculated.

Automated image analysis via Deep learning (DL) models have a great potential to optimize the role of CXR images for a fast diagnosis of COVID-19. A robust and accurate DL model could serve as a triage method and as a support for medical decision making. An increasing number of recent works claim achieving impressive sensitivities , far higher than expert radiologists. These high sensitivities are due to the bias in the most used COVID dataset, COVID-19 Image Data Collection cohen2020covid . This dataset includes a very small number of COVID positive cases, coming from highly heterogeneous sources (at least 15 countries) and most cases are severe patients, an issue that drastically reduces its clinical value. To populate Non-COVID and Healthy classes, AI researchers are using CXR images from diverse pulmonary disease repositories. The obtained models will have no clinical value as well since they will be unable to detect patients with low and moderate severity, which are the target of a clinical triage system. In view of this situation, there is still a huge need for higher quality datasets built under the same clinical protocol and under a close collaboration with expert radiologists.

The concept of Smart Data refers to the process of converting raw data into higher quality data with higher concentration of useful information LuengoGRGH20 . Multiple studies have proven that higher quality data ensures higher quality models. Smart data includes all pre-processing methods that improve value and veracity of data. Examples of these methods include noise elimination, data-augmentation tabik2017snapshot and data transformation FuCiTNet20 among other techniques.

In this work, we designed a high clinical quality dataset, named COVIDGR-1.0 that includes four levels of severity, Normal-PCR+, Mild, Moderate and Severe. We identified these four severity levels from a recent COVID radiological study wong2020frequency . We also propose COVID Smart Data based Network (COVID-SDNet) methodology. It combines segmentation, data-augmentation and data transformations together with an appropriate Convolutional Neural Network (CNN) for inference.

The contributions of this paper can be summarized as follows:

  • To analyze reliability, potential and limitations of the most used COVID CXR datasets and models.

  • To provide a high quality dataset, called COVIDGR-1.0, for building triage systems with high clinical value.

  • To design a novel methodology, named COVID-SDNet, with a high generalization capacity for COVID classification based on CXR images. COVIDSDNet combines segmentation, data-transformation to increase the discrimination capacity of the classification model, data-augmentation, and a suitable CNN model together with an inference approach to get the final class.

Experiments demonstrate that our approach reaches good and stable results especially in moderate and severe levels, with and respectively. Lower accuracies were obtained in mild and normal-PCR+ severity levels with and respectively.

This paper is organized as follows: A review of the most used datasets and COVID classification approaches is provided in Section 2. Section 3 describes how COVIDGR-1.0 is built and organized. Our approach is presented in Section 4. Experiments, comparisons and results are provided in Section 5 and finally Conclusions are pointed out in Section 6.

2 Related works

The last three months have known an increasing number of works exploring the potential of deep learning models for automating COVID-19 diagnosis based on CXR images. The results are promising but still too much work needs to be done at the level of data and models design. Given the potential bias in this type of problems, several studies include explication methods to their models. This section analyzes the advantages and limitations of current datasets an models for building automatic COVID-19 diagnosis systems with and without decision explication.

2.1 Datasets

There does not exist yet a high quality collection of CXR images for building COVID diagnosis systems of high clinical value. Currently, the main source for COVID class is COVID-19 Image Data Collection cohen2020covid . It contains 76 positive and 26 negative PA views. These images were obtained from highly heterogeneous equipment from all around the world. To build Non-COVID classes, most studies are using CXR from one or multiple public pulmonary disease data-sets. Examples of these repositories are:

For instance, COVIDx 1.0 wang2020covidnet was built by combining three public datasets: (i) COVID-19 Image Data Collection cohen2020covid , (ii) Figure-1-COVID- 19 Chest X-ray Dataset Initiative agchung2020covid and (iii) RSNA Pneumonia Detection Challenge dataset 2010nocovid . COVIDx 2.0 was built by re-organizing COVIDx 1.0 into three classes, Normal (healthy), Pneumonia and COVID-19 using 201 CXR images for COVID class, including PA(PosteroAnterior) and AP(AnteroPosterior) views (see Table 1). Notice that for a correct learning front view (PA) and back view (AP) cannot be mixed in the same class.

Version Normal(healthy) Pneumonia COVID-19
1.0 1,583 4,273 (Bacterial+viral) 76
2.0 8,066 8,614 190
Table 1: A brief description of COVIDx dataset cohen2020covid (only PA views are counted).

Although the value of these datasets is unquestionable as they are being useful for carrying out first studies and reformulations, however they do not guarantee useful triage systems for the next reasons. It is not clear what annotation protocol has been followed for constructing the positive class in COVID-19 Image Data Collection. The included data is highly heterogeneous and hence DL-models can rely on other aspects then COVID visual features to differentiate between the involved classes. This dataset does not provide a representative spectrum of COVID-19 severity levels, most positive cases are of severe patients kundu2020might .

Our claim is that the design of a high quality dataset must be done under a close collaboration between expert radiologists and AI experts. The annotations must follow the same protocol and representative numbers of all levels of severity, especially Mild and Moderate levels, must be included.

2.2 DL classification models

Existing related works are not directly comparable as they consider different combinations of public data-sets and different experimental setup. A brief summary of these works is provided in Table 2.

Ref. Classes Datasets Model Partition Sens. Acc.
wang2020covidnet Normal, Pneumonia, COVID COVIDx 1.0 COVIDNet 98% - 2% 87.1% 92.6%
afshar2020covid Normal, COVID COVIDx 1.0 COVID-CAPS 98% - 2% 90% 95.7%
ozturk2020automated No-Findings, COVID cohen2020covid + wang2017chestx DarkCovidNet 5-FCV 90.65% 98.08%
No-Findings, Pneumonia, COVID 5-FCV 97.9% 87.02%
karim2020deepcovidexplainer Normal, Pneumonia, COVID COVIDx 2.0+2010nocovid VGG-19 + DenseNet-161 70% - 30% 93% 96.77%
ghoshal2020estimating Normal, Bacterial, Viral, COVID cohen2020covid +2010nocovid Bayesian ResNet50V2 80% - 20% 85.71% 89.82%
apostolopoulos2020covid Normal, Pneumonia, COVID cohen2020covid + 2010nocovid + other sources MobileNet 10-FCV 98.66% 96.78%
Table 2: Summary of related works that analyze variations of COVIDx with CNN.

The most related studies to ours as they proposed different models to the typical ones are wang2020covidnet and afshar2020covid . In wang2020covidnet , the authors designed a deep network, called COVIDNet. They affirmed that COVIDNet reaches an overall accuracy of 92.6%, with 97.0% sensitivity in Normal class, 90.0% in Non-COVID-19 and 87.1% in COVID-19. The authors of a smaller network, called COVID-CAPS afshar2020covid , also claim that their model achieved an accuracy of , sensitivity of , specificity of . These results look too impressive when compared to expert radiologist sensitivity, . This can be explained by the fact that the used dataset is biased to severe COVID cases kundu2020might

. In addition, the performed experiments in both cited works are not statistically reliable as they were evaluated on one single partition. The stability of these models, in terms of standard deviation, has not been reported.

DL classification models with explanation approaches: Several interesting explanations were proposed to help inspect the predictions of DL-models ghoshal2020estimating ; karim2020deepcovidexplainer although all their classification models were trained and validated on variations of COVIDx. The authors in karim2020deepcovidexplainer first use an ensemble of two CNN networks to predict the class of the input image, as Normal, Pneumonia or COVID. Then highlight class-discriminating regions in the input CXR image using gradient-guided class activation maps (Grad-CAM++) and layer-wise relevance propagation (LRP). In ghoshal2020estimating , the authors proposed explaining the decision of the classification model to radiologists using different saliency map types together with uncertainty estimations (i.e., how certain is the model in the prediction).

3 COVIDGR 1.0: Data acquisition, annotation and organization

It is well known that the larger is the database the more effective is the learning of ML algorithms. Even when the data is of lower quality, algorithms can actually perform better, as long as useful information can be extracted by the model. Alternatively, instead of starting with an extremely large and noisy dataset, one can build a small and smart dataset then augment it in a way it increases the performance of the model. This approach has proven effective in multiple studies. This is particularly true in the medical field, where access to data is heavily protected due to privacy concerns and costly expert annotation.

Under a close collaboration with four highly trained radiologists from Hospital Universitario Clínico San Cecilio, Granada, Spain, we first established a protocol on how CXR images are selected and annotated to be included in the dataset. A CXR image is annotated as COVID-19 positive if both RT-PCR test and expert radiologist confirm that decision within less than 24 hours. CXR with positive PCR are labeled as Normal-PCR+. The involved radiologists annotated the level of severity of positive cases based on RALE score as: Normal-PCR+, Mild, Moderate and Severe. Patients with positive RT-PCR that were annotated by expert radiologists as Normal are actually asymptomatic patients.

Dataset Class #images women men #img. per severity level
COVIDGR-1.0 Negative 377 211 166
COVID-19 377 164 213 Normal-PCR+: 76
Mild: 80
Moderate: 145
Severe: 76
Table 3: A brief summary of COVIDGR-1.0 dataset. All samples in COVIDGR 1.0 are segmented CXR images considering only PA view.

COVIDGR-1.0 is organized into two classes, positive and negative. It contains 754 images distributed into 377 positive and 377 negative cases, more details are provided in Table 3. All the images were obtained from the same equipment and under the same X-ray regime. Only PosteriorAnterior (PA) view is considered. COVIDGR-1.0 will be available to the scientific community after review at https://github.com/ari-dasci/covidgr.

Figure 2: Flowchart of the proposed COVID-SDNet methodology.

4 COVID-SDNet methodology

In this section, we describe COVID-SDNet methodology in detail, covering pre-processing to produce smart data, including segmentation and data transformation for increasing discrimination between positive and negative classes, combined with a deep CNN for classification.

One of the pieces of COVID-SDNet is the CNN-based classifier. We have selected Resnet-50 initialized with ImageNet weights for a transfer learning approach. To adapt this CNN to our problem, we have removed the last layer of the net and added 512 neurons layer with ReLU activation and two or four neuron layer (according to the considered number of classes) with softmax activation. All the layers of the network were fine-tuned. We used a batch size of 16 and SGD as optimizer.

The main stages of COVID-SDNet are three, two associated to pre-processing for producing quality data (smart data stages) and the learning and inference process. A flowchart of COVID-SDNet is depicted in Figure 2.

  1. Segmentation: Unnecessary information elimination

    Different CXR equipment brands include different extra information about the patient in the sides and contour of CXR images. The position and size of the patient may also imply the inclusion of more parts of the body, e.g., arms, neck, stomach. As this information may alter the learning of the classification model, first, we used the pre-trained U-Net segmentation model provided in Kaggle-seg to first extract the smallest rectangle that includes left and right lungs. Then, to avoid eliminating useful information, we add of pixels to the left, right, up and down sides of the rectangle. An illustration with example of this pre-processing is shown in Figure 3.

    Figure 3: The segmentation process applied in this work.
  2. Class-inherent transformations Network


    To increase the discrimination capacity of the classification model, we used a Class-inherent transformations (CiT) Network inspired by GANs (Generative Adversarial Networks). This transformation method is actually an array of two generators and . learns the inherent-class transformations of the positive class P and learns the inherent-class transformations of the negative class N. In other words, learns the transformations that bring an input image from its own domain, with , to the P class domain. While learns the transformations that bring the input image from its space, with , to the N class space. The classification loss is introduced in the generators to drive the learning of each specific -class transformations. More details about these transformation networks can be found in FuCiTNet20 .

    The architecture of the generators consists of 5 identical residual blocks. Each block has two convolutional layers with kernels and

    feature maps followed by batch-normalization layers and Parametric ReLU as activation function. The last residual block is followed by a final convolutional layer which reduces the output image channels to 3 to match the input’s dimensions. The classifier is a ResNet-18 which consists of an initial convolutional layer with

    kernels and feature maps followed by a max pool layer. Then, 4 blocks of two convolutional layers with kernels with 64, 128, 256 and 512 feature maps respectively followed by a

    average pooling and one fully connected layer which outputs a vector of

    elements. ReLU is used as activation function.

    (a) Original Negative
    (b) Negative transf.
    (c) Positive transf.
    Figure 7: Class-inherent transformations applied to a negative sample. a) Original negative sample; b) Negative transformation; c) Positive transformation

    Once the generators learn the corresponding transformations, the dataset is processed using and . Two pair of images will be obtained from each input image , where and are respectively the positively and negatively transformed images of . If belongs to class P, and will produce the positive transformation and the negative transformation . If an input image belongs to class N, and will produce its positive and negative transformations. Figure 7 illustrates with example the transformations applied by and . Notice that these transformations are not meant to be interpretable by the human eye but rather help the classification model better distinguish between the different classes.

    The original binary problem is then converted into a four classes problem, where the new classes are N+, N-, P+ and P-.

  3. Learning and inference based on the fusion of CNN twins

    The CNN classification model described above in this section (Resnet-50) is trained to predict the new four classes. The output for each transformed image associated to the original one are actually four tuple. Herein, we propose an inference process to fuse the output. In this way, for each pair , the prediction of the original image will be either P or N. Let and be ResNet-50 predictions for and respectively, where and

    are the probabilities of belonging to each class. Then:

    1. If = N+ and = N-, then = N.

    2. If = P+ and = P-, then = P.

    3. If none of the above applies, then

      (1)

    Experimentally, we used a batch size of 16 and SGD as optimizer.

5 Experiments and Results

In this section we (1) provide all the information about the used experimental setup, (2) evaluate two state-of-the-art COVID classification models on our dataset then, analyze (3) the impact of data pre-processing and (4) Normal-PCR+ severity level on our approach.

5.1 Experimental setup

Due to the high variations between different executions, we performed 5 different 5 fold cross validations in all the experiments. Each experiment uses 80% of COVIDGR 1.0 for training and the remaining 20% for testing. To choose when to stop the training process, we used a random 10% of each training set for validation. In each experiment, a proper set of data-augmentation techniques is carefully selected. All results, in terms of sensitivity, specificity, precision, F1 and accuracy, are presented using the average values and the standard deviation of the 25 executions. The used metrics are calculated as follows:

TP and TN refers respectively to the number of true positives and true negatives.

5.2 Analysis of COVIDNet and COVID-CAPS

We compare our approach with the two most related approaches to ours, COVIDNet wang2020covidnet and COVID-CAPS afshar2020covid .

  • COVIDNet: Currently, the authors of this network provide three versions, namely A, B and C, available at COVIDNet . A has the largest number of trainable parameters, followed by B and C. We performed two evaluations of each network in such a way that the results will be comparable to ours.

    • First, we tested COVIDNet-A, COVIDNet-B and COVIDNet-C, pre-trained on COVIDx, directly on our dataset by considering only two classes: Normal (negative), and COVID-19 (positive). The whole dataset (377 positive images and 377 negative images) is evaluated. We report in Table 4 recall and precision results for Normal and COVID-19 classes.

    • Second, we retrained COVIDNet on our dataset. It is important to note that as only a checkpoint of each model is available, we could not remove the last layer of these networks, which has three neurons. We used 5 different 5 fold cross validations. In order to be able to retrain COVIDNet models, we had to add a third Pneumonia class into our dataset. We randomly selected 377 images from the Pneumonia class in COVIDx dataset. We used the same hyper-parameters as the ones indicated in their training script, that is, 10 epochs, a batch size of 8 and a learning rate of 0.0002. We changed covid_weight to 1 and covid_percent to 0.33 since we had the same number of images in all the classes. Similarly, we report in

      Table 4 recall and precision of our two classes, Normal and COVID-19, and omit recall and precision of Pneumonia class. The accuracy reported in the same table only takes into account the images from our two classes. As with our models, we report here the mean and standard deviation of all metrics.

    Although we analyzed all three A, B and C variations of COVIDNet, for simplicity we only report the results of the best one.

  • COVID-CAPS: This is a capsule network-based model proposed in afshar2020covid and available at covidcaps2020web . Its architecture is notably smaller than COVIDNet, which implies a dramatically lower number of trainable parameters. Since the authors also provide a checkpoint with weights trained in the COVIDx dataset, we were able to follow a similar procedure than with COVIDNet:

    • First, we tested the pretrained weights using COVIDx on COVIDGR-1.0 dataset. COVID-CAPS is designed to predict two classes, so we reused the same architecture with the new dataset and compute the evaluation metrics shown in

      Table 4.

    • Second, COVID-CAPS architecture was retrained over the COVIDGR-1.0 dataset. This process finetunes the weights to improve class separation. The retraining process is performed using the same setup and hyper-parameters reported by the authors. Adam optimizer is used across 100 epochs with a batch size of 16. Class weights were omitted as with COVIDNet, since this dataset contains balanced classes in training as well as in test. Evaluation metrics are computed for five sets of 5-fold cross-validation test subsets and summarized in Table 4.

Class Negative Positive (COVID-19) Accuracy
Metric Specificity Precision Sensitivity Precision
COVIDNet-CXR A wang2020covidnet 0.27 20 99.74 33.78 50
Retrained COVIDNet-CXR A 89.378.88 60.936.20 41.5717.98 82.348.82 65.475.53
COVID-CAPS afshar2020covid 26.58 50.78 74.25 50.27 50.41
Retrained COVID-CAPS 64.84 10.48 61.766.40 57.8915.77 62.214.86 61.375.24
Table 4: COVIDNet and COVID-CAPS results on our dataset

The results from Table 4 show that COVIDNet and COVID-CAPS trained on COVIDx overestimate COVID-19 class in our dataset, i.e., most images are classified as positive, resulting in very high sensitivities but at the cost of low positive predictive value. However, when COVIDNet and COVID-CAPS are re-trained on COVIDGR-1.0 they achieve slightly better overall accuracy and a higher balance between sensitivity and specificity, although they seem to acquire a bias favoring the negative class. In general, none of these models perform adequately for the detection of the disease from CXR images in our dataset.

5.3 Results and Analysis of COVID prediction

The results of the baseline COVID classification model considering all the levels of severity, with and without segmentation; and COVID-SDNet are shown in Table 5.

Class N P Accuracy
Metric Specificity Precision F1 Sensitivity Precision F1
COVIDNet-CXR 89.378.88 60.936.20 71.842.94 41.5717.98 82.348.82 52.2714.89 65.475.53
COVID-CAPS 64.84 10.48 61.766.40 62.444.97 57.8915.77 62.214.86 58.8110.65 61.375.24
Without seg. 75.256.78 71.043.13 72.842.87 68.956.27 74.044.45 71.092.88 72.102.31
With seg. 71.379.25 73.895.41 71.974.39 73.689.33 72.594.39 72.634.19 72.543.19
COVID-SDNet
79.206.29 76.583.92 77.673.21 75.435.91 78.825.04 76.823.08 77.312.92
Table 5: Results of COVID prediction using ResNet-50 with and without segmentation, COVID-SDNet, Retrained COVIDNet-CXR A and Retrained COVID-CAPS. All four levels of severity in the positive class are taken into account.

In general, COVID-SDNet achieves better and more stable results than the rest of approaches. In particular, COVID-SDNet achieved the highest balance between specificity and sensitivity with F1 in the negative class and F1 in the positive class. Most importantly, COVID-SDNet achieved the highest specificity with , sensitivity and accuracy with . When comparing the results of the baseline classification model with and without segmentation, we can observe that the use of segmentation improves substantially the sensitivity which is the most important criteria for a triage system. This can be explained by the fact that segmentation allows the model to focus on most important parts of the CXR image.

Analysis per severity level

To determine which levels are the hardest to distinguish by the best approach, we have analyzed the accuracy per severity level (S), with , where Normal-PCR+, Mild, Moderate, Severe. The results are shown in Table 6.

S (Severity level) accuracy (S)()
Normal-PCR+ 38.68 2.44
Mild 66.5 8.04
Moderate 88.14 2.02
Severe 97.37 1.86
Table 6: Results of COVID-SDNet per severity level.

As it can be seen from these results, COVID-SDNet correctly distinguish Moderate and Severe levels with an accuracy of and respectively. This is due to the fact that Moderate and Severe CRX images contain more important visual features than Mild and Normal-PCR+ which ease the classification task. Normal-PCR+ and Mild cases are much more difficult to identify as they contain few or none visual features. These results are coherent with the clinical studies provided in weinstock2020chest and wong2020frequency which report that expert sensitivity is very low in Normal-PCR+ and Mild infection levels. Recall that the expert eye does not see any visual signs in Normal-PCR+ although the PCR is positive. Those cases are actually considered as asymptomatic patients.

5.4 Analysis of the impact of Normal-PCR+

To analyze the impact of Normal-PCR+ class on COVID-19 classification, we trained and evaluated the baseline model, COVID-SDNet classification stage, COVIDNet-CXR-A and COVID-CAPS, on COVIDGR by eliminating Normal-PCR+. The results are summarized in Table 7.

Class N P Accuracy
Metric Specificity Precision F1 Sensitivity Precision F1
COVIDNet-CXR 90.149.73 63.247.71 73.503.97 50.5118.31 78.7512.81 59.2514.70 70.325.96
COVID-CAPS 72.167.04 66.015.94 68.644.42 61.9110.97 69.165.29 64.817.44 67.045.03
With seg. 80.286.98 77.124.93 78.333.36 75.478.11 79.784.87 77.164.16 77.873.29
COVID-SDNet 81.065.32 81.584.76 81.153.34 81.335.94 81.344.17 81.163.56 81.203.32
Table 7: Results of the baseline classification model with segmentation, COVID-SDNet, retrained COVIDNet-CXR-A and retrained COVID-CAPS. Only three levels of severity are considered, Mild, Moderate and Severe.

Overall, all the approaches systematically provide better results when eliminating Normal-PCR+ from the training and test processes, including COVIDNet-CXR-A and COVID-CAPS. In particular, COVID-SDNet still represents the best and most stable approach.

Analysis per severity level

A further analysis of the accuracy at the level of each severity degree (see Table 8) demonstrates that eliminating Normal-PCR+ decreases the accuracy in Mild and Moderate severity levels by 10% and 3.75% respectively.

S (Severity level) accuracy (S)()
Mild 59.5 3.22
Moderate 84.83 2.51
Severe 97.63 0.98
Table 8: Results of COVID-SDNet by severity level without considering Normal-PCR+.

These results show that although Normal-PCR+ is the hardest level to predict, its presence improves the accuracy of lower severity levels, especially Mild level.

6 Inspection of model’s decision

(a) Original Positive (Mild)
(b) why positive
(c) why negative
Figure 11: Heatmap showing the parts of the input image that triggered the positive prediction (b) and counterfactual explanation (c)
(a) Original Positive (Moderate)
(b) why positive
(c) why negative
Figure 15: Heatmap showing the parts of the input image that triggered the positive prediction (b) and counterfactual explanation (c)
(a) Original Positive (Severe)
(b) why positive
(c) why negative
Figure 19: Heatmap showing the parts of the input image that triggered the positive prediction (b) and counterfactual explanation (c)
(a) Original Negative
(b) why positive
(c) why negative
Figure 23: Heatmap that explains the parts of the input image that triggered the counterfactual explanation (b) and the negative actual prediction (c).

Automatic DL diagnosis systems alone are not mature yet to replace expert radiologists. To help clinician making decisions, these tools must be interpretable so that clinicians can decide whether to trust the model or not arrieta2020explainable . We inspect what led our model make a decision by showing the regions of the input image that triggered that decision along with its counterfactual explanation by showing the parts that explain the opposite class. We adapted Grad-CAM method selvaraju2017grad to explain the decision of the negative and positive class.

Figures 11, 15 and 19 show (a) the original CXR image, (b) visual explanation by means of a heat-map that highlights the regions/pixels which led the model to output the actual prediction and (c) its counterfactual explanation using a heat-map that highlights the regions/pixels which had the highest impact on predicting the opposite class. The larger high intensity areas in the heat-map determine the final class. However, Figure 23(b) represents first the counterfactual explanation and Figure 23(c) represents the explanation of the actual decision.

As expected, negative and positive interpretations are complementary, i.e, areas which triggered the correct decision are opposite, in most cases, to the areas that triggered the decision towards negative. In CXR images with different severity levels, the heat-maps correctly point out opaque regions due to different levels of infiltrates, consolidations and also to osteoarthritis.

In particular, in Figure 11(b), the red areas in the right lung points out a region with infiltrates and also osteoarthritis in the spine region. Figure 15 (b) correctly shows moderate infiltrates in the right lower and lower-middle lung fields in addition to a dilation of ascending aorta and aortic arch (red color in the center). Figure 11(c) shows normal upper-middle fields of both lungs (less important on the left due to aortic dilation). Figure 19(b) indicates an important bilateral pulmonary involvement with consolidations.

As it can be observed in Figure 23(c), the explanation of the negative class correctly highlights a symmetric bilateral pattern that occupies a larger lung volume especially in regions with high density. In fact, a very similar pattern is shown in the counterfactual explanation of the positive class in Figures 11(c), 15(c) and 19(c).

7 Conclusions

This paper introduced a dataset, named COVIDGR, with high clinical value. COVIDGR includes the four main COVID severity levels identified by a recent radiological study wong2020frequency . We proposed a methodology, called COVID-SDNet, that combines segmentation, data-augmentation and data transformation. The obtained results show the high generalization capacity of COVID-SDNet, especially on severe and moderate levels as they include important visual features. The existence of few or none visual features in Mild and Normal-PCR+ reduces the opportunities for improvement.

As main conclusions, we must highlight that COVID-SDNet can be used in a triage system to detect especially moderate and severe patients. Finally, we must also mention that more robust and accurate triage system can be built by fusing our approach with other approaches such as the one proposed in CohenAnnot .

As future work, we are working on enriching COVIDGR with more CXR images coming from different hospitals. We are planning to explore the use of additional clinical information along with CXR images to improve the prediction performance.

Acknowlegments

This work was supported by the project DeepSCOP-Ayudas Fundación BBVA a Equipos de Investigación Científica en Big Data 2018 and the Spanish Ministry of Science and Technology under the project TIN2017-89517-P. S. Tabik was supported by the Ramon y Cajal Programme (RYC-2015-18136). A. Gómez-Ríos was supported by the FPU Programme 998758-2016. E.G was supported by the European Research Council (ERC Grant agreement 647038 [BIODESERT])

Ethics

This project is approved by the Provincial Research Ethics Committee of Granada.

References