Log In Sign Up

Dual-Sampling Attention Network for Diagnosis of COVID-19 from Community Acquired Pneumonia

The coronavirus disease (COVID-19) is rapidly spreading all over the world, and has infected more than 1,436,000 people in more than 200 countries and territories as of April 9, 2020. Detecting COVID-19 at early stage is essential to deliver proper healthcare to the patients and also to protect the uninfected population. To this end, we develop a dual-sampling attention network to automatically diagnose COVID- 19 from the community acquired pneumonia (CAP) in chest computed tomography (CT). In particular, we propose a novel online attention module with a 3D convolutional network (CNN) to focus on the infection regions in lungs when making decisions of diagnoses. Note that there exists imbalanced distribution of the sizes of the infection regions between COVID-19 and CAP, partially due to fast progress of COVID-19 after symptom onset. Therefore, we develop a dual-sampling strategy to mitigate the imbalanced learning. Our method is evaluated (to our best knowledge) upon the largest multi-center CT data for COVID-19 from 8 hospitals. In the training-validation stage, we collect 2186 CT scans from 1588 patients for a 5-fold cross-validation. In the testing stage, we employ another independent large-scale testing dataset including 2796 CT scans from 2057 patients. Results show that our algorithm can identify the COVID-19 images with the area under the receiver operating characteristic curve (AUC) value of 0.944, accuracy of 87.5 this performance, the proposed algorithm could potentially aid radiologists with COVID-19 diagnosis from CAP, especially in the early stage of the COVID-19 outbreak.


page 1

page 2

page 4

page 8

page 9


Dual-Attention Residual Network for Automatic Diagnosis of COVID-19

The ongoing global pandemic of Coronavirus Disease 2019 (COVID-19) has p...

Large-Scale Screening of COVID-19 from Community Acquired Pneumonia using Infection Size-Aware Classification

The worldwide spread of coronavirus disease (COVID-19) has become a thre...

Group-Attention Single-Shot Detector (GA-SSD): Finding Pulmonary Nodules in Large-Scale CT Images

Early diagnosis of pulmonary nodules (PNs) can improve the survival rate...

Automated Computer Evaluation of Acute Ischemic Stroke and Large Vessel Occlusion

Large vessel occlusion (LVO) plays an important role in the diagnosis of...

An Attention Mechanism with Multiple Knowledge Sources for COVID-19 Detection from CT Images

Until now, Coronavirus SARS-CoV-2 has caused more than 850,000 deaths an...

Weakly Supervised Attention Model for RV StrainClassification from volumetric CTPA Scans

Pulmonary embolus (PE) refers to obstruction of pulmonary arteries by bl...

Hypergraph Learning for Identification of COVID-19 with CT Imaging

The coronavirus disease, named COVID-19, has become the largest global p...

I Introduction

The disease caused by the novel coronavirus, or Coronavirus Disease 2019 (COVID-19) is quickly spreading globally. It has infected more than 1,436,000 people in more than 200 countries and territories as of April 9, 2020 [world2020coronavirus80]. On February 12, 2020, the World Health Organization (WHO) officially named the disease caused by the novel coronavirus as Coronavirus Disease 2019 (COVID-19) [world2020director]. Now, the number of COVID-19 patients, is dramatically increasing every day around the world [world2020coronavirus]. Compared with the prior Severe Acute Respiratory Syndrome (SARS) and Middle East Respiratory Syndrome (MERS), COVID-19 has spread to more places and caused more deaths, despite its relatively lower fatality rate [wu2020characteristics, mahase2020coronavirus]. Considering the pandemic of COVID-19, it is important to detect COVID-19 early, which could facilitate the slowdown of viral transmission and thus disease containment.

In clinics, real-time reverse-transcription–polymerase-chain-reaction (RT-PCR) is the golden standard to make a definitive diagnosis of COVID-19 infection [zu2020coronavirus]. However, the high false negative rate [chan2020familial] and unavailability of RT-PCR assay in the early stage of an outbreak may delay the identification of potential patients. Due to the highly contagious nature of the virus, it then constitutes a high risk for infecting a larger population. At the same time, thoracic computed tomography (CT) is relatively easy to perform and can produce fast diagnosis [ai2020correlation]. For example, almost all COVID-19 patients have some typical radiographic features in chest CT, including ground-glass opacities (GGO), multifocal patchy consolidation, and/or interstitial changes with a peripheral distribution [chung2020ct]. Thus chest CT has been recommended as a major tool for clinical diagnosis especially in the hard-hit region such as Hubei, China [zu2020coronavirus]

. Considering the need of high-throughput screening by chest CT and the workload for radiologists especially in the outbreak, we design a deep-learning-based method to automatically diagnose COVID-19 infection from the community acquired pneumonia (CAP) infection.

Fig. 1: Examples of CT images and infection segmentations of two COVID-19 patients (upper left) and two CAP patients (bottom left), and the size distribution of the infection regions of COVID-19 and CAP in our training-validation set (right). The segmentation results of the lungs and infection regions are obtained from an established VB-Net toolkit [shan+2020lung]. The sizes of the infection regions are denoted by the volume ratio of the segmented infection regions and the whole lung. Compared with CAP, the COVID-19 cases tend to have more severe infections in terms of the infection region sizes.

With the development of deep learning [lecun2015deep, krizhevsky2012imagenet, he2016deep, huang2017densely], the technology has a wide range of applications in medical image processing, including disease diagnosis [wang2017chestx], and organ segmentation [ronneberger2015u]

, etc. Convolutional neural network (CNN)

[lecun1989backpropagation], one of the most representative deep learning technology, has been applied to reading and analyzing CT images in many recent studies [pang2019automatic, park2019lung]. For example, Koichiro et. al. use CNN for differentiation of liver masses on dynamic contrast agent–enhanced CT images [yasaka2018deep]. Also, some studies focus on the diagnoses of lung diseases in chest CT, e.g., pulmonary nodules [huang2018added, ardila2019end] and pulmonary tuberculosis [lakhani2017deep]. Although deep learning has achieved remarkable performance for abnormality diagnoses of medical images [wang2017chestx, irvin2019chexpert, cruz2013deep], physicians have concerns especially in the lack of model interpretability and understanding [zhang2018visual], which is important for the diagnosis of COVID-19. To provide more insight for model decisions, the class activation mapping (CAM) [zhou2016learning] and gradient-weighted class activation mapping (Grad-CAM) [selvaraju2017grad] methods have been proposed to produce localization heatmaps highlighting important regions that are closely associated with predicted results.

In this study, we propose a dual-sampling attention network to classify the COVID-19 and CAP infection. To focus on the lung, our method leverages a lung mask to suppress image context of none-lung regions in chest CT. At the same time, we refine the attention of the deep learning model through an online mechanism, in order to better focus on the infection regions in the lung. In this way, the model facilitates interpreting and explaining the evidence for the automatic diagnosis of COVID-19. The experimental results also demonstrate that the proposed online attention refinement can effectively improve classification performance.

In our work, an important observation is that COVID-19 cases usually have more severe infection than CAP cases [shi2020large], although some COVID-19 cases and CAP cases do have similar infection sizes. To illustrate it, we use an established VB-Net toolkit [shan+2020lung] to automatically segment lungs and pneumonia infection regions on all the cases in our training-validation (TV) set (with details of our TV set provided in Section IV), and show the distribution of the ratios between the infection regions and lungs in Fig. 1. We can see the imbalanced distribution of the infection size ratios in both COVID-19 and CAP data. In this situation, the conventional uniform sampling on the entire dataset to train the network could lead to unsatisfactory diagnosis performance, especially concerning the limited cases of COVID-19 with small infections and also the limited cases of CAP with large infections. To this end, we train the second network with the size-balanced sampling strategy, by sampling more cases of COVID-19 with small infections and also more cases of CAP with large infections within mini-batches. Finally, we apply ensemble learning to integrate the networks of uniform sampling and size-balanced sampling to get the final diagnosis results, by following the dual-sampling strategy.

As a summary, the contributions of our work are in three-fold:

  • We propose an online module to utilize the segmented pneumonia infection regions to refine the attention for the network. This ensures the network to focus on the infection regions and increase the adoption of visual attention for model interpretability and explainability.

  • We propose a dual-sampling strategy to train the network, which further alleviates the imbalanced distribution of the sizes of pneumonia infection regions.

  • To our knowledge, we have used the largest multi-center CT data in the world for evaluating automatic COVID-19 diagnosis. In particular, we conduct extensive cross-validations in a TV dataset of 2186 CT scans from 1588 patients. Moreover, to better evaluate the performance and generalization ability of the proposed method, a large independent testing set of 2796 CT scans from 2057 patients is also used. Experimental results demonstrate that our algorithm is able to identify the COVID-19 images with the area under the receiver operating characteristic curve (AUC) value of 0.944, accuracy of 87.5%, sensitivity of 86.9%, specificity of 90.1%, and F1-score of 82.0%.

Ii Related Works

Ii-a Computer-Assisted Pneumonia Diagnosis

Chest X-ray (CXR) is one of the firstline imaging modality to diagnose pneumonia, which manifests as increased opacity [franquet2018imaging]. The CNN networks have been successfully applied to pneumonia diagnosis in CXR images [wang2017chestx, rajpurkar2017chexnet]. As the release of the Radiological Society of North America (RSNA) pneumonia detection challenge [challenge2018radiological] dataset, object detection methods (i.e., RetinaNet [lin2017focal] and Mask R-CNN [he2017mask]) have been used for pneumonia localization in CXR images. At the same time, CT has been used as a standard procedure in the diagnosis of lung diseases [wielputz2014radiological]. An automated classification method has been proposed to use regional volumetric texture analysis for usual interstitial pneumonia diagnosis in high-resolution CT [depeursinge2015automated]. For COVID-19, GGO and consolidation along the subpleural area of the lung are the typical radiographic features of COVID-19 patients [chung2020ct]. Chest CT, especially high-resolution CT, can detect small areas of ground glass opacity (GGO) [macmahon2017guidelines].

Some recent works have focused on the COVID-19 diagnosis from other pneumonia in CT images [wang2020deep, xu2020deep, song2020deep]. It requires the chest CT images to identify some typical features, including GGO, multifocal patchy consolidation, and/or interstitial changes with a peripheral distribution [chung2020ct]. Wang et al. [wang2020deep] propose a 2D CNN network to classify between COVID-19 and other viral pneumonia based on manually delineated regions. Xu et al. [xu2020deep] use a V-Net model to segment the infection region and apply a ResNet18 network for the classification. Ying et al. [song2020deep] use a ResNet50 network to process all the slices of each 3D chest CT images to form the final prediction for each CT images. However, all these methods are evaluated in small datasets. In this paper, we have collected 4982 CT scans from 3645 patients, provided by 8 collaborative hospitals. To our best knowledge, it is the largest multi-center dataset for COVID-19 till now, which can prove the effectiveness of the method.

Note that, in the context of pneumonia diagnosis, lung segmentation is often an essential preprocessing step in analyzing chest CT images to assess pneumonia. In the literature, Alom et al. [alom2018recurrent] utilize U-net, residual network and recurrent CNN for lung lesion segmentation. A convolutional-deconvolutional capsule network has also been proposed for pathological lung segmentation in CT images. In this paper, we use an established VB-Net toolkit for lung segmentation, which has been reported with high Dice similarity coefficient of 98% in evaluation [shan+2020lung]. Also, this VB-Net toolkit achieves Dice similarity coefficient of 92% between automatically and manually delineated pneumonia infection regions, showing the state-of-the-art performance [9069255]. For more related works, a recent review paper of automatic segmentation methods on COVID-19 could be found in [9069255].

Fig. 2: Illustration of the pipeline of the proposed method, including two steps. 1) We train two 3D ResNet34 networks [hara2018can] with different sampling strategies. Also, the online attention mechanism generates attention maps during training, which refer to the segmented infection regions to refine the attention localization. 2) We use the ensemble learning to integrate predictions from the two trained networks. In this figure, “Attention RN34 + US” means the 3D ResNet34 (RN34) with attention module and uniform sampling (US) strategy, while “Attention RN34 + SS” means the 3D ResNet34 with attention module and size-balanced sampling (SS) strategy. “GAP” indicates the global average pooling layer, and “FC” indicates the fully connected layer. “ Conv” refers to the convolutional layer with kernel, and takes the parameters from the fully connected layer as the kernel weights. “MSE Loss” refers to the mean square error function.

Ii-B Class Re-sampling Strategies

For network training in the datasets with long-tailed data distribution, there exist some problems for the universal paradigm to sample the entire dataset uniformly [van2017devil]. In such datasets, some classes contain relatively few samples. The information of these cases may be ignored by the network if applying uniform sampling. To address this, some class re-sampling strategies have been proposed in the literature [zhou2019bbn, buda2018systematic, shen2016relay, he2009learning, japkowicz2002class]. The aim of these methods is to adjust the numbers of the examples from different classes within mini-batches, which achieves better performance on long-tailed dataset. Generally, class re-sampling strategies could be categorized into two groups, i.e., over-sampling by repeating data for minority classes [zhou2019bbn, buda2018systematic, shen2016relay] and under-sampling by randomly removing samples to make the number of each class to be equal [buda2018systematic, he2009learning, japkowicz2002class]. The COVID-19 data is hard to collect and precious, so abandoning data is not a good choice. In this study, we adapt the over-sampling strategies [zhou2019bbn] on the COVID-19 with small infections and also CAP with large infections to form a size-balanced sampling method, which can better balance the distribution of the infection regions of COVID-19 and CAP cases within mini-batches. However, over-sampling may lead to over-fitting upon these minority classes [cui2019class, chawla2002smote]. We thus propose the dual-sampling strategy to integrate results from the two networks trained with uniform sampling and size-balanced sampling, respectively.

Fig. 3: The pneumonia infection region (upper right) and the lung segmentation (bottom right) from the VB-Net toolkit [shan+2020lung].

Ii-C Attention Mechanism

Attention mechanism has been widely used in many deep networks, and can be roughly divided into two types: 1) activation-based attention [wang2018non, fu2019dual, hu2018squeeze] and 2) gradient-based attention [zhou2016learning, selvaraju2017grad]. The activation-based attention usually serves as an inserted module to refine the hidden feature maps during the training, which can make the network to focus on the important regions. For the activation-based attention, the channel-wise attention assigns weights to each channel in the feature maps [hu2018squeeze] while the position-wise attention produces heatmaps of importance for each pixel of the feature maps [wang2018non, fu2019dual]. The most common gradient-based attention methods are CAM [zhou2016learning] and Grad-CAM [selvaraju2017grad], which reveal the important regions influencing the network prediction. These methods are normally conducted offline and provide a pattern of model interpretability during the inference stage. Recently, some studies [fukui2019attention, li2018tell] argue that the gradient-based methods can be developed as an online module during the training for better localization. In this study, we extend the gradient-based attention to composing an online trainable component and the scenario of 3D input. The proposed attention module utilizes the segmented pneumonia infection regions to ensure that the network can make decisions based on these infection regions.

Iii Method

The overall framework is shown in Fig. 2. The input for the network is the 3D CT images masked in lungs only. We use an established VB-Net toolkit [shan+2020lung] to segment the lungs for all CT images, and perform auto-contouring of possible infection regions as shown in Fig. 3. The VB-Net toolkit is a modified network that combines V-Net [milletari2016v] with bottleneck layers to reduce and integrate feature map channels. The toolkit is capable of segmenting the infected regions as well as the lung fields, achieving Dice similarity coefficient of 92% between automatically and manually delineated infection regions [shan+2020lung]. By labeling all voxels within the segmented regions to , and the rest part to , we can get the corresponding lung mask and then input image by masking the original CT image with the corresponding lung mask.

As shown in Fig. 2, the training pipeline of our method consists of two stages: 1) using different sampling strategies to train two 3D ResNet34 models [hara2018can] with the online attention module; 2) training an ensemble learning layer to integrate the predictions from the two models. The details of our method are introduced in the following sections.

Iii-a Network

We use the 3D ResNet34 architecture [hara2018can] as the backbone network. It is the 3D extended version of residual network [he2016deep]

, which uses the 3D kernels in all the convolutional layers. In 3D ResNet34, we set the stride of each dimension as

in the last residual block instead of . This makes the resolution of the feature maps before the global average pooling (GAP) [lin2013network] operation into of the input CT image in each dimension. Compared with the case of downsampling the input image by a factor of 32 in each dimension in the original 3D ResNet34, it can greatly improve the quality of the generated attention maps based on higher-resolution feature maps.

Iii-B Online attention module

To exhaustively learn all features that are important for classification, and also to produce the corresponding attention maps, we use an online attention mechanism of 3D class activation mapping (CAM). The key idea of CAM [zhou2016learning, selvaraju2017grad, fukui2019attention] is to back-propagate weights of the fully-connected layer onto the convolutional feature maps for generating the attention maps. In this study, we extend this offline operation to become an online trainable component for the scenario of 3D input. Let denote the feature maps before the GAP operation and also denote the weight matrix of the fully-connected layer. To make our attention generation procedure trainable, we use as the kernel of a

convolution layer and apply a ReLU layer

[nair2010rectified] to generate the attention feature map as:


where has the shape , and is of corresponding size of the input CT images. Given the attention feature map , we first upsample it to the input image size, then normalize it to have intensity values between and , and finally perform sigmoid for soft masking [li2018tell], as follows:


where values of and are set to and respectively. is the generated attention map of this online attention module, where is defined in Eq. 1. During the training, the parameters in the convolution layer are always copied from the fully-connected layer and only updated by the binary cross entropy (BCE) loss for the classification task.

Iii-C Size-balanced Sampling

The main idea of size-balanced sampling is to repeat the data sampling for the COVID-19 cases with small infections and also the CAP cases with large infections in each mini-batch during training. Normally, we use the uniform sampling in the entire dataset for the network training (i.e., “Attention RN34 + US” branch in Fig. 2

). Specifically, each sample in the training dataset is fed into the network only once with equal probability within one epoch. Thus, the model can review the entire dataset when maintaining the intrinsic data distribution. Due to the imbalance of the distribution of infection size, we train a second network via the size-balanced sampling strategy (i.e., “Attention RN34 + SS” branch). It aims to boost the sampling possibility of the small-infection-area COVID-19 and also large-infection-area CAP cases in each mini-batch. To this end, we split the data into 4 groups according to the volume ratio of the pneumonia infection regions and the lung: 1) small-infection-area COVID-19, 2) large-infection-area COVID-19, 3) small-infection-area CAP, and 4) large-infection-area CAP. For COVID-19, we define the cases that meet the criteria of

as small-infection-area COVID-19, and the rest as large-infection-area COVID-19. For CAP, we define the cases with the ratio as large-infection-area CAP and the rest as small-infection-area CAP. We define the numbers of samples for the 4 groups as . Then, inspired by the class-resampling strategy in [zhou2019bbn], we define the weights for 4 groups as . Since the numbers of small-infection-area COVID-19 and large-infection-area CAP are relatively small, the weights and are higher than 1. The values of these two weights are approximately 1.5 in each training fold. Then, the sampling possibilities for 4 groups are calculated by the weight of each group divided by the sum of all weights, . In a mini-batch, we randomly select a group according to the refined possibilities for each group , and uniformly pick up a sample from the selected group. This strategy ensures to have more possibility to sample cases from the two groups of 1) COVID-19 with small infections and 2) CAP with large infections. We conduct the size-balanced sampling strategy for all mini-batches when training the “Attention RN34 + SS” model.

Iii-D Objective Function

Two losses are used to train “Attention RN34 + US” and “Attention RN34 + SS” models, i.e., the classification loss and the extra attention loss for COVID-19 cases, respectively. We adopt the binary cross entropy as constrain for the COVID-19/CAP classification loss . For the COVID-19 cases, given the pneumonia infection segmentation mask , we can use them to directly refine the attention maps from our model and is thus formulated as:


where is the attention map generated from our online attention module (Eq. 2), and , and represent the voxel in the attention map. The proposed is modified from the traditional mean square error (MSE) loss, using the sum of regions of attention map and the corresponding mask as an adaptive normalization factor. It can adjust the loss value dynamically according to the sizes of pneumonia infection regions. Then, the overall objective function for training “Attention RN34 + US” and “Attention RN34 + SS” models is expressed as:


where is a weight factor for the attention loss. It is set to 0.5 in our experiments. For the CAP cases, only the classification loss is used for model training.

Iii-E Ensemble Learning

The size-balanced sampling method could gain more attention on the minority classes and remedy the infection area bias in COVID-19 and CAP patients. A drawback is that it may suffer from the possible over-fitting of these minority classes. In contrast, the uniform sampling method could learn feature representation from the original data distribution in a relatively robust way. Taking the advantages of both sampling methods, we propose a dual-sampling method via an ensemble learning layer, which gauges the weights for the prediction results produced by the two models.

After training the two models with different sampling strategies, we use an ensemble learning layer to integrate the predictions from two models into the final diagnosis result. We combine the prediction scores with different weights for different ratios of the pneumonia infection regions and the lung:


where, is the weight factor. In our experiment, it is set to 0.35 for the case where the ratio meets the criterion or

, and 0.96 for the rest cases. The factor values are determined with a hyperparameter search on the TV set. Then,

is the final prediction result of the dual-sampling model. As presented in Eq. 5, the dual-sampling strategy combines the characteristics of uniform sampling and size-balanced sampling. For the minority classes, i.e., COVID-19 with small infections as well as CAP with large infections, we assign extra weights to the “Attention RN34 + SS” model. For the rest cases, more weights are assigned to the “Attention RN34 + US” model.

Iv Experimental Results

Iv-a Dataset

In this study, we use a large multi-center CT data for evaluating the proposed method in diagnosis of COVID-19. In particular, we have collected a total of 4982 (2mm) chest CT images from 3645 patients, including 3389 COVID-19 CT images and 1593 CAP CT images. All recruited COVID-19 patients were confirmed by RT-PCR test. Here, the images were provided by the Tongji Hospital of Huazhong University of Science and Technology, Shanghai Public Health Clinical Center of Fudan University, the Second Xiangya Hospital of Central South University, China-Japan Union Hospital of Jilin University, Ruijin Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Affiliated Hangzhou First People’s Hospital of Zhejiang University, the Beijing Chaoyang Hospital of Capital Medical University, and Sichuan University West China Hospital. According to the data collection dates, we separate them into two datasets. The first dataset (TV dataset) is used for training and cross-validation, which includes 1094 COVID-19 images and 1092 CAP images. The second dataset serves for independent testing, including 2295 COVID-19 images and 501 CAP images. Note that the split is done on patient level, which means the images of same subject are kept in the same group of training or testing. More details are shown in Table I.

Characteristics TV set Test set
No. (images (patients))
 COVID-19 1094 (960) 2295 (1605)
 CAP 1092 (628) 501 (452)
 Total 2186 (1588) 2796 (2057)
Age (years)
 COVID-19 50.0 (14-89) 50.0 (8-95)
 CAP 57.0 (12-94) 42.0 (15-98)
 Total 53.0 (12-94) 49.0 (8-98)
 COVID-19 479/481 800/805
 CAP 322/306 255/197
 Total 801/787 1055/1002
TABLE I: Demographic of the training-validation (TV) dataset and test dataset. The results of “Age” is presented as median values (range).

Thin-slice chest CT images are used in this study with the CT thickness ranging from 0.625 to 1.5mm. CT scanners include uCT 780 from UIH, Optima CT520, Discovery CT750, LightSpeed 16 from GE, Aquilion ONE from Toshiba, SOMATOM Force from Siemens, and SCENARIA from Hitachi. Scanning protocol includes: 120 kV, with breath hold at full inspiration. All CT images are anonymized before sending them for conducting this research project. The study is approved by the Institutional Review Board of participating institutes. Written informed consent is waived due to the retrospective nature of the study.

Iv-B Image pre-processing

Data are pre-processed in the following steps before feeding them into the network. First, we resample all CT images and the corresponding masks of lungs and infection regions to the same spacing (0.7168mm, 0.7168mm, 1.25mm for the x, y, and z axes, respectively) for the normalization to the same voxel size. Second, we down-sample the CT images and segmentation masks into the approximately half sizes considering efficient computation. To avoid morphological change in down-sampling, we use the same scale factor in all three dimensions and pad zeros to ensure the final size of

. We should emphasize that our method is capable of handling full-size images. Third, we conduct “window/level” (window: 1500, level: -600) scaling in CT images for contrast enhancement. We truncate the CT image into the window [-1350, 150], which sets the intensity value above 150 to 150, and below -1350 to -1350. Finally, following the standard protocol of data pre-processing, we normalize the voxel-wise intensities in the CT images to the interval .

Iv-C Training Details and Evaluation Methods

We implement the networks in PyTorch

[paszke2019pytorch], and use NVIDIA Apex for less memory consumption and faster computation. We also use the Adam [kingma2014adam] optimizer with momentum set to 0.9, a weight decay of 0.0001, and a learning rate of 0.0002 that is reduced by a factor of 10 after every 5 epochs. We set the batch size as 20 during the training. In our experiments, all the models are trained from scratch. In the TV set, we conduct 5-fold cross-validation. In each fold, the model is evaluated on the validation set in the end of each training epoch. The best checkpoint model with the best evaluation performance within 20 epochs is used as the final model and then evaluated on the test set. All the models are trained in 4 NVIDIA TITAN RTX graphics processing units, and the inference time for one sample is approximately 4.6s in one NVIDIA TITAN RTX GPU. For evaluating, we use five different metrics to measure the classification results from the model: area under the receiver operating characteristic curve (AUC), accuracy, sensitivity, specificity, and F1-score. AUC represents degree or measure of separability. In this study, we calculated the accuracy, sensitivity, specificity, and F1-score at the threshold of 0.5.

Iv-D Results

TV set
Test set
AUC RN34 + US 0.984 0.9340.011
Attention RN34 + US 0.986 0.9480.003
Attention RN34 + SS 0.987 0.9380.002
Attention RN34 + DS 0.988 0.9440.003
Accuracy RN34 + US 0.945 0.8590.013
Attention RN34 + US 0.947 0.8790.012
Attention RN34 + SS 0.951 0.8690.008
Attention RN34 + DS 0.954 0.8750.009
Sensitivity RN34 + US 0.931 0.8560.029
Attention RN34 + US 0.941 0.8720.018
Attention RN34 + SS 0.953 0.8680.020
Attention RN34 + DS 0.954 0.8690.016
Specificity RN34 + US 0.959 0.8700.071
Attention RN34 + US 0.953 0.9070.029
Attention RN34 + SS 0.948 0.8760.048
Attention RN34 + DS 0.954 0.9010.025
F1-score RN34 + US 0.945 0.7980.011
Attention RN34 + US 0.947 0.8250.013
Attention RN34 + SS 0.951 0.8110.004
Attention RN34 + DS 0.954 0.8200.008
TABLE II: Comparasion of classification results of differnet models on the TV set and test set (RN34: 3D ResNet34; US: Uniform Sampling; SS: Size-balanced Sampling; DS: Dual-sampling). The results of AUC, accuracy, sensitivity, specificity and F1-score are present in this table. The results on TV set are the combined results of 5 validation sets. For results on the test set, we show mean

std (standard deviation) scores of five trained models of each training-validation fold.

Fig. 4: ROC curves of the TV set and the test set. (A) ROC curves of TV set for 5 folds. (B) ROC curve of test set by using the model from TV set fold 1. (C) ROC curve of test set by using the model from TV set fold 2. (D) ROC curve of test set by using the model from TV set fold 3. (E) ROC curve of test set by using the model from TV set fold 4. (F) ROC curve of test set by using the model from TV set fold 5.

First, we conduct 5-fold cross-validation on the TV set. The experimental results are shown in Table II, which combines the results of all 5 validation sets. The receiver operating characteristic (ROC) curve is also shown in Fig. 4(A). We can see that the models with the proposed attention refinement technique can improve the AUC and sensitivity scores. At the same time, we can see that “Attention RN34 + DS” achieves the highest performance in AUC, accuracy, sensitivity, and F1-score, when combining the two models with different sampling strategies. As for the specificity, the performance of the dual-sampling method is a little bit lower than that of ResNet34 with uniform sampling.

We further investigate the generalization capability of the model by deploying the five trained models of five individual folds on the independent testing dataset. From Fig. 4(B-F), we can see that the trained model of each fold achieves similar performance, implying consistent performance with different training data. Compared with the results on the TV set in Fig. 4(A), the AUC score of the models with the proposed attention module (“Attention RN34 + DS”) on the independent test set drops from 0.988 to 0.944, while the AUC score of “RN34 + US” drops from 0.984 to 0.934. This indicates the strong robustness of our model, trained with our attention module, against possible over-fitting. The proposed attention module can also ensure that the decisions made by the model depend mainly on the infection regions, suppressing the contributions from the non-related parts in the images. All 501 CAP images in the test set are from a single site that was not included in the TV set. “Attention RN34 + US” and “Attention RN34 + DS” models achieves in specificity for these images. We can see that our algorithm maintains a great performance on the data acquired from different centers. In the next section, the effects of different sampling strategies are presented. In order to confirm whether there exists significant difference when using the proposed attention module or not, paired -tests are applied. The -values between “RN34 + US” and the three proposed methods are calculated. All the -values are small than 0.01, implying that the proposed methods have significant improvements compared with “RN34 + US”.

Results TV set Test set
No. of
COVID-19 151 318 625 363 718 1214
838 183 71 436 41 24
Total No. 989 501 696 799 759 1238
AUC RN34 + US 0.949 0.975 0.972 0.7960.032 0.9140.021 0.9050.011
Attention RN34 + US 0.958 0.974 0.986 0.8350.012 0.9230.005 0.9060.016
Attention RN34 + SS 0.958 0.981 0.986 0.8160.007 0.9190.004 0.9060.014
Attention RN34 + DS 0.960 0.981 0.987 0.8300.011 0.9190.004 0.9070.015
Accuracy RN34 + US 0.930 0.930 0.976 0.7190.015 0.8480.037 0.9550.007
Attention RN34 + US 0.932 0.930 0.981 0.7520.017 0.8710.017 0.9650.008
Attention RN34 + SS 0.938 0.942 0.974 0.7470.006 0.8580.018 0.9550.009
Attention RN34 + DS 0.941 0.942 0.981 0.7550.012 0.8590.016 0.9620.007
Sensitivity RN34 + US 0.675 0.925 0.995 0.5140.093 0.8510.042 0.9620.007
Attention RN34 + US 0.722 0.937 0.997 0.5340.050 0.8750.021 0.9720.008
Attention RN34 + SS 0.815 0.953 0.987 0.5690.061 0.8620.020 0.9600.010
Attention RN34 + DS 0.795 0.953 0.994 0.5490.049 0.8630.018 0.9680.008
Specificity RN34 + US 0.976 0.940 0.803 0.8890.074 0.8100.078 0.6170.062
Attention RN34 + US 0.970 0.918 0.845 0.9330.024 0.7850.090 0.6420.037
Attention RN34 + SS 0.961 0.923 0.859 0.8960.051 0.7850.047 0.6670.051
Attention RN34 + DS 0.968 0.923 0.873 0.9260.025 0.7900.051 0.6500.037
F1-score RN34 + US 0.853 0.926 0.928 0.6980.022 0.6430.035 0.6630.018
Attention RN34 + US 0.863 0.925 0.946 0.7320.022 0.6620.015 0.7020.026
Attention RN34 + SS 0.882 0.938 0.929 0.7320.009 0.6480.020 0.6710.018
Attention RN34 + DS 0.885 0.938 0.947 0.7370.017 0.6490.017 0.6920.018
TABLE III: Group-wise results on TV set and test set. Based on the volume ratio of pneumonia regions and the lung, the data is divided into 3 groups: the volume ratios that meet the criteria of , , and , respectively.
Fig. 5: Visualization results of our methods on three COVID-19 cases from small-infection group (), median-infection group () and large-infection group () of the test set are shown from left to right, respectively. For each case, we show the visualization results in both axial view and coronal view. We show the original images (first row), and the segmentation results of the lung and pneumonia infection regions (2 and 3 rows) by the VB-Net tookit [shan+2020lung]. For the attention results, we show the Grad-CAM results of “RN34 +US” (4 row), and the attention maps obtained by our proposed attention module of “Attention RN34 + US” and “Attention RN34 + SS” models (5 and 6 rows).

Iv-E Detailed Analysis

To demonstrate the effectiveness in diagnosing pneumonia of different severity, we use the VB-Net toolkit [shan+2020lung] to get the lung mask and the pneumonia infection regions for all CT images. Based on the quantified volume ratio of pneumonia infection regions over the lung, we roughly divide the data into 3 groups in both the TV set and the test set, according to the ratios, i.e., 1) , 2) , and 3) . As shown in Table III, most of COVID-19 images have high ratios (higher than 0.030), while most CAPs are lower than 0.005, which may indicate that the severity of COVID-19 is usually higher than that of CAP in our collected dataset. Furthermore, the classification results of COVID-19 is highly related with the ratio. In Table III, we can see that the sensitivity scores are relatively high for the high infected region group (), while the specificity scores are relatively low for the small infection region group (). This performance matches the nature of COVID-19 and CAP in the collected dataset.

As size-balanced sampling strategy (“Attention RN34 + SS”) is applied in the training procedure, we can find that the sensitivity of the small infected region group () increases from 0.534 to 0.569, compared with the case of using the uniform sampling strategy (“Attention RN34 + US”). And also the specificity of the large infected region group () increases from 0.642 to 0.667. These results demonstrate that the size-balanced sampling strategy can effectively improve the classification robustness when the bias of the pneumonia area exists. However, if we only utilize the size-balanced sampling strategy in the training process, the sensitivity of the large infected region group () will decrease from 0.965 to 0.955, and the specificity of the small infected region group () will decrease from 0.933 to 0.896. This reflects that some advantages of the network may be sacrificed in order to achieve specific requirements. To achieve a dynamic balance between the two extreme conditions, we present the results using the ensemble learning with the dual-sampling model (i.e., “Attention RN34 + DS”). From the sensitivity and specificity in both small and large infected region groups, dual sampling strategy can preserve the classification ability obtained by uniform sampling, and slightly improve the classification performance of the COVID-19 cases in the small infected region group and the CAP cases in the large infected region group. Furthermore, the -values between “Attention RN34 + US” and “Attention RN34 + DS” in both small-infected-region group () and high-infected-region group () are calculated. All the -values are smaller than 0.01, which also proves the effectiveness and necessity of the dual sampling strategy.

Finally, we show typical attention maps obtained by our models (Fig. 5) trained in one fold. For comparison, we show the attention results of naive ReNset34 (“RN34 + US”) in the same fold without both the online attention module and the infection mask refinement, and perform the model explanation techniques (Grad-CAM [selvaraju2017grad]) to get the heatmaps for classification. We can see that the output of Grad-CAM roughly indicates the infection localization, yet sometimes appears far outside of the lung. However, the attention maps from our models (“Attention RN34 + US” and “Attention RN34 + SS”) can reveal the precise locations of the infection. These conspicuous areas in attention maps are similar to the infection segmentation results, which demonstrates that the final classification results determined by our model are reliable and interpretable. The attention maps thus can be possibly used as the basis to derive the COVID-19 diagnosis in clinical practice.

Fig. 6: Visualization results of two failure cases.

Iv-F Failure Analysis

We also show two failure cases in Fig. 6, where the COVID-19 cases are classified as CAP by mistake for all the models. As can be observed from the results shown in Fig. 5, the attention maps from all the models incorrectly get activated on many areas unrelated to pneumonia. “RN34 + US” model even generates many highlighted areas in the none-lung region instead of focusing on lungs. With the proposed attention constrain, the attention maps of “Attention RN34 + US” and “Attention RN34 + SS” have partially alleviated this problem. But still the visual evidences are insufficient to reach a final correct prediction.

V Discussion and Conclusion

For COVID-19, it is important to get the diagnosis result at soon as possible. Although RT-PCR is the current ground truth to diagnose COVID-19, it will take up to days to get the final results and the capacity of the tests is also limited in many places especially in the early outbreak [ai2020correlation]. CT is shown as a powerful tool and could provide the chest scan results in several minutes. It is beneficial to develop an automatic diagnosis method based on chest CT to assist the COVID-19 screening. In this study, we explore a deep-learning-based method to perform automatic COVID-19 diagnosis from CAP in chest CT images. We evaluate our method by the largest multi-center CT data in the world, to the best of our knowledge. To further evaluate the generalization ability of the model, we use independent data from different hospitals (not included in the TV set), achieving AUC of 0.944, accuracy of 87.5%, sensitivity of 86.9%, specificity of 90.1%, and F1-score of 82.0%. At the same time, to better understand the decision of the deep learning model, we also refine the attention module and show the visual evidence, which is able to reveal important regions used in the model for diagnosis. Our proposed method could be further extended for differential diagnosis of pneumonia, which can greatly assist physicians.

There also exist several limitations in this study. First, when longitudinal data becomes ready, the proposed model should be tested for its consistency tracking the development of the COVID-19 during the treatment. Second, although the proposed online attention module could largely improve the interpretability and explainability in COVID-19 diagnosis, in comparison to the conventional methods such as Grad-CAM, future work is still needed to analyze the correlation between these attention localizations with the specific imaging signs that are frequently used in clinical diagnosis. There also exist some failure cases that the visualization results do not appear correctly at the pneumonia infection regions, as shown in Fig. 6. This motivates us to further improve the attention module to better focus on the related regions and reduce the distortion from cofounding visual information to the classification task in the future research. Third, we also notice that the accuracy of the small-infection-area COVID-19 is not quite satisfactory. This indicates the necessity of combining CT images with clinical assessment and laboratory tests for precise diagnosis of early COVID-19, which will also be covered by our future work. The last but not least, the CAP cases used in this study do not include the subtype information, i.e., bacterial, fungal, and non-COVID-19 viral pneumonia. To assist the clinical diagnosis of pneumonia subtypes would also be beneficial.

To conclude, we have developed a 3D CNN network with both online attention refinement and dual-sampling strategy to distinguish COVID-19 from the CAP in the chest CT images. The generalization performance of this algorithm is also verified by the largest multi-center CT data in the world, to our best knowledge.